Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# 04. Train in a remote Linux VM
* Create Workspace
* Create `train.py` file
* Create (or attach) DSVM as compute resource.
* Upoad data files into default datastore
* Configure & execute a run in a few different ways
    - Use system-built conda
    - Use existing Python environment
    - Use Docker 
* Find the best model in the run

## Prerequisites
Make sure you go through the [00. Installation and Configuration](00.configuration.ipynb) Notebook first if you haven't.

In [3]:
# Allow multiple displays per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all" 

In [4]:
# Check core SDK version number
import azureml.core

print("SDK version:", azureml.core.VERSION)

SDK version: 0.1.80


In [5]:
%load_ext dotenv

## Initialize Workspace

Initialize a workspace object from persisted configuration.

In [6]:
from azureml.core import Workspace

ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location[:1], ws.subscription_id[:1], sep='\n')

Found the config file in: /workspace/amlsdk/AMLSDKNotebooks/aml_config/config.json
ghiordanDockerPower001ws
ghiordanDockerPower001rsg
e
e


## Create Experiment

**Experiment** is a logical container in an Azure ML Workspace. It hosts run records which can include run metrics and output artifacts from your experiments.

In [7]:
experiment_name = 'train-on-remote-vm'

from azureml.core import Experiment
exp = Experiment(workspace=ws, name=experiment_name)

Let's also create a local folder to hold the training script.

In [8]:
import os
script_folder = './vm-run'
os.makedirs(script_folder, exist_ok=True)

## Upload data files into datastore
Every workspace comes with a default datastore (and you can register more) which is backed by the Azure blob storage account associated with the workspace. We can use it to transfer data from local to the cloud, and access it from the compute target.

In [9]:
# get the default datastore
ds = ws.get_default_datastore()
print(ds.name, ds.datastore_type, ds.account_name[:2], ds.container_name[:2])

workspacefilestore AzureFile gh az


Load diabetes data from `scikit-learn` and save it as 2 local files.

In [10]:
from sklearn.datasets import load_diabetes
import numpy as np

training_data = load_diabetes()
np.save(file='./features.npy', arr=training_data['data'])
np.save(file='./labels.npy', arr=training_data['target'])

Now let's upload the 2 files into the default datastore under a path named `diabetes`:

In [11]:
ds.upload_files(['./features.npy', './labels.npy'], target_path='diabetes', overwrite=True)

$AZUREML_DATAREFERENCE_c85c0794a8364c16a22e88436e443e20

## View `train.py`

For convenience, we created a training script for you. It is printed below as a text, but you can also run `%pfile ./train.py` in a cell to show the file. Please pay special attention on how we are loading the features and labels from files in the `data_folder` path, which is passed in as an argument of the training script (shown later).

In [12]:
# copy train.py into the script folder
import shutil
shutil.copy('./train.py', os.path.join(script_folder, 'train.py'))

with open(os.path.join(script_folder, './train.py'), 'r') as training_script:
    print(training_script.read())

'./vm-run/train.py'

# Copyright (c) Microsoft. All rights reserved.
# Licensed under the MIT license.

import os
import argparse

from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from azureml.core.run import Run
from sklearn.externals import joblib

import numpy as np

os.makedirs('./outputs', exist_ok=True)
parser = argparse.ArgumentParser()
parser.add_argument('--data-folder', type=str,
                    dest='data_folder', help='data folder')
args = parser.parse_args()

print('Data folder is at:', args.data_folder)
print('List all files: ', os.listdir(args.data_folder))

X = np.load(os.path.join(args.data_folder, 'features.npy'))
y = np.load(os.path.join(args.data_folder, 'labels.npy'))

run = Run.get_context()

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0)
data = {"train": {"X": X_train, "y": y_train},
        "test": {"X": X_test, "y": y_test}}

# list of numbers

## Create Linux DSVM as a compute target

**Note**: If creation fails with a message about Marketplace purchase eligibilty, go to portal.azure.com, start creating DSVM there, and select "Want to create programmatically" to enable programmatic creation. Once you've enabled it, you can exit without actually creating VM.
 
**Note**: By default SSH runs on port 22 and you don't need to specify it. But if for security reasons you switch to a different port (such as 5022), you can specify the port number in the provisioning configuration object.

In [13]:
compute_target_name = 'ghiordanXRgpuvm'

if False:
    from azureml.core.compute import DsvmCompute
    from azureml.core.compute_target import ComputeTargetException

    try:
        dsvm_compute = DsvmCompute(workspace=ws, name=compute_target_name)
        print('found existing:', dsvm_compute.name)
    except ComputeTargetException:
        print('creating new.')
        dsvm_config = DsvmCompute.provisioning_configuration(vm_size="Standard_D2_v2")
        dsvm_compute = DsvmCompute.create(ws, name=compute_target_name, provisioning_configuration=dsvm_config)
        dsvm_compute.wait_for_completion(show_output=True)

## Attach an existing Linux DSVM
You can also attach an existing Linux VM as a compute target. The default port is 22.

In [14]:
%dotenv

from azureml.core.compute import RemoteCompute 
# if you want to connect using SSH key instead of username/password you can provide parameters private_key_file and private_key_passphrase 
attached_dsvm_compute = RemoteCompute.attach(workspace=ws,
                                             name='ghiordanXRgpuvm',
                                             username=os.getenv('COMPUTE_CONTEXT_VM_USER_NAME'),
                                             address=os.getenv('COMPUTE_CONTEXT_VM_FQDN'),
                                             ssh_port=os.getenv('COMPUTE_CONTEXT_VM_SSH_PORT'),
                                             password=os.getenv('COMPUTE_CONTEXT_VM_PWD'))
attached_dsvm_compute.wait_for_completion(show_output=True)



    config = RemoteCompute.attach_configuration(username, address, ssh_port, password, private_key_file, private_key_passphrase)
    ComputeTarget.attach(workspace, name, config)
SucceededProvisioning operation finished, operation "Succeeded"


## Configure & Run
First let's create a `DataReferenceConfiguration` object to inform the system what data folder to download to the copmute target.

In [15]:
from azureml.core.runconfig import DataReferenceConfiguration
dr = DataReferenceConfiguration(datastore_name=ds.name, 
                   path_on_datastore='diabetes', 
                   mode='download', # download files from datastore to compute target
                   overwrite=True)

Now we can try a few different ways to run the training script in the VM.

### Configure a Docker run with new conda environment on the VM
You can execute in a Docker container in the VM. If you choose this option, the system will pull down a base Docker image, build a new conda environment in it if you ask for (you can also skip this if you are using a customer Docker image when a preconfigured Python environment), start a container, and run your script in there. This image is also uploaded into your ACR (Azure Container Registry) assoicated with your workspace, an reused if your dependencies don't change in the subsequent runs.

In [16]:
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies


# Load the "cpu-dsvm.runconfig" file (created by the above attach operation) in memory
docker_run_config = RunConfiguration(framework="python")

# Set compute target to the Linux DSVM
docker_run_config.target = compute_target_name

# Use Docker in the remote VM
docker_run_config.environment.docker.enabled = True

# Use CPU base image from DockerHub
docker_run_config.environment.docker.base_image = azureml.core.runconfig.DEFAULT_CPU_IMAGE
print('Base Docker image is:', docker_run_config.environment.docker.base_image)

# set the data reference of the run coonfiguration
docker_run_config.data_references = {ds.name: dr}

# specify CondaDependencies obj
docker_run_config.environment.python.conda_dependencies = CondaDependencies.create(conda_packages=['scikit-learn'])

Base Docker image is: mcr.microsoft.com/azureml/base:0.1.4


### Submit the Experiment
Submit script to run in the Docker image in the remote VM. If you run this for the first time, the system will download the base image, layer in packages specified in the `conda_dependencies.yml` file on top of the base image, create a container and then execute the script in the container.

In [17]:
from azureml.core import ScriptRunConfig
src = ScriptRunConfig(source_directory=script_folder, 
                      script='train.py', 
                      run_config=docker_run_config,
                      # pass the datastore reference as a parameter to the training script
                      arguments=['--data-folder', str(ds.as_download())])
run = exp.submit(config=src)

In [18]:
run.wait_for_completion(show_output=True)

RunId: train-on-remote-vm_1542765450744

Streaming azureml-logs/60_control_log.txt

Streaming log file azureml-logs/60_control_log.txt
Logging experiment preparation status in history service.
Running: ['sudo', 'docker', 'build', '-f', 'azureml-setup/Dockerfile', '-t', 'azureml/azureml_f1e6e4024c66c3f4ee382b806410b613', '.']
Sending build context to Docker daemon  163.8kB
Step 1/13 : FROM mcr.microsoft.com/azureml/base:0.1.4
 ---> cafddcd9dc72
Step 2/13 : USER root
 ---> Using cache
 ---> d3b4123171d1
Step 3/13 : RUN mkdir -p $HOME/.cache
 ---> Using cache
 ---> adcd726554e4
Step 4/13 : WORKDIR /
 ---> Using cache
 ---> 642e2eed0294
Step 5/13 : COPY azureml-setup/99brokenproxy /etc/apt/apt.conf.d/
 ---> Using cache
 ---> 500265de8af1
Step 6/13 : RUN if dpkg --compare-versions `conda --version | grep -oE '[^ ]+$'` lt 4.4.0; then conda install conda==4.4.11; fi
 ---> Using cache
 ---> 008fd2314e03
Step 7/13 : COPY azureml-setup/mutated_conda_dependencies.yml azureml-setup/mutated_conda_d

Collecting jmespath (from azure-cli-core>=2.0.38->azureml-core==0.1.80.*->azureml-defaults==0.1.80->-r /azureml-setup/condaenv.y836d7r7.requirements.txt (line 1))
  Downloading https://files.pythonhosted.org/packages/b7/31/05c8d001f7f87f0f07289a5fc0fc3832e9a57f2dbd4d3b0fee70e0d51365/jmespath-0.9.3-py2.py3-none-any.whl
Collecting paramiko>=2.0.8 (from azure-cli-core>=2.0.38->azureml-core==0.1.80.*->azureml-defaults==0.1.80->-r /azureml-setup/condaenv.y836d7r7.requirements.txt (line 1))
  Downloading https://files.pythonhosted.org/packages/cf/ae/94e70d49044ccc234bfdba20114fa947d7ba6eb68a2e452d89b920e62227/paramiko-2.4.2-py2.py3-none-any.whl (193kB)
Collecting azure-cli-nspkg>=2.0.0 (from azure-cli-core>=2.0.38->azureml-core==0.1.80.*->azureml-defaults==0.1.80->-r /azureml-setup/condaenv.y836d7r7.requirements.txt (line 1))
  Downloading https://files.pythonhosted.org/packages/a7/85/601ef6484bf7a722daa76a4383c4ccfd4980b74ed6c2895392f53ed210d5/azure_cli_nspkg-3.0.3-py2.py3-none-any.whl
Coll

  Downloading https://files.pythonhosted.org/packages/57/41/05e79e5516db1cc0c967b3202388cde729f871c871b0a07bf24ff11adfcf/portalocker-1.2.1-py2.py3-none-any.whl
Collecting oauthlib>=0.6.2 (from requests-oauthlib>=0.5.0->msrest>=0.5.1->azureml-core==0.1.80.*->azureml-defaults==0.1.80->-r /azureml-setup/condaenv.y836d7r7.requirements.txt (line 1))
  Downloading https://files.pythonhosted.org/packages/e6/d1/ddd9cfea3e736399b97ded5c2dd62d1322adef4a72d816f1ed1049d6a179/oauthlib-2.1.0-py2.py3-none-any.whl (121kB)
Collecting pycparser (from cffi!=1.11.3,>=1.7->cryptography!=1.9,!=2.0.*,!=2.1.*,!=2.2.*->azureml-core==0.1.80.*->azureml-defaults==0.1.80->-r /azureml-setup/condaenv.y836d7r7.requirements.txt (line 1))
  Downloading https://files.pythonhosted.org/packages/68/9e/49196946aee219aead1290e00d1e7fdeab8567783e83e1b9ab5585e6206a/pycparser-2.19.tar.gz (158kB)
Building wheels for collected packages: SecretStorage, pathspec, tabulate, antlr4-python3-runtime, pyyaml, pycparser
  Running setup.p


Streaming azureml-logs/80_driver_log.txt

  import imp
Data folder is at: workspacefilestore/diabetes
List all files:  ['features.npy', 'labels.npy']
alpha is 0.00, and mse is 3424.32
alpha is 0.05, and mse is 3408.92
alpha is 0.10, and mse is 3372.65
alpha is 0.15, and mse is 3345.15
alpha is 0.20, and mse is 3325.29
alpha is 0.25, and mse is 3311.56
alpha is 0.30, and mse is 3302.67
alpha is 0.35, and mse is 3297.66
alpha is 0.40, and mse is 3295.74
alpha is 0.45, and mse is 3296.32
alpha is 0.50, and mse is 3298.91
alpha is 0.55, and mse is 3303.14
alpha is 0.60, and mse is 3308.70
alpha is 0.65, and mse is 3315.36
alpha is 0.70, and mse is 3322.90
alpha is 0.75, and mse is 3331.17
alpha is 0.80, and mse is 3340.02
alpha is 0.85, and mse is 3349.36
alpha is 0.90, and mse is 3359.09
alpha is 0.95, and mse is 3369.13


The experiment completed successfully. Finalizing run...
Logging experiment finalizing status in history service
Cleaning up all outstanding Run operations, waiting 30

{'runId': 'train-on-remote-vm_1542765450744',
 'target': 'ghiordanXRgpuvm',
 'status': 'Finalizing',
 'startTimeUtc': '2018-11-21T02:00:15.703663Z',
 'properties': {'azureml.runsource': 'experiment',
  'ContentSnapshotId': '2825de45-0c87-4739-8571-e77422aff451'},
 'runDefinition': {'Script': 'train.py',
  'Arguments': ['--data-folder', '$AZUREML_DATAREFERENCE_workspacefilestore'],
  'SourceDirectoryDataStore': None,
  'Framework': 0,
  'Communicator': 0,
  'Target': 'ghiordanXRgpuvm',
  'DataReferences': {'workspacefilestore': {'DataStoreName': 'workspacefilestore',
    'Mode': 1,
    'PathOnDataStore': 'diabetes',
    'PathOnCompute': None,
    'Overwrite': True}},
  'JobName': None,
  'AutoPrepareEnvironment': True,
  'MaxRunDurationSeconds': None,
  'NodeCount': 1,
  'Environment': {'Python': {'InterpreterPath': 'python',
    'UserManagedDependencies': False,
    'CondaDependencies': {'name': 'project_environment',
     'dependencies': ['python=3.6.2',
      {'pip': ['azureml-defaul

### View run history details

In [19]:
run

Experiment,Id,Type,Status,Details Page,Docs Page
train-on-remote-vm,train-on-remote-vm_1542765450744,azureml.scriptrun,Finalizing,Link to Azure Portal,Link to Documentation


### Find the best model

Now we have tried various execution modes, we can find the best model from the last run.

In [20]:
# get all metris logged in the run
run.get_metrics()
metrics = run.get_metrics()

{'alpha': [0.0,
  0.05,
  0.1,
  0.15000000000000002,
  0.2,
  0.25,
  0.30000000000000004,
  0.35000000000000003,
  0.4,
  0.45,
  0.5,
  0.55,
  0.6000000000000001,
  0.65,
  0.7000000000000001,
  0.75,
  0.8,
  0.8500000000000001,
  0.9,
  0.9500000000000001],
 'mse': [3424.316688213733,
  3408.91531225893,
  3372.6496278100326,
  3345.1496434741894,
  3325.294679467877,
  3311.556250928974,
  3302.673633401725,
  3297.6587339442026,
  3295.741064355809,
  3296.316884705675,
  3298.909605807062,
  3303.1400555275163,
  3308.7042707723226,
  3315.3568399622563,
  3322.8983149039614,
  3331.1656169285875,
  3340.0246620321604,
  3349.3646443486023,
  3359.0935697484424,
  3369.1347399130477]}

In [21]:
# find the index where MSE is the smallest
indices = list(range(0, len(metrics['mse'])))
min_mse_index = min(indices, key=lambda x: metrics['mse'][x])

print('When alpha is {1:0.2f}, we have min MSE {0:0.2f}.'.format(
    metrics['mse'][min_mse_index], 
    metrics['alpha'][min_mse_index]
))

When alpha is 0.40, we have min MSE 3295.74.


In [22]:
run.get_file_names()

['azureml-logs/60_control_log.txt',
 'azureml-logs/80_driver_log.txt',
 'outputs/ridge_0.85.pkl',
 'outputs/ridge_0.40.pkl',
 'outputs/ridge_0.05.pkl',
 'outputs/ridge_0.30.pkl',
 'outputs/ridge_0.55.pkl',
 'outputs/ridge_0.25.pkl',
 'outputs/ridge_0.20.pkl',
 'outputs/ridge_0.45.pkl',
 'outputs/ridge_0.90.pkl',
 'outputs/ridge_0.15.pkl',
 'outputs/ridge_0.80.pkl',
 'outputs/ridge_0.35.pkl',
 'outputs/ridge_0.50.pkl',
 'outputs/ridge_0.95.pkl',
 'outputs/ridge_0.60.pkl',
 'outputs/ridge_0.10.pkl',
 'outputs/ridge_0.65.pkl',
 'outputs/ridge_0.70.pkl',
 'outputs/ridge_0.75.pkl',
 'outputs/ridge_0.00.pkl',
 'driver_log',
 'azureml-logs/azureml.log']

We know the model ridge_0.40.pkl is the best performing model from the eariler queries. 
So we can register it with the workspace, or download locally

In [23]:
# supply a model name, and the full path to the serialized model file.
model = run.register_model(model_name='second_best_ridge_model', model_path='./outputs/ridge_0.10.pkl')

In [24]:
from azureml.core.model import Model

model_local_file_path = os.path.join(*(['.', '..', '..']+['ridge_0.40.pkl'])) 
model_local_file_path
run.download_file(name='./outputs/ridge_0.40.pkl', output_file_path=model_local_file_path)

'./../../ridge_0.40.pkl'

In [25]:
# show ridge_0.40.pkl exists locally
!ls -l ./../../

total 304
-rw-r--r-- 1 root root 276011 Nov 11 19:22 00.configuration.html
-rw-rw-r-- 1 1003 1003  15170 Nov 21 01:56 00.configuration.ipynb
drwxrwxr-x 3 1003 1003   4096 Nov 11 21:49 01.getting-started
drwxrwxr-x 3 1003 1003   4096 Nov 14 22:46 10.register-model-create-image-deploy-service
drwxr-xr-x 2 root root   4096 Nov 10 01:29 aml_config
-rw-r--r-- 1 root root    658 Nov 21 02:00 ridge_0.40.pkl
