# Tutorial #1: Train a ML model on Azure Machine Learning service.

In this lab, you train the classifcation machine learning model on remote compute resources. You'll use the training and deployment workflow for Azure Machine Learning service (preview) in a Python Jupyter notebook.  

Learn how to:

> * Set up your development environment
> * Access and examine the data
> * Train a simple classification model on a remote cluster
> * Review training results, find and register the best model

## Set up your development environment

All the setup for your development work can be accomplished in a Python notebook.  Setup includes:

* Create an Experiment in an existing Workspace.
* Configure AutoML using AutoMLConfig.
* Importing Python packages
* Connecting to a workspace to enable communication between your local computer and remote resources
* Creating an experiment to track all your runs
* Creating a remote compute target to use for training

### Import packages

Import Python packages you need in this session. Also display the Azure Machine Learning SDK version.

In [1]:
import numpy as np

import azureml.core
from azureml.core import Workspace

# check core SDK version number
print("Azure ML SDK Version: ", azureml.core.VERSION)

Azure ML SDK Version:  1.20.0


## Workspace Properties from Above:

Name | Description
---- | -----------
name            | A name you chose to call the workspace.  We'll use the value aready in the code.
subscription_id | The id of the subscription the workspace will be assigned to.  You can get this from the Azure portal.
resource_group  | A name you want all the Azure resourced creates for the workspace to be associated with.  Makes rsource management easier.
location | Azure data center location closest to you that support creation of AMLS workspaces.  

### Create the AMLS workspace...

You will be asked to log into Azure and be given a code in the output message area to enter.

In [2]:
from azureml.core import Workspace
ws = Workspace.create(name='docs-ws',
            subscription_id='c958680c-dc7a-403c-bb83-74f48dce46b3', 
            resource_group='docs-aml',
            create_resource_group = True,
            location='West US'
            )

Deploying KeyVault with name docswskeyvault04193cdacb.
Deploying StorageAccount with name docswsstoragef28c3087be0.
Deployed KeyVault with name docswskeyvault04193cdacb. Took 20.82 seconds.
Deploying AppInsights with name docswsinsights74131b2eef.
Deployed AppInsights with name docswsinsights74131b2eef. Took 26.23 seconds.
Deployed StorageAccount with name docswsstoragef28c3087be0. Took 22.79 seconds.
Deploying Workspace with name docs-ws.
Deployed Workspace with name docs-ws. Took 46.8 seconds.


If you got a Azure Machine Learning Workspace please use the cell above

In [None]:
from azureml.core import Workspace
ws = Workspace.get(name='docs-ws',
            subscription_id='c958680c-dc7a-403c-bb83-74f48dce46b3', 
            resource_group='docs-aml')

### Connect to workspace in future work...

In the future, we can use the code below to connect back to this workspace. 
The code creates a workspace object from the existing workspace. `Workspace.from_config()` reads the file **config.json** and loads the details into an object named `ws`.  You don't need to use this now since we are still connected from when we created the workspace but this will come in handy later.

In [None]:
# Create the configuration file.
ws.write_config(path='.', file_name='config.json')
print('Configuration saved.')

In [None]:
# load workspace configuration from the config.json file in the current folder.
ws = Workspace.from_config()
print(ws.name, ws.location, ws.resource_group, ws.location, sep='\t')

### Create experiment
Create an experiment to track the runs in your workspace. A workspace can have muliple experiments

In [3]:
experiment_name = 'sklearn-diabetes'

from azureml.core import Experiment
exp = Experiment(workspace=ws, name=experiment_name)

### Create or Attach existing compute resource
By using Azure Machine Learning Compute, a managed service, data scientists can train machine learning models on clusters of Azure virtual machines. Examples include VMs with GPU support. In this tutorial, you create Azure Machine Learning Compute as your training environment. The code below creates the compute clusters for you if they don't already exist in your workspace.

**Creation of compute takes approximately 5 minutes.** If the AmlCompute with that name is already in your workspace the code will skip the creation process.

In [4]:
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
import os

# choose a name for your cluster
compute_name = os.environ.get("AML_COMPUTE_CLUSTER_NAME", "cpu-cluster")
compute_min_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MIN_NODES", 0)
compute_max_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MAX_NODES", 4)

# This example uses CPU VM. For using GPU VM, set SKU to STANDARD_NC6
vm_size = os.environ.get("AML_COMPUTE_CLUSTER_SKU", "STANDARD_D2_V2")


if compute_name in ws.compute_targets:
    compute_target = ws.compute_targets[compute_name]
    if compute_target and type(compute_target) is AmlCompute:
        print("found compute target: " + compute_name)
else:
    print("creating new compute target...")
    provisioning_config = AmlCompute.provisioning_configuration(vm_size = vm_size,
                                                                min_nodes = compute_min_nodes, 
                                                                max_nodes = compute_max_nodes)

    # create the cluster
    compute_target = ComputeTarget.create(ws, compute_name, provisioning_config)
    
    # can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it will use the scale settings for the cluster
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
    
     # For a more detailed view of current AmlCompute status, use get_status()
    print(compute_target.get_status().serialize())

creating new compute target...
Creating
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned
{'currentNodeCount': 0, 'targetNodeCount': 0, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 0, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2021-01-25T19:36:35.589000+00:00', 'errors': None, 'creationTime': '2021-01-25T19:36:29.730811+00:00', 'modifiedTime': '2021-01-25T19:36:47.379332+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 4, 'nodeIdleTimeBeforeScaleDown': 'PT120S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_D2_V2'}


You now have the necessary packages and compute resources to train a model in the cloud. 

## Verify you have the data

You already explored the data in the last lab.  You need to copy the data into the cloud so it can be accessed by your cloud training environment.  We saved the model training data to a csv file so all we have to do is load it.

### Display a few rows of data to make sure the load worked.

In [5]:
import pandas as pd
df_diabetes = pd.read_csv('inputs/diabetes.csv')
df_diabetes.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [6]:
# Confirm the df_feeatures one more time...
df_diabetes.dtypes

Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object

### Create a directory
Create a directory to deliver the necessary code from your computer to the remote resource

In [7]:
import os
script_folder = os.path.join(os.getcwd(), "sklearn-diabetes")
os.makedirs(script_folder, exist_ok=True)

## Create a training script

To submit the job to the cluster, first create a training script. Run the following code to create the training script called `train.py` in the directory you just created. 

### About training scripts...

To train a model in an Azure container, we need to get the model training script to the container. We start by saving the model training script to a Python script file, i.e. .py.  This will be uploaded to the Azure container later.  We don't need any exploratory code in this script, just what is needed to train the model.

In [8]:
%%writefile $script_folder/train.py

import os
import joblib
import argparse

import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

from azureml.core import Run
print('Libraries Imported')

# ***  Azure Machine Learning service specfic code starts... ***

# let user feed in 2 parameters, the location of the data files (from datastore), and the regularization rate of the logistic regression model
parser = argparse.ArgumentParser()
parser.add_argument('--data-folder', type=str, dest='data_folder', help='data folder mounting point')
parser.add_argument('--penalty', type=str, dest='penalty', default='l2', help='penalty')
args = parser.parse_args()


data_folder = args.data_folder
penalty = args.penalty

print('Data folder:', data_folder)

# get hold of the current run
run = Run.get_context()

# ***  Azure Machine Learning service specfic code ends. ***

filepath = os.path.join(data_folder, 'diabetes.csv')

df_diabetes = pd.read_csv(filepath)
#Features data
X0= df_diabetes.loc[:,  df_diabetes.columns != 'Outcome']
#label data
y= df_diabetes[['Outcome']]

# Scaler the data 
names = X0.columns
scaler = StandardScaler()
X = scaler.fit_transform(X0)
X = pd.DataFrame(X, columns=names)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
X_test, X_validate, y_test, y_validate = train_test_split(X_test, y_test, test_size=0.3)

# Adjuting model
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(penalty=penalty,random_state=0)
clf.fit(X_train, np.ravel(y_train))
print('Regressionn Model Trained.')

# Predict using the test data...
print('Running the test dataset through...')
y_predtest = clf.predict(X_test)
print('Test dataset scored.')

# calculate accuracy on the prediction
acc= clf.score(X_test, y_test)
print("accuracy = ",acc * 100,"%")

# ***  Azure Machine Learning service specfic code starts... ***
run.log('data_dir', data_folder)
run.log('accuracy', np.float(acc))

os.makedirs('outputs', exist_ok=True)

# note file saved in the outputs folder is automatically uploaded into experiment record
joblib.dump(value=clf, filename='outputs/diabetes_model.pkl')
X_validate.to_json('outputs/validation_data.json', orient="split")

# ***  Azure Machine Learning service specfic code ends. ***

Overwriting /home/atabordal/azureml-tutorial/diabetes/sklearn-diabetes/train.py


In [10]:
import shutil
shutil.copytree('./inputs', script_folder+'/inputs')

TypeError: copytree() got an unexpected keyword argument 'dirs_exist_ok'

Notice how the script gets data and saves models:

+ The training script reads an argument to find the directory containing the data.  When you submit the job later, you point to the datastore for this argument:
`parser.add_argument('--data-folder', type=str, dest='data_folder', help='data directory mounting point')`


+ The training script saves your model into a directory named outputs. <br/>
`joblib.dump(value=clf, filename='outputs/diabetes_model.pkl')`<br/>
Anything written in this directory is automatically uploaded into your workspace. You'll access your model from this directory later in the tutorial.

### Configure the training job

Create a ScriptRunConfig object to specify the configuration details of your training job, including your training script, environment to use, and the compute target to run on. Configure the ScriptRunConfig by specifying:

* The directory that contains your scripts. All the files in this directory are uploaded into the cluster nodes for execution. 
* The compute target.  In this case you will use the AmlCompute you created
* The training script name, train.py
* An environment that contains the libraries needed to run the script
* Arguments required from the training script. 

In this tutorial, the target is AmlCompute. All files in the script folder are uploaded into the cluster nodes for execution. The data_folder is set to use the dataset.

First, create the environment that contains: the scikit-learn library, azureml-dataset-runtime required for accessing the dataset, and azureml-defaults which contains the dependencies for logging metrics. The azureml-defaults also contains the dependencies required for deploying the model as a web service later in the part 2 of the tutorial.

Once the environment is defined, register it with the Workspace to re-use it in part 2 of the tutorial.

In [12]:
from azureml.core.environment import Environment
from azureml.core.conda_dependencies import CondaDependencies

# to install required packages
env = Environment('tutorial-env')
cd = CondaDependencies.create(pip_packages=['azureml-dataset-runtime[pandas,fuse]', 'azureml-defaults'], conda_packages = ['scikit-learn==0.22.1'])

env.python.conda_dependencies = cd

# Register environment to re-use later
env.register(workspace = ws)

{
    "databricks": {
        "eggLibraries": [],
        "jarLibraries": [],
        "mavenLibraries": [],
        "pypiLibraries": [],
        "rcranLibraries": []
    },
    "docker": {
        "arguments": [],
        "baseDockerfile": null,
        "baseImage": "mcr.microsoft.com/azureml/intelmpi2018.3-ubuntu16.04:20210104.v1",
        "baseImageRegistry": {
            "address": null,
            "password": null,
            "registryIdentity": null,
            "username": null
        },
        "enabled": false,
        "platform": {
            "architecture": "amd64",
            "os": "Linux"
        },
        "sharedVolumes": true,
        "shmSize": null
    },
    "environmentVariables": {
        "EXAMPLE_ENV_VAR": "EXAMPLE_VALUE"
    },
    "inferencingStackVersion": null,
    "name": "tutorial-env",
    "python": {
        "baseCondaEnvironment": null,
        "condaDependencies": {
            "channels": [
                "anaconda",
                "conda-forge"

Then, create the ScriptRunConfig by specifying the training script, compute target and environment.

In [13]:
from azureml.core import ScriptRunConfig

args = ['--data-folder', 'inputs', '--penalty', 'l2']

src = ScriptRunConfig(source_directory=script_folder,
                      script='train.py', 
                      arguments=args,
                      compute_target=compute_target,
                      environment=env)

### Submit the job to the cluster

Run the experiment by submitting the ScriptRunConfig object. And you can navigate to Azure portal to monitor the run.

In [14]:
run = exp.submit(config=src)
run

Experiment,Id,Type,Status,Details Page,Docs Page
sklearn-diabetes,sklearn-diabetes_1611603657_c3095b50,azureml.scriptrun,Preparing,Link to Azure Machine Learning studio,Link to Documentation


Since the call is asynchronous, it returns a Preparing or Running state as soon as the job is started.

Since the call is asynchronous, it returns a **Preparing** or **Running** state as soon as the job is started.

## Monitor a remote run

In total, the first run takes **approximately 10 minutes**. But for subsequent runs, as long as the dependencies in the Azure ML environment don't change, the same image is reused and hence the container start up time is much faster.

Here is what's happening while you wait:

- **Image creation**: A Docker image is created matching the Python environment specified by the Azure ML environment. The image is built and stored in the ACR (Azure Container Registry) associated with your workspace. Image creation and uploading takes **about 5 minutes**. 

  This stage happens once for each Python environment since the container is cached for subsequent runs.  During image creation, logs are streamed to the run history. You can monitor the image creation progress using these logs.

- **Scaling**: If the remote cluster requires more nodes to execute the run than currently available, additional nodes are added automatically. Scaling typically takes **about 5 minutes.**

- **Running**: In this stage, the necessary scripts and files are sent to the compute target, then data stores are mounted/copied, then the entry_script is run. While the job is running, stdout and the files in the ./logs directory are streamed to the run history. You can monitor the run's progress using these logs.

- **Post-Processing**: The ./outputs directory of the run is copied over to the run history in your workspace so you can access these results.


You can check the progress of a running job in multiple ways. This tutorial uses a Jupyter widget as well as a `wait_for_completion` method. 

### Jupyter widget

Watch the progress of the run with a Jupyter widget.  Like the run submission, the widget is asynchronous and provides live updates every 10-15 seconds until the job completes.

In [15]:
from azureml.widgets import RunDetails
RunDetails(run).show()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

By the way, if you need to cancel a run, you can follow [these instructions](https://aka.ms/aml-docs-cancel-run).

### Get log results upon completion

Model training happens in the background. You can use `wait_for_completion` to block and wait until the model has completed training before running more code. 

In [20]:
# specify show_output to True for a verbose log
run.wait_for_completion(show_output=True) 

RunId: sklearn-diabetes_1611603657_c3095b50
Web View: https://ml.azure.com/experiments/sklearn-diabetes/runs/sklearn-diabetes_1611603657_c3095b50?wsid=/subscriptions/c958680c-dc7a-403c-bb83-74f48dce46b3/resourcegroups/docs-aml/workspaces/docs-ws

Streaming azureml-logs/20_image_build_log.txt

2021/01/25 19:41:19 Downloading source code...
2021/01/25 19:41:20 Finished downloading source code
2021/01/25 19:41:20 Creating Docker network: acb_default_network, driver: 'bridge'
2021/01/25 19:41:21 Successfully set up Docker network: acb_default_network
2021/01/25 19:41:21 Setting up Docker configuration...
2021/01/25 19:41:21 Successfully set up Docker configuration
2021/01/25 19:41:21 Logging in to registry: 514746dd62594ad690c43a3793280184.azurecr.io
2021/01/25 19:41:22 Successfully logged into 514746dd62594ad690c43a3793280184.azurecr.io
2021/01/25 19:41:22 Executing step ID: acb_step_0. Timeout(sec): 5400, Working directory: '', Network: 'acb_default_network'
2021/01/25 19:41:22 Scanning 

{'runId': 'sklearn-diabetes_1611603657_c3095b50',
 'target': 'cpu-cluster',
 'status': 'Completed',
 'startTimeUtc': '2021-01-25T19:51:07.904108Z',
 'endTimeUtc': '2021-01-25T19:53:22.374436Z',
 'properties': {'_azureml.ComputeTargetType': 'amlcompute',
  'ContentSnapshotId': '05a94487-e537-43c4-8a82-b046a6af587d',
  'azureml.git.repository_uri': 'https://github.com/atabordal/azureml-tutorial.git',
  'mlflow.source.git.repoURL': 'https://github.com/atabordal/azureml-tutorial.git',
  'azureml.git.branch': 'main',
  'mlflow.source.git.branch': 'main',
  'azureml.git.commit': '40817d5f514240e4b842b74c7f7afc4e7db263b1',
  'mlflow.source.git.commit': '40817d5f514240e4b842b74c7f7afc4e7db263b1',
  'azureml.git.dirty': 'True',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json'},
 'inputDatasets': [],
 'outputDatasets': [],
 'runDefinition': {'script': 'train.py',
  'command': '',
  'useAbsolutePath': False,
  'arguments': ['--data-f

### Display run results

You now have a model trained on a remote cluster.  Retrieve all the metrics logged during the run, including the accuracy of the model:

In [21]:
print(run.get_metrics())

{'data_dir': 'inputs', 'accuracy': 0.8385093167701864}


In the next tutorial you will explore this model in more detail.

## Register model

The last step in the training script wrote the file `outputs/diabetes_model.pkl` in a directory named `outputs` in the VM of the cluster where the job is executed. `outputs` is a special directory in that all content in this  directory is automatically uploaded to your workspace.  This content appears in the run record in the experiment under your workspace. Hence, the model file is now also available in your workspace.

You can see files associated with that run.

In [22]:
print(run.get_file_names())

['azureml-logs/20_image_build_log.txt', 'azureml-logs/55_azureml-execution-tvmps_7f857e056bacb7d0be6b8f4c97cf8d9a06c96bb552b28d6d990e6da1126f209c_d.txt', 'azureml-logs/65_job_prep-tvmps_7f857e056bacb7d0be6b8f4c97cf8d9a06c96bb552b28d6d990e6da1126f209c_d.txt', 'azureml-logs/70_driver_log.txt', 'azureml-logs/75_job_post-tvmps_7f857e056bacb7d0be6b8f4c97cf8d9a06c96bb552b28d6d990e6da1126f209c_d.txt', 'azureml-logs/process_info.json', 'azureml-logs/process_status.json', 'logs/azureml/100_azureml.log', 'logs/azureml/job_prep_azureml.log', 'logs/azureml/job_release_azureml.log', 'outputs/diabetes_model.pkl', 'outputs/validation_data.json']


Register the model in the workspace so that you (or other collaborators) can later query, examine, and deploy this model.

In [25]:
# register model 
model_name = "sklearn_diabetes"
model = run.register_model(model_path="outputs/diabetes_model.pkl",
                        model_name=model_name,
                        tags={"data": "diabetes", "model": "regression"},
                        description="Ridge regression model to predict diabetes")

print(model.name, model.id, model.version, sep='\t')

sklearn_diabetes	sklearn_diabetes:1	1


In [None]:
Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.