Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Tutorial #1: Train an image classification model with Azure Machine Learning accessing a Data Labeling Dataset export

In this tutorial, you train a machine learning model on remote compute resources. You'll use the training and deployment workflow for Azure Machine Learning service (preview) in a Python Jupyter notebook.  You can then use the notebook as a template to train your own machine learning model with your own data. This tutorial is **part one of a two-part tutorial series**.  

Learn how to:

> * Set up your development environment
> * Access and examine the data
> * Train a simple logistic regression model on a remote cluster
> * Review training results, find and register the best model

## Prerequisites

See prerequisites in the [Azure Machine Learning documentation](https://docs.microsoft.com/azure/machine-learning/service/tutorial-train-models-with-aml#prerequisites).

## Set up your development environment

All the setup for your development work can be accomplished in a Python notebook.  Setup includes:

* Importing Python packages
* Connecting to a workspace to enable communication between your local computer and remote resources
* Creating an experiment to track all your runs
* Creating a remote compute target to use for training

### Import packages

Import Python packages you need in this session. Also display the Azure Machine Learning SDK version.

In [64]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

import azureml.core
from azureml.core import Workspace

# check core SDK version number
print("Azure ML SDK Version: ", azureml.core.VERSION)

Azure ML SDK Version:  1.0.76


### Connect to workspace

Create a workspace object from the existing workspace. `Workspace.from_config()` reads the file **config.json** and loads the details into an object named `ws`.

In [65]:
# load workspace configuration from the config.json file in the current folder.
ws = Workspace.from_config()
print(ws.name, ws.location, ws.resource_group, sep='\t')

ailabml	westus2	ai-lab


### Create experiment

Create an experiment to track the runs in your workspace. A workspace can have muliple experiments. 

In [66]:
experiment_name = 'ai-lab-defect-detection'

from azureml.core import Experiment
exp = Experiment(workspace=ws, name=experiment_name)

### Create or Attach existing compute resource
By using Azure Machine Learning Compute, a managed service, data scientists can train machine learning models on clusters of Azure virtual machines. Examples include VMs with GPU support. In this tutorial, you create Azure Machine Learning Compute as your training environment. The code below creates the compute clusters for you if they don't already exist in your workspace.

**Creation of compute takes approximately 5 minutes.** If the AmlCompute with that name is already in your workspace the code will skip the creation process.

In [67]:
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
import os

# choose a name for your cluster
compute_name = os.environ.get("AML_COMPUTE_CLUSTER_NAME", "cpu-cluster")
compute_min_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MIN_NODES", 0)
compute_max_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MAX_NODES", 4)

# This example uses CPU VM. For using GPU VM, set SKU to STANDARD_NC6
vm_size = os.environ.get("AML_COMPUTE_CLUSTER_SKU", "STANDARD_D2_V2")


if compute_name in ws.compute_targets:
    compute_target = ws.compute_targets[compute_name]
    if compute_target and type(compute_target) is AmlCompute:
        print("found compute target: " + compute_name)
else:
    print("creating new compute target...")
    provisioning_config = AmlCompute.provisioning_configuration(vm_size = vm_size,
                                                                min_nodes = compute_min_nodes, 
                                                                max_nodes = compute_max_nodes)

    # create the cluster
    compute_target = ComputeTarget.create(ws, compute_name, provisioning_config)
    
    # can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it will use the scale settings for the cluster
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
    
     # For a more detailed view of current AmlCompute status, use get_status()
    print(compute_target.get_status().serialize())

found compute target: cpu-cluster


You now have the necessary packages and compute resources to train a model in the cloud. 

## Explore data

Before you train a model, you need to understand the data that you are using to train it. In this section you learn how to:

* Download the MNIST dataset
* Display some sample images

### Download the Data Labeling exported Dataset

This code retrieves the data as a `TabularDataset` object, which is a subclass of `Dataset`. A `TabularDataset` references images and their labels and confidence. The class provides you with the ability to download or mount the files to your compute by creating a reference to the data source location. Additionally, you register the Dataset to your workspace for easy retrieval during training.

Follow the [how-to](https://aka.ms/azureml/howto/createdatasets) to learn more about Datasets and their usage in the SDK.

In [None]:
# install the azureml-contrib-dataset package
%pip install azureml-contrib-dataset

In [68]:
# ====================================================================================
# This code is just to demo/debug what we will do in the train.py script defined below
# ====================================================================================

#
import argparse
import os
import numpy as np
from tqdm import tqdm

from sklearn import svm, metrics, datasets
from sklearn.utils import Bunch
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.externals import joblib

from skimage.io import imread
from skimage.transform import resize

from azureml.core import Run, Workspace, Datastore
from azureml.contrib.dataset import Dataset

def load_image_files(dimension=(256, 256)):

    # get dataset
    subscription_id = 'c5ec24ce-9c5f-4da2-bf12-9ca8e9758d60'
    resource_group = 'ai-lab'
    workspace_name = 'ailabml'
    workspace = Workspace(subscription_id, resource_group, workspace_name)
    ds = Dataset.get_by_name(workspace, name='Light Bulbs-2019-12-08 00:35:33')
    df = ds.to_pandas_dataframe()

    # Images
    descr = "Defect Detection Dataset"
    images = []
    flat_data = []
    target = []
    categories = set()
    for i in tqdm(range(df.shape[0])):
        si = df.loc[i].image_url.to_pod()
        if i == 0:
            datastore = Datastore.get(workspace, si['arguments']['datastoreName'])
        categories.add(df.loc[i].label[0])
        datastore.download(target_path='.',prefix=si['resourceIdentifier'],overwrite=True,show_progress=False)
        img = imread(si['resourceIdentifier'], as_gray=True)
        img_resized = resize(img, dimension, anti_aliasing=True, mode='reflect')
        flat_data.append(img_resized.flatten()) 
        images.append(img_resized)
        target.append(df.loc[i].label[0])

    categories = list(categories)
    flat_data = np.array(flat_data)
    target = np.array(target)
    images = np.array(images)

    return Bunch(data=flat_data,
                 target=target,
                 target_names=categories,
                 images=images,
                 DESCR=descr)

# load train and test set
image_dataset = load_image_files()
# split
X_train, X_test, y_train, y_test = train_test_split(
    image_dataset.data, image_dataset.target, test_size=0.2,random_state=12)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape, sep = '\n')

print('Train a logistic regression model with regularization rate of', 0.5)
clf = LogisticRegression(C=1.0/0.5, solver="liblinear", multi_class="auto", random_state=12)
clf.fit(X_train, y_train)

print('Predict the test set')
y_pred = clf.predict(X_test)

print("Classification report for - \n{}:\n{}\n".format(
    clf, metrics.classification_report(y_test, y_pred)))

100%|██████████| 34/34 [02:57<00:00,  5.22s/it]


(27, 65536)
(27,)
(7, 65536)
(7,)
Train a logistic regression model with regularization rate of 0.5
Predict the test set
Classification report for - 
LogisticRegression(C=2.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='auto',
          n_jobs=None, penalty='l2', random_state=12, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False):
              precision    recall  f1-score   support

      broken       0.67      0.67      0.67         3
  not_broken       0.75      0.75      0.75         4

   micro avg       0.71      0.71      0.71         7
   macro avg       0.71      0.71      0.71         7
weighted avg       0.71      0.71      0.71         7




## Train on a remote cluster

For this task, submit the job to the remote training cluster you set up earlier.  To submit a job you:
* Create a directory
* Create a training script
* Create an estimator object
* Submit the job 

### Create a directory

Create a directory to deliver the necessary code from your computer to the remote resource.

In [53]:
import os
script_folder = os.path.join(os.getcwd(), "scripts")
os.makedirs(script_folder, exist_ok=True)

### Create a training script

To submit the job to the cluster, first create a training script. Run the following code to create the training script called `train.py` in the directory you just created. 

In [54]:
%%writefile $script_folder/train.py

#
import argparse
import os
import numpy as np
from tqdm import tqdm

from sklearn import svm, metrics, datasets
from sklearn.utils import Bunch
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.externals import joblib

from skimage.io import imread
from skimage.transform import resize

from azureml.core import Run, Workspace, Datastore
from azureml.contrib.dataset import Dataset

def load_image_files(dimension=(256, 256)):
    
    # get dataset (online run)
    run = Run.get_context()
    ds = run.input_datasets['Light Bulbs-2019-12-08 00:35:33']
    
    # get dataset (offline run)
    #subscription_id = 'c5ec24ce-9c5f-4da2-bf12-9ca8e9758d60'
    #resource_group = 'ai-lab'
    #workspace_name = 'ailabml'
    #workspace = Workspace(subscription_id, resource_group, workspace_name)
    #ds = Dataset.get_by_name(workspace, name='Light Bulbs-2019-12-08 00:35:33')
    
    # dataset to dataframe
    df = ds.to_pandas_dataframe()

    # Images
    descr = "Defect Detection Dataset"
    images = []
    flat_data = []
    target = []
    categories = set()
    for i in tqdm(range(df.shape[0])):
        si = df.loc[i].image_url.to_pod()
        if i == 0:
            datastore = Datastore.get(workspace, si['arguments']['datastoreName'])
        categories.add(df.loc[i].label[0])
        datastore.download(target_path='.',prefix=si['resourceIdentifier'],overwrite=True,show_progress=False)
        img = imread(si['resourceIdentifier'], as_gray=True)
        img_resized = resize(img, dimension, anti_aliasing=True, mode='reflect')
        flat_data.append(img_resized.flatten()) 
        images.append(img_resized)
        target.append(df.loc[i].label[0])

    categories = list(categories)
    flat_data = np.array(flat_data)
    target = np.array(target)
    images = np.array(images)

    return Bunch(data=flat_data,
                 target=target,
                 target_names=categories,
                 images=images,
                 DESCR=descr)

# let user feed in 2 parameters, the dataset to mount or download, and the regularization rate of the logistic regression model
parser = argparse.ArgumentParser()
parser.add_argument('--regularization', type=float, dest='reg', default=0.5, help='regularization rate')
args = parser.parse_args()

# load train and test set
image_dataset = load_image_files()
# split
X_train, X_test, y_train, y_test = train_test_split(
    image_dataset.data, image_dataset.target, test_size=0.2,random_state=12)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape, sep = '\n')

# get hold of the current run
run = Run.get_context()

print('Train a logistic regression model with regularization rate of', args.reg)
clf = LogisticRegression(C=1.0/args.reg, solver="liblinear", multi_class="auto", random_state=12)
clf.fit(X_train, y_train)

print('Predict the test set')
y_pred = clf.predict(X_test)

print("Classification report for - \n{}:\n{}\n".format(
    clf, metrics.classification_report(y_test, y_pred)))

# calculate accuracy on the prediction
acc = np.average(y_pred == y_test)
print('Accuracy is', acc)

run.log('regularization rate', np.float(args.reg))
run.log('accuracy', np.float(acc))

os.makedirs('outputs', exist_ok=True)
# note file saved in the outputs folder is automatically uploaded into experiment record
joblib.dump(value=clf, filename='outputs/ai_lab_defect_detection_model.pkl')

Overwriting /mnt/azmnt/code/Users/aldelar/ai-lab/scripts/train.py


Notice how the script gets data and saves models:

+ The training script reads an argument to find the directory containing the data.  When you submit the job later, you point to the dataset for this argument:
`parser.add_argument('--data-folder', type=str, dest='data_folder', help='data directory mounting point')`


+ The training script saves your model into a directory named outputs. <br/>
`joblib.dump(value=clf, filename='outputs/sklearn_ai_lab_defect_detection_model.pkl')`<br/>
Anything written in this directory is automatically uploaded into your workspace. You'll access your model from this directory later in the tutorial.

The file `utils.py` is referenced from the training script to load the dataset correctly.  Copy this script into the script folder so that it can be accessed along with the training script on the remote resource.

### Create an estimator

An estimator object is used to submit the run. Azure Machine Learning has pre-configured estimators for common machine learning frameworks, as well as generic Estimator. Create SKLearn estimator for scikit-learn model, by specifying

* The name of the estimator object, `est`
* The directory that contains your scripts. All the files in this directory are uploaded into the cluster nodes for execution. 
* The compute target.  In this case you will use the AmlCompute you created
* The training script name, train.py
* Parameters required from the training script 

In this tutorial, the target is AmlCompute. All files in the script folder are uploaded into the cluster nodes for execution. The data_folder is set to use the dataset.

In [55]:
from azureml.core.environment import Environment
from azureml.core.conda_dependencies import CondaDependencies

# to install required packages
env = Environment('my_env')
cd = CondaDependencies.create(pip_packages=[
    'azureml-contrib-dataset',
    'azureml-core',
    'azureml-dataprep[pandas,fuse]',
    'azureml-dataprep-native',
    'numpy',
    'pandas',
    'scikit-image',
    'scikit-learn',
    'sklearn',
    'tqdm'
])

env.python.conda_dependencies = cd

In [56]:
from azureml.train.sklearn import SKLearn

script_params = {}

est = SKLearn(source_directory=script_folder,
              script_params=script_params,
              compute_target=compute_target,
              environment_definition=env,
              entry_script='train.py')



### Submit the job to the cluster

Run the experiment by submitting the estimator object. And you can navigate to Azure portal to monitor the run.

In [57]:
run = exp.submit(config=est)
run

Experiment,Id,Type,Status,Details Page,Docs Page
ai-lab-defect-detection,ai-lab-defect-detection_1575935091_504b22a6,azureml.scriptrun,Starting,Link to Azure Machine Learning studio,Link to Documentation


Since the call is asynchronous, it returns a **Preparing** or **Running** state as soon as the job is started.

## Monitor a remote run

In total, the first run takes **approximately 10 minutes**. But for subsequent runs, as long as the dependencies (`conda_packages` parameter in the above estimator constructor) don't change, the same image is reused and hence the container start up time is much faster.

Here is what's happening while you wait:

- **Image creation**: A Docker image is created matching the Python environment specified by the estimator. The image is built and stored in the ACR (Azure Container Registry) associated with your workspace. Image creation and uploading takes **about 5 minutes**. 

  This stage happens once for each Python environment since the container is cached for subsequent runs.  During image creation, logs are streamed to the run history. You can monitor the image creation progress using these logs.

- **Scaling**: If the remote cluster requires more nodes to execute the run than currently available, additional nodes are added automatically. Scaling typically takes **about 5 minutes.**

- **Running**: In this stage, the necessary scripts and files are sent to the compute target, then data stores are mounted/copied, then the entry_script is run. While the job is running, stdout and the files in the ./logs directory are streamed to the run history. You can monitor the run's progress using these logs.

- **Post-Processing**: The ./outputs directory of the run is copied over to the run history in your workspace so you can access these results.


You can check the progress of a running job in multiple ways. This tutorial uses a Jupyter widget as well as a `wait_for_completion` method. 

### Jupyter widget

Watch the progress of the run with a Jupyter widget.  Like the run submission, the widget is asynchronous and provides live updates every 10-15 seconds until the job completes.

In [58]:
from azureml.widgets import RunDetails
RunDetails(run).show()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'NOTSET',…

By the way, if you need to cancel a run, you can follow [these instructions](https://aka.ms/aml-docs-cancel-run).

### Get log results upon completion

Model training happens in the background. You can use `wait_for_completion` to block and wait until the model has completed training before running more code. 

In [42]:
# specify show_output to True for a verbose log
run.wait_for_completion(show_output=True)

RunId: ai-lab-defect-detection_1575932264_871d14da
Web View: https://ml.azure.com/experiments/ai-lab-defect-detection/runs/ai-lab-defect-detection_1575932264_871d14da?wsid=/subscriptions/c5ec24ce-9c5f-4da2-bf12-9ca8e9758d60/resourcegroups/ai-lab/workspaces/ailabml

Streaming azureml-logs/55_azureml-execution-tvmps_39d9b49377e7c2f7513f5ce018114dd8d0f090f1f1e67efbee48e059809e9e8e_d.txt

2019-12-09T22:57:59Z Starting output-watcher...
Login Succeeded
Using default tag: latest
latest: Pulling from azureml/azureml_d52ec22a82c6605a7e3742b0f8974fc7
Digest: sha256:dfde58f8b2c9e723ae24827302ac963776c6b0320bf10d83c5a8e5696e133be6
Status: Image is up to date for ailabml1b51bd50.azurecr.io/azureml/azureml_d52ec22a82c6605a7e3742b0f8974fc7:latest
162dd4ef4b846bc8a9dbd8d9c1cf7dc133fc343e67c48ae7ded1af606d3e6888
2019/12/09 22:58:02 Version: 3.0.01059.0002 Branch: master Commit: e8f402a4
2019/12/09 22:58:02 sshd runtime has already been installed in the container
ssh-keygen: /azureml-envs/azureml_5c493

{'runId': 'ai-lab-defect-detection_1575932264_871d14da',
 'target': 'cpu-cluster',
 'status': 'Completed',
 'startTimeUtc': '2019-12-09T22:58:12.11497Z',
 'endTimeUtc': '2019-12-09T23:04:05.514823Z',
 'properties': {'_azureml.ComputeTargetType': 'amlcompute',
  'ContentSnapshotId': 'a769a3cc-b42c-409c-83cb-ce3c2e316dec',
  'AzureML.DerivedImageName': 'azureml/azureml_d52ec22a82c6605a7e3742b0f8974fc7',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json'},
 'inputDatasets': [{'dataset': {'id': '9fe4e0e6-c822-4ab9-8d87-9d6623a99715'}, 'consumptionDetails': {'type': 'Reference'}}],
 'runDefinition': {'script': 'train.py',
  'arguments': [],
  'sourceDirectoryDataStore': None,
  'framework': 'Python',
  'communicator': 'None',
  'target': 'cpu-cluster',
  'dataReferences': {},
  'data': {},
  'jobName': None,
  'maxRunDurationSeconds': None,
  'nodeCount': 1,
  'environment': {'name': 'my_env',
   'version': 'Autosave_2019-12-09T2

### Display run results

You now have a model trained on a remote cluster.  Retrieve all the metrics logged during the run, including the accuracy of the model:

In [43]:
print(run.get_metrics())

{'regularization rate': 0.5, 'accuracy': 0.7142857142857143}


In the next tutorial you will explore this model in more detail.

## Register model

The last step in the training script wrote the file `outputs/sklearn_mnist_model.pkl` in a directory named `outputs` in the VM of the cluster where the job is executed. `outputs` is a special directory in that all content in this  directory is automatically uploaded to your workspace.  This content appears in the run record in the experiment under your workspace. Hence, the model file is now also available in your workspace.

You can see files associated with that run.

In [44]:
print(run.get_file_names())

['azureml-logs/55_azureml-execution-tvmps_39d9b49377e7c2f7513f5ce018114dd8d0f090f1f1e67efbee48e059809e9e8e_d.txt', 'azureml-logs/65_job_prep-tvmps_39d9b49377e7c2f7513f5ce018114dd8d0f090f1f1e67efbee48e059809e9e8e_d.txt', 'azureml-logs/70_driver_log.txt', 'azureml-logs/75_job_post-tvmps_39d9b49377e7c2f7513f5ce018114dd8d0f090f1f1e67efbee48e059809e9e8e_d.txt', 'azureml-logs/process_info.json', 'azureml-logs/process_status.json', 'logs/azureml/140_azureml.log', 'logs/azureml/azureml.log', 'outputs/ai_lab_defect_detection_model.pkl']


Register the model in the workspace so that you (or other collaborators) can later query, examine, and deploy this model.

In [45]:
# register model 
model = run.register_model(model_name='ai_lab_defect_detection', model_path='outputs/ai_lab_defect_detection_model.pkl')
print(model.name, model.id, model.version, sep='\t')

ai_lab_defect_detection	ai_lab_defect_detection:4	4


## Next steps

In this Azure Machine Learning tutorial, you used Python to:

> * Set up your development environment
> * Access and examine the data
> * Train multiple models on a remote cluster using the popular scikit-learn machine learning library
> * Review training details and register the best model

You are ready to deploy this registered model using the instructions in the next part of the tutorial series:

> [Tutorial 2 - Deploy models](img-classification-part2-deploy.ipynb)

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/NotebookVM/tutorials/img-classification-part1-training.png)