Author: Kevin ALBERT  

Created: April 2020  

# Azure Machine Learning
_**Classification project with data residing on a data lake gen2 using remote compute with autoML and customML**_

## Contents
1. [AutoML](#AutoML)
1. [Setup](#Setup)
1. [Train](#Train)
1. [Results](#Results)
1. [Register](#Register)
1. [Deploy](#Deploy)
1. [Test](#Test)
1. [CustomML](#CustomML)
1. [Finetuning](#Finetuning)
1. [Pipelines](#Pipelines)

## Introduction

Cleaned datasets created in datafactory onto a delta lake Gen2.  
This notebook is using delta lake data and remote compute to autoML train a classification model.  
We use example data to detect diabetic or non-diabetic based on 8 features.  

This notebook show how to:
1. Setup packages
1. Setup workspace
1. Create an experiment
1. Load data
1. Setup compute
1. Configure autoML
1. Train pipelines
1. Explore the best pipeline
1. Inspect model properties
1. Register the model
1. Deploy model as webservice
1. Webservice inference test
1. customML inline method
1. customML script method
1. HyperParametertuning
1. Pipelines endpoint

## Setup

* required
  * **disable shield on Brave** webbrowser for the widgets to work
  * download **config.json** from the machine learning workspace portal
  * install extra azureml packages on **py37_default** when using **'local'** compute  
  * split the data up in train and test dataset on data lake, validation dataset is not needed due to cross_validation
* optional
  * register datastore(s) manually
  * register dataset(s) manually
  * register compute cluster(s) manually

In [None]:
# update local
# ! /anaconda/envs/py37_default/bin/python -m pip install --upgrade --upgrade-strategy eager azureml-sdk[explain,automl] azureml-widgets

In [None]:
# update virtual env (will break it - do not use)
# ! pip install --upgrade --upgrade-strategy eager azureml-sdk[explain,automl] azureml-widgets

### Import open-source packages

In [1]:
import logging
import os
import pandas as pd
import numpy as np
import json
import requests
import joblib

### Import azure machine learning SDK packages

In [2]:
from azureml.core import Workspace, Dataset, Datastore, Run
from azureml.core.experiment import Experiment
from azureml.data.datapath import DataPath
from azureml.core.compute import ComputeTarget, AmlCompute, AksCompute
from azureml.core.model import Model, InferenceConfig
from azureml.train.automl import AutoMLConfig
from azureml.train.automl.run import AutoMLRun
from azureml.widgets import RunDetails
from azureml.core.webservice import Webservice, AciWebservice, AksWebservice
from azureml.exceptions import WebserviceException
from azureml.core.environment import Environment
from azureml.train.estimator import Estimator
from azureml.core.conda_dependencies import CondaDependencies
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.sampling import RandomParameterSampling, GridParameterSampling
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.parameter_expressions import choice
from azureml.core.runconfig import RunConfiguration
from azureml.pipeline.core import PipelineData, Pipeline
from azureml.pipeline.steps import PythonScriptStep, EstimatorStep
from azureml.pipeline.core.run import PipelineRun
from azureml.core.authentication import InteractiveLoginAuthentication
from azureml.explain.model._internal.explanation_client import ExplanationClient
import azureml.core
print("azureml.core version:", azureml.core.__version__)

azureml.core version: 1.16.0


In [3]:
!pip list |grep -i azureml

azureml-accel-models                  1.8.0
azureml-automl-core                   1.16.0
azureml-automl-runtime                1.16.0
azureml-cli-common                    1.8.0
azureml-contrib-dataset               1.8.0
azureml-contrib-fairness              1.8.0
azureml-contrib-gbdt                  1.8.0
azureml-contrib-interpret             1.8.0
azureml-contrib-notebook              1.8.0
azureml-contrib-pipeline-steps        1.8.0
azureml-contrib-reinforcementlearning 1.8.0
azureml-contrib-server                1.8.0
azureml-contrib-services              1.8.0
azureml-core                          1.16.0.post1
azureml-datadrift                     1.8.0
azureml-dataprep                      2.3.3
azureml-dataprep-native               23.0.0
azureml-dataprep-rslex                1.1.2
azureml-dataset-runtime               1.16.0
azureml-defaults                      1.16.0
azureml-explain-model                 1.16.0
azureml-interpret                     1.16

### Workspace

In [4]:
# load the workspace
ws = Workspace.from_config()

### Experiment

In [5]:
# choose an experiment name
experiment = Experiment(ws, 'automl-classification')

### Data

Data Factory has prepped data from /bronze to /silver to /gold and /platinum for model training  
**note:** this demonstration had files in the Data Lake Gen2 datalake container /platinum folder  
  * /datalake/platinum/diabetes.csv
  * /datalake/platinum/diabetes.parquet
  * copy from ../data/platinum/*

Register the datastore 'data lake gen2' as a **blob container**  
**optional:** manually register in ML workspace

In [6]:
ds = Datastore.register_azure_blob_container(
    workspace=ws,
    datastore_name="datalakestoragegen2",
    container_name="datalake",
    account_name="datalake15102020",
    account_key="mk9ftVkp1KnObAujAfHVl5dbQznfZmWYS4nP3/QFzsG0xXMCW5OluHx2zIYyrHSItkIC7d9I/RbFIZ2km3SMIQ==",
    create_if_not_exists=False)
# list available datastores
ws.datastores

{'datalakestoragegen2': {
   "name": "datalakestoragegen2",
   "container_name": "datalake",
   "account_name": "datalake15102020",
   "protocol": "https",
   "endpoint": "core.windows.net"
 },
 'workspacefilestore': {
   "name": "workspacefilestore",
   "container_name": "azureml-filestore-d4c83106-e1e1-4e75-83ee-252b1c859da0",
   "account_name": "machinelstoragece02c4ea7",
   "protocol": "https",
   "endpoint": "core.windows.net"
 },
 'workspaceblobstore': {
   "name": "workspaceblobstore",
   "container_name": "azureml-blobstore-d4c83106-e1e1-4e75-83ee-252b1c859da0",
   "account_name": "machinelstoragece02c4ea7",
   "protocol": "https",
   "endpoint": "core.windows.net"
 }}

Register file(s) into a tabular dataset  
**Note:** do not import Delta lake parquet file(s)  
**Fix:** you can import pandas single gold/*.csv or gold/*.parquet file(s)  

In [None]:
# load datastore
ds = Datastore.get(ws, 'datalakestoragegen2')
# show datastore settings
ds

**Option 1 Tabular:** loading *.parquet

In [7]:
# setup parquet file(s) into a tabular dataset
ds_path = [DataPath(ds, 'platinum/diabetes.parquet')] # {path/*.parquet}
dataset = Dataset.Tabular.from_parquet_files(path=ds_path)
# show dataset settings
dataset

{
  "source": [
    "('datalakestoragegen2', 'platinum/diabetes.parquet')"
  ],
  "definition": [
    "GetDatastoreFiles",
    "ReadParquetFile",
    "DropColumns"
  ]
}

**Option 2 Tabular:** loading *.csv

In [None]:
# setup csv file(s) into a tabular dataset
ds_path = [DataPath(ds, 'platinum/diabetes.csv')]
dataset = Dataset.Tabular.from_delimited_files(path=ds_path)
# show dataset settings
dataset

**Option 3 Registered:** loading a registered dataset (manually register in ML workspace)

In [None]:
# list available datasets
ws.datasets

In [None]:
# load a registered dataset
dataset = Dataset.get_by_name(ws, 'diabetes_parquet_from_datastore_datalakegen2')
# show dataset settings
dataset

### Compute

Check possible compute type **names** to create auto-scaling cluster

In [8]:
# example: list all with 1=vCPUs 2>GB and no-GPU
vm_df = pd.DataFrame(AmlCompute.supported_vmsizes(ws))
vm_df[(vm_df.vCPUs == 1) & (vm_df.memoryGB >= 2) & (vm_df.gpus == 0)]

Unnamed: 0,name,vCPUs,gpus,memoryGB,maxResourceVolumeMB
0,Standard_D1_v2,1,0,3.5,51200
14,Standard_DS1_v2,1,0,3.5,7168
68,Standard_D1,1,0,3.5,51200


option 1: Create training cluster  

In [8]:
# Specify a name for the compute (unique within the workspace)
compute_name = 'aml-cluster'
# Define compute configuration
compute_config = AmlCompute.provisioning_configuration(vm_size='Standard_D1_v2',
                                                       min_nodes=0, # you are not paying if not using
                                                       max_nodes=10, # depending quota limits
                                                       vm_priority='dedicated', # {lowpriority, dedicated}
                                                       admin_username='ubuntu',
                                                       admin_user_password='ABCD1234abcd',
                                                       idle_seconds_before_scaledown=120, # {default: 120}
                                                      )
# Create the compute
training_cluster = ComputeTarget.create(ws, compute_name, compute_config)
training_cluster.wait_for_completion(show_output=True)

Creating
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


option 2: Load already known training cluster

In [9]:
# list all available training cluster(s):
for cluster in ws.compute_targets:
    print(cluster)

aml-cluster


In [None]:
# load the training cluster
compute_name = 'aml-cluster'
training_cluster = ComputeTarget(ws, name=compute_name)

## Train

### Configure autoML
Define settings to run the experiment.

|Property|Description|Options|
|-|-|-|
|**task**||<i>classification</i><br><i>regression</i><br><i>forecasting</i>|
|**compute_target**|execution on local DSVM serialized<br>execution on remote AML or AKS parallel|<i>local</i><br><i>training_cluster</i>|
|**primary_metric**|the metric you want to optimize<br>[metrics](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-understand-automated-ml)|**classification:**<br><i>accuracy</i><br><i>AUC_weighted</i><br><i>average_precision_score_weighted</i><br><i>norm_macro_recall</i><br><i>precision_score_weighted</i><br><br>**regression:**<br><i>spearman_correlation</i><br><i>normalized_root_mean_squared_error</i><br><i>r2_score</i><br><i>normalized_mean_absolute_error</i>|
|**training_data**|input dataset, containing both X_train and y_train|<i>DataFrame</i><br><i>Dataset</i><br><i>DatasetDefinition</i><br><i>TabularDataset</i>|
|**validation_data**|input dataset, covered with cross validation|N/A|
|**label_column_name**|the name of the 'target' or 'label' column||
|**enable_early_stopping**|stop the run if metric score is not improving|<i>True</i><br><i>False</i>|
|**n_cross_validations**|number of cross validation splits|5|
|**experiment_timeout_hours**|max time in hours the experiment terminates (+15min)|<i>0.25</i>|
|**max_concurrent_iterations**|less or equal to the number of cores per node|2|



**_You can find more information_** [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-auto-train)

In [9]:
automl_settings = {
    "enable_early_stopping":True,
    "experiment_timeout_hours":0.25,
    "iterations":10, # number of runs
    "iteration_timeout_minutes":5,
    "max_concurrent_iterations":1,
    "max_cores_per_iteration":-1,
    #"experiment_exit_score":0.9920,
    "model_explainability":True,
    "n_cross_validations":5,
    "primary_metric":'AUC_weighted',
    "featurization":'auto',
    "verbosity":logging.INFO, # {INFO, DEBUG, CRITICAL, ERROR, WARNING} -- debug_log=<*.log>
}

automl_config = AutoMLConfig(task='classification',
                             debug_log='automl_errors.log',
                             compute_target='local', # {training_cluster or 'local'}
                             #blacklist_models=['KNN','LinearSVM'],
                             enable_onnx_compatible_models=True,
                             training_data=dataset,
                             label_column_name="Diabetic",
                             **automl_settings
                            )
# ouputs "model.pkl" and "automl_errors.log"

### Train pipelines

In [11]:
automl_run = experiment.submit(automl_config, show_output=True)

Running on local machine
Parent Run ID: AutoML_afb6fbc1-05a7-427a-9768-e30de3d147a4

Current status: DatasetEvaluation. Gathering dataset statistics.
Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetFeaturization. Beginning to fit featurizers and featurize the dataset.
Current status: DatasetFeaturizationCompleted. Completed fit featurizers and featurizing the dataset.
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       PASSED
DESCRIPTION:  Classes are balanced in the training data.

TYPE:         High cardinality feature detection
STATUS:       PASSED
DESCRIPTION:  Your inputs were analyzed, and no high cardinality features were detected.

******************************************************************************************



Current status: BestRunExplainModel. Best run model explanations started
Current status: ModelExplanationDataSetSetup. Model explanations data setup completed
Current status: EngineeredFeatureExplanations. Computation of engineered features started
Current status: EngineeredFeatureExplanations. Computation of engineered features completed
Current status: BestRunExplainModel. Best run model explanations completed
****************************************************************************************************


### Optional: retrieve a run

In [None]:
runId = 'AutoML_afb6fbc1-05a7-427a-9768-e30de3d147a4'
automl_run = AutoMLRun(experiment, run_id=runId)

## Results

### Explore the best pipeline

In [12]:
RunDetails(automl_run).show()
automl_run.wait_for_completion() # get more parameter info

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

{'runId': 'AutoML_afb6fbc1-05a7-427a-9768-e30de3d147a4',
 'target': 'local',
 'status': 'Completed',
 'startTimeUtc': '2020-04-20T15:17:33.469605Z',
 'endTimeUtc': '2020-04-20T15:21:04.102187Z',
 'properties': {'num_iterations': '10',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'AUC_weighted',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': '5',
  'target': 'local',
  'DataPrepJsonString': None,
  'EnableSubsampling': 'False',
  'runTemplate': 'AutoML',
  'azureml.runsource': 'automl',
  'display_task_type': 'classification',
  'dependencies_versions': '{"azureml-widgets": "1.3.0", "azureml-train": "1.3.0", "azureml-train-restclients-hyperdrive": "1.3.0", "azureml-train-core": "1.3.0", "azureml-train-automl": "1.3.0", "azureml-train-automl-runtime": "1.3.0", "azureml-train-automl-client": "1.3.0", "azureml-tensorboard": "1.0.85", "azureml-telemetry": "1.3.0", "azureml-sdk": "1.3.0", "azureml-pipeline": "1.3.0", "az

**option 1:** select any pipeline iteration 

In [None]:
best_run, fitted_model = automl_run.get_output(iteration=2)

**option 2:** select best pipeline iteration automatically

In [13]:
best_run, fitted_model = automl_run.get_output()

### inspect model properties

In [14]:
# pipeline steps
for step in fitted_model.named_steps:
    print(step)

datatransformer
MaxAbsScaler
LightGBMClassifier


In [15]:
# model properties
fitted_model.named_steps

{'datatransformer': DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
         feature_sweeping_config=None, feature_sweeping_timeout=None,
         featurization_config=None, force_text_dnn=None,
         is_cross_validation=None, is_onnx_compatible=None, logger=None,
         observer=None, task=None, working_dir=None),
 'MaxAbsScaler': MaxAbsScaler(copy=True),
 'LightGBMClassifier': LightGBMClassifier(boosting_type='gbdt', class_weight=None,
           colsample_bytree=1.0, importance_type='split', learning_rate=0.1,
           max_depth=-1, min_child_samples=20, min_child_weight=0.001,
           min_split_gain=0.0, n_estimators=100, n_jobs=-1, num_leaves=31,
           objective=None, random_state=None, reg_alpha=0.0, reg_lambda=0.0,
           silent=True, subsample=1.0, subsample_for_bin=200000,
           subsample_freq=0, verbose=-10)}

In [16]:
# show all metrics
best_run.get_metrics()

{'recall_score_micro': 0.9516,
 'confusion_matrix': 'aml://artifactId/ExperimentRun/dcid.AutoML_afb6fbc1-05a7-427a-9768-e30de3d147a4_0/confusion_matrix',
 'precision_score_weighted': 0.951529651244859,
 'accuracy_table': 'aml://artifactId/ExperimentRun/dcid.AutoML_afb6fbc1-05a7-427a-9768-e30de3d147a4_0/accuracy_table',
 'log_loss': 0.11810503522790523,
 'AUC_micro': 0.9914617,
 'average_precision_score_micro': 0.9916787674263302,
 'precision_score_micro': 0.9516,
 'average_precision_score_weighted': 0.9907832843118888,
 'average_precision_score_macro': 0.9885533270268901,
 'AUC_macro': 0.9904426831775627,
 'f1_score_micro': 0.9516,
 'weighted_accuracy': 0.9576144671870581,
 'recall_score_weighted': 0.9516,
 'f1_score_macro': 0.9454409908953988,
 'matthews_correlation': 0.8909693671988277,
 'recall_score_macro': 0.9440819974618796,
 'norm_macro_recall': 0.8881639949237587,
 'accuracy': 0.9516,
 'f1_score_weighted': 0.951525872795924,
 'precision_score_macro': 0.9468968210340906,
 'balan

### Feature Importance

In [None]:
client = ExplanationClient.from_run(best_run)
engineered_explanations = client.download_model_explanation(raw=False)
feature_importance = engineered_explanations.get_feature_importance_dict() # get model feature importance values
columns = ["modelFeatureImportance_name", "modelFeatureImportance_value"]
pd.DataFrame(list(feature_importance.items()), columns=columns)

## Register

### Prepare

autoML generated a scoring script, environment file and model

In [17]:
# get the score and environment files
model_name = best_run.properties['model_name'] # score.py script will look for the name of the registered model

# make a local copy of the best scoring script, environment file and the model file
script_file_name = 'inference/score.py'
conda_env_file_name = 'inference/env.yml'
model_pickle_file_name = 'inference/model.pkl'
model_onnx_file_name = 'inference/model.onnx'
best_run.download_file('outputs/scoring_file_v_1_0_0.py', script_file_name)
best_run.download_file('outputs/conda_env_v_1_0_0.yml', conda_env_file_name)
best_run.download_file('outputs/model.pkl', model_pickle_file_name)
best_run.download_file('outputs/model.onnx', model_onnx_file_name)

In [18]:
! cat inference/env.yml

# Conda environment specification. The dependencies defined in this file will
# be automatically provisioned for runs with userManagedDependencies=False.

# Details about the Conda environment file format:
# https://conda.io/docs/user-guide/tasks/manage-environments.html#create-env-file-manually

name: project_environment
dependencies:
  # The python interpreter version.
  # Currently Azure ML only supports 3.5.2 and later.
- python=3.6.2

- pip:
  - azureml-train-automl-runtime==1.3.0
  - inference-schema
  - azureml-explain-model==1.3.0
  - azureml-defaults==1.3.0
- numpy>=1.16.0,<=1.16.2
- pandas>=0.21.0,<=0.23.4
- scikit-learn>=0.19.0,<=0.20.3
- py-xgboost<=0.90
- fbprophet==0.5
- psutil>=5.2.2,<6.0.0
channels:
- anaconda
- conda-forge


In [19]:
! cat inference/score.py

# ---------------------------------------------------------
# Copyright (c) Microsoft Corporation. All rights reserved.
# ---------------------------------------------------------
import json
import pickle
import numpy as np
import pandas as pd
import azureml.train.automl
from sklearn.externals import joblib
from azureml.core.model import Model

from inference_schema.schema_decorators import input_schema, output_schema
from inference_schema.parameter_types.numpy_parameter_type import NumpyParameterType
from inference_schema.parameter_types.pandas_parameter_type import PandasParameterType


input_sample = pd.DataFrame({'PatientID': pd.Series(['1354778'], dtype='int64'), 'Pregnancies': pd.Series(['0'], dtype='int64'), 'PlasmaGlucose': pd.Series(['171'], dtype='int64'), 'DiastolicBloodPressure': pd.Series(['80'], dtype='int64'), 'TricepsThickness': pd.Series(['34'], dtype='int64'), 'SerumInsulin': pd.Series(['23'], dtype='int64'), 'BMI': pd.Series(['43.50972593'], dtype='f

### Register the model

**Option 1:** from workspace /outputs folder with .register_model()

In [None]:
model = best_run.register_model(model_name=model_name, # registered model name used in scoring script init()
                                model_framework=Model.Framework.SCIKITLEARN, # {TensorFlow, ScikitLearn, Onnx, Custom}
                                model_framework_version='0.22.2',
                                model_path='outputs/model.pkl', # fixed path in workspace {'model.pkl', 'model.onnx'}
                                tags={'Training context': 'autoML Training'},
                                properties={'AUC': best_run.get_metrics()['AUC_weighted'],
                                            'Accuracy': best_run.get_metrics()['accuracy']},
                                description="Classification model to predict diabetes")

**Option 2:** from local /path/model folder with Model.register()

In [20]:
model = Model.register(workspace=ws,
                       model_name=model_name, # registered model name used in scoring script init()
                       model_framework=Model.Framework.SCIKITLEARN, # {TensorFlow, ScikitLearn, Onnx, Custom}
                       model_framework_version='0.22.2',
                       model_path='inference/model.pkl', # local file {'model.pkl', 'model.onnx'}
                       tags={'Training context': 'autoML Training'},
                       properties={'AUC': best_run.get_metrics()['AUC_weighted'],
                                   'Accuracy': best_run.get_metrics()['accuracy']},
                       description="Classification model to predict diabetes")

Registering model AutoMLafb6fbc100


**Optional:** Load the model

In [21]:
# list all registered models
for model in Model.list(ws):
    print(model.name, 'version:', model.version)
    for tag_name in model.tags:
        tag = model.tags[tag_name]
        print ('\t',tag_name, ':', tag)
    for prop_name in model.properties:
        prop = model.properties[prop_name]
        print ('\t',prop_name, ':', prop)
    print('\n')

AutoMLafb6fbc100 version: 1
	 Training context : autoML Training
	 AUC : 0.9904426831775627
	 Accuracy : 0.9516




In [None]:
# load the registered model for deployment (latest version)
model = ws.models[model_name] # or replace with any registered modelname from Model.list(ws)
model

## Deploy

### Deploy model as webservice (ACI)

Linux Azure Container Instance with 1 vCPU and 1GB of RAM cost €28 per month

In [22]:
# Configure the scoring environment
service_name = "automl-projname-service" # only lowercase letters, numbers, or dashes

# Remove any existing service under the same name
try:
    Webservice(ws, service_name).delete()
except WebserviceException:
    print('"' + service_name + '" does not exist, creating the webservice...')

myenv = Environment.from_conda_specification(name="myenv", file_path=conda_env_file_name)
inference_config = InferenceConfig(entry_script=script_file_name, environment=myenv)

deployment_config = AciWebservice.deploy_configuration(cpu_cores=1,
                                                       memory_gb=1)

# build container from environment, start webservice ACI and deploy inference scrips 
service = Model.deploy(ws, service_name, [model], inference_config, deployment_config)
service.wait_for_deployment(show_output=True)

"automl-projname-service" does not exist, creating the webservice...
Running................................................................................................................................................................................................................
Succeeded
ACI service creation operation finished, operation "Succeeded"


**Optional:** load a running webservice

In [23]:
# list available webservices
for i in ws.webservices:
    print(i)

automl-projname-service


In [24]:
service_name = "automl-projname-service" # only lowercase letters, numbers, or dashes
service = Webservice(ws, service_name)

In [25]:
# get webservice logs
print(service.get_logs())

2020-04-20T15:41:59,762511984+00:00 - iot-server/run 
2020-04-20T15:41:59,761930572+00:00 - gunicorn/run 
2020-04-20T15:41:59,764030716+00:00 - rsyslog/run 
2020-04-20T15:41:59,822608762+00:00 - nginx/run 
/usr/sbin/nginx: /azureml-envs/azureml_faa7a7b8ebfe27e4f27f49576cf50f1f/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_faa7a7b8ebfe27e4f27f49576cf50f1f/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_faa7a7b8ebfe27e4f27f49576cf50f1f/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_faa7a7b8ebfe27e4f27f49576cf50f1f/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_faa7a7b8ebfe27e4f27f49576cf50f1f/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
rsyslogd

## Test

### Webservice inference test

Send a HTTP triggered webrequest with testdata to the model for a prediction value.  
In this example we test a person is diabetic (1) or not-diabetic (0).  
The testdata must be a list of 9 features to predict a binary classification.  
We demonstrate the use of **service** or **requests** method to send a prediction request.  
Know that 'Postman' application or 'Rest Client' plugin in VSCode work as well.  

|Web API|Example value|Options|
|-|-|-|
|**HTTP method**|POST|<i>POST</i><br><i>GET</i>|
|**URI**|http://3bb0618b-ef7b-4b17-af32-a52f9c64f4d5.northeurope.azurecontainer.io/score||
|**Header**|{Content-Type: Application/json}||
|**Body**|{"data": [[5, 2, 180, 74, 24, 21, 24, 1.5, 22], <br>[6, 0, 148, 58, 11, 179, 39, 0.16, 45]]}|<i>one or </i><br><i>more records</i>|
|**Response**|{"result": [1, 0]}|<i>json object</i>|

In [26]:
# get webservice URI
endpoint = service.scoring_uri

# raw test data
rawdata = [[5, 2, 180, 74, 24, 21, 24, 1.5, 22],
           [6, 0, 148, 58, 11, 179, 39, 0.16, 45]]

print("URI: " + endpoint)
print("Body: " + json.dumps({"data": rawdata})) # convert array to a serialized JSON formatted string object

URI: http://a839d171-f3dc-43c3-a1f3-d8ff9b64ae9f.westeurope.azurecontainer.io/score
Body: {"data": [[5, 2, 180, 74, 24, 21, 24, 1.5, 22], [6, 0, 148, 58, 11, 179, 39, 0.16, 45]]}


**Test 1:** service.run()

In [27]:
service.run(json.dumps({"data": rawdata}))

'{"result": [1, 0]}'

**Test 2:** requests.post()

In [28]:
response = requests.post(endpoint, json={"data": rawdata})
response.json()

'{"result": [1, 0]}'

When you are finished testing your service, clean up the deployment with service.delete()

In [29]:
service.delete()

# CustomML

Inspired from autoML results is an alternative customML development.  
Using inline method to test and develop, train local or with remote compute and deploy and test the model.  

1. option1: inline method
1. option2: script method
  * create training script
  * create training environment
  * creating and register dataset (File)
  * train model
1. create an inference script
1. create an inference environment
1. register the model
1. deploy the model
1. inference test

In [30]:
ws = Workspace.from_config()

### Option 1: Inline method

|log metric function|Description|Example|
|-|-|-|
|**log**|<i>Record a single named value</i>|run.log("accuracy", 0.95)|
|**log_list**|<i>Record a named list of values</i>|run.log_list("accuracies", [0.6, 0.7, 0.87])|
|**log_row**|<i>Record a row with multiple columns</i>|run.log_row("Y over X", x=1, y=0.4)|
|**log_table**|<i>Record a dictionary as a table</i>|run.log_table("Y over X", {"x":[1, 2, 3], "y":[0.6, 0.7, 0.89]})|
|**log_image**|<i>Record an image file or a plot</i>|run.log_image("ROC", plot=plt)|
|**upload_file**|<i>Upload any file to "./outputs"</i>|run.upload_file("best_model.pkl", "./model.pkl")|

https://aka.ms/AA70zf6

In [31]:
from azureml.core import Experiment
from azureml.core import Model
from azureml.core import Datastore
from azureml.core import Dataset
import pandas as pd
import numpy as np
import joblib
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

# Create an Azure ML experiment in your workspace
experiment = Experiment(workspace=ws, name="diabetes-training")
run = experiment.start_logging()
print("Starting experiment:", experiment.name)

# load the diabetes dataset (File method)
print("Loading data lake gen2 data in a pandas dataframe...")
ds = Datastore.get(ws, 'datalakestoragegen2')
ds_path = [DataPath(ds, 'platinum/diabetes.parquet')] # {path/*.parquet or path/**}
dataset = Dataset.File.from_files(path=ds_path)
mount_context = dataset.mount(mount_point='/tmp/platinum') # read-only mount from delta lake
mount_context.start()
diabetes = pd.read_parquet('/tmp/platinum/diabetes.parquet') # {'/tmp/path/'} can load latest delta lake parquet files
mount_context.stop()

# load the diabetes dataset (Tabular method)
# print("Loading data lake gen2 data in a pandas dataframe...")
# ds = Datastore.get(ws, 'datalakestoragegen2')
# ds_path = [DataPath(ds, 'platinum/diabetes.parquet')] # {path/*.parquet or path/**}
# dataset = Dataset.Tabular.from_parquet_files(path=ds_path) # {delimited, json, parquet, sql}
# diabetes = dataset.to_pandas_dataframe() # create a pandas dataframe

# Separate features and labels as numpy array
X, y = diabetes[['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']].values, diabetes['Diabetic'].values

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Train a decision tree model
print('Training a decision tree model')
model = DecisionTreeClassifier().fit(X_train, y_train)

# calculate accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
print('Accuracy:', acc)
run.log('Accuracy', np.float(acc))

# calculate AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
print('AUC: ' + str(auc))
run.log('AUC', np.float(auc))

# Save the trained model
model_file = 'diabetes_model.pkl'
joblib.dump(value=model, filename=model_file) # backup model local
run.upload_file(name='outputs/' + model_file,
                path_or_stream='./' + model_file) # save model to workspace

# Complete the run
run.complete()

Starting experiment: diabetes-training
Loading data lake gen2 data in a pandas dataframe...
Training a decision tree model
Accuracy: 0.8873333333333333
AUC: 0.8741181153291208


### Option 2: Script method

Create training script

In [32]:
# Create a local folder for the experiment files
folder_name = 'diabetes_service'
experiment_folder = './' + folder_name
os.makedirs(folder_name, exist_ok=True)
print(folder_name, 'folder created')

diabetes_service folder created


In [33]:
%%writefile $experiment_folder/diabetes_training.py
# Import libraries
import argparse
from azureml.core import Workspace, Dataset, Experiment, Run
import pandas as pd
import numpy as np
import joblib
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
import glob
print("libraries imported...")

# Set regularization hyperparameter (passed as an argument to the script)
parser = argparse.ArgumentParser()
parser.add_argument('--regularization', type=float, dest='reg_rate', default=0.01, help='regularization rate')
args = parser.parse_args()
reg = args.reg_rate
print("argparse parameters loaded...")

# Get the experiment run context
run = Run.get_context()
print("run context loaded...")

# load the diabetes dataset (File method)
# Get the training data from the estimator input identified as 'diabetes'
mount = run.input_datasets['diabetes'] # read-only mount from delta lake as '/mnt/data'
print("delta lake mounted...")
diabetes = pd.read_parquet('/mnt/data/diabetes.parquet') # load any file(s) from this delta lake mounted folder
print("dataset loaded...")

# save data into workspace
diabetes.to_csv("outputs/dataset.csv", index=False) # {logs/  outputs/}
print("test: write dataset to workspace 'outputs/dataset.csv'")

# Separate features and labels
X, y = diabetes[['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']].values, diabetes['Diabetic'].values

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Train a logistic regression model
print('Training a logistic regression model with regularization rate of', reg)
run.log('Regularization Rate',  np.float(reg))
model = LogisticRegression(C=1/reg, solver="liblinear").fit(X_train, y_train)

# calculate accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
print('Accuracy:', acc)
run.log('Accuracy', np.float(acc))

# calculate AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
print('AUC: ' + str(auc))
run.log('AUC', np.float(auc))

os.makedirs('outputs', exist_ok=True)
# note file saved in the outputs folder is automatically uploaded into experiment record
joblib.dump(value=model, filename='outputs/diabetes_model.pkl')

run.complete()

Writing ./diabetes_service/diabetes_training.py


Create training environment

In [34]:
myenv = Environment("training_environment")
myenv.docker.enabled = True
myenv.python.user_managed_dependencies = False
conda_packages = ['scikit-learn', 'joblib', 'python==3.6.2']
pip_packages = ['azureml-defaults', 'azureml-dataprep[pandas,fuse]', 'pyarrow', 'fastparquet']
myenv.python.conda_dependencies = CondaDependencies.create(conda_packages=conda_packages, pip_packages=pip_packages)
myenv.register(ws)

{
    "databricks": {
        "eggLibraries": [],
        "jarLibraries": [],
        "mavenLibraries": [],
        "pypiLibraries": [],
        "rcranLibraries": []
    },
    "docker": {
        "arguments": [],
        "baseDockerfile": null,
        "baseImage": "mcr.microsoft.com/azureml/base:intelmpi2018.3-ubuntu16.04",
        "baseImageRegistry": {
            "address": null,
            "password": null,
            "username": null
        },
        "enabled": true,
        "sharedVolumes": true,
        "shmSize": null
    },
    "environmentVariables": {
        "EXAMPLE_ENV_VAR": "EXAMPLE_VALUE"
    },
    "inferencingStackVersion": null,
    "name": "training_environment",
    "python": {
        "baseCondaEnvironment": null,
        "condaDependencies": {
            "channels": [
                "anaconda",
                "conda-forge"
            ],
            "dependencies": [
                "python=3.6.2",
                {
                    "pip": [
         

In [35]:
# list environments
env_names = Environment.list(workspace=ws)
for env_name in env_names:
    print('Name:',env_name)

Name: training_environment
Name: AzureML-Tutorial
Name: AzureML-Minimal
Name: AzureML-Chainer-5.1.0-GPU
Name: AzureML-PyTorch-1.2-CPU
Name: AzureML-TensorFlow-1.12-CPU
Name: AzureML-TensorFlow-1.13-CPU
Name: AzureML-PyTorch-1.1-CPU
Name: AzureML-TensorFlow-1.10-CPU
Name: AzureML-PyTorch-1.0-GPU
Name: AzureML-TensorFlow-1.12-GPU
Name: AzureML-TensorFlow-1.13-GPU
Name: AzureML-Chainer-5.1.0-CPU
Name: AzureML-PyTorch-1.0-CPU
Name: AzureML-Scikit-learn-0.20.3
Name: AzureML-PyTorch-1.2-GPU
Name: AzureML-PyTorch-1.1-GPU
Name: AzureML-TensorFlow-1.10-GPU
Name: AzureML-PyTorch-1.3-GPU
Name: AzureML-TensorFlow-2.0-CPU
Name: AzureML-PyTorch-1.3-CPU
Name: AzureML-TensorFlow-2.0-GPU
Name: AzureML-PySpark-MmlSpark-0.15
Name: AzureML-AutoML
Name: AzureML-PyTorch-1.4-GPU
Name: AzureML-PyTorch-1.4-CPU
Name: AzureML-VowpalWabbit-8.8.0
Name: AzureML-Hyperdrive-ForecastDNN
Name: AzureML-AutoML-GPU
Name: AzureML-AutoML-DNN-GPU
Name: AzureML-AutoML-DNN
Name: AzureML-Designer-R
Name: AzureML-Designer-Recomm

Creating and register dataset (File)

In [36]:
# load the diabetes dataset (File method)
ds = Datastore.get(ws, 'datalakestoragegen2')
ds_path = [DataPath(ds, 'platinum/**')] # {path/*.parquet or path/**}
file_ds = Dataset.File.from_files(path=ds_path)
   
# Register the file dataset
try:
    file_ds = file_ds.register(workspace=ws,
                               name='diabetes file dataset',
                               description='diabetes files',
                               tags = {'format':'parquet'},
                               create_new_version=True)
except Exception as ex:
    print(ex)
print('Dataset registered')

Dataset registered


In [37]:
# show a list of registered dataset(s)
print("Datasets:")
for dataset_name in list(ws.datasets.keys()):
    dataset = Dataset.get_by_name(ws, dataset_name)
    print("\t", dataset.name, '\t version', dataset.version)

Datasets:
	 diabetes file dataset 	 version 1


In [38]:
# list of the file path(s)
for file_path in file_ds.to_path():
    print(file_path)

/diabetes.parquet


Train model

In [39]:
# Set the script parameters
script_params = {
    '--regularization': 0.1
}

# load the registered dataset by name
file_ds = Dataset.get_by_name(ws, "diabetes file dataset")

# load the docker environment
training_env = Environment.get(ws, 'training_environment')

# load the training compute cluster
training_cluster = ComputeTarget(ws, 'aml-cluster')

estimator = Estimator(source_directory=experiment_folder, # All the files in this directory are uploaded into the cluster nodes for execution
                      compute_target='local', # {'local', training_cluster}
                      entry_script='diabetes_training.py',
                      script_params=script_params,
                      environment_definition=training_env,
                      inputs=[file_ds.as_named_input('diabetes').as_mount(path_on_compute='/mnt/data')],
                     )

# Create an experiment
experiment_name = 'diabetes-training'
experiment = Experiment(workspace=ws, name=experiment_name)
# Run the experiment
run = experiment.submit(config=estimator)

# Show the run details while running
RunDetails(run).show()
run.wait_for_completion() # get more parameter info

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

{'runId': 'diabetes-training_1587397611_097b5668',
 'target': 'local',
 'status': 'Finalizing',
 'startTimeUtc': '2020-04-20T15:53:22.844143Z',
 'properties': {'_azureml.ComputeTargetType': 'local',
  'ContentSnapshotId': '97af77b5-9a08-4d94-a47f-e81d6b5f8a8c',
  'azureml.git.repository_uri': 'https://github.com/albert-kevin/azuremachinelearning.git',
  'mlflow.source.git.repoURL': 'https://github.com/albert-kevin/azuremachinelearning.git',
  'azureml.git.branch': 'master',
  'mlflow.source.git.branch': 'master',
  'azureml.git.commit': '3b0b9f345b8ba523ffcf9a3cb26aecd0e097d138',
  'mlflow.source.git.commit': '3b0b9f345b8ba523ffcf9a3cb26aecd0e097d138',
  'azureml.git.dirty': 'True'},
 'inputDatasets': [{'dataset': {'id': '642bff52-df41-4dd7-a6b7-353d164ed1ab'}, 'consumptionDetails': {'type': 'RunInput', 'inputName': 'diabetes', 'mechanism': 'Mount', 'pathOnCompute': '/mnt/data'}}],
 'runDefinition': {'script': 'diabetes_training.py',
  'useAbsolutePath': False,
  'arguments': ['--regul

### Create inference script

In [40]:
# Create a local folder for the experiment files
folder_name = 'diabetes_service'
experiment_folder = './' + folder_name
os.makedirs(folder_name, exist_ok=True)
print(folder_name, 'folder created')

diabetes_service folder created


In [41]:
%%writefile $folder_name/diabetes_score.py
import json
import joblib
import numpy as np
from azureml.core.model import Model

# Called when the service is loaded
def init():
    global model
    # Get the path to the deployed model file and load a registered model
    model_path = Model.get_model_path(model_name='diabetes_model')
    model = joblib.load(model_path)

# Called when a request is received
def run(raw_data):
    # Get the input data as a numpy array
    data = np.array(json.loads(raw_data)['data'])
    # Get a prediction from the model
    predictions = model.predict(data)
    # Get the corresponding classname for each prediction (0 or 1)
    classnames = ['not-diabetic', 'diabetic']
    predicted_classes = []
    for prediction in predictions:
        predicted_classes.append(classnames[prediction])
    # Return the predictions as JSON
    return json.dumps(predicted_classes)

Writing diabetes_service/diabetes_score.py


### Create inference environment

In [42]:
# Add the dependencies for our model (AzureML defaults is already included)
myenv = CondaDependencies()
myenv.add_conda_package("scikit-learn")

# Save the environment config as a .yml file
env_file = folder_name + "/diabetes_env.yml"
with open(env_file, "w") as f:
    f.write(myenv.serialize_to_string())
print("Saved inference environment file in", env_file)

# Print the .yml file
with open(env_file,"r") as f:
    print(f.read())

Saved inference environment file in diabetes_service/diabetes_env.yml
# Conda environment specification. The dependencies defined in this file will
# be automatically provisioned for runs with userManagedDependencies=False.

# Details about the Conda environment file format:
# https://conda.io/docs/user-guide/tasks/manage-environments.html#create-env-file-manually

name: project_environment
dependencies:
  # The python interpreter version.
  # Currently Azure ML only supports 3.5.2 and later.
- python=3.6.2

- pip:
    # Required packages for AzureML execution, history, and data preparation.
  - azureml-defaults

- scikit-learn
channels:
- anaconda
- conda-forge



### Register the model

In [43]:
# define model name
model_name = 'diabetes_model'

# register model from the workspace 
run.register_model(model_name=model_name, # registered model name used in scoring script init()
                   model_path='outputs/diabetes_model.pkl', # fixed path in workspace {'model.pkl', 'model.onnx'}
                   tags={'Training context': 'Custom Training'},
                   properties={'AUC': run.get_metrics()['AUC'],
                               'Accuracy': run.get_metrics()['Accuracy']},
                   description="Classification model to predict diabetes",
                   model_framework=Model.Framework.SCIKITLEARN, # {TensorFlow, ScikitLearn, Onnx, Custom}
                   model_framework_version='0.22.2')

print('Model trained and registered')

Model trained and registered


### Deploy the model

In [44]:
service_name = "diabetes-service"

# Remove any existing service under the same name
try:
    Webservice(ws, service_name).delete()
except WebserviceException:
    print('"' + service_name + '" does not exist, creating the webservice...')

# Configure the scoring environment
inference_config = InferenceConfig(runtime="python",
                                   source_directory=folder_name,
                                   entry_script="diabetes_score.py",
                                   conda_file="diabetes_env.yml")

deployment_config = AciWebservice.deploy_configuration(cpu_cores=1,
                                                       memory_gb=1)

# load the registered model
model = ws.models['diabetes_model']

service = Model.deploy(ws, service_name, [model], inference_config, deployment_config)

service.wait_for_deployment(show_output=True)
print(service.state)

"diabetes-service" does not exist, creating the webservice...
Running...........................................................................................................
Succeeded
ACI service creation operation finished, operation "Succeeded"
Healthy


### Inference test

In [46]:
# get webservice URI
endpoint = service.scoring_uri

# raw test data
rawdata = [[9, 103, 78, 25, 304, 29.6, 1.28, 43],
           [0, 148, 58, 11, 179, 39, 0.16, 45]]

print("URI: " + endpoint)
print("Body: " + json.dumps({"data": rawdata})) # convert array to a serialized JSON formatted string object

service.run(json.dumps({"data": rawdata}))

URI: http://712a95c4-9b83-46f8-8600-01befdb4da0f.westeurope.azurecontainer.io/score
Body: {"data": [[9, 103, 78, 25, 304, 29.6, 1.28, 43], [0, 148, 58, 11, 179, 39, 0.16, 45]]}


'["diabetic", "not-diabetic"]'

When you are finished testing your service, clean up the deployment with service.delete()

In [47]:
service.delete()

## Finetuning

Hyperparameter tuning of the model using HyperDrive.  
Hyperdrive runs enable comparison for metrics on all different hyper parameter combinations tried.  

[doc: how to tune hyperparameters](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters)  
[git: examples](https://github.com/microsoft/MLHyperparameterTuning)  

In [59]:
# Initialize workspace
ws = Workspace.from_config()

In [60]:
# Create AmlCompute
training_cluster = ComputeTarget(ws, 'aml-cluster')

In [61]:
# Create a project directory
project_folder = './diabetes_hyperdrive'
os.makedirs(project_folder, exist_ok=True)

In [62]:
# Experiment folder
experiment_folder = './' + project_folder

Prepare training script

In [63]:
%%writefile $experiment_folder/diabetes_training.py

import argparse
from azureml.core import Workspace, Dataset, Experiment, Run
import pandas as pd
import numpy as np
import joblib
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
import glob
print("libraries imported...")

# Get the experiment run context
run = Run.get_context()
print("run context loaded...")

# Set regularization hyperparameter (passed as an argument to the script)
parser = argparse.ArgumentParser()
parser.add_argument('--regularization', type=float, dest='reg_rate', default=0.01, help='regularization rate')
parser.add_argument('--C', type=float, default=1.0, help='Inverse of regularization strength')
parser.add_argument('--solver', type=str, default='lbfgs', help='Algorithm to use in the optimization problem')
args = parser.parse_args()
reg = args.reg_rate
run.log('Inverse of regularization strength', np.float(args.C))
run.log('Algorithm to use in the optimization problem', np.str(args.solver))
print("argparse parameters loaded...")

# load the diabetes dataset (File method)
# Get the training data from the estimator input identified as 'diabetes'
mount = run.input_datasets['diabetes'] # read-only mount from delta lake as '/mnt/data'
print("delta lake mounted...")
diabetes = pd.read_parquet('/mnt/data/diabetes.parquet') # load any file(s) from this delta lake mounted folder
print("dataset loaded...")

# save data into workspace
diabetes.to_csv("outputs/dataset.csv", index=False) # {logs/  outputs/}
print("test: write dataset to workspace 'outputs/dataset.csv'")

# Separate features and labels
X, y = diabetes[['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']].values, diabetes['Diabetic'].values

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Train a logistic regression model
print('Training a logistic regression model with regularization rate of', reg)
run.log('Regularization Rate',  np.float(reg))
model = LogisticRegression(C=args.C, solver=args.solver).fit(X_train, y_train)

# calculate accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
print('Accuracy:', acc)
run.log('Accuracy', np.float(acc))

# calculate AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test, y_scores[:,1])
print('AUC: ' + str(auc))
run.log('AUC', np.float(auc))

os.makedirs('outputs', exist_ok=True)
# note file saved in the outputs folder is automatically uploaded into experiment record
joblib.dump(value=model, filename='outputs/diabetes_model.pkl')

run.complete()

Overwriting ././diabetes_hyperdrive/diabetes_training.py


In [64]:
# Create an experiment name
experiment = Experiment(ws, 'diabetes-hyperdrive-training')

In [65]:
# Create a Scikit-learn estimator

# get the training compute cluster
training_cluster = ComputeTarget(ws, 'aml-cluster')

# Set the script parameters
script_params = {
    '--regularization': 0.1,
    '--C': 10,
    '--solver': 'lbfgs',
}

# Get the docker environment
training_env = Environment.get(ws, 'training_environment')

# get the registered dataset by name
file_ds = Dataset.get_by_name(ws, "diabetes file dataset")

estimator = Estimator(source_directory=experiment_folder, # All the files in this directory are uploaded into the cluster nodes for execution
                      compute_target=training_cluster, # only compute allowed for hyperparameter tuning
                      entry_script='diabetes_training.py',
                      script_params=script_params,
                      environment_definition=training_env,
                      inputs=[file_ds.as_named_input('diabetes').as_mount(path_on_compute='/mnt/data')],
                     )

In [66]:
# define the hyperparameter space

param_sampling = RandomParameterSampling( {
    '--regularization': choice(1, 0.333, 0.1, 0.033),
    '--C': choice(1, 3, 10, 30),
    '--solver': choice('lbfgs', 'liblinear', 'newton-cg', 'lbfgs', 'sag'),
    } )

hyperdrive_run_config = HyperDriveConfig(estimator=estimator,
                                         hyperparameter_sampling=param_sampling,
                                         primary_metric_name='Accuracy',
                                         primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                                         max_total_runs=20,   # 20 = reg x C x solver = 4 x 4 x 5 script uses C + solver = 20
                                         max_concurrent_runs=5,
                                        )

In [56]:
# start the HyperDrive experiment run (~25')
hyperdrive_run = experiment.submit(config=hyperdrive_run_config)

In [68]:
# Show the run details while running
RunDetails(hyperdrive_run).show()  # <-- Completed, no it is running in the background !

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

In [None]:
# the RUN must FINISH first, then continue...

In [69]:
# Find best run
best_run = hyperdrive_run.get_best_run_by_primary_metric()
print(best_run.get_details()['runDefinition']['arguments'])

['--C', '3', '--regularization', '0.1', '--solver', 'liblinear']


# Pipelines

orchestrate machine learning operations, arranged sequentially or in parallel.  
a workflow of machine learning tasks in which each task is implemented as a step.  
Each step in the pipeline runs on its allocated compute target.  
publish a pipeline as a REST endpoint, enabling client applications to initiate a pipeline run.

What follow is a simple **_2-step_** pipeline that trains and registers a model.  

1. storage
1. compute
1. environment
1. scripts
  * step 1: create a model
  * step 2: register the model
1. create pipeline
1. run pipeline
1. publish pipeline
1. call pipeline

### storage

The PipelineData object is a special kind of data reference that is used to pass data from the output of one pipeline step to the input of another, creating a dependency between them. You'll create one and use it as the output for the first step and the input for the second step. Note that you also need to pass it as a script argument so your code can access the datastore location referenced by the data reference.

In [70]:
# Initialize workspace
ws = Workspace.from_config()

# load the training diabetes dataset (File method)
ds = Datastore.get(ws, 'datalakestoragegen2')
ds_path = [DataPath(ds, 'platinum/diabetes.parquet')] # {path/*.parquet or path/**}
diabetes_ds = Dataset.File.from_files(path=ds_path)

# Create a PipelineData (temporary Data Reference)
# data lake gen2: container "datalake" > azureml > 0b93a7bc-9bf2-46a9-b9c4-5afdba292d08 > model_folder
model_folder = PipelineData("model_folder", datastore=ds)

### compute

In [71]:
# load compute cluster
pipeline_cluster = ComputeTarget(ws, 'aml-cluster')

### environment

In [72]:
# load the docker environment
training_env = Environment.get(ws, 'training_environment')

# Create a new runconfig object for the pipeline
pipeline_run_config = RunConfiguration()

# Use the compute you created above.
pipeline_run_config.target = pipeline_cluster

# Assign the environment to the run configuration
pipeline_run_config.environment = training_env

### scripts

In [73]:
# Create a folder for the pipeline step files
project_folder = 'diabetes_pipeline'
os.makedirs(project_folder, exist_ok=True)

In [74]:
# Experiment folder
experiment_folder = './' + project_folder

**step 1:** create a model

In [None]:
# estimator step

In [75]:
%%writefile $experiment_folder/train_diabetes.py
# Import libraries
from azureml.core import Run
import argparse
import pandas as pd
import numpy as np
import joblib
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score

# Get parameters
parser = argparse.ArgumentParser()
parser.add_argument('--output_folder', type=str, dest='output_folder', default="diabetes_model", help='output folder')
args = parser.parse_args()
output_folder = args.output_folder

# Get the experiment run context
run = Run.get_context()

# load the diabetes data (passed as an input dataset)
print("Loading Data...")
#diabetes = run.input_datasets['diabetes_train'].to_pandas_dataframe()
mount = run.input_datasets['diabetes_train'] # read-only mount from delta lake as '/mnt/data'
print("delta lake mounted...")
diabetes = pd.read_parquet('/mnt/data/diabetes.parquet') # load any file(s) from this delta lake mounted folder
print("dataset loaded...")

# Separate features and labels
X, y = diabetes[['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']].values, diabetes['Diabetic'].values

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Train adecision tree model
print('Training a decision tree model')
model = DecisionTreeClassifier().fit(X_train, y_train)

# calculate accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
print('Accuracy:', acc)
run.log('Accuracy', np.float(acc))

# calculate AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
print('AUC: ' + str(auc))
run.log('AUC', np.float(auc))

# Save the trained model
os.makedirs(output_folder, exist_ok=True)
output_path = output_folder + "/model.pkl"
joblib.dump(value=model, filename=output_path)

run.complete()

Writing ./diabetes_pipeline/train_diabetes.py


**step 2:** register the model

In [None]:
# python script step

In [76]:
%%writefile $experiment_folder/register_diabetes.py
# Import libraries
import argparse
import joblib
from azureml.core import Workspace, Model, Run

# Get parameters
parser = argparse.ArgumentParser()
parser.add_argument('--model_folder', type=str, dest='model_folder', default="diabetes_model", help='model location')
args = parser.parse_args()
model_folder = args.model_folder

# Get the experiment run context
run = Run.get_context()

# load the model
print("Loading model from " + model_folder)
model_file = model_folder + "/model.pkl"
model = joblib.load(model_file)

Model.register(workspace=run.experiment.workspace,
               model_path=model_file,
               model_name='diabetes_model',
               tags={'Training context':'Pipeline'})

run.complete()

Writing ./diabetes_pipeline/register_diabetes.py


### create pipeline

|Common kinds of step|Description|
|-|-|
|**PythonScriptStep**|<i>Runs a specified Python script</i>|
|**EstimatorStep**|<i>Runs an estimator</i>|
|**DataTransferStep**|<i>Uses Azure Data Factory to copy data between data stores</i>|
|**DatabricksStep**|<i>Runs a notebook, script, or compiled JAR on a databricks cluster</i>|
|**AdlaStep**|<i>Runs a U-SQL job in Azure Data Lake Analytics</i>|
|**[6 more steps](https://aka.ms/AA70rrh)**||

In [77]:
estimator = Estimator(source_directory=experiment_folder, 
                      compute_target=pipeline_cluster, #training_cluster
                      environment_definition=pipeline_run_config.environment, #training_env
                      entry_script='train_diabetes.py', #'diabetes_training.py'
                      # NO script_params=script_params,
                      # NO inputs=[file_ds.as_named_input('diabetes').as_mount(path_on_compute='/mnt/data')],
                     )

# Step 1, run the estimator to train the model
train_step = EstimatorStep(name="Train Model",
                           estimator=estimator,
                           compute_target=pipeline_cluster, # {'aml-cluster'}
                           estimator_entry_script_arguments=['--output_folder', model_folder],
                           inputs=[diabetes_ds.as_named_input('diabetes_train').as_mount(path_on_compute='/mnt/data')],
                           outputs=[model_folder],
                           allow_reuse=True)

# Step 2, run the model registration script
register_step = PythonScriptStep(name="Register Model",
                                 source_directory=experiment_folder,
                                 script_name="register_diabetes.py",
                                 compute_target=pipeline_cluster,
                                 runconfig=pipeline_run_config,
                                 inputs=[model_folder],
                                 arguments=['--model_folder', model_folder],
                                 allow_reuse=True)

# Construct the pipeline
pipeline = Pipeline(ws, [train_step, register_step])

### run pipeline

Run pipeline and verify it works

In [78]:
# Create an experiment
experiment = Experiment(ws, 'diabetes-training-pipeline')

# Run the pipeline
pipeline_run = experiment.submit(pipeline, regenerate_outputs=True)

Created step Train Model [fc8c6401][cf50c66d-1485-4683-bb24-8faae2af198d], (This step will run and generate new outputs)Created step Register Model [946d0f72][e6f4f986-650c-498d-b23c-2547cc5bd773], (This step will run and generate new outputs)

Submitted PipelineRun d9447ed8-c689-418f-a7ba-60ba05c743a8
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/diabetes-training-pipeline/runs/d9447ed8-c689-418f-a7ba-60ba05c743a8?wsid=/subscriptions/43c1f93a-903d-4b23-a4bf-92bd7a150627/resourcegroups/myResourceGroup4/workspaces/machine_learning_workspace4


In [79]:
# Show run details
RunDetails(pipeline_run).show()
pipeline_run.wait_for_completion()

_PipelineWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', …

PipelineRunId: d9447ed8-c689-418f-a7ba-60ba05c743a8
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/diabetes-training-pipeline/runs/d9447ed8-c689-418f-a7ba-60ba05c743a8?wsid=/subscriptions/43c1f93a-903d-4b23-a4bf-92bd7a150627/resourcegroups/myResourceGroup4/workspaces/machine_learning_workspace4
PipelineRun Status: Running


StepRunId: d3d253d8-f099-4441-b09c-0e12de2c9428
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/diabetes-training-pipeline/runs/d3d253d8-f099-4441-b09c-0e12de2c9428?wsid=/subscriptions/43c1f93a-903d-4b23-a4bf-92bd7a150627/resourcegroups/myResourceGroup4/workspaces/machine_learning_workspace4
StepRun( Train Model ) Status: Running

Streaming azureml-logs/20_image_build_log.txt
2020/04/20 16:56:09 Downloading source code...
2020/04/20 16:56:10 Finished downloading source code
2020/04/20 16:56:11 Creating Docker network: acb_default_network, driver: 'bridge'
2020/04/20 16:56:11 Successfully set up Docker network: acb_def

[91m
mkl-2019.5           | 205.3 MB  | ########6  |  86% [0m[91m
mkl-2019.5           | 205.3 MB  | ########6  |  86% [0m[91m
mkl-2019.5           | 205.3 MB  | ########6  |  86% [0m[91m
mkl-2019.5           | 205.3 MB  | ########6  |  86% [0m[91m
mkl-2019.5           | 205.3 MB  | ########6  |  87% [0m[91m
mkl-2019.5           | 205.3 MB  | ########6  |  87% [0m[91m
mkl-2019.5           | 205.3 MB  | ########6  |  87% [0m[91m
mkl-2019.5           | 205.3 MB  | ########6  |  87% [0m[91m
mkl-2019.5           | 205.3 MB  | ########6  |  87% [0m[91m
mkl-2019.5           | 205.3 MB  | ########6  |  87% [0m[91m
mkl-2019.5           | 205.3 MB  | ########7  |  87% [0m[91m
mkl-2019.5           | 205.3 MB  | ########7  |  87% [0m[91m
mkl-2019.5           | 205.3 MB  | ########7  |  87% [0m[91m
mkl-2019.5           | 205.3 MB  | ########7  |  87% [0m[91m
mkl-2019.5           | 205.3 MB  | ########7  |  87% [0m[91m
mkl-2019.5           | 205.3 MB  | ########7  |  

[91m
mkl-2019.5           | 205.3 MB  | #########9 | 100% [0m[91m
mkl-2019.5           | 205.3 MB  | #########9 | 100% [0m[91m
mkl-2019.5           | 205.3 MB  | #########9 | 100% [0m[91m
mkl-2019.5           | 205.3 MB  | #########9 | 100% [0m[91m
mkl-2019.5           | 205.3 MB  | ########## | 100% [0m[91m

python-3.6.2         | 27.0 MB   |            |   0% [0m[91m
python-3.6.2         | 27.0 MB   | #2         |  13% [0m[91m
python-3.6.2         | 27.0 MB   | ####6      |  46% [0m[91m
python-3.6.2         | 27.0 MB   | #######5   |  75% [0m[91m
python-3.6.2         | 27.0 MB   | #########2 |  93% [0m[91m
python-3.6.2         | 27.0 MB   | ########## | 100% [0m[91m

openssl-1.0.2u       | 3.1 MB    |            |   0% [0m[91m
openssl-1.0.2u       | 3.1 MB    | #####8     |  59% [0m[91m
openssl-1.0.2u       | 3.1 MB    | #######7   |  78% [0m[91m
openssl-1.0.2u       | 3.1 MB    | #########5 |  96% [0m[91m
openssl-1.0.2u       | 3.1 MB    | ########## |

  Downloading azure_graphrbac-0.61.1-py2.py3-none-any.whl (141 kB)
Collecting SecretStorage
  Downloading SecretStorage-3.1.2-py3-none-any.whl (14 kB)
Collecting contextlib2
  Downloading contextlib2-0.6.0.post1-py2.py3-none-any.whl (9.8 kB)
Collecting azure-mgmt-authorization>=0.40.0
  Downloading azure_mgmt_authorization-0.60.0-py2.py3-none-any.whl (82 kB)
Collecting msrestazure>=0.4.33
  Downloading msrestazure-0.6.3-py2.py3-none-any.whl (40 kB)
Collecting azure-mgmt-resource<9.0.0,>=1.2.1
  Downloading azure_mgmt_resource-8.0.1-py2.py3-none-any.whl (758 kB)
Collecting pyopenssl
  Downloading pyOpenSSL-19.1.0-py2.py3-none-any.whl (53 kB)
Collecting ndg-httpsclient
  Downloading ndg_httpsclient-0.5.1-py3-none-any.whl (34 kB)
Collecting pathspec
  Downloading pathspec-0.8.0-py2.py3-none-any.whl (28 kB)
Collecting cryptography!=1.9,!=2.0.*,!=2.1.*,!=2.2.*
  Downloading cryptography-2.9-cp35-abi3-manylinux2010_x86_64.whl (2.7 MB)
Collecting azure-mgmt-keyvault>=0.40.0
  Downloading azur

[91m
[0m#
# To activate this environment, use:
# > source activate /azureml-envs/azureml_b9a1534962684a800c586e9fce04292e
#
# To deactivate an active environment, use:
# > source deactivate
#


Removing intermediate container 75b154f6541a
 ---> 00371a6e2d79
Step 9/15 : ENV PATH /azureml-envs/azureml_b9a1534962684a800c586e9fce04292e/bin:$PATH
 ---> Running in 770e24e720a5
Removing intermediate container 770e24e720a5
 ---> 7c2b537ea211
Step 10/15 : ENV AZUREML_CONDA_ENVIRONMENT_PATH /azureml-envs/azureml_b9a1534962684a800c586e9fce04292e
 ---> Running in 611f6fced5d7
Removing intermediate container 611f6fced5d7
 ---> 829052d6621d
Step 11/15 : ENV LD_LIBRARY_PATH /azureml-envs/azureml_b9a1534962684a800c586e9fce04292e/lib:$LD_LIBRARY_PATH
 ---> Running in d5d352d2f2f7
Removing intermediate container d5d352d2f2f7
 ---> 6deb097054aa
Step 12/15 : COPY azureml-environment-setup/spark_cache.py azureml-environment-setup/log4j.properties /azureml-environment-setup/
 ---> 187e5cfee18f
Step 13/15 

dataset loaded...
Training a decision tree model
Accuracy: 0.8906666666666667
AUC: 0.877118264590278


The experiment completed successfully. Finalizing run...
Cleaning up all outstanding Run operations, waiting 300.0 seconds
2 items cleaning up...
Cleanup took 0.39716506004333496 seconds
Starting the daemon thread to refresh tokens in background for process with pid = 125
Enter __exit__ of DatasetContextManager
Unmounting /mnt/data.
Finishing unmounting /mnt/data.
Exit __exit__ of DatasetContextManager
Engine process terminated with returncode=0

Streaming azureml-logs/75_job_post-tvmps_b8866aaa807c663004dafc0cfd1578194868072477aa849a9ae832d9e347ee40_d.txt
Starting job release. Current time:2020-04-20T17:05:33.834370
Logging experiment finalizing status in history service.
Starting the daemon thread to refresh tokens in background for process with pid = 389
Job release is complete. Current time:2020-04-20T17:05:36.753009

StepRun(Train Model) Execution Summary
StepRun( Train Model ) S




StepRunId: a98df897-13ea-4bb1-87e3-e89dee6b7e0b
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/diabetes-training-pipeline/runs/a98df897-13ea-4bb1-87e3-e89dee6b7e0b?wsid=/subscriptions/43c1f93a-903d-4b23-a4bf-92bd7a150627/resourcegroups/myResourceGroup4/workspaces/machine_learning_workspace4
StepRun( Register Model ) Status: NotStarted
StepRun( Register Model ) Status: Queued
StepRun( Register Model ) Status: Running

Streaming azureml-logs/55_azureml-execution-tvmps_b8866aaa807c663004dafc0cfd1578194868072477aa849a9ae832d9e347ee40_d.txt
2020-04-20T17:07:27Z Starting output-watcher...
2020-04-20T17:07:27Z IsDedicatedCompute == True, won't poll for Low Pri Preemption
Login Succeeded
Using default tag: latest
latest: Pulling from azureml/azureml_25db707bbbfa2a62132a58e7c231c3f6
Digest: sha256:965429143f86820e4838d07e2d5b03ec38818a37342546db6aba4dd7ad4f3dfb
Status: Image is up to date for machinelearnc5ceeadf.azurecr.io/azureml/azureml_25db707bbbfa2a62132a58e7c23



PipelineRun Execution Summary
PipelineRun Status: Finished
{'runId': 'd9447ed8-c689-418f-a7ba-60ba05c743a8', 'status': 'Completed', 'startTimeUtc': '2020-04-20T16:55:17.545117Z', 'endTimeUtc': '2020-04-20T17:08:15.049055Z', 'properties': {'azureml.runsource': 'azureml.PipelineRun', 'runSource': 'SDK', 'runType': 'SDK', 'azureml.parameters': '{}'}, 'inputDatasets': [], 'logFiles': {'logs/azureml/executionlogs.txt': 'https://machinelstorage7defa4bf6.blob.core.windows.net/azureml/ExperimentRun/dcid.d9447ed8-c689-418f-a7ba-60ba05c743a8/logs/azureml/executionlogs.txt?sv=2019-02-02&sr=b&sig=xRdXxj8K9Nyju2kjC1VTAAE4486jtAkRkew34OmnReU%3D&st=2020-04-20T16%3A58%3A16Z&se=2020-04-21T01%3A08%3A16Z&sp=r', 'logs/azureml/stderrlogs.txt': 'https://machinelstorage7defa4bf6.blob.core.windows.net/azureml/ExperimentRun/dcid.d9447ed8-c689-418f-a7ba-60ba05c743a8/logs/azureml/stderrlogs.txt?sv=2019-02-02&sr=b&sig=%2BYgjjbC1W%2Bpi6t09radcK5kHZv3Zd7%2BxEVoywRfv%2FIU%3D&st=2020-04-20T16%3A58%3A16Z&se=2020-04-

'Finished'

### publish pipeline

publish pipeline as a REST service endpoint.

In [80]:
published_pipeline = pipeline.publish(name="Diabetes_Training_Pipeline",
                                      description="Trains diabetes model",
                                      version="1.0")
rest_endpoint = published_pipeline.endpoint
print(rest_endpoint)

https://westeurope.api.azureml.ms/pipelines/v1.0/subscriptions/43c1f93a-903d-4b23-a4bf-92bd7a150627/resourceGroups/myResourceGroup4/providers/Microsoft.MachineLearningServices/workspaces/machine_learning_workspace4/PipelineRuns/PipelineSubmit/fea492c4-40bf-4740-aa79-2c778fe7c64d


### call pipeline

In [81]:
interactive_auth = InteractiveLoginAuthentication()
auth_header = interactive_auth.get_authentication_header()

In [82]:
experiment_name = 'Run-diabetes-pipeline'

response = requests.post(rest_endpoint, 
                         headers=auth_header, 
                         json={"ExperimentName": experiment_name})
run_id = response.json()["Id"]
run_id

'20e79c0f-ab5c-4e0c-80eb-96bb1acbecf2'

Since you have the run ID, you can use the RunDetails widget

In [83]:
published_pipeline_run = PipelineRun(ws.experiments[experiment_name], run_id)
RunDetails(published_pipeline_run).show()

_PipelineWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', …