Author: Kevin ALBERT  

Created: April 2020  

# Automated Machine Learning
_**Classification project with data residing on a data lake gen2 using remote compute with autoML and model registration**_

## Contents
1. [Introduction](#Introduction)
1. [Setup](#Setup)
1. [Train](#Train)
1. [Results](#Results)
1. [Register](#Register)
1. [Deploy](#Deploy)
1. [Test](#Test)
1. [CustomML](#CustomML)
1. [Serverless](#Serverless)
1. [Finetuning](#Finetuning)

## Introduction

Cleaned datasets created in datafactory onto a delta lake Gen2.  
This notebook is using delta lake data and remote compute to autoML train a classification model.  
We use example data to detect diabetic or non-diabetic based on 8 features.  

This notebook show how to:
1. Setup packages
1. Setup workspace
1. Create an experiment
1. Load data
1. Setup compute
1. Configure autoML
1. Train pipelines
1. Explore the best pipeline
1. Inspect model properties
1. Register the model
1. Deploy model as webservice
1. Webservice inference test

## Setup

* required
  * disable shield on Brave webbrowser for the widgets to work
  * download **config.json** from the machine learning workspace portal
  * install extra azureml packages on **py37_default** when using **'local'** compute  
  * split the data up in train and test dataset on data lake, validation dataset is not needed due to cross_validation
* optional
  * register datastore(s) manually
  * register dataset(s) manually
  * register compute cluster(s) manually

In [None]:
! /anaconda/envs/py37_default/bin/python -m pip install -U azureml-sdk[explain,automl] azureml-widgets

### Import open-source packages

In [2]:
import logging
# import os
# import random
# import re
# import lightgbm
import pandas as pd
import numpy as np
import json
import requests
import joblib
# import csv
# from matplotlib import pyplot as plt
# from matplotlib.pyplot import imshow
# from sklearn import datasets
# from shutil import copy2
# import seaborn as sns
# sns.set(color_codes='True')

# import warnings
# warnings.filterwarnings('ignore')

### Import azure machine learning SDK packages

In [81]:
#import azureml.core
from azureml.core import Workspace
#from azureml.core.authentication import InteractiveLoginAuthentication
from azureml.core.experiment import Experiment
from azureml.core import Dataset
from azureml.core import Datastore
from azureml.data.datapath import DataPath
from azureml.core.compute import ComputeTarget
from azureml.core.compute import AmlCompute
from azureml.core.compute import AksCompute
# from azureml.core.compute_target import ComputeTargetException
# from azureml.core.image import Image
from azureml.core.model import Model
from azureml.train.automl import AutoMLConfig
from azureml.train.automl.run import AutoMLRun
from azureml.widgets import RunDetails
from azureml.core import Run
from azureml.core.webservice import Webservice
from azureml.core.webservice import AciWebservice
from azureml.core.webservice import AksWebservice
#from azureml.core import Webservice
from azureml.exceptions import WebserviceException
from azureml.core.model import InferenceConfig
from azureml.core.environment import Environment
from azureml.train.estimator import Estimator

#from azureml.core import Environment
from azureml.core.conda_dependencies import CondaDependencies

### Workspace

In [3]:
# load the workspace
ws = Workspace.from_config()

If you run your code in unattended mode, i.e., where you can't give a user input, then we recommend to use ServicePrincipalAuthentication or MsiAuthentication.
Please refer to aka.ms/aml-notebook-auth for different authentication mechanisms in azureml-sdk.


### Experiment

In [7]:
# choose an experiment name
experiment = Experiment(ws, 'automl-classification')

### Data

Data Factory has prepped data from /bronze to /silver to /gold and /platinum for model training  
**note:** this demonstration had files in the Data Lake Gen2 datalake container /platinum folder  
  * /datalake/platinum/diabetes.csv
  * /datalake/platinum/diabetes.parquet
  * copy from ../data/platinum/*

Register the datastore 'data lake gen2' as a **blob container**  
**optional:** manually register in ML workspace

In [5]:
ds = Datastore.register_azure_blob_container(
    workspace=ws,
    datastore_name="datalakestoragegen2",
    container_name="datalake",
    account_name="datalake21032020",
    account_key="Ck/4hMq3Zrzq5toZ96zE6cDncjbw2VdkR9ny1xXA3GLBwQXIv7V1ycSc/KpqyNRcoPWKtzKljjpcZVqjWOu+3Q==",
    create_if_not_exists=False)
# list available datastores
ws.datastores

{'datalakestoragegen2': {
   "name": "datalakestoragegen2",
   "container_name": "datalake",
   "account_name": "datalake21032020",
   "protocol": "https",
   "endpoint": "core.windows.net"
 },
 'workspacefilestore': {
   "name": "workspacefilestore",
   "container_name": "azureml-filestore-b02eb652-86d9-4686-a295-28fda46cf559",
   "account_name": "alehanwagaatet2212538382",
   "protocol": "https",
   "endpoint": "core.windows.net"
 },
 'workspaceblobstore': {
   "name": "workspaceblobstore",
   "container_name": "azureml-blobstore-b02eb652-86d9-4686-a295-28fda46cf559",
   "account_name": "alehanwagaatet2212538382",
   "protocol": "https",
   "endpoint": "core.windows.net"
 }}

Register file(s) into a tabular dataset  
**Note:** do not import Delta lake parquet file(s)  
**Fix:** you can import pandas single gold/*.csv or gold/*.parquet file(s)  

In [8]:
# load datastore
ds = Datastore.get(ws, 'datalakestoragegen2')
# show datastore settings
ds

{
  "name": "datalakestoragegen2",
  "container_name": "datalake",
  "account_name": "datalake21032020",
  "protocol": "https",
  "endpoint": "core.windows.net"
}

**Option 1 Tabular:** loading *.parquet

In [31]:
# setup parquet file(s) into a tabular dataset
ds_path = [DataPath(ds, 'platinum/diabetes.parquet')] # {path/*.parquet}
dataset = Dataset.Tabular.from_parquet_files(path=ds_path)
# show dataset settings
dataset

{
  "source": [
    "('datalakestoragegen2', 'platinum/diabetes.parquet')"
  ],
  "definition": [
    "GetDatastoreFiles",
    "ReadParquetFile",
    "DropColumns"
  ]
}

**Option 2 Tabular:** loading *.csv

In [None]:
# setup csv file(s) into a tabular dataset
ds_path = [DataPath(ds, 'platinum/diabetes.csv')]
dataset = Dataset.Tabular.from_delimited_files(path=ds_path)
# show dataset settings
dataset

**Option 3 Registered:** loading a registered dataset (manually register in ML workspace)

In [None]:
# list available datasets
ws.datasets

In [None]:
# load a registered dataset
dataset = Dataset.get_by_name(ws, 'diabetes_parquet_from_datastore_datalakegen2')
# show dataset settings
dataset

### Compute

Check possible compute type **names** to create auto-scaling cluster

In [8]:
# example: list all with 1 vCPUs and no-GPU
vm_df = pd.DataFrame(AmlCompute.supported_vmsizes(ws))
vm_df[(vm_df.vCPUs == 1) & (vm_df.gpus == 0)]

Unnamed: 0,gpus,maxResourceVolumeMB,memoryGB,name,vCPUs
0,0,51200,3.5,Standard_D1_v2,1
8,0,7168,3.5,Standard_DS1_v2,1
18,0,51200,3.5,Standard_D1,1


option 1: Create training cluster  

In [None]:
# Specify a name for the compute (unique within the workspace)
compute_name = 'aml-cluster'
# Define compute configuration
compute_config = AmlCompute.provisioning_configuration(vm_size='Standard_D1_v2',
                                                       min_nodes=0, # you are not paying if not using
                                                       max_nodes=10, # depending quota limits
                                                       vm_priority='dedicated', # {lowpriority, dedicated}
                                                       admin_username='ubuntu',
                                                       admin_user_password='ABCD1234abcd',
                                                       idle_seconds_before_scaledown=120, # {default: 120}
                                                      )
# Create the compute
training_cluster = ComputeTarget.create(ws, compute_name, compute_config)
training_cluster.wait_for_completion(show_output=True)

option 2: Load already known training cluster

In [9]:
# list all available training cluster(s):
for cluster in ws.compute_targets:
    print(cluster)

aml-cluster


In [10]:
# load the training cluster
compute_name = 'aml-cluster'
training_cluster = ComputeTarget(ws, name=compute_name)

## Train

### Configure autoML
Define settings to run the experiment.

|Property|Description|Options|
|-|-|-|
|**task**||<i>classification</i><br><i>regression</i><br><i>forecasting</i>|
|**compute_target**|execution on local DSVM serialized<br>execution on remote AML or AKS parallel|<i>local</i><br><i>training_cluster</i>|
|**primary_metric**|the metric you want to optimize|**classification:**<br><i>accuracy</i><br><i>AUC_weighted</i><br><i>average_precision_score_weighted</i><br><i>norm_macro_recall</i><br><i>precision_score_weighted</i><br><br>**regression:**<br><i>spearman_correlation</i><br><i>normalized_root_mean_squared_error</i><br><i>r2_score</i><br><i>normalized_mean_absolute_error</i>|
|**training_data**|input dataset, containing both X_train and y_train|<i>DataFrame</i><br><i>Dataset</i><br><i>DatasetDefinition</i><br><i>TabularDataset</i>|
|**validation_data**|input dataset, covered with cross validation|N/A|
|**label_column_name**|the name of the 'target' or 'label' column||
|**enable_early_stopping**|stop the run if metric score is not improving|<i>True</i><br><i>False</i>|
|**n_cross_validations**|number of cross validation splits|5|
|**experiment_timeout_hours**|max time in hours the experiment terminates (+15min)|<i>0.25</i>|
|**max_concurrent_iterations**|less or equal to the number of cores per node|2|



**_You can find more information_** [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-auto-train)

In [29]:
automl_settings = {
    "enable_early_stopping":True,
    #"experiment_timeout_hours":0.25,
    "iterations":5, # number of runs
    "iteration_timeout_minutes":5,
    "max_concurrent_iterations":1,
    "max_cores_per_iteration":-1,
    #"experiment_exit_score":0.9920,
    "model_explainability":True,
    "n_cross_validations":5,
    "primary_metric":'AUC_weighted',
    "featurization":'auto',
    "verbosity":logging.INFO, # {INFO, DEBUG, CRITICAL, ERROR, WARNING} -- debug_log=<*.log>
}

automl_config = AutoMLConfig(task='classification',
                             debug_log='automl_errors.log',
                             compute_target='local', # {training_cluster or 'local'}
                             #blacklist_models=['KNN','LinearSVM'],
                             enable_onnx_compatible_models=True,
                             training_data=dataset,
                             label_column_name="Diabetic",
                             **automl_settings
                            )
# ouputs "model.pkl" and "automl_errors.log"

### Train pipelines

In [30]:
automl_run = experiment.submit(automl_config, show_output=True)

Running on local machine


DataException: DataException:
	Message: X should not be None
	InnerException None
	ErrorResponse 
{
    "error": {
        "code": "UserError",
        "inner_error": {
            "code": "InvalidData"
        },
        "message": "X should not be None"
    }
}

### Optional: retrieve a run

In [10]:
runId = 'AutoML_a2a3083f-88d4-4094-8806-ab010e5ad643'
automl_run = AutoMLRun(experiment, run_id=runId)

## Results

### Explore the best pipeline

In [11]:
RunDetails(automl_run).show()
#automl_run.wait_for_completion()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

**option 1:** select any pipeline iteration 

In [None]:
best_run, fitted_model = automl_run.get_output(iteration=49)

**option 2:** select best pipeline iteration automatically

In [12]:
best_run, fitted_model = automl_run.get_output()

### inspect model properties

In [15]:
# pipeline steps
for step in fitted_model.named_steps:
    print(step)

datatransformer
prefittedsoftvotingclassifier


In [24]:
# model properties
fitted_model.named_steps

{'datatransformer': DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
        feature_sweeping_config=None, feature_sweeping_timeout=None,
        featurization_config=None, force_text_dnn=None,
        is_cross_validation=None, is_onnx_compatible=None, logger=None,
        observer=None, task=None, working_dir=None), 'prefittedsoftvotingclassifier': PreFittedSoftVotingClassifier(classification_labels=None,
               estimators=[('34', Pipeline(memory=None,
     steps=[('robustscaler', RobustScaler(copy=True, quantile_range=[25, 75], with_centering=True,
       with_scaling=True)), ('lightgbmclassifier', LightGBMClassifier(boosting_type='gbdt', class_weight=None,
          colsample_bytree=0.99, importance_type=...ubsample=0.8415789473684211,
          subsample_for_bin=200000, subsample_freq=0, verbose=-10))]))],
               flatten_transform=None,
               weights=[0.7333333333333333, 0.06666666666666667, 0.06666666666666667, 0.06666666666666667, 0.06666666

In [17]:
# show all metrics
best_run.get_metrics()

{'average_precision_score_macro': 0.9901031726701672,
 'log_loss': 0.11217641416499095,
 'precision_score_micro': 0.9554,
 'f1_score_macro': 0.9497716749688514,
 'recall_score_micro': 0.9554,
 'balanced_accuracy': 0.9490746837008428,
 'recall_score_macro': 0.9490746837008428,
 'f1_score_micro': 0.9554,
 'average_precision_score_weighted': 0.992025634556071,
 'AUC_weighted': 0.9917229098997534,
 'f1_score_weighted': 0.9553619043715094,
 'accuracy': 0.9554,
 'AUC_micro': 0.9926484,
 'precision_score_macro': 0.9505294456109737,
 'matthews_correlation': 0.8995982190032044,
 'recall_score_weighted': 0.9554,
 'confusion_matrix': 'aml://artifactId/ExperimentRun/dcid.AutoML_a2a3083f-88d4-4094-8806-ab010e5ad643_49/confusion_matrix',
 'weighted_accuracy': 0.9604599701095635,
 'accuracy_table': 'aml://artifactId/ExperimentRun/dcid.AutoML_a2a3083f-88d4-4094-8806-ab010e5ad643_49/accuracy_table',
 'norm_macro_recall': 0.8981493674016858,
 'AUC_macro': 0.9917229098997534,
 'precision_score_weighted':

## Register

### Prepare

autoML generated a scoring script, environment file and model

In [19]:
# get the score and environment files
model_name = best_run.properties['model_name'] # score.py script will look for the name of the registered model

# make a local copy of the best scoring script, environment file and the model file
script_file_name = 'inference/score.py'
conda_env_file_name = 'inference/env.yml'
model_pickle_file_name = 'inference/model.pkl'
model_onnx_file_name = 'inference/model.onnx'
best_run.download_file('outputs/scoring_file_v_1_0_0.py', script_file_name)
best_run.download_file('outputs/conda_env_v_1_0_0.yml', conda_env_file_name)
best_run.download_file('outputs/model.pkl', model_pickle_file_name)
best_run.download_file('outputs/model.onnx', model_onnx_file_name)

In [20]:
! cat inference/env.yml

# Conda environment specification. The dependencies defined in this file will
# be automatically provisioned for runs with userManagedDependencies=False.

# Details about the Conda environment file format:
# https://conda.io/docs/user-guide/tasks/manage-environments.html#create-env-file-manually

name: project_environment
dependencies:
  # The python interpreter version.
  # Currently Azure ML only supports 3.5.2 and later.
- python=3.6.2

- pip:
  - azureml-train-automl-runtime==1.2.0
  - inference-schema
  - azureml-explain-model==1.2.0
  - azureml-defaults==1.2.0
- numpy>=1.16.0,<=1.16.2
- pandas>=0.21.0,<=0.23.4
- scikit-learn>=0.19.0,<=0.20.3
- py-xgboost<=0.90
- fbprophet==0.5
- psutil>=5.2.2,<6.0.0
channels:
- anaconda
- conda-forge


In [21]:
! cat inference/score.py

# ---------------------------------------------------------
# Copyright (c) Microsoft Corporation. All rights reserved.
# ---------------------------------------------------------
import json
import pickle
import numpy as np
import pandas as pd
import azureml.train.automl
from sklearn.externals import joblib
from azureml.core.model import Model

from inference_schema.schema_decorators import input_schema, output_schema
from inference_schema.parameter_types.numpy_parameter_type import NumpyParameterType
from inference_schema.parameter_types.pandas_parameter_type import PandasParameterType


input_sample = pd.DataFrame({'PatientID': pd.Series(['1354778.0'], dtype='float64'), 'Pregnancies': pd.Series(['0.0'], dtype='float64'), 'PlasmaGlucose': pd.Series(['171.0'], dtype='float64'), 'DiastolicBloodPressure': pd.Series(['80.0'], dtype='float64'), 'TricepsThickness': pd.Series(['34.0'], dtype='float64'), 'SerumInsulin': pd.Series(['23.0'], dtype='float64'), 'BMI': pd.Series([

### Register the model

**Option 1:** from workspace /outputs folder with .register_model()

In [2]:
model = best_run.register_model(model_name=model_name, # registered model name used in scoring script init()
                                #model_framework=Model.Framework.SCIKITLEARN, # {TensorFlow, ScikitLearn, Onnx, Custom}
                                model_path='outputs/model.pkl', # fixed path in workspace {'model.pkl', 'model.onnx'}
                                tags={'Training context': 'autoML Training'},
                                properties={'AUC': best_run.get_metrics()['AUC_weighted'],
                                            'Accuracy': best_run.get_metrics()['accuracy']},
                                description="Classification model to predict diabetes")

NameError: name 'best_run' is not defined

**Option 2:** from local /path/model folder with Model.register()

In [28]:
model = Model.register(workspace=ws,
                       model_name=model_name, # registered model name used in scoring script init()
                       #model_framework=Model.Framework.SCIKITLEARN, # {TensorFlow, ScikitLearn, Onnx, Custom}
                       model_path='inference/model.pkl', # local file {'model.pkl', 'model.onnx'}
                       tags={'Training context': 'autoML Training'},
                       properties={'AUC': best_run.get_metrics()['AUC_weighted'],
                                   'Accuracy': best_run.get_metrics()['accuracy']},
                       description="Classification model to predict diabetes")

Registering model AutoMLa2a3083f849


**Optional:** Load the model

In [30]:
# list all registered models
for model in Model.list(ws):
    print(model.name, 'version:', model.version)
    for tag_name in model.tags:
        tag = model.tags[tag_name]
        print ('\t',tag_name, ':', tag)
    for prop_name in model.properties:
        prop = model.properties[prop_name]
        print ('\t',prop_name, ':', prop)
    print('\n')

AutoMLa2a3083f849 version: 2
	 Training context : autoML Training
	 AUC : 0.9917229098997534
	 Accuracy : 0.9554


AutoMLa2a3083f849 version: 1
	 Training context : autoML Training
	 AUC : 0.9917229098997534
	 Accuracy : 0.9554




In [31]:
# load the registered model for deployment (latest version)
model = ws.models[model_name] # or replace with any registered modelname from Model.list(ws)
model

Model(workspace=Workspace.create(name='alehanwagaatetnuzijn', subscription_id='43c1f93a-903d-4b23-a4bf-92bd7a150627', resource_group='myResourceGroup'), name=AutoMLa2a3083f849, id=AutoMLa2a3083f849:2, version=2, tags={'Training context': 'autoML Training'}, properties={'AUC': '0.9917229098997534', 'Accuracy': '0.9554'})

## Deploy

### Deploy model as webservice (ACI)

Linux Azure Container Instance with 1 vCPU and 1GB of RAM cost €28 per month

In [40]:
# Configure the scoring environment
service_name = "automl-projname-service" # only lowercase letters, numbers, or dashes

# Remove any existing service under the same name
try:
    Webservice(ws, service_name).delete()
except WebserviceException:
    print('"' + service_name + '" does not exist, creating the webservice...')

myenv = Environment.from_conda_specification(name="myenv", file_path=conda_env_file_name)
inference_config = InferenceConfig(entry_script=script_file_name, environment=myenv)

deployment_config = AciWebservice.deploy_configuration(cpu_cores=1,
                                                       memory_gb=1)

# build container from environment, start webservice ACI and deploy inference scrips 
service = Model.deploy(ws, service_name, [model], inference_config, deployment_config)
service.wait_for_deployment(show_output=True)

webservice 'automl-projname-service' does not exist, creating webservice...
Running..................................................
Succeeded
ACI service creation operation finished, operation "Succeeded"


**Optional:** load a running webservice

In [45]:
# list available webservices
for i in ws.webservices:
    print(i)

automl-projname-service
automl-projname-service-4


In [33]:
service_name = "automl-projname-service" # only lowercase letters, numbers, or dashes
service = Webservice(ws, service_name)

In [None]:
# get webservice logs
print(service.get_logs())

## Test

### Webservice inference test

Send a HTTP triggered webrequest with testdata to the model for a prediction value.  
In this example we test a person is diabetic (1) or not-diabetic (0).  
The testdata must be a list of 9 features to predict a binary classification.  
We demonstrate the use of **service** or **requests** method to send a prediction request.  
Know that 'Postman' application or 'Rest Client' plugin in VSCode work as well.  

|Web API|Example value|Options|
|-|-|-|
|**HTTP method**|POST|<i>POST</i><br><i>GET</i>|
|**URI**|http://3bb0618b-ef7b-4b17-af32-a52f9c64f4d5.northeurope.azurecontainer.io/score||
|**Header**|{Content-Type: Application/json}||
|**Body**|{"data": [[5, 2, 180, 74, 24, 21, 24, 1.5, 22], <br>[6, 0, 148, 58, 11, 179, 39, 0.16, 45]]}|<i>one or </i><br><i>more records</i>|
|**Response**|{"result": [1, 0]}|<i>json object</i>|

In [109]:
# get webservice URI
endpoint = service.scoring_uri

# raw test data
rawdata = [[5, 2, 180, 74, 24, 21, 24, 1.5, 22],
           [6, 0, 148, 58, 11, 179, 39, 0.16, 45]]

print("URI: " + endpoint)
print("Body: " + json.dumps({"data": rawdata})) # convert array to a serialized JSON formatted string object

URI: http://a7de5f4e-8473-4f42-a5d2-5ff79eb7429d.northeurope.azurecontainer.io/score
Body: {"data": [[5, 2, 180, 74, 24, 21, 24, 1.5, 22], [6, 0, 148, 58, 11, 179, 39, 0.16, 45]]}


**Test 1:** service.run()

In [105]:
service.run(json.dumps({"data": rawdata}))

'{"result": [1, 0]}'

**Test 2:** requests.post()

In [106]:
response = requests.post(endpoint, json={"data": rawdata})
response.json()

'{"result": [1, 0]}'

When you are finished testing your service, clean up the deployment with service.delete()

In [111]:
#service.delete()

# CustomML

Inspired from autoML results is an alternative customML development.  
Using inline method to test and develop, train local or with remote compute and deploy and test the model.  

1. option1: inline method
1. option2: script method
1. create an inference script
1. create an inference environment
1. register the model
1. deploy the model
1. inference test

in edit:

1. interactive inline method
1. create a scoring script
1. create environment
1. train model on remote compute
1. create an entry script
1. create an inference environment
1. register the model
1. deploy the model
1. inference test

In [32]:
ws = Workspace.from_config()

### Option 1: Inline method

In [33]:
from azureml.core import Experiment
from azureml.core import Model
from azureml.core import Datastore
from azureml.core import Dataset
import pandas as pd
import numpy as np
import joblib
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

# Create an Azure ML experiment in your workspace
experiment = Experiment(workspace=ws, name="diabetes-training")
run = experiment.start_logging()
print("Starting experiment:", experiment.name)

# load the diabetes dataset (File method)
print("Loading data lake gen2 data in a pandas dataframe...")
ds = Datastore.get(ws, 'datalakestoragegen2')
ds_path = [DataPath(ds, 'platinum/diabetes.parquet')] # {path/*.parquet or path/**}
dataset = Dataset.File.from_files(path=ds_path)
mount_context = dataset.mount(mount_point='/tmp/platinum') # read-only mount from delta lake
mount_context.start()
diabetes = pd.read_parquet('/tmp/platinum/diabetes.parquet') # {'/tmp/path/'} can load latest delta lake parquet files
mount_context.stop()

# load the diabetes dataset (Tabular method)
# print("Loading data lake gen2 data in a pandas dataframe...")
# ds = Datastore.get(ws, 'datalakestoragegen2')
# ds_path = [DataPath(ds, 'platinum/diabetes.parquet')] # {path/*.parquet or path/**}
# dataset = Dataset.Tabular.from_parquet_files(path=ds_path) # {delimited, json, parquet, sql}
# diabetes = dataset.to_pandas_dataframe() # create a pandas dataframe

# Separate features and labels as numpy array
X, y = diabetes[['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']].values, diabetes['Diabetic'].values

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Train a decision tree model
print('Training a decision tree model')
model = DecisionTreeClassifier().fit(X_train, y_train)

# calculate accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
print('Accuracy:', acc)
run.log('Accuracy', np.float(acc))

# calculate AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
print('AUC: ' + str(auc))
run.log('AUC', np.float(auc))

# Save the trained model
model_file = 'diabetes_model.pkl'
joblib.dump(value=model, filename=model_file) # backup model local
run.upload_file(name='outputs/' + model_file,
                path_or_stream='./' + model_file) # save model to workspace

# Complete the run
run.complete()

Starting experiment: diabetes-training
Loading data lake gen2 data in a pandas dataframe...
Training a decision tree model
Accuracy: 0.8873333333333333
AUC: 0.8741181153291208


### Option 2: Script method

Create training script

In [34]:
# Create a local folder
import os

# Create a folder for the experiment files
folder_name = 'diabetes_service'
experiment_folder = './' + folder_name
os.makedirs(folder_name, exist_ok=True)
print(folder_name, 'folder created')

diabetes_service folder created


In [80]:
%%writefile $experiment_folder/diabetes_training.py
# Import libraries
import argparse
from azureml.core import Workspace, Dataset, Experiment, Run
import pandas as pd
import numpy as np
import joblib
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
import glob
print("libraries imported...")

# Set regularization hyperparameter (passed as an argument to the script)
parser = argparse.ArgumentParser()
parser.add_argument('--regularization', type=float, dest='reg_rate', default=0.01, help='regularization rate')
args = parser.parse_args()
reg = args.reg_rate
print("argparse parameters loaded...")

# Get the experiment run context
run = Run.get_context()
print("run context loaded...")

# load the diabetes dataset (File method)
# Get the training data from the estimator input identified as 'diabetes'
mount = run.input_datasets['diabetes'] # read-only mount from delta lake as '/mnt/data'
print("delta lake mounted...")
diabetes = pd.read_parquet('/mnt/data/diabetes.parquet') # load any file(s) from this delta lake mounted folder
print("dataset loaded...")

# save data into workspace
diabetes.to_csv("outputs/dataset.csv", index=False) # {logs/  outputs/}
print("test: write dataset to workspace 'outputs/dataset.csv'")

# Separate features and labels
X, y = diabetes[['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']].values, diabetes['Diabetic'].values

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Train a logistic regression model
print('Training a logistic regression model with regularization rate of', reg)
run.log('Regularization Rate',  np.float(reg))
model = LogisticRegression(C=1/reg, solver="liblinear").fit(X_train, y_train)

# calculate accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
print('Accuracy:', acc)
run.log('Accuracy', np.float(acc))

# calculate AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
print('AUC: ' + str(auc))
run.log('AUC', np.float(auc))

os.makedirs('outputs', exist_ok=True)
# note file saved in the outputs folder is automatically uploaded into experiment record
joblib.dump(value=model, filename='outputs/diabetes_model.pkl')

run.complete()

Overwriting ./diabetes_service/diabetes_training.py


Create training environment

In [100]:
myenv = Environment("training_environment")
myenv.docker.enabled = True
myenv.python.user_managed_dependencies = False
conda_packages = ['scikit-learn', 'joblib', 'python==3.6.2']
pip_packages = ['azureml-defaults', 'azureml-dataprep[pandas,fuse]', 'pyarrow', 'fastparquet']
myenv.python.conda_dependencies = CondaDependencies.create(conda_packages=conda_packages, pip_packages=pip_packages)
myenv.register(ws)

{
    "name": "training_environment",
    "version": "2",
    "environmentVariables": {
        "EXAMPLE_ENV_VAR": "EXAMPLE_VALUE"
    },
    "python": {
        "userManagedDependencies": false,
        "interpreterPath": "python",
        "condaDependenciesFile": null,
        "baseCondaEnvironment": null,
        "condaDependencies": {
            "channels": [
                "anaconda",
                "conda-forge"
            ],
            "dependencies": [
                "python=3.6.2",
                {
                    "pip": [
                        "azureml-defaults",
                        "azureml-dataprep[pandas,fuse]",
                        "pyarrow",
                        "fastparquet"
                    ]
                },
                "scikit-learn",
                "joblib"
            ],
            "name": "azureml_b9a1534962684a800c586e9fce04292e"
        }
    },
    "docker": {
        "enabled": true,
        "baseImage": "mcr.microsoft.com/azu

In [102]:
# list environments
env_names = Environment.list(workspace=ws)
for env_name in env_names:
    print('Name:',env_name)

Name: training_environment
Name: AzureML-PyTorch-1.3-GPU
Name: AzureML-TensorFlow-2.0-CPU
Name: AzureML-Tutorial
Name: AzureML-PyTorch-1.3-CPU
Name: AzureML-TensorFlow-2.0-GPU
Name: AzureML-Chainer-5.1.0-GPU
Name: AzureML-Minimal
Name: AzureML-PyTorch-1.2-CPU
Name: AzureML-TensorFlow-1.12-CPU
Name: AzureML-TensorFlow-1.13-CPU
Name: AzureML-PyTorch-1.1-CPU
Name: AzureML-TensorFlow-1.10-CPU
Name: AzureML-PyTorch-1.0-GPU
Name: AzureML-TensorFlow-1.12-GPU
Name: AzureML-TensorFlow-1.13-GPU
Name: AzureML-Chainer-5.1.0-CPU
Name: AzureML-PyTorch-1.0-CPU
Name: AzureML-Scikit-learn-0.20.3
Name: AzureML-PyTorch-1.2-GPU
Name: AzureML-PyTorch-1.1-GPU
Name: AzureML-TensorFlow-1.10-GPU
Name: AzureML-PySpark-MmlSpark-0.15
Name: AzureML-AutoML
Name: AzureML-PyTorch-1.4-GPU
Name: AzureML-PyTorch-1.4-CPU
Name: AzureML-VowpalWabbit-8.8.0
Name: AzureML-Hyperdrive-ForecastDNN
Name: AzureML-AutoML-GPU
Name: AzureML-AutoML-DNN-GPU
Name: AzureML-AutoML-DNN
Name: AzureML-Designer-R
Name: AzureML-Designer-Recomm

Creating and registering file datasets

In [122]:
# load the diabetes dataset (File method)
ds = Datastore.get(ws, 'datalakestoragegen2')
ds_path = [DataPath(ds, 'platinum/**')] # {path/*.parquet or path/**}
file_ds = Dataset.File.from_files(path=ds_path)
   
# Register the file dataset
try:
    file_ds = file_ds.register(workspace=ws,
                               name='diabetes file dataset',
                               description='diabetes files',
                               tags = {'format':'parquet'},
                               create_new_version=True)
except Exception as ex:
    print(ex)
print('Dataset registered')

Dataset registered


In [123]:
# show a list of registered dataset(s)
print("Datasets:")
for dataset_name in list(ws.datasets.keys()):
    dataset = Dataset.get_by_name(ws, dataset_name)
    print("\t", dataset.name, '\t version', dataset.version)

Datasets:
	 diabetes file dataset 	 version 1


In [124]:
# list of the file path(s)
for file_path in file_ds.to_path():
    print(file_path)

/diabetes.csv
/diabetes.parquet
/folder/diabetes2.csv


Train model

In [125]:
# Set the script parameters
script_params = {
    '--regularization': 0.1
}

# get the registered dataset by name
file_ds = Dataset.get_by_name(ws, "diabetes file dataset")

# Get the docker environment
training_env = Environment.get(ws, 'training_environment')

# get the training compute cluster
training_cluster = ComputeTarget(ws, 'aml-cluster')

estimator = Estimator(source_directory=experiment_folder, # All the files in this directory are uploaded into the cluster nodes for execution
                      compute_target='local', # {'local', training_cluster}
                      entry_script='diabetes_training.py',
                      script_params=script_params,
                      environment_definition=training_env,
                      inputs=[file_ds.as_named_input('diabetes').as_mount(path_on_compute='/mnt/data')],
                     )

# Create an experiment
experiment_name = 'diabetes-training'
experiment = Experiment(workspace=ws, name=experiment_name)
# Run the experiment
run = experiment.submit(config=estimator)

# Show the run details while running
RunDetails(run).show()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

### Create inference script

In [44]:
# Create a local folder
import os

# Create a folder for the experiment files
folder_name = 'diabetes_service'
experiment_folder = './' + folder_name
os.makedirs(folder_name, exist_ok=True)
print(folder_name, 'folder created')

diabetes_service folder created


In [45]:
%%writefile $folder_name/diabetes_score.py
import json
import joblib
import numpy as np
from azureml.core.model import Model

# Called when the service is loaded
def init():
    global model
    # Get the path to the deployed model file and load a registered model
    model_path = Model.get_model_path(model_name='diabetes_model')
    model = joblib.load(model_path)

# Called when a request is received
def run(raw_data):
    # Get the input data as a numpy array
    data = np.array(json.loads(raw_data)['data'])
    # Get a prediction from the model
    predictions = model.predict(data)
    # Get the corresponding classname for each prediction (0 or 1)
    classnames = ['not-diabetic', 'diabetic']
    predicted_classes = []
    for prediction in predictions:
        predicted_classes.append(classnames[prediction])
    # Return the predictions as JSON
    return json.dumps(predicted_classes)

Overwriting diabetes_service/diabetes_score.py


### Create inference environment

In [46]:
from azureml.core.conda_dependencies import CondaDependencies

# Add the dependencies for our model (AzureML defaults is already included)
myenv = CondaDependencies()
myenv.add_conda_package("scikit-learn")

# Save the environment config as a .yml file
env_file = folder_name + "/diabetes_env.yml"
with open(env_file, "w") as f:
    f.write(myenv.serialize_to_string())
print("Saved inference environment file in", env_file)

# Print the .yml file
with open(env_file,"r") as f:
    print(f.read())

Saved inference environment file in diabetes_service/diabetes_env.yml
# Conda environment specification. The dependencies defined in this file will
# be automatically provisioned for runs with userManagedDependencies=False.

# Details about the Conda environment file format:
# https://conda.io/docs/user-guide/tasks/manage-environments.html#create-env-file-manually

name: project_environment
dependencies:
  # The python interpreter version.
  # Currently Azure ML only supports 3.5.2 and later.
- python=3.6.2

- pip:
    # Required packages for AzureML execution, history, and data preparation.
  - azureml-defaults

- scikit-learn
channels:
- anaconda
- conda-forge



### Register the model

In [47]:
# define model name
model_name = 'diabetes_model'

# register model from the workspace 
run.register_model(model_name=model_name, # registered model name used in scoring script init()
                   model_path='outputs/diabetes_model.pkl', # fixed path in workspace {'model.pkl', 'model.onnx'}
                   tags={'Training context': 'Custom Training'},
                   properties={'AUC': run.get_metrics()['AUC'],
                               'Accuracy': run.get_metrics()['Accuracy']},
                   description="Classification model to predict diabetes")

print('Model trained and registered')

Model trained and registered


### Deploy the model

In [48]:
from azureml.core.webservice import AciWebservice
from azureml.core.model import InferenceConfig

service_name = "diabetes-service"

# Remove any existing service under the same name
try:
    Webservice(ws, service_name).delete()
except WebserviceException:
    print('"' + service_name + '" does not exist, creating the webservice...')

# Configure the scoring environment
inference_config = InferenceConfig(runtime="python",
                                   source_directory=folder_name,
                                   entry_script="diabetes_score.py",
                                   conda_file="diabetes_env.yml")

deployment_config = AciWebservice.deploy_configuration(cpu_cores=1,
                                                       memory_gb=1)

# load the registered model
model = ws.models['diabetes_model']

service = Model.deploy(ws, service_name, [model], inference_config, deployment_config)

service.wait_for_deployment(show_output=True)
print(service.state)

"diabetes-service" does not exist, creating the webservice...
Running...............................
Succeeded
ACI service creation operation finished, operation "Succeeded"
Healthy


### Inference test

In [49]:
# get webservice URI
endpoint = service.scoring_uri

# raw test data
rawdata = [[2, 180, 74, 24, 21, 24, 1.5, 22],
           [0, 148, 58, 11, 179, 39, 0.16, 45]]

print("URI: " + endpoint)
print("Body: " + json.dumps({"data": rawdata})) # convert array to a serialized JSON formatted string object

service.run(json.dumps({"data": rawdata}))

URI: http://76a76dec-913b-4bca-ae00-1721c8f5bcca.northeurope.azurecontainer.io/score
Body: {"data": [[2, 180, 74, 24, 21, 24, 1.5, 22], [0, 148, 58, 11, 179, 39, 0.16, 45]]}


'["not-diabetic", "not-diabetic"]'

When you are finished testing your service, clean up the deployment with service.delete()

In [50]:
service.delete()

## Serverless

## Finetuning