Author: Kevin ALBERT  

Created: April 2020  

# Automated Machine Learning
_**Classification project with data residing on a data lake gen2 using remote compute with autoML and model registration**_

## Contents
1. [Introduction](#Introduction)
1. [Setup](#Setup)
1. [Train](#Train)
1. [Results](#Results)
1. [Deploy](#Deploy)
1. [Test](#Test)

## Introduction

Cleaned datasets created in datafactory onto a delta lake Gen2.  
This notebook is using delta lake data and remote compute to autoML train a classification model.  
We use example data to detect diabetic or non-diabetic based on 8 features.  

This notebook show how to:
1. Setup packages
2. Setup workspace
3. Create an experiment
4. Load data
5. Setup compute
6. Configure AutoML
7. Train models
8. Explore the pipelines
9. Deploy model
9. Test the fitted model

## Setup

* required
  * disable shield on Brave webbrowser for the widgets to work
  * download **config.json** from the machine learning workspace portal
  * install extra azureml packages on **py37_default** when using **'local'** compute  
  * split the data up in train and test dataset on data lake, validation dataset is not needed due to cross_validation
* optional
  * register datastore(s) manually
  * register dataset(s) manually
  * register compute cluster(s) manually

In [None]:
! /anaconda/envs/py37_default/bin/python -m pip install -U azureml-sdk[explain,automl] azureml-widgets

### Import open-source packages

In [1]:
import logging
# import os
# import random
# import re
# import lightgbm
import pandas as pd
# import numpy as np
# import json
# import csv
# from matplotlib import pyplot as plt
# from matplotlib.pyplot import imshow
# from sklearn import datasets
# from shutil import copy2
# import seaborn as sns
# sns.set(color_codes='True')

# import warnings
# warnings.filterwarnings('ignore')

### Import azure machine learning SDK packages

In [2]:
#import azureml.core
from azureml.core import Workspace
#from azureml.core.authentication import InteractiveLoginAuthentication
from azureml.core.experiment import Experiment
from azureml.core import Dataset
from azureml.core import Datastore
from azureml.data.datapath import DataPath
from azureml.core.compute import ComputeTarget
from azureml.core.compute import AmlCompute
from azureml.core.compute import AksCompute
# from azureml.core.compute_target import ComputeTargetException
# from azureml.core.image import Image
from azureml.core.model import Model
from azureml.train.automl import AutoMLConfig
from azureml.train.automl.run import AutoMLRun
from azureml.widgets import RunDetails
from azureml.core import Run
from azureml.core.webservice import Webservice
from azureml.core.webservice import AciWebservice
from azureml.core.webservice import AksWebservice
from azureml.core.model import InferenceConfig
from azureml.core.environment import Environment

### Workspace

In [4]:
# load the workspace
ws = Workspace.from_config()

### Experiment

In [5]:
# choose an experiment name
experiment = Experiment(ws, 'automl-classification')

### Data

Data Factory has prepped data from /bronze to /silver to /gold and /platinum for model training  
**note:** this demonstration had files in the Data Lake Gen2 datalake container /platinum folder  
  * /datalake/platinum/diabetes.csv
  * /datalake/platinum/diabetes.parquet
  * copy from ../data/platinum/*

Register the datastore 'data lake gen2' as a **blob container**  
**optional:** manually register in ML workspace

In [None]:
ds = Datastore.register_azure_blob_container(
    workspace=ws,
    datastore_name="datalakestoragegen2",
    container_name="datalake",
    account_name="datalake21032020",
    account_key="Ck/4hMq3Zrzq5toZ96zE6cDncjbw2VdkR9ny1xXA3GLBwQXIv7V1ycSc/KpqyNRcoPWKtzKljjpcZVqjWOu+3Q==",
    create_if_not_exists=False)
# list available datastores
ws.datastores

Register file(s) into a tabular dataset  
**Note:** do not import Delta lake parquet file(s)  
**Fix:** you can import pandas single gold/*.csv or gold/*.parquet file(s)  

In [6]:
# load datastore
ds = Datastore.get(ws, 'datalakestoragegen2')
# show datastore settings
ds

{
  "name": "datalakestoragegen2",
  "container_name": "datalake",
  "account_name": "datalake21032020",
  "protocol": "https",
  "endpoint": "core.windows.net"
}

option 1: loading *.parquet

In [7]:
# setup parquet file(s) into a tabular dataset
ds_path = [DataPath(ds, 'platinum/diabetes.parquet')]
dataset = Dataset.Tabular.from_parquet_files(path=ds_path)
# show dataset settings
dataset

{
  "source": [
    "('datalakestoragegen2', 'platinum/diabetes.parquet')"
  ],
  "definition": [
    "GetDatastoreFiles",
    "ReadParquetFile",
    "DropColumns"
  ]
}

option 2: loading *.csv

In [None]:
# setup csv file(s) into a tabular dataset
ds_path = [DataPath(ds, 'platinum/diabetes.csv')]
dataset = Dataset.Tabular.from_delimited_files(path=ds_path)
# show dataset settings
dataset

option 3: loading a registered dataset (manually register in ML workspace)

In [None]:
# list available datasets
ws.datasets

In [None]:
# load a registered dataset
dataset = Dataset.get_by_name(ws, 'diabetes_parquet_from_datastore_datalakegen2')
# show dataset settings
dataset

### Compute

Check possible compute type **names** to create auto-scaling cluster

In [8]:
# example: list all with 1 vCPUs and no-GPU
vm_df = pd.DataFrame(AmlCompute.supported_vmsizes(ws))
vm_df[(vm_df.vCPUs == 1) & (vm_df.gpus == 0)]

Unnamed: 0,gpus,maxResourceVolumeMB,memoryGB,name,vCPUs
0,0,51200,3.5,Standard_D1_v2,1
8,0,7168,3.5,Standard_DS1_v2,1
18,0,51200,3.5,Standard_D1,1


option 1: Create training cluster  

In [None]:
# Specify a name for the compute (unique within the workspace)
compute_name = 'aml-cluster'
# Define compute configuration
compute_config = AmlCompute.provisioning_configuration(vm_size='Standard_D1_v2',
                                                       min_nodes=0, # you are not paying if not using
                                                       max_nodes=10, # depending quota limits
                                                       vm_priority='dedicated', # {lowpriority, dedicated}
                                                       admin_username='ubuntu',
                                                       admin_user_password='ABCD1234abcd',
                                                       idle_seconds_before_scaledown=120, # {default: 120}
                                                      )
# Create the compute
training_cluster = ComputeTarget.create(ws, compute_name, compute_config)
training_cluster.wait_for_completion(show_output=True)

option 2: Load already known training cluster

In [9]:
# list all available training cluster(s):
for cluster in ws.compute_targets:
    print(cluster)

aml-cluster


In [10]:
# load the training cluster
compute_name = 'aml-cluster'
training_cluster = ComputeTarget(ws, name=compute_name)

## Train

### Configure autoML
Define settings to run the experiment.

|Property|Description|Options|
|-|-|-|
|**task**||<i>classification</i><br><i>regression</i><br><i>forecasting</i>|
|**compute_target**|execution on local DSVM serialized<br>execution on remote AML or AKS parallel|<i>local</i><br><i>training_cluster</i>|
|**primary_metric**|the metric you want to optimize|**classification:**<br><i>accuracy</i><br><i>AUC_weighted</i><br><i>average_precision_score_weighted</i><br><i>norm_macro_recall</i><br><i>precision_score_weighted</i><br><br>**regression:**<br><i>spearman_correlation</i><br><i>normalized_root_mean_squared_error</i><br><i>r2_score</i><br><i>normalized_mean_absolute_error</i>|
|**training_data**|input dataset, containing both X_train and y_train|<i>DataFrame</i><br><i>Dataset</i><br><i>DatasetDefinition</i><br><i>TabularDataset</i>|
|**validation_data**|input dataset, covered with cross validation|N/A|
|**label_column_name**|the name of the 'target' or 'label' column||
|**enable_early_stopping**|stop the run if metric score is not improving|<i>True</i><br><i>False</i>|
|**n_cross_validations**|number of cross validation splits|5|
|**experiment_timeout_hours**|max time in hours the experiment terminates (+15min)|<i>0.25</i>|
|**max_concurrent_iterations**|less or equal to the number of cores per node|2|



**_You can find more information_** [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-auto-train)

In [11]:
automl_settings = {
    "enable_early_stopping":True,
    #"experiment_timeout_hours":0.25,
    "iterations":10, # number of runs
    "iteration_timeout_minutes":5,
    "max_concurrent_iterations":1,
    "max_cores_per_iteration":-1,
    "experiment_exit_score":0.9920,
    "model_explainability":True,
    "n_cross_validations":5,
    "primary_metric":'AUC_weighted',
    "featurization":'auto',
    "verbosity":logging.INFO, # {INFO, DEBUG, CRITICAL, ERROR, WARNING} -- debug_log=<*.log>
}

automl_config = AutoMLConfig(task='classification',
                             debug_log='automl_errors.log',
                             compute_target='local', # {training_cluster or 'local'}
                             blacklist_models=['KNN','LinearSVM'],
                             enable_onnx_compatible_models=True,
                             training_data=dataset,
                             label_column_name="Diabetic",
                             **automl_settings
                            )
# ouputs "model.pkl" and "automl_errors.log"

### Train models

In [12]:
automl_run = experiment.submit(automl_config, show_output=True)

Running on local machine
Parent Run ID: AutoML_0a57b191-f184-41a3-b093-5e49d9a159eb

Current status: DatasetEvaluation. Gathering dataset statistics.
Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetFeaturization. Beginning to fit featurizers and featurize the dataset.
Current status: DatasetFeaturizationCompleted. Completed fit featurizers and featurizing the dataset.
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       PASSED
DESCRIPTION:  Classes are balanced in the training data.

TYPE:         High cardinality feature detection
STATUS:       PASSED
DESCRIPTION:  Your inputs were analyzed, and no high cardinality features were detected.

******************************************************************************************

### Optional: retrieve a run

In [12]:
runId = 'AutoML_a2a3083f-88d4-4094-8806-ab010e5ad643'
automl_run = AutoMLRun(experiment, run_id=runId)

## Results

### Explore the pipelines

In [13]:
RunDetails(automl_run).show()
#automl_run.wait_for_completion()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

### option 1: select any pipeline iteration 

In [None]:
best_run, fitted_model = automl_run.get_output(iteration=49)

### option 2: select best pipeline iteration automatically

In [14]:
best_run, fitted_model = automl_run.get_output()

### inspect iteration

In [None]:
# pipeline steps
for step in fitted_model.named_steps:
    print(step)

In [None]:
# model properties
fitted_model.named_steps

In [None]:
# all run metrics
best_run.get_metrics()

## Deploy

### Register the model

In [15]:
# get the score and environment files
model_name = best_run.properties['model_name'] # score.py script will look for the name of the registered model

# make a local copy of the best scoring script, environment file and the model file
script_file_name = 'inference/score.py'
conda_env_file_name = 'inference/env.yml'
model_file_name = 'inference/model.pkl'
best_run.download_file('outputs/scoring_file_v_1_0_0.py', script_file_name)
best_run.download_file('outputs/conda_env_v_1_0_0.yml', conda_env_file_name)
best_run.download_file('outputs/model.pkl', model_file_name)

In [16]:
model = best_run.register_model(model_name=model_name, # registered model name used in scoring script init()
                                model_path='outputs/model.pkl', # fixed path in workspace
                                tags={'Training context':'autoML Training'},
                                properties={'AUC': best_run.get_metrics()['AUC_weighted'],
                                            'Accuracy': best_run.get_metrics()['accuracy']})

### Load the model

In [21]:
# list all registered models
for model in Model.list(ws):
    print(model.name, 'version:', model.version)
    for tag_name in model.tags:
        tag = model.tags[tag_name]
        print ('\t',tag_name, ':', tag)
    for prop_name in model.properties:
        prop = model.properties[prop_name]
        print ('\t',prop_name, ':', prop)
    print('\n')

AutoMLa2a3083f849 version: 1
	 Training context : autoML Training
	 AUC : 0.9917229098997534
	 Accuracy : 0.9554




In [23]:
# load the registered model for deployment
model = ws.models[model_name] # or replace with any registered modelname from Model.list(ws)
model

Model(workspace=Workspace.create(name='alehanwagaatetnuzijn', subscription_id='43c1f93a-903d-4b23-a4bf-92bd7a150627', resource_group='myResourceGroup'), name=AutoMLa2a3083f849, id=AutoMLa2a3083f849:1, version=1, tags={'Training context': 'autoML Training'}, properties={'AUC': '0.9917229098997534', 'Accuracy': '0.9554'})

### option 1: deploy on Azure Container Instance (ACI)
Linux ACI instance with 1 vCPU and 1GB of RAM cost €28 per month

In [17]:
# Configure the scoring environment
service_name = "automl-projname-service" # only lowercase letters, numbers, or dashes

myenv = Environment.from_conda_specification(name="myenv", file_path=conda_env_file_name)
inference_config = InferenceConfig(entry_script=script_file_name, environment=myenv)

deployment_config = AciWebservice.deploy_configuration(cpu_cores=1,
                                                       memory_gb=1)

# build container from environment, start webservice ACI and deploy inference scrips 
service = Model.deploy(ws, service_name, [model], inference_config, deployment_config)
service.wait_for_deployment(True)

Running.......................................................
Succeeded
ACI service creation operation finished, operation "Succeeded"
Healthy


In [27]:
# get webservice logs
print(service.get_logs())

/usr/sbin/nginx: /azureml-envs/azureml_29526d93bcbca0513e9c1ca0d57832a0/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_29526d93bcbca0513e9c1ca0d57832a0/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_29526d93bcbca0513e9c1ca0d57832a0/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_29526d93bcbca0513e9c1ca0d57832a0/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_29526d93bcbca0513e9c1ca0d57832a0/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
2020-04-07T13:37:14,623165703+00:00 - rsyslog/run 
2020-04-07T13:37:14,624618601+00:00 - iot-server/run 
2020-04-07T13:37:14,625049301+00:00 - gunicorn/run 
2020-04-07T13:37:14,628777396+00:00 - nginx/run 
rsyslogd

In [None]:
# list available webservices
ws.webservices

In [None]:
# delete webservice
#service.delete()

In [None]:
# HTTP requests to the web service
# determine the URL to which these applications must submit their requests
endpoint = service.scoring_uri
print(endpoint)

In [None]:
# sending the patient data in JSON (or binary) format, and receive back the predicted class(es)
import requests
import json

x_new = [[2,180,74,24,21,23.9091702,1.488172308,22],
         [0,148,58,11,179,39.19207553,0.160829008,45]]

# Convert the array to a serializable list in a JSON document
input_json = json.dumps({"data": x_new})

# Set the content type
headers = { 'Content-Type':'application/json' }

predictions = requests.post(endpoint, input_json, headers = headers)
predicted_classes = json.loads(predictions.json())

for i in range(len(x_new)):
    print ("Patient {}".format(x_new[i]), predicted_classes[i] )

### option 2: funcionApps

## Test

## manual model build + hyperparm tuning...

In [None]:
# check future for hyperparameter tuning if a model is chosen
# https://youtu.be/YlWCeY_CWEg?t=862