Author: Kevin ALBERT  

Created: April 2020  

In [1]:
from datetime import datetime
print ('latest TestRun: ' + datetime.now().strftime("%d %b %Y"))

latest TestRun: 05 Apr 2022


# Azure Machine Learning
_**Classification project with data residing on a data lake gen2 using remote compute with autoML and customML**_

## Contents
1. [AutoML](#AutoML)
1. [Setup](#Setup)
1. [Train](#Train)
1. [Results](#Results)
1. [Register](#Register)
1. [Deploy](#Deploy)
1. [Test](#Test)
1. [CustomML](#CustomML)
1. [Finetuning](#Finetuning)
1. [Pipelines](#Pipelines)

## Introduction

Cleaned datasets created in datafactory onto a delta lake Gen2.  
This notebook is using delta lake data and remote compute to autoML train a classification model.  
We use example data to detect diabetic or non-diabetic based on 8 features.  

This notebook show how to:
1. Setup packages
1. Setup workspace
1. Create an experiment
1. Load data
1. Setup compute
1. Configure autoML
1. Train pipelines
1. Explore the best pipeline
1. Inspect model properties
1. Register the model
1. Deploy model as webservice
1. Webservice inference test
1. customML inline method
1. customML script method
1. HyperParametertuning
1. Pipelines endpoint

## Setup

* required
  * **disable shield on Brave** webbrowser for the widgets to work
  * download **config.json** from the machine learning workspace portal
  * install extra azureml packages on **py37_default** when using **'local'** compute  
  * split the data up in train and test dataset on data lake, validation dataset is not needed due to cross_validation
* optional
  * register datastore(s) manually
  * register dataset(s) manually
  * register compute cluster(s) manually

### Import open-source packages

In [2]:
# environment packages
import platform
import psutil
import os

# other packages
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_colwidth', 100) # default 50, the maximum width in characters of a column
pd.set_option('display.max_columns', 40)   # default 20, the maximum amount of columns in view 
pd.set_option('display.max_rows', 60)      # default 60, the maximum amount of rows in view
import logging
import json
import requests
import joblib
from sklearn.model_selection import train_test_split

### Import azure machine learning SDK packages

In [3]:
from azureml.core import Workspace, Dataset, Datastore, Run
from azureml.core.experiment import Experiment
from azureml.data.datapath import DataPath
from azureml.core.compute import ComputeTarget, AmlCompute, AksCompute
from azureml.core.model import Model, InferenceConfig
from azureml.train.automl import AutoMLConfig
from azureml.train.automl.run import AutoMLRun
from azureml.widgets import RunDetails
from azureml.core.webservice import Webservice, AciWebservice, AksWebservice
from azureml.exceptions import WebserviceException
from azureml.core.environment import Environment
from azureml.train.estimator import Estimator
from azureml.core.conda_dependencies import CondaDependencies
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.sampling import RandomParameterSampling, GridParameterSampling
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.parameter_expressions import choice
from azureml.core.runconfig import RunConfiguration
from azureml.pipeline.core import PipelineData, Pipeline
from azureml.pipeline.steps import PythonScriptStep, EstimatorStep
from azureml.pipeline.core.run import PipelineRun
from azureml.core.authentication import InteractiveLoginAuthentication
from azureml.interpret import ExplanationClient
import azureml.core
print("azureml.core version:", azureml.core.__version__)

azureml.core version: 1.40.0


### versions

In [4]:
conda_version = ! conda -V
print(f"conda   : {conda_version[0].split()[1]}")
pip_version = ! pip -V
print(f"pip     : {pip_version[0].split()[1]}")
python_version = ! python -V
print(f"python  : {python_version[0].split()[1]}")
pandas_version = ! pip list |grep -ie "^pandas "
print(f"pandas  : {pandas_version[0].split()[1]}")
numpy_version = ! pip list |grep -ie "^numpy "
print(f"numpy   : {numpy_version[0].split()[1]}")
sklearn_version = ! pip list |grep -ie "^scikit-learn "
print(f"sklearn : {sklearn_version[0].split()[1]}")

!pip list |grep -i azureml

conda   : 4.12.0
pip     : 20.2.4
python  : 3.8.13
pandas  : 1.1.5
numpy   : 1.19.5
sklearn : 0.22.2.post1
azureml-automl-core                   1.40.0
azureml-automl-runtime                1.40.0
azureml-contrib-automl-pipeline-steps 1.40.0
azureml-contrib-dataset               1.40.0
azureml-core                          1.40.0
azureml-dataprep                      3.0.1
azureml-dataprep-native               38.0.0
azureml-dataprep-rslex                2.4.1
azureml-dataset-runtime               1.40.0
azureml-defaults                      1.40.0
azureml-inference-server-http         0.4.11
azureml-interpret                     1.40.0
azureml-mlflow                        1.40.0
azureml-pipeline-core                 1.40.0
azureml-pipeline-steps                1.40.0
azureml-telemetry                     1.40.0
azureml-train-automl-client           1.40.0
azureml-train-automl-runtime          1.40.0.post1
azureml-train-core                    1.40.0
azureml-train-restclients-hyperdri

### Workspace

![load the workspace](../../image/howto_automl/loadtheworkspace.png)

In [None]:
# load the workspace
ws = Workspace.from_config()

### Alternative Authentication 

Use chapter **Service Principal Authentication** in:  
[**how to authenticate**](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/manage-azureml-service/authentication-in-azureml/authentication-in-azureml.ipynb)

In [5]:
from azureml.core.authentication import ServicePrincipalAuthentication

svc_pr = ServicePrincipalAuthentication(
    tenant_id="73b49191-8db3-45ab-87b3-b8f956ac123b",
    service_principal_id="d7c04ded-ec80-4e62-a9d2-423b2553b83d",
    service_principal_password='lar7Q~9wkWUPEK0Eb06mUDM.b~4FhR6c56fzF')

ws = Workspace.from_config(auth=svc_pr)

### Experiment

In [6]:
# choose an experiment name
experiment = Experiment(ws, 'automl-classification')

### Data

In [7]:
# here is a backup
!ls -al ../../data/platinum/*

-rw-r--r-- 1 ubuntu root 517752 Apr  4 12:35 ../../data/platinum/diabetes.csv
-rw-r--r-- 1 ubuntu root 327574 Apr  4 12:35 ../../data/platinum/diabetes.parquet


Data Factory has prepped data from /bronze to /silver to /gold and /platinum for model training  
**note:** this demonstration had files in the Data Lake Gen2 datalake container /platinum folder  
  * /datalake/platinum/diabetes.csv
  * /datalake/platinum/diabetes.parquet
  * copy from ../data/platinum/*

Register the datastore 'data lake gen2' as a **blob container**  
(**optionally** use WebGUI to manually register in ML workspace)

In [8]:
ds = Datastore.register_azure_blob_container(
    workspace=ws,
    datastore_name="datalakestoragegen2",
    container_name="datalake",
    account_name="datalake04042022",
    account_key="hE8pYNR4hdI6NyKq0ZaGxM8Hcj3d57XVPiGaag4ctNw2TvSqGyhpI/7Q+EJ2mVUFHVg7DQtCrS6EiM+m06DguA==",
    create_if_not_exists=False)
# list available datastores
ws.datastores

{'datalakestoragegen2': {
   "name": "datalakestoragegen2",
   "container_name": "datalake",
   "account_name": "datalake04042022",
   "protocol": "https",
   "endpoint": "core.windows.net"
 },
 'workspaceworkingdirectory': {
   "name": "workspaceworkingdirectory",
   "container_name": "code-391ff5ac-6576-460f-ba4d-7e03433c68b6",
   "account_name": "machinelstorage865aef211",
   "protocol": "https",
   "endpoint": "core.windows.net"
 },
 'workspacefilestore': {
   "name": "workspacefilestore",
   "container_name": "azureml-filestore-5224a85c-9ec5-4b58-86dd-d59b28efde48",
   "account_name": "machinelstorage865aef211",
   "protocol": "https",
   "endpoint": "core.windows.net"
 },
 'workspaceartifactstore': {
   "name": "workspaceartifactstore",
   "container_name": "azureml",
   "account_name": "machinelstorage865aef211",
   "protocol": "https",
   "endpoint": "core.windows.net"
 },
 'workspaceblobstore': {
   "name": "workspaceblobstore",
   "container_name": "azureml-blobstore-5224a85c

Register file(s) into a tabular dataset  
**Note:** do not import Delta lake parquet file(s)  
**Fix:** you can import pandas single gold/*.csv or gold/*.parquet file(s)  

In [None]:
# load datastore
ds = Datastore.get(ws, 'datalakestoragegen2')
# show datastore settings
ds

**Option 1 Tabular:** loading *.parquet

In [9]:
# setup parquet file(s) into a tabular dataset
ds_path = [DataPath(ds, 'platinum/diabetes.parquet')] # {path/*.parquet}
dataset = Dataset.Tabular.from_parquet_files(path=ds_path)
# show dataset settings
dataset

{
  "source": [
    "('datalakestoragegen2', 'platinum/diabetes.parquet')"
  ],
  "definition": [
    "GetDatastoreFiles",
    "ReadParquetFile",
    "DropColumns"
  ]
}

**Option 2 Tabular:** loading *.csv

In [None]:
# setup csv file(s) into a tabular dataset
ds_path = [DataPath(ds, 'platinum/diabetes.csv')]
dataset = Dataset.Tabular.from_delimited_files(path=ds_path)
# show dataset settings
dataset

**Option 3 Registered:** loading a registered dataset (manually register in ML workspace)

In [None]:
# list available datasets
ws.datasets

In [None]:
# load a registered dataset
dataset = Dataset.get_by_name(ws, 'diabetes_parquet_from_datastore_datalakegen2')
# show dataset settings
dataset

#### DataSplit
Split the data into (train + validation) and test  
The model will learn from train + validation using cross validation  

In [10]:
# Load all records from the dataset into a pandas DataFrame
df = dataset.to_pandas_dataframe()
df.sample(5)

Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic
7470,1015023,1,75,64,45,568,20.721635,0.299076,23,0
6143,1601246,0,165,95,34,80,40.635156,0.28118,21,0
5708,1950687,2,175,63,44,268,25.821359,1.243973,62,1
5403,1859047,0,124,53,30,261,42.30854,0.103749,25,0
2002,1058480,4,100,70,41,93,28.399614,0.122858,21,1


In [11]:
# select the target variable
y = df["Diabetic"]

In [12]:
# select the features (x1, x2, x3, ...)
X = df.drop('Diabetic', axis='columns')

In [13]:
# calculate the target incidence (preferred > 5%)
np.sum(y)/len(y)

0.3344

In [14]:
# store the dummy columns for each categorical feature
categorical = [col for col, value in X.iteritems() if value.dtype == 'object']
print(f"categorical: {categorical}")
# store the numerical columns for each numerical feature
numerical = list(X.columns.difference(categorical))
print(f"numerical  : {numerical}")

categorical: []
numerical  : ['Age', 'BMI', 'DiabetesPedigree', 'DiastolicBloodPressure', 'PatientID', 'PlasmaGlucose', 'Pregnancies', 'SerumInsulin', 'TricepsThickness']


In [15]:
# split data into (train + validation) and test
x_train, x_test, y_train, y_test = train_test_split(X,                # features
                                                    y,                # target
                                                    test_size=0.2,    # 20% test data records
                                                    random_state=101, # random number generator fixed sample
                                                    stratify=y        # same target incidence, same amount of target %
                                                   )

In [16]:
# train + validation (~we call it training_data)
training_data = pd.concat([x_train, y_train], axis=1)
print(training_data.shape)
# test (~we call it validation_data)
validation_data = pd.concat([x_test, y_test], axis=1)
print(validation_data.shape)

(8000, 10)
(2000, 10)


### Compute

Check possible compute type **names** to create auto-scaling cluster

In [17]:
# example: list all with 1=vCPUs 2>GB and no-GPU
vm_df = pd.DataFrame(AmlCompute.supported_vmsizes(ws))
vm_df[(vm_df.vCPUs == 1) & (vm_df.memoryGB >= 2) & (vm_df.gpus == 0)]

Unnamed: 0,name,vCPUs,gpus,memoryGB,maxResourceVolumeMB
0,Standard_D1,1,0,3.5,51200
13,Standard_D1_v2,1,0,3.5,51200
41,Standard_DS1_v2,1,0,3.5,7168


option 1: Create training cluster  

In [18]:
%%time
# Specify a name for the compute (unique within the workspace)
compute_name = 'aml-cluster'
# Define compute configuration
compute_config = AmlCompute.provisioning_configuration(vm_size='Standard_D1_v2',
                                                       min_nodes=0, # you are not paying if not using
                                                       max_nodes=10, # depending quota limits
                                                       vm_priority='dedicated', # {lowpriority, dedicated}
                                                       admin_username='ubuntu',
                                                       admin_user_password='ABCD1234abcd',
                                                       idle_seconds_before_scaledown=120, # {default: 120}
                                                      )
# Create the compute
training_cluster = ComputeTarget.create(ws, compute_name, compute_config)
training_cluster.wait_for_completion(show_output=True)

InProgress..
SucceededProvisioning operation finished, operation "Succeeded"
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned
CPU times: user 59.7 ms, sys: 9.19 ms, total: 68.9 ms
Wall time: 11.5 s


option 2: Load already known training cluster

In [18]:
# list all available training cluster(s):
for cluster in ws.compute_targets:
    print(cluster)

aml-cluster


In [19]:
# load the training cluster
compute_name = 'aml-cluster'
training_cluster = ComputeTarget(ws, name=compute_name)

## Train

### Configure autoML
Define settings to run the experiment.

|Property|Description|Options|
|-|-|-|
|**task**||<i>classification</i><br><i>regression</i><br><i>forecasting</i>|
|**compute_target**|execution on local DSVM serialized<br>execution on remote AML or AKS parallel|<i>local</i><br><i>training_cluster</i>|
|**primary_metric**|the metric you want to optimize<br>[metrics](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-understand-automated-ml)|**classification:**<br><i>accuracy</i><br><i>AUC_weighted</i><br><i>average_precision_score_weighted</i><br><i>norm_macro_recall</i><br><i>precision_score_weighted</i><br><br>**regression:**<br><i>spearman_correlation</i><br><i>normalized_root_mean_squared_error</i><br><i>r2_score</i><br><i>normalized_mean_absolute_error</i>|
|**training_data**|input dataset, containing both X_train and y_train|<i>DataFrame</i><br><i>Dataset</i><br><i>DatasetDefinition</i><br><i>TabularDataset</i>|
|**validation_data**|input dataset, covered with cross validation|N/A|
|**label_column_name**|the name of the 'target' or 'label' column||
|**enable_early_stopping**|stop the run if metric score is not improving|<i>True</i><br><i>False</i>|
|**n_cross_validations**|number of cross validation splits|5|
|**experiment_timeout_hours**|max time in hours the experiment terminates (+15min)|<i>0.25</i>|
|**max_concurrent_iterations**|less or equal to the number of cores per node|2|



**_You can find more information_** [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-auto-train)

In [20]:
automl_settings = {
    "enable_early_stopping":True,
    "experiment_timeout_hours":0.75, # (0.75 = 45min)
    "iterations":5, # number of runs
    "iteration_timeout_minutes":10,
    "max_concurrent_iterations":1,
    "max_cores_per_iteration":-1,
#     "experiment_exit_score":0.9920,
    "model_explainability":True,
#     "n_cross_validations":5,
    "primary_metric":'AUC_weighted',
    "featurization":'auto',
    "verbosity":logging.INFO, # {INFO, DEBUG, CRITICAL, ERROR, WARNING} -- debug_log=<*.log>
}

automl_config = AutoMLConfig(task='classification',
                             debug_log='automl_errors.log',
                             compute_target='local', # {training_cluster or 'local'}
#                              blacklist_models=['KNN','LinearSVM'],
                             enable_onnx_compatible_models=True,
                             training_data=training_data, # (train + validation) will use automatic cross_validation 
                             label_column_name="Diabetic",
                             **automl_settings
                            )
# ouputs "model.pkl" and "automl_errors.log"

### Train pipelines

In [21]:
%%time
automl_run = experiment.submit(automl_config, show_output=True)

2022-04-05:11:08:44,350 INFO     [modeling_bert.py:226] Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
2022-04-05:11:08:44,367 INFO     [modeling_xlnet.py:339] Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .


Running in the active local environment.


Experiment,Id,Type,Status,Details Page,Docs Page
automl-classification,AutoML_99a62dfa-d65a-49d9-934b-2a78d4895923,automl,Preparing,Link to Azure Machine Learning studio,Link to Documentation


Current status: DatasetEvaluation. Gathering dataset statistics.
Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetFeaturization. Beginning to fit featurizers and featurize the dataset.
Current status: DatasetFeaturizationCompleted. Completed fit featurizers and featurizing the dataset.
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.

********************************************************************************************
DATA GUARDRAILS: 

TYPE:         Cross validation
STATUS:       DONE
DESCRIPTION:  Each iteration of the trained model was validated through cross-validation.
              
DETAILS:      
+------------------------------+
|Number of folds               |
|3                             |
+------------------------------+

********************************************************************************************

TYPE:         Class balancing detection
STATUS:       PASSED
DESC

2022-04-05:11:22:33,349 INFO     [explanation_client.py:334] Using default datastore for uploads


Current status: EngineeredFeatureExplanations. Computation of engineered features completed
Current status: RawFeaturesExplanations. Computation of raw features started
Current status: RawFeaturesExplanations. Computation of raw features completed
Current status: BestRunExplainModel. Best run model explanations completed
********************************************************************************************
CPU times: user 7min 26s, sys: 52.6 s, total: 8min 19s
Wall time: 14min 6s


### Optional: retrieve a run

In [22]:
runId = 'AutoML_99a62dfa-d65a-49d9-934b-2a78d4895923'
automl_run = AutoMLRun(experiment, run_id=runId)

## Results

### Explore the best pipeline

In [23]:
RunDetails(automl_run).show()
automl_run.wait_for_completion() # get more parameter info

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

{'runId': 'AutoML_99a62dfa-d65a-49d9-934b-2a78d4895923',
 'target': 'local',
 'status': 'Completed',
 'startTimeUtc': '2022-04-05T09:08:58.949983Z',
 'endTimeUtc': '2022-04-05T09:21:04.850886Z',
 'services': {},
 'properties': {'num_iterations': '5',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'AUC_weighted',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': None,
  'target': 'local',
  'DataPrepJsonString': None,
  'EnableSubsampling': 'False',
  'runTemplate': 'AutoML',
  'azureml.runsource': 'automl',
  'display_task_type': 'classification',
  'dependencies_versions': '{"azureml-dataprep-native": "38.0.0", "azureml-dataprep": "3.0.1", "azureml-dataprep-rslex": "2.4.1", "azureml-train-automl-runtime": "1.40.0.post1", "azureml-pipeline-steps": "1.40.0", "azureml-interpret": "1.40.0", "azureml-core": "1.40.0", "azureml-dataset-runtime": "1.40.0", "azureml-widgets": "1.40.0", "azureml-train-automl-client": "1.40.0", "

![automl_run](../../image/howto_automl/automl_run.png)

**option 1:** select any pipeline iteration 

In [24]:
best_run, fitted_model = automl_run.get_output(iteration=0)

**option 2:** select best pipeline iteration automatically

In [None]:
best_run, fitted_model = automl_run.get_output()

### inspect model properties

In [25]:
# pipeline steps
for step in fitted_model.named_steps:
    print(step)

datatransformer
MaxAbsScaler
LightGBMClassifier


In [26]:
# model properties
fitted_model.named_steps

{'datatransformer': DataTransformer(enable_dnn=False, enable_feature_sweeping=False, feature_sweeping_config={}, feature_sweeping_timeout=86400, featurization_config=None, force_text_dnn=False, is_cross_validation=True, is_onnx_compatible=True, task='classification'),
 'MaxAbsScaler': MaxAbsScaler(copy=True),
 'LightGBMClassifier': LightGBMClassifier(min_data_in_leaf=20, n_jobs=-1, problem_info=ProblemInfo(gpu_training_param_dict={'processing_unit_type': 'cpu'}), random_state=None)}

In [27]:
# show all metrics
best_run.get_metrics()

{'weighted_accuracy': 0.9528355112553637,
 'AUC_macro': 0.9889310060568878,
 'precision_score_macro': 0.9411211289424788,
 'norm_macro_recall': 0.8805391085180713,
 'recall_score_micro': 0.9472499827244025,
 'average_precision_score_macro': 0.9871352774186574,
 'precision_score_weighted': 0.9472033268188875,
 'balanced_accuracy': 0.9402695542590357,
 'AUC_weighted': 0.9889310060568879,
 'average_precision_score_weighted': 0.9894832131217713,
 'recall_score_weighted': 0.9472499827244025,
 'AUC_micro': 0.9902186358077073,
 'f1_score_micro': 0.9472499827244025,
 'f1_score_weighted': 0.9472230129853932,
 'precision_score_micro': 0.9472499827244025,
 'average_precision_score_micro': 0.9903872726940194,
 'matthews_correlation': 0.8813898032287959,
 'log_loss': 0.12722240856346553,
 'f1_score_macro': 0.9406907984438874,
 'recall_score_macro': 0.9402695542590357,
 'accuracy': 0.9472499827244025,
 'accuracy_table': 'aml://artifactId/ExperimentRun/dcid.AutoML_99a62dfa-d65a-49d9-934b-2a78d4895923

### Feature Importance

In [28]:
client = ExplanationClient.from_run(best_run)
engineered_explanations = client.download_model_explanation(raw=False)
feature_importance = engineered_explanations.get_feature_importance_dict() # get model feature importance values
columns = ["modelFeatureImportance_name", "modelFeatureImportance_value"]
pd.DataFrame(list(feature_importance.items()), columns=columns)

2022-04-05:11:24:22,583 INFO     [explanation_client.py:334] Using default datastore for uploads


ExplanationNotFoundException: ExplanationNotFoundException:
	Message: Explanation asset ID None was not found to match the supplied filters ['comment', 'raw'].
	InnerException None
	ErrorResponse 
{
    "error": {
        "code": "UserError",
        "message": "Explanation asset ID None was not found to match the supplied filters ['comment', 'raw'].",
        "inner_error": {
            "code": "NotFound",
            "inner_error": {
                "code": "ExplanationFiltersNotFound"
            }
        }
    }
}

## Register

### Prepare

autoML generated a scoring script, environment file and model

In [29]:
# get the score and environment files
model_name = best_run.properties['model_name'] # score.py script will look for the name of the registered model

# make a local copy of the best scoring script, environment file and the model file
script_file_name = 'inference/score.py'
conda_env_file_name = 'inference/env.yml'
model_pickle_file_name = 'inference/model.pkl'
model_onnx_file_name = 'inference/model.onnx'
best_run.download_file('outputs/scoring_file_v_1_0_0.py', script_file_name)
best_run.download_file('outputs/conda_env_v_1_0_0.yml', conda_env_file_name)
best_run.download_file('outputs/model.pkl', model_pickle_file_name)
best_run.download_file('outputs/model.onnx', model_onnx_file_name)

In [30]:
! cat inference/env.yml

# Conda environment specification. The dependencies defined in this file will
# be automatically provisioned for runs with userManagedDependencies=False.

# Details about the Conda environment file format:
# https://conda.io/docs/user-guide/tasks/manage-environments.html#create-env-file-manually

name: project_environment
dependencies:
  # The python interpreter version.
  # Currently Azure ML only supports 3.5.2 and later.
- python=3.8.13

- pip:
  - azureml-train-automl-runtime==1.40.0.post1
  - inference-schema
  - azureml-interpret==1.40.0
  - azureml-defaults==1.40.0
- numpy==1.19.5
- pandas==1.1.5
- scikit-learn==0.22.2.post1
- py-xgboost==1.3.3
- fbprophet==0.7.1
- holidays==0.10.3
- psutil==5.9.0
- pytorch==1.4.0
- cudatoolkit==10.1.243
channels:
- anaconda
- conda-forge


In [31]:
! cat inference/score.py

# ---------------------------------------------------------
# Copyright (c) Microsoft Corporation. All rights reserved.
# ---------------------------------------------------------
import json
import logging
import os
import pickle
import numpy as np
import pandas as pd
import joblib

import azureml.automl.core
from azureml.automl.core.shared import logging_utilities, log_server
from azureml.telemetry import INSTRUMENTATION_KEY

from inference_schema.schema_decorators import input_schema, output_schema
from inference_schema.parameter_types.numpy_parameter_type import NumpyParameterType
from inference_schema.parameter_types.pandas_parameter_type import PandasParameterType
from inference_schema.parameter_types.standard_py_parameter_type import StandardPythonParameterType

input_sample = pd.DataFrame({"Age": pd.Series([0], dtype="int64"), "BMI": pd.Series([0.0], dtype="float64"), "DiabetesPedigree": pd.Series([0.0], dtype="float64"), "DiastolicBloodPressure": pd.Series(

### Register the model

**Option 1:** from workspace /outputs folder with .register_model()

In [None]:
model = best_run.register_model(model_name=model_name, # registered model name used in scoring script init()
                                model_framework=Model.Framework.SCIKITLEARN, # {TensorFlow, ScikitLearn, Onnx, Custom}
                                model_framework_version='0.22.2',
                                model_path='outputs/model.pkl', # fixed path in workspace {'model.pkl', 'model.onnx'}
                                tags={'Training context': 'autoML Training'},
                                properties={'AUC': best_run.get_metrics()['AUC_weighted'],
                                            'Accuracy': best_run.get_metrics()['accuracy']},
                                description="Classification model to predict diabetes")

**Option 2:** from local /path/model folder with Model.register()

In [32]:
model = Model.register(workspace=ws,
                       model_name=model_name, # registered model name used in scoring script init()
                       model_framework=Model.Framework.SCIKITLEARN, # {TensorFlow, ScikitLearn, Onnx, Custom}
                       model_framework_version='0.22.2',
                       model_path='inference/model.pkl', # local file {'model.pkl', 'model.onnx'}
                       tags={'Training context': 'autoML Training'},
                       properties={'AUC': best_run.get_metrics()['AUC_weighted'],
                                   'Accuracy': best_run.get_metrics()['accuracy']},
                       description="Classification model to predict diabetes")

Registering model AutoML99a62dfad0


**Optional:** Load the model

In [33]:
# list all registered models
for model in Model.list(ws):
    print(model.name, 'version:', model.version)
    for tag_name in model.tags:
        tag = model.tags[tag_name]
        print ('\t',tag_name, ':', tag)
    for prop_name in model.properties:
        prop = model.properties[prop_name]
        print ('\t',prop_name, ':', prop)
    print('\n')

AutoML99a62dfad0 version: 1
	 Training context : autoML Training
	 AUC : 0.9889310060568879
	 Accuracy : 0.9472499827244025




In [None]:
# load the registered model for deployment (latest version)
model = ws.models[model_name] # or replace with any registered modelname from Model.list(ws)
model

## Deploy

### Deploy model as webservice (ACI)

Linux Azure Container Instance with 1 vCPU and 1GB of RAM cost €28 per month

In [34]:
%%time
# Configure the scoring environment
service_name = "automl-projname-service" # only lowercase letters, numbers, or dashes

# Remove any existing service under the same name
try:
    Webservice(ws, service_name).delete()
except WebserviceException:
    print('"' + service_name + '" does not exist, creating the webservice...')

myenv = Environment.from_conda_specification(name="myenv", file_path=conda_env_file_name)
inference_config = InferenceConfig(entry_script=script_file_name, environment=myenv)

deployment_config = AciWebservice.deploy_configuration(cpu_cores=1,
                                                       memory_gb=1)

# build container from environment, start webservice ACI and deploy inference scrips 
service = Model.deploy(ws, service_name, [model], inference_config, deployment_config)
service.wait_for_deployment(show_output=True)

"automl-projname-service" does not exist, creating the webservice...
Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running
2022-04-05 11:30:31+02:00 Creating Container Registry if not exists..
2022-04-05 11:40:31+02:00 Registering the environment.
2022-04-05 11:40:33+02:00 Building image..
2022-04-05 11:55:03+02:00 Generating deployment configuration..
2022-04-05 11:55:04+02:00 Submitting deployment to compute..
2022-04-05 11:55:10+02:00 Checking the status of deployment automl-projname-service..
2022-04-05 12:00:11+02:00 Checking the status of inference endpoint automl-projname-service.
Succeeded
ACI service creation operation finished, operation "Succeeded"
CPU times: user 25.9 s, sys: 3.97 s, total: 29.9 s
Wall time: 30min 38s


**Optional:** load a running webservice

In [35]:
# list available webservices
for i in ws.webservices:
    print(i)

automl-projname-service


In [36]:
service_name = "automl-projname-service" # only lowercase letters, numbers, or dashes
service = Webservice(ws, service_name)

In [37]:
# get webservice logs
print(service.get_logs())

2022-04-05T09:59:52,941744500+00:00 - iot-server/run 
2022-04-05T09:59:52,941745300+00:00 - gunicorn/run 
Dynamic Python package installation is disabled.
Starting HTTP server
2022-04-05T09:59:52,967272100+00:00 - nginx/run 
2022-04-05T09:59:52,969077500+00:00 - rsyslog/run 
EdgeHubConnectionString and IOTEDGE_IOTHUBHOSTNAME are not set. Exiting...
2022-04-05T09:59:53,311964600+00:00 - iot-server/finish 1 0
2022-04-05T09:59:53,318243500+00:00 - Exit code 1 is normal. Not restarting iot-server.
Starting gunicorn 20.1.0
Listening at: http://127.0.0.1:31311 (72)
Using worker: sync
worker timeout is set to 300
Booting worker with pid: 99
SPARK_HOME not set. Skipping PySpark Initialization.
Initializing logger
2022-04-05 10:00:08,085 | root | INFO | Starting up app insights client
logging socket was found. logging is available.
logging socket was found. logging is available.
2022-04-05 10:00:08,086 | root | INFO | Starting up request id generator
2022-04-05 10:00:08,086 | root | INFO | Star

## Test

### Webservice inference test

Send a HTTP triggered webrequest with testdata to the model for a prediction value.  
In this example we test a person is diabetic (1) or not-diabetic (0).  
The testdata must be a list of 9 features to predict a binary classification.  
We demonstrate the use of **service** or **requests** method to send a prediction request.  
Know that 'Postman' application or 'Rest Client' plugin in VSCode work as well.  

|Web API|Example value|Options|
|-|-|-|
|**HTTP method**|POST|<i>POST</i><br><i>GET</i>|
|**URI**|http://3bb0618b-ef7b-4b17-af32-a52f9c64f4d5.northeurope.azurecontainer.io/score||
|**Header**|{Content-Type: Application/json}||
|**Body**|{"data": [[5, 2, 180, 74, 24, 21, 24, 1.5, 22], <br>[6, 0, 148, 58, 11, 179, 39, 0.16, 45]]}|<i>one or </i><br><i>more records</i>|
|**Response**|{"result": [1, 0]}|<i>json object</i>|

In [38]:
# get webservice URI
endpoint = service.scoring_uri

# raw test data
rawdata = [[5, 2, 180, 74, 24, 21, 24, 1.5, 22],
           [6, 0, 148, 58, 11, 179, 39, 0.16, 45]]

print("URI: " + endpoint)
print("Body: " + json.dumps({"data": rawdata})) # convert array to a serialized JSON formatted string object

URI: http://cb6a9619-f4c9-47fe-bca2-0b8841b3d018.westeurope.azurecontainer.io/score
Body: {"data": [[5, 2, 180, 74, 24, 21, 24, 1.5, 22], [6, 0, 148, 58, 11, 179, 39, 0.16, 45]]}


**Test 1:** service.run()

In [39]:
service.run(json.dumps({"data": rawdata}))

'{"result": [0, 0]}'

**Test 2:** requests.post()

In [40]:
response = requests.post(endpoint, json={"data": rawdata})
response.json()

'{"result": [0, 0]}'

When you are finished testing your service, clean up the deployment with service.delete()

In [41]:
service.delete()

# CustomML

Inspired from autoML results is an alternative customML development.  
Using inline method to test and develop, train local or with remote compute and deploy and test the model.  

1. option1: inline method
1. option2: script method
  * create training script
  * create training environment
  * creating and register dataset (File)
  * train model
1. create an inference script
1. create an inference environment
1. register the model
1. deploy the model
1. inference test

In [None]:
ws = Workspace.from_config()

### Option 1: Inline method

|log metric function|Description|Example|
|-|-|-|
|**log**|<i>Record a single named value</i>|run.log("accuracy", 0.95)|
|**log_list**|<i>Record a named list of values</i>|run.log_list("accuracies", [0.6, 0.7, 0.87])|
|**log_row**|<i>Record a row with multiple columns</i>|run.log_row("Y over X", x=1, y=0.4)|
|**log_table**|<i>Record a dictionary as a table</i>|run.log_table("Y over X", {"x":[1, 2, 3], "y":[0.6, 0.7, 0.89]})|
|**log_image**|<i>Record an image file or a plot</i>|run.log_image("ROC", plot=plt)|
|**upload_file**|<i>Upload any file to "./outputs"</i>|run.upload_file("best_model.pkl", "./model.pkl")|

https://aka.ms/AA70zf6

In [42]:
from azureml.core import Experiment
from azureml.core import Model
from azureml.core import Datastore
from azureml.core import Dataset
import pandas as pd
import numpy as np
import joblib
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

# Create an Azure ML experiment in your workspace
experiment = Experiment(workspace=ws, name="diabetes-training")
run = experiment.start_logging()
print("Starting experiment:", experiment.name)

# load the diabetes dataset (File method)
print("Loading data lake gen2 data in a pandas dataframe...")
ds = Datastore.get(ws, 'datalakestoragegen2')
ds_path = [DataPath(ds, 'platinum/diabetes.parquet')] # {path/*.parquet or path/**}
dataset = Dataset.File.from_files(path=ds_path)
mount_context = dataset.mount(mount_point='/tmp/platinum') # read-only mount from delta lake
mount_context.start()
diabetes = pd.read_parquet('/tmp/platinum/diabetes.parquet') # {'/tmp/path/'} can load latest delta lake parquet files
mount_context.stop()

# load the diabetes dataset (Tabular method)
# print("Loading data lake gen2 data in a pandas dataframe...")
# ds = Datastore.get(ws, 'datalakestoragegen2')
# ds_path = [DataPath(ds, 'platinum/diabetes.parquet')] # {path/*.parquet or path/**}
# dataset = Dataset.Tabular.from_parquet_files(path=ds_path) # {delimited, json, parquet, sql}
# diabetes = dataset.to_pandas_dataframe() # create a pandas dataframe

# Separate features and labels as numpy array
X, y = diabetes[['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']].values, diabetes['Diabetic'].values

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Train a decision tree model
print('Training a decision tree model')
model = DecisionTreeClassifier().fit(X_train, y_train)

# calculate accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
print('Accuracy:', acc)
run.log('Accuracy', np.float(acc))

# calculate AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
print('AUC: ' + str(auc))
run.log('AUC', np.float(auc))

# Save the trained model
model_file = 'diabetes_model.pkl'
joblib.dump(value=model, filename=model_file) # backup model local
run.upload_file(name='outputs/' + model_file,
                path_or_stream='./' + model_file) # save model to workspace

# Complete the run
run.complete()

2022-04-05:12:06:17,79 INFO     [datastore_client.py:991] <azureml.core.authentication.ServicePrincipalAuthentication object at 0x7f96dc02a3a0>


Starting experiment: diabetes-training
Loading data lake gen2 data in a pandas dataframe...




Training a decision tree model
Accuracy: 0.8953333333333333
AUC: 0.8823424050947809


### Option 2: Script method

Create training script

In [43]:
# Create a local folder for the experiment files
folder_name = 'diabetes_service'
experiment_folder = './' + folder_name
os.makedirs(folder_name, exist_ok=True)
print(folder_name, 'folder created')

diabetes_service folder created


In [44]:
%%writefile $experiment_folder/diabetes_training.py
# Import libraries
import argparse
from azureml.core import Workspace, Dataset, Experiment, Run
import pandas as pd
import numpy as np
import joblib
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
import glob
print("libraries imported...")

# Set regularization hyperparameter (passed as an argument to the script)
parser = argparse.ArgumentParser()
parser.add_argument('--regularization', type=float, dest='reg_rate', default=0.01, help='regularization rate')
args = parser.parse_args()
reg = args.reg_rate
print("argparse parameters loaded...")

# Get the experiment run context
run = Run.get_context()
print("run context loaded...")

# load the diabetes dataset (File method)
# Get the training data from the estimator input identified as 'diabetes'
mount = run.input_datasets['diabetes'] # read-only mount from delta lake as '/mnt/data'
print("delta lake mounted...")
diabetes = pd.read_parquet('/mnt/data/diabetes.parquet') # load any file(s) from this delta lake mounted folder
print("dataset loaded...")

# save data into workspace
diabetes.to_csv("outputs/dataset.csv", index=False) # {logs/  outputs/}
print("test: write dataset to workspace 'outputs/dataset.csv'")

# Separate features and labels
X, y = diabetes[['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']].values, diabetes['Diabetic'].values

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Train a logistic regression model
print('Training a logistic regression model with regularization rate of', reg)
run.log('Regularization Rate',  np.float(reg))
model = LogisticRegression(C=1/reg, solver="liblinear").fit(X_train, y_train)

# calculate accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
print('Accuracy:', acc)
run.log('Accuracy', np.float(acc))

# calculate AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
print('AUC: ' + str(auc))
run.log('AUC', np.float(auc))

os.makedirs('outputs', exist_ok=True)
# note file saved in the outputs folder is automatically uploaded into experiment record
joblib.dump(value=model, filename='outputs/diabetes_model.pkl')

run.complete()

Writing ./diabetes_service/diabetes_training.py


Create training environment

In [45]:
myenv = Environment("training_environment")
myenv.docker.enabled = True
myenv.python.user_managed_dependencies = False
conda_packages = ['scikit-learn', 'joblib', 'python==3.6.2']
pip_packages = ['azureml-defaults', 'azureml-dataprep[pandas,fuse]', 'pyarrow', 'fastparquet']
myenv.python.conda_dependencies = CondaDependencies.create(conda_packages=conda_packages, pip_packages=pip_packages)
myenv.register(ws)



{
    "databricks": {
        "eggLibraries": [],
        "jarLibraries": [],
        "mavenLibraries": [],
        "pypiLibraries": [],
        "rcranLibraries": []
    },
    "docker": {
        "arguments": [],
        "baseDockerfile": null,
        "baseImage": "mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:20220314.v1",
        "baseImageRegistry": {
            "address": null,
            "password": null,
            "registryIdentity": null,
            "username": null
        },
        "enabled": true,
        "platform": {
            "architecture": "amd64",
            "os": "Linux"
        },
        "sharedVolumes": true,
        "shmSize": null
    },
    "environmentVariables": {
        "EXAMPLE_ENV_VAR": "EXAMPLE_VALUE"
    },
    "inferencingStackVersion": null,
    "name": "training_environment",
    "python": {
        "baseCondaEnvironment": null,
        "condaDependencies": {
            "channels": [
                "anaconda",
                "conda-f

In [46]:
# list environments
env_names = Environment.list(workspace=ws)
for env_name in env_names:
    print('Name:',env_name)

Name: training_environment
Name: AzureML-Triton
Name: AzureML-sklearn-0.24.1-ubuntu18.04-py37-cpu-inference
Name: AzureML-minimal-ubuntu18.04-py37-cpu-inference
Name: AzureML-tensorflow-2.4-ubuntu18.04-py37-cpu-inference
Name: AzureML-tensorflow-2.4-ubuntu18.04-py37-cuda11.0.3-gpu-inference
Name: AzureML-tensorflow-2.4-ubuntu18.04-py37-cuda11-gpu
Name: AzureML-pytorch-1.7-ubuntu18.04-py37-cuda11-gpu
Name: AzureML-mlflow-ubuntu18.04-py37-cpu-inference
Name: AzureML-lightgbm-3.2-ubuntu18.04-py37-cpu-inference
Name: AzureML-pytorch-1.10-ubuntu18.04-py37-cpu-inference
Name: AzureML-pytorch-1.9-ubuntu18.04-py37-cpu-inference
Name: AzureML-minimal-ubuntu18.04-py37-cuda11.0.3-gpu-inference
Name: AzureML-pytorch-1.9-ubuntu18.04-py37-cuda11.0.3-gpu-inference
Name: AzureML-sklearn-0.24-ubuntu18.04-py37-cpu
Name: AzureML-lightgbm-3.2-ubuntu18.04-py37-cpu
Name: AzureML-responsibleai-0.17-ubuntu20.04-py38-cpu
Name: AzureML-sklearn-1.0-ubuntu20.04-py38-cpu
Name: AzureML-tensorflow-2.6-ubuntu20.04-py

Creating and register dataset (File)

In [47]:
# load the diabetes dataset (File method)
ds = Datastore.get(ws, 'datalakestoragegen2')
ds_path = [DataPath(ds, 'platinum/**')] # {path/*.parquet or path/**}
file_ds = Dataset.File.from_files(path=ds_path)
   
# Register the file dataset
try:
    file_ds = file_ds.register(workspace=ws,
                               name='diabetes file dataset',
                               description='diabetes files',
                               tags = {'format':'parquet'},
                               create_new_version=True)
except Exception as ex:
    print(ex)
print('Dataset registered')

2022-04-05:12:06:48,108 INFO     [datastore_client.py:991] <azureml.core.authentication.ServicePrincipalAuthentication object at 0x7f96dc02a3a0>


Dataset registered


In [48]:
# show a list of registered dataset(s)
print("Datasets:")
for dataset_name in list(ws.datasets.keys()):
    dataset = Dataset.get_by_name(ws, dataset_name)
    print("\t", dataset.name, '\t version', dataset.version)

Datasets:
	 diabetes file dataset 	 version 1


In [49]:
# list of the file path(s)
for file_path in file_ds.to_path():
    print(file_path)

/diabetes.csv
/diabetes.parquet
/pharma_ref.xlsx


Train model

In [50]:
%%time
# Set the script parameters
script_params = {
    '--regularization': 0.1
}

# load the registered dataset by name
file_ds = Dataset.get_by_name(ws, "diabetes file dataset")

# load the docker environment
training_env = Environment.get(ws, 'training_environment')

# load the training compute cluster
training_cluster = ComputeTarget(ws, 'aml-cluster')

estimator = Estimator(source_directory=experiment_folder, # All the files in this directory are uploaded into the cluster nodes for execution
                      compute_target='local', # {'local', training_cluster}
                      entry_script='diabetes_training.py',
                      script_params=script_params,
                      environment_definition=training_env,
                      inputs=[file_ds.as_named_input('diabetes').as_mount(path_on_compute='/mnt/data')],
                     )

# Create an experiment
experiment_name = 'diabetes-training'
experiment = Experiment(workspace=ws, name=experiment_name)
# Run the experiment
run = experiment.submit(config=estimator)

# Show the run details while running
RunDetails(run).show()
run.wait_for_completion() # get more parameter info

2022-04-05:12:07:00,709 INFO     [_loggerfactory.py:154] ScriptRunSubmit


_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

CPU times: user 24.4 s, sys: 3.25 s, total: 27.6 s
Wall time: 11min 33s


{'runId': 'diabetes-training_1649153220_4fde8e7e',
 'target': 'local',
 'status': 'Completed',
 'startTimeUtc': '2022-04-05T10:17:27.725978Z',
 'endTimeUtc': '2022-04-05T10:17:54.322245Z',
 'services': {},
 'properties': {'_azureml.ComputeTargetType': 'local',
  'ContentSnapshotId': '70d47910-1467-4bc7-b33b-f965eaa46de3',
  'azureml.git.repository_uri': 'https://github.com/albert-kevin/azuremachinelearning.git',
  'mlflow.source.git.repoURL': 'https://github.com/albert-kevin/azuremachinelearning.git',
  'azureml.git.branch': 'master',
  'mlflow.source.git.branch': 'master',
  'azureml.git.commit': '6fff6c6dbc2e7e872873ad08b697dec70d229e8e',
  'mlflow.source.git.commit': '6fff6c6dbc2e7e872873ad08b697dec70d229e8e',
  'azureml.git.dirty': 'True'},
 'inputDatasets': [{'dataset': {'id': '5232b173-8663-4f1e-ae7e-4f55879e07ed'}, 'consumptionDetails': {'type': 'RunInput', 'inputName': 'diabetes', 'mechanism': 'Mount', 'pathOnCompute': '/mnt/data'}}],
 'outputDatasets': [],
 'runDefinition': {'

![ScriptRun Widget](../../image/howto_automl/script_run_widget1.png)

### Create inference script

In [51]:
# Create a local folder for the experiment files
folder_name = 'diabetes_service'
experiment_folder = './' + folder_name
os.makedirs(folder_name, exist_ok=True)
print(folder_name, 'folder created')

diabetes_service folder created


In [52]:
%%writefile $folder_name/diabetes_score.py
import json
import joblib
import numpy as np
from azureml.core.model import Model

# Called when the service is loaded
def init():
    global model
    # Get the path to the deployed model file and load a registered model
    model_path = Model.get_model_path(model_name='diabetes_model')
    model = joblib.load(model_path)

# Called when a request is received
def run(raw_data):
    # Get the input data as a numpy array
    data = np.array(json.loads(raw_data)['data'])
    # Get a prediction from the model
    predictions = model.predict(data)
    # Get the corresponding classname for each prediction (0 or 1)
    classnames = ['not-diabetic', 'diabetic']
    predicted_classes = []
    for prediction in predictions:
        predicted_classes.append(classnames[prediction])
    # Return the predictions as JSON
    return json.dumps(predicted_classes)

Writing diabetes_service/diabetes_score.py


### Create inference environment

In [53]:
# Add the dependencies for our model (AzureML defaults is already included)
myenv = CondaDependencies()
myenv.add_conda_package("scikit-learn")

# Save the environment config as a .yml file
env_file = folder_name + "/diabetes_env.yml"
with open(env_file, "w") as f:
    f.write(myenv.serialize_to_string())
print("Saved inference environment file in", env_file)

# Print the .yml file
with open(env_file,"r") as f:
    print(f.read())

Saved inference environment file in diabetes_service/diabetes_env.yml
# Conda environment specification. The dependencies defined in this file will
# be automatically provisioned for runs with userManagedDependencies=False.

# Details about the Conda environment file format:
# https://conda.io/docs/user-guide/tasks/manage-environments.html#create-env-file-manually

name: project_environment
dependencies:
  # The python interpreter version.
  # Currently Azure ML only supports 3.5.2 and later.
- python=3.6.2

- pip:
    # Required packages for AzureML execution, history, and data preparation.
  - azureml-defaults

- scikit-learn
channels:
- anaconda
- conda-forge



### Register the model

In [54]:
# define model name
model_name = 'diabetes_model'

# register model from the workspace 
run.register_model(model_name=model_name, # registered model name used in scoring script init()
                   model_path='outputs/diabetes_model.pkl', # fixed path in workspace {'model.pkl', 'model.onnx'}
                   tags={'Training context': 'Custom Training'},
                   properties={'AUC': run.get_metrics()['AUC'],
                               'Accuracy': run.get_metrics()['Accuracy']},
                   description="Classification model to predict diabetes",
                   model_framework=Model.Framework.SCIKITLEARN, # {TensorFlow, ScikitLearn, Onnx, Custom}
                   model_framework_version='0.22.2')

print('Model trained and registered')

Model trained and registered


### Deploy the model

In [55]:
%%time
service_name = "diabetes-service"

# Remove any existing service under the same name
try:
    Webservice(ws, service_name).delete()
except WebserviceException:
    print('"' + service_name + '" does not exist, creating the webservice...')

# Configure the scoring environment
inference_config = InferenceConfig(runtime="python",
                                   source_directory=folder_name,
                                   entry_script="diabetes_score.py",
                                   conda_file="diabetes_env.yml")

deployment_config = AciWebservice.deploy_configuration(cpu_cores=1,
                                                       memory_gb=1)

# load the registered model
model = ws.models['diabetes_model']

service = Model.deploy(ws, service_name, [model], inference_config, deployment_config)

service.wait_for_deployment(show_output=True)
print(service.state)

"diabetes-service" does not exist, creating the webservice...
Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running
2022-04-05 12:19:25+02:00 Creating Container Registry if not exists.
2022-04-05 12:19:25+02:00 Registering the environment.
2022-04-05 12:19:27+02:00 Building image..
2022-04-05 12:25:39+02:00 Generating deployment configuration..
2022-04-05 12:25:40+02:00 Submitting deployment to compute..
2022-04-05 12:25:46+02:00 Checking the status of deployment diabetes-service..
2022-04-05 12:27:40+02:00 Checking the status of inference endpoint diabetes-service.
Succeeded
ACI service creation operation finished, operation "Succeeded"
Healthy
CPU times: user 12 s, sys: 1.83 s, total: 13.8 s
Wall time: 8min 23s


### Inference test

In [56]:
# get webservice URI
endpoint = service.scoring_uri

# raw test data
rawdata = [[9, 103, 78, 25, 304, 29.6, 1.28, 43],
           [0, 148, 58, 11, 179, 39, 0.16, 45]]

print("URI: " + endpoint)
print("Body: " + json.dumps({"data": rawdata})) # convert array to a serialized JSON formatted string object

service.run(json.dumps({"data": rawdata}))

URI: http://961f4622-8384-4fcb-86a0-e4aa8f56bbf1.westeurope.azurecontainer.io/score
Body: {"data": [[9, 103, 78, 25, 304, 29.6, 1.28, 43], [0, 148, 58, 11, 179, 39, 0.16, 45]]}


'["diabetic", "not-diabetic"]'

When you are finished testing your service, clean up the deployment with service.delete()

In [57]:
service.delete()

## Finetuning

Hyperparameter tuning of the model using HyperDrive.  
Hyperdrive runs enable comparison for metrics on all different hyper parameter combinations tried.  

[doc: how to tune hyperparameters](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters)  
[git: examples](https://github.com/microsoft/MLHyperparameterTuning)  

In [None]:
# Initialize workspace
ws = Workspace.from_config()

In [51]:
# Create AmlCompute
training_cluster = ComputeTarget(ws, 'aml-cluster')

In [52]:
# Create a project directory
project_folder = './diabetes_hyperdrive'
os.makedirs(project_folder, exist_ok=True)

In [53]:
# Experiment folder
experiment_folder = './' + project_folder

Prepare training script

In [54]:
%%writefile $experiment_folder/diabetes_training.py

import argparse
from azureml.core import Workspace, Dataset, Experiment, Run
import pandas as pd
import numpy as np
import joblib
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
import glob
print("libraries imported...")

# Get the experiment run context
run = Run.get_context()
print("run context loaded...")

# Set regularization hyperparameter (passed as an argument to the script)
parser = argparse.ArgumentParser()
parser.add_argument('--regularization', type=float, dest='reg_rate', default=0.01, help='regularization rate')
parser.add_argument('--C', type=float, default=1.0, help='Inverse of regularization strength')
parser.add_argument('--solver', type=str, default='lbfgs', help='Algorithm to use in the optimization problem')
args = parser.parse_args()
reg = args.reg_rate
run.log('Inverse of regularization strength', np.float(args.C))
run.log('Algorithm to use in the optimization problem', np.str(args.solver))
print("argparse parameters loaded...")

# load the diabetes dataset (File method)
# Get the training data from the estimator input identified as 'diabetes'
mount = run.input_datasets['diabetes'] # read-only mount from delta lake as '/mnt/data'
print("delta lake mounted...")
diabetes = pd.read_parquet('/mnt/data/diabetes.parquet') # load any file(s) from this delta lake mounted folder
print("dataset loaded...")

# save data into workspace
diabetes.to_csv("outputs/dataset.csv", index=False) # {logs/  outputs/}
print("test: write dataset to workspace 'outputs/dataset.csv'")

# Separate features and labels
X, y = diabetes[['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']].values, diabetes['Diabetic'].values

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Train a logistic regression model
print('Training a logistic regression model with regularization rate of', reg)
run.log('Regularization Rate',  np.float(reg))
model = LogisticRegression(C=args.C, solver=args.solver).fit(X_train, y_train)

# calculate accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
print('Accuracy:', acc)
run.log('Accuracy', np.float(acc))

# calculate AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test, y_scores[:,1])
print('AUC: ' + str(auc))
run.log('AUC', np.float(auc))

os.makedirs('outputs', exist_ok=True)
# note file saved in the outputs folder is automatically uploaded into experiment record
joblib.dump(value=model, filename='outputs/diabetes_model.pkl')

run.complete()

Writing ././diabetes_hyperdrive/diabetes_training.py


In [55]:
# Create an experiment name
experiment = Experiment(ws, 'diabetes-hyperdrive-training')

In [56]:
# Create a Scikit-learn estimator

# get the training compute cluster
training_cluster = ComputeTarget(ws, 'aml-cluster')

# Set the script parameters
script_params = {
    '--regularization': 0.1,
    '--C': 10,
    '--solver': 'lbfgs',
}

# Get the docker environment
training_env = Environment.get(ws, 'training_environment')

# get the registered dataset by name
file_ds = Dataset.get_by_name(ws, "diabetes file dataset")

estimator = Estimator(source_directory=experiment_folder, # All the files in this directory are uploaded into the cluster nodes for execution
                      compute_target=training_cluster, # only compute allowed for hyperparameter tuning
                      entry_script='diabetes_training.py',
                      script_params=script_params,
                      environment_definition=training_env,
                      inputs=[file_ds.as_named_input('diabetes').as_mount(path_on_compute='/mnt/data')],
                     )



In [57]:
# define the hyperparameter space

param_sampling = RandomParameterSampling( {
    '--regularization': choice(1, 0.333, 0.1, 0.033),
    '--C': choice(1, 3, 10, 30),
    '--solver': choice('lbfgs', 'liblinear', 'newton-cg', 'lbfgs', 'sag'),
    } )

hyperdrive_run_config = HyperDriveConfig(estimator=estimator,
                                         hyperparameter_sampling=param_sampling,
                                         primary_metric_name='Accuracy',
                                         primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                                         max_total_runs=20,   # 20 = reg x C x solver = 4 x 4 x 5 script uses C + solver = 20
                                         max_concurrent_runs=5,
                                        )

In [58]:
# start the HyperDrive experiment run (~25')
hyperdrive_run = experiment.submit(config=hyperdrive_run_config)



CPU times: user 325 ms, sys: 238 ms, total: 563 ms
Wall time: 3.97 s


This can take ~25min

In [59]:
# Show the run details while running
RunDetails(hyperdrive_run).show()  # <-- Completed, no it is running in the background !

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

CPU times: user 25.3 ms, sys: 5.09 ms, total: 30.4 ms
Wall time: 316 ms


![hyperdrive_run](../../image/howto_automl/hyperdrive_run.png)

In [None]:
# the RUN must FINISH first, then continue...

In [60]:
# Find best run
best_run = hyperdrive_run.get_best_run_by_primary_metric()
print(best_run.get_details()['runDefinition']['arguments'])

['--C', '3', '--regularization', '1', '--solver', 'liblinear']


# Pipelines

orchestrate machine learning operations, arranged sequentially or in parallel.  
a workflow of machine learning tasks in which each task is implemented as a step.  
Each step in the pipeline runs on its allocated compute target.  
publish a pipeline as a REST endpoint, enabling client applications to initiate a pipeline run.

What follow is a simple **_2-step_** pipeline that trains and registers a model.  

1. storage
1. compute
1. environment
1. scripts
  * step 1: create a model
  * step 2: register the model
1. create pipeline
1. run pipeline
1. publish pipeline
1. call pipeline

### storage

The PipelineData object is a special kind of data reference that is used to pass data from the output of one pipeline step to the input of another, creating a dependency between them. You'll create one and use it as the output for the first step and the input for the second step. Note that you also need to pass it as a script argument so your code can access the datastore location referenced by the data reference.

In [61]:
# Initialize workspace
ws = Workspace.from_config()

# load the training diabetes dataset (File method)
ds = Datastore.get(ws, 'datalakestoragegen2')
ds_path = [DataPath(ds, 'platinum/diabetes.parquet')] # {path/*.parquet or path/**}
diabetes_ds = Dataset.File.from_files(path=ds_path)

# Create a PipelineData (temporary Data Reference)
# data lake gen2: container "datalake" > azureml > 0b93a7bc-9bf2-46a9-b9c4-5afdba292d08 > model_folder
model_folder = PipelineData("model_folder", datastore=ds)

### compute

In [62]:
# load compute cluster
pipeline_cluster = ComputeTarget(ws, 'aml-cluster')

### environment

In [63]:
# load the docker environment
training_env = Environment.get(ws, 'training_environment')

# Create a new runconfig object for the pipeline
pipeline_run_config = RunConfiguration()

# Use the compute you created above.
pipeline_run_config.target = pipeline_cluster

# Assign the environment to the run configuration
pipeline_run_config.environment = training_env

### scripts

In [64]:
# Create a folder for the pipeline step files
project_folder = 'diabetes_pipeline'
os.makedirs(project_folder, exist_ok=True)

In [65]:
# Experiment folder
experiment_folder = './' + project_folder

**step 1:** create a model

In [None]:
# estimator step

In [66]:
%%writefile $experiment_folder/train_diabetes.py
# Import libraries
from azureml.core import Run
import argparse
import pandas as pd
import numpy as np
import joblib
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score

# Get parameters
parser = argparse.ArgumentParser()
parser.add_argument('--output_folder', type=str, dest='output_folder', default="diabetes_model", help='output folder')
args = parser.parse_args()
output_folder = args.output_folder

# Get the experiment run context
run = Run.get_context()

# load the diabetes data (passed as an input dataset)
print("Loading Data...")
#diabetes = run.input_datasets['diabetes_train'].to_pandas_dataframe()
mount = run.input_datasets['diabetes_train'] # read-only mount from delta lake as '/mnt/data'
print("delta lake mounted...")
diabetes = pd.read_parquet('/mnt/data/diabetes.parquet') # load any file(s) from this delta lake mounted folder
print("dataset loaded...")

# Separate features and labels
X, y = diabetes[['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']].values, diabetes['Diabetic'].values

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Train adecision tree model
print('Training a decision tree model')
model = DecisionTreeClassifier().fit(X_train, y_train)

# calculate accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
print('Accuracy:', acc)
run.log('Accuracy', np.float(acc))

# calculate AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
print('AUC: ' + str(auc))
run.log('AUC', np.float(auc))

# Save the trained model
os.makedirs(output_folder, exist_ok=True)
output_path = output_folder + "/model.pkl"
joblib.dump(value=model, filename=output_path)

run.complete()

Writing ./diabetes_pipeline/train_diabetes.py


**step 2:** register the model

In [None]:
# python script step

In [67]:
%%writefile $experiment_folder/register_diabetes.py
# Import libraries
import argparse
import joblib
from azureml.core import Workspace, Model, Run

# Get parameters
parser = argparse.ArgumentParser()
parser.add_argument('--model_folder', type=str, dest='model_folder', default="diabetes_model", help='model location')
args = parser.parse_args()
model_folder = args.model_folder

# Get the experiment run context
run = Run.get_context()

# load the model
print("Loading model from " + model_folder)
model_file = model_folder + "/model.pkl"
model = joblib.load(model_file)

Model.register(workspace=run.experiment.workspace,
               model_path=model_file,
               model_name='diabetes_model',
               tags={'Training context':'Pipeline'})

run.complete()

Writing ./diabetes_pipeline/register_diabetes.py


### create pipeline

|Common kinds of step|Description|
|-|-|
|**PythonScriptStep**|<i>Runs a specified Python script</i>|
|**EstimatorStep**|<i>Runs an estimator</i>|
|**DataTransferStep**|<i>Uses Azure Data Factory to copy data between data stores</i>|
|**DatabricksStep**|<i>Runs a notebook, script, or compiled JAR on a databricks cluster</i>|
|**AdlaStep**|<i>Runs a U-SQL job in Azure Data Lake Analytics</i>|
|**[6 more steps](https://aka.ms/AA70rrh)**||

In [68]:
estimator = Estimator(source_directory=experiment_folder, 
                      compute_target=pipeline_cluster, #training_cluster
                      environment_definition=pipeline_run_config.environment, #training_env
                      entry_script='train_diabetes.py', #'diabetes_training.py'
                      # NO script_params=script_params,
                      # NO inputs=[file_ds.as_named_input('diabetes').as_mount(path_on_compute='/mnt/data')],
                     )

# Step 1, run the estimator to train the model
train_step = EstimatorStep(name="Train Model",
                           estimator=estimator,
                           compute_target=pipeline_cluster, # {'aml-cluster'}
                           estimator_entry_script_arguments=['--output_folder', model_folder],
                           inputs=[diabetes_ds.as_named_input('diabetes_train').as_mount(path_on_compute='/mnt/data')],
                           outputs=[model_folder],
                           allow_reuse=True)

# Step 2, run the model registration script
register_step = PythonScriptStep(name="Register Model",
                                 source_directory=experiment_folder,
                                 script_name="register_diabetes.py",
                                 compute_target=pipeline_cluster,
                                 runconfig=pipeline_run_config,
                                 inputs=[model_folder],
                                 arguments=['--model_folder', model_folder],
                                 allow_reuse=True)

# Construct the pipeline
pipeline = Pipeline(ws, [train_step, register_step])



### run pipeline

Run pipeline and verify it works

In [69]:
%%time
# Create an experiment
experiment = Experiment(ws, 'diabetes-training-pipeline')

# Run the pipeline
pipeline_run = experiment.submit(pipeline, regenerate_outputs=True)

Created step Train Model [8bfdd59f][826a0671-8ad1-4c9e-a678-fbaab9404948], (This step will run and generate new outputs)Created step Register Model [02634b9d][cad22272-034a-4b99-8a28-c3be1aba9876], (This step will run and generate new outputs)

Submitted PipelineRun fd40771b-9879-4c09-b4b0-41eb28ea2c02
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/diabetes-training-pipeline/runs/fd40771b-9879-4c09-b4b0-41eb28ea2c02?wsid=/subscriptions/43c1f93a-903d-4b23-a4bf-92bd7a150627/resourcegroups/myResourceGroup02/workspaces/machine_learning_workspace02
CPU times: user 328 ms, sys: 617 ms, total: 945 ms
Wall time: 6.41 s


In [70]:
%%time
# Show run details
RunDetails(pipeline_run).show()
pipeline_run.wait_for_completion()

_PipelineWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', …

PipelineRunId: fd40771b-9879-4c09-b4b0-41eb28ea2c02
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/diabetes-training-pipeline/runs/fd40771b-9879-4c09-b4b0-41eb28ea2c02?wsid=/subscriptions/43c1f93a-903d-4b23-a4bf-92bd7a150627/resourcegroups/myResourceGroup02/workspaces/machine_learning_workspace02
PipelineRun Status: Running


StepRunId: 2465180b-abca-4371-a291-6d5885c8e6a4
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/diabetes-training-pipeline/runs/2465180b-abca-4371-a291-6d5885c8e6a4?wsid=/subscriptions/43c1f93a-903d-4b23-a4bf-92bd7a150627/resourcegroups/myResourceGroup02/workspaces/machine_learning_workspace02
StepRun( Train Model ) Status: Running

Streaming azureml-logs/55_azureml-execution-tvmps_10b995cbfa648953267dc71f615e5ee1469bbbdcec865bf581b238fb36ba6eb9_d.txt
2021-03-04T14:56:58Z Starting output-watcher...
2021-03-04T14:56:58Z IsDedicatedCompute == True, won't poll for Low Pri Preemption
2021-03-04T14:56:58Z Executing 'Copy

'Finished'

### publish pipeline

publish pipeline as a REST service endpoint.

In [71]:
%%time
published_pipeline = pipeline.publish(name="Diabetes_Training_Pipeline",
                                      description="Trains diabetes model",
                                      version="1.0")
rest_endpoint = published_pipeline.endpoint
print(rest_endpoint)

https://northeurope.api.azureml.ms/pipelines/v1.0/subscriptions/43c1f93a-903d-4b23-a4bf-92bd7a150627/resourceGroups/myResourceGroup02/providers/Microsoft.MachineLearningServices/workspaces/machine_learning_workspace02/PipelineRuns/PipelineSubmit/aff720bc-70bd-40a0-b922-a3e30bb88885
CPU times: user 57.3 ms, sys: 193 ms, total: 250 ms
Wall time: 672 ms


### call pipeline

In [72]:
interactive_auth = InteractiveLoginAuthentication()
auth_header = interactive_auth.get_authentication_header()

In [73]:
experiment_name = 'Run-diabetes-pipeline'

response = requests.post(rest_endpoint, 
                         headers=auth_header, 
                         json={"ExperimentName": experiment_name})
run_id = response.json()["Id"]
run_id

'2596b0ee-c5e7-4984-b865-50ab419322e9'

Since you have the run ID, you can use the RunDetails widget

In [74]:
published_pipeline_run = PipelineRun(ws.experiments[experiment_name], run_id)
RunDetails(published_pipeline_run).show()

_PipelineWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', …

![pipeline_run](../../image/howto_automl/pipeline_run.png)