# Hands-On Workshop
### Building Precision Marketing Campaign Solution with Azure ML

## Check Pre-requisites 

Check required Azure ML Python SDK modules and their version. These are core to subsequent operations.   

AzureML Python SDK module and Class reference:    
https://docs.microsoft.com/en-us/python/api/azureml-core/?view=azure-ml-py

In [0]:
try:
    import azureml.core
    from azureml.core import Workspace, Dataset, Datastore, Environment, Experiment, Run, Model, ScriptRunConfig
    from azureml.core.webservice import AciWebservice
    from azureml.core.conda_dependencies import CondaDependencies
    print('Azure ML Python SDK version:', azureml.core.VERSION)

except Exception as e:
    print(e.args)

## Setup Project Workspace

[Workspace](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.workspace.workspace?view=azure-ml-py) class definition

In [0]:
import azureml.core
from azureml.core import Workspace

from azureml.core.authentication import ServicePrincipalAuthentication

sp = ServicePrincipalAuthentication(tenant_id="72f988bf-86f1-41af-91ab-2d7cd011db47", # tenantID
                                    service_principal_id="2cfbcca2-c1a0-4e4a-a43e-2ac27f068242", # clientId
                                    service_principal_password="qhZ8Q~3DOZt4YyWocfDcf1cCe.qfd6hQsdeDfcPR") # clientSecret

# sepcficy workspace using current active config
subscription_id = '09ba1f2e-4799-434c-9f88-6ca60b368ac8'
resource_group = 'mlservicedemo'
workspace_name = 'mlservicedemo'

ws = Workspace(subscription_id, resource_group, workspace_name, auth = sp)

## Securely Access Shared Data via Datastore & Dataset

[Datastore](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.datastore(class)?view=azure-ml-py) class definition    
[Dataset](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.dataset(class)?view=azure-ml-py) class definition

**IMPORTANT**   
Make sure you have the required Datastore and Dataset configured with appropriate level of access provided for this hands-on session (i.e. SAS token)

In [0]:
# blob account https://storageblobdatabrick.blob.core.windows.net/amldata
# blob sasurl ?sv=2020-08-04&ss=bfqt&srt=sco&sp=rwdlacupitfx&se=2022-04-29T22:09:48Z&st=2022-04-17T14:09:48Z&spr=https&sig=ndQ65k3K5xSn%2BCRm341mc474z1rU5YZmX3CNivKG1Yk%3D

In [0]:
from azureml.core import Datastore

# get a named datastore from the current workspace
datastore_name = 'chikustoragebb'
datastore = Datastore.get(ws, datastore_name=datastore_name)

# list all registered datastores in current workspace
if not datastore_name:
    for name, datastore in ws.datastores.items():
        print(name, datastore.datastore_type)

print(datastore)

In [0]:
!pip install azureml-dataset-runtime --upgrade

In [0]:
from azureml.core import Dataset

# get the dataset with specified version
dataset_name = 'bank'
dataset_version = 1
dataset = Dataset.get_by_name(workspace=ws, name=dataset_name, version=dataset_version)

# store it into pandas DF
df = dataset.to_pandas_dataframe()
df

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4516,33,services,married,secondary,no,-333,yes,no,cellular,30,jul,329,5,-1,0,unknown,no
4517,57,self-employed,married,tertiary,yes,-3313,yes,yes,unknown,9,may,153,1,-1,0,unknown,no
4518,57,technician,married,secondary,no,295,no,no,cellular,19,aug,151,11,-1,0,unknown,no
4519,28,blue-collar,married,secondary,no,1137,no,no,cellular,6,feb,129,4,211,3,other,no


## Setup Experiment for Tracking & Reproducibility
[Experiment](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.experiment(class)?view=azure-ml-py) class reference

In [0]:
from azureml.core import Experiment

# setup Experiment for tracking
experiment_name = 'exp-bank-marketing'
exp = Experiment(workspace=ws, name=experiment_name)
exp

Name,Workspace,Report Page,Docs Page
exp-bank-marketing,mlservicedemo,Link to Azure Machine Learning studio,Link to Documentation


## Assisted Explorative Data Analysis

* use standard `Pandas Dataframe` to describe, inspect Dataframe
* use `Azure ML Data Profile` feature to run comprehensive data inspection

In [0]:
df.describe()

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous
count,4521.0,4521.0,4521.0,4521.0,4521.0,4521.0,4521.0
mean,41.170095,1422.657819,15.915284,263.961292,2.79363,39.766645,0.542579
std,10.576211,3009.638142,8.247667,259.856633,3.109807,100.121124,1.693562
min,19.0,-3313.0,1.0,4.0,1.0,-1.0,0.0
25%,33.0,69.0,9.0,104.0,1.0,-1.0,0.0
50%,39.0,444.0,16.0,185.0,2.0,-1.0,0.0
75%,49.0,1480.0,21.0,329.0,3.0,-1.0,0.0
max,87.0,71188.0,31.0,3025.0,50.0,871.0,25.0


### Opensource Data Profiler
Suited for small dataset for quick exploratory analysis. https://github.com/pandas-profiling/pandas-profiling

In [0]:
# if not installed run
# %pip install pandas-profiling

import pandas_profiling as pdp
pdp.ProfileReport(df)

### Azure Data Profiler
Suited for big dataset leveraging the power of scalable clustered compute

* see Azure ML Studio UI Datasets for "Generate Data Profile"
* alternatively run profiling task using Azure ML Python SDK: [Dataset Profile](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.dataset_profile.datasetprofile?view=azure-ml-py) class reference

![Azure Data Profiler](https://jixjiastoragegbb.blob.core.windows.net/public/trendmicro_workshop/azureml_data_profiling.png?sv=2020-08-04&st=2022-01-19T09%3A40%3A14Z&se=2030-01-20T09%3A40%3A00Z&sr=b&sp=r&sig=I5gXrfNkaUjPXV7cX5q3YY8xQTf%2BVImuR3UQOjI0rg4%3D)

## Option 1 - Build Reproducible Model with AzureML infused MLOps practices
This workshop showcases building Scikit-Learn XGBoost framework but the same practice is applicable to any major Machine Learning frameworks (Tensorflow/Keras, PyTorch)

Also refer to "Building ID Masking AI Solution" for training custom CNN deep learning model on clustered compute with GPU.

### (1) Train on Local Compute Instance
Just like how you do on Jupyter Notebook or any favorite local IDE, except on Cloud

In [0]:
# inspect feature types
categorical_cols = [df.columns[idx] for idx, i in enumerate(df.dtypes) if i.name=='object']
numerical_cols = [df.columns[idx] for idx, i in enumerate(df.dtypes) if i.name!='object']

print(f'Categorical features: {categorical_cols}\nNumerical features: {numerical_cols}')

**Prepare XGBoost Classifier train script**    

* Adding `Azure ML Run Experiment Tracking` for reproducibility and asset tracking

In [0]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_curve, auc, f1_score
from xgboost import XGBClassifier
import joblib
import numpy as np


# initiate a Run to track training
run = exp.start_logging()

''' (1) Manual Feature Engineering '''

# split dataframe into cateogircal features, numerical features and label respectively
categorical_df = df[df.columns.drop('y')].select_dtypes(include=['object']).dropna()
numerical_df = df.select_dtypes(exclude=['object']).dropna()
label_df = df.loc[:, df.columns == 'y'].dropna()

# morph categorical features into numpy 2d array
X_cat = categorical_df.values

# morph label into numpy 1d array
Y = np.reshape(label_df.values,(-1))

# OHE cateogrical features one at a time
X_cat_encoded = None

for i in range(0, X_cat.shape[1]):
    # encode string feature (Xi)
    le = LabelEncoder()
    encoded_features = le.fit_transform(X_cat[:,i])
    encoded_features = encoded_features.reshape(X_cat.shape[0], 1)
    
    # perform OHE transformation
    ohe = OneHotEncoder(sparse=False, categories='auto')
    ohe_features = ohe.fit_transform(encoded_features) 
    
    # combine OHE features into one array
    if X_cat_encoded is None:
        X_cat_encoded = ohe_features
    else:
        X_cat_encoded = np.concatenate((X_cat_encoded, ohe_features), axis=1)

# encode string label (Y)
le = LabelEncoder()
Y_encoded = le.fit_transform(Y)

# combine encoded categorical features with numerical features to form a complete feature array (X)
X = np.hstack((X_cat_encoded, numerical_df.values))

# train/val split
X_train, X_test, y_train, y_test = train_test_split(X, Y_encoded, test_size=.2, random_state=42)


''' (2) Model Training with Hyperparam Tuning (skipped and reuse AutoML result) '''

# fit model no training data
# this part uses the best tuned hyperparameter settings from AutoML Run result
# track hyperparmeter in run log
gamma = 5
max_depth = 6
max_leaves = 15
n_estimators = 100
reg_alpha = 2.395
reg_lambda = 1.04
subsample = 0.7
eta = 0.1
lr = 0.3

model = XGBClassifier(
                booster='gbtree', 
                colsample_bylevel=1,
                colsample_bynode=1, 
                colsample_bytree=0.8, 
                gamma=gamma,
                eta=eta,
                learning_rate=lr, 
                max_delta_step=0, 
                max_depth=max_depth,
                max_leaves=max_leaves,
                min_child_weight=1, 
                missing=1, 
                n_estimators=n_estimators, 
                n_jobs=1,
                nthread=None, 
                objective='binary:logistic', 
                random_state=0,
                reg_alpha=reg_alpha, 
                reg_lambda=reg_lambda, 
                scale_pos_weight=1, 
                subsample=subsample)

model.fit(X_train, y_train)

# track hyper parameters for team visibility
run.log("lr", lr)
run.log("gamma", gamma)
run.log("alpha", reg_alpha)
run.log("lambda", reg_lambda)
run.log("max_depth", max_depth)
run.log("max_leaves", max_leaves)

# evaluate model using validation data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]

# calculate accuracy
accuracy = accuracy_score(y_test, predictions)
f1_weighted = f1_score(y_test, predictions, average='weighted')
fpr, tpr, thresholds = roc_curve(y_test, predictions)
auc_score = auc(fpr, tpr)

print(f"Validation Accuracy: {accuracy * 100.0:.2f}%")
print(f"ROC AUC score: {auc_score * 100.0:.2f}%")
print(f"F1 score: {f1_weighted:.3f}")

# serialize and save
model_name = 'bank-marketing-xgboost'
model_path = "outputs/model.pkl"
joblib.dump(value=model, filename=model_path)

# post for tracking run artifacts
run.upload_file(name=model_name, path_or_stream=model_path)
run.complete()

In [0]:
# Show run outputs in UI
run

Experiment,Id,Type,Status,Details Page,Docs Page
exp-bank-marketing,8fdd3bb2-45b2-485f-b1ba-ec751d2240ae,,Running,Link to Azure Machine Learning studio,Link to Documentation


### (2) Train against Remote Clustered Compute

So far we trained models on the 'local' machine (the compute instance). However, we can use exactly the same method to submit the job to more scalable clustered compute targets (e.g. AKS, AML Compute Cluster, Azure Databricks, Azure Synapse etc.) by changing a single line of code. 

Full list of supported compute targets:    
https://docs.microsoft.com/en-us/azure/machine-learning/concept-compute-target

#### (A) Package training script into job

* Add FUSE mounted dataset access for running across multiple-nodes cluster 
* Pacakge reusable train virtual enviornment and dependencies (if not using Curated Environment)
* Add argument parser to allow controlling script behavior from batch job control plane
* Add AzureML required pacakges for Experiment Run tracking
* Reproducible experiments for logging, tracking and shared asset reuse

In [0]:
%%writefile train.py
from azureml.core import Run, Dataset, Workspace, Experiment
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_curve, auc, f1_score
from xgboost import XGBClassifier
import joblib
import numpy as np
import argparse

# get current run context from the control plane (batch job)
run = Run.get_context()

# get workspace from current run context
ws = run.experiment.workspace

# construct the argument parser and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-l", "--lr", type=float, default=0.1)
ap.add_argument("-ml", "--max_leaves", type=int, default=10)
ap.add_argument("-md", "--max_depth", type=int, default=10)
ap.add_argument("-ra", "--reg_alpha", type=float, default=1.0)
ap.add_argument("-rl", "--reg_lambda", type=float, default=1.0)
ap.add_argument("-g", "--gamma", type=int, default=5)
args = ap.parse_args()

# get the dataset with specified version
dataset_name = 'bank_marketing'
dataset_version = 1
dataset = Dataset.get_by_name(
    workspace=ws, 
    name=dataset_name, 
    version=dataset_version)

# store it into pandas DF
df = dataset.to_pandas_dataframe()


''' (1) Manual Feature Engineering '''

# split dataframe into cateogircal features, numerical features and label respectively
categorical_df = df[df.columns.drop('y')].select_dtypes(include=['object']).dropna()
numerical_df = df.select_dtypes(exclude=['object']).dropna()
label_df = df.loc[:, df.columns == 'y'].dropna()

# morph categorical features into numpy 2d array
X_cat = categorical_df.values

# morph label into numpy 1d array
Y = np.reshape(label_df.values,(-1))

# OHE cateogrical features one at a time
X_cat_encoded = None

for i in range(0, X_cat.shape[1]):
    # encode string feature (Xi)
    le = LabelEncoder()
    encoded_features = le.fit_transform(X_cat[:,i])
    encoded_features = encoded_features.reshape(X_cat.shape[0], 1)
    
    # perform OHE transformation
    ohe = OneHotEncoder(sparse=False, categories='auto')
    ohe_features = ohe.fit_transform(encoded_features) 
    
    # combine OHE features into one array
    if X_cat_encoded is None:
        X_cat_encoded = ohe_features
    else:
        X_cat_encoded = np.concatenate((X_cat_encoded, ohe_features), axis=1)

# encode string label (Y)
le = LabelEncoder()
Y_encoded = le.fit_transform(Y)

# combine encoded categorical features with numerical features to form a complete feature array (X)
X = np.hstack((X_cat_encoded, numerical_df.values))

# train/val split
X_train, X_test, y_train, y_test = train_test_split(X, Y_encoded, test_size=.2, random_state=42)


''' (2) Model Training with Hyperparam Tuning (skipped and reuse AutoML result) '''

# fit model no training data
# this part uses the best tuned hyperparameter settings from AutoML Run result
# track hyperparmeter in run log
model = XGBClassifier(
                booster='gbtree', 
                colsample_bylevel=1,
                colsample_bynode=1, 
                colsample_bytree=1, 
                gamma=args.gamma,
                eta=0.1,
                learning_rate=args.lr, 
                max_delta_step=0, 
                max_depth=args.max_depth,
                max_leaves=args.max_leaves,
                min_child_weight=1, 
                missing=None, 
                n_estimators=100, 
                n_jobs=1,
                nthread=None, 
                objective='binary:logistic', 
                random_state=0,
                reg_alpha=args.reg_alpha, 
                reg_lambda=args.reg_lambda, 
                scale_pos_weight=1, 
                subsample=1)

model.fit(X_train, y_train)

# track hyper parameters for team visibility
run.log("lr", args.lr)
run.log("alpha", args.reg_alpha)
run.log("lambda", args.reg_lambda)
run.log("max_depth", args.max_depth)
run.log("max_leaves", args.max_leaves)
run.log("gamma", args.gamma)

# evaluate model using validation data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]

# calculate accuracy
accuracy = accuracy_score(y_test, predictions)
f1_weighted = f1_score(y_test, predictions, average='weighted')
fpr, tpr, thresholds = roc_curve(y_test, predictions)
auc_score = auc(fpr, tpr)

print(f"Validation Accuracy: {accuracy * 100.0:.2f}%")
print(f"ROC AUC score: {auc_score * 100.0:.2f}%")
print(f"F1 score: {f1_weighted:.3f}")

# serialize and save
model_name = 'bank-marketing-xgboost'
model_path = "outputs/model.pkl"
joblib.dump(value=model, filename=model_path)

# post for tracking run artifacts
run.upload_file(name=model_name, path_or_stream=model_path)
run.complete()

Overwriting train.py


#### (B) Package Environment for Reusable Custom Training

* Add training environment (either Curated or custom defined)
* Docker build environment image based on a curated base image with defined custom depdencies 

[Environment](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.environment(class)?view=azure-ml-py) class reference

![image-alt-text](https://jixjiastoragegbb.blob.core.windows.net/public/trendmicro_workshop/azureml_custom_env_docker_image.png?sv=2020-08-04&st=2022-01-19T10%3A15%3A45Z&se=2033-01-20T10%3A15%3A00Z&sr=b&sp=r&sig=xJsrK0bAkNR6nCw%2Bf6076nYCAA7hWCIu9%2BLzA8ZcUDY%3D)

**Define Environment via YAML**

In [0]:
%%writefile conda_environment_train.yml

dependencies:
- python=3.8.1
- pip:
  - azureml-dataset-runtime[pandas,fuse]
  - azureml-defaults
  - imutils==0.5.3
  - numpy==1.18.5
  - scikit-learn==0.22
  - inference-schema
- conda:
  - py-xgboost<=0.90

**Register Environment**

In [0]:
from azureml.core.environment import Environment

# option 1 - use a custom defined environment
env = Environment.from_conda_specification(
    name='trendmicro-xgboost-train-env', 
    file_path='./conda_environment_train.yml')

'''
# option 2 - use a curated environment that has already been built
env = Environment.get(workspace=ws, 
                      name="AzureML-xgboost-0.9-ubuntu18.04-py37-cpu-inference", 
                      version=1)
'''

# register env for reuse
env.register(workspace=ws)

#### (C) Batch training using Clustered Compute

* Submit against target compute (either `Local` or any supported `Compute Targets`)

[ComputeTarget](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.computetarget?view=azure-ml-py) class reference    
[AML Compute Provisioning](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.compute.amlcompute.amlcomputeprovisioningconfiguration?view=azure-ml-py) class reference

In [0]:
# if azureml.widgets not installed
# %pip install azureml-widgets

from azureml.widgets import RunDetails
from azureml.core import ScriptRunConfig
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget

# define target compute cluster
compute_name = 'prod-ds5v2-x4-tm'

# check compute target availability
if compute_name in ws.compute_targets and type(ws.compute_targets[compute_name]) is AmlCompute:
    compute_target = ws.compute_targets[compute_name]
    print("Found compute target! Set to use clsuter: " + compute_name)
else:
    print(f"Cannot find {compute_name} or not qualified. Set to use LOCAL instance instead.")
    compute_target = 'local'

# script run config for batch train job
src = ScriptRunConfig(
    source_directory="./",
    script="train.py",
    arguments=['--lr', 0.5, '--max_leaves', 10, '--max_depth', 5, '--reg_alpha', 1.0, '--reg_lambda', 0.8, '--gamma', 5],
    compute_target=compute_target,
    environment=env,
)

# submit job
run = exp.submit(config=src)

# monitor the run
RunDetails(run).show()

Found compute target! Set to use clsuter: prod-ds5v2-x4-tm


_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

### (3) Register Best Model

* For asset tracking, sharing and reproducibility
* This part shows registering any arbitrarily generated models (not limited by runs or experiments)

[Model](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.model(class)?view=azure-ml-py) class reference

In [0]:
from azureml.core.run import Run

# specify the best run's RUN ID from AutoML result page
# we can also programmatically search for the best run given a target metric of interest
best_run_id = '8fdd3bb2-45b2-485f-b1ba-ec751d2240ae'

# get run details
best_run = Run(experiment=exp, run_id=best_run_id)
best_run.get_details()

# Download the model from run history
best_run.download_file(name='outputs/model.pkl',output_file_path='./model/model.pkl')

In [0]:
%sh ls /databricks/driver/model

In [0]:
import sklearn
from azureml.core import Workspace
from azureml.core import Model
from azureml.core.resource_configuration import ResourceConfiguration

best_model = Model.register(workspace=ws,
                       model_name='bank-marketing-xgboost',                        # Name of the registered model in your workspace.
                       model_path='./model/model.pkl',                        # Local file to upload and register as a model.
                       model_framework=Model.Framework.SCIKITLEARN,  # Framework used to create the model.
                       model_framework_version=sklearn.__version__,  # Version of scikit-learn used to create the model.
                       tags={'project':'trendmicro-workshop', 'algorithm':'xgboost'},
                       description='Model to predict campaign acceptance propensity')

print('Model name:', best_model.name)
print('Version:', best_model.version)

### (4) Define Inference Scoring Function `score.py` (and Inference Schema)

In [0]:
%%writefile score.py

import json
import pickle
import numpy as np
import pandas as pd
import os
import joblib
from azureml.core.model import Model

# setup swagger inference schema (for OpenAPI compatible clients)
from inference_schema.schema_decorators import input_schema, output_schema
from inference_schema.parameter_types.numpy_parameter_type import NumpyParameterType
from inference_schema.parameter_types.pandas_parameter_type import PandasParameterType


def init():
    global model
    
    # Load serialized model from ACI/AKS deploy path (AZUREML_MODEL_DIR)
    path = os.getenv('AZUREML_MODEL_DIR') 
    model_path = os.path.join(path, 'model.pkl')
    
    # Deserialize the model back into memory with joblib (for sk-learn)
    model = joblib.load(model_path)

# define inference input schema (swagger)
input_sample = pd.DataFrame({
    "age": pd.Series([0], dtype="int64"), 
    "job": pd.Series(["example_value"], dtype="object"), 
    "marital": pd.Series(["example_value"], dtype="object"), 
    "education": pd.Series(["example_value"], dtype="object"), 
    "default": pd.Series(["example_value"], dtype="object"), 
    "housing": pd.Series(["example_value"], dtype="object"), 
    "loan": pd.Series(["example_value"], dtype="object"), 
    "contact": pd.Series(["example_value"], dtype="object"),
    "month": pd.Series(["example_value"], dtype="object"), 
    "day_of_week": pd.Series(["example_value"], dtype="object"), 
    "duration": pd.Series([0], dtype="int64"), 
    "campaign": pd.Series([0], dtype="int64"), 
    "pdays": pd.Series([0], dtype="int64"), 
    "previous": pd.Series([0], dtype="int64"), 
    "poutcome": pd.Series(["example_value"], dtype="object"), 
    "emp.var.rate": pd.Series([0.0], dtype="float64"), 
    "cons.price.idx": pd.Series([0.0], dtype="float64"), 
    "cons.conf.idx": pd.Series([0.0], dtype="float64"), 
    "euribor3m": pd.Series([0.0], dtype="float64"), 
    "nr.employed": pd.Series([0.0], dtype="float64")
    })

# define output scehma (swagger)
output_sample = np.array(["example_value"])

@input_schema('data', PandasParameterType(input_sample))
@output_schema(NumpyParameterType(output_sample))


def run(data):
    try:
        '''
        [Gin]
        Add your custom input data preprocessing steps here...
        For this workshop I'll showcase using AutoML's built-in data transformation pipeline
        Hence will skip this part and perform inference directly on raw input
        '''
        result = model.predict(data)
        print(result)
        return result.tolist()
    except Exception as e:
        error = str(e)
        return error

## Option 2 - Leverage Best Model from AutoML Run Results

We already have the best model fine tuned through Azure ML's AutoML thus we can leverage the output and its built-in data transformation pipelines in any CI/CD Build and Release processes without need to re-invent the wheel.

#### (1) Fetch Run Information

In [0]:
from azureml.core.run import Run

# specify the best run's RUN ID from AutoML result page
# we can also programmatically search for the best run given a target metric of interest
best_run_id = 'AutoML_4fe602e9-4273-4cb5-a234-8e538c01ddfe_63'

# get run details
best_run = Run(experiment=exp, run_id=best_run_id)
best_run.get_details()

Out[56]: {'runId': 'AutoML_4fe602e9-4273-4cb5-a234-8e538c01ddfe_63',
 'target': 'prod-ds5v2-x4-tm',
 'status': 'Completed',
 'startTimeUtc': '2022-01-17T06:07:47.101228Z',
 'endTimeUtc': '2022-01-17T06:09:58.263576Z',
 'services': {},
 'properties': {'runTemplate': 'automl_child',
  'pipeline_id': '__AutoML_Ensemble__',
  'pipeline_spec': '{"pipeline_id":"__AutoML_Ensemble__","objects":[{"module":"azureml.train.automl.ensemble","class_name":"Ensemble","spec_class":"sklearn","param_args":[],"param_kwargs":{"automl_settings":"{\'task_type\':\'classification\',\'primary_metric\':\'AUC_weighted\',\'ensemble_iterations\':15,\'is_timeseries\':False,\'name\':\'exp-bank-marketing\',\'compute_target\':\'prod-ds5v2-x4-tm\',\'subscription_id\':\'d7d72c6d-f9bf-48e3-b11e-6b6c9196e6bc\',\'region\':\'japaneast\'}","ensemble_run_id":"AutoML_4fe602e9-4273-4cb5-a234-8e538c01ddfe_63","experiment_name":"exp-bank-marketing","workspace_name":"ws-trendmicro","subscription_id":"d7d72c6d-f9bf-48e3-b11e-6b6c919

#### (2) Fetch Model Artifacts

In [0]:
from azureml.core.model import Model
import os

# get best model's information from run details
best_model_path = best_run.get_details()['properties']['model_output_path']
best_model_name = best_run.get_details()['properties']['model_name']
print(f'Found best model: {best_model_name} ({best_model_path})')

# retrieve best model
best_model = Model(workspace=ws, name=best_model_name)

# download the model to local project (on local compute instance)
print('\nDownloading to local compute instance...')
best_model.download(target_dir=os.path.join(os.getcwd(), 'outputs'), exist_ok=True)

Found best model: AutoML4fe602e9463 (outputs/model.pkl)

Downloading to local compute instance...


Out[57]: '/mnt/batch/tasks/shared/LS_root/mounts/clusters/dev-cpu-ds5v2-tm/code/Users/jixinjia/bank-campaign-propensity/outputs/model.pkl'

#### (3) Fetch build environment definition (YAML)

In [0]:
import os

# fetch conda environment yaml definition from Run context
url = best_run.get_details()['properties']['conda_env_data_location']

# download it from the 'outputs' directory managed by AzureML (for all run artifacts)
base_name = os.path.basename(url)
best_run.download_file(os.path.join('outputs', base_name),'conda_environment.yml')

print('Fetched Conda environment yaml for restoring the build of the Best Run')

Fetched Conda environment yaml for restoring the build of the Best Run


#### (4) Fetch Scoring Function (score.py)

**IMPORTANT**   
This is a sample produced by AutoML, do NOT use it if we plan to include custom data preprocessing steps in the inference process such as this workshop

In [0]:
import os

# fetch sample score.py produced by Run context
url = best_run.get_details()['properties']['scoring_data_location']

# download it from the 'outputs' directory managed by AzureML (for all run artifacts)
base_name = os.path.basename(url)
best_run.download_file(os.path.join('outputs', base_name), base_name)

print(f'Fetched Scoring Funciton ({base_name}) built by AutoML from Best Run')

Fetched Scoring Funciton (scoring_file_v_1_0_0.py) built by AutoML from Best Run


## Model Build, Package, Release and Deploy (CI/CD) with AzureML

Following is executed using AzureML's built-in MLOps features

* Package all required dependencies, artifacts and model
* Generate docker build file
* Instantiate Flask/Gunicor and Nginx based webservice
* Build images
* Register image and push to private container registry
* Setup target inference compute (ACI/AKS/AzureMLCompute/Databricks)
* Deploy
* Logging and managed endpoint monitoring

In [0]:
%sh ls /databricks/driver/

In [0]:
import sklearn
from azureml.core.model import InferenceConfig
from azureml.core.webservice import AciWebservice
from azureml.core import Environment
from azureml.core.conda_dependencies import CondaDependencies

# reuse build environment from earlier steps (Option1/2)
env = Environment.from_conda_specification(name='trendmicro-xgboost-train-env', file_path='./conda_environment_train.yml')

''' 
[Gin]
alternatively, create a new inference environment on-the-fly
with custom defined dependencies that best suits the deployment environment
For this workshop I'll opt to use the ad-hoc environment
'''

# add custom pip / conda dependencies
env = Environment('sklearn0.22-xgboost-automl')

env.python.conda_dependencies = CondaDependencies.create(
    pip_packages=[
        'azureml-defaults==1.37.0',
        'azureml-interpret==1.37.0',
        'azureml-train-automl-runtime==1.37.0',
        'inference-schema',
        'numpy>=1.16.0,<1.19.0',
        'pandas==0.25.1',
        'scikit-learn=={}'.format(sklearn.__version__)
    ],
    conda_packages = [
        'py-xgboost<=0.90'
    ])

# setup inference runtime config
inference_config = InferenceConfig(entry_script='./score.py', environment=env)

# setup inference target config (Azure Container Instance)
aci_config = AciWebservice.deploy_configuration(cpu_cores=1, memory_gb=1)
service_name = 'bank-marketing'

# trigger deployment (can also be controled with AzCLI in addition to PythonSDK for integration with external CI/CD pipelines)
service = Model.deploy(workspace=ws,
                       name=service_name,
                       models=[best_model],
                       inference_config=inference_config,
                       deployment_config=aci_config,
                       overwrite=True)

service.wait_for_deployment(show_output=True)

Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running
2022-01-19 11:07:16+00:00 Creating Container Registry if not exists.
2022-01-19 11:07:16+00:00 Registering the environment.
2022-01-19 11:07:18+00:00 Use the existing image.
2022-01-19 11:07:19+00:00 Submitting deployment to compute.
2022-01-19 11:07:22+00:00 Checking the status of deployment bank-marketing..
2022-01-19 11:11:04+00:00 Checking the status of inference endpoint bank-marketing.
Succeeded
ACI service creation operation finished, operation "Succeeded"


In [0]:
# get service run log for debugging docker image build
service.get_logs()

## Inference Unit Test

In [0]:
import json
import random

# randomly fetch 30 rows from the feature DF as our test dataset
start_row = random.randint(10,1000)
end_row = start_row + 30
test_df = df.loc[:, df.columns != 'y'][start_row:end_row]

# Restful call to ACI model
input_payload = json.dumps({
    'data': test_df.values.tolist()
})

output = service.run(input_payload)

print(f'Testing with row {start_row} ~ {end_row}')
print(output)

Testing with row 854 ~ 884
['no', 'no', 'yes', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'yes', 'no', 'no', 'no']


# End

In [0]:
service.delete()

&copy;2022 Microsoft   
Originally developed by Jixin Jia (Gin) for customer workshop