### Load your Libraries
All the libraries listed below are required to run this notebook.  

If you require a GPU to train your model (for example, you are training a deep learning model), use DEFAULT_GPU_IMAGE instead of DEFAULT_CPU_IMAGE.  

You can also modify the Hyperparameter run with some of the optional functions following this documentation: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters

In [1]:
# Import Python Libraries
import json
import logging
import numpy as np
import os
import pandas as pd
import pytz

# Load Azure libaries
import azureml.core
from azureml.core import Datastore, Dataset, Environment, Experiment, Model, ScriptRunConfig, Workspace
from azureml.core.compute import AmlCompute, ComputeTarget
from azureml.core.runconfig import CondaDependencies, DEFAULT_CPU_IMAGE, RunConfiguration
from azureml.pipeline.core import Pipeline, PipelineData, PipelineEndpoint, PipelineParameter, PipelineRun
from azureml.pipeline.core import PublishedPipeline, StepSequence, TrainingOutput
from azureml.pipeline.steps import PythonScriptStep, HyperDriveStep
from azureml.train.hyperdrive import HyperDriveRun, HyperDriveConfig, PrimaryMetricGoal
from azureml.train.hyperdrive import BayesianParameterSampling, uniform, choice
from azureml.widgets import RunDetails

# Modify this workbook with some of the optional Azure libraries below
from azureml.core.runconfig import DEFAULT_GPU_IMAGE
from azureml.train.hyperdrive import normal, GridParameterSampling, RandomParameterSampling
from azureml.train.hyperdrive import BanditPolicy, MedianStoppingPolicy, TruncationSelectionPolicy

### Connect your Workspace
When using an Azure Notebook, you must first connect it to your Azure Machine Learning Service to access objects within the Workspace.  

Use the code below and follow the instructions to sign in.  

Also, issues may arise if you are use a different version of the Azure ML SDK.  If you encounter errors, <b>install the version this notebook was created with</b>.

In [2]:
# Check which version of the AzureML SDK you are using
print("You are currently using version " + azureml.core.VERSION + " of the Azure ML SDK")
print("This notebook was made using version 1.31.0 of the Azure ML SDK")

You are currently using version 1.31.0 of the Azure ML SDK
This notebook was made using version 1.31.0 of the Azure ML SDK


In [3]:
# Connect your Jupyter Notebook Server to your AMLS Workspace
ws = Workspace.from_config()

### Set your Remote Compute Target

When you submit this run, it will run on a cluster of virtual machines.  Specify the cluster below.

In [4]:
# Specify the Compute Cluster for running your Pipeline jobs remotely
computeName = 'automl-cluster'  # CHANGE THIS
computeTarget = ComputeTarget(ws, computeName) 

### Create an Environment which contains all the libraries needed for your scripts
When you submit this run, it will create a docker container using all of the packages you list in this object.

If a package is available through both conda and pip, <b>use the conda version</b>, as conda automatically reconciles package discrepancies.

In [5]:
# To find out which packages are available in Conda, uncomment and run the code below
#%conda list

In [6]:
# Give your environment a name
environment = Environment(name="XGBoostTrainingEnv") # CHANGE HERE
condaDep = CondaDependencies()

# Add conda packages
# CHANGE HERE TO MATCH SCRIPT
condaDep.add_conda_package("scikit-learn==0.22.1")
condaDep.add_conda_package("numpy==1.16.2")
condaDep.add_conda_package("matplotlib==3.2.1")
condaDep.add_conda_package("joblib==0.14.1")
condaDep.add_conda_package("xgboost==0.90")
condaDep.add_conda_package("seaborn==0.9.0")
condaDep.add_conda_package("pandas==0.23.4")
condaDep.add_conda_package("scipy==1.3.1")

# Add pip packages
# CHANGE HERE TO MATCH SCRIPT
condaDep.add_pip_package("azureml-defaults==1.31.0")
condaDep.add_pip_package("azureml-interpret==1.31.0")
condaDep.add_pip_package("azureml-explain-model==1.31.0")
condaDep.add_pip_package("pyarrow==1.0.1")
condaDep.add_pip_package("pytz==2021.1")
condaDep.add_pip_package("interpret-core==0.1.21")
condaDep.add_pip_package("lightgbm==2.3.0")

# Adds dependencies to PythonSection of myenv
environment.python.conda_dependencies=condaDep

# Register the environment to your workspace
trainingEnvironment = environment.register(workspace=ws)

In [7]:
# Create a Run Configuration object to dockerize your environment
runConfig = RunConfiguration()
runConfig.docker.use_docker = True
runConfig.environment = environment
runConfig.environment.docker.base_image = DEFAULT_CPU_IMAGE 

### Create Dataset Registration, Training, Model Registration, and Metrics Output Scripts for your Pipeline
When you run this pipeline, it will run a series of .py scripts.  Specify the folder name and file names of your scripts here.

In [8]:
# Create a folder on your Jupyter Notebook server to store your .py files.
projectFolder = './XGB_Pipeline_Scripts/Training'
os.makedirs(projectFolder, exist_ok=True)

# Assign a file name for your .py file which will perform unit test.
unitTestingFileName = 'XGB_Hyperdrive_Unit_Testing.py'

# Assign a file name for your .py file which will contain your shared functions.
sharedFunctionsFileName = 'XGB_Hyperdrive_Shared_Functions.py'

# Assign a file name for your .py file which will register your model.
datasetRegistrationFileName = 'XGB_Hyperdrive_Dataset_Registration.py'

# Assign a file name for your .py file which will execute your training script.
trainingFileName = 'XGB_Hyperdrive_Training.py'

# Assign a file name for your .py file which will register your model.
modelRegistrationFileName = 'XGB_Hyperdrive_Model_Registration.py'

# Assign a file name for your .py file which will output your Hyperdrive Metrics.
metricsOutputFileName = 'XGB_Hyperdrive_Metrics_Output.py'

# Create file path strings
sharedFunctionsFilePath = os.path.join(projectFolder, sharedFunctionsFileName)
unitTestingFilePath = os.path.join(projectFolder, unitTestingFileName)
datasetRegistrationFilePath = os.path.join(projectFolder, datasetRegistrationFileName)
trainingFilePath = os.path.join(projectFolder, trainingFileName)
modelRegistrationFilePath = os.path.join(projectFolder, modelRegistrationFileName)
metricsOutputFilePath = os.path.join(projectFolder, metricsOutputFileName)

### Create your Shared Functions Script
This script holds functions which you use across multiple .py files in the pipeline.

In [9]:
%%writefile $sharedFunctionsFilePath
# Load in libaries
import json
import numpy as np
import os
import pandas as pd
from itertools import zip_longest

# Creates a Python Dictionary object out of key-value pairs
def create_dict(keys, values):
    return dict(zip_longest(keys, values[:len(keys)]))

# Sets tags for Azure resources
def set_tags(tagNameList, tagValueList):
    return create_dict(tagNameList, tagValueList)

# Writes CSV files back to a storage account
def write_to_datastore(dataframe, workspace, datastore, folder, file, indexBoolean):
    os.makedirs(folder, exist_ok=True) 
    filePath = os.path.join(folder, file)
    dataframe.to_csv(filePath, index = indexBoolean)
    print('Data Written as CSV and saved to ' + filePath)
    
    # Upload to Datastore
    files = [filePath]
    datastore.upload_files(files=files, target_path=folder, overwrite = True)
    
# Split data into X (features) and Y (target) columns for machine learning
def split_x_y (dataframe, scoring_column):
    X = dataframe.drop(scoring_column, axis=1)
    Y = dataframe[scoring_column]
    return X, Y
    
# Make predictions using a classification model
def make_classification_predictions(model, dataframe, X, Y):
    dataframe.loc[:, 'Predictions'] = model.predict(X)
    for i in range(0, len(Y.unique())):
        dataframe.loc[:, 'Probability_' + str(Y.unique()[i])] = model.predict_proba(X)[:,i]
    return dataframe

# Saves local explanations as columns to a dataframe
def save_local_explanations(explainerModel, dataframe, features):
    localExplanation = explainerModel.explain_local(features)
    localImportanceNames = localExplanation.get_ranked_local_names()
    localImportanceValues = localExplanation.get_ranked_local_values()
    dataframe['ExplanationColumns'] = localImportanceNames[0]
    dataframe['ExplanationValues'] = localImportanceValues[0]
    dataframe['Explanations'] = 'fill'
    for i in range(0,len(dataframe)):
        dataframe['Explanations'][i] = json.dumps(dict(zip(dataframe.ExplanationColumns[i],\
                                                           dataframe.ExplanationValues[i])))
    dataframe = dataframe.drop(['ExplanationColumns','ExplanationValues'], axis=1)
    return dataframe

# Saves local explanations as columns to a dataframe
def save_global_explanations(explainerModel, features):
    globalExplanation = explainerModel.explain_global(features)
    pd.DataFrame()
    global_names = globalExplanation.get_ranked_global_names()
    global_values = globalExplanation.get_ranked_global_values()
    global_zipped = list(zip(global_names, global_values))
    dataframe = pd.DataFrame(global_zipped, columns=['Columns','Importance'])
    return dataframe

Overwriting ./XGB_Pipeline_Scripts/Training/XGB_Hyperdrive_Shared_Functions.py


### Create your Unit Testing Script
Use this script to perform unit tests for your functions.  If any fail, the whole pipeline will fail.

In [10]:
%%writefile $unitTestingFilePath
# Load in libaries
import unittest

# Load in functions from your other .py files to unit test
from XGB_Hyperdrive_Shared_Functions import create_dict, set_tags

class TestFunctions(unittest.TestCase):
      
    def setUp(self):
        pass
  
    # Returns True if the output is a dictionary and matches the correct format
    def test_create_dict(self):
        self.assertEqual(create_dict(['Key1', 'Key2'], [123, 456]), {'Key1': 123, 'Key2': 456})
  
    # Returns True if the output is a dictionary and matches the correct format
    def test_set_tags(self):        
        self.assertEqual(['Project', 'Message'], {'Project': 'Test', 'Message': 'Testing'})
        
if __name__ == '__main__':
    unittest.main(argv=['first-arg-is-ignored'], exit=False)
    print('Script Finished')

Overwriting ./XGB_Pipeline_Scripts/Training/XGB_Hyperdrive_Unit_Testing.py


### Create your Dataset Registration Script
This script registers data on your datastore as datasets.  It will register two datasets, one for training data and the other for validation data. 

It expects data to be located in your datastore in the following format, "projectFolder/trainingInputFolder/2021-07-22/yourFile"

Feel free to name the folders and files whatever you wish, but the date folder must be <b>today's date</b>.

In [11]:
%%writefile $datasetRegistrationFilePath
# Load in libaries
import argparse
import datetime as dt
import numpy as np
import os
import pandas as pd
import pytz
from itertools import zip_longest

# Load in Azure libraries
from azureml.core import Dataset, Datastore, Run, Workspace

# Load in functions from shared functions file
from XGB_Hyperdrive_Shared_Functions import create_dict, set_tags, write_to_datastore

# Define script-specific functions
def register_dataset(workspace, datastore, folder, file, datasetName, description, tags):
    filePath = os.path.join(folder, file)
    datastorePath = [(datastore, filePath)]
    dataset = Dataset.Tabular.from_delimited_files(datastorePath)
    dataset.register(workspace = workspace, 
                     create_new_version = True,
                     name = datasetName,
                     description = description,
                     tags = tags)
    
def init():
    # Set Arguments
    global args
    parser = argparse.ArgumentParser()
    parser.add_argument('--train_dataset_name', type=str,
                            help='Name of Training Dataset')
    parser.add_argument('--val_dataset_name', type=str,
                            help='Name of Validation Dataset')
    parser.add_argument('--datastore_path', type=str,
                            help='Location of file or files on Datastore')
    parser.add_argument('--datastore_name', type=str,
                            help='Name of Datastore')
    parser.add_argument('--train_file_name', type=str,
                            help='Name of training data file on Datastore')
    parser.add_argument('--val_file_name', type=str,
                            help='Name of validation data file on Datastore')
    parser.add_argument('--project_name', type=str,
                            help='Name of project')
    parser.add_argument('--project_description', type=str,
                            help='Description of project')
    parser.add_argument('--pytz_time_zone', type=str,
                            help='Time Zone associated with your data')
    args = parser.parse_args()

    # Set the Run context for logging
    global run
    run = Run.get_context()

def main():
    # Connect to your AMLS Workspace and set your Datastore
    ws = run.experiment.workspace
    datastoreName = args.datastore_name
    datastore = Datastore.get(ws, datastoreName)
    print('Datastore Set')
    
    # Set your Time Zone
    timeZone = pytz.timezone(args.pytz_time_zone)
    timeLocal = dt.datetime.now(timeZone).strftime('%Y-%m-%d')
    print('Time Zone Set')

    # Specify your File Names
    trainFile = timeLocal + '/' + args.train_file_name
    valFile = timeLocal + '/' + args.val_file_name
    print('File Names Set for Training and Validation Data.')
    
    # Set Tags and Description
    description = args.project_description
    trainTags = set_tags(['Project', 'Dataset Type', 'Date Created'],\
                         [args.project_name, 'Training', timeLocal])
    valTags = set_tags(['Project', 'Dataset Type', 'Date Created'],\
                       [args.project_name, 'Validation', timeLocal])
    print("Dataset Tags and Description Assigned")
    
    # Register your Training data as an Azure Tabular Dataset
    register_dataset(ws, datastore, args.datastore_path, trainFile, args.train_dataset_name, description, trainTags)
    print('Training Data Registered')
    
    # Register your Validation data as an Azure Tabular Dataset
    register_dataset(ws, datastore, args.datastore_path, valFile, args.val_dataset_name, description, valTags)
    print('Validation Data Registered')

if __name__ == '__main__':
    init()
    print('Script Initialized')
    main()
    print('Script Finished')

Overwriting ./XGB_Pipeline_Scripts/Training/XGB_Hyperdrive_Dataset_Registration.py


### Create your XGBoost Model Training Script
This script will train and hypeparameter tune your model.  

It will output models, logs, charts, and metrics for all runs, along with predictions and explanations for your validation dataset.

Feel free to add in <b>custom metrics</b> or charts into the scoring and charting functions.

In [12]:
%%writefile $trainingFilePath
# Load in Libraries
import argparse
import joblib
import json
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
from interpret.ext.blackbox import MimicExplainer
from interpret.ext.glassbox import LGBMExplainableModel
from itertools import zip_longest
from lightgbm import LGBMClassifier
from sklearn.metrics import accuracy_score, balanced_accuracy_score, confusion_matrix, auc, precision_recall_curve
from sklearn.metrics import precision_score, recall_score, f1_score, roc_curve
from sklearn.model_selection import ShuffleSplit, cross_validate
os.environ['KMP_DUPLICATE_LIB_OK']='True'
from xgboost import XGBClassifier

# Load in Azure libraries
from azureml.core import Run, Dataset, Workspace, Experiment
from azureml.interpret import ExplanationClient

# Load in functions from shared functions file
from XGB_Hyperdrive_Shared_Functions import create_dict, save_local_explanations, make_classification_predictions
from XGB_Hyperdrive_Shared_Functions import split_x_y, save_global_explanations

# Define script-specific functions
def score_log_classification_training_data(model, features, target_column, cv_splits, bootstrap_sample_number):
    metrics_cv = cross_validate(model, features, target_column,\
                                scoring=["accuracy", "balanced_accuracy", "precision", "recall", "f1",], cv=cv_splits)
    
    # Get average of each metric across cross validation splits
    accuracy = np.mean(metrics_cv['test_accuracy'])
    balanced_accuracy = np.mean(metrics_cv['test_balanced_accuracy'])
    precision = np.mean(metrics_cv['test_precision'])
    recall = np.mean(metrics_cv['test_recall'])
    F1 = np.mean(metrics_cv['test_f1'])
    
    # Calculate Confidence Intervals for each of the metrics via bootstrapping cross-validated means
    resampled_mean_accuracy = []
    resampled_mean_balanced_accuracy = []
    resampled_mean_precision = []
    resampled_mean_recall = []
    resampled_mean_F1 = []
    metricsDF = pd.DataFrame(metrics_cv)
    
    for i in range(0, bootstrap_sample_number):
        resample = metricsDF.sample(frac=1, replace=True)
        mean_accuracy = np.mean(resample['test_accuracy'])
        mean_balanced_accuracy = np.mean(resample['test_balanced_accuracy'])
        mean_precision = np.mean(resample['test_precision'])
        mean_recall = np.mean(resample['test_recall'])
        mean_F1 = np.mean(resample['test_f1'])
        resampled_mean_accuracy.append(mean_accuracy)
        resampled_mean_balanced_accuracy.append(mean_balanced_accuracy)
        resampled_mean_precision.append(mean_precision)
        resampled_mean_recall.append(mean_recall)
        resampled_mean_F1.append(mean_F1)
    resampled_mean_accuracy.sort()
    resampled_mean_balanced_accuracy.sort()
    resampled_mean_precision.sort()
    resampled_mean_recall.sort()
    resampled_mean_F1.sort()
    lower_bound_index = int(np.floor(bootstrap_sample_number*(1-args.confidence_level)/2))
    upper_bound_index = int(np.floor(bootstrap_sample_number*(1+args.confidence_level)/2))
    
    print("Scoring Done for Training Data")
    
    # Log training metrics
    run.log('Accuracy Training', np.float(accuracy))
    run.log('Balanced Accuracy Training', np.float(balanced_accuracy))
    run.log('Recall Training', np.float(recall))
    run.log('Precision Training', np.float(precision))
    run.log('F1 Training', np.float(F1))
    run.log('Accuracy Training Lower CI', np.float(resampled_mean_accuracy[lower_bound_index]))
    run.log('Balanced Accuracy Training Lower CI', np.float(resampled_mean_balanced_accuracy[lower_bound_index]))
    run.log('Recall Training Lower CI', np.float(resampled_mean_recall[lower_bound_index]))
    run.log('Precision Training Lower CI', np.float(resampled_mean_precision[lower_bound_index]))
    run.log('F1 Training Lower CI', np.float(resampled_mean_F1[lower_bound_index]))
    run.log('Accuracy Training Upper CI', np.float(resampled_mean_accuracy[upper_bound_index]))
    run.log('Balanced Accuracy Training Upper CI', np.float(resampled_mean_balanced_accuracy[upper_bound_index]))
    run.log('Recall Training Upper CI', np.float(resampled_mean_recall[upper_bound_index]))
    run.log('Precision Training Upper CI', np.float(resampled_mean_recall[upper_bound_index]))
    run.log('F1 Training Upper CI', np.float(resampled_mean_F1[upper_bound_index]))
    run.log_list('Accuracy for all CV Splits', metrics_cv['test_accuracy'])
    run.log_list('Balanced Accuracy for all CV Splits', metrics_cv['test_balanced_accuracy'])
    run.log_list('Precision for all CV Splits', metrics_cv['test_precision'])
    run.log_list('Recall for all CV Splits', metrics_cv['test_recall'])
    run.log_list('F1 for all CV Splits', metrics_cv['test_f1'])
    return print("Metrics Logged for Training Data")
    
def score_log_classification_validation_data(classificationModel, features, target_column):
    val_binary_predictions = classificationModel.predict(features)
    val_accuracy = accuracy_score(target_column, val_binary_predictions)
    val_balanced_accuracy = balanced_accuracy_score(target_column, val_binary_predictions)
    val_precision = precision_score(target_column, val_binary_predictions)
    val_recall = recall_score(target_column, val_binary_predictions)
    val_F1 = f1_score(target_column, val_binary_predictions)
    print("Scoring Done for Validation Data")
    
    run.log('Accuracy Validation', np.float(val_accuracy))
    run.log('Balanced Accuracy Validation', np.float(val_balanced_accuracy))
    run.log('Recall Validation', np.float(val_recall))
    run.log('Precision Validation', np.float(val_precision))
    run.log('F1 Validation', np.float(val_F1))
    return print("Metrics Logged for Validation Data")

def log_classification_charts(dataType, classificationModel, features, targetColumn):
    # Get predictions and actuals to make Confusion Matrix and Precision Recall Curve
    binary_predictions = classificationModel.predict(features)
    probability_predictions = classificationModel.predict_proba(features)[:,1]
    print("Predictions Made for " + dataType + " Data Charts")
    
    # Log a Confusion Matrix
    data = pd.DataFrame(dict(s1 = targetColumn, s2 = binary_predictions)).reset_index()
    confusion_matrix = pd.crosstab(data['s1'], data['s2'], rownames=['Actual'], colnames=['Predicted'])
    fig = plt.figure()
    sns.heatmap(confusion_matrix, annot=True,cmap='Blues', fmt='g')
    plt.title("Confusion Matrix " + dataType)
    plt.close(fig)
    run.log_image(name='confusion-matrix-' + dataType.lower(), plot=fig)
    print("Confusion Matrix Logged for " + dataType + " Data")

    # Log a Precision / Recall Curve
    lr_precision, lr_recall, _ = precision_recall_curve(targetColumn, probability_predictions)
    lr_f1, lr_auc = f1_score(targetColumn, binary_predictions), auc(lr_recall, lr_precision)
    no_skill = len(targetColumn[targetColumn==1]) / len(targetColumn)
    fig2 = plt.figure()
    plt.plot([0, 1], [no_skill, no_skill], linestyle='--', label='No Skill')
    plt.plot(lr_recall, lr_precision, marker='.', label='Logistic')
    plt.title('Precision Recall Curve ' + dataType)
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.legend()
    plt.close(fig2)
    run.log_image(name='precision-recall-curve-' + dataType.lower(), plot=fig2)
    print("Precision Recall Curve Logged for " + dataType + " Data")
    
    # Log a Receiving Operating Characteristic (ROC) Curve  
    fpr, tpr, thresholds = roc_curve(targetColumn, probability_predictions) 
    fig3 = plt.figure()
    plt.plot(fpr, tpr, color='lightblue', label='ROC')
    plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curve ' + dataType)
    plt.legend()
    plt.close(fig3)
    run.log_image(name='roc-curve-' + dataType.lower(), plot=fig3)
    print("ROC Curve Logged for " + dataType + " Data")
    
def init():
    # Set Arguments.  These should be all of the hyperparameters you will tune.
    global args
    parser = argparse.ArgumentParser()
    # Hyperparameters
    parser.add_argument('--eta', type=float, default=0.1,
                        help='Learning Rate')
    parser.add_argument('--learning_rate', type=float, default=0.1,
                        help='Learning Rate')
    parser.add_argument('--scale_pos_weight', type=float, default=0.6,
                        help='Helps with Unbalanced Classes.  Should be Sum(Negative)/Sum(Positive)')
    parser.add_argument('--booster', type=str, default='gbtree',
                        help='The type of Boosting Algorithim')
    parser.add_argument('--min_child_weight', type=float, default=1,
                        help='Controls Overfitting')
    parser.add_argument('--max_depth', type=int, default=6,
                        help='Controls Overfitting')
    parser.add_argument('--gamma', type=float, default=0,
                        help='Make Algorithm Conservative')
    parser.add_argument('--subsample', type=float, default=1,
                        help='Controls Overfitting')
    parser.add_argument('--colsample_bytree', type=float, default=1,
                        help='Defines Sampling')
    parser.add_argument('--reg_lambda', type=float, default=1,
                        help='Controls Overfitting')
    parser.add_argument('--alpha', type=float, default=0,
                        help='Reduces Dimensionality')
    parser.add_argument('--objective', type=str, default='binary:logistic',
                        help='Defines Training Objective Metric')
    # Other Parameters
    parser.add_argument('--train_dataset_name', type=str,
                        help='Name of Training Dataset')
    parser.add_argument('--val_dataset_name', type=str,
                        help='Name of Validation Dataset')
    parser.add_argument('--target_column_name', type=str,
                        help='Name of variable to score')
    parser.add_argument('--k_folds', type=int, default = 10,
                        help='Number of folds to split your data into for cross validation')
    parser.add_argument('--shuffle_split_size', type=float,
                        help='Percentage of data to hold out for testing during cross validation')
    parser.add_argument('--confidence_level', type=float, default = 0.95,
                        help='Level of confidence to set for your confidence interval ()')
    args = parser.parse_args()
    print(args)

    # Set the Run context for logging
    global run
    run = Run.get_context()
    
    # log your hyperparameters,
    run.log('eta',np.float(args.eta))
    run.log('learning_rate',np.float(args.learning_rate))
    run.log('scale_pos_weight',np.float(args.scale_pos_weight))
    run.log('booster',np.str(args.booster))
    run.log('min_child_weight',np.float(args.min_child_weight))
    run.log('max_depth',np.float(args.max_depth))
    run.log('gamma',np.float(args.gamma))
    run.log('subsample',np.float(args.subsample))
    run.log('colsample_bytree',np.float(args.colsample_bytree))
    run.log('reg_lambda',np.float(args.reg_lambda))
    run.log('alpha',np.float(args.alpha))
    run.log('objective',np.str(args.objective))

# Write your main function.  This will train and log your model.
def main():
    # Connect to your AMLS Workspace and retrieve your data
    ws = run.experiment.workspace
    training_dataset_name  = args.train_dataset_name
    train_dataset  = Dataset.get_by_name(ws, training_dataset_name, version='latest')
    val_dataset_name  = args.val_dataset_name
    val_dataset  = Dataset.get_by_name(ws, val_dataset_name, version='latest')
    print('Datasets Retrieved')
    
    # Transform your data to Pandas
    trainTab =  train_dataset
    trainDF = trainTab.to_pandas_dataframe()
    valTab =  val_dataset
    valDF = valTab.to_pandas_dataframe()
    print('Datasets Converted to Pandas')
    
    # Split out X and Y variables for both training and validation data
    X, Y = split_x_y(trainDF, args.target_column_name)
    val_X, val_Y = split_x_y(valDF, args.target_column_name)
    print("Data Ready for Scoring")
 
    # Set your model and hyperparameters
    hyperparameters = dict(eta=args.eta,\
                           learning_rate=args.learning_rate,\
                           scale_pos_weight=args.scale_pos_weight,\
                           booster = args.booster,\
                           min_child_weight = args.min_child_weight,\
                           max_depth = args.max_depth,\
                           gamma = args.gamma,\
                           subsample = args.subsample,\
                           colsample_bytree = args.colsample_bytree,\
                           reg_lambda = args.reg_lambda,\
                           alpha = args.alpha,\
                           objective = args.objective)
    
    model = XGBClassifier(**hyperparameters)
    print('Hyperparameters Set')
    
    # Fit your model
    xgbModel = model.fit(X,Y)
    print("Model Fit")
    
    # Score your training data with cross validation and log metrics
    ss = ShuffleSplit(n_splits=args.k_folds, test_size=args.shuffle_split_size, random_state = 33)
    bootstrap_sample_number = args.k_folds*100
    score_log_classification_training_data(model, X, Y, ss, bootstrap_sample_number)
    
    # Log a Confusion Matrix and Precision Recall Curve for your training data
    log_classification_charts("Training", xgbModel, X, Y)
    
    # Score your validation data and log metrics
    score_log_classification_validation_data(xgbModel, X, Y)
    print("Scoring Done for Validation Data")

    # Log a Confusion Matrix and Precision Recall Curve for your training data
    log_classification_charts("Validation", xgbModel, val_X, val_Y)
    
    # Model Explanations
    client = ExplanationClient.from_run(run)
    explainer = MimicExplainer(xgbModel, 
                               X, 
                               LGBMExplainableModel,
                               classes = list(val_Y.unique()),
                               features = val_X.columns,
                               shap_values_output = 'probability',
                               model_task = 'classification')
    global_explanation = explainer.explain_global(X)
    print(global_explanation)
    client.upload_model_explanation(global_explanation, top_k=30)
    print("Global Explanations Created")
    
    # Save local Explanations in json format to a column in the Validation Set
    valDF = save_local_explanations(explainer, valDF, val_X)
    print("Explanations Saved to Validation Data")
    
    # Save Global Explanations as a pandas dataframe
    globalExplanations = save_global_explanations(explainer, val_X)
    print("Global Explanations Saved as Pandas Dataframe")
    
    # Make a folder in which to save your output
    os.makedirs('outputs', exist_ok=True)
    
    # Save your Model
    joblib.dump(xgbModel, 'outputs/XGBmodel.pkl')
    print("Model Saved")
    
    # Save your Explainer Model
    joblib.dump(explainer, 'outputs/LGBMexplainer.pkl')
    print("Explainer Model Saved")
    
    # Save your Validation Set Predictions
    valDF = make_classification_predictions(xgbModel, valDF, val_X, val_Y)
    valCSV = valDF.to_csv('outputs/validationPredictions.csv', index=False)
    print('Validation Predictions written to CSV file in logs')
    
    # Save your Global Explanations
    globalExplanationsCSV = globalExplanations.to_csv('outputs/globalExplanations.csv', index=False)
    print('Global Explanations written to CSV file in logs')

if __name__ == '__main__':
    init()
    print('Script Initialized')
    main()
    print('Script Finished')

Overwriting ./XGB_Pipeline_Scripts/Training/XGB_Hyperdrive_Training.py


### Set Pipeline Data to pass Best Model to Model Registration Step
Pipeline data will be used to pass in combined metrics for all Hyperdrive runs along with the model and explaination for that <b>highest performing model</b>.

In [13]:
# Set your Default Datastore
defaultDatastore = ws.get_default_datastore()

# Hyperdrive Metrics
metricsOutputName = 'metrics_output'
metricsData = PipelineData(name = 'metrics_data',
                           datastore = defaultDatastore,
                           pipeline_output_name = metricsOutputName,
                           training_output = TrainingOutput("Metrics"))

# Hyperdrive Best Model
modelOutputName = 'model_output'
savedModel = PipelineData(name = 'saved_model',
                          datastore = defaultDatastore,
                          pipeline_output_name = modelOutputName,
                          training_output = TrainingOutput("Model", model_file="outputs/XGBmodel.pkl"))

# Hyperdrive Best Model Explanations
explainerOutputName = 'explainer_output'
explainerModel = PipelineData(name = 'explainer_model',
                              datastore = defaultDatastore,
                              pipeline_output_name = explainerOutputName,
                              training_output = TrainingOutput("Model", model_file="outputs/LGBMexplainer.pkl"))

### Create your Model Registration Script
This scrip will register your model.  

It will only do after comparing your model's performance to older versions if any exist, and also comparing your training results to your validation results.  

It will also save out predictions and explanations for your validation data to your datastore <b>using the best model</b>.

In [14]:
%%writefile $modelRegistrationFilePath
# Load in libraries
import argparse
import datetime as dt
import joblib
import json
import os
import pandas as pd
import pytz
import scipy.stats as st
import shutil
from interpret.ext.blackbox import MimicExplainer
from interpret.ext.glassbox import LGBMExplainableModel
from itertools import zip_longest
from lightgbm import LGBMClassifier
from shutil import copy2
from sklearn.metrics import accuracy_score, balanced_accuracy_score

# Load in Azure libraries
from azureml.core import Dataset, Datastore, Experiment, Model, Run, Workspace

# Load in functions from shared functions file
from XGB_Hyperdrive_Shared_Functions import create_dict, set_tags, save_local_explanations, write_to_datastore
from XGB_Hyperdrive_Shared_Functions import make_classification_predictions, split_x_y, save_global_explanations

# Define script-specific functions
def register_model(workspace, modelName, modelPath, trainDataset, valDataset, description, tags):
    Model.register(workspace = workspace, model_name = modelName, model_path = modelPath, description = description,\
                   tags = tags, datasets=[('Training', trainDataset),('Validation', valDataset)])
    print("Registered version {0} of model {1}".format(model.version, model.name))

def load_model_from_hd(modelFolder, savedModel):
    copy2(savedModel, modelFolder)
    modelPath = modelFolder + 'saved_model'
    model = joblib.load(modelPath)
    return model

def load_explainer_model_from_hd(modelFolder, explainerModel):
    copy2(explainerModel, modelFolder)
    modelPath = modelFolder + 'explainer_model'
    model = joblib.load(modelPath)
    return model

def load_model(modelName):
    modelPath = Model.get_model_path(modelName)
    model = joblib.load(modelPath)
    return model

def score_model(model, features, targetColumn, scoringMethod):
    predictions = model.predict(features)
    score = scoringMethod(targetColumn, predictions)
    return score

def set_scoring_method(scoringMethod):
    if scoringMethod == 'Accuracy Training':
        return accuracy_score
    elif scoringMethod == 'Balanced Accuracy Training':
        return balanced_accuracy_score
    elif scoringMethod == 'Precision Training':
        return precision_score
    elif scoringMethod == 'Recall Training':
        return recall_score
    elif scoringMethod == 'F1 Training':
        return F1_score
    else:
        print('Add your scoring metric to the set_scoring_method function')
        raise Exception ('Scoring Metric not found in set_scoring_method function.  Add your metric to the function.')

def init():
    # Set Arguments.
    global args
    parser = argparse.ArgumentParser()
    parser.add_argument('--train_dataset_name', type=str,
                            help='Name of Training Dataset')
    parser.add_argument('--val_dataset_name', type=str,
                            help='Name of Validation Dataset')
    parser.add_argument('--datastore_name', type=str,
                            help='Name of Datastore')
    parser.add_argument('--project_name', type=str,
                            help='Name of project')
    parser.add_argument('--project_description', type=str,
                            help='Description of project')
    parser.add_argument('--pytz_time_zone', type=str,
                            help='Time Zone associated with your data')
    parser.add_argument('--target_column_name', type=str,
                            help='Name of variable to score')
    parser.add_argument('--k_folds', type=int, default = 10,
                            help='Number of folds to split your data into for cross validation')
    parser.add_argument('--confidence_level', type=float, default = 0.95,
                            help='Level of confidence to set for your confidence interval ()')
    parser.add_argument('--model_name', type=str,
                            help='Name of model to register')
    parser.add_argument('--output_path', type=str,
                            help='Location to store output on Datastore')
    parser.add_argument('--scoring_metric', type=str,
                            help='Metric with which you scored your Hyperdrive Run')
    parser.add_argument('--metric_goal', type=str, default = 'MAXIMIZE',
                            help='Whether the scoring metric should be minimized or maximized')
    parser.add_argument('--saved_model', type=str, 
                            help='path to saved model file')
    parser.add_argument('--explainer_model', type=str, 
                            help='path to saved explanation file')
    parser.add_argument('--metrics_data', type=str,
                            help='Location of Hyperdrive Run Metrics File')
    args = parser.parse_args()

    # Set the Run context for logging
    global run
    run = Run.get_context()

def main():
    # Set scoring metric
    scoringMethod = set_scoring_method(args.scoring_metric)
    print ('Scoring Metric Set')
    
    # Retrieve your Metrics Data file
    with open(args.metrics_data) as metrics:
        metricsData = json.load(metrics)
    print('Metrics File Downloaded')
    
    # Turn the Metrics JSON file into a Pandas Dataframe, then transpose and sort it by your scoring metric.
    metrics = pd.DataFrame(metricsData)
    metricsTransposed = metrics.transpose().sort_values(by=args.scoring_metric, ascending=False)
    print('Metrics Dataframe Created')
    
    # Connect to your AMLS Workspace and set your Datastore
    ws = run.experiment.workspace
    datastore = Datastore.get(ws, args.datastore_name)
    print('Datastore Set')
    
    # Retrieve your dataset
    trainDataset = Dataset.get_by_name(ws, args.train_dataset_name)
    valDataset = Dataset.get_by_name(ws, args.val_dataset_name)
    print('Datasets Retrieved')
    
    # Transform your data into Pandas dataframes
    trainDF = trainDataset.to_pandas_dataframe()
    valDF = valDataset.to_pandas_dataframe()
    print('Datasets Converted to Pandas')
    
    # Split out X and Y variables
    val_X, val_Y = split_x_y(valDF, args.target_column_name)
    print("Validation Data split into Feature and Target Columns")
    
    # Load your training model
    modelFolder = 'model/'
    os.makedirs(modelFolder, exist_ok=True)
    newModel = load_model_from_hd(modelFolder, args.saved_model)
    print("Training Model Loaded")
    
    # Load your explainer model
    explainer = load_explainer_model_from_hd(modelFolder, args.explainer_model)
    print("Explainer Model Loaded")
    
    # Save explanations to your validation data
    valDF = save_local_explanations(explainer, valDF, val_X)
    print("Explanations Saved to Validation Data")
    
    # Save Global Explanations as a pandas dataframe
    globalExplanations = save_global_explanations(explainer, val_X)
    print("Global Explanations Saved as Pandas Dataframe")
    
    # Save your Validation Set Predictions
    valDF = make_classification_predictions(newModel, valDF, val_X, val_Y)
    print('Validation Predictions written to CSV file in logs')
    
    # Set your Time Zone 
    timeZone = pytz.timezone(args.pytz_time_zone)
    timeLocal = dt.datetime.now(timeZone).strftime('%Y-%m-%d')
    print('Time Zone Set')
    
    # Make Output Directory
    datastorePath = args.output_path + '/' + timeLocal
    os.makedirs(datastorePath, exist_ok=True) 
    print('Output Directory Created')
    
    # Upload Validation Data with Predictions to Datastore
    write_to_datastore(valDF, ws, datastore, datastorePath, "validationPredictions.csv", False)
    print('Predictions with Explanations for Validation Data Loaded to Datastore')
    
    # Save your Global Explanations
    write_to_datastore(globalExplanations, ws, datastore, datastorePath, "globalExplanations.csv", False)
    print('Global Expanations Loaded to Datastore')
    
    # Calculate main scoring metric for the validation dataset
    newModelScore = score_model(newModel, val_X, val_Y, scoringMethod)
    print('Predictions Made for Validation Data')
    print('Validation Set ' + args.scoring_metric + ' is ' + str(round(newModelScore, 2)))
    
    # Retrieve confidence interval for the cross validated scoring metric
    lowerBoundColumn = args.scoring_metric + ' Lower CI'
    upperBoundColumn = args.scoring_metric + ' Upper CI'
    lowerBound = metricsTransposed[lowerBoundColumn][0]
    upperBound = metricsTransposed[upperBoundColumn][0]
    print('Model ' + str(args.scoring_metric) + ' is ' + str(args.confidence_level*100) +\
          '% likely to actually fall between ' + str(lowerBound) + ' and ' + str(upperBound))
    
    # Compare confidence interval of cross validation training metric with validation metric
    if (((args.metric_goal == 'MAXIMIZE') & (newModelScore < lowerBound)) or\
       (((args.metric_goal == 'MINIMIZE') & (newModelScore > upperBound)))):
        print('Model performance on training data is significantly different from performance on validation data.')
        raise Exception("Models performs differently on training and validation data.\
                         Please check to see if your model is overfitting.  Validation Set "\
                        + args.scoring_metric + ' is ' + str(round(newModelScore, 2)) + ".  "\
                        + 'Model ' + str(args.scoring_metric) + ' is ' + str(args.confidence_level*100) +\
                          '% likely to actually fall between ' + str(lowerBound) + ' and ' + str(upperBound))
    else:
        print('Model is performing as expected on validation data.')
    
    # Check to see if previous model exists and compare it to new model
    modelDictionary = ws.models
    if args.model_name in modelDictionary.keys():
        print('Models Being Compared')
        oldModel = load_model(args.model_name)
        oldModelScore = score_model(oldModel, val_X, val_Y, scoringMethod)
        print('Previous Model Loaded and is being Compared to New Model')
        if newModelScore > oldModelScore:
            registerFlag = 1
        else:
            registerFlag = 0
    else:
        registerFlag = 1
        print('No Previous Models Found')
    
    # Set your model tags and description
    description = args.project_description
    explainTags = set_tags(['Algorithm', 'Project', 'Model Type', 'Explainer Type'],\
                            ['XGB', args.project_name, 'Explainer', 'Mimic'])
    trainTags = set_tags(['Algorithm', 'Project', 'Model Type'], ['XGB', args.project_name, 'Classification'])
    print("Model Tags and Description Assigned")
    
    # Register your new model and explainer model
    if registerFlag == 1:
        modelPath = modelFolder + 'saved_model'
        register_model(ws, args.model_name, modelPath, trainDataset, valDataset, description, trainTags)
        modelNameExplainer = args.model_name + '-Explainer'
        explainerPath = modelFolder + 'explainer_model'
        register_model(ws, modelNameExplainer, explainerPath, trainDataset, valDataset, description, explainTags)
    else:
        print('Old model outperforms new model and new model will not be registered.')
        
    # Remove files from compute cluster
    shutil.rmtree(datastorePath)

    # Remove model pickle file from compute cluster
    shutil.rmtree(modelFolder)

if __name__ == '__main__':
    init()
    print('Script Initialized')
    main()
    print('Script Finished')

Overwriting ./XGB_Pipeline_Scripts/Training/XGB_Hyperdrive_Model_Registration.py


### Create your Metrics Output Step
This script outputs all of the metrics for of all your Hyperdrive runs into files on your datastore. 

In [15]:
%%writefile $metricsOutputFilePath
# Load in libraries
import argparse
import datetime as dt
import json
import os
import pandas as pd
import pytz
import shutil
from itertools import zip_longest
from shutil import copy2

# Load in Azure libraries
from azureml.core import Dataset, Datastore, Experiment, Model, Run, Workspace

# Load in functions from shared functions file
from XGB_Hyperdrive_Shared_Functions import write_to_datastore

def init():
    # Set Arguments.
    global args
    parser = argparse.ArgumentParser()
    parser.add_argument('--datastore_name', type=str,
                            help='Name of Datastore')
    parser.add_argument('--pytz_time_zone', type=str,
                            help='Time Zone associated with your data')
    parser.add_argument('--output_path', type=str,
                            help='Location to store output on Datastore')
    parser.add_argument('--scoring_metric', type=str,
                            help='Metric with which you scored your Hyperdrive Run')
    parser.add_argument('--metrics_data', type=str,
                            help='Location of Hyperdrive Run Metrics File')
    args = parser.parse_args()

    # Set the Run context for logging
    global run
    run = Run.get_context()

def main():   
    # Retrieve your Metrics Data file
    with open(args.metrics_data) as metrics:
        metricsData = json.load(metrics)
    print('Metrics File Downloaded')
    
    # Turn the Metrics JSON file into two pandas dataframes
    metrics = pd.DataFrame(metricsData)
    metricsTransposed = metrics.transpose().sort_values(by=args.scoring_metric, ascending=False)
    print('Metrics Dataframes Created')
    
    # Connect to your AMLS Workspace and set your Datastore
    ws = run.experiment.workspace
    datastore = Datastore.get(ws, args.datastore_name)
    print('Datastore Set')
    
    # Set your Time Zone 
    timeZone = pytz.timezone(args.pytz_time_zone)
    timeLocal = dt.datetime.now(timeZone).strftime('%Y-%m-%d')
    print('Time Zone Set')
    
    # Make Output Directory
    outputFolder = args.output_path + '/' + timeLocal
    os.makedirs(outputFolder, exist_ok=True) 
    print('Output Directory Created')
    
    # Upload csv files to Datastore
    write_to_datastore(metrics, ws, datastore, outputFolder, 'Metrics.csv', True)
    write_to_datastore(metricsTransposed, ws, datastore, outputFolder, 'MetricsTransposed.csv', True)
    print('Hyperdrive Metrics Data Loaded to Datastore')
    
    # Remove files from compute cluster
    shutil.rmtree(outputFolder)

if __name__ == '__main__':
    init()
    print('Script Initialized')
    main()
    print('Script Finished')

Overwriting ./XGB_Pipeline_Scripts/Training/XGB_Hyperdrive_Metrics_Output.py


### Set your Pipeline Parameters
These are all the parameters you can use to easily adapt this code to other projects.

In [16]:
# Dataset Registration Step Parameters
train_dataset_name_param = PipelineParameter(name="TrainDatasetName", default_value='None')
val_dataset_name_param = PipelineParameter(name="ValDatasetName", default_value='None')
datastore_name_param = PipelineParameter(name="DatastoreName", default_value='None')
datastore_path_param = PipelineParameter(name="DatastorePath", default_value='None')
train_file_name_param = PipelineParameter(name="TrainFileName", default_value='None')
val_file_name_param = PipelineParameter(name="ValFileName", default_value='None')
project_name_param = PipelineParameter(name="ProjectName", default_value='None')
project_description_param = PipelineParameter(name="ProjectDescription", default_value='None')
pytz_time_zone_param = PipelineParameter(name='PytzTimeZone', default_value='UTC')

# Hyperdrive Step Parameters
target_column_param = PipelineParameter(name="TargetColumn", default_value='None')
k_folds_param = PipelineParameter(name="KFolds", default_value=10)
shuffle_split_size_param = PipelineParameter(name="ShuffleSplitSize", default_value=0.1)
confidence_level_param = PipelineParameter(name="ConfidenceLevel", default_value = 0.95)

# Model Registration Step Parameters
model_name_param = PipelineParameter(name="ModelName", default_value='None')
output_path_param = PipelineParameter(name="OutputPath", default_value='None')
scoring_metric_param = PipelineParameter(name="ScoringMetric", default_value='None')
metric_goal_param = PipelineParameter(name="MetricGoal", default_value='MAXIMIZE')

### Configure your Unit Testing Step
Configure your unit testing step by specifing the folder and file names, the docker container run configuration, and the remote compute target.

In [17]:
unitTestingStep = PythonScriptStep(
    name = "unit-testing-step",
    source_directory = projectFolder,
    script_name = unitTestingFileName,
    arguments=[],
    compute_target=computeTarget,
    runconfig=runConfig,
    allow_reuse=False)

### Configure your Dataset Registration Step
Configure your data registration step by specifing the folder and file names, the docker container run configuration, the remote compute target, and parameter arguments.

In [18]:
datasetRegistrationStep = PythonScriptStep(
    name = "dataset-registration-step",
    source_directory = projectFolder,
    script_name = datasetRegistrationFileName,
    arguments=['--train_dataset_name', train_dataset_name_param,
               '--val_dataset_name', val_dataset_name_param,
               '--datastore_name', datastore_name_param,
               '--datastore_path', datastore_path_param,
               '--train_file_name', train_file_name_param,
               '--val_file_name', val_file_name_param,
               '--project_name', project_name_param,
               '--project_description', project_description_param,
               '--pytz_time_zone', pytz_time_zone_param],
    compute_target=computeTarget,
    runconfig=runConfig,
    allow_reuse=False)

### Configure your Hyperdrive Step
Configure your Hyperdrive registration step by specifing the folder and file names, the run environment, the remote compute target and parameter arguments.  

Then, specify which <b>hyperparameters</b> you'd like to tune and the values that should be tested.

Next, set the scoring metric and whether that metric should be minimized or maximized, along with the desired number of runs to tune your model.

Finally, configure the step to output the best model, the best model explainer, and hyperdrive metrics data.

In [19]:
# Set your script run configuration
scriptRunConfig = ScriptRunConfig(source_directory = projectFolder,
                  script = trainingFileName,
                  compute_target = computeTarget,
                  environment = environment,
                  arguments = ['--train_dataset_name', train_dataset_name_param,
                               '--val_dataset_name', val_dataset_name_param,
                               '--target_column_name', target_column_param,
                               '--k_folds', k_folds_param,
                               '--shuffle_split_size', shuffle_split_size_param,
                               '--confidence_level', confidence_level_param])

In [20]:
hyperParams = BayesianParameterSampling({
                        '--eta': uniform(0.01, 0.5),
                        '--learning_rate': uniform(0.01,0.5),
                        '--min_child_weight': uniform(1,100),
                        '--max_depth': choice(range(3,11)),
                        '--gamma': uniform(0,10),
                        '--subsample': uniform(0.5,1),
                        '--colsample_bytree': uniform(0.5,1),
                        '--reg_lambda': uniform(0,10),
                        '--alpha': uniform(0,10),
                        '--scale_pos_weight': uniform(0,10),
                        })

In [21]:
# Set your Hyperdrive configurations 
scoringMetric = 'Balanced Accuracy Training'
metricGoal = PrimaryMetricGoal.MAXIMIZE
metricGoalString = str(metricGoal)[18:]
hyperdriveConfig = HyperDriveConfig(run_config = scriptRunConfig,
                                     hyperparameter_sampling = hyperParams,
                                     primary_metric_name = scoringMetric,
                                     primary_metric_goal = metricGoal, # MAXIMIZE OR MINIMIZE
                                     max_total_runs = 20,      # should be >= 20 times number of Hyperparameters
                                     max_concurrent_runs = 20)  # should be 20 for Bayesian Sampling

For best results with Bayesian Sampling we recommend using a maximum number of runs greater than or equal to 20 times the number of hyperparameters being tuned. Recommendend value:200.


In [22]:
# Configure your Hyperdrive Step
hyperdriveTrainingStep = HyperDriveStep(
    name = 'xgb-model-training-step-with-hyperparameter-tuning',
    hyperdrive_config = hyperdriveConfig,
    inputs = [],
    outputs = [metricsData, savedModel, explainerModel],
    allow_reuse = False)

### Configure your Model Registration Step
Configure your model registration step by specifing the folder and file names, the docker container run configuration, the remote compute target, and parameter arguments.

Also, take in the best model, best model explanation, and hyperdrive metrics data as input into this step.

In [23]:
modelRegistrationStep = PythonScriptStep(
    name = "model-registration-step",
    source_directory = projectFolder,
    script_name = modelRegistrationFileName,
    inputs = [savedModel, explainerModel, metricsData],
    arguments = ['--train_dataset_name', train_dataset_name_param,
                 '--val_dataset_name', val_dataset_name_param,
                 '--datastore_name', datastore_name_param,
                 '--project_name', project_name_param,
                 '--project_description', project_description_param,
                 '--pytz_time_zone', pytz_time_zone_param,
                 '--target_column_name', target_column_param,
                 '--k_folds', k_folds_param,
                 '--confidence_level', confidence_level_param,
                 '--model_name', model_name_param,
                 '--output_path', output_path_param,
                 '--scoring_metric', scoring_metric_param,
                 '--metric_goal', metric_goal_param,
                 '--saved_model', savedModel,
                 '--explainer_model', explainerModel,
                 '--metrics_data', metricsData],
    compute_target = computeTarget,
    runconfig = runConfig,
    allow_reuse = False)

### Configure your Hyperdrive Run Metrics Output Step
Configure your metrics output step by specifing the folder and file names, the docker container run configuration, the remote compute target, and parameter arguments.

Also, take in the hyperdrive metrics data as input into this step.

In [24]:
metricsOutputStep = PythonScriptStep(
    name = "metrics-output-step",
    source_directory = projectFolder,
    script_name = metricsOutputFileName,
    inputs = [metricsData],
    arguments = ['--datastore_name', datastore_name_param,
                 '--pytz_time_zone', pytz_time_zone_param,
                 '--output_path', output_path_param,
                 '--scoring_metric', scoring_metric_param,
                 '--metrics_data', metricsData],
    compute_target = computeTarget,
    runconfig = runConfig,
    allow_reuse = False
)

### Run your Five-Step Pipeline
Specify the order in which to run your steps.  Then, pass in your parameters and <b>submit</b> your pipeline.

In [25]:
# Create your pipeline
parallelSteps = [modelRegistrationStep, metricsOutputStep]
stepSequence = StepSequence(steps = [unitTestingStep, datasetRegistrationStep, hyperdriveTrainingStep, parallelSteps])
pipeline = Pipeline(workspace = ws, steps = stepSequence)

#### Parameter Explanation
<p><b>TrainDatasetName:</b> Name of your registered training dataset.  This can be anything you would like.</p>
<p><b>ValDatasetName:</b> Name of your registered validation dataset.  This can be anything you would like.</p>
<p><b>DatastoreName:</b> Name of your datastore.  This should be the datastore that holds your input data.</p>
<p><b>DatastorePath:</b> Root folder path which holds your data up to today's date.</p>
<p><b>TrainFileName:</b> Name of your training file located in your datastore.</p>
<p><b>ValFileName:</b> Name of your validation file located in your datastore.</p>
<p><b>ProjectName:</b> Name of your project.  This can be anything you would like.</p>
<p><b>ProjectDescription:</b> Description of your project.  This can be anything you would like.</p>
<p><b>PytzTimeZone:</b> Your timezone or the timezone in which the data is loaded.</p>
<p><b>TargetColumn:</b> Name of your target column for machine learning.</p>
<p><b>KFolds:</b> Number of times to split your data for cross validation.</p>
<p><b>ShuffleSplitSize:</b> Percentage of data to split for cross validation.</p>
<p><b>ConfidenceLevel:</b> Percentage used to create your confidence interval to compare validation and training results.</p>
<p><b>ModelName:</b> Name of your registered model.  This can anything you like following the naming convention.</p>
<p><b>OutputPath:</b> Root folder path to output your results on your datastore.</p>
<p><b>ScoringMetric:</b> Metric you wish to maximize or minimize as part of hyperparameter tuning.  Set in the Hyperdrive pipeline step section.</p>
<p><b>MetricGoal:</b> Whether you should minimize or maximize your Hyperparameter Metric.  Set in the Hyperdrive pipeline step section.</p>

In [26]:
# To get a list of Pytz Time Zones, uncomment and run the code below
#pytz.all_timezones

In [27]:
# Run your pipeline
pipelineName = 'XGB_Model_Training'
pipeline_run = Experiment(ws, pipelineName).submit(pipeline,pipeline_parameters=
                                                           {'TrainDatasetName': 'XGB Training Data',
                                                           'ValDatasetName': 'XGB Validation Data',
                                                           'DatastoreName': 'ds_dev',
                                                           'DatastorePath': 'XGB/XGB_Training_Input',
                                                           'TrainFileName': 'xgbTrainingData.csv',
                                                           'ValFileName': 'xgbValidationData.csv',
                                                           'ProjectName': 'XGB Test',
                                                           'ProjectDescription': 'XGB Test Run',
                                                           'PytzTimeZone': 'US/Eastern',
                                                           'TargetColumn': 'LeftCompany',
                                                           'KFolds': 10,
                                                           'ShuffleSplitSize': 0.1,
                                                           'ConfidenceLevel': 0.95,
                                                           'ModelName': 'Tuned-XGB-Model',
                                                           'OutputPath': 'XGB/XGB_Training_Output',
                                                           'ScoringMetric': scoringMetric,
                                                           'MetricGoal': metricGoalString}, 
                                                           show_output=True)

Created step unit-testing-step [84fad3e0][be6de64f-a337-4e5c-879b-bfaf3e6d9583], (This step will run and generate new outputs)
Created step dataset-registration-step [4dbe6b00][1aea2a40-4db2-4491-bc38-968112cfc047], (This step will run and generate new outputs)
Created step xgb-model-training-step-with-hyperparameter-tuning [cc2535e6][fa79cfb8-3a6e-476e-86d4-43232d1c17ba], (This step will run and generate new outputs)
Created step model-registration-step [5c64d6b9][1bec9d14-830f-41c0-baf5-0319c8d09cab], (This step will run and generate new outputs)
Created step metrics-output-step [0c0ab5be][d24f6134-fd59-422e-bad7-49264a890da3], (This step will run and generate new outputs)
Submitted PipelineRun 83aec045-2a62-4d8a-bf04-6ee38a921cde
Link to Azure Machine Learning Portal: https://ml.azure.com/runs/83aec045-2a62-4d8a-bf04-6ee38a921cde?wsid=/subscriptions/47a7ec0c-37ad-428b-9114-b87ea1057632/resourcegroups/ml-teaching/workspaces/ml-teaching-workspace&tid=72f988bf-86f1-41af-91ab-2d7cd011db

In [28]:
# GUI to see your Pipeline Run
RunDetails(pipeline_run).show()
pipeline_run.wait_for_completion(show_output=True)

_PipelineWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', …

PipelineRunId: 83aec045-2a62-4d8a-bf04-6ee38a921cde
Link to Azure Machine Learning Portal: https://ml.azure.com/runs/83aec045-2a62-4d8a-bf04-6ee38a921cde?wsid=/subscriptions/47a7ec0c-37ad-428b-9114-b87ea1057632/resourcegroups/ml-teaching/workspaces/ml-teaching-workspace&tid=72f988bf-86f1-41af-91ab-2d7cd011db47
PipelineRun Status: Running


StepRunId: fd3eb2bb-e2a5-4897-ab7a-811a3dc4b5df
Link to Azure Machine Learning Portal: https://ml.azure.com/runs/fd3eb2bb-e2a5-4897-ab7a-811a3dc4b5df?wsid=/subscriptions/47a7ec0c-37ad-428b-9114-b87ea1057632/resourcegroups/ml-teaching/workspaces/ml-teaching-workspace&tid=72f988bf-86f1-41af-91ab-2d7cd011db47
StepRun( unit-testing-step ) Status: Running

Streaming azureml-logs/55_azureml-execution-tvmps_631863cdde1aa496170c871ac55bd751278663a436ac21a39c72a53629bacad1_d.txt
2021-07-22T21:55:26Z Successfully mounted a/an Blobfuse File System at /mnt/batch/tasks/shared/LS_root/jobs/ml-teaching-workspace/azureml/fd3eb2bb-e2a5-4897-ab7a-811a3dc4b5df/mounts/w

'Finished'

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

### Publish your Pipeline
First, if you shutdown your notebook, use the first cell to retrieve your pipeline run.

Second, publish your pipeline. 

Third, assign your published pipeline to a permanent endpoint.  

You now have an endpoint you can easily schedule either in AMLS or through <b>Azure Data Factory</b>.

In [29]:
# Retrieve a previously run pipeline if necessary by uncommenting and running the code below
#experiment_name = 'XGB_Model_Training'
#experiment = Experiment(ws, experiment_name)
#pipeline_run = PipelineRun(experiment, 'your-pipeline-run-id')

In [30]:
# Publish your Pipeline
published_pipeline = pipeline_run.publish_pipeline(
    name="XGB_Model_Training",\
    description="XGB Model Training Pipeline for ADF Use", version="1.0")

published_pipeline

Name,Id,Status,Endpoint
XGB_Model_Training,ff43c875-d9ee-407c-83f2-b5235e3b5d2d,Active,REST Endpoint


In [31]:
# Attach your Published Pipeline to a Permanent Endpoint
pipelineEndpointName = "XGB Training Pipeline Endpoint"

if pipelineEndpointName in str(PipelineEndpoint.list(ws)):
    # Add a new Version to an existing Endpoint
    pipeline_endpoint = PipelineEndpoint.get(workspace = ws, name = pipelineEndpointName)
    pipeline_endpoint.add_default(published_pipeline)
else:
    # Create a new Endpoint
    pipeline_endpoint = PipelineEndpoint.publish(workspace = ws,
                                                name = pipelineEndpointName,
                                                pipeline = published_pipeline,
                                                description = "XGB Training Pipeline Endpoint")