### Load your Libraries
All the libraries listed below are required to run this notebook.  

If you require a GPU to train your model (for example, you are training a deep learning model), use DEFAULT_GPU_IMAGE instead of DEFAULT_CPU_IMAGE.  

You can also modify the Hyperparameter run with some of the optional functions following this documentation: [hyperparameter tuning](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters)

In [1]:
import os

root_folder = os.path.dirname(os.path.dirname(os.getcwd()))

if root_folder != '/home/brandon/projects/aml/tAMLplates':
    os.chdir('/home/brandon/projects/aml/tAMLplates')
else:
    os.chdir(root_folder)
reuse_prior_run = True

In [2]:
print(root_folder)

/home/brandon/projects/aml/e2eml


In [3]:
# Import Python Libraries
import json
import logging
import numpy as np
import os
import pandas as pd
import pytz

# Load Azure libaries
import azureml.core
from azureml.core import Datastore, Dataset, Environment, Experiment, Model, ScriptRunConfig, Workspace
from azureml.core.compute import AmlCompute, ComputeTarget
from azureml.core.runconfig import CondaDependencies, DEFAULT_CPU_IMAGE, RunConfiguration
from azureml.pipeline.core import Pipeline, PipelineData, PipelineEndpoint, PipelineParameter, PipelineRun
from azureml.pipeline.core import PublishedPipeline, StepSequence, TrainingOutput
from azureml.pipeline.steps import PythonScriptStep, HyperDriveStep
from azureml.train.hyperdrive import HyperDriveRun, HyperDriveConfig, PrimaryMetricGoal
from azureml.train.hyperdrive import BayesianParameterSampling, uniform, choice
from azureml.widgets import RunDetails

# Modify this workbook with some of the optional Azure libraries below
from azureml.core.runconfig import DEFAULT_GPU_IMAGE
from azureml.train.hyperdrive import normal, GridParameterSampling, RandomParameterSampling
from azureml.train.hyperdrive import BanditPolicy, MedianStoppingPolicy, TruncationSelectionPolicy

# utility scripts and yaml files
from ml_service.util.env_variables import Env
from ml_service.util.attach_compute import get_compute

### Connect your Workspace
When using an Azure Notebook, you must first connect it to your Azure Machine Learning Service to access objects within the Workspace.  

Use the code below and follow the instructions to sign in.  

Also, issues may arise if you are use a different version of the Azure ML SDK.  If you encounter errors, <b>install the version this notebook was created with</b>.

In [None]:
# Check which version of the AzureML SDK you are using
print("You are currently using version " + azureml.core.VERSION + " of the Azure ML SDK")
print("This notebook was made using version 1.31.0 of the Azure ML SDK")

In [None]:
# Connect your Jupyter Notebook Server to your AMLS Workspace
e = Env()

#ws = Workspace.from_config()
ws = Workspace.get(
    name=e.workspace_name,
    subscription_id=os.getenv("MYSUBSCRIPTION"),
    resource_group=e.resource_group,
)
print("get_workspace:")
print(ws.name)

### Set your Remote Compute Target

When you submit this run, it will run on a cluster of virtual machines.  Specify the cluster below.

In [None]:
computeTarget = get_compute(ws, e.compute_name, e.vm_size)
if computeTarget is not None:
    print("Using Azure Machine Learning compute:")
    print(computeTarget)

### Create an Environment which contains all the libraries needed for your scripts
When you submit this run, it will create a docker container using all of the packages you list in this object.

If a package is available through both conda and pip, <b>use the conda version</b>, as conda automatically reconciles package discrepancies.

In [None]:
# To find out which packages are available in Conda, uncomment and run the code below
#%conda list

In [None]:
# Give your environment a name
environment = Environment(name="XGBoostTrainingEnv") # CHANGE HERE
#condaDep = CondaDependencies()

# Add conda packages
# CHANGE HERE TO MATCH SCRIPT

condaDep = CondaDependencies.create(
    conda_packages=[
        "scikit-learn==0.22.1",
        "numpy==1.16.2",
        "matplotlib==3.2.1",
        "joblib==0.14.1",
        "xgboost==0.90",
        "seaborn==0.9.0",
        "pandas==0.23.4",
        "scipy==1.3.1",'pip'],
    pip_packages=[
        "azureml-defaults==1.31.0",
        "azureml-interpret==1.31.0",
        "azureml-explain-model==1.31.0",
        "pyarrow==1.0.1",
        "pytz==2021.1",
        "interpret-core==0.1.21",
        "lightgbm==2.3.0",
        'openpyxl'])

# Adds dependencies to PythonSection of myenv
environment.python.conda_dependencies=condaDep

# Register the environment to your workspace
trainingEnvironment = environment.register(workspace=ws)

In [None]:
# Create a Run Configuration object to dockerize your environment
runConfig = RunConfiguration()
runConfig.docker.use_docker = True
runConfig.environment = environment
runConfig.environment.docker.base_image = DEFAULT_CPU_IMAGE 

### Create Dataset Registration, Training, Model Registration, and Metrics Output Scripts for your Pipeline
When you run this pipeline, it will run a series of .py scripts.  Specify the folder name and file names of your scripts here.

In [None]:
# Create a folder on your Jupyter Notebook server to store your .py files.
projectFolder = e.projectFolder
scriptFolder = e.scriptFolder
os.makedirs(projectFolder, exist_ok=True)

# Create file path strings
# sharedFunctionsFilePath = os.path.join(projectFolder, "training", e.sharedFunctionsFileName)
# unitTestingFilePath = os.path.join(projectFolder, "training", e.unitTestingFileName)
# datasetRegistrationFilePath = os.path.join(projectFolder, "training", e.datasetRegistrationFileName)
# trainingFilePath = os.path.join(projectFolder, "training", e.trainingFileName)
# modelRegistrationFilePath = os.path.join(projectFolder, "training", e.modelRegistrationFileName)
# metricsOutputFilePath = os.path.join(projectFolder, "training", e.metricsOutputFileName)

## Create or set datastore for saving data

In the following steps you will be using the defined datastore from the environment configuration file or the default blob store that comes with Azure Machine Learning Service. You will then proceed to work with the defined dataset if it exists or upload a document to defined datastore and register the dataset.

In [None]:
if e.datastore_name:
    datastore_name = e.datastore_name
else:
    datastore_name = ws.get_default_datastore().name

runConfig.environment.environment_variables["DATASTORE_NAME"] = datastore_name
print(datastore_name)

In [None]:
dataset_name = e.dataset_name

if dataset_name not in ws.datasets:

    # Use a CSV to read in the data set.
    print(os.getcwd())
    path_to_local_folder = os.path.join("..","data")
    
    target_path = "XGB/XGB_Training_Input"
    file_name = "processed.cleveland.data.csv"

    path_and_file = os.path.join(path_to_local_folder, file_name)

    if not os.path.exists(path_and_file):
        raise Exception(
            'Could not find CSV dataset at "%s".'
            % file_name
        )  # NOQA: E501

    # Upload file to default datastore in workspace
    datatstore = Datastore.get(ws, datastore_name)
    datatstore.upload_files(
        files=[file_name],
        target_path=target_path,
        overwrite=True,
        show_progress=False,
    )

    # Register dataset
    path_on_datastore = os.path.join(target_path, os.path.basename(file_name))

    dataset = Dataset.Tabular.from_delimited_files(
        path=(datatstore, path_on_datastore)
    )

    dataset = dataset.register(
        workspace=ws,
        name=dataset_name,
        description="heart disease training data",
        tags={"format": "CSV", "ml type": "multi-class classification"},
        create_new_version=True,
    )

### Set Pipeline Data to pass Best Model to Model Registration Step
Pipeline data will be used to pass in combined metrics for all Hyperdrive runs along with the model and explaination for that <b>highest performing model</b>.

In [None]:
# Get your datastore
datastore = Datastore.get(ws, datastore_name)

# Hyperdrive Metrics
metricsOutputName = 'metrics_output'
metricsData = PipelineData(name = 'metrics_data',
                           datastore = datastore,
                           pipeline_output_name = metricsOutputName,
                           training_output = TrainingOutput("Metrics"))

# Hyperdrive Best Model
modelOutputName = 'model_output'
savedModel = PipelineData(name = 'saved_model',
                          datastore = datastore,
                          pipeline_output_name = modelOutputName,
                          training_output = TrainingOutput("Model", model_file="outputs/XGBmodel.pkl"))

# Hyperdrive Best Model Explanations
explainerOutputName = 'explainer_output'
explainerModel = PipelineData(name = 'explainer_model',
                              datastore = datastore,
                              pipeline_output_name = explainerOutputName,
                              training_output = TrainingOutput("Model", model_file="outputs/LGBMexplainer.pkl"))

### Set your Pipeline Parameters
These are all the parameters you can use to easily adapt this code to other projects.

In [None]:
# Dataset Registration Step Parameters
train_dataset_name_param = PipelineParameter(name="TrainDatasetName", default_value='None')
val_dataset_name_param = PipelineParameter(name="ValDatasetName", default_value='None')
datastore_name_param = PipelineParameter(name="DatastoreName", default_value='None')
datastore_path_param = PipelineParameter(name="DatastorePath", default_value='None')
train_file_name_param = PipelineParameter(name="TrainFileName", default_value='None')
original_file_name_param = PipelineParameter(name="OriginalData", default_value='None')

val_file_name_param = PipelineParameter(name="ValFileName", default_value='None')
project_name_param = PipelineParameter(name="ProjectName", default_value='None')
project_description_param = PipelineParameter(name="ProjectDescription", default_value='None')
pytz_time_zone_param = PipelineParameter(name='PytzTimeZone', default_value='UTC')

# Hyperdrive Step Parameters
target_column_param = PipelineParameter(name="TargetColumn", default_value='None')
k_folds_param = PipelineParameter(name="KFolds", default_value=10)
shuffle_split_size_param = PipelineParameter(name="ShuffleSplitSize", default_value=0.1)
confidence_level_param = PipelineParameter(name="ConfidenceLevel", default_value = 0.95)

# Model Registration Step Parameters
model_name_param = PipelineParameter(name="ModelName", default_value='None')
output_path_param = PipelineParameter(name="OutputPath", default_value='None')
scoring_metric_param = PipelineParameter(name="ScoringMetric", default_value='None')
metric_goal_param = PipelineParameter(name="MetricGoal", default_value='MAXIMIZE')

In [None]:
splitData = PythonScriptStep(
    name = "split-data",
    source_directory = projectFolder,
    script_name = 'split/split_data.py',
    arguments=[
        "--folder_name", datastore_path_param,
        "--file_name", original_file_name_param,
        "--datastore_name", datastore_name_param,
        "--train_file_name", train_file_name_param,
        "--val_file_name", val_file_name_param,
        "--label_name", target_column_param,
        "--train_size", "0.80"],
    compute_target=computeTarget,
    runconfig=runConfig,
    allow_reuse=reuse_prior_run)

### Configure your Unit Testing Step
Configure your unit testing step by specifing the folder and file names, the docker container run configuration, and the remote compute target.

In [None]:
unit_test_folder = os.path.join(projectFolder,"clusterScripts")
script_name = os.path.join("util",e.unitTestingFileName)
unitTestingStep = PythonScriptStep(
    name = "unit-testing-step",
    source_directory = unit_test_folder,
    script_name = script_name,
    arguments=[],
    compute_target=computeTarget,
    runconfig=runConfig,
    allow_reuse=reuse_prior_run)

### Configure your Dataset Registration Step
Configure your data registration step by specifing the folder and file names, the docker container run configuration, the remote compute target, and parameter arguments.

In [None]:
register_folder = os.path.join(projectFolder,"clusterScripts")
script_name = os.path.join("register",e.datasetRegistrationFileName)
datasetRegistrationStep = PythonScriptStep(
    name = "dataset-registration-step",
    source_directory = register_folder,
    script_name = script_name,
    arguments=['--train_dataset_name', train_dataset_name_param,
               '--val_dataset_name', val_dataset_name_param,
               '--datastore_name', datastore_name_param,
               '--datastore_path', datastore_path_param,
               '--train_file_name', train_file_name_param,
               '--val_file_name', val_file_name_param,
               '--project_name', project_name_param,
               '--project_description', project_description_param,
               '--pytz_time_zone', pytz_time_zone_param],
    compute_target=computeTarget,
    runconfig=runConfig,
    allow_reuse=reuse_prior_run)

### Configure your Hyperdrive Step
Configure your Hyperdrive registration step by specifing the folder and file names, the run environment, the remote compute target and parameter arguments.  

Then, specify which <b>hyperparameters</b> you'd like to tune and the values that should be tested.

Next, set the scoring metric and whether that metric should be minimized or maximized, along with the desired number of runs to tune your model.

Finally, configure the step to output the best model, the best model explainer, and hyperdrive metrics data.

In [None]:
# Set your script run configuration
training_folder = os.path.join(projectFolder, "clusterScripts")
script_name = os.path.join("training",e.trainingFileName)
scriptRunConfig = ScriptRunConfig(source_directory = training_folder,
                  script = script_name,
                  compute_target = computeTarget,
                  environment = environment,
                  arguments = ['--train_dataset_name', train_dataset_name_param,
                               '--val_dataset_name', val_dataset_name_param,
                               '--target_column_name', target_column_param,
                               '--k_folds', k_folds_param,
                               '--shuffle_split_size', shuffle_split_size_param,
                               '--confidence_level', confidence_level_param])

In [None]:
hyperParams = BayesianParameterSampling({
                        '--eta': uniform(0.01, 0.5),
                        '--learning_rate': uniform(0.01,0.5),
                        '--min_child_weight': uniform(1,100),
                        '--max_depth': choice(range(3,11)),
                        '--gamma': uniform(0,10),
                        '--subsample': uniform(0.5,1),
                        '--colsample_bytree': uniform(0.5,1),
                        '--reg_lambda': uniform(0,10),
                        '--alpha': uniform(0,10),
                        '--scale_pos_weight': uniform(0,10),
                        })

In [None]:
# Set your Hyperdrive configurations 
scoringMetric = 'Balanced Accuracy Training'
metricGoal = PrimaryMetricGoal.MAXIMIZE
metricGoalString = str(metricGoal)[18:]
hyperdriveConfig = HyperDriveConfig(run_config = scriptRunConfig,
                                     hyperparameter_sampling = hyperParams,
                                     primary_metric_name = scoringMetric,
                                     primary_metric_goal = metricGoal, # MAXIMIZE OR MINIMIZE
                                     max_total_runs = 20,      # should be >= 20 times number of Hyperparameters
                                     max_concurrent_runs = 20)  # should be 20 for Bayesian Sampling

In [None]:
# Configure your Hyperdrive Step
hyperdriveTrainingStep = HyperDriveStep(
    name = 'xgb-model-training-step-with-hyperparameter-tuning',
    hyperdrive_config = hyperdriveConfig,
    inputs = [],
    outputs = [metricsData, savedModel, explainerModel],
    allow_reuse = reuse_prior_run)

### Configure your Model Registration Step
Configure your model registration step by specifing the folder and file names, the docker container run configuration, the remote compute target, and parameter arguments.

Also, take in the best model, best model explanation, and hyperdrive metrics data as input into this step.

In [None]:
script_name = os.path.join("register", e.modelRegistrationFileName)
modelRegistrationStep = PythonScriptStep(
    name = "model-registration-step",
    source_directory = register_folder,
    script_name = script_name,
    inputs = [savedModel, explainerModel, metricsData],
    arguments = ['--train_dataset_name', train_dataset_name_param,
                 '--val_dataset_name', val_dataset_name_param,
                 '--datastore_name', datastore_name_param,
                 '--project_name', project_name_param,
                 '--project_description', project_description_param,
                 '--pytz_time_zone', pytz_time_zone_param,
                 '--target_column_name', target_column_param,
                 '--k_folds', k_folds_param,
                 '--confidence_level', confidence_level_param,
                 '--model_name', model_name_param,
                 '--output_path', output_path_param,
                 '--scoring_metric', scoring_metric_param,
                 '--metric_goal', metric_goal_param,
                 '--saved_model', savedModel,
                 '--explainer_model', explainerModel,
                 '--metrics_data', metricsData],
    compute_target = computeTarget,
    runconfig = runConfig,
    allow_reuse = reuse_prior_run)

### Configure your Hyperdrive Run Metrics Output Step
Configure your metrics output step by specifing the folder and file names, the docker container run configuration, the remote compute target, and parameter arguments.

Also, take in the hyperdrive metrics data as input into this step.

In [None]:
metrics_folder = os.path.join(projectFolder,"clusterScripts")
script_name = os.path.join("metrics", e.metricsOutputFileName )

metricsOutputStep = PythonScriptStep(
    name = "metrics-output-step",
    source_directory = metrics_folder,
    script_name = script_name,
    inputs = [metricsData],
    arguments = ['--datastore_name', datastore_name_param,
                 '--pytz_time_zone', pytz_time_zone_param,
                 '--output_path', output_path_param,
                 '--scoring_metric', scoring_metric_param,
                 '--metrics_data', metricsData],
    compute_target = computeTarget,
    runconfig = runConfig,
    allow_reuse = reuse_prior_run
)

### Run your Five-Step Pipeline
Specify the order in which to run your steps.  Then, pass in your parameters and <b>submit</b> your pipeline.

In [None]:
# Create your pipeline
parallelSteps = [modelRegistrationStep, metricsOutputStep]
stepSequence = StepSequence(steps = [unitTestingStep, splitData, datasetRegistrationStep, hyperdriveTrainingStep, parallelSteps])
pipeline = Pipeline(workspace = ws, steps = stepSequence)

#### Parameter Explanation
<p><b>TrainDatasetName:</b> Name of your registered training dataset.  This can be anything you would like.</p>
<p><b>ValDatasetName:</b> Name of your registered validation dataset.  This can be anything you would like.</p>
<p><b>DatastoreName:</b> Name of your datastore.  This should be the datastore that holds your input data.</p>
<p><b>DatastorePath:</b> Root folder path which holds your data up to today's date.</p>
<p><b>TrainFileName:</b> Name of your training file located in your datastore.</p>
<p><b>ValFileName:</b> Name of your validation file located in your datastore.</p>
<p><b>ProjectName:</b> Name of your project.  This can be anything you would like.</p>
<p><b>ProjectDescription:</b> Description of your project.  This can be anything you would like.</p>
<p><b>PytzTimeZone:</b> Your timezone or the timezone in which the data is loaded.</p>
<p><b>TargetColumn:</b> Name of your target column for machine learning.</p>
<p><b>KFolds:</b> Number of times to split your data for cross validation.</p>
<p><b>ShuffleSplitSize:</b> Percentage of data to split for cross validation.</p>
<p><b>ConfidenceLevel:</b> Percentage used to create your confidence interval to compare validation and training results.</p>
<p><b>ModelName:</b> Name of your registered model.  This can anything you like following the naming convention.</p>
<p><b>OutputPath:</b> Root folder path to output your results on your datastore.</p>
<p><b>ScoringMetric:</b> Metric you wish to maximize or minimize as part of hyperparameter tuning.  Set in the Hyperdrive pipeline step section.</p>
<p><b>MetricGoal:</b> Whether you should minimize or maximize your Hyperparameter Metric.  Set in the Hyperdrive pipeline step section.</p>

In [None]:
# To get a list of Pytz Time Zones, uncomment and run the code below
#pytz.all_timezones

In [None]:
# Run your pipeline
pipelineName = 'XGB_Model_Training'
pipeline_run = Experiment(ws, pipelineName).submit(pipeline,pipeline_parameters=
                                                           {'TrainDatasetName': 'XGB Training Data',
                                                           'ValDatasetName': 'XGB Validation Data',
                                                           'DatastoreName': datastore_name,
                                                           'DatastorePath': 'XGB/XGB_Training_Input',
                                                           'TrainFileName': 'xgbTrainingData.csv',
                                                           'ValFileName': 'xgbValidationData.csv',
                                                           'OriginalData': 'processed.cleveland.data.csv',
                                                           'ProjectName': 'XGB Test',
                                                           'ProjectDescription': 'XGB Test Run',
                                                           'PytzTimeZone': 'US/Eastern',
                                                           'TargetColumn': 'num',
                                                           'KFolds': 10,
                                                           'ShuffleSplitSize': 0.1,
                                                           'ConfidenceLevel': 0.95,
                                                           'ModelName': 'Tuned-XGB-Model',
                                                           'OutputPath': 'XGB/XGB_Training_Output',
                                                           'ScoringMetric': scoringMetric,
                                                           'MetricGoal': metricGoalString}, 
                                                           show_output=True)

In [None]:
# GUI to see your Pipeline Run
RunDetails(pipeline_run).show()
pipeline_run.wait_for_completion(show_output=True)

### Publish your Pipeline
First, if you shutdown your notebook, use the first cell to retrieve your pipeline run.

Second, publish your pipeline. 

Third, assign your published pipeline to a permanent endpoint.  

You now have an endpoint you can easily schedule either in AMLS or through <b>Azure Data Factory</b>.

In [None]:
# Retrieve a previously run pipeline if necessary by uncommenting and running the code below
#experiment_name = 'XGB_Model_Training'
#experiment = Experiment(ws, experiment_name)
#pipeline_run = PipelineRun(experiment, 'your-pipeline-run-id')

In [None]:
# Publish your Pipeline
published_pipeline = pipeline_run.publish_pipeline(
    name="XGB_Model_Training",\
    description="XGB Model Training Pipeline for ADF Use", version="1.0")

published_pipeline

In [None]:
# Attach your Published Pipeline to a Permanent Endpoint
pipelineEndpointName = "XGB Training Pipeline Endpoint"

if pipelineEndpointName in str(PipelineEndpoint.list(ws)):
    # Add a new Version to an existing Endpoint
    pipeline_endpoint = PipelineEndpoint.get(workspace = ws, name = pipelineEndpointName)
    pipeline_endpoint.add_default(published_pipeline)
else:
    # Create a new Endpoint
    pipeline_endpoint = PipelineEndpoint.publish(workspace = ws,
                                                name = pipelineEndpointName,
                                                pipeline = published_pipeline,
                                                description = "XGB Training Pipeline Endpoint")