# Optimizing an ML Pipeline in Azure

In this notebook, we will use a standard Scikit-learn Logistic Regression and a custom-coded model, the hyperparameters of which we will optimize using Azure HyperDrive. Next, we will find an optimal model for the same dataset using automated machine learning.

The main steps in this notebook are illustrated in the following diagram:

![](images/ch1_1.png)

The training script `train.py` has previously been set up as part of the pipeline with the tabular dataset created with `TabularDatasetFactory` and evaluated with logistic regression through Scikit-learn. We will to implement a script to train a logistic regression model on the dataset.

We will then use HyperDrive to help find optimal hyperparameters for the logistic regression model. Manual hyperparameter tuning can be extremely time-consuming, and we will use HyperDrive to help us find the best parameters for the model. This will give a train model with hyperparameters optimized by HyperDrive. Once we have a trained logistic regression model, we will proceed to create a tabular dataset in the notebook, so that we can use AutoML to find another optimimzed model.

The idea here is that while we may have a model we think is optimal, and we can use HyperDrive to optimize that model's hyperparameters, there may be better machine learning algorithms to use. 

Finally, we will be able to compare the results from the two methods and conclude if our HyperDrive optimized logistic regression outperform AutoML, or was the surge power of AutoML enough to beat the algorithm. 

# Part 1: Hyperparameter Tuning with HyperDrive

In this step, we will first create a compute cluster, as well as setting up exception handling to check for existing compute clusters before creating a new one. 

## Setting up Azure Workspace

We will first import the `Workspace` and `Experiment` objects from the `azureml.core` library. These same Azure configurations (*Workspace Name, Azure Region, Subscription ID, Resource Group*) will then be used to run both the HyperDrive and AutoML pipelines for the purpose of this project.

In [1]:
from azureml.core import Workspace, Experiment

ws = Workspace.from_config()
exp = Experiment(workspace=ws, name="project1-hd9")

print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

run = exp.start_logging()

Workspace name: udacityworkspace
Azure region: westeurope
Subscription id: 9abcc493-07de-4e5c-a655-1606fd996080
Resource group: cloud-shell-storage-westeurope


## Setting up Compute Cluster

We will set up a Standard D2 v2 (2 cores, 7GB RAM, 100GB disk) compute cluster for the purpose of this pipeline. We will be using an existing compute cluster that was previously created. If none is detected, it will instantiate one.

In [2]:
from azureml.core.compute_target import ComputeTargetException
from azureml.core.compute import ComputeTarget, AmlCompute

# This is a Standard_D2_v2 (2 cores, 7 GB RAM, 100 GB disk) compute cluster I have previously created
# If it does not exist, this will create a new compute cluster
compute_cluster = "standard-D2v2"

try:
    cluster_compute = ComputeTarget(workspace=ws, name=compute_cluster)
    print(f"'{compute_cluster}' currently exists. Using '{compute_cluster}' as compute cluster")
except ComputeTargetException:
    print('Creating new AzureML compute target...')
    cluster_config = AmlCompute.provisioning_configuration(vm_size = "Standard_D2_v2", min_nodes=1, max_nodes=4)
    cluster_compute = ComputeTarget.create(ws, name = compute_cluster, provisioning_configuration = cluster_config)

cluster_compute.wait_for_completion(show_output = True)

'standard-D2v2' currently exists. Using 'standard-D2v2' as compute cluster
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## Setting up HyperDrive

We will then set up the HyperDrive configuration, including the estimator, policy, and parameter sampler. In addition, we will configure the primary metrics and maximum number of runs for this experiment.

In [3]:
from azureml.widgets import RunDetails
from azureml.train.sklearn import SKLearn
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.parameter_expressions import uniform, choice, loguniform
from azureml.core import ScriptRunConfig, Experiment
import os

# Specify parameter sampler
ps = RandomParameterSampling(
    {
        '--C': uniform(0.1, 1.0), 
        '--max_iter': choice(10, 50, 100, 200)
    }
)

# Specify a Policy
policy = BanditPolicy(evaluation_interval=2, slack_factor=0.1)

if "training" not in os.listdir():
    os.mkdir("./training")

# Create a SKLearn estimator for use with train.py
est = SKLearn("./", 
              entry_script="train.py", 
              compute_target=compute_cluster)
# est = ScriptRunConfig("./", script="train.py", compute_target=compute_cluster)

# Create a HyperDriveConfig using the estimator, hyperparameter sampler, and policy.
hd_config = HyperDriveConfig(estimator = est,
                       hyperparameter_sampling = ps,
                       policy = policy,
                       primary_metric_name = 'accuracy',
                       primary_metric_goal = PrimaryMetricGoal.MAXIMIZE,
                       max_total_runs = 20,
                       max_concurrent_runs = 4)

'SKLearn' estimator is deprecated. Please use 'ScriptRunConfig' from 'azureml.core.script_run_config' with your own defined environment or the AzureML-Tutorial curated environment.
'enabled' is deprecated. Please use the azureml.core.runconfig.DockerConfiguration object with the 'use_docker' param instead.


## Running HyperDrive

We will now run HyperDrive with the previous configurations.

In [None]:
# Submit your hyperdrive run to the experiment and show run details with the widget.
hd_run = exp.submit(hd_config, show_output = True)
RunDetails(hd_run).show()
hd_run.wait_for_completion(show_output = True)



_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

RunId: HD_f61658f7-878f-4035-a75f-340d535df6a8
Web View: https://ml.azure.com/runs/HD_f61658f7-878f-4035-a75f-340d535df6a8?wsid=/subscriptions/9abcc493-07de-4e5c-a655-1606fd996080/resourcegroups/cloud-shell-storage-westeurope/workspaces/udacityworkspace&tid=15ce9348-be2a-462b-8fc0-e1765a9b204a

Streaming azureml-logs/hyperdrive.txt

"<START>[2021-04-15T10:18:23.188551][API][INFO]Experiment created<END>\n""<START>[2021-04-15T10:18:23.991260][GENERATOR][INFO]Trying to sample '4' jobs from the hyperparameter space<END>\n""<START>[2021-04-15T10:18:24.284397][GENERATOR][INFO]Successfully sampled '4' jobs, they will soon be submitted to the execution target.<END>\n"<START>[2021-04-15T10:18:24.2643581Z][SCHEDULER][INFO]The execution environment is being prepared. Please be patient as it can take a few minutes.<END>


In [5]:
import joblib

best_metrics = hd_run.get_metrics()
best_metrics

{'HD_f61658f7-878f-4035-a75f-340d535df6a8_19': {'Regularization Strength:': 0.10708766947092146,
  'Max iterations:': 50,
  'Accuracy': 0.9071320182094081},
 'HD_f61658f7-878f-4035-a75f-340d535df6a8_18': {'Regularization Strength:': 0.7915878841049379,
  'Max iterations:': 50,
  'Accuracy': 0.9082448153768335},
 'HD_f61658f7-878f-4035-a75f-340d535df6a8_17': {'Regularization Strength:': 0.3594539317652813,
  'Max iterations:': 100,
  'Accuracy': 0.9070308548305513},
 'HD_f61658f7-878f-4035-a75f-340d535df6a8_16': {'Regularization Strength:': 0.9825129543652819,
  'Max iterations:': 10,
  'Accuracy': 0.9016691957511381},
 'HD_f61658f7-878f-4035-a75f-340d535df6a8_15': {'Regularization Strength:': 0.49564343761654905,
  'Max iterations:': 200,
  'Accuracy': 0.9070308548305513},
 'HD_f61658f7-878f-4035-a75f-340d535df6a8_14': {'Regularization Strength:': 0.26828336533745434,
  'Max iterations:': 50,
  'Accuracy': 0.9070308548305513},
 'HD_f61658f7-878f-4035-a75f-340d535df6a8_13': {'Regulariza

In [19]:
best_run_hd

{'HD_14855822-6fbd-46ed-85d6-b9f2d055d5ca_19': {'Regularization Strength:': 0.3115435155674668,
  'Max iterations:': 10,
  'Accuracy': 0.9016691957511381},
 'HD_14855822-6fbd-46ed-85d6-b9f2d055d5ca_18': {'Regularization Strength:': 0.5968594773402093,
  'Max iterations:': 200,
  'Accuracy': 0.9067273646939807},
 'HD_14855822-6fbd-46ed-85d6-b9f2d055d5ca_17': {'Regularization Strength:': 0.36544561402942044,
  'Max iterations:': 200,
  'Accuracy': 0.9070308548305513},
 'HD_14855822-6fbd-46ed-85d6-b9f2d055d5ca_16': {'Regularization Strength:': 0.6248055311610956,
  'Max iterations:': 50,
  'Accuracy': 0.9082448153768335},
 'HD_14855822-6fbd-46ed-85d6-b9f2d055d5ca_15': {'Regularization Strength:': 0.19133412664374494,
  'Max iterations:': 50,
  'Accuracy': 0.9075366717248357},
 'HD_14855822-6fbd-46ed-85d6-b9f2d055d5ca_14': {'Regularization Strength:': 0.8936764009743738,
  'Max iterations:': 50,
  'Accuracy': 0.9075366717248357},
 'HD_14855822-6fbd-46ed-85d6-b9f2d055d5ca_13': {'Regularizat

# Part 2: Hyperparameter Tuning with AutoML

Now that the HyperDrive run has been initialized, we will now set up the AutoMLConfig first by importing that data from the specified URL, clean the data, and then pass the cleaned data to an AutoMLConfig which will be created in this notebook.

In [1]:
from azureml.data.dataset_factory import TabularDatasetFactory

# Create TabularDataset using TabularDatasetFactory
# Data is available at: 
# "https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv"

data = "https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv"
ds = TabularDatasetFactory.from_delimited_files(data)

## Setting up Training and Test Set

We will use the sklearn library to create training and test sets for the AutoML pipeline.

In [2]:
from train import clean_data
from sklearn.model_selection import train_test_split
import pandas as pd

# Use the clean_data function to clean your data.
x, y = clean_data(ds)
x_train, x_test, y_train, y_test = train_test_split(x, y, 
                                                    test_size = 0.20,
                                                    random_state = 26)

In [6]:
from azureml.train.automl import AutoMLConfig

# Set parameters for AutoMLConfig
# NOTE: DO NOT CHANGE THE experiment_timeout_minutes PARAMETER OR YOUR INSTANCE WILL TIME OUT.
# If you wish to run the experiment longer, you will need to run this notebook in your own
# Azure tenant, which will incur personal costs.
automl_config = AutoMLConfig(experiment_timeout_minutes = 30,
                            task = 'classification',
                            primary_metric = 'accuracy',
                            training_data = ds,
                            iterations = 20,
                            iteration_timeout_minutes = 10,
                            label_column_name = "y",
                            n_cross_validations = 5,
                            compute_target = compute_cluster)

In [7]:
from azureml.widgets import RunDetails
from azureml.core.experiment import Experiment

# Submit AutoML run with SDK

exp2 = Experiment(ws, "project1-auto4")
automl_run = exp2.submit(automl_config, show_output = True)
RunDetails(automl_run).show()
automl_run.wait_for_completion(show_output = True)

Submitting remote run.
No run_configuration provided, running on standard-D2v2 with default configuration
Running on remote compute: standard-D2v2


Experiment,Id,Type,Status,Details Page,Docs Page
project1-auto4,AutoML_fffd8b83-2cdb-411b-834b-230936d70213,automl,NotStarted,Link to Azure Machine Learning studio,Link to Documentation



Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetFeaturization. Beginning to fit featurizers and featurize the dataset.
Current status: DatasetBalancing. Performing class balancing sweeping
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       ALERTED
DESCRIPTION:  To decrease model bias, please cancel the current run and fix balancing problem.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData
DETAILS:      Imbalanced data can lead to a falsely perceived positive effect of a model's accuracy because the input data has bias towards one class.
+---------------------------------+---------------------------------+-------------------------

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

Experiment,Id,Type,Status,Details Page,Docs Page
project1-auto4,AutoML_fffd8b83-2cdb-411b-834b-230936d70213,automl,Completed,Link to Azure Machine Learning studio,Link to Documentation




****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       ALERTED
DESCRIPTION:  To decrease model bias, please cancel the current run and fix balancing problem.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData
DETAILS:      Imbalanced data can lead to a falsely perceived positive effect of a model's accuracy because the input data has bias towards one class.
+---------------------------------+---------------------------------+--------------------------------------+
|Size of the smallest class       |Name/Label of the smallest class |Number of samples in the training data|
|3692                             |yes                              |32950                                 |
+---------------------------------+---------------------------------+--------------------------------------+

********************************************

{'runId': 'AutoML_fffd8b83-2cdb-411b-834b-230936d70213',
 'target': 'standard-D2v2',
 'status': 'Completed',
 'startTimeUtc': '2021-04-15T11:28:55.69088Z',
 'endTimeUtc': '2021-04-15T12:13:55.015701Z',
 'properties': {'num_iterations': '20',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'accuracy',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': '5',
  'target': 'standard-D2v2',
  'DataPrepJsonString': '{\\"training_data\\": {\\"datasetId\\": \\"6c52a91e-11c2-40c0-8492-3444257ef13c\\"}, \\"datasets\\": 0}',
  'EnableSubsampling': 'False',
  'runTemplate': 'AutoML',
  'azureml.runsource': 'automl',
  'display_task_type': 'classification',
  'dependencies_versions': '{"azureml-widgets": "1.26.0", "azureml-train": "1.26.0", "azureml-train-restclients-hyperdrive": "1.26.0", "azureml-train-core": "1.26.0", "azureml-train-automl": "1.26.0", "azureml-train-automl-runtime": "1.26.0", "azureml-train-automl-client": "1.26.0", 

In [8]:
import joblib

best_run_automl = automl_run.get_metrics()
best_run_automl

{'experiment_status': ['DatasetEvaluation',
  'FeaturesGeneration',
  'DatasetFeaturization',
  'DatasetFeaturizationCompleted',
  'DatasetBalancing',
  'DatasetCrossValidationSplit',
  'ModelSelection',
  'BestRunExplainModel',
  'ModelExplanationDataSetSetup',
  'PickSurrogateModel',
  'EngineeredFeatureExplanations',
  'EngineeredFeatureExplanations',
  'RawFeaturesExplanations',
  'RawFeaturesExplanations',
  'BestRunExplainModel'],
 'experiment_status_description': ['Gathering dataset statistics.',
  'Generating features for the dataset.',
  'Beginning to fit featurizers and featurize the dataset.',
  'Completed fit featurizers and featurizing the dataset.',
  'Performing class balancing sweeping',
  'Generating individually featurized CV splits.',
  'Beginning model selection.',
  'Best run model explanations started',
  'Model explanations data setup completed',
  'Choosing LightGBM as the surrogate model for explanations',
  'Computation of engineered features started',
  'Comp

In [9]:
print('Best AutoML Model')
print('Accuracy: ', best_run_automl['accuracy'])
print('AUC_weighted: ', best_run_automl['AUC_weighted'])

Best AutoML Model
Accuracy:  0.9165098634294386
AUC_weighted:  0.9463799039851419


### Delete Compute Cluster

In [3]:
print(f'Deleting {compute_cluster} compute cluster...')
cluster_compute.delete()
print(f'{compute_cluster} deleted')

Deleting standard-D2v2 compute cluster...
standard-D2v2 deleted
