# Optimize Machine Learning Pipeline with Azure

## Contents
1. [Introduction](#Introduction)
1. [Setup](#Setup)
1. [Tune Hyperparameters using Hyperdrive Pipeline](#id-hyperdrive)
1. [AutoML Pipeline](#id-automl)

## Introduction
We developed a classification model (Logistic Regression) using scikit-learn in the `train.py`.
In this notebook we show how to optimize the Machine Learning Pipeline using Azure ML in two ways:

1. Tune the hyperparameters of the logistic regression using Azure Hyperdrive.
2. Use Azure AutoML to find another model. 




## Setup

First we setup de Azure environment  

### Evaluate AzureML SDK Version 

In [2]:
import azureml.core
import logging


In [3]:
#alert to changes in the SDK version
print("This notebook was created using version 1.19.0 of the Azure ML SDK")
print("You are currently using version", azureml.core.VERSION, "of the Azure ML SDK")

This notebook was created using version 1.19.0 of the Azure ML SDK
You are currently using version 1.19.0 of the Azure ML SDK


### Create Workspace and Experiment

In [5]:
from azureml.core import Workspace, Experiment

#use json config file to access remote vscode
ws=Workspace.from_config()
#create experiment
exp = Experiment(ws, "udacity-project")

print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')
run = exp.start_logging()

Note, we have launched a browser for you to login. For old experience with device code, use "az login --use-device-code"
Performing interactive authentication. Please follow the instructions on the terminal.
You have logged in. Now let us find all the subscriptions to which you have access...
Interactive authentication successfully completed.
Workspace name: quick-starts-ws-133834
Azure region: southcentralus
Subscription id: d4ad7261-832d-46b2-b093-22156001df5b
Resource group: aml-quickstarts-133834


### Create cluster resource in Azure

In [6]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# cluster name
amlcompute_cluster_name = "cpu-cluster"

#verify if the compute_target already exists so that it will skip de process
try:
    compute_target=ComputeTarget(workspace=ws,name=amlcompute_cluster_name)
    print('The cluster already exists')
except ComputeTargetException:
    #it is a new cluster so let's configure it
    provisioning_config=AmlCompute.provisioning_configuration(vm_size="STANDARD_D2_V2",max_nodes=4)
    #create the cluster
    compute_target=ComputeTarget.create(ws,amlcompute_cluster_name,provisioning_config)

compute_target.wait_for_completion(show_output=True,min_node_count=None,timeout_in_minutes=20)

Creating
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


Get Resource info

In [7]:
print(compute_target.get_status().serialize())


{'currentNodeCount': 0, 'targetNodeCount': 0, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 0, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2021-01-07T13:54:28.818000+00:00', 'errors': None, 'creationTime': '2021-01-07T13:54:23.488875+00:00', 'modifiedTime': '2021-01-07T13:54:39.450382+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 4, 'nodeIdleTimeBeforeScaleDown': 'PT120S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_D2_V2'}


<div id='id-hyperdrive'/>

## Tune hyperparameters using Hyperdrive Pipeline

We need to connect the train.py script with the scikit-learn model with the Azure Hyperdrive. We will present this in the next steps:

### Imports 


In [8]:
from azureml.widgets import RunDetails
from azureml.train.sklearn import SKLearn
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.parameter_expressions import uniform, choice
import os
from azureml.core import Environment
from azureml.core import ScriptRunConfig


### Scikit-learn Environment and Run Configuration

[Documentation](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-train-scikit-learn) to train scikit-learn models at azure

In [9]:
%%writefile conda_dependencies.yml
dependencies:
- python=3.6.2
- scikit-learn
- pip:
  - azureml-defaults

Overwriting conda_dependencies.yml


In [10]:
from azureml.core import Environment
from azureml.core import ScriptRunConfig
sklearn_env = Environment.from_conda_specification(name = 'sklearn-env', file_path = 'conda_dependencies.yml')
src = ScriptRunConfig(source_directory='.',
                      script='train.py',
                      compute_target=compute_target,
                      environment=sklearn_env)

### HyperDrive Configuration

The  hyperparameter tuning using HyperDrive documentation can be found [here](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters) 

We define the search space of the following hyperparameters 

* -C        = Inverse of regularization strength. Smaller values cause stronger regularization
* --max_iter= Maximum number of iterations to converge

We use Random Sampling and Bandit Policy for early stopping


In [11]:
# Specify parameter sampler
ps = RandomParameterSampling( {
       "C": choice(0.01,0.05,0.2,1,5,10,25),
       "max_iter": choice(100, 150, 200, 250)    
    }
)


# Specify a Policy
# (using early termination policy the Bayesian Sampling is not supported)
policy = BanditPolicy(slack_factor = 0.1, evaluation_interval=1)


if "training" not in os.listdir():
    os.mkdir("./training")


# Create a HyperDriveConfig using the model, hyperparameter sampler, and policy.


hyperdrive_config = HyperDriveConfig(run_config=src,
                                    hyperparameter_sampling=ps,
                                    policy=policy,
                                    primary_metric_name="accuracy",
                                    primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                                    max_total_runs=20,
                                    max_concurrent_runs=4)

### HyperDrive run

In [12]:
# Submit the hyperdrive run to the experiment and show run details with the widget.
hyperdrive_run = exp.submit(hyperdrive_config)

hyperdrive_run.wait_for_completion(show_output=True)

RunId: HD_d232f427-c99d-47bd-b809-8cc881fbbac6
Web View: https://ml.azure.com/experiments/udacity-project/runs/HD_d232f427-c99d-47bd-b809-8cc881fbbac6?wsid=/subscriptions/d4ad7261-832d-46b2-b093-22156001df5b/resourcegroups/aml-quickstarts-133834/workspaces/quick-starts-ws-133834

Streaming azureml-logs/hyperdrive.txt

"<START>[2021-01-07T13:56:16.343032][API][INFO]Experiment created<END>\n""<START>[2021-01-07T13:56:16.799883][GENERATOR][INFO]Trying to sample '4' jobs from the hyperparameter space<END>\n""<START>[2021-01-07T13:56:16.971369][GENERATOR][INFO]Successfully sampled '4' jobs, they will soon be submitted to the execution target.<END>\n"

Execution Summary
RunId: HD_d232f427-c99d-47bd-b809-8cc881fbbac6
Web View: https://ml.azure.com/experiments/udacity-project/runs/HD_d232f427-c99d-47bd-b809-8cc881fbbac6?wsid=/subscriptions/d4ad7261-832d-46b2-b093-22156001df5b/resourcegroups/aml-quickstarts-133834/workspaces/quick-starts-ws-133834



{'runId': 'HD_d232f427-c99d-47bd-b809-8cc881fbbac6',
 'target': 'cpu-cluster',
 'status': 'Completed',
 'startTimeUtc': '2021-01-07T13:56:16.064647Z',
 'endTimeUtc': '2021-01-07T14:16:24.212948Z',
 'properties': {'primary_metric_config': '{"name": "accuracy", "goal": "maximize"}',
  'resume_from': 'null',
  'runTemplate': 'HyperDrive',
  'azureml.runsource': 'hyperdrive',
  'platform': 'AML',
  'ContentSnapshotId': '98c12fe6-af4c-4f41-aeb9-e6eb3a5aea34',
  'score': '0.91350531107739',
  'best_child_run_id': 'HD_d232f427-c99d-47bd-b809-8cc881fbbac6_19',
  'best_metric_status': 'Succeeded'},
 'inputDatasets': [],
 'outputDatasets': [],
 'logFiles': {'azureml-logs/hyperdrive.txt': 'https://mlstrg133834.blob.core.windows.net/azureml/ExperimentRun/dcid.HD_d232f427-c99d-47bd-b809-8cc881fbbac6/azureml-logs/hyperdrive.txt?sv=2019-02-02&sr=b&sig=13XX7m1OW8wzx%2BPyMa1GGpB%2FKKwviBUDc2VdaOzKHqM%3D&st=2021-01-07T14%3A07%3A20Z&se=2021-01-07T22%3A17%3A20Z&sp=r'}}

### Use Azure Widget to evaluate RunDetails

It can be used while connected to the cloud, but it is not possible to save the widget for offline analysis

In [13]:
#visualize experiment
RunDetails(hyperdrive_run).show()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

### Get the Best Model 

In [16]:
# Get your best run and save the model from that run.
best_run = hyperdrive_run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()
print('Best Run Id: ', best_run.id)
print('Best Run info:',best_run_metrics)

Best Run Id:  HD_ec4b5867-df05-4955-8fa7-6670750c25d6_1
Best Run info: {'Regularization Strength:': 5.0, 'Max iterations:': 200, 'accuracy': 0.91350531107739}


### Saving the Best Model as joblib
To save the Run as a model. For more details see the [documentation](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.run(class)?view=azure-ml-py)

In [17]:
import joblib
print(best_run.get_file_names())
hyperdrive_model=best_run.register_model(model_name = 'hyperdrivemodel', model_path = 'outputs/model.joblib')

['azureml-logs/55_azureml-execution-tvmps_3b279520a196f972ef0469dc3d953f9e71beb5603a94d7c5242b27b04cac6d46_d.txt', 'azureml-logs/65_job_prep-tvmps_3b279520a196f972ef0469dc3d953f9e71beb5603a94d7c5242b27b04cac6d46_d.txt', 'azureml-logs/70_driver_log.txt', 'azureml-logs/75_job_post-tvmps_3b279520a196f972ef0469dc3d953f9e71beb5603a94d7c5242b27b04cac6d46_d.txt', 'azureml-logs/process_info.json', 'azureml-logs/process_status.json', 'logs/azureml/102_azureml.log', 'logs/azureml/dataprep/backgroundProcess.log', 'logs/azureml/dataprep/backgroundProcess_Telemetry.log', 'logs/azureml/dataprep/engine_spans_l_d587be3c-0843-4128-86ff-9c262dbb5c1f.jsonl', 'logs/azureml/dataprep/python_span_00ed3a90-e255-4cc8-ae55-28f220045c6d.jsonl', 'logs/azureml/dataprep/python_span_034b4c67-c9ed-4f2a-93e9-9b77c2335a4e.jsonl', 'logs/azureml/dataprep/python_span_0421c138-70c8-43ff-90e2-e13302f683aa.jsonl', 'logs/azureml/dataprep/python_span_06593dff-0068-4ee4-a829-d814d37ff826.jsonl', 'logs/azureml/dataprep/python_sp

<div id='id-automl'/>

## AutoML Pipeline
Create a AutoML Pipeline to get a different model and compare it with the Logistic Regression with hyperparameters tuned using Hyperdrive

### Load the data
As we are not using the `train.py` script to get the data, we need to load the dataset into a Azure `TabularDataset`

In [18]:
from azureml.data.dataset_factory import TabularDatasetFactory

# Create TabularDataset using TabularDatasetFactory
ds = TabularDatasetFactory.from_delimited_files(path="https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv")


### Split the training/validation  and test data
The AutoML will perform the training and validation, we can test the model afterwards with the remaining test data.

In [19]:

train_validation_data, test_data = ds.random_split(percentage=0.8, seed=1)


### AutoML Config
We need to configurate as a classification task
For more details see the [AutoMLConfig documentation](https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.automlconfig.automlconfig?view=azure-ml-py).
There is no need to split the data as the AutoML will take care of this

In [22]:
from azureml.train.automl import AutoMLConfig

# Parameters settings for the AutoMLConfig
automl_settings = {
    "experiment_timeout_minutes": 30,
    "max_concurrent_iterations": 4,
    "primary_metric" : 'accuracy',
    "enable_early_stopping": True,
    "featurization": 'auto',
    "debug_log": "automl_errors.log",
    "n_cross_validations": 5,
    "compute_target": compute_target,
}
#AutoML Configuration
automl_config = AutoMLConfig(
                             task = "classification",
                             training_data=train_validation_data,
                             label_column_name="y",                                
                             **automl_settings
                            )

### AutoML Run

We will now run the experiment, using the same experiment defined previously as udacity project  

In [26]:
# Submit the autoML run
automl_run = exp.submit(automl_config, show_output = True)


Running on remote.
Running on remote compute: cpu-cluster
Parent Run ID: AutoML_cad533a8-a6b0-41fa-a4a6-45ddc242c6d4

Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetFeaturization. Beginning to fit featurizers and featurize the dataset.
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       ALERTED
DESCRIPTION:  To decrease model bias, please cancel the current run and fix balancing problem.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData
DETAILS:      Imbalanced data can lead to a falsely perceived positive effect of a model's accuracy because the input data has bias towards one class.
+---------------------------------+------------

### Get the best model and its properties

In [27]:
# Retrieve and save your best automl model.
best_automl_run, fitted_automl_model = automl_run.get_output()
model_name = best_automl_run.properties['model_name']


In [29]:
description = 'AutoML Model trained on bank marketing data to predict if a client will subscribe to a term deposit'
tags = None
model = automl_run.register_model(model_name = model_name, description = description, tags = tags)
print(automl_run.model_id) # This will be written to the script file later in the notebook.

AutoMLcad533a8a36


In [20]:
print(automl_run.properties)

{'runTemplate': 'automl_child', 'pipeline_id': '__AutoML_Ensemble__', 'pipeline_spec': '{"pipeline_id":"__AutoML_Ensemble__","objects":[{"module":"azureml.train.automl.ensemble","class_name":"Ensemble","spec_class":"sklearn","param_args":[],"param_kwargs":{"automl_settings":"{\'task_type\':\'classification\',\'primary_metric\':\'accuracy\',\'verbosity\':20,\'ensemble_iterations\':15,\'is_timeseries\':False,\'name\':\'udacity-project\',\'compute_target\':\'cpu-cluster\',\'subscription_id\':\'510b94ba-e453-4417-988b-fbdc37b55ca7\',\'region\':\'southcentralus\',\'spark_service\':None}","ensemble_run_id":"AutoML_ac11b830-a542-45b6-bb02-1b43156666d2_23","experiment_name":"udacity-project","workspace_name":"quick-starts-ws-133565","subscription_id":"510b94ba-e453-4417-988b-fbdc37b55ca7","resource_group_name":"aml-quickstarts-133565"}}]}', 'training_percent': '100', 'predicted_cost': None, 'iteration': '23', '_aml_system_scenario_identification': 'Remote.Child', '_azureml.ComputeTargetType': 

### Get Best Model Information

In [35]:
print(best_automl_run)
print(fitted_automl_model)

Run(Experiment: udacity-project,
Id: AutoML_cad533a8-a6b0-41fa-a4a6-45ddc242c6d4_36,
Type: azureml.scriptrun,
Status: Completed)
Pipeline(memory=None,
         steps=[('datatransformer',
                 DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
                                 feature_sweeping_config=None,
                                 feature_sweeping_timeout=None,
                                 featurization_config=None, force_text_dnn=None,
                                 is_cross_validation=None,
                                 is_onnx_compatible=None, logger=None,
                                 observer=None, task=None, working_dir=None)),
                ('prefittedsoftvotingclassifier',...
                                                                                               reg_lambda=0.10416666666666667,
                                                                                               scale_pos_weight=1,
                     

Show the functons used in the Voting enseble and its hyperparameters

In [38]:
from pprint import pprint

# Function to list the hyperparameters 

def print_model(model, prefix=""):
    for step in model.steps:
        print(prefix + step[0])
        
        if hasattr(step[1], 'estimators') and hasattr(step[1], 'weights'):
            pprint({'estimators' : list(e[0] for e in step[1].estimators), 'weights' : step[1].weights})
            print()

            for estimator in step[1].estimators:
                print_model(estimator[1], estimator[0] + ' - ')
        
        else:
            pprint(step[1].get_params())
            print()
        
print_model(fitted_automl_model)

datatransformer
{'enable_dnn': None,
 'enable_feature_sweeping': None,
 'feature_sweeping_config': None,
 'feature_sweeping_timeout': None,
 'featurization_config': None,
 'force_text_dnn': None,
 'is_cross_validation': None,
 'is_onnx_compatible': None,
 'logger': None,
 'observer': None,
 'task': None,
 'working_dir': None}

prefittedsoftvotingclassifier
{'estimators': ['8', '1', '19', '18', '0', '4', '24', '32'],
 'weights': [0.08333333333333333,
             0.16666666666666666,
             0.08333333333333333,
             0.16666666666666666,
             0.16666666666666666,
             0.16666666666666666,
             0.08333333333333333,
             0.08333333333333333]}

8 - sparsenormalizer
{'copy': True, 'norm': 'max'}

8 - xgboostclassifier
{'base_score': 0.5,
 'booster': 'gbtree',
 'colsample_bylevel': 1,
 'colsample_bynode': 1,
 'colsample_bytree': 0.8,
 'eta': 0.2,
 'gamma': 0,
 'learning_rate': 0.1,
 'max_delta_step': 0,
 'max_depth': 6,
 'max_leaves': 0,
 'min_chi

### save the model

We will save the model as joblib. To save the model at ONNX format, one should enable in the AutoML Config `enable_onnx_compatible_models=True` and use OnnxConverter class for saving. 

In [36]:


import joblib

joblib.dump(fitted_automl_model, 'outputs/model.joblib')

['outputs/model.joblib']

### Clean-up Cluster resources

In [39]:
#deletes the compute cluster
compute_target.delete()