# Hyperparameter Tuning using HyperDrive

TODO: Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

In [3]:
from azureml.core import Workspace, Experiment

from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.exceptions import ComputeTargetException
import pandas as pd

from azureml.widgets import RunDetails
from azureml.train.sklearn import SKLearn
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.parameter_expressions import uniform,choice
import os

In [4]:
# auth = InteractiveLoginAuthentication()
ws = Workspace.from_config()
# 
# choose a name for experiment
experiment_name = 'classification-hyperdrive'
project_folder = './capstoneProject'
experiment=Experiment(ws, experiment_name)


print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

run = experiment.start_logging()


Workspace name: quick-starts-ws-162718
Azure region: southcentralus
Subscription id: aa7cf8e8-d23f-4bce-a7b9-1f0b4e0ac8ee
Resource group: aml-quickstarts-162718


# Create or Attach an AmlCompute cluster

In [5]:
# NOTE: update the cluster name to match the existing cluster
# Choose a name for your CPU cluster
amlcompute_cluster_name = "hyperdiveCompute"

# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=amlcompute_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='Standard_D16_v3',# for GPU, use "STANDARD_NC6"
                                                           vm_priority = 'dedicated', # optional
                                                           max_nodes=6)
    compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True, min_node_count = 1, timeout_in_minutes = 1)
# For a more detailed view of current AmlCompute status, use get_status().

InProgress....
SucceededProvisioning operation finished, operation "Succeeded"
Succeeded............
AmlCompute wait for completion finished

Wait timeout has been reached
Current provisioning state of AmlCompute is "Succeeded" and current node count is "0"


## Hyperdrive Configuration

TODO: Explain the model you are using and the reason for chosing the different hyperparameters, termination policy and config settings.

This is a binary classification task so the logistic regression was used to accomplish this task. The logistic regression uses sigmoid function to model the probability of the binary class. Two hyperparameters were tuned here, C and max interation. The C parameter is the inverse of the regularization strength and smaller values specify stronger regularization. Regularization is used to mitigate the overfitting problem. The max iteration paramter specifies the maximum the number of iterations taken got for the solver to converge.
<br><br>
The _BanditPolicy_ method was used to define early stopping based on the slack criteria and evaluation interval. The evaluation_interval is the frequency for applying the policy. The slack_factor is the ratio used to calculate the allowed distance from the best performing experiment run. Based on the defined parameters in cell below, the early termination policy is applied at every other interval when metrics are reported. For instance, if the best performing run at interval 2 reported a primary metric of 0.8. If the policy specify a slack factor of 0.1, any training runs whose best metric at interval 2 is less than 0.73 (0.8/(1+slack_factor)) will be terminated.

In [12]:
# TODO: Create an early termination policy. This is not required if you are using Bayesian sampling.
early_termination_policy = BanditPolicy(evaluation_interval=2, slack_factor=0.1)

#TODO: Create the different params that you will be using during training
param_sampling = RandomParameterSampling({
    "--C": uniform(0.03, 1),
    "--max_iter": choice(1000, 1500, 2000)
})

if "training" not in os.listdir():
    os.mkdir("./training")
    
#TODO: Create your estimator and hyperdrive config
est = SKLearn(source_directory=".", 
              compute_target=amlcompute_cluster_name, 
              entry_script="train.py",
             pip_packages=['pandas'])

hyperdrive_run_config = HyperDriveConfig(estimator=est, 
                                         hyperparameter_sampling=param_sampling,
                                         policy=early_termination_policy, 
                                         primary_metric_name='Accuracy', 
                                         primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                                         max_total_runs=20, 
                                         max_concurrent_runs=4)




In [13]:
#TODO: Submit your experiment
hyperdrive_run = experiment.submit(config=hyperdrive_run_config)



## Run Details

OPTIONAL: Write about the different models trained and their performance. Why do you think some models did better than others?

TODO: In the cell below, use the `RunDetails` widget to show the different experiments.

In [14]:
RunDetails(hyperdrive_run).show()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

## Best Model

TODO: In the cell below, get the best model from the hyperdrive experiments and display all the properties of the model.

In [20]:
# Get the best run and save the model from that run.
best_run = hyperdrive_run.get_best_run_by_primary_metric()
metrics = best_run.get_metrics()
print(metrics)
best_run

{'Regularization Strength:': 0.8844302084136857, 'Max iterations:': 1000, 'Accuracy': 0.9748459958932238}


Experiment,Id,Type,Status,Details Page,Docs Page
classification-hyperdrive,HD_eb73f1a0-64aa-4bd3-b16e-c9cdd3347ca7_1,azureml.scriptrun,Completed,Link to Azure Machine Learning studio,Link to Documentation


In [21]:
display(best_run.get_details())

{'runId': 'HD_eb73f1a0-64aa-4bd3-b16e-c9cdd3347ca7_1',
 'target': 'hyperdiveCompute',
 'status': 'Completed',
 'startTimeUtc': '2021-11-04T12:05:31.708771Z',
 'endTimeUtc': '2021-11-04T12:05:58.190868Z',
 'services': {},
 'properties': {'_azureml.ComputeTargetType': 'amlcompute',
  'ContentSnapshotId': 'c86ea829-c12b-4926-a7af-6c812117aa74',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json'},
 'inputDatasets': [],
 'outputDatasets': [],
 'runDefinition': {'script': 'train.py',
  'command': '',
  'useAbsolutePath': False,
  'arguments': ['--C', '0.8844302084136857', '--max_iter', '1000'],
  'sourceDirectoryDataStore': None,
  'framework': 'Python',
  'communicator': 'None',
  'target': 'hyperdiveCompute',
  'dataReferences': {},
  'data': {},
  'outputData': {},
  'datacaches': [],
  'jobName': None,
  'maxRunDurationSeconds': None,
  'nodeCount': 1,
  'instanceTypes': [],
  'priority': None,
  'credentialPassthrough': False

In [30]:
#TODO: Save the best model
best_run.download_file('outputs/model.joblib', 'hyperdrive_model.pkl')


In [27]:
## Register best model
model = best_run.register_model(model_name = 'hyperdrive_model', model_path='./outputs/')
model

Model(workspace=Workspace.create(name='quick-starts-ws-162718', subscription_id='aa7cf8e8-d23f-4bce-a7b9-1f0b4e0ac8ee', resource_group='aml-quickstarts-162718'), name=hyperdrive_model, id=hyperdrive_model:2, version=2, tags={}, properties={})