## Hyperparameter Tuning using HyperDrive

This project is part of the Udacity Azure ML Nanodegree.

In [1]:
import azureml.core

print("SDK version:", azureml.core.VERSION)

SDK version: 1.24.0


### Overview
Please refer to the Github README file for a comprehensive overview of the project, including all details regarding the dataset.

As this is a Mercedes-Benz used car price prediction project, I will be performing a Random Forest Regression in order to retrieve the best model for a price prediction.

Steps in this notebook include:
- Experiment
- Compute
- dataset
- HyperDrive Configuration
- Run Details
- Best Model

## Experiment

Creates the experiment called 'mercedes-price-prediction-experiment-hyperdrive'.

In [2]:
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace

ws = Workspace.from_config()

experiment_name = 'mercedes-price-prediction-experiment-hyperdrive'

experiment = Experiment(ws, experiment_name)

print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')
print('\n')
print(experiment)

udacity-ws
udacity-rg
westeurope
009f51ba-ba0a-4c91-aadf-56aa26b996cb


Experiment(Name: mercedes-price-prediction-experiment-hyperdrive,
Workspace: udacity-ws)


## Compute

Chooses the already existing compute cluster.

If it didn't exist, it'll be created instead.

In [3]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Choose a name for your CPU cluster
cpu_cluster_name = "mercedes-cc"

# Verify that cluster does not exist already
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                              max_nodes=4)
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

cpu_cluster.wait_for_completion(show_output=True)

Found existing cluster, use it.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## Dataset

Loads the already registered dataset called 'mercedes'.

Prints out if dataset was found or not.

In [4]:
found = False
key = "mercedes"

if key in ws.datasets.keys():
    found = True
    dataset = ws.datasets[key]
    print("dataset found")

if not found:
    print("dataset not found")

dataset found


Prints out an overview of the dataset to ensure the quality of the dataset. For instance, 13.119 datapoints are available in each column. The price ranges from 650 to 159.999 British Pounds. This sounds reasonable for a Mercedes-Benz car, considering the average year of registration (2017).

In [5]:
df = dataset.to_pandas_dataframe()

df.describe()

Unnamed: 0,year,price,mileage,tax,mpg,engineSize
count,13119.0,13119.0,13119.0,13119.0,13119.0,13119.0
mean,2017.296288,24698.59692,21949.559037,129.972178,55.155843,2.07153
std,2.224709,11842.675542,21176.512267,65.260286,15.220082,0.572426
min,1970.0,650.0,1.0,0.0,1.1,0.0
25%,2016.0,17450.0,6097.5,125.0,45.6,1.8
50%,2018.0,22480.0,15189.0,145.0,56.5,2.0
75%,2019.0,28980.0,31779.5,145.0,64.2,2.1
max,2020.0,159999.0,259000.0,580.0,217.3,6.2


## Hyperdrive Configuration

Defines the different hyperparameters, termination policy and config settings.

In [8]:
from azureml.core import Environment
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive import choice
from azureml.core import ScriptRunConfig
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.run import PrimaryMetricGoal

# Creates a new environment.
# The AzureML-Tutorial environment contains common data science packages.
env = Environment.get(workspace=ws, name="AzureML-Tutorial")

# Creates an early termination policy.
# BanditPolicy Class: Defines an early termination policy based on slack criteria, and a frequency and delay interval for evaluation.
# In this example, the early termination policy is applied at every interval when metrics are reported, starting at evaluation interval 2.
# Any run whose best metric is less than (1/(1+0.1) or 91% of the best performing run will be terminated.
early_termination_policy = BanditPolicy(
    evaluation_interval=1,
    slack_factor = 0.1,
    slack_amount = None,
    delay_evaluation = 2
    )

# Specifies parameter sampler.
# RandomParameterSampling Class: Defines random sampling over a hyperparameter search space.
param_sampling = RandomParameterSampling(
    {
        '--max_depth': choice(range(1, 5)),
        '--min_samples_split': choice(2, 5),
        '--min_samples_leaf': choice(range(1, 5))
    }
)

# Creates ScriptRunConfig.
# ScriptRunConfig Class: Represents configuration information for submitting a training run in Azure Machine Learning.
src = ScriptRunConfig(
    source_directory = './',
    script = 'train.py',
    compute_target = cpu_cluster,
    environment = env
    )

# Creates a HyperDriveConfig using the hyperparameter sampler, policy and ScriptRunConfig.
# HyperDriveConfig Class: Configuration that defines a HyperDrive run.
hyperdrive_config = HyperDriveConfig(
    run_config = src,
    hyperparameter_sampling = param_sampling,
    policy = early_termination_policy,
    primary_metric_name = 'mae',
    primary_metric_goal = PrimaryMetricGoal.MINIMIZE,
    max_total_runs = 5,
    max_concurrent_runs = 3
    )

## Run Details

Shows the different experiments.

In [17]:
from azureml.widgets import RunDetails

run = experiment.submit(hyperdrive_config)
RunDetails(run).show()
run.wait_for_completion(show_output=True)

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

RunId: HD_63b9485a-c6a6-4589-ad58-b6404b078aca
Web View: https://ml.azure.com/experiments/mercedes-price-prediction-experiment-hyperdrive/runs/HD_63b9485a-c6a6-4589-ad58-b6404b078aca?wsid=/subscriptions/009f51ba-ba0a-4c91-aadf-56aa26b996cb/resourcegroups/udacity-rg/workspaces/udacity-ws

Streaming azureml-logs/hyperdrive.txt

"<START>[2021-04-15T11:17:15.004373][API][INFO]Experiment created<END>\n""<START>[2021-04-15T11:17:15.831179][GENERATOR][INFO]Trying to sample '3' jobs from the hyperparameter space<END>\n"<START>[2021-04-15T11:17:16.1948400Z][SCHEDULER][INFO]The execution environment is being prepared. Please be patient as it can take a few minutes.<END>"<START>[2021-04-15T11:17:16.082263][GENERATOR][INFO]Successfully sampled '3' jobs, they will soon be submitted to the execution target.<END>\n"

Execution Summary
RunId: HD_63b9485a-c6a6-4589-ad58-b6404b078aca
Web View: https://ml.azure.com/experiments/mercedes-price-prediction-experiment-hyperdrive/runs/HD_63b9485a-c6a6-4589-ad5

{'runId': 'HD_63b9485a-c6a6-4589-ad58-b6404b078aca',
 'target': 'mercedes-cc',
 'status': 'Completed',
 'startTimeUtc': '2021-04-15T11:17:14.734969Z',
 'endTimeUtc': '2021-04-15T11:22:55.297336Z',
 'properties': {'primary_metric_config': '{"name": "mae", "goal": "minimize"}',
  'resume_from': 'null',
  'runTemplate': 'HyperDrive',
  'azureml.runsource': 'hyperdrive',
  'platform': 'AML',
  'ContentSnapshotId': '2ce6669a-847b-4c31-b5b0-57e600f567d1',
  'score': '4113.0',
  'best_child_run_id': 'HD_63b9485a-c6a6-4589-ad58-b6404b078aca_2',
  'best_metric_status': 'Succeeded'},
 'inputDatasets': [],
 'outputDatasets': [],
 'logFiles': {'azureml-logs/hyperdrive.txt': 'https://udacityws9125636447.blob.core.windows.net/azureml/ExperimentRun/dcid.HD_63b9485a-c6a6-4589-ad58-b6404b078aca/azureml-logs/hyperdrive.txt?sv=2019-02-02&sr=b&sig=nKInTHl%2FMD9oK34GEfv1bjQ8IxQ0dx%2Bl2ldRnwzlc0k%3D&st=2021-04-15T11%3A12%3A56Z&se=2021-04-15T19%3A22%3A56Z&sp=r'},
 'submittedBy': 'cz. official'}

## Best Model

Gets the best model from the hyperdrive experiment and displays all the properties of the model.


In [38]:
best_run = run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()

print('Best Run Id: ', best_run.id)
print('Metrics: ', best_run_metrics)


Best Run Id:  HD_63b9485a-c6a6-4589-ad58-b6404b078aca_2
Metrics:  {'max_depth:': 3, 'min_samples_split:': 2, 'min_samples_leaf:': 1, 'mae': 4113}


Saves the best model.

In [35]:
model = best_run.register_model(model_name="hd-model", model_path='outputs/hd-model.pkl')
model.download(target_dir="outputs", exist_ok=True)

'outputs/hd-model.pkl'