# Tuning Hyperparameters

There are many machine learning algorithms that require *hyperparameters* (parameter values that influence training, but can't be determined from the training data itself). For example, when training a logistic regression model, you can use a *regularization rate* hyperparameter to counteract bias in the model; or when training a convolutional neural network, you can use hyperparameters like *learning rate* and *batch size* to control how weights are adjusted and how many data items are processed in a mini-batch respectively. The choice of hyperparameter values can significantly affect the performance of a trained model, or the time taken to train it; and often you need to try multiple combinations to find the optimal solution.

In this case, you'll use a simple example of a logistic regression model with a single hyperparameter, but the principles apply to any kind of model you can train with Azure Machine Learning.

## Before You Start

Before you start this lab, ensure that you have completed the *Create an Azure Machine Learning Workspace* and *Create a Compute Instance* tasks in [Lab 1: Getting Started with Azure Machine Learning](./labdocs/Lab01.md). Then open this notebook in Jupyter on your Compute Instance.

## Connect to Your Workspace

The first thing you need to do is to connect to your workspace using the Azure ML SDK.

> **Note**: You may be prompted to authenticate. Just copy the code and click the link provided to sign into your Azure subscription, and then return to this notebook.

In [1]:
import azureml.core
from azureml.core import Workspace

# Load the workspace from the saved config file
ws = Workspace.from_config()
print('Ready to use Azure ML {} to work with {}'.format(azureml.core.VERSION, ws.name))

Ready to use Azure ML 1.17.0 to work with ml-sdk


## Prepare Data for an Experiment

In this lab, you'll use a dataset containing details of diabetes patients. Run the cell below to create this dataset (if you already created it, the code will create a new version)

In [2]:
from azureml.core import Dataset

default_ds = ws.get_default_datastore()

if 'diabetes dataset' not in ws.datasets:
    default_ds.upload_files(files=['./data/diabetes.csv', './data/diabetes2.csv'], # Upload the diabetes csv files in /data
                        target_path='diabetes-data/', # Put it in a folder path in the datastore
                        overwrite=True, # Replace existing files of the same name
                        show_progress=True)

    #Create a tabular dataset from the path on the datastore (this may take a short while)
    tab_data_set = Dataset.Tabular.from_delimited_files(path=(default_ds, 'diabetes-data/*.csv'))

    # Register the tabular dataset
    try:
        tab_data_set = tab_data_set.register(workspace=ws, 
                                name='diabetes dataset',
                                description='diabetes data',
                                tags = {'format':'CSV'},
                                create_new_version=True)
        print('Dataset registered.')
    except Exception as ex:
        print(ex)
else:
    print('Dataset already registered.')

Dataset already registered.


## Prepare a Training Script

Let's start by creating a folder for the training script you'll use to train a logistic regression model.

In [3]:
import os

experiment_folder = 'diabetes_training-hyperdrive'
os.makedirs(experiment_folder, exist_ok=True)

print('Folder ready.')

Folder ready.


Now create the Python script to train the model. This must include:

- A parameter for each hyperparameter you want to optimize (in this case, there's only the regularization hyperparameter)
- Code to log the performance metric you want to optimize for (in this case, you'll log both AUC and accuracy, so you can choose to optimize the model for either of these)

In [4]:
%%writefile $experiment_folder/diabetes_training.py
# Import libraries
import os
import argparse
import joblib
from azureml.core import Run
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

# Set regularization parameter
parser = argparse.ArgumentParser()
parser.add_argument('--regularization', type=float, dest='reg_rate', default=0.01, help='regularization rate')
args = parser.parse_args()
reg = args.reg_rate

# Get the experiment run context
run = Run.get_context()

# load the diabetes dataset
print("Loading Data...")
diabetes = run.input_datasets['diabetes'].to_pandas_dataframe() # Get the training data from the estimator input

# Separate features and labels
X, y = diabetes[['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']].values, diabetes['Diabetic'].values

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Train a logistic regression model
print('Training a logistic regression model with regularization rate of', reg)
run.log('Regularization Rate',  np.float(reg))
model = LogisticRegression(C=1/reg, solver="liblinear").fit(X_train, y_train)

# calculate accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
print('Accuracy:', acc)
run.log('Accuracy', np.float(acc))

# calculate AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
print('AUC: ' + str(auc))
run.log('AUC', np.float(auc))

os.makedirs('outputs', exist_ok=True)
# note file saved in the outputs folder is automatically uploaded into experiment record
joblib.dump(value=model, filename='outputs/diabetes_model.pkl')

run.complete()

Overwriting diabetes_training-hyperdrive/diabetes_training.py


## Prepare a Compute Target

One of the benefits of cloud compute is that it scales on-demand, enabling you to provision enough compute resources to process multiple runs of an experiment in parallel, each with different hyperparameter values.

You'll create an Azure Machine Learning compute cluster in your workspace (or use an existing one if you have created it previously).

> **Important**: Change *your-compute-cluster* to a unique name for your compute cluster in the code below before running it! Cluster names must be globally unique names between 2 to 16 characters in length. Valid characters are letters, digits, and the - character.

In [5]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

cluster_name = "dj-cluster"

try:
    # Check for existing compute target
    training_cluster = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    # If it doesn't already exist, create it
    try:
        compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS11_V2', max_nodes=2)
        training_cluster = ComputeTarget.create(ws, cluster_name, compute_config)
        training_cluster.wait_for_completion(show_output=True)
    except Exception as ex:
        print(ex)
    

Found existing cluster, use it.


## Run a *Hyperdrive* Experiment

Azure Machine Learning includes a hyperparameter tuning capability through *Hyperdrive* experiments. These experiments launch multiple child runs, each with a different hyperparameter combination. The run producing the best model (as determined by the logged target performance metric for which you want to optimize) can be identified, and its trained model selected for registration and deployment.

In [6]:
from azureml.core import Experiment
from azureml.train.sklearn import SKLearn
from azureml.train.hyperdrive import GridParameterSampling, BanditPolicy, HyperDriveConfig, PrimaryMetricGoal, choice
from azureml.widgets import RunDetails


# Sample a range of parameter values
params = GridParameterSampling(
    {
        # There's only one parameter, so grid sampling will try each value - with multiple parameters it would try every combination
        '--regularization': choice(0.001, 0.005, 0.01, 0.05, 0.1, 1.0)
    }
)


# Get the training dataset
diabetes_ds = ws.datasets.get("diabetes dataset")

# Create an estimator that uses the remote compute
hyper_estimator = SKLearn(source_directory=experiment_folder,
                          inputs=[diabetes_ds.as_named_input('diabetes')], # Pass the dataset as an input...
                          pip_packages=['azureml-sdk'], # ...so we need azureml-dataprep (it's in the SDK!)
                          entry_script='diabetes_training.py',
                          compute_target = training_cluster,)

# Configure hyperdrive settings
hyperdrive = HyperDriveConfig(estimator=hyper_estimator, 
                          hyperparameter_sampling=params, 
                          policy=None, 
                          primary_metric_name='AUC', 
                          primary_metric_goal=PrimaryMetricGoal.MAXIMIZE, 
                          max_total_runs=6,
                          max_concurrent_runs=4)

# Run the experiment
experiment = Experiment(workspace = ws, name = 'diabates_training_hyperdrive')
run = experiment.submit(config=hyperdrive)

# Show the status in the notebook as the experiment runs
RunDetails(run).show()
run.wait_for_completion()



_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

{'runId': 'HD_73d9f8de-6d15-4ef5-8bd4-7ad99d1b5f69',
 'target': 'dj-cluster',
 'status': 'Completed',
 'startTimeUtc': '2020-11-02T16:11:17.301914Z',
 'endTimeUtc': '2020-11-02T16:17:06.599783Z',
 'properties': {'primary_metric_config': '{"name": "AUC", "goal": "maximize"}',
  'resume_from': 'null',
  'runTemplate': 'HyperDrive',
  'azureml.runsource': 'hyperdrive',
  'platform': 'AML',
  'ContentSnapshotId': '5acd054d-0551-4d8c-b7a5-cb9cb50352b6',
  'score': '0.856969468262725',
  'best_child_run_id': 'HD_73d9f8de-6d15-4ef5-8bd4-7ad99d1b5f69_5',
  'best_metric_status': 'Succeeded'},
 'inputDatasets': [],
 'outputDatasets': [],
 'logFiles': {'azureml-logs/hyperdrive.txt': 'https://mlsdk1289217328.blob.core.windows.net/azureml/ExperimentRun/dcid.HD_73d9f8de-6d15-4ef5-8bd4-7ad99d1b5f69/azureml-logs/hyperdrive.txt?sv=2019-02-02&sr=b&sig=eNPgAjo48X0VQ2H%2F%2B%2BjqOMA63Y%2F89m5LGGiDE6%2Flaok%3D&st=2020-11-02T16%3A07%3A09Z&se=2020-11-03T00%3A17%3A09Z&sp=r'}}

You can view the experiment run status in the widget above. You can also view the main Hyperdrive experiment run and its child runs in [Azure Machine Learning studio](https://ml.azure.com).

> **Note**: The widget may not refresh. You'll see summary information displayed below the widget when the run has completed.

## Determine the Best Performing Run

When all of the runs have finished, you can find the best one based on the performance metric you specified (in this case, the one with the best AUC).

In [7]:
for child_run in run.get_children_sorted_by_primary_metric():
    print(child_run)

best_run = run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()
parameter_values = best_run.get_details() ['runDefinition']['arguments']

print('Best Run Id: ', best_run.id)
print(' -AUC:', best_run_metrics['AUC'])
print(' -Accuracy:', best_run_metrics['Accuracy'])
print(' -Regularization Rate:',parameter_values)

{'run_id': 'HD_73d9f8de-6d15-4ef5-8bd4-7ad99d1b5f69_5', 'hyperparameters': '{"--regularization": 1.0}', 'best_primary_metric': 0.856969468262725, 'status': 'Completed'}
{'run_id': 'HD_73d9f8de-6d15-4ef5-8bd4-7ad99d1b5f69_4', 'hyperparameters': '{"--regularization": 0.1}', 'best_primary_metric': 0.8568613016622707, 'status': 'Completed'}
{'run_id': 'HD_73d9f8de-6d15-4ef5-8bd4-7ad99d1b5f69_1', 'hyperparameters': '{"--regularization": 0.005}', 'best_primary_metric': 0.8568570988700241, 'status': 'Completed'}
{'run_id': 'HD_73d9f8de-6d15-4ef5-8bd4-7ad99d1b5f69_3', 'hyperparameters': '{"--regularization": 0.05}', 'best_primary_metric': 0.8568436056949162, 'status': 'Completed'}
{'run_id': 'HD_73d9f8de-6d15-4ef5-8bd4-7ad99d1b5f69_2', 'hyperparameters': '{"--regularization": 0.01}', 'best_primary_metric': 0.8568309973181761, 'status': 'Completed'}
{'run_id': 'HD_73d9f8de-6d15-4ef5-8bd4-7ad99d1b5f69_0', 'hyperparameters': '{"--regularization": 0.001}', 'best_primary_metric': 0.8568283429230729

Now that you've found the best run, you can register the model it trained.

In [8]:
from azureml.core import Model

# Register model
best_run.register_model(model_path='outputs/diabetes_model.pkl', model_name='diabetes_model',
                        tags={'Training context':'Hyperdrive'},
                        properties={'AUC': best_run_metrics['AUC'], 'Accuracy': best_run_metrics['Accuracy']})

# List registered models
for model in Model.list(ws):
    print(model.name, 'version:', model.version)
    for tag_name in model.tags:
        tag = model.tags[tag_name]
        print ('\t',tag_name, ':', tag)
    for prop_name in model.properties:
        prop = model.properties[prop_name]
        print ('\t',prop_name, ':', prop)
    print('\n')

diabetes_model version: 10
	 Training context : Hyperdrive
	 AUC : 0.856969468262725
	 Accuracy : 0.7891111111111111


diabetes_model version: 9
	 Training context : Pipeline


diabetes_model version: 8
	 Training context : Hyperdrive
	 AUC : 0.856969468262725
	 Accuracy : 0.7891111111111111


diabetes_model version: 7
	 Training context : Pipeline


diabetes_model_automl version: 1
	 Training context : Auto ML
	 AUC : 0.9904812577250306
	 Accuracy : 0.9520809898762654


diabetes_model version: 6
	 Training context : Inline Training
	 AUC : 0.8743619085526643
	 Accuracy : 0.8873333333333333


diabetes_model version: 5
	 Training context : Parameterized SKLearn Estimator
	 AUC : 0.8483904671874223
	 Accuracy : 0.7736666666666666


diabetes_model version: 4
	 Training context : Estimator
	 AUC : 0.8484929598487486
	 Accuracy : 0.774


diabetes_mitigated_20 version: 1


diabetes_mitigated_19 version: 1


diabetes_mitigated_18 version: 1


diabetes_mitigated_17 version: 1


diabetes_mitiga

> **More Information**: For more information about Hyperdrive, see the [Azure ML documentation](https://docs.microsoft.com/azure/machine-learning/how-to-tune-hyperparameters).