# Tuning Hyperparameters using Hyperdrive
Azure Machine Learning allows for hyperparameter training through Hyperdrive experiments. This process launches multiple child runs, each with a different hyperparameter configuration. After all runs are complete, the best model can be evaluated and registered to the Azure Machine Learning Studio.

In this notebook, you will follow along the process of tuning Hyperparameters for optimizing a model.

*Note*: To execute the code in each cell, click on the cell and press SHIFT + ENTER. 

### What are hyperparameters?

Hyperparameters are different than the model parameters, in that they cannot be learned from the data. They are decided before training the model. They are adjustable and need to be tuned in order to obtain a model with optimal performance. Some examples of hyperparameters include the number of layers in a neural network or the so-called learning rate of many manchine learning algorithms that determines how big of a step the algorithm takes at each iteration. You can [learn more about hyperparameters here](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning)). An example of a hyperparameter used in the scikit-learn package would be:

```
train_test_split( X, y, test_size=0.9, random_state=0)
```
The `test_size` represents the percentage of the data to use in the test split and `random_state` is the seed used by the random number generator. These hyperparameters can be fined tuned in order to create the best possible model.

### Login to Workspace
To login to the workspace with the Azure ML Python SDK, you will need to authenticate again with Azure. When you run this cell for the first time, you are prompted to authenticate with Azure by clicking on a link and inputting a security code into a web page.

This block of code imports the `azureml.core` package which is used for interacting with Azure Machine Learning.

In [None]:
import azureml.core
from azureml.core import Workspace

# Load the workspace from the saved config file
ws = Workspace.from_config()
print('Ready to use Azure ML {} to work with {}'.format(azureml.core.VERSION, ws.name))

### Prepare the Data
Use the [Diabetes open dataset](https://azure.microsoft.com/en-us/services/open-datasets/catalog/sample-diabetes/) to train a regression model. Executing the code below registers the diabetes dataset within the Machine Learning Studio Workspace as a tabular dataset to be used in experiments.


In [None]:
from azureml.opendatasets import Diabetes
from azureml.core import Dataset

if 'diabetes' not in ws.datasets:

    ds_name = 'diabetes'

    #Create a tabular dataset from the path on the datastore
    tab_data_set = Diabetes.get_tabular_dataset()

    # Register the tabular dataset
    try:
        print("Registering Dataset")
        tab_data_set = tab_data_set.register(workspace=ws, 
                                name=ds_name,
                                description='Diabetes Sample',
                                tags = {'format':'CSV'},
                                create_new_version=True)
        print ("Dataset is registered")
    except Exception as ex:
        print(ex)
else:
    print('Dataset already registered.')

### Set Up Compute
A compute instance will need to be selected to deploy the Hyperdrive experiment to. Executing the code, discovers the available compute instance and sets it to a variable which is used in a later code cell.


In [None]:
from azureml.core.compute import ComputeTarget


for compute in ComputeTarget.list(ws):
    training_cluster = ComputeTarget(workspace=ws, name=compute.name)
    
print("Found compute instance!")

### Create Training Script
A training script needs to be generated so it can be executed during each run. Create a folder directory to download the training script.

In [None]:
import os

experiment_folder = 'diabetes_training-hyperdrive'
os.makedirs(experiment_folder, exist_ok=True)

print('The folder has been created.')

A parameterized  training script is created in the `experiment_folder` with parameters for optimizing the the *alpha* and *tol* arguments of the algorithm. The script downloads the Diabetes dataset from the workspace and trains against it with the specified algorithm settings. Running the cell below will generate the script.

In [None]:
%%writefile $experiment_folder/diabetes_training.py
import os
import argparse
import joblib
import math
from azureml.core import Run
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error

# Set alphas and tols parameters
parser = argparse.ArgumentParser()
parser.add_argument('--alphas', type=float, dest='alpha_value', default=0.01, help='alpha rate')
parser.add_argument('--tols', type=float, dest='tol_value', default=0.01, help='tol rate')
args = parser.parse_args()
alpha = args.alpha_value
tol = args.tol_value

# Get the experiment run context
run = Run.get_context()

# Load the Diabetes dataset and split the data into training and test sets
diabetes = run.input_datasets['diabetes'].to_pandas_dataframe()

X, y = diabetes[['AGE','BMI','S1','S2','S3','S4','S5','S6','SEX']].values, diabetes['Y'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=66)

# Train the model with the specified alpha and tol arguments
model = Ridge(alpha=alpha, tol=tol)
model.fit(X=X_train, y=y_train)
y_pred = model.predict(X=X_test)
rmse = math.sqrt(mean_squared_error(y_true=y_test, y_pred=y_pred))
run.log("rmse", rmse)

#A file is saved to the outputs folder which automated gets uploaded into the experiment record in Azure ML Studio
os.makedirs('outputs', exist_ok=True)
model_name = "model_alpha_" + str(alpha) + ".pkl"
filename = "outputs/" + model_name
joblib.dump(value=model, filename=filename)

run.complete()

### Run a Hyperdrive Experiment
Tuning hyperparameters is similar to tuning a musical instrument. You play with the settings to determine which result is best. The Hyperdrive package helps automate this process to reduce the amount of tedious work it would take to test each configuration by hand. Run the next cell to start by importing the required packages for the Hyperdrive experiment.

In [None]:
from azureml.core import Experiment
from azureml.train.sklearn import SKLearn
from azureml.train.hyperdrive import RandomParameterSampling, BanditPolicy, HyperDriveConfig, PrimaryMetricGoal, choice, uniform
from azureml.widgets import RunDetails
print("Packages imported!")



Hyperparameters are the desired settings used to tweak a given algorithm. Azure Machine Learning provides the ability to automate the selection of these settings. Currently, the sampling methods supported are random sampling, grid sampling, and Bayesian sampling.

Executing the code below will use random sampling, which randomly picks values from the defined search space. With random sampling, you can use continuous hyperparameters that choose within a range of values instead of statically calling out each value to use. This reduces some of the manual work in hyperparameter tuning.


In [None]:
# Parameter values for random sampling
params = RandomParameterSampling(
    {
        '--alphas': choice(0.001, 0.005, 0.01, 0.05, 0.1, 1.0, 2.0, 4.0, 8.0),
        '--tols': uniform(0.001, 0.01),
    }
)

print("Hyperparameters are set!")



Run the next cell to create the estimator. The estimator defines the training script to use for each run, and the compute target to apply for the runs. Also, the dataset is passed through as an input so that each run can use the Diabetes dataset.

In [None]:
# Get the training Diabetes dataset
diabetes_ds = ws.datasets.get("diabetes")

# Create an estimator
estimator = SKLearn(source_directory=experiment_folder,
                          inputs=[diabetes_ds.as_named_input('diabetes')], 
                          pip_packages=['azureml-sdk'], 
                          entry_script='diabetes_training.py',
                          compute_target =training_cluster,)


print("The estimator has been configured!")



Set up the Hyperdrive to configure the experiment settings. This includes the random sampling parameters as well as the estimator configuration.

In [None]:
# Configure hyperdrive settings
hyperdrive = HyperDriveConfig(estimator=estimator, 
                          hyperparameter_sampling=params, 
                          policy=None, 
                          primary_metric_name='rmse', 
                          primary_metric_goal=PrimaryMetricGoal.MAXIMIZE, 
                          max_total_runs=20,
                          max_concurrent_runs=4)

print("The hyperdive is ready to run!")

Run the experiment and review the results. This will take 10 - 20 minutes. The status will be displayed in the output as the experiment runs.

You can also switch over to Azure Machine Learning Studio and veiw the status of the run from the Experiments console.

In [None]:
# Run the experiment
experiment = Experiment(workspace = ws, name = 'diabetes_training_hyperdrive')
run = experiment.submit(config=hyperdrive)

# Show the status
RunDetails(run).show()
run.wait_for_completion()

### Get Best Performing Run
When all the runs have finished, you can execute the code below to determine the best performing run based on the primary metric used in the experiment.

In [None]:
best_run = run.get_best_run_by_primary_metric()
if best_run is None:
    raise Exception("No best run was found")
best_run



Automating the hyperparameter tuning process provides a lot of efficiency in the Machine Learning process. For more information on tuning hyperparameters checkout [Microsoft's Documentation](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters)

Don't forget to switch back to the Cloud Academy Lab and run the validation check to verify the Hyperdrive experiment.