# Hyperparameter Tuning using HyperDrive

Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

In [9]:
from azureml.train.sklearn import SKLearn
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.sampling import BayesianParameterSampling
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.parameter_expressions import uniform, choice

from azureml.widgets import RunDetails

from azureml.core import Dataset, Environment, Experiment, Workspace

from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

## Dataset

Get data. In the cell below, write code to access the data you will be using in this project. Remember that the dataset needs to be external.

In [10]:
ws = Workspace.from_config()
experiment_name = 'hyperdriveExperiment'

experiment=Experiment(ws, experiment_name)

In [11]:
# create a TabularDataset from a dataset
dataset = Dataset.get_by_name(ws, name='police_projectNumbers')
police_records = dataset.to_pandas_dataframe()

# preview the first 3 rows of the dataset
police_records.head(5)

Unnamed: 0,stop_date,stop_time,driver_gender,driver_age,driver_race,violation,search_conducted,stop_outcome,is_arrested,stop_duration,drugs_related_stop
0,122005,0.079861,1,20,5,6,0,3,0,15,0
1,1182005,0.34375,1,40,5,6,0,3,0,15,0
2,1232005,0.96875,1,33,5,6,0,3,0,15,0
3,2202005,0.71875,1,19,5,3,0,1,1,30,0
4,3142005,0.416667,0,21,5,6,0,3,0,15,0


## Hyperdrive Configuration

TODO: Explain the model you are using and the reason for chosing the different hyperparameters, termination policy and config settings.

In [12]:
# Create compute cluster to run HyperDrive
cluster_name = 'project-clusters'
try:
    compute_target = ComputeTarget(ws, cluster_name)
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D3_V2', min_nodes=1, max_nodes=4)
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True)

Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## Create the different params that you will be using during training
### This experiment tunes:
### n-estimators: The number of trees - the more trees can result in better performance
### max-depth: The maximum depth of the tree to correctly balance generalization and prevent overfitting

In [13]:

param_sampling = BayesianParameterSampling(
   parameter_space={
       "--n-estimators": choice(range(50, 300, 40)),
       "--max-depth": choice(range(1, 40))
    }
)

#TODO: Create your estimator and hyperdrive config
estimator = est = SKLearn(
    source_directory="./",
    entry_script="train.py",
    script_params={'--input-data': dataset.id},
    compute_target=compute_target
)

hyperdrive_run_config = HyperDriveConfig(
    primary_metric_name="accuracy",
    primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
    max_total_runs=30,
    max_concurrent_runs=4,
    hyperparameter_sampling=param_sampling,
    estimator=est
)

For best results with Bayesian Sampling we recommend using a maximum number of runs greater than or equal to 20 times the number of hyperparameters being tuned. Recommendend value:40.


In [14]:
# Submit your experiment
hyperdrive_run = experiment.submit(hyperdrive_run_config)



## Run Details

OPTIONAL: Write about the different models trained and their performance. Why do you think some models did better than others?

In the cell below, use the `RunDetails` widget to show the different experiments.

In [15]:
RunDetails(hyperdrive_run).show()
hyperdrive_run.wait_for_completion(show_output=False)

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

## Best Model

TODO: In the cell below, get the best model from the hyperdrive experiments and display all the properties of the model.

In [18]:
best_run = hyperdrive_run.get_best_run_by_primary_metric()
best_run.get_metrics()

{'Num estimators:': 210,
 'Max depth:': 5,
 'accuracy': 0.9906623551211714,
 'Confusion Matrix': 'aml://artifactId/ExperimentRun/dcid.HD_8c7677f1-714a-4760-ba71-fe4d58b1e882_1/Confusion Matrix'}

In [19]:
#TODO: Save the best model
model = best_run.register_model(model_name='model', model_path='outputs/model.pkl')
print(f"Model {model.name}.v{model.version} correctly saved")

Model model.v1 correctly saved


In [20]:
# Deprovision and delete the AmlCompute target. 
compute_target.delete()
print('Cluster Deleted')

Cluster Deleted
