# Hyperparameter Tuning using HyperDrive

TODO: Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

## Import dependencies

### Import libraries

In [1]:
# Environment libraries
from azureml.core import Environment

# Workspace and experiment Libraries
from azureml.core import Workspace, Experiment

# Compute cluster libraries
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
from azureml.core.compute_target import ComputeTargetException

# Azure ML libraries
from azureml.widgets import RunDetails
from azureml.train.sklearn import SKLearn
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.parameter_expressions import uniform
from azureml.core import ScriptRunConfig
from azureml.train.hyperdrive import choice

# OS libraries
import os
import shutil

# Download model libraries
import joblib

### Create environment

In [2]:
%%writefile conda_dependencies.yml

dependencies:
- python=3.6.2
- scikit-learn
- pip:
  - azureml-defaults

Overwriting conda_dependencies.yml


In [3]:
sklearn_env = Environment.from_conda_specification(name = 'sklearn-env', file_path = './conda_dependencies.yml')

## Dataset

### Dataset analysis

The dataset selected for the project is the UCI [Estimation of Obesity Levels Data Set](https://archive.ics.uci.edu/ml/datasets/Estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition+).   
The dataset includes data for the estimation of obesity levels in individuals from the countries of Mexico, Peru, and Colombia, based on their eatin habits and physical condition.  
The data contains 17 attributes and 2,111 records. The records are labeled with the class variable NObesity (Obesity Level), that allows classification of the data.  

The attributes are:
1. Gender: (categorical: Female, Male)
2. Age: (numerical)
3. Height: (numerical)
4. Weight: (numerical)
5. family_history_with_overweight: categorical (yes, no)
6. FAVC: frequent consumption of high caloric food (categorical: yes, no)
7. FCVC: frequency of consumption of vegetables (numerical)
8. NCP: number of main meals (numerical)
9. CAEC: consumption of food between meals (categorical: Always, Frequently, no, Sometimes)
10. SMOKE: if the person smokes or no (categorical: yes, no)
11. CH20: comsumption of water daily (numerical)
12. SCC: calories consumption monitoring (categorical: yes. no)
13. FAF: physical activity frequency (numerical)
14. TUE: time using technology devices (numerical)
15. CALC: consumption of alcohol (categorical: categorical: Always, Frequently, no, Sometimes)
16. MTRANS: transportation used (categorical: Automobile, Bike, Motorbike, Public_Transportation, Walking)

The desired target is:  
17. NObeyesdad (categorical: Insufficient_Weight, Normal_Weight, Overweight_Level_I, Overweight_Level_II, Obesity_Type_I, Obesity_Type_II, Obesity_Type_III).  


In the [01. Exploratory Data Analysis notebook](01.%20Exploratory%20Data%20Analysis.ipynb), different analysis were conducted on the dataset to be used in the project.  

In the [02. Preprocessing notebook](01.%20Preprocessing.ipynb), different preprocessing tasks were conducted on the dataset to be used in the project.  

TODO: Get data. In the cell below, write code to access the data you will be using in this project. Remember that the dataset needs to be external.

In [4]:
# Create dataset folder
path = '../'
if "dataset" not in os.listdir(path):
    os.mkdir("../dataset")

In [5]:
# Download dataset
!wget -O ObesityDataSet.zip  'https://archive.ics.uci.edu/ml/machine-learning-databases/00544/ObesityDataSet_raw_and_data_sinthetic%20(2).zip'

--2021-01-12 02:59:38--  https://archive.ics.uci.edu/ml/machine-learning-databases/00544/ObesityDataSet_raw_and_data_sinthetic%20(2).zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 119205 (116K) [application/x-httpd-php]
Saving to: ‘ObesityDataSet.zip’


2021-01-12 02:59:38 (654 KB/s) - ‘ObesityDataSet.zip’ saved [119205/119205]



In [6]:
# Unzip dataset
!unzip ObesityDataSet.zip

Archive:  ObesityDataSet.zip
  inflating: ObesityDataSet_raw_and_data_sinthetic.arff  
  inflating: ObesityDataSet_raw_and_data_sinthetic.csv  


In [8]:
# Move dataset to folder
!mv ObesityDataSet* ../dataset/

### Initialize workspace and experiment

In [5]:
ws = Workspace.from_config()

# choose a name for experiment
experiment_name = 'capstone-project'

experiment=Experiment(ws, experiment_name)

In [6]:
# Print workspace settings
print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

Workspace name: udacity-ws
Azure region: eastus
Subscription id: 4ee8335e-198c-4d15-b7f8-70b9f3a46669
Resource group: udacity-rg


### Create / attach cluster

In [7]:
cluster_name = 'capstone-cluster'

# Verify that cluster does not exist already
try:
    cluster_target = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    cluster_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                           max_nodes=4)
    cluster_target = ComputeTarget.create(ws, cluster_name, cluster_config)

# Set cluster timeout
cluster_target.wait_for_completion(show_output=True, min_node_count = 1, timeout_in_minutes = 10)

# Get cluster status
print(cluster_target.get_status().serialize())

Found existing cluster, use it.
Succeeded................................................................................................................
AmlCompute wait for completion finished

Wait timeout has been reached
Current provisioning state of AmlCompute is "Succeeded" and current node count is "0"
{'currentNodeCount': 0, 'targetNodeCount': 0, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 0, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2021-01-12T03:35:13.032000+00:00', 'errors': None, 'creationTime': '2021-01-12T03:01:15.044380+00:00', 'modifiedTime': '2021-01-12T03:01:31.649682+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 4, 'nodeIdleTimeBeforeScaleDown': 'PT120S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_D2_V2'}


## Hyperdrive Configuration

TODO: Explain the model you are using and the reason for chosing the different hyperparameters, termination policy and config settings.

### Model selection
In the [03. Model Selection notebook](03.%20Model%20Selection.ipynb), different tests were conducted to determine the model to be used in the project.  
The model selected was *Decision Trees*.  


### Performance Metric selection
To evaluate the performance of the models, *Accuracy* was selected.  
Accuracy is a popular choice because it is very easy to understand and explain.  

In this dataset, the identification of positives is not crucial (sensitivity/recall), there is no need to be more confident of the predicted positives (precision), there is no need to cover all true negatives (specificity), and there is not an uneven class distribution (F1).  

Reference: [How to select Performance Metrics for Classification Models](https://medium.com/analytics-vidhya/how-to-select-performance-metrics-for-classification-models-c847fe6b1ea3).  


### Termination policy
An early termination policy improves computational efficiency by automatically terminating poorly performing runs.  
For aggressive savings, a *Bandit Policy* with small allowable slack is recommended.  
Bandit is based on slack factor/slack amount and evaluation interval. The policy early terminates any runs where the primary metric is not within the specified slack factor/slack amount with respect to the best performing training run.  

References: [Tune hyperparameters for your model with Azure Machine Learning](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters) and [Define a Bandit policy for early termination of HyperDrive runs](https://azure.github.io/azureml-sdk-for-r/reference/bandit_policy.html#:~:text=Bandit%20is%20an%20early%20termination,the%20best%20performing%20training%20run.).  

The values selected for the policy were a *slack factor* of 0.1 and a *evaluation interval* of 2.  
The early termination policy is applied every 2 intervals, when metrics are reported. Any run whose accuracy is less than (1 / (1 + 0.1)) or 91\% of the best run will be terminated.  


### Parameter sampler
For hyperparameter sampling, there are 3 types of sampling algorithms:
* Bayesian Sampling
* Grid Sampling
* Random Sampling

**Bayesian Sampling.**  
Picks samples based on how previous samples performed, so the new samples improve the primary metric.  
It is recommended if there is enough budget to explore the hyperparameter space and does not support early termination policies.  

**Grid Sampling.**  
Performs a simple grid search over all possible values.  
It is recommended if there is enough budget to exhaustively search over the search space. Supports early termination of low-performance runs.  

**Random Sampling.**
Hyperparameters are randomly selected from the defined search space.  
It supports early termination of low-performance runs.  

Random sampling was selected for the test, since it is the algorithm that consumes less resources and accepts an early termination policy.  

Reference: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters#define-search-space  


### Parameter space
In the [04. Hyperparameters Testing](04.%20Hyperparameters%20Testing.ipynb), different tests were conducted to determine the values of the hyperparameters to test.  

The hyperparameters to test are:
* Maximum depth of the tree (max_depth).  
* Minimum number of samples required to split (min_samples_split).  

**Maximum depth.**  
This is the maximum depth of the tree.  
The default value is *None*.  
A high value causes overfitting. A low value causes underfitting.  
The values selected for the pre-test were 5, 10, 50, 100 to test underfitting and overfitting.  

**Minimum number of samples.**  
This is the minimum number of samples required to split an internal node.  
The default value is 2.  
The values selected for the sampler were 2, 10, 50, 100 to see the effect of the selecting too few and too many samples.  

**Note:** *random_state* was set to 0, to obtain a deterministic behavior during fitting.  
This parameter controls the randomness of the estimator. If set to the default value *None*, the features are randomly permuted at each split. The best found split may vary across different runs.  


### Configuration settings
The total of runs was set to 20 and the maximum concurrent runs was set to 4, to avoid overconsumption of resources.  

In [8]:
# Create project folder
if 'training' not in os.listdir():
    os.mkdir('./training')

In [9]:
# Create project folder variable
project_folder = './training'

# Copy train.py file to project folder
shutil.copy('train.py', project_folder)
shutil.copy('../dataset/ObesityDataSet_raw_and_data_sinthetic.csv', project_folder)

'./training/ObesityDataSet_raw_and_data_sinthetic.csv'

In [10]:
shutil.copy('../dataset/ObesityDataSet_raw_and_data_sinthetic.csv', project_folder)

'./training/ObesityDataSet_raw_and_data_sinthetic.csv'

In [11]:
# TODO: Create an early termination policy. This is not required if you are using Bayesian sampling.
early_termination_policy = BanditPolicy(evaluation_interval=2, slack_factor=0.1)

#TODO: Create the different params that you will be using during training
param_sampling = RandomParameterSampling(
    {
        '--max_depth': choice(5, 10, 50, 100),
        '--min_samples_split': choice(2, 10, 50, 100)
    }
)

#TODO: Create your estimator and hyperdrive config
estimator = ScriptRunConfig(source_directory=project_folder,
                      script='train.py',
                      compute_target=cluster_target,
                      environment=sklearn_env)

hyperdrive_run_config = HyperDriveConfig(run_config=estimator,
                                         hyperparameter_sampling=param_sampling,
                                         policy=early_termination_policy,
                                         primary_metric_name='accuracy',
                                         primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                                         max_total_runs=20,
                                         max_concurrent_runs=4)

In [12]:
#TODO: Submit your experiment
hyperdrive_run = experiment.submit(config=hyperdrive_run_config)

## Run Details

OPTIONAL: Write about the different models trained and their performance. Why do you think some models did better than others?

TODO: In the cell below, use the `RunDetails` widget to show the different experiments.

In [13]:
# Show details
RunDetails(hyperdrive_run).show()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

![RunDetails 1](../images/HyperparameterRun1.png)
![RunDetails 2](../images/HyperparameterRun2.png)
![RunDetails 3](../images/HyperparameterRun3.png)

In [14]:
# Wait for completion
hyperdrive_run.wait_for_completion(show_output=True)
assert(hyperdrive_run.get_status() == 'Completed')

RunId: HD_184d90ff-1d05-46dc-b342-d368a77a9a41
Web View: https://ml.azure.com/experiments/capstone-project/runs/HD_184d90ff-1d05-46dc-b342-d368a77a9a41?wsid=/subscriptions/4ee8335e-198c-4d15-b7f8-70b9f3a46669/resourcegroups/udacity-rg/workspaces/udacity-ws

Streaming azureml-logs/hyperdrive.txt

"<START>[2021-01-16T22:04:34.171440][API][INFO]Experiment created<END>\n""<START>[2021-01-16T22:04:35.111300][GENERATOR][INFO]Successfully sampled '4' jobs, they will soon be submitted to the execution target.<END>\n""<START>[2021-01-16T22:04:34.695852][GENERATOR][INFO]Trying to sample '4' jobs from the hyperparameter space<END>\n"<START>[2021-01-16T22:04:37.2666613Z][SCHEDULER][INFO]The execution environment is being prepared. Please be patient as it can take a few minutes.<END>

Execution Summary
RunId: HD_184d90ff-1d05-46dc-b342-d368a77a9a41
Web View: https://ml.azure.com/experiments/capstone-project/runs/HD_184d90ff-1d05-46dc-b342-d368a77a9a41?wsid=/subscriptions/4ee8335e-198c-4d15-b7f8-70b

![Hyperparameter Run](../images/HyperparameterRunId1.png)

## Best Model

TODO: In the cell below, get the best model from the hyperdrive experiments and display all the properties of the model.

In [15]:
# Get best run
best_run_h = hyperdrive_run.get_best_run_by_primary_metric()
print(best_run_h.get_details()['runDefinition']['arguments'])

# List model files uploaded
print(best_run_h.get_file_names())

['--max_depth', '10', '--min_samples_split', '2']
['azureml-logs/55_azureml-execution-tvmps_0ded1c61a1d07443810c4a3abe4dfb7a1de58425f8da49729ce6266708167b93_d.txt', 'azureml-logs/65_job_prep-tvmps_0ded1c61a1d07443810c4a3abe4dfb7a1de58425f8da49729ce6266708167b93_d.txt', 'azureml-logs/70_driver_log.txt', 'azureml-logs/75_job_post-tvmps_0ded1c61a1d07443810c4a3abe4dfb7a1de58425f8da49729ce6266708167b93_d.txt', 'azureml-logs/process_info.json', 'azureml-logs/process_status.json', 'logs/azureml/102_azureml.log', 'logs/azureml/job_prep_azureml.log', 'logs/azureml/job_release_azureml.log', 'outputs/model_h/model_h.joblib']


![Hyperparameter Best Model 1](../images/HyperparameterRunChildId1.png)
![Hyperparameter Best Model 2](../images/HyperparameterRunChildId2.png)

In [16]:
#TODO: Save the best model
os.makedirs('./model_h', exist_ok=True)

for f in best_run_h.get_file_names():
    if f.startswith('outputs/model_h'):
        output_file_path = os.path.join('./model_h', f.split('/')[-1])
        print('Downloading from {} to {} ...'.format(f, output_file_path))
        best_run_h.download_file(name=f, output_file_path=output_file_path)

Downloading from outputs/model_h/model_h.joblib to ./model_h/model_h.joblib ...


## Model Deployment

Remember you have to deploy only one of the two models you trained.. Perform the steps in the rest of this notebook only if you wish to deploy this model.

TODO: In the cell below, register the model, create an inference config and deploy the model as a web service.

### Model Accuracy Comparison

#### Hyperparameter Tuning Model
![Hyperparameter Best Model 2](../images/HyperparameterRunChildId2.png)

#### AutoML Model
![AutoML Best Model 3](../images/AutoMLRunChildId1.png)

Since the AutoML generated model obtained a better accuracy, that model was deployed in the [automl notebook](automl.ipynb)

### Model Registration

The instructions in the notebook state: 

"Remember you have to deploy **only one** of the two models you trained. Perform the steps in the rest of this notebook **only if you wish to deploy this model**.  
TODO: In the cell below, **register the model**, create an inference config and deploy the model as a web service."  

Nevertheless, as part of the rubric requirements, the registration of the model is requested. Therefore, the following code lines register the best model using its run_id.

In [1]:
# Import libraries
from azureml.core import Workspace, Experiment
from azureml.core.run import Run

In [3]:
# Initialize workspace and experiment
ws = Workspace.from_config()
experiment_name = 'capstone-project'
experiment=Experiment(ws, experiment_name)

In [4]:
## Get best run id
best_run_h = Run(experiment, run_id = 'HD_184d90ff-1d05-46dc-b342-d368a77a9a41_11')

In [5]:
# Register model
registered_model = best_run_h.register_model(model_name = 'hyperdrive-model',
                                                 model_path = 'outputs/model_h/model_h.joblib')

![Registered Model](../images/RegisteredModelH.png)

TODO: In the cell below, send a request to the web service you deployed to test it.

TODO: In the cell below, print the logs of the web service and delete the service