# Hyperparameter Tuning using HyperDrive

In [1]:
# import all necessary packages

# setup workspace
from azureml.core import Workspace, Experiment
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.exceptions import ComputeTargetException
# load dataset
from train import read_data
# Hyperdrive Run
from azureml.core import Environment
from azureml.core import ScriptRunConfig
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.parameter_expressions import uniform, choice, randint
from azureml.widgets import RunDetails
# Deploy model
from azureml.core import Model
from azureml.core.model import InferenceConfig
from azureml.core.webservice import LocalWebservice, AciWebservice
# consume model
import json, requests

## Dataset

The dataset I'm using for this project is the Heart Failure Prediction Dataset from kaggle.

fedesoriano. (September 2021). Heart Failure Prediction Dataset. Retrieved [2021-10-18] from https://www.kaggle.com/fedesoriano/heart-failure-prediction.

The task with this dataset is a classification task to predict whether a person will develop a heart disease with a set of 11 diagnostic features.<br>
A detailed description of the dataset can be found in the [README](./README.md).

### Setup workspace and experiment

Use Workspace.from_config() to get the workspace configuration in the VM.
Set up an experiment with the name "heart-failure-experiment".

In [2]:
ws = Workspace.from_config()

# print some information about the workspace
print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

# choose a name for experiment
experiment_name = 'heart-failure-experiment'

experiment=Experiment(ws, experiment_name)

Workspace name: quick-starts-ws-162277
Azure region: southcentralus
Subscription id: 81cefad3-d2c9-4f77-a466-99a7f541c7bb
Resource group: aml-quickstarts-162277


### Create a cluster

I am reusing the cluster from the AutoML run or create a new one, if it doesn't exist.

In [3]:
cluster_name = "expcluster"

# Use existing cluster, if it exists
try:
    compute_target = ComputeTarget(workspace=ws, name = cluster_name)
    print('Found existing cluster, use it!')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='Standard_DS12_v2',
                                                          max_nodes=6, min_nodes=1)
    compute_target = ComputeTarget.create(workspace=ws, name=cluster_name, provisioning_configuration=compute_config)
compute_target.wait_for_completion(show_output=True)

Found existing cluster, use it!
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


### Create Dataset
I am reusing the dataset from the AutoML run or creating a new one, if it doesn't exist (see [automl.ipynb](./automl.ipynb)).
The upload of the data and preprocessing steps are defined in the [train.py](./train.py) script.

In [4]:
dataset=read_data()
df = dataset.to_pandas_dataframe()
df.describe()

found existing dataset. use it


Unnamed: 0,Age,Sex,RestingBP,Cholesterol,FastingBS,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease,ChestPainType_ASY,ChestPainType_ATA,ChestPainType_NAP,ChestPainType_TA,RestingECG_LVH,RestingECG_Normal,RestingECG_ST
count,918.0,918.0,918.0,918.0,918.0,918.0,918.0,918.0,918.0,918.0,918.0,918.0,918.0,918.0,918.0,918.0,918.0
mean,53.510893,0.78976,132.396514,198.799564,0.233115,136.809368,0.404139,0.887364,0.361656,0.553377,0.540305,0.188453,0.221133,0.050109,0.204793,0.601307,0.1939
std,9.432617,0.407701,18.514154,109.384145,0.423046,25.460334,0.490992,1.06657,0.607056,0.497414,0.498645,0.391287,0.415236,0.218289,0.40377,0.489896,0.395567
min,28.0,0.0,0.0,0.0,0.0,60.0,0.0,-2.6,-1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.0,1.0,120.0,173.25,0.0,120.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,54.0,1.0,130.0,223.0,0.0,138.0,0.0,0.6,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
75%,60.0,1.0,140.0,267.0,0.0,156.0,1.0,1.5,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
max,77.0,1.0,200.0,603.0,1.0,202.0,1.0,6.2,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## Hyperdrive Configuration

#### Model
I'm using a RandomForestClassifier model from sklearn for this task. I chose this model, since the input data is a mix of numerical and categorical data on which decision-tree based models tend to perform better than algorithms like Logistic Regression.
The model training is defined in the `main()`function of the [train.py](./train.py) script.
I'm tuning three parameters of this model, using a random parameter sampling:
- `n_estimators`: the number of trees in the RandomForest model. The tuning algorithm will use random integers up to 10000.
- `max_depth`: the maximum depth of a tree. The depth will be chosen from $[10, 100, 1000, 5000]$
- `min_samples_split`: The minimum number of samples required to split an internal node. Here the input will be a random integer up to 50.<br> **Attention** The input for this parameter should be greater than 2! Any run with a lower number will fail.

For early termination I use a Banditpolicy with a slack factor of $0.1$. This means any model, that is more than 10% worse in regard to the primary metric than the current best model is terminated.

The HyperDrive experiment will run on the above created cluster `expcluster`. To run the training script, some packages (`pandas`, `skl2onnx`, `azureml-defaults`) need to be installed on the cluster and `python`and `scikit-learn` should be accessible. This I defined in the environment script `conda_environment.yml`.

The primary metric, that should be maximized is Accuracy. I also log other metrics for the model (see [train.py](./train.py) line 119ff). The maximum number of runs for this Experiment is $50$, and 5 iterations can be run in parallel (`max_concurrent_runs`).

The models will be saved in `pkl` and `onnx` format in the outputs\ folder (see [train.py](./train.py) line 135ff).

In [5]:
early_termination_policy = BanditPolicy(evaluation_interval=2, slack_factor=0.1)

param_sampling = RandomParameterSampling({"--n_estimators": randint(10000),
    "--max_depth": choice(10, 100, 1000, 5000),
    "--min_samples_split": randint(50)})

env = Environment.from_conda_specification(name = 'env', file_path = './envs/conda_environment.yml')

src = ScriptRunConfig(source_directory = "./",
    script = "train.py",
    compute_target = "expcluster",
    environment = env)

hyperdrive_run_config = HyperDriveConfig(run_config=src,
    hyperparameter_sampling=param_sampling,
    policy=early_termination_policy,
    primary_metric_name="Accuracy",
    primary_metric_goal= PrimaryMetricGoal.MAXIMIZE,
    max_total_runs=50,
    max_concurrent_runs=5)

In [6]:
hyperdrive_run = experiment.submit(hyperdrive_run_config)

## Run Details

In [8]:
RunDetails(hyperdrive_run).show()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

Current provisioning state of AmlCompute is "Deleting"

Current provisioning state of AmlCompute is "Deleting"

Current provisioning state of AmlCompute is "Deleting"

Current provisioning state of AmlCompute is "Deleting"

Current provisioning state of AmlCompute is "Deleting"

Current provisioning state of AmlCompute is "Deleting"



#### Screenshot of RunDetails Widget

<img src="./screenshots/hyperdrive_rundetails_1.png" />
<img src="./screenshots/hyperdrive_rundetails_2.png" />

##### OPTIONAL: Write about the different models trained and their performance. Why do you think some models did better than others?

Some hyperdrive runs fail, since the input for `min_samples_split` was $0$ or $1$. This was expected, since the function `randint(50)` chooses a random integer between $[0,50]$.
The models accuracy is higher with low values for this parameter, since the trees can be much deeper.
A high number of estimators do not necessarily result in a better accuracy.

<img src="./screenshots/hyperdriverun_accuracy chart.png"/>

In [9]:
hyperdrive_run.get_children_sorted_by_primary_metric()

[{'run_id': 'HD_6749a8fe-5138-477d-bd0c-65cc00ad962c_12',
  'hyperparameters': '{"--max_depth": 100, "--min_samples_split": 2, "--n_estimators": 1416}',
  'best_primary_metric': 0.9217391304347826,
  'status': 'Completed'},
 {'run_id': 'HD_6749a8fe-5138-477d-bd0c-65cc00ad962c_1',
  'hyperparameters': '{"--max_depth": 5000, "--min_samples_split": 17, "--n_estimators": 8416}',
  'best_primary_metric': 0.9130434782608695,
  'status': 'Completed'},
 {'run_id': 'HD_6749a8fe-5138-477d-bd0c-65cc00ad962c_6',
  'hyperparameters': '{"--max_depth": 5000, "--min_samples_split": 31, "--n_estimators": 2072}',
  'best_primary_metric': 0.9,
  'status': 'Completed'},
 {'run_id': 'HD_6749a8fe-5138-477d-bd0c-65cc00ad962c_13',
  'hyperparameters': '{"--max_depth": 100, "--min_samples_split": 2, "--n_estimators": 7529}',
  'best_primary_metric': 0.8956521739130435,
  'status': 'Completed'},
 {'run_id': 'HD_6749a8fe-5138-477d-bd0c-65cc00ad962c_10',
  'hyperparameters': '{"--max_depth": 1000, "--min_samples_

## Best Model

#### Screenshot of Best Model with RunID and hyperparameters
<img src="./screenshots/Inkedhyperdrive_bestmodel_runid_LI.jpg"/>

<img src="./screenshots/hyperdrive_bestmodel_confusionmatrix.png" width=500 align="right"/>
<br> The best RandomForestClassifier model of this run consists of 1416 trees with a maximum depth of 100 and at least 2 samples inside one leaf.
It has an accuracy of $0.922$ and a precision of $0.896$. The confusion matrix shows a 3% probability of false negative predictions and a 13% chance of false positive predictions.

### Improvements for Future Work
In hindsight the choices for the maximum depth parameter are not good, since most of the decision trees can be quite shallow (around 15 to 20) due to the binary nature of most features. So the `max_depth` parameter in most runs is pointless. Since the maximum depth also correlates to the minimum samples in a split, this parameter can be stripped for future work.<br>
This model might possibly suffer from sample bias too, as described in the [AutoML run](automl.ipynb).<br>
I split my dataset into train and test data to fit the model. Since the dataset is quite small, cross-validation may be a better choice and should be considered for future runs.

In [10]:
best_hyperdrive_run = hyperdrive_run.get_best_run_by_primary_metric()
print("best run details: ", best_hyperdrive_run.get_details())
print("best run metrics :", best_hyperdrive_run.get_metrics())

best run details:  {'runId': 'HD_6749a8fe-5138-477d-bd0c-65cc00ad962c_12', 'target': 'expcluster', 'status': 'Completed', 'startTimeUtc': '2021-10-28T08:14:21.067019Z', 'endTimeUtc': '2021-10-28T08:15:06.161548Z', 'services': {}, 'properties': {'_azureml.ComputeTargetType': 'amlcompute', 'ContentSnapshotId': '2b9207f9-2a08-4fe7-8f4e-deda53cee95b', 'ProcessInfoFile': 'azureml-logs/process_info.json', 'ProcessStatusFile': 'azureml-logs/process_status.json'}, 'inputDatasets': [{'dataset': {'id': 'd22cfed2-6179-4dad-b095-e740e3e3740d'}, 'consumptionDetails': {'type': 'Reference'}}], 'outputDatasets': [], 'runDefinition': {'script': 'train.py', 'command': '', 'useAbsolutePath': False, 'arguments': ['--max_depth', '100', '--min_samples_split', '2', '--n_estimators', '1416'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'expcluster', 'dataReferences': {}, 'data': {}, 'outputData': {}, 'datacaches': [], 'jobName': None, 'maxRunDurationSeconds': 259

I am saving this model in `pkl`and in `onnx` format.

In [11]:
for f in best_hyperdrive_run.get_file_names():
    if f.startswith('outputs/hyperdrive_model.onnx'):
        output_file_path = os.path.join('./hyperdrive_model', 'hyperdrive_model.onnx')
        print('Downloading from {} to {} ...'.format(f, output_file_path))
        best_hyperdrive_run.download_file(name=f, output_file_path=output_file_path)
    elif f.startswith('outputs/model'):
        output_file_path = os.path.join('./hyperdrive_model', 'hyperdrive_model.pkl')
        print('Downloading from {} to {} ...'.format(f, output_file_path))
        best_hyperdrive_run.download_file(name=f, output_file_path=output_file_path)

Downloading from outputs/hyperdrive_model.onnx to ./hyperdrive_model/hyperdrive_model.onnx ...
Downloading from outputs/model.pkl to ./hyperdrive_model/hyperdrive_model.pkl ...


## Model Deployment

### Register Model
I'm registering the saved `onnx` model for deployment.

In [12]:
# register the onnx model
description = "HyperDrive run heart-failure classification model"

model = Model.register(workspace=ws,
    model_name="hyperdrive_model",
    model_path='./hyperdrive_model/hyperdrive_model.onnx',
    model_framework=Model.Framework.ONNX,
    model_framework_version='1.3',
    description=description)

Registering model hyperdrive_model


### Local Deployment
First I deploy the model as a LocalWebservice for debugging purposes.

I created the environment file for the webservice using the [write_env_file.py](./hyperdrive_model/write_env_file.py). To run the `onnx` model, `onnxruntime` needs to be installed on the server.
The scoring script for the `onnx` model is [score_onnx_model_version2.py](./hyperdrive_model/score_onnx_model_version2.py).
In this script I defined an `init()` function to load the model into an `onnxruntime.InferenceSession` object. The `run()` function passes the input values to the model and returns its prediction. The input in the model is an array, but I find it more user friendly to give the input in the shape of a pandas dataframe. Therefore the webservice expects a `PandasParameterType`, which I then convert into the array for the model.
I also used the inference_schema and decorator functions to create a swagger.json for the Webservice.

In [13]:
service_name = 'heart-failure-hyperdrive-service'

env = Environment.from_conda_specification(name="hyperdrive_env", file_path='./hyperdrive_model/hyperdrive_env.yml')
inference_config = InferenceConfig(entry_script='hyperdrive_model/score_onnx_model_version2.py',
                                   environment=env)

In [14]:
# deploy to local for debugging
deployment_config = LocalWebservice.deploy_configuration(port=6789)
test_service = Model.deploy(
    ws,
    name='test-service',
    models=[model],
    inference_config=inference_config,
    deployment_config=deployment_config,
    overwrite=True
)
test_service.wait_for_deployment(show_output=True)

Downloading model hyperdrive_model:1 to /tmp/azureml_954a5mif/hyperdrive_model/1
Generating Docker build context.
2021/10/28 08:40:12 Downloading source code...
2021/10/28 08:40:13 Finished downloading source code
2021/10/28 08:40:13 Creating Docker network: acb_default_network, driver: 'bridge'
2021/10/28 08:40:14 Successfully set up Docker network: acb_default_network
2021/10/28 08:40:14 Setting up Docker configuration...
2021/10/28 08:40:14 Successfully set up Docker configuration
2021/10/28 08:40:14 Logging in to registry: 7d5d1cb457424e47883ec96d527005f3.azurecr.io
2021/10/28 08:40:15 Successfully logged into 7d5d1cb457424e47883ec96d527005f3.azurecr.io
2021/10/28 08:40:15 Executing step ID: acb_step_0. Timeout(sec): 5400, Working directory: '', Network: 'acb_default_network'
2021/10/28 08:40:15 Scanning for dependencies...
2021/10/28 08:40:15 Successfully scanned dependencies
2021/10/28 08:40:15 Launching container with name: acb_step_0
Sending build context to Docker daemon  66.5

Get the swagger.json from the LocalWebservice.

In [15]:
r = requests.get(test_service.swagger_uri)
r.text

'{"swagger": "2.0", "info": {"title": "ML service", "description": "API specification for the Azure Machine Learning service ML service", "version": "1.0"}, "schemes": ["https"], "consumes": ["application/json"], "produces": ["application/json"], "securityDefinitions": {"Bearer": {"type": "apiKey", "name": "Authorization", "in": "header", "description": "For example: Bearer abc123"}}, "paths": {"/": {"get": {"operationId": "ServiceHealthCheck", "description": "Simple health check endpoint to ensure the service is up at any given point.", "responses": {"200": {"description": "If service is up and running, this response will be returned with the content \'Healthy\'", "schema": {"type": "string"}, "examples": {"application/json": "Healthy"}}, "default": {"description": "The service failed to execute due to an error.", "schema": {"$ref": "#/definitions/ErrorResponse"}}}}}, "/score": {"post": {"operationId": "RunMLService", "description": "Run web service\'s model and get the prediction out

The input in the Webservice is the first row of the dataframe. It is a male 40 year old patient. The model predicts no heart disease for this patient.

In [16]:
# get some testdata to send a request
data = df.head(1).drop("HeartDisease", axis=1).to_dict(orient="records")
body = {"Inputs": [data],}
print(body)

{'Inputs': [[{'Age': 40, 'Sex': 1, 'RestingBP': 140, 'Cholesterol': 289, 'FastingBS': 0, 'MaxHR': 172, 'ExerciseAngina': 0, 'Oldpeak': 0.0, 'ST_Slope': 1, 'ChestPainType_ASY': 0, 'ChestPainType_ATA': 1, 'ChestPainType_NAP': 0, 'ChestPainType_TA': 0, 'RestingECG_LVH': 0, 'RestingECG_Normal': 1, 'RestingECG_ST': 0}]]}


In [17]:
# test against local deployment
uri = test_service.scoring_uri
requests.get("http://localhost:6789")
headers = {"Content-Type": "application/json"}
response = requests.post(uri, data=json.dumps(body), headers=headers)
print(response.json())

[0]


In [18]:
# local deployment is working, it can be deleted now
test_service.delete()

Container has been successfully cleaned up.


### Deploy as Webservice

After I tested the model deployment on the LocalWebservice, I deploy the model on an AzureContainerInstance with 1 CPU core with 1GB memory. I enabled authentification and AppInsights for the WebService.
The `inference_config` is the same as for the LocalWebservice: I use the same environment und scoring script for the ACI deployment.

In [19]:
aci_config = AciWebservice.deploy_configuration(cpu_cores=1, memory_gb=1,
                                                enable_app_insights=True,
                                                auth_enabled=True)
service = Model.deploy(workspace=ws,
                       name=service_name,
                       models=[model],
                       inference_config=inference_config,
                       deployment_config=aci_config,
                       overwrite=True)
service.wait_for_deployment(show_output=True)

Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running
2021-10-28 08:48:31+00:00 Creating Container Registry if not exists.
2021-10-28 08:48:31+00:00 Registering the environment.
2021-10-28 08:48:32+00:00 Use the existing image.
2021-10-28 08:48:32+00:00 Generating deployment configuration.
2021-10-28 08:48:33+00:00 Submitting deployment to compute.
2021-10-28 08:48:38+00:00 Checking the status of deployment heart-failure-hyperdrive-service..
2021-10-28 08:50:35+00:00 Checking the status of inference endpoint heart-failure-hyperdrive-service.
Succeeded
ACI service creation operation finished, operation "Succeeded"


#### Screenshot of active model endpoint
<img src="./screenshots/hyperdrive_model_endpointhealthy.png" width=600 align="left"/> 

The url for the swagger documentation of the REST Endpoint of this model can be found using the method `swagger_uri` of the Webservice object. To consume the model, I need the scoring uri and (since it is an ACI) a key to authentificate my request. 
I get those using the `scoring_uri`and `get_keys()` methods of the Webservice object.

In [40]:
# send request to deployed web service
uri = service.scoring_uri
print(uri)
print(service.swagger_uri)
key, _ = service.get_keys()

http://1362c40e-a404-4dad-aa2b-df91df9dffcd.southcentralus.azurecontainer.io/score
http://1362c40e-a404-4dad-aa2b-df91df9dffcd.southcentralus.azurecontainer.io/swagger.json


The [swagger.json](./hyperdrive_model/swagger/swagger.json) can be visualized using the Swagger UI. The scripts to run the Swagger UI on the localhost can be found in [hyperdrive_model/swagger](./hyperdrive_model/swagger/).

<img src="./screenshots/hyperdrive_model_swaggerUI.png"/>

The input in the Webservice is again the data of the male 40 year old patient. The model predicts no heart disease for this patient.

In [41]:
print(body)

{'Inputs': [[{'Age': 40, 'Sex': 1, 'RestingBP': 140, 'Cholesterol': 289, 'FastingBS': 0, 'MaxHR': 172, 'ExerciseAngina': 0, 'Oldpeak': 0.0, 'ST_Slope': 1, 'ChestPainType_ASY': 0, 'ChestPainType_ATA': 1, 'ChestPainType_NAP': 0, 'ChestPainType_TA': 0, 'RestingECG_LVH': 0, 'RestingECG_Normal': 1, 'RestingECG_ST': 0}]]}


In [42]:
headers = {"Content-Type": "application/json"}
headers["Authorization"] = f"Bearer {key}"
response = requests.post(uri, data=json.dumps(body), headers=headers)
print(response.json())

[0]


### Service Logs

In [27]:
print(service.get_logs())

2021-10-28T08:50:26,961038800+00:00 - gunicorn/run 
Dynamic Python package installation is disabled.
Starting HTTP server
2021-10-28T08:50:27,004851000+00:00 - rsyslog/run 
2021-10-28T08:50:27,019054900+00:00 - iot-server/run 
2021-10-28T08:50:27,045806200+00:00 - nginx/run 
EdgeHubConnectionString and IOTEDGE_IOTHUBHOSTNAME are not set. Exiting...
2021-10-28T08:50:27,511351100+00:00 - iot-server/finish 1 0
2021-10-28T08:50:27,513721500+00:00 - Exit code 1 is normal. Not restarting iot-server.
Starting gunicorn 20.1.0
Listening at: http://127.0.0.1:31311 (66)
Using worker: sync
worker timeout is set to 300
Booting worker with pid: 94
SPARK_HOME not set. Skipping PySpark Initialization.
Initializing logger
2021-10-28 08:50:29,154 | root | INFO | Starting up app insights client
logging socket was found. logging is available.
logging socket was found. logging is available.
2021-10-28 08:50:29,155 | root | INFO | Starting up request id generator
2021-10-28 08:50:29,155 | root | INFO | Star

In [44]:
compute_target.delete()

In [45]:
service.delete()

**Submission Checklist**

- [x] I have registered the model.
- [x] I have deployed the model with the best accuracy as a webservice.
- [x] I have tested the webservice by sending a request to the model endpoint.
- [x] I have deleted the webservice and shutdown all the computes that I have used.
- [x] I have taken a screenshot showing the model endpoint as active.
- [x] The project includes a file containing the environment details.

