# Hyperparameter Tuning using HyperDrive

Steps for creating hyperdrive experiment

1. Create Cluster
2. Download data
3. Click history generation
4. Register data
5. Build `train.py` entry point for hyperdrive
6. Define hyperdrive config object, using `train.py` as entry point
7. Submit hyperdrive for execution
8. Register the best performing model

check azureml version

In [1]:
# %load_ext lab_black

import logging
import os
import csv
import shutil

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets
import pkg_resources
from tqdm import tqdm
from utils import get_click_histories

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.core.dataset import Dataset
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.core import ScriptRunConfig, Environment

from azureml.widgets import RunDetails
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.sampling import (
    BayesianParameterSampling,
    RandomParameterSampling,
)
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.parameter_expressions import uniform, choice, loguniform

# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

SDK version: 1.19.0


## Initialize workspace
Initialize a Workspace object from the existing workspace you created in the Prerequisites step. Workspace.from_config() creates a workspace object from the details stored in config.json.

In [2]:
# set up workspace
ws = Workspace.from_config()
print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep='\n')

# create experiment
experiment_name = 'click-through-rate-prediction'
experiment = Experiment(ws, experiment_name)

Workspace name: quick-starts-ws-134610
Azure region: southcentralus
Subscription id: 3d1a56d2-7c81-4118-9790-f85d1acf0c77
Resource group: aml-quickstarts-134610


## Data¶

### Overview

The dataset is from Kaggle, originally used as the competition dataset of the 2020 DIGIX CTR Predicition competition. The full dataset is 6GB, contains 36 columns (some masked) describing advertising behavior for seven consecutive days. Specifically, the dataset includes

* **label** - Whether the customer clicked on an advertisement
* **uid** - Unique user ID after data anonymization
* **task_id** - Unique ID of an ad task
* **adv_id** - Unique ID of an ad material
* **creat_type_cd** - Unique ID of an ad creative type
* **adv_prim_id** - Advertiser ID of an ad task
* dev_id - Developer ID of an ad task
* **inter_typ_cd** - Display form of an ad material
* **slot_id** - Ad slot ID
* **spread_app_id** - App ID of an ad task
* **tags** - App tag of an ad task
* **app_first_class** - App level-1 category of an ad task
* **app_second_class** - App level-2 category of an ad task
* **age** - User age
* **city** - Resident city of a user
* **city_rank** - Level of the resident city of a user
* **device_name** - Phone model used by a user
* **device_size** - Size of the phone used by a user
* **career** - User occupation
* **gender** - User gender
* **net_type** - Network status when a behavior occurs
* **residence** - Resident province of a user
* **his_app_size** - App storage size
* **his_on_shelf_time** - Release time
* **app_score** - App rating score
* **emui_dev** - EMUI version
* **list_time** - Model release time
* **device_price** - Device price
* **up_life_duration** - HUAWEI ID lifecycle
* **up_membership_grade** - Service membership level
* **membership_life_duration** - Membership lifecycle
* **consume_purchase** - Paid user tag
* **communication_onlinerate** - Active time by mobile phone
* **communication_avgonline_30d** - Daily active time by mobile phone
* **indu_name** - Ad industry information
* **pt_d** - Date when a behavior occurs

The task will be for each advertisement, predict whether the customer is going to click on it in the near future.

In [3]:
found = False
train_key = "CTR-Training"
test_key = "CTR-Test"
description_text = "Building structure data that could help predict damage grade"

if train_key in ws.datasets.keys() and test_key in ws.datasets.keys():
    print("Found existing dataset, using")
    found = True
    train_dataset = ws.datasets[train_key]
    test_dataset = ws.datasets[test_key]

if not found:
    # Create AML Dataset and register it into Workspace
    print(f"Did not find existing dataset with key {train_key} and {test_key}, creating")

    # run the shell script to download the data
    os.system("sh fetch_dataset.sh")

     # read data in with chunks
    num_of_chunk = 0
    chunkSize = 10 ** 6
    ctr_data = pd.DataFrame()
    n = 7

    # select 1% to form the new training data
    for chunk in tqdm(
        pd.read_csv("data/train_data.csv", iterator=True, sep="|", chunksize=chunkSize)
    ):
        num_of_chunk += 1
        chunk_label_0 = chunk[chunk["label"] == 0]
        chunk_label_1 = chunk[chunk["label"] == 1]
        ctr_data = pd.concat([ctr_data, chunk_label_0.iloc[n * 1000 : (n + 1) * 1000, :]], axis=0)
        ctr_data = pd.concat([ctr_data, chunk_label_1.iloc[n * 1000 : (n + 1) * 1000, :]], axis=0)
        
    
    # create historical click through rate features
    time_features = [
        "uid",
        "task_id",
        "adv_id",
        "adv_prim_id",
        "spread_app_id",
    ]

    for feature in time_features:
        for time_window in range(1, 7):
            ctr_data = get_click_histories(ctr_data, feature, time_window)

    # merge two csv files
    ctr_data.drop_duplicates().to_csv("data/ctr_data.csv", index=False)

    # remove old files
    os.system("rm data/train_data.csv")

    # upload the data to default datastore
    ds = ws.get_default_datastore()
    ds.upload(
        src_dir="./data",
        target_path="click_through_rate",
        overwrite=True,
        show_progress=True,
    )

    dataset = Dataset.Tabular.from_delimited_files(
        path=ds.path("click_through_rate/ctr_data.csv")
    )

    train_dataset, test_dataset = dataset.random_split(percentage=0.8, seed=7)

    # Register Dataset in Workspace
    train_dataset = train_dataset.register(workspace=ws, name=train_key, description=description_text)
    test_dataset = test_dataset.register(workspace=ws, name=test_key, description=description_text)

Did not find existing dataset with key CTR-Training and CTR-Test, creating


42it [03:02,  4.35s/it]


Uploading an estimated of 1 files
Uploading ./data/ctr_data.csv
Uploaded ./data/ctr_data.csv, 1 files out of an estimated total of 1
Uploaded 1 files


## Create or Attach existing AmlCompute

use Azure ML managed compute (AmlCompute) for our remote training compute resource.

If the AmlCompute with that name is already in your workspace, this code will skip the creation process.

In [4]:
cluster_name = "compute-cluster"

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing compute target.')
    
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(
        vm_size='Standard_DS3_v2',                                                    
        min_nodes=2,                                                   
        max_nodes=5
    )

    # create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True)

# use get_status() to get a detailed status for the current cluster. 
print(compute_target.get_status().serialize())

Found existing compute target.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned
{'currentNodeCount': 2, 'targetNodeCount': 2, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 2, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2021-01-14T01:01:47.861000+00:00', 'errors': None, 'creationTime': '2021-01-14T00:59:24.252578+00:00', 'modifiedTime': '2021-01-14T00:59:41.029133+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 2, 'maxNodeCount': 5, 'nodeIdleTimeBeforeScaleDown': 'PT120S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_DS3_V2'}


## Training Job Configuration

Create a directory that will contain all the necessary code from the local machine that we will need access to on the remote resource. This includes the training script and any additional files our training script depends on.

In [5]:
script_folder = "training"

if script_folder not in os.listdir():
    os.mkdir(script_folder)

try:
    shutil.move('train.py', script_folder)
    shutil.move('fetch_dataset.sh', script_folder)
    shutil.move('utils.py', script_folder)
except:
    pass

create environment file

In [6]:
%%writefile conda_dependencies.yml

dependencies:
    - python=3.6.2
    - scikit-learn
    - lightgbm
    - tqdm
    - pip:
        - azureml-defaults

Writing conda_dependencies.yml


In [7]:
env_folder = "envs"

if env_folder not in os.listdir():
    os.mkdir(env_folder)

try:
    shutil.move("conda_dependencies.yml", env_folder)
except:
    pass

create a environment configuration

In [8]:
training_env = Environment.from_conda_specification(
    name="training-env", file_path="./envs/conda_dependencies.yml"
)

### configure the Estimator

Create a ScriptRunConfig object to specify the configuration details of your training job, including your training script, environment to use, and the compute target to run on.

In [9]:
src = ScriptRunConfig(source_directory=script_folder,
                      script='train.py',
                      compute_target=compute_target,
                      environment=training_env)

### Create the Hyperparamter Tuning using HyperDrive

Explain the model you are using and the reason for chosing the different hyperparameters, termination policy and config settings.

The model used to solve this problem will be LightGBM, it's a gradient boosting framework that uses tree based learning algorithms, designed for better speed, accuracy and volumne. Normally non-linear problems will yield fantanstic performances using LightGBM. Parameters tuned in the model controls different aspects of the model, for more information about the parameters visit [here](https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html). Specifically,

* `max_depth`, `max_bin` and `num_leaves` controls the complexities of the trees built by the model. The lower these values are, the faster the model will be completing its training process
* `learning_rate` contorls the accuracy of the model, but could potentially lead to longer training time and overfitting, the lower this value is, the more accurate the model will be (more likely to overfit as well)
* `boosting` controls the boosting methods the model used during its training, choices are ["gbdt", "rf", "dart", "goss"]
* `lambda_l1` and `lambda_l2` controls the regularizations (l1 and l2 respectively) of the model, which reduces overfitting effect, but might lead to lower accuracies. The higher these values are, the less likely the model will be prone to overfitting, but too high values could lead to lower performances
* `path_smooth` controls smoothing applied to tree nodes, which helps prevent overfitting on leaves with few samples. The lower this value is, the less smoothing is applied

define sampling method and create early termination policy

In [10]:
# Specify parameter sampler
ps = BayesianParameterSampling(
    {
        '--learning_rate': uniform(0.01, 0.5),
        '--num_leaves': choice(*list(range(10, 200, 1))), 
        '--boosting': choice(["gbdt", "rf", "dart", "goss"]),
        '--max_depth': choice(*list([-1] + list(range(1, 1000)))),
        '--lambda_l1': uniform(0, 50),
        '--lambda_l2': uniform(0, 50),
        '--path_smooth': uniform(0, 50),
        '--max_bin': choice(*list(range(10, 500, 1))),
    }
)
# Specify a Policy
policy = BanditPolicy(evaluation_interval=3, slack_factor=0.1)

The early termination policy used here is Bandit Policy. Bandit is an early termination policy based on slack factor/slack amount and evaluation interval. The policy early terminates any runs where the primary metric is not within the specified slack factor/slack amount with respect to the best performing training run. Using this stopping policy allow us to cancel the runs that are using hyperparameters that lead to really bad performances. This will save us valuable runtime and computing resources to avoid paying for runs we would not use.

### Create HyperDrive Run Config

With parameter sampler, early termination policy completed, a `HyperDriveConfig` could be defined
* To match up the metric with the autoML experiment, the primary parameter is set to "AUC_weighted"
* Higher AUC values indicate better performances, the primary metric goal is set to "maximize"
* `max_total_runs` and `max_concurrent_runs` controls the training speed and cocurrency. `BayesianParameterSampling` requires at least 160 runs to get the best results, therefore 160 runs is the minimum value to execute the experiment as fast and accuracte as possible

In [11]:
hyperdrive_config = HyperDriveConfig(run_config=src,
                                     hyperparameter_sampling=ps, 
                                     primary_metric_name='AUC_weighted',
                                     primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                                     max_total_runs=160,
                                     max_concurrent_runs=10)

### Submit for execution


In [12]:
hyperdrive_run = experiment.submit(hyperdrive_config)

### Run Details

OPTIONAL: Write about the different models trained and their performance. Why do you think some models did better than others?

use the `RunDetails` widget to show the different experiments.

In [13]:
RunDetails(hyperdrive_run).show()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

In [14]:
hyperdrive_run.wait_for_completion(show_output=False)

{'runId': 'HD_e1f4ddb3-7a76-4a07-97b0-3690923efa71',
 'target': 'compute-cluster',
 'status': 'Completed',
 'startTimeUtc': '2021-01-14T01:30:58.677049Z',
 'endTimeUtc': '2021-01-14T02:16:57.516746Z',
 'error': {'error': {'code': 'UserError',
   'message': 'User errors were found in at least one of the child runs.',
   'messageParameters': {},
   'details': []},
  'time': '0001-01-01T00:00:00.000Z'},
   'message': '{\n  "error": {\n    "code": "UserError",\n    "severity": null,\n    "message": "User errors were found in at least one of the child runs.",\n    "messageFormat": null,\n    "messageParameters": {},\n    "referenceCode": null,\n    "detailsUri": null,\n    "target": null,\n    "details": [],\n    "innerError": null,\n    "debugInfo": null\n  },\n  "correlation": null,\n  "environment": null,\n  "location": null,\n  "time": "0001-01-01T00:00:00+00:00",\n  "componentName": null\n}'}],
 'properties': {'primary_metric_config': '{"name": "AUC_weighted", "goal": "maximize"}',
  '

## Best Model

Get the best model from the hyperdrive experiments and display all the properties of the model.

Save the best model from HyperDrive

In [15]:
import joblib

# Get your best run and register the model from that run.
best_run = hyperdrive_run.get_best_run_by_primary_metric()

register the best model

In [16]:
# register the model
model = best_run.register_model(
    model_name='click-through-rate-predictions-HDrive', 
    model_path='./outputs/model.joblib',
    tags=best_run.get_metrics()
)

get best model parameters

In [17]:
best_run.download_file(best_run.get_file_names()[-1], output_file_path='./outputs/')
joblib.load('./outputs/model.joblib')

Trying to unpickle estimator LabelEncoder from version 0.23.2 when using version 0.22.2.post1. This might lead to breaking code or invalid results. Use at your own risk.


LGBMClassifier(boosting='gbdt', boosting_type='gbdt', class_weight=None,
               colsample_bytree=1.0, importance_type='split',
               lambda_l1=37.29947430112857, lambda_l2=37.885765652758316,
               learning_rate=0.19930802226875916, max_bin=421, max_depth=243,
               min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=100, n_jobs=-1, num_leaves=121, objective=None,
               path_smooth=38.75810723080655, random_state=None, reg_alpha=0.0,
               reg_lambda=0.0, silent=True, subsample=1.0,
               subsample_for_bin=200000, subsample_freq=0)