# Hyperparameter Tuning using HyperDrive

In this notebook, we use `HyperDrive` to tune hyperparameters, train, select, and operationalize a time-series forecasting model that forecasts daily sales for the next 28 days of Walmart hobbies products in Texas.

The algorithm used by HyperDrive is Light GBM known for being a Kaggle's competition winner.

The dataset used is a subset of the one made available from [Kaggle's competition M5 Forecasting - Accuracy ](https://www.kaggle.com/c/m5-forecasting-accuracy/data) is available at [GitHub](https://github.com/dpbac/Forecasting-Walmart-sales-with-Azure/blob/master/data/walmart_tx_stores_10_items_with_day.csv). 

More details over the dataset are give in section [Dataset](#Dataset).


In [1]:
import pandas as pd
import numpy as np
import os
import sys
import json
import azureml
import requests 

from azureml.core.workspace import Workspace
from azureml.core.experiment import Experiment
from azureml.core import ScriptRunConfig

from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.train.estimator import Estimator

from azureml.core.dataset import Dataset
from azureml.widgets import RunDetails
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.sampling import BayesianParameterSampling
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.parameter_expressions import uniform, quniform, choice

from azureml.core.runconfig import RunConfiguration
from azureml.core.runconfig import EnvironmentDefinition
from azureml.core.runconfig import CondaDependencies

from azureml.core.model import Model

from azureml.core.webservice import AciWebservice
from azureml.core.model import Model, InferenceConfig


from azureml.train.automl import constants

import warnings
warnings.filterwarnings("ignore")

from train import *

# Check system and core SDK version number
print("System version: {}".format(sys.version))
print("Azure ML SDK version:", azureml.core.VERSION)

System version: 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31) 
[GCC 7.3.0]
Azure ML SDK version: 1.20.0


# Initialize workspace and create an Azure ML experiment

To start we need to initialize our workspace and create a Azule ML experiment. It is also to remember that accessing the Azure ML workspace requires authentication with Azure.

Make sure the config file is present at `.\config.json`. This file can be downloaded from home of Azure Machine Learning Studio.

In [2]:
#Define the workspace
ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

quick-starts-ws-138011
aml-quickstarts-138011
southcentralus
a24a24d5-8d87-4c8a-99b6-91ed2d2df51f


In [3]:
#Create an experiment
experiment_name = 'hyper-lgbm-walmart-forecasting'
experiment = Experiment(ws, experiment_name)
experiment

Name,Workspace,Report Page,Docs Page
hyper-lgbm-walmart-forecasting,quick-starts-ws-138011,Link to Azure Machine Learning studio,Link to Documentation


In [4]:
dic_data = {'Workspace name': ws.name,
            'Azure region': ws.location,
            'Subscription id': ws.subscription_id,
            'Resource group': ws.resource_group,
            'Experiment Name': experiment.name}

df_data = pd.DataFrame.from_dict(data = dic_data, orient='index')

df_data.rename(columns={0:''}, inplace = True)
df_data

Unnamed: 0,Unnamed: 1
Workspace name,quick-starts-ws-138011
Azure region,southcentralus
Subscription id,a24a24d5-8d87-4c8a-99b6-91ed2d2df51f
Resource group,aml-quickstarts-138011
Experiment Name,hyper-lgbm-walmart-forecasting


# Create or Attach an AmlCompute cluster

In [5]:
# Define CPU cluster name
compute_target_name = "cpu-cluster"

# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=compute_target_name)
    print("Found existing cpu-cluster. Use it.")
except ComputeTargetException:
    # Specify the configuration for the new cluster
    compute_config = AmlCompute.provisioning_configuration(vm_size="STANDARD_DS12_V2",
                                                           min_nodes=1, # when innactive
                                                           max_nodes=4) # when busy
    # Create the cluster with the specified name and configuration
    compute_target = ComputeTarget.create(ws, compute_target_name, compute_config)

compute_target.wait_for_completion(show_output=True)

# For a more detailed view of current AmlCompute status, use get_status()
print(compute_target.get_status().serialize())

Found existing cpu-cluster. Use it.

Running
{'errors': [], 'creationTime': '2021-02-07T19:51:40.730944+00:00', 'createdBy': {'userObjectId': '6349a418-5489-4b5c-99f7-2cf72c94fcdd', 'userTenantId': '660b3398-b80e-49d2-bc5b-ac1dc93b5254', 'userName': None}, 'modifiedTime': '2021-02-07T19:54:11.991799+00:00', 'state': 'Running', 'vmSize': 'STANDARD_DS12_V2'}


# Configure Docker environment

The remote compute will need to create a [Docker image](https://docs.docker.com/get-started/) for running the script. The Docker image is an encapsulated environment with necessary dependencies installed. In the following cell, we specify the conda packages and Python version that are needed for running the script.

In [6]:
env = EnvironmentDefinition()
env.python.user_managed_dependencies = False
env.python.conda_dependencies = CondaDependencies.create(
    conda_packages=["pandas", "numpy", "scipy", "scikit-learn", "lightgbm", "joblib"],
    python_version="3.6.2",
)
env.python.conda_dependencies.add_channel("conda-forge")
env.docker.enabled = True

# Dataset

## Overview

The dataset used in this project is a small subset of a much bigger dataset made available at Kaggle's competition [M5 Forecasting - Accuracy Estimate the unit sales of Walmart retail goods](https://www.kaggle.com/c/m5-forecasting-accuracy/overview/description).

The complete dataset covers stores in three US States (California, Texas, and Wisconsin) and includes item level, department, product categories, and store details. In addition, it has explanatory variables such as price, promotions, day of the week, and special events. **The task is to forecast daily sales for the next 28 days.**

In order to demonstrate the use of Azure ML in forecasting we used the available data consisting of the following files and create a reduced dataset with **10 products of the 3 Texas stores of Walmart**. 

* **calendar.csv** - Contains information about the dates on which the products are sold.
* **sell_prices.csv** - Contains information about the price of the products sold per store and date.
* **sales_train_evaluation.csv** - Includes sales [d_1 - d_1941] (labels used for the Public leaderboard)

Details on how the new dataset was created can be seen in notebook [01-walmart_data_preparation](http://localhost:8888/notebooks/Capstone%20Project/notebooks/01-walmart_data_preparation.ipynb).


In [7]:
time_column_name = 'date'
data = pd.read_csv("https://raw.githubusercontent.com/dpbac/Forecasting-Walmart-sales-with-Azure/master/data/walmart_tx_stores_10_items_with_day.csv", parse_dates=[time_column_name])
data.head()

Unnamed: 0,id,item_id,dept_id,cat_id,store_id,state_id,day,demand,date,wm_yr_wk,event_name_1,event_type_1,event_name_2,event_type_2,snap_TX,sell_price
0,HOBBIES_2_001_TX_1_evaluation,HOBBIES_2_001,HOBBIES_2,HOBBIES,TX_1,TX,d_1,0,2011-01-29,11101,,,,,0,
1,HOBBIES_2_002_TX_1_evaluation,HOBBIES_2_002,HOBBIES_2,HOBBIES,TX_1,TX,d_1,0,2011-01-29,11101,,,,,0,1.97
2,HOBBIES_2_003_TX_1_evaluation,HOBBIES_2_003,HOBBIES_2,HOBBIES,TX_1,TX,d_1,0,2011-01-29,11101,,,,,0,
3,HOBBIES_2_004_TX_1_evaluation,HOBBIES_2_004,HOBBIES_2,HOBBIES,TX_1,TX,d_1,0,2011-01-29,11101,,,,,0,
4,HOBBIES_2_005_TX_1_evaluation,HOBBIES_2_005,HOBBIES_2,HOBBIES,TX_1,TX,d_1,0,2011-01-29,11101,,,,,0,


In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58230 entries, 0 to 58229
Data columns (total 16 columns):
id              58230 non-null object
item_id         58230 non-null object
dept_id         58230 non-null object
cat_id          58230 non-null object
store_id        58230 non-null object
state_id        58230 non-null object
day             58230 non-null object
demand          58230 non-null int64
date            58230 non-null datetime64[ns]
wm_yr_wk        58230 non-null int64
event_name_1    4740 non-null object
event_type_1    4740 non-null object
event_name_2    120 non-null object
event_type_2    120 non-null object
snap_TX         58230 non-null int64
sell_price      52938 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(3), object(11)
memory usage: 7.1+ MB


## Prepare Data

In [9]:
forecast_horizon = 28
gap = 0

data = create_features(data,forecast_horizon)

In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 41988 entries, 10951 to 58229
Data columns (total 42 columns):
id                          41988 non-null category
item_id                     41988 non-null category
dept_id                     41988 non-null category
cat_id                      41988 non-null category
store_id                    41988 non-null category
state_id                    41988 non-null category
day                         41988 non-null category
demand                      41988 non-null int64
date                        41988 non-null datetime64[ns]
wm_yr_wk                    41988 non-null int64
event_name_1                41988 non-null category
event_type_1                41988 non-null category
event_name_2                41988 non-null category
event_type_2                41988 non-null category
snap_TX                     41988 non-null int64
sell_price                  41988 non-null float64
lag_t28                     41988 non-null float64
lag_t29 

In [11]:
# Create a training/testing split

df_train, df_test = split_train_test(data,forecast_horizon, gap)

# Separate features and labels
    
X_train=df_train.drop(['demand'],axis=1)
y_train=df_train['demand']
X_test=df_test.drop(['demand'],axis=1)
y_test=df_test['demand']
    
X_train.drop(columns='date',inplace=True)
X_test.drop(columns='date',inplace=True)

First day training dataset:2012-01-29 00:00:00
Last day training dataset:2016-04-24 00:00:00
First day test dataset:2016-04-25 00:00:00
Last day test dataset:2016-05-22 00:00:00


## Upload Data to Datastore

In [12]:
# save data locally
    
path_data = './data_walmart_tx.csv'
path_train = './train.csv'
path_test = './test.csv'

data.to_csv(path_data, index = None, header=True)
df_train.to_csv(path_train, index = None, header=True)
df_test.to_csv(path_test, index = None, header=True)

datastore = ws.get_default_datastore()
datastore.upload_files(files = ['./data_walmart_tx.csv','./train.csv', './test.csv'], 
                       target_path = 'dataset/', 
                       overwrite = True,
                       show_progress = True)

Uploading an estimated of 3 files
Uploading ./test.csv
Uploaded ./test.csv, 1 files out of an estimated total of 3
Uploading ./data_walmart_tx.csv
Uploaded ./data_walmart_tx.csv, 2 files out of an estimated total of 3
Uploading ./train.csv
Uploaded ./train.csv, 3 files out of an estimated total of 3
Uploaded 3 files


$AZUREML_DATAREFERENCE_63ba62fe8f204e7287f25eab12bd3b21

In [13]:
print(
    "Datastore type: " + datastore.datastore_type,
    "Account name: " + datastore.account_name,
    "Container name: " + datastore.container_name,
    sep="\n",
)

Datastore type: AzureBlob
Account name: mlstrg138011
Container name: azureml-blobstore-28ffa382-9ab2-454c-a434-008fe23a8960


In [14]:
# Get data reference object for the data path
ds_data = datastore.path('dataset/')
print(ds_data)

$AZUREML_DATAREFERENCE_bab57158ffa74fa691335474ffe39e7b


In [15]:
type(ds_data.as_mount())

azureml.data.data_reference.DataReference

In [16]:
from azureml.core.dataset import Dataset

df_temp = Dataset.Tabular.from_delimited_files(path=datastore.path('dataset/train.csv'))
df_temp = df_temp.to_pandas_dataframe()

In [17]:
type(datastore.path('dataset/train.csv'))

azureml.data.data_reference.DataReference

In [18]:
df_temp.head()

Unnamed: 0,id,item_id,dept_id,cat_id,store_id,state_id,day,demand,date,wm_yr_wk,...,day_of_week,week,month,year,is_month_start,is_month_end,is_weekend,lag_revenue_t1,rolling_revenue_std_t28,rolling_revenue_mean_t28
0,HOBBIES_2_002_TX_1_evaluation,HOBBIES_2_002,HOBBIES_2,HOBBIES,TX_1,TX,d_366,2,2012-01-29,11201,...,6,4,1,2012,0,0,1,0.0,1.86,0.63
1,HOBBIES_2_007_TX_1_evaluation,HOBBIES_2_007,HOBBIES_2,HOBBIES,TX_1,TX,d_366,0,2012-01-29,11201,...,6,4,1,2012,0,0,1,0.0,0.31,0.1
2,HOBBIES_2_009_TX_1_evaluation,HOBBIES_2_009,HOBBIES_2,HOBBIES,TX_1,TX,d_366,0,2012-01-29,11201,...,6,4,1,2012,0,0,1,0.0,7.39,3.39
3,HOBBIES_2_001_TX_2_evaluation,HOBBIES_2_001,HOBBIES_2,HOBBIES,TX_2,TX,d_366,0,2012-01-29,11201,...,6,4,1,2012,0,0,1,0.0,1.72,0.59
4,HOBBIES_2_002_TX_2_evaluation,HOBBIES_2_002,HOBBIES_2,HOBBIES,TX_2,TX,d_366,0,2012-01-29,11201,...,6,4,1,2012,0,0,1,0.0,2.76,2.04


In [19]:
df_temp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41148 entries, 0 to 41147
Data columns (total 42 columns):
id                          41148 non-null object
item_id                     41148 non-null object
dept_id                     41148 non-null object
cat_id                      41148 non-null object
store_id                    41148 non-null object
state_id                    41148 non-null object
day                         41148 non-null object
demand                      41148 non-null int64
date                        41148 non-null datetime64[ns]
wm_yr_wk                    41148 non-null int64
event_name_1                41148 non-null object
event_type_1                41148 non-null object
event_name_2                41148 non-null object
event_type_2                41148 non-null object
snap_TX                     41148 non-null int64
sell_price                  41148 non-null float64
lag_t28                     41148 non-null float64
lag_t29                     41148 

In [20]:
del df_temp

# Hyperdrive Configuration

## Tune Hyperparameters using HyperDrive

The following code tune hyperparameters for the LightGBM forecast model.

The ranges of parameters for the LGBM used were chosen considering the parameters tuning guides for different scenarios provided [here]( https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html).

The code below does a parallel search of the hyperparameter space using a `Bayesian sampling method` which does not support `termination policy`. Therefore, `policy=None`.

For Bayesian Sampling we recommend using a `maximum number of runs` greater than or equal to 20 times the number of hyperparameters being tuned. The recommendend value is 140. We set the maximum number of child runs of HyperDrive `max_total_runs` to `20` to reduce the running time. 

In order to compare the performance of HyperDrive with the one of AutoML we chose as [objective metric]( https://lightgbm.readthedocs.io/en/latest/Parameters.html#objective) of LGBM `root_mean_squared_root` and we used the fact that `normalized_root_mean_squared_error` is the root_mean_squared_error divided by the range of the data. For more information check this [link]( https://docs.microsoft.com/en-us/azure/machine-learning/how-to-understand-automated-ml#metric-normalization).


In [21]:
# Increase this value if you want to achieve better performance
max_total_runs = 20


est = Estimator( 
    source_directory='./', # directory containing experiment configuration files (train.py)
    compute_target=compute_target, # compute target where training will happen
    entry_script='train.py',
    use_docker=True,
    script_params={"--data-folder": ds_data.as_mount()},
    environment_definition=env, #remove if there is an error
)



# Specify hyperparameter space
param_sampling = BayesianParameterSampling(
    {
        "--num-leaves": quniform(8, 128, 1),
        "--min-data-in-leaf": quniform(20, 500, 10),
        "--learning-rate": choice(
            1e-4, 1e-3, 5e-3, 1e-2, 1.5e-2, 2e-2, 3e-2, 5e-2, 1e-1
        ),
        "--feature-fraction": uniform(0.2, 1),
        "--bagging-fraction": uniform(0.1, 1),
        "--bagging-freq": quniform(1, 20, 1),
        "--max-rounds": quniform(50, 2000, 10),
    }
)

# Create a HyperDriveConfig using the estimator, hyperparameter sampler, and policy.

hyperdrive_config = HyperDriveConfig(
    estimator=est,
    hyperparameter_sampling=param_sampling,
    primary_metric_name='NRMSE',# normalized_root_mean_squared_error
    primary_metric_goal=PrimaryMetricGoal.MINIMIZE,
    max_total_runs=max_total_runs, 
    max_concurrent_runs=4,
    policy=None, #Bayesian sampling does not support early termination policies.
)

'Estimator' is deprecated. Please use 'ScriptRunConfig' from 'azureml.core.script_run_config' with your own defined environment or an Azure ML curated environment.


In [22]:
# Submit hyperdrive run to the experiment 

hyperdrive_run = experiment.submit(config = hyperdrive_config)



## Run Details

With the help of `RunDetails` widget we can see the different experiments.

In [23]:
# Show run details with the Jupyter widget
RunDetails(hyperdrive_run).show()
hyperdrive_run.wait_for_completion(show_output=True)
hyperdrive_run.get_metrics()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

RunId: HD_826991c6-7a1f-4ea8-bd65-9483fae0ebc7
Web View: https://ml.azure.com/experiments/hyper-lgbm-walmart-forecasting/runs/HD_826991c6-7a1f-4ea8-bd65-9483fae0ebc7?wsid=/subscriptions/a24a24d5-8d87-4c8a-99b6-91ed2d2df51f/resourcegroups/aml-quickstarts-138011/workspaces/quick-starts-ws-138011

Streaming azureml-logs/hyperdrive.txt

"<START>[2021-02-07T19:57:32.826539][API][INFO]Experiment created<END>\n""<START>[2021-02-07T19:57:33.629193][GENERATOR][INFO]Trying to sample '4' jobs from the hyperparameter space<END>\n""<START>[2021-02-07T19:57:33.786704][GENERATOR][INFO]Successfully sampled '4' jobs, they will soon be submitted to the execution target.<END>\n"<START>[2021-02-07T19:57:34.5731848Z][SCHEDULER][INFO]The execution environment is being prepared. Please be patient as it can take a few minutes.<END>

Execution Summary
RunId: HD_826991c6-7a1f-4ea8-bd65-9483fae0ebc7
Web View: https://ml.azure.com/experiments/hyper-lgbm-walmart-forecasting/runs/HD_826991c6-7a1f-4ea8-bd65-9483fae0

{'HD_826991c6-7a1f-4ea8-bd65-9483fae0ebc7_19': {'NRMSE': 0.14313439356177318},
 'HD_826991c6-7a1f-4ea8-bd65-9483fae0ebc7_18': {'NRMSE': 0.14313439356177318},
 'HD_826991c6-7a1f-4ea8-bd65-9483fae0ebc7_17': {'NRMSE': 0.14313439356177318},
 'HD_826991c6-7a1f-4ea8-bd65-9483fae0ebc7_15': {'NRMSE': 0.143029112484634},
 'HD_826991c6-7a1f-4ea8-bd65-9483fae0ebc7_16': {'NRMSE': 0.14461324476308365},
 'HD_826991c6-7a1f-4ea8-bd65-9483fae0ebc7_14': {'NRMSE': 0.14501297230442048},
 'HD_826991c6-7a1f-4ea8-bd65-9483fae0ebc7_13': {'NRMSE': 0.14382278386001782},
 'HD_826991c6-7a1f-4ea8-bd65-9483fae0ebc7_12': {'NRMSE': 0.1446882366482226},
 'HD_826991c6-7a1f-4ea8-bd65-9483fae0ebc7_11': {'NRMSE': 0.1433650124338112},
 'HD_826991c6-7a1f-4ea8-bd65-9483fae0ebc7_10': {'NRMSE': 0.14390537589244803},
 'HD_826991c6-7a1f-4ea8-bd65-9483fae0ebc7_9': {'NRMSE': 0.1448583286194304},
 'HD_826991c6-7a1f-4ea8-bd65-9483fae0ebc7_8': {'NRMSE': 0.14331369266599964},
 'HD_826991c6-7a1f-4ea8-bd65-9483fae0ebc7_7': {'NRMSE': 0.1

## Retrieve and Save Best Model

Here we retrieve and save the best model as well as display all the properties of the model.

In [24]:
# Retrieve the best model and its hyperparameter values

best_run = hyperdrive_run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()
parameter_values = best_run.get_details()["runDefinition"]["arguments"]


print('Best Run Id: ', best_run.id)
print('NRMSE:', best_run_metrics['NRMSE'])
print('Best model hyperparameter values', parameter_values)


Best Run Id:  HD_826991c6-7a1f-4ea8-bd65-9483fae0ebc7_15
NRMSE: 0.143029112484634
Best model hyperparameter values ['--data-folder', '$AZUREML_DATAREFERENCE_bab57158ffa74fa691335474ffe39e7b', '--num-leaves', '71', '--min-data-in-leaf', '500', '--learning-rate', '0.02', '--feature-fraction', '0.4992612616035378', '--bagging-fraction', '0.9584208428388806', '--bagging-freq', '17', '--max-rounds', '1640']


In [25]:
best_run.get_file_names()

['azureml-logs/55_azureml-execution-tvmps_12d9a62dd9d0276ac4b9a4604b318acdf140fe93395e087b60ff2d4b1288563c_d.txt',
 'azureml-logs/65_job_prep-tvmps_12d9a62dd9d0276ac4b9a4604b318acdf140fe93395e087b60ff2d4b1288563c_d.txt',
 'azureml-logs/70_driver_log.txt',
 'azureml-logs/75_job_post-tvmps_12d9a62dd9d0276ac4b9a4604b318acdf140fe93395e087b60ff2d4b1288563c_d.txt',
 'azureml-logs/process_info.json',
 'azureml-logs/process_status.json',
 'logs/azureml/98_azureml.log',
 'logs/azureml/job_prep_azureml.log',
 'logs/azureml/job_release_azureml.log',
 'outputs/bst-model.pkl']

In [26]:
# Save the best model
model = best_run.register_model(
    model_name="hd_lgbm_walmart_forecast", 
    model_path="./outputs/bst-model.pkl",
    description='Best HyperDrive Walmart forecasting model'
)
print("Model successfully saved.")

Model successfully saved.


# Clean Up Cluster

In [None]:
compute_target.delete()