# Hyperparameter Tuning using HyperDrive

TODO: Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

In [None]:
import os
import sys
import azureml
import pandas as pd
# import numpy as np
# import logging
import joblib
# import json

from azureml.core.workspace import Workspace
from azureml.core.experiment import Experiment

from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

from azureml.widgets import RunDetails
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.sampling import BayesianParameterSampling
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.parameter_expressions import uniform, quniform, choice

from azureml.core.model import Model




# Initialize workspace and create an Azure ML experiment

To start we need to initialize our workspace and create a Azule ML experiment. It is also to remember that accessing the Azure ML workspace requires authentication with Azure.

Make sure the config file is present at `.\config.json`. This file can be downloaded from home of Azure Machine Learning Studio.

In [None]:
#Define the workspace
ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

In [None]:
#Create an experiment
experiment_name = 'hyper-lgbm-walmart-forecasting'
experiment = Experiment(ws, experiment_name)
experiment

In [None]:
dic_data = {'Workspace name': ws.name,
            'Azure region': ws.location,
            'Subscription id': ws.subscription_id,
            'Resource group': ws.resource_group,
            'Experiment Name': experiment.name}

df_data = pd.DataFrame.from_dict(data = dic_data, orient='index')

df_data.rename(columns={0:''}, inplace = True)
df_data

# Create or Attach an AmlCompute cluster

In [None]:
# Define CPU cluster name
compute_target_name = "cpu-cluster"

# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=compute_target_name)
    print("Found existing cpu-cluster. Use it.")
except ComputeTargetException:
    # Specify the configuration for the new cluster
    compute_config = AmlCompute.provisioning_configuration(vm_size="STANDARD_DS12_V2",
                                                           min_nodes=1, # when innactive
                                                           max_nodes=4) # when busy
    # Create the cluster with the specified name and configuration
    compute_target = ComputeTarget.create(ws, compute_target_name, compute_config)

compute_target.wait_for_completion(show_output=True)

# For a more detailed view of current AmlCompute status, use get_status()
print(compute_target.get_status().serialize())

## Configure Docker environment

**REVIEW AND EDIT TEXT**

The remote compute will need to create a [Docker image](https://docs.docker.com/get-started/) for running the script. The Docker image is an encapsulated environment with necessary dependencies installed. In the following cell, we specify the conda packages and Python version that are needed for running the script.

In [None]:
# env = EnvironmentDefinition()
# env.python.user_managed_dependencies = False
# env.python.conda_dependencies = CondaDependencies.create(
#     conda_packages=["pandas", "numpy", "scipy", "scikit-learn", "lightgbm", "joblib"],
#     python_version="3.6.9",
# )
# env.python.conda_dependencies.add_channel("conda-forge")
# env.docker.enabled = True

# Dataset

TODO: Get data. In the cell below, write code to access the data you will be using in this project. Remember that the dataset needs to be external.

## Overview

The dataset used in this project is a small subset of a much bigger dataset made available at Kaggle's competition [M5 Forecasting - Accuracy Estimate the unit sales of Walmart retail goods](https://www.kaggle.com/c/m5-forecasting-accuracy/overview/description).

The complete dataset covers stores in three US States (California, Texas, and Wisconsin) and includes item level, department, product categories, and store details. In addition, it has explanatory variables such as price, promotions, day of the week, and special events. **The task is to forecast daily sales for the next 28 days.**

In order to demonstrate the use of Azure ML in forecasting we used the available data consisting of the following files and create a reduced dataset with **10 products of the 3 Texas stores of Walmart**. 

* **calendar.csv** - Contains information about the dates on which the products are sold.
* **sell_prices.csv** - Contains information about the price of the products sold per store and date.
* **sales_train_evaluation.csv** - Includes sales [d_1 - d_1941] (labels used for the Public leaderboard)

Details on how the new dataset was created can be seen in notebook [01-walmart_data_preparation](http://localhost:8888/notebooks/Capstone%20Project/notebooks/01-walmart_data_preparation.ipynb).


In [None]:
time_column_name = 'date'
data = pd.read_csv("https://raw.githubusercontent.com/dpbac/Forecasting-Walmart-sales-with-Azure/master/data/walmart_tx_stores_10_items.csv?token=AEBB67JDCI7I3NESPI2HUHTACUSVS", parse_dates=[time_column_name])
data.head()

In [None]:
data.info()

## Upload Data to Datastore

In [None]:
# # save data locally

# path_data = './data_walmart_tx.csv'
# data.to_csv(path_data, index = None, header=True)

# datastore_path = 'dataset/'

# datastore = ws.get_default_datastore()
# datastore.upload_files(files = ['./data_walmart_tx.csv'], 
#                        target_path = datastore_path, 
#                        overwrite = True,
#                        show_progress = True)

In [None]:
# print(
#     "Datastore type: " + datastore.datastore_type,
#     "Account name: " + datastore.account_name,
#     "Container name: " + datastore.container_name,
#     sep="\n",
# )

In [None]:
# # Get data reference object for the data path
# datastore_data = ds.path(datastore_path)
# print(datastore_data)

## Hyperdrive Configuration

TODO: Explain the model you are using and the reason for chosing the different hyperparameters, termination policy and config settings.

**REVIEW AND EDIT**

Now we are ready to tune hyperparameters of the LightGBM forecast model by launching multiple runs on the cluster. In the following cell, we define the configuration of a HyperDrive job that does a parallel search of the hyperparameter space using a Bayesian sampling method. HyperDrive also supports random sampling of the parameter space.

It is recommended that the maximum number of runs should be greater than or equal to 20 times the number of hyperparameters being tuned, for best results with Bayesian sampling. Specifically, it should be no less than 180 in the following case as we have 9 hyperparameters to tune. Nevertheless, we find that even with a very small amount of runs Bayesian search can achieve decent performance. Thus, the maximum number of child runs of HyperDrive `max_total_runs` is set as `20` to reduce the running time.

In [None]:
# Increase this value if you want to achieve better performance
# max_total_runs = 20
# script_folder = './'
# script_params = {"--data-folder": datastore_data.as_mount()}
# train_script_name = "train_TEMP.py" #change to train.py when everything works fine

# Early Stop Policy
early_termination_policy = BanditPolicy(slack_factor = 0.1, # specifies the allowable slack as a ratio
                      evaluation_interval=2, # frequency for applying the policy
                      delay_evaluation=5) # delays the first policy evaluation for a specified number of intervals

# Specify hyperparameter space
param_sampling = BayesianParameterSampling(
    {
        "--num-leaves": quniform(8, 128, 1),
        "--min-data-in-leaf": quniform(20, 500, 10),
        "--learning-rate": choice(1e-4, 1e-3, 5e-3, 1e-2, 1.5e-2, 2e-2, 3e-2, 5e-2, 1e-1),
        "--feature-fraction": uniform(0.2, 1),
        "--bagging-fraction": uniform(0.1, 1),
        "--bagging-freq": quniform(1, 20, 1),
        "--max-rounds": quniform(50, 2000, 10),
#         "--max-lag": quniform(3, 40, 1),
#         "--window-size": quniform(3, 40, 1),
    }
)

# Create an estimator for use with train.py

est = Estimator(
    source_directory = './',
    compute_target=compute_target,
    entry_script='train_TEMP.py')
    
    
# #     source_directory=script_folder,
# #     script_params=script_params,
#     compute_target=compute_target,
# #     use_docker=True,
#     entry_script=train_script_name,
# #     environment_definition=env,


# Create a HyperDriveConfig using the estimator, hyperparameter sampler, and policy.

hyperdrive_config = HyperDriveConfig(
    estimator=est,
    hyperparameter_sampling=ps,
    primary_metric_name='MAE',# mean_absolute_error
#     primary_metric_name="MAPE",
    primary_metric_goal=PrimaryMetricGoal.MINIMIZE,
    max_total_runs=20,
    max_concurrent_runs=4,
    policy=early_termination_policy,
)


In [None]:
# # TODO: Create an early termination policy. This is not required if you are using Bayesian sampling.
# early_termination_policy = <your policy here>

# #TODO: Create the different params that you will be using during training
# param_sampling = <your params here>

# #TODO: Create your estimator and hyperdrive config
# estimator = <your estimator here>

# hyperdrive_run_config = <your config here?

In [None]:
# Submit hyperdrive run to the experiment 

hyperdrive_run = exp.submit(config = hyperdrive_config)

## Run Details

OPTIONAL: Write about the different models trained and their performance. Why do you think some models did better than others?

TODO: In the cell below, use the `RunDetails` widget to show the different experiments.

In [None]:
# Show run details with the Jupyter widget

RunDetails(hyperdrive_run).show()
hyperdrive_run.wait_for_completion(show_output=True)
hyperdrive_run.get_metrics()

## Retrieve and Save Best Model

TODO: In the cell below, get the best model from the hyperdrive experiments and display all the properties of the model.

In [None]:
# Get your best run and save the model from that run.

best_run = hyperdrive_run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()

print('Best Run Id: ', best_run.id)
print('NMAE:', best_run_metrics['normalized_mean_absolute_error'])

best_run

In [None]:
parameter_values = best_run.get_details()["runDefinition"]["arguments"]
print(parameter_values)

In [None]:
# Save the best model
model = best_run.register_model(
    model_name="hd_lgbm_walmart_forecast", 
    model_path="./outputs/model.pkl",
    description='Best HyperDrive Walmart forecasting model'
)
print("Model successfully saved.")

## Model Deployment

Remember you have to deploy only one of the two models you trained.. Perform the steps in the rest of this notebook only if you wish to deploy this model.

TODO: In the cell below, register the model, create an inference config and deploy the model as a web service.

TODO: In the cell below, send a request to the web service you deployed to test it.

TODO: In the cell below, print the logs of the web service and delete the service