# Automated ML Time Series Forecasting for Orange Juice Sales data
## Introduction

In this notebook we will be using Azure Automated ML in order to predict orange juice sales. The dataset we use is taken from Dominick's Finer Foods, and is available openly.
We will first start by checking the installed packages and connecting to the Azure ML workspace which has been created previously.

Let us first install the Python SDK v2 for Azure ML : 

In [None]:
pip install --pre azure-ai-ml

If you would like to upgrade from an existing version, then please use this command below : 

In [None]:
pip install --pre --upgrade azure-ai-ml

Let us check now the version of Azure ML that we installed : 

In [None]:
pip show azure-ai-ml

For a complete list of the installed packages you can use the command **pip list**
We will be now importing the required libraries.

In [None]:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

We will also include all the standard ML packages : 

In [None]:
pip install azureml

Now we need to define our subscription ID, resource group name where our Azure ML was created and the Azure ML workspace name : 

In [None]:
pip install azureml-core

In [None]:
#Enter details of your AzureML workspace
subscription_id = '7567b7de-befe-40fa-b883-40bb316ee50c'
resource_group = 'DP100'
workspace = 'AzureMLLabTest'

We will use the subscription_id, resource_group and workspace variables defined earlier in order to connect to the Azure ML workspace via SDKv2. We use the MLClient constructor to create an instance of Aure ML client object and use it later on to create our compute instance within that workspace : 

In [None]:
ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace
)

Once the connection is established, we will now create a compute resource for training. You will notice that we are using the try/except block in order to first check if a compute cluster already exists. If there is no compute cluster, then a new one will be created. You will notice that we will be using the ``` ml_client ``` object that we created in the previous cell, providing the connection to our workspace.

The method ```begin_create_or_update()``` will create an instance of AmlCompute object that we instantiated using the ``` AmlCompute() ``` constructor : 


```
    compute = AmlCompute(
        name=cpu_compute_target, size="STANDARD_D2_V2", min_instances=0, max_instances=4
    )
```

In [None]:
from azure.ai.ml.entities import AmlCompute

# specify aml compute name.
cpu_compute_target = "cpu-cluster"

try:
    ml_client.compute.get(cpu_compute_target)
except Exception:
    print("Creating a new cpu compute target...")
    compute = AmlCompute(
        name=cpu_compute_target, size="STANDARD_D2_V2", min_instances=0, max_instances=4
    )
    ml_client.compute.begin_create_or_update(compute)

In [None]:
import json
import logging

import azureml.core
import pandas as pd

## Data assets
We will be using open dataset which represents Dominick's store orange juice sales. The dataset contains information about orange juice sales of different brands across different stores. We will be predicting the future sales of orange juice based on the historical data.

In [None]:
time_column_name = "WeekStarting"
data = pd.read_csv("dominicks_OJ_original.csv", parse_dates=[time_column_name])

In [None]:
featurization_config = FeaturizationConfig()
featurization_config.blocked_transformers = ['LabelEncoder']
featurization_config.drop_columns = ['aspiration', 'stroke']
featurization_config.add_column_purpose('engine-size', 'Numeric')
featurization_config.add_column_purpose('body-style', 'CategoricalHash')
#default strategy mean, add transformer param for for 3 columns
featurization_config.add_transformer_params('Imputer', ['engine-size'], {"strategy": "median"})
featurization_config.add_transformer_params('Imputer', ['city-mpg'], {"strategy": "median"})
featurization_config.add_transformer_params('Imputer', ['bore'], {"strategy": "most_frequent"})
featurization_config.add_transformer_params('HashOneHotEncoder', [], {"number_of_bits": 3})

In [None]:
# Training MLTable defined locally, with local data to be uploaded
my_training_data_input = Input(
    type=AssetTypes.MLTABLE, path="./"
)

# Training MLTable defined locally, with local data to be uploaded
my_validation_data_input = Input(
    type=AssetTypes.MLTABLE, path="./"
)


Let us first see a sample of the dataset. We can notice that the first column ``` WeekStarting ``` describes the time, there are other columns that describe demographical information about the customers, such as ```Age```. ``` Advert ``` column is a flag defining whether there was a marketing campaign for that data or not.

In [None]:
data.head()

As we can notice, the data contains a column *Quantity* that shows how much orange juice was sold per *Store* for a given *Brand*. However, there is a column called *logQuantity* that represents the natural logarithm of the *Quantity* column. This represents a leak into our data and we need to remove that column from our dataset, so that it does not affect our training.
For this, we will use the ``` drop ``` method, by specifying that we would like to delete a column using the property ```axis = 1 ```. The ``` inplace = True ``` property will let us do the operation inplace, i.e. on the object.

In [None]:
# Drop the columns 'logQuantity' as it is a leaky feature.
data.drop("logQuantity", axis=1, inplace=True)

Let's see the resulting dataframe's sample : 

In [None]:
data.head()

For each combination of the different brand and store there is a different time serie. This means that we need to specify the column ids which determine each of the unique time series for the *Store* and *Brand* combination.

In [None]:
time_series_id_column_names = ["Store", "Brand"]
nseries = data.groupby(time_series_id_column_names).ngroups
print("Data contains {0} individual time-series.".format(nseries))

In [None]:
use_stores = [2, 5, 8]
data_subset = data[data.Store.isin(use_stores)]
nseries = data_subset.groupby(time_series_id_column_names).ngroups
print("Data subset contains {0} individual time-series.".format(nseries))

Since we are working with time component in our date, the train-test split has to include the time related splitting of the data. We will be splitting the data in time intervals equal to 20.

In [None]:
n_test_periods = 20


def split_last_n_by_series_id(df, n):
    """Group df by series identifiers and split on last n rows for each group."""
    df_grouped = df.sort_values(time_column_name).groupby(  # Sort by ascending time
        time_series_id_column_names, group_keys=False
    )
    df_head = df_grouped.apply(lambda dfg: dfg.iloc[:-n])
    df_tail = df_grouped.apply(lambda dfg: dfg.iloc[-n:])
    return df_head, df_tail


train, test = split_last_n_by_series_id(data_subset, n_test_periods)

## Featurization ##

We now need to specify the target column and will do some data featurization.

In [None]:
target_column_name = "Quantity"

In [None]:
## Below is what we would do for featurization in SDK v1, but this will have to be changed to SDK v2

In [None]:
## The below code is using FeaturizationConfig class from SDK v1 which seems to be deprecated
## or the package is no longer same as in v1
from azureml.automl.core.featurization import FeaturizationConfig

featurization_config = FeaturizationConfig()
# Force the CPWVOL5 feature to be numeric type.
featurization_config.add_column_purpose("CPWVOL5", "Numeric")
# Fill missing values in the target column, Quantity, with zeros.
featurization_config.add_transformer_params(
    "Imputer", ["Quantity"], {"strategy": "constant", "fill_value": 0}
)
# Fill missing values in the INCOME column with median value.
featurization_config.add_transformer_params(
    "Imputer", ["INCOME"], {"strategy": "median"}
)
# Fill missing values in the Price column with forward fill (last value carried forward).
featurization_config.add_transformer_params("Imputer", ["Price"], {"strategy": "ffill"})

In [None]:
## Once the featurization is done, my plan is to upload the data as a .csv file 
## into the .\data folder and then create an ML table

## Train the algorithm using AutoML SDK


Once the compute target is created, we can define the training job. In Azure ML Python SDK v2, we need to define the ```job``` and then submit it.

In order to do that, we need to define a ```command``` where we will specify which Python file has the script for training the algorithm, the input file, hyperparameters, compute target and the environment.

We already created our compute target earlier and will now define the environment and the other components.

We will first use the data that we have featurized before and saved in the .\data folder. We will create an ML table object which is a series of lazily-evaluated, immutable operations to load data from the data source. Data is not loaded from the source until MLTable is asked to deliver data.

In [2]:
# Imports
from azure.ai.ml import automl, Input, MLClient

from azure.ai.ml.constants import AssetTypes

my_training_data_input = Input(
    type=AssetTypes.MLTABLE, path="./data/"
)


In [1]:

forecasting_job = automl.forecasting(
    compute=cpu_compute_target,
    # name="dpv2-forecasting-job-02",
    experiment_name=exp_name,
    training_data=my_training_data_input,
    # validation_data = my_validation_data_input,
    target_column_name="Quantity",
    primary_metric="NormalizedRootMeanSquaredError",
    n_cross_validations=3,
    enable_model_explainability=True,
    tags={"my_custom_tag": "My custom value"},
)

NameError: name 'automl' is not defined

In [None]:
## This is not working, TODO
### forecast_job = ForecastingJob(primary_metric=primary_metric, forecasting_settings=forecasting_settings, **kwargs)

# Create the AutoML forecasting job with the related factory-function.

forecasting_job = automl.forecasting(
    compute=cpu_compute_target,
    # name="dpv2-forecasting-job-02",
    experiment_name=exp_name,
    training_data=my_training_data_input,
    # validation_data = my_validation_data_input,
    target_column_name="Quantity",
    primary_metric="NormalizedRootMeanSquaredError",
    n_cross_validations=3,
    enable_model_explainability=True,
    tags={"my_custom_tag": "My custom value"},
)

forecasting_job = automl.forecasting(training_data: azure.ai.ml.entities._inputs_outputs.input.Input, 
target_column_name: str, 
primary_metric: str = None, 
enable_model_explainability: bool = None, weight_column_name: str = None, validation_data: azure.ai.ml.entities._inputs_outputs.input.Input = None, validation_data_size: float = None, n_cross_validations: Union[str, int] = None, cv_split_column_names: List[str] = None, test_data: azure.ai.ml.entities._inputs_outputs.input.Input = None, test_data_size: float = None, forecasting_settings: azure.ai.ml.entities._job.automl.tabular.forecasting_settings.ForecastingSettings = None, **kwargs)

# Submit the AutoML job
returned_job = ml_client.jobs.create_or_update(forecasting_job)  
returned_job

## The below code is an alternative that I would like to use instead of the above approach, by using command

#job = command(
#    code="./src",  # local path where the code is stored
#    command="ls ${{inputs.input_data}}",
#    inputs=my_job_inputs,
#    environment="AzureML-sklearn-0.24-ubuntu18.04-py37-cpu:9",
#    compute="cpu-cluster",
#)

## submit the command
#returned_job = ml_client.jobs.create_or_update(job)

In [None]:
## All the below is TODO in SDK v2

## Retrieving forecasts from the model

We have created a function called run_forecast that submits the test data to the best model determined during the training run and retrieves forecasts. This function uses a helper script forecasting_script which is uploaded and expecuted on the remote compute.

To produce predictions on the test set, we need to know the feature values at all dates in the test set. This requirement is somewhat reasonable for the OJ sales data since the features mainly consist of price, which is usually set in advance, and customer demographics which are approximately constant for each store over the 20 week forecast horizon in the testing data.

In [None]:
from run_forecast import run_remote_inference

remote_run_infer = run_remote_inference(
    test_experiment=test_experiment,
    compute_target=compute_target,
    train_run=best_run,
    test_dataset=test_dataset,
    target_column_name=target_column_name,
)
remote_run_infer.wait_for_completion(show_output=False)

# download the forecast file to the local machine
remote_run_infer.download_file("outputs/predictions.csv", "predictions.csv")

## Evaluate



In [None]:
# load forecast data frame
fcst_df = pd.read_csv("predictions.csv", parse_dates=[time_column_name])
fcst_df.head()
from azureml.automl.core.shared import constants
from azureml.automl.runtime.shared.score import scoring
from matplotlib import pyplot as plt

# use automl scoring module
scores = scoring.score_regression(
    y_test=fcst_df[target_column_name],
    y_pred=fcst_df["predicted"],
    metrics=list(constants.Metric.SCALAR_REGRESSION_SET),
)

print("[Test data scores]\n")
for key, value in scores.items():
    print("{}:   {:.3f}".format(key, value))

# Plot outputs
%matplotlib inline
test_pred = plt.scatter(fcst_df[target_column_name], fcst_df["predicted"], color="b")
test_test = plt.scatter(
    fcst_df[target_column_name], fcst_df[target_column_name], color="g"
)
plt.legend(
    (test_pred, test_test), ("prediction", "truth"), loc="upper left", fontsize=8
)
plt.show()