# Train a forecasting model with Automated Machine Learning

There are many kinds of machine learning algorithm that you can use to train a model, and sometimes it's not easy to determine the most effective algorithm for your particular data and prediction requirements. Additionally, you can significantly affect the predictive performance of a model by preprocessing the training data, using techniques such as normalization, missing feature imputation, and others. In your quest to find the best model for your requirements, you may need to try many combinations of algorithms and preprocessing transformations; which takes a lot of time and compute resources.

Azure Machine Learning enables you to automate the comparison of models trained using different algorithms and preprocessing options. You can use the visual interface in [Azure Machine Learning Studio](https://ml/azure.com) or the Python SDK (v2) to leverage this capability. The Python SDK gives you greater control over the settings for the automated machine learning job, but the visual interface is easier to use.

## Before you start

You'll need the latest version of the  **azureml-ai-ml** package to run the code in this notebook. Run the cell below to verify that it is installed.

In [88]:
pip show azure-ai-ml

Name: azure-ai-ml
Version: 1.0.0
Summary: Microsoft Azure Machine Learning Client Library for Python
Home-page: https://github.com/Azure/azure-sdk-for-python
Author: Microsoft Corporation
Author-email: azuresdkengsysadmins@microsoft.com
License: MIT License
Location: /anaconda/envs/azureml_py38/lib/python3.8/site-packages
Requires: azure-common, pydash, marshmallow, colorama, isodate, strictyaml, azure-storage-blob, tqdm, jsonschema, azure-storage-file-share, msrest, typing-extensions, azure-mgmt-core, pyjwt, azure-storage-file-datalake, azure-core, pyyaml
Required-by: 
Note: you may need to restart the kernel to use updated packages.


If the package is not installed, you can use the below cell to install it.

In [89]:
pip install --pre azure-ai-ml

Note: you may need to restart the kernel to use updated packages.


## Connect to your workspace

With the required SDK packages installed, now you're ready to connect to your workspace. We will be instantiating an object ```ml_client``` that will belong to the class MLClient which lets us manage workspaces, jobs, models, etc. 

> **Note**: When working from a notebook on an Azure Machine Learning managed compute instance, you don't have to provide the values for the subscription ID, resource group, and workspace name. The values are retrieved from the workspace you currently work in.

In [90]:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
from azureml.core import Workspace

# get details of the current Azure ML workspace
ws = Workspace.from_config()

# default authentication flow for Azure applications
default_azure_credential = DefaultAzureCredential()
subscription_id = ws.subscription_id
resource_group = ws.resource_group
workspace = ws.name

# client class to interact with Azure ML services and resources, e.g. workspaces, jobs, models and so on.
ml_client = MLClient(
   default_azure_credential,
   subscription_id,
   resource_group,
   workspace)

print('Ready to work with {}'.format(ws.name)) 

Ready to work with azuremllabtest


## Prepare data

You don't need to create a training script for automated machine learning, but you do need to load the training data.

In this case, you'll be using Azure Automated ML in order to predict orange juice sales. The dataset we use is taken from Dominick's Finer Foods, and is available openly.

To pass a dataset as an input to an automated machine learning job, the data must be in tabular form and include a target column. For the data to be interpreted as a tabular dataset, the input dataset must be a **MLTable**.

A MLTable data asset has already been created for you during set-up. You can explore the data asset by navigating to the **Data** page. You'll retrieve the data asset here by specifying its name `oj-training-table` and version `1`.

Please notice that in the current version of the SDK v2 we need to have a .yaml file that defines our MLTable. That file has to be placed in a local folder or a remote folder in the cloud, along with the data file (.CSV or Parquet file)

In [93]:
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
from azure.ai.ml import Input

# creates a dataset based on the files in the local data folder
#my_training_data_input = Input(type=AssetTypes.MLTABLE, path="azureml:oj-training:1")

my_training_data = Data(
    type=AssetTypes.MLTABLE,
    path="./dataraw",
    name="oj-mltable-train",
    description="Orange Juice Data Asset SDK v2",
    version="4"
    )

ml_client.data.create_or_update(my_training_data)

Failed to download MLTable metadata jsonschema from "None", skipping validation
Traceback (most recent call last):
  File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/azure/ai/ml/operations/_data_operations.py", line 276, in _try_get_mltable_metadata_jsonschema
    return download_mltable_metadata_schema(mltable_schema_url, self._requests_pipeline)
  File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/azure/ai/ml/_utils/_data_utils.py", line 30, in download_mltable_metadata_schema
    response = requests_pipeline.get(mltable_schema_url)
  File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/azure/ai/ml/_utils/_http_utils.py", line 63, in decorated
    return self.run(request, **kwargs).http_response
  File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/azure/core/pipeline/_base.py", line 211, in run
    return first_node.send(pipeline_request)  # type: ignore
  File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/azure/core/pipeline/_ba

Data({'skip_validation': False, 'mltable_schema_url': None, 'referenced_uris': ['./*.csv'], 'type': 'mltable', 'is_anonymous': False, 'auto_increment_version': False, 'name': 'oj-mltable-train', 'description': 'Orange Juice Data Asset SDK v2', 'tags': {}, 'properties': {}, 'id': '/subscriptions/7567b7de-befe-40fa-b883-40bb316ee50c/resourceGroups/dp100/providers/Microsoft.MachineLearningServices/workspaces/azuremllabtest/data/oj-mltable-train/versions/4', 'Resource__source_path': None, 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/annavanyan3/code/Users/annavanyan', 'creation_context': <azure.ai.ml.entities._system_data.SystemData object at 0x7fcc00e0c2b0>, 'serialize': <msrest.serialization.Serializer object at 0x7fcc00d9a4f0>, 'version': '4', 'latest_version': None, 'path': 'azureml://subscriptions/7567b7de-befe-40fa-b883-40bb316ee50c/resourcegroups/dp100/workspaces/azuremllabtest/datastores/workspaceblobstore/paths/LocalUpload/acdde790a247cbb31f1331cb3c96771a/dataraw/

@Anna add featurization here

### Data featurization

Although many of the raw data fields can be used directly to train a model, it's often necessary to create additional (engineered) features that provide information that better differentiates patterns in the data. This process is called feature engineering (data featurization), where the use of domain knowledge of the data is leveraged to create features that, in turn, help machine learning algorithms to learn better.

Data featurization is one of the most important parts of a data science projects because it can improve the prediction results significantly. If we were to give a generic definition to data featurization, we could define it as encoding various forms of data to numerical data which can be used for basic ML algorithms.

In Azure Machine Learning, data-scaling and normalization techniques are applied to make feature engineering easier. Collectively, these techniques and this feature engineering are called featurization in automated ML experiments.

One of the featurization examples is categorical features encoding which is the process of converting categorical data into integer format so that the data with converted categorical values can be provided to the different models. There are different techniques to encode categorical features - for example One-Hot Encoder, Binary Encoding, Hash Encoding 

In AzureML's AutoML module we have the possibility to do the data featurization either automatically or make some custom featurizations.

Let us first have a look at the ML Table that we created earlier by loading the the MLTable artifact into a Pandas dataframe. This will give us an idea on how to featurize the data further.

We can use the below code to load the MLTable into a dataframe : 

```
import mltable

tbl = mltable.load(uri="./my_data")
df = tbl.to_pandas_dataframe()

```

For this code to work, we need to check if the  ```mltable``` package is installed.

**Note** : The ```uri``` parameter in ```mltable.load()``` should be a valid path to a local or cloud folder which contains a valid MLTable file.

In [92]:
pip install mltable

Note: you may need to restart the kernel to use updated packages.


In [94]:
import mltable

tbl = mltable.load(uri="./dataraw")
df = tbl.to_pandas_dataframe()

In [95]:
df.head()

Unnamed: 0,WeekStarting,Store,Brand,Quantity,logQuantity,Advert,Price,Age60,COLLEGE,INCOME,Hincome150,Large HH,Minorities,WorkingWoman,SSTRDIST,SSTRVOL,CPDIST5,CPWVOL5
0,6/14/2018,2,dominicks,10560,9.264828557,1,1.59,0.232864734,0.248934934,10.55320518,0.463887065,0.103953406,0.114279949,0.303585347,2.110122129,1.142857143,1.927279669,0.376926613
1,6/14/2018,2,minute.maid,4480,8.407378325,0,3.17,0.232864734,0.248934934,10.55320518,0.463887065,0.103953406,0.114279949,0.303585347,2.110122129,1.142857143,1.927279669,0.376926613
2,6/14/2018,2,tropicana,8256,9.018695488,0,3.87,0.232864734,0.248934934,10.55320518,0.463887065,0.103953406,0.114279949,0.303585347,2.110122129,1.142857143,1.927279669,0.376926613
3,6/14/2018,5,dominicks,1792,7.491087594,1,1.59,0.117368032,0.32122573,10.92237097,0.535883355,0.103091585,0.053875277,0.410568032,3.801997814,0.681818182,1.600573425,0.736306837
4,6/14/2018,5,minute.maid,4224,8.348537825,0,2.99,0.117368032,0.32122573,10.92237097,0.535883355,0.103091585,0.053875277,0.410568032,3.801997814,0.681818182,1.600573425,0.736306837


Let's remove the **logQuantity** column as it is introducing data leak into our dataset.

In [96]:
df.drop("logQuantity", axis=1, inplace=True)

Let's see how the dataframe looks like now without the data leakage column.

In [97]:
df.head()

Unnamed: 0,WeekStarting,Store,Brand,Quantity,Advert,Price,Age60,COLLEGE,INCOME,Hincome150,Large HH,Minorities,WorkingWoman,SSTRDIST,SSTRVOL,CPDIST5,CPWVOL5
0,6/14/2018,2,dominicks,10560,1,1.59,0.232864734,0.248934934,10.55320518,0.463887065,0.103953406,0.114279949,0.303585347,2.110122129,1.142857143,1.927279669,0.376926613
1,6/14/2018,2,minute.maid,4480,0,3.17,0.232864734,0.248934934,10.55320518,0.463887065,0.103953406,0.114279949,0.303585347,2.110122129,1.142857143,1.927279669,0.376926613
2,6/14/2018,2,tropicana,8256,0,3.87,0.232864734,0.248934934,10.55320518,0.463887065,0.103953406,0.114279949,0.303585347,2.110122129,1.142857143,1.927279669,0.376926613
3,6/14/2018,5,dominicks,1792,1,1.59,0.117368032,0.32122573,10.92237097,0.535883355,0.103091585,0.053875277,0.410568032,3.801997814,0.681818182,1.600573425,0.736306837
4,6/14/2018,5,minute.maid,4224,0,2.99,0.117368032,0.32122573,10.92237097,0.535883355,0.103091585,0.053875277,0.410568032,3.801997814,0.681818182,1.600573425,0.736306837


Let's now save this as a .csv file and use it as input to our ML job.

In [98]:
df.to_csv("./dataprep/oj_prepped.csv", sep=',', encoding='ascii')

Now we will use the newly created .csv file as an MLTable **Input** to our AutoML Forecasting job. Please notice that we havethe ``MLTable`` .yml definition of that file in the *dataprep* folder as well, which defines the schema and the transformations of our .csv file.

In [99]:
my_training_data_input = Input(
    type=AssetTypes.MLTABLE, path="./dataprep"
)

## Specialized Forecasting Parameters

To define forecasting parameters for your experiment training, you can leverage the ```.set_forecast_settings()``` method. The table below details the forecasting parameters we will be passing into our experiment.

| Property                  | Description                  |
| ------------------------- |:----------------------------:|
| **time_column_name**      | The name of your time column |
| **forecast_horizon**      | The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly)      |
| **frequency**      | Forecast frequency. This optional parameter represents the period with which the forecast is desired, for example, daily, weekly, yearly, etc. Use this parameter for the correction of time series containing irregular data points or for padding of short time series. The frequency needs to be a pandas offset alias. Please refer to pandas documentation for more information.|

## Configure automated machine learning job

Now, you're ready to configure the automated machine learning experiment.

When you run the code below, it will create an automated machine learning job that:

- Uses the compute cluster named `aml-cluster`
- Sets `Quantity` as the target column
- Sets `normalized_root_mean_squarred_error` as the primary metric
- Times out after `60` minutes of total training time 
- Trains a maximum of `5` models
- Forecast horizon will be 5 time units, which translates to 5 weeks

## Compute configuration

We will need to configure a compute cluster where we will run the forecasting job.

In [100]:
from azure.ai.ml.entities import AmlCompute

# Name assigned to the compute cluster
cpu_compute_target = "aml-cluster"

try:
    # let's see if the compute target already exists
    cpu_cluster = ml_client.compute.get(cpu_compute_target)
    print(
        f"You already have a cluster named {cpu_compute_target}, we'll reuse it as is."
    )

except Exception:
    print("Creating a new cpu compute target...")

    # Let's create the Azure ML compute object with the intended parameters
    cpu_cluster = AmlCompute(
        name=cpu_compute_target,
        # Azure ML Compute is the on-demand VM service
        type="amlcompute",
        # VM Family
        size="STANDARD_DS11_V2",
        # Minimum running nodes when there is no job running
        min_instances=0,
        # Nodes in cluster
        max_instances=2,
        # How many seconds will the node running after the job termination
        idle_time_before_scale_down=120,
        # Dedicated or LowPriority. The latter is cheaper but there is a chance of job termination
        tier="Dedicated",
    )

    # Now, we pass the object to MLClient's create_or_update method
    cpu_cluster = ml_client.compute.begin_create_or_update(cpu_cluster)

    print(
        f"AMLCompute with name {cpu_cluster.name} is created, the compute size is {cpu_cluster.size}"
    )

You already have a cluster named aml-cluster, we'll reuse it as is.


## AutoML Forecasting Job Configuration

In [101]:
from azure.ai.ml import automl

# configure the forecasting job
forecasting_job = automl.forecasting(
    compute=cpu_compute_target,
    experiment_name="forecasting",
    training_data=my_training_data_input,
    # validation_data = my_validation_data_input,
    target_column_name="Quantity",
    primary_metric="NormalizedRootMeanSquaredError",
    n_cross_validations=3,
    enable_model_explainability=True,
    tags={"my_custom_tag": "My custom value"},
)

# Limits are all optional
forecasting_job.set_limits(
    timeout_minutes=60,
    trial_timeout_minutes=20,
    max_trials=4,
    # max_concurrent_trials = 4,
    # max_cores_per_trial: -1,
    enable_early_termination=True,
)

# Specialized properties for Time Series Forecasting training
forecasting_job.set_forecast_settings(
    time_column_name="WeekStarting",
    forecast_horizon=48,
    frequency="H",
    target_lags=[12],
    target_rolling_window_size=4,
    # ADDITIONAL FORECASTING TRAINING PARAMS ---
    # time_series_id_column_names=["tid1", "tid2", "tid2"],
    # short_series_handling_config=ShortSeriesHandlingConfiguration.DROP,
    # use_stl="season",
    # seasonality=3,
)

Once our featurization job object is created, we can now add the featurization that we made earlier to the job and then submit the job.

In the SDK v2 a job can be submitted using the ```create_or_update()``` method on the ```ml_client.jobs```. As you remember from the beginning of this exercise, ```ml_client``` is an object from MLClient class that allows working with jobs, workspaces etc.

In [102]:
forecasting_job.set_featurization(
    mode="auto"
)

#forecasting_job.set_featurization(
#    mode="auto",
#    transformer_params=transformer_params,
#    blocked_transformers=["LabelEncoding"],
#    column_name_and_types={"CHMIN": "Categorical"},
#)

#from azure.ai.ml.automl import ColumnTransformer

#transformer_params = {
#    "imputer": [
#        ColumnTransformer(fields=["CACH"], parameters={"strategy": "most_frequent"}),
#        ColumnTransformer(fields=["PRP"], parameters={"strategy": "most_frequent"}),
#    ],
#}
#regression_job.set_featurization(
#    mode="custom",
#    transformer_params=transformer_params,
#    blocked_transformers=["LabelEncoding"],
#    column_name_and_types={"CHMIN": "Categorical"},
#)

## Run an automated machine learning job

OK, you're ready to go. Let's run the automated machine learning experiment.

> **Note**: This may take some time!

In [83]:
# Submit the AutoML job
returned_job = ml_client.jobs.create_or_update(forecasting_job)
# submit the job to the backend

In [84]:
print(f"Created job: {returned_job}")

# Get a URL for the status of the job
returned_job.services["Studio"].endpoint

Created job: ForecastingJob({'log_verbosity': <LogVerbosity.INFO: 'Info'>, 'target_column_name': 'Quantity', 'weight_column_name': None, 'validation_data_size': None, 'cv_split_column_names': None, 'n_cross_validations': 3, 'test_data_size': None, 'task_type': <TaskType.FORECASTING: 'Forecasting'>, 'training_data': {'type': 'mltable', 'path': 'azureml://datastores/workspaceblobstore/paths/LocalUpload/acdde790a247cbb31f1331cb3c96771a/dataav'}, 'validation_data': {'type': 'mltable'}, 'test_data': None, 'environment_id': None, 'environment_variables': None, 'outputs': {}, 'type': 'automl', 'status': 'NotStarted', 'log_files': None, 'name': 'ashy_goat_501wm98jwb', 'description': None, 'tags': {'my_custom_tag': 'My custom value'}, 'properties': {}, 'id': '/subscriptions/7567b7de-befe-40fa-b883-40bb316ee50c/resourceGroups/dp100/providers/Microsoft.MachineLearningServices/workspaces/azuremllabtest/jobs/ashy_goat_501wm98jwb', 'Resource__source_path': None, 'base_path': '/mnt/batch/tasks/shared

'https://ml.azure.com/runs/ashy_goat_501wm98jwb?wsid=/subscriptions/7567b7de-befe-40fa-b883-40bb316ee50c/resourcegroups/dp100/workspaces/azuremllabtest&tid=72f988bf-86f1-41af-91ab-2d7cd011db47'

While the job is running, you can monitor it in the Studio.

## Metrics for Time Series Forecasting scenarios

| **Metric**                           | **Example Use Case**                  |
| ------------------------------------ |:-------------------------------------:|
|``normalized_root_mean_squared_error``| Price prediction (forecasting), Inventory optimization, Demand forecasting|
|``r2_score``                          | Price prediction (forecasting), Inventory optimization, Demand forecasting|
|``normalized_mean_absolute_error``    ||


``r2_score``, ``normalized_mean_absolute_error`` and ``normalized_root_mean_squared_error`` are all trying to minimize prediction errors. ``r2_score`` and ``normalized_root_mean_squared_error`` are both minimizing average squared errors while ``normalized_mean_absolute_error`` is minizing the average absolute value of errors.

Absolute value treats errors at all magnitudes alike and squared errors will have a much larger penalty for errors with larger absolute values. Depending on whether larger errors should be punished more or not, one can choose to optimize squared error or absolute error.

## Delete Azure resources

When you finish exploring Azure Machine Learning, you should delete the resources you've created to avoid unnecessary Azure costs.

1. Close the Azure Machine Learning Studio tab and return to the Azure portal.
1. In the Azure portal, on the **Home** page, select **Resource groups**.
1. Select the **rg-dp100-labs** resource group.
1. At the top of the **Overview** page for your resource group, select **Delete resource group**. 
1. Enter the resource group name to confirm you want to delete it, and select **Delete**.