Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/manymodels/02_Training/02_Training_Pipeline.png)

# Training Pipeline - Automated ML
_**Training many models using Automated Machine Learning**_

---

This notebook demonstrates how to train and register 50 models using Automated Machine Learning. We will utilize the [ParallelRunStep](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.parallel_run_step.parallelrunstep?view=azure-ml-py) to parallelize the process of training 50 models. For this notebook we are using the Energy Dataset to predict the solar production of each home in each suburb. For more information about the data refer to the Data Preparation Notebook.

<span style="color:red"><b>NOTE: There are limits on how many runs we can do in parallel per workspace, and we currently recommend to set the parallelism to maximum of 20 runs per experiment per workspace. If users want to have more parallelism and increase this limit they might encounter Too Many Requests errors (HTTP 429). </b></span>

<span style="color:red"><b> Please ensure you have the latest version of the SDK to ensure AutoML dependencies are consistent.</b></span>

In [29]:
#!pip install --upgrade azureml-sdk[automl]

Also install the pipeline.steps package that is needed for parallel run step

In [30]:
#!pip install --upgrade azureml-pipeline-steps

### Prerequisites

At this point, you should have already:

1. Created your AML Workspace using the [00_Setup_AML_Workspace notebook](../../00_Setup_AML_Workspace.ipynb)
2. Run [01_Data_Preparation.ipynb](../../01_Data_Preparation.ipynb) to create the dataset

## 1.0 Set up workspace, datastore, experiment

In [31]:
import azureml.core
from azureml.core import Workspace, Datastore
import pandas as pd

# set up workspace
ws= Workspace.from_config() 

# Take a look at Workspace
ws.get_details()

# set up datastores
#dstore = ws.get_default_datastore()
dstore = Datastore.get(ws, datastore_name='energy')

output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Default datastore name'] = dstore.name
pd.set_option('display.max_colwidth', -1)
outputDf = pd.DataFrame(data = output, index = [''])
outputDf.T

Unnamed: 0,Unnamed: 1
SDK version,1.14.0
Subscription ID,33125c98-8730-4ada-8519-4282d89758eb
Workspace,mme-test-sa
Resource Group,sample
Location,westus
Default datastore name,energy


### Choose an experiment

In [32]:
from azureml.core import Experiment

experiment = Experiment(ws, 'manymodels-training-pipeline')

print('Experiment name: ' + experiment.name)

Experiment name: manymodels-training-pipeline


## 2.0 Call the registered filedataset

We use 50 datasets and ParallelRunStep to build 50 time-series to predict the solar production of each home. 

Each dataset represents a 1 years worth of data. 

You will need to register the datasets in the Workspace first. We did so in the [data preparation notebook](../../01_Data_Preparation.ipynb).


In [33]:
from azureml.core.dataset import Dataset

Energy50 = Dataset.get_by_name(ws, name='Energy50_train')
Energy50_input = Energy50.as_named_input('train_50_models')

## 3.0 Build the training pipeline
Now that the dataset, WorkSpace, and datastore are set up, we can put together a pipeline for training. 

### Set up environment  for ParallelRunStep

[Environment](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.environment.environment?view=azure-ml-py) defines a collection of resources that we will need to run our pipelines. We configure a reproducible Python environment for our training script. 

In [34]:
from scripts.helper import get_automl_environment
from azureml.core import Environment
train_env = get_automl_environment()

##Register the environment 
train_env.register(workspace=ws)

##If the Environment is registered; retrieve this here
train_env = Environment.get(ws,"many_models_environment_automl")

### Choose a compute target

Currently ParallelRunConfig only supports AMLCompute. You can change to a different compute cluster if one fails.

This is the compute target we will pass into our ParallelRunConfig.

In [35]:
from azureml.core.compute import ComputeTarget

# Choose a name for your cluster.
amlcompute_cluster_name = "cpucluster"



found = False
# Check if this compute target already exists in the workspace.
cts = ws.compute_targets
if amlcompute_cluster_name in cts and cts[amlcompute_cluster_name].type == 'AmlCompute':
    found = True
    print('Found existing compute target.')
    compute = cts[amlcompute_cluster_name]
    
if not found:
    print('Creating a new compute target...')
    provisioning_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D13_V2',
                                                           min_nodes=2,
                                                           max_nodes=20)
    # Create the cluster. It would be recommended to use GPU-clusters if you're using AutoML-DeepLearning models
    compute = ComputeTarget.create(ws, amlcompute_cluster_name, provisioning_config)
    
print('Checking cluster status...')
# Can poll for a minimum number of nodes and for a specific timeout.
# If no min_node_count is provided, it will use the scale settings for the cluster.
compute.wait_for_completion(show_output = True, min_node_count = None, timeout_in_minutes = 20)

# For a more detailed view of current AmlCompute status, use get_status().

Found existing compute target.
Checking cluster status...
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## Train

This dictionary defines the [AutoML settings](https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.automlconfig.automlconfig?view=azure-ml-py#parameters), for this forecasting task we add the name of the time column and the maximum forecast horizon.

|Property|Description|
|-|-|
|**task**|forecasting|
|**primary_metric**|This is the metric that you want to optimize.<br> Forecasting supports the following primary metrics <br><i>spearman_correlation</i><br><i>normalized_root_mean_squared_error</i><br><i>r2_score</i><br><i>normalized_mean_absolute_error</i>|
|**blacklist_models**|Models in blacklist won't be used by AutoML. All supported models can be found at [here](https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.constants.supportedmodels.forecasting?view=azure-ml-py).|
|**iterations**|Number of models to train. This is optional but provides customer with greater control.|
|**iteration_timeout_minutes**|Maximum amount of time in minutes that the model can train. This is optional and depends on the dataset. We ask customer to explore a bit to get approximate times for training the dataset. For OJ dataset we set it 20 minutes|
|**experiment_timeout_hours**|Maximum amount of time in hours that the experiment can take before it terminates.|
|**label_column_name**|The name of the label column.|
|**n_cross_validations**|Number of cross validation splits. Rolling Origin Validation is used to split time-series in a temporally consistent way.|
|**enable_early_stopping**|Flag to enable early termination if the score is not improving in the short term.|
|**time_column_name**|The name of your time column.|
|**max_horizon**|The number of periods out you would like to predict past your training data. Periods are inferred from your data.|
|**grain_column_names**|The column names used to uniquely identify timeseries in data that has multiple rows with the same timestamp.|
|**group_column_names**|The names of columns used to group your models. For timeseries, the groups must not split up individual time-series. That is, each group must contain one or more whole time-series.|
|**drop_column_names**|The names of columns to drop for forecasting tasks.|
|**track_child_runs**|Flag to disable tracking of child runs. Only best run (metrics and model) is tracked if the flag is set to False.|

In [36]:
import logging
from scripts.helper import write_automl_settings_to_file

automl_settings = {
    "task" : 'forecasting',
    "primary_metric" : 'normalized_root_mean_squared_error',
    "iteration_timeout_minutes" : 5, # This needs to be changed based on the dataset. We ask customer to explore how long training is taking before settings this value
    "iterations" : 5,
    "experiment_timeout_minutes" : 60,
    "label_column_name" : 'Solar',
    "n_cross_validations" : 2,
    "verbosity" : logging.INFO, 
    "debug_log": 'automl_oj_sales_debug.txt',
    "time_column_name": 'EndDate',
    "max_horizon" : 10,
    "max_cores_per_iteration": 4, ##Depends on the VM Type
    "max_concurrent_iterations": 10, ##Depends on the VM Type
    "enable_tf": True,  ##Set to True if you're using GPU VMs - this will use Tensorflow/DL to forecast
    "group_column_names": ['Suburb', 'Home'],
    "grain_column_names": ['Suburb', 'Home'],
    "drop_column_names": ['DeviceNumber','Generalusage'],
    "blacklist_models": ['Average','Naive']
}

write_automl_settings_to_file(automl_settings)

### Set up ParallelRunConfig

[ParallelRunConfig](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.parallel_run_config.parallelrunconfig) is configuration for parallel run step. You will need to determine the number of workers and nodes appropriate for your use case. The process_count_per_node is based off the number of cores of the compute VM. The node_count will determine the number of master nodes to use, increasing the node count will speed up the training process.


* <b>node_count</b>: The number of compute nodes to be used for running the user script. We recommend to start with 3 and increase the node_count if the training time is taking too long.

* <b>process_count_per_node</b>: The number of processes per node.

* <b>run_invocation_timeout</b>: The run() method invocation timeout in seconds. The timeout should be set to maximum training time of one AutoML run(with some buffer), by default it's 60 seconds.

<span style="color:red"><b>NOTE: There are limits on how many runs we can do in parallel per workspace, and we currently recommend to set the parallelism to maximum of 20 runs per experiment per workspace. If users want to have more parallelism and increase this limit they might encounter Too Many Requests errors (HTTP 429). </b></span>


In [37]:
from scripts.helper import build_parallel_run_config

# PLEASE MODIFY the following three settings based on your compute and experiment timeout.
node_count=4
process_count_per_node=5
run_invocation_timeout=3700 # this timeout(in seconds) is inline with AutoML experiment timeout or (no of iterations * iteration timeout)

parallel_run_config = build_parallel_run_config(train_env, compute, node_count, process_count_per_node, run_invocation_timeout)

### Set up ParallelRunStep

This [ParallelRunStep](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.parallelrunstep?view=azure-ml-py) is the main step in our pipeline. First, we set up the output directory and define the Pipeline's output name. The datastore that stores the pipeline's output data is Workspace's default datastore.

In [38]:
from azureml.pipeline.core import PipelineData

training_output_name = "training_output"

output_dir = PipelineData(name=training_output_name, 
                          datastore=dstore)

We specify the following parameters:

* <b>name</b>: We set a name for our ParallelRunStep.

* <b>parallel_run_config</b>: We then pass the previously defined ParallelRunConfig.

* <b>allow_reuse</b>: Indicates whether the step should reuse previous results when re-run with the same settings. 

* <b>inputs</b>: We are going to use the registered FileDataset that we called earlier in the Notebook. _inputs_ points to a registered file dataset in AML studio that points to a path in the blob container. The number of files in that path determines the number of models will be trained in the ParallelRunStep. 

* <b>output</b>: The output directory we just defined. A PipelineData object that corresponds to the output directory.

* <b>models</b>: Zero or more model names already registered in the Azure Machine Learning model registry.


<span style="color:red"><b>Please upgrade azureml-pipeline-steps(>=1.6.0) if the following fails.</b></span>

In [39]:
from azureml.pipeline.steps import ParallelRunStep

parallel_run_step = ParallelRunStep(
    name="many-models-training",
    parallel_run_config=parallel_run_config,
    allow_reuse = False,
    inputs=[Energy50_input], # train 10 models
    #inputs=[filedst_all_models_inputs], # switch to this inputs if train all 11,973 models
    output=output_dir,
    #arguments=['--retrain_failed_models', 'True'], # Uncomment this if you want to retrain only failed models
)

## 4.0 Run the training pipeline

### Submit the pipeline to run

Next we submit our pipeline to run. The whole training pipeline takes about 1h 11m using a Standard_D13_V2 VM with our current ParallelRunConfig setting.

In [40]:
from azureml.pipeline.core import Pipeline
#from azureml.widgets import RunDetails

pipeline = Pipeline(workspace=ws, steps=parallel_run_step)
run = experiment.submit(pipeline)
#RunDetails(run).show()

Created step many-models-training [66a369aa][887e0fbe-5bb2-47e1-bcf9-a153fd2cd7e8], (This step will run and generate new outputs)
Using data reference train_50_models_0 for StepId [b3d4dc6d][8af75f2e-94ea-4912-92e8-6a50263a8ecd], (Consumers of this data are eligible to reuse prior runs.)
Submitted PipelineRun e8b0f6e9-5716-49e5-8245-e4b8c75f2ea1
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/manymodels-training-pipeline/runs/e8b0f6e9-5716-49e5-8245-e4b8c75f2ea1?wsid=/subscriptions/33125c98-8730-4ada-8519-4282d89758eb/resourcegroups/sample/workspaces/mme-test-sa


You can run the folowing command if you'd like to monitor the training process in jupyter notebook. It will stream logs live while training. 

**Note**: This command may not work for Notebook VM, however it should work on your local laptop.

In [28]:
run.wait_for_completion(show_output=True)

PipelineRunId: 83fc9301-7cce-4637-bf9e-445f4de4dd41
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/manymodels-training-pipeline/runs/83fc9301-7cce-4637-bf9e-445f4de4dd41?wsid=/subscriptions/33125c98-8730-4ada-8519-4282d89758eb/resourcegroups/sample/workspaces/mme-test-sa
PipelineRun Status: Running


StepRunId: 63511730-a6f0-4a6e-984c-ed972840081f
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/manymodels-training-pipeline/runs/63511730-a6f0-4a6e-984c-ed972840081f?wsid=/subscriptions/33125c98-8730-4ada-8519-4282d89758eb/resourcegroups/sample/workspaces/mme-test-sa
StepRun( many-models-training ) Status: Running

Streaming azureml-logs/55_azureml-execution-tvmps_0555844bd79b85553ca984d1883fc82dd82e538e0b5ee3605566724ba5136287_d.txt
Using default tag: latest
latest: Pulling from azureml/azureml_bafd6ade1c94d3b2014fc6b378f62160
8e097b52bfb8: Pulling fs layer
a613a9b4553c: Pulling fs layer
acc000f01536: Pulling fs layer
73eef93b7466: Pulling

ExperimentExecutionException: ExperimentExecutionException:
	Message: The output streaming for the run interrupted.
But the run is still executing on the compute target. 
Details for canceling the run can be found here: https://aka.ms/aml-docs-cancel-run
	InnerException None
	ErrorResponse 
{
    "error": {
        "message": "The output streaming for the run interrupted.\nBut the run is still executing on the compute target. \nDetails for canceling the run can be found here: https://aka.ms/aml-docs-cancel-run"
    }
}

Succesfully trained, registered Automated ML models. 

## 5.0 Review outputs of the training pipeline

The training pipeline will train and register models to the Workspace. You can review trained models in the Azure Machine Learning Studio under 'Models'.
If there are any issues with training, you can go to 'many-models-training' run under the pipeline run and explore logs under 'Logs'.
You can look at the stdout and stderr output under logs/user/worker/<ip> for more details


## 6.0 Get list of AutoML runs along with registered model names and tags

The following code snippet will iterate through all the automl runs for the experiment and list the details.

**Framework** - AutoML, **Dataset** - input data set, **Run** - AutoML run id, **Status** - AutoML run status,  **Model** - Registered model name, **Tags** - Tags for model, **StartTime** - Start time, **EndTime** - End time, **ErrorType** - ErrorType, **ErrorCode** - ErrorCode, **ErrorMessage** - Error Message

In [14]:
from scripts.helper import get_training_output
import os

training_results_name = "training_results"

training_file = get_training_output(run, training_results_name, training_output_name)
all_columns = ["Framework", "Dataset", "Run", "Status", "Model", "Tags", "StartTime", "EndTime" , "ErrorType", "ErrorCode", "ErrorMessage" ]
df = pd.read_csv(training_file, delimiter=" ", header=None, names=all_columns)
training_csv_file = "training.csv"
df.to_csv(training_csv_file)
print("Training output has", df.shape[0], "rows. Please open", os.path.abspath(training_csv_file), "to browse through all the output.")

Training output has 50 rows. Please open /mnt/batch/tasks/shared/LS_root/mounts/clusters/julian-ci/code/Users/julianle/Projects/ManyModelsforEnergy/Automated_ML/02_AutoML_Training_Pipeline/training.csv to browse through all the output.


In [15]:
df.head()

Unnamed: 0,Framework,Dataset,Run,Status,Model,Tags,StartTime,EndTime,ErrorType,ErrorCode,ErrorMessage
0,AutoML,Bondi_home9,AutoML_e8613d15-0355-44f1-a2be-290a396d4bed,Completed,automl_1704596300685ffc01afd572da5a30a03c784c09610095be3bae9f125e8737f3,"{'ModelType': 'AutoML', 'Suburb': 'Bondi', 'Home': 'home9', 'InputData': 'Bondi_home9.csv', 'StepRunId': 'ad9d6e9a-de2d-4460-9f6c-6a13c51f033a', 'RunId': '30060a76-710e-4a4c-8b32-366ef11a4f98', 'Hash': '1704596300685ffc01afd572da5a30a03c784c09610095be3bae9f125e8737f3'}",2020-09-24 06:09:45.449721,2020-09-24 06:21:57.250943,,,
1,AutoML,NorthBridge_home2,AutoML_9cc87304-7b43-42e5-b719-8394c889bc1c,Completed,automl_b8724c8c56fa9316b3d1fae724774bd51f53f7dc6dde0da0d8a7bee5844b7cd6,"{'ModelType': 'AutoML', 'Suburb': 'NorthBridge', 'Home': 'home2', 'InputData': 'NorthBridge_home2.csv', 'StepRunId': 'ad9d6e9a-de2d-4460-9f6c-6a13c51f033a', 'RunId': '30060a76-710e-4a4c-8b32-366ef11a4f98', 'Hash': 'b8724c8c56fa9316b3d1fae724774bd51f53f7dc6dde0da0d8a7bee5844b7cd6'}",2020-09-24 06:21:58.690233,2020-09-24 06:31:34.942275,,,
2,AutoML,Bondi_home7,AutoML_60b6eb39-b4c3-4043-98bc-04f17a563257,Completed,automl_fd1179616c935236e39bc13ae5aeecb1383956c61026e1fdc15cf8f0d8bb4e8e,"{'ModelType': 'AutoML', 'Suburb': 'Bondi', 'Home': 'home7', 'InputData': 'Bondi_home7.csv', 'StepRunId': 'ad9d6e9a-de2d-4460-9f6c-6a13c51f033a', 'RunId': '30060a76-710e-4a4c-8b32-366ef11a4f98', 'Hash': 'fd1179616c935236e39bc13ae5aeecb1383956c61026e1fdc15cf8f0d8bb4e8e'}",2020-09-24 06:31:36.732972,2020-09-24 06:40:37.129168,,,
3,AutoML,AlbertPark_home1,AutoML_eb1328f3-4a33-4304-802b-567e4e53ac3e,Completed,automl_bb3d06532a80b0cea738dd967087fbb6f22a4c013212621349906bdec735a156,"{'ModelType': 'AutoML', 'Suburb': 'AlbertPark', 'Home': 'home1', 'InputData': 'AlbertPark_home1.csv', 'StepRunId': 'ad9d6e9a-de2d-4460-9f6c-6a13c51f033a', 'RunId': '30060a76-710e-4a4c-8b32-366ef11a4f98', 'Hash': 'bb3d06532a80b0cea738dd967087fbb6f22a4c013212621349906bdec735a156'}",2020-09-24 06:09:55.945220,2020-09-24 06:23:35.374225,,,
4,AutoML,Manly_home6,AutoML_6e6befd2-e4e5-4bca-a7d0-c2b4b167e39c,Completed,automl_055917f29e69e427fdf02837ed27e226382c887b4b6eb53ffd0ae868d0871b93,"{'ModelType': 'AutoML', 'Suburb': 'Manly', 'Home': 'home6', 'InputData': 'Manly_home6.csv', 'StepRunId': 'ad9d6e9a-de2d-4460-9f6c-6a13c51f033a', 'RunId': '30060a76-710e-4a4c-8b32-366ef11a4f98', 'Hash': '055917f29e69e427fdf02837ed27e226382c887b4b6eb53ffd0ae868d0871b93'}",2020-09-24 06:23:37.002372,2020-09-24 06:34:15.474209,,,


## 7.0 Publish and schedule the pipeline (Optional)

### 7.1 Publish the pipeline

Once you have a pipeline you're happy with, you can publish a pipeline so you can call it programmatically later on. See this [tutorial](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-your-first-pipeline#publish-a-pipeline) for additional information on publishing and calling pipelines.

In [None]:
# published_pipeline = pipeline.publish(name = 'automl_train_many_models',
#                                      description = 'train many models',
#                                      version = '1',
#                                      continue_on_step_failure = False)

### 7.2 Schedule the pipeline
You can also [schedule the pipeline](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-schedule-pipelines) to run on a time-based or change-based schedule. This could be used to automatically retrain models every month or based on another trigger such as data drift.

In [None]:
# from azureml.pipeline.core import Schedule, ScheduleRecurrence
    
# training_pipeline_id = published_pipeline.id

# recurrence = ScheduleRecurrence(frequency="Month", interval=1, start_time="2020-01-01T09:00:00")
# recurring_schedule = Schedule.create(ws, name="automl_training_recurring_schedule", 
#                             description="Schedule Training Pipeline to run on the first day of every month",
#                             pipeline_id=training_pipeline_id, 
#                             experiment_name=experiment.name, 
#                             recurrence=recurrence)

## 8.0 Bookkeeping of workspace (Optional)

### 8.1 Cancel any runs that are running

To cancel any runs that are still running in a given experiment.

In [None]:
# from scripts.helper import cancel_runs_in_experiment
# failed_experiment =  'Please modify this and enter the experiment name'
# # Please note that the following script cancels all the currently running runs in the experiment
# cancel_runs_in_experiment(ws, failed_experiment)

## Next Steps

Now that you've trained and scored the models, move on to [03_AutoML_Forecasting_Pipeline.ipynb](../03_AutoML_Forecasting_Pipeline/03_AutoML_Forecasting_Pipeline.ipynb) to make forecasts with your models.