# Train a Model using AutoML
Automated machine learning (AutoML) is the process of automating the time consuming, iterative tasks of machine learning model development. It allows data scientists, analysts, and developers to build ML models with high scale, efficiency, and productivity all while sustaining model quality. 

In short, AutoML is Microsoft's answer to automated Machine Learning.  

AutoML takes in Azure Tabular Datasets or Pandas Dataframes for local runs, and Azure Tabular Datasets only for remote runs.  

For a list of AutoML algorithms, please consult this page: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-train

In [None]:
# Load Azure Libaries
from azureml.core import Datastore
from azureml.core.dataset import Dataset
from azureml.core.workspace import Workspace
from azureml.core.authentication import InteractiveLoginAuthentication
from azure.storage.blob import BlockBlobService
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.train.automl import AutoMLConfig
from azureml.core.experiment import Experiment
from azureml.core.environment import Environment
from azureml.core.conda_dependencies import CondaDependencies
from azureml.automl.core.featurization import FeaturizationConfig
from azureml.explain.model._internal.explanation_client import ExplanationClient

# Import other Libraries
import pandas as pd
import numpy as np
import logging
import json
import os
import math
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# Retrieve your workspace from the configuration file
ws = Workspace.from_config()

In [None]:
# Retrieve your Datastore by name by filling in the lower case values between double quotes
datastore_name = "<my-datastore-name>"
datastore = Datastore.get(ws, datastore_name)

In [None]:
# Retrieve your Refined Datasets by name by filling in the lower case values between double quotes
dataset_name_test = "<my-transformed-dataset-name>"
dataset_name_train  = "<my-transformed-dataset-name>"

# Load Data in as Tabular Datasets
testing_data  = Dataset.get_by_name(ws, dataset_name_test, version='latest')
training_data = Dataset.get_by_name(ws, dataset_name_train, version='latest')

In [None]:
# Convert your tabular dataset to pandas data frames
testTransformedDF = testing_data.to_pandas_dataframe()
trainTransformedDF = training_data.to_pandas_dataframe()

In [None]:
# Retrieve your Compute Targets for Running AutoML
cpu_compute_target = ComputeTarget(ws, '<my-cpu-cluster>')
# Retrieve a GPU cluster for Deep Learning Runs
gpu_compute_target = ComputeTarget(ws, '<my-gpu-cluster>')

## Now, we're ready to configure our AutoML run
### First, drop all columns not required in your machine learning model and assign your label column
Use tabular datasets for remote run.  Tabular data is the only data that will work on remote runs.

In [2]:
# Drop any column that isn't appropriate to add into the model
# For example, ID columns, redundant columns or columns with only one value should be dropped
trainTab = training_data.drop_columns(['<MyIdColumn>','<MyRedundantColumn>','<MySingleValueColumn>'])

NameError: name 'training_data' is not defined

In [None]:
# Next, assign the name of the column you are trying to predict to a variable.
label = '<MyLabelColumn>'

### Next, configure your AutoML settings
There are numerous configuration options within AutoML, task, primary metric, featurization and explainability being the most important.<br>
Set <b>task</b> to either classification, regression or forecasting depending on the type of problem you are trying to solve.<br>
Set <b>Primary Metric</b> to what you are trying to minimize or maximize, like accuracy for classification or rmse for regression problems.<br>
<b>Featurization</b> set to Auto automatically one hot encodes categorical values, drops high cardinality categorical columns, imputes missing values across all types of columns, autogenerates numerous datetime features and also creates many features from text data.<br>
Set <b>Model Explainability</b> to True to let you obtain a ranked list of features used to generate the AutoML model.

For a list and explanation of configurations, click the link below: <br>
https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.automlconfig.automlconfig?view=azure-ml-py

For a list of primary metrics based on problem type, click the following:
https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-train

In [None]:
automl_settings = {
    #"experiment_timeout_minutes": 20,  # Use this for testing to limit the autoML run
    "enable_early_stopping" : True,    # Enable this to end the experiment once results stop improving.  Always use this.
    #"iteration_timeout_minutes": 5,   # Enable this to limit how long each model takes to run
    "max_concurrent_iterations": 4,    # Match this value to the max number of nodes in your cluster
    #"max_cores_per_iteration": -1,     # Only used for DNN 
    "n_cross_validations": 5,         # This is the number of splits to use for cross validation
    "featurization": 'auto',           # Set to auto to preprocess data
    "preprocess": True,                # Set to auto to preprocess data
    "enable_dnn": False,               # Enables Deep Neural Networks for appropriate problems
    "enable_tf": False,                # Enables Tensorflow algorithms for appropriate problems
    "verbosity": logging.INFO,         # Enables logging
}

automl_config = AutoMLConfig(task = '<my-problem-type>',         # Classification, regression or forecasting
                             primary_metric = '<my-metric>',     # Select the metric to be optimized through autoML 
                             num_classes = 5,                    # Set the number of categories classification
                             debug_log = 'automl_errors.log',    # Assigns the debug log name
                             compute_target=cpu_compute_target,  # Assign the remote cluster.  If blank, runs locally
                             experiment_exit_score = 0.99,       # Threshold to end autoML runs prematurely 
                             #blacklist_models = ['Prophet'],        # Use to blacklist models
                             #whitelist_models = ['KNN'],        # Runs only your selected models
                             enable_onnx_compatible_models=False,# Enables/disables enforcing onnx compatible models
                             training_data = trainTab,           # Sets the training data
                             label_column_name = label,          # Sets the column to predict in your training data
                             model_explainability=True,          # Enables/disables model explainability
                             enable_voting_ensemble=True,        # Enables/disables voting ensemble algorithm
                             enable_stack_ensemble=True,         # Enables/disables stack ensemble algorithm
                             **automl_settings
                            )


## Set up your Experiment
An experiment is a grouping of many runs from a specified script. It always belongs to a workspace. When you submit a run, you provide an experiment name. Information for the run is stored under that experiment. If you submit a run and specify an experiment name that doesn't exist, a new experiment with that newly specified name is automatically created.

To learn more about Experiments, click here:

In [None]:
# choose a name for experiment appropriate to the project
experiment_name = '<my-experiment-name>'

experiment=Experiment(ws, experiment_name)

# Output a nice table with all of the essential experiment information
output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Experiment Name'] = experiment.name
pd.set_option('display.max_colwidth', -1)
outputDf = pd.DataFrame(data = output, index = [''])
outputDf.T

In [None]:
# Train your model
remote_run = experiment.submit(automl_config, show_output = True)

In [None]:
# Run the Widget to best view Model Results
from azureml.widgets import RunDetails
RunDetails(remote_run).show() 

In [None]:
# Compare your results to the default model accuracy.  This is the number to beat.  Pick one depending on your Problem.

# For Classification problems, the default model accuracy is simply predicting the most common class every time.
Default_Model_Accuracy = trainTransformedDF[trainTransformedDF.MyLabelColumn=='<my-most-common-value>'].Label.count()/trainTransformedDF.MyLabelColumn.count()
print(Default_Model_Accuracy)

In [None]:
# Compare your results to the default model accuracy.  This is the number to beat.  Pick one depending on your Problem.

# For Regression problems, the default model score is rmse if you predict the average.
trainTransformedDF['DefaultPrediction'] = np.mean(trainTransformedDF.MyPredictionColumn)
trainTransformedDF['SquaredError'] = (trainTransformedDF['MyPredictionColumn'] - trainTransformedDF['DefaultPrediction'])**2
Default_Model_RMSE = np.mean(trainTransformedDF.SquaredError)
print(Default_Model_RMSE)

## Obtain and Register Model to Machine Learning Work Space
Next up is retrieving your model and registering it to your workspace.  Registering your model lets you deploy it, run it in pipelines, and store it for later use.

### Download your scoring file and your environment file to your local notebook.
To deploy models, Azure ML requires a scoring script to make predictions on new data using your model.  It also requires an environment file containing all of the packages required to run your scoring script.  Here, we retrieve both of these from Auto_ML's get_output() function and write them to our local VM.

In [None]:
best_run, fitted_model = remote_run.get_output()

In [None]:
# Create a Directory to your store your files first
inference_folder = os.path.join(os.getcwd(), 'inference')
os.makedirs(inference_folder, exist_ok=True)

# Specify names for your scoring scrip and environment file
script_file_name = 'inference/score.py'
environment_file_name = 'inference/AutoML.yml'

# Download the files locally
best_run.download_file('outputs/scoring_file_v_1_0_0.py', script_file_name)
best_run.download_file('outputs/conda_env_v_1_0_0.yml', environment_file_name)

In [None]:
# Give a meaningful description and tags to your autoML model
description = '<my-model-description>'
tags = {"<my-tag-name>": "<my-tag-value>", "<my-tag-name2>": "<my-tag-value2>"}

# Retrieve the model_name from the autoML run
model_name = best_run.properties['model_name']

# Register your model, set tags and description
model = remote_run.register_model(model_name = model_name, description = description, tags = tags)

# Print the Model ID
print(remote_run.model_id)

In [None]:
# Create an environment to register and give it a name
autoMLenv = Environment.from_conda_specification(name = "<my-automl-environment>",
                                             file_path = environment_file_name)

# Register the environment to your workspace
autoMLenv.register(workspace=ws)

## Most people can stop here and move on to the next notebook.  
If you want to use a pandas dataframe and train an AutoML model locally, follow the code below.

### First, drop all columns not required in your machine learning model and assign your label column
You can use pandas dataframes for local runs.  Tabular data will also work on local runs.

In [None]:
# Drop any column that isn't appropriate to add into the model
# For example, ID columns, redundant columns or columns with only one value should be dropped
trainDF = trainTransformedDF.drop(['<MyIdColumn>','<MyRedundantColumn>','<MySingleValueColumn>'], axis=1)

In [None]:
# Next, assign the name of the column you are trying to predict to a variable.
label = '<MyLabelColumn>'

### Next, configure your AutoML settings
There are numerous configuration options within AutoML, task, primary metric, featurization and explainability being the most important.<br>
Set <b>task</b> to either classification, regression or forecasting depending on the type of problem you are trying to solve.<br>
Set <b>Primary Metric</b> to what you are trying to minimize or maximize, like accuracy for classification or rmse for regression problems.<br>
<b>Featurization</b> set to Auto automatically one hot encodes categorical values, drops high cardinality categorical columns, imputes missing values across all types of columns, autogenerates numerous datetime features and also creates many features from text data.<br>
Set <b>Model Explainability</b> to True to let you obtain a ranked list of features used to generate the AutoML model.

For a list and explanation of configurations, click the link below: <br>
https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.automlconfig.automlconfig?view=azure-ml-py

For a list of primary metrics based on problem type, click the following:
https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-train

In [None]:
automl_local_settings = {
    #"experiment_timeout_minutes": 20,  # Use this for testing to limit the autoML run
    "enable_early_stopping" : True,    # Enable this to end the experiment once results stop improving.  Always use this.
    #"iteration_timeout_minutes": 5,   # Enable this to limit how long each model takes to run
    "max_concurrent_iterations": 4,    # Match this value to the max number of nodes in your cluster
    #"max_cores_per_iteration": -1,     # Only used for DNN 
    "n_cross_validations": 5,         # This is the number of splits to use for cross validation
    "featurization": 'auto',           # Set to auto to preprocess data
    "preprocess": True,                # Set to auto to preprocess data
    "enable_dnn": False,               # Enables Deep Neural Networks for appropriate problems
    "enable_tf": False,                # Enables Tensorflow algorithms for appropriate problems
    "verbosity": logging.INFO,         # Enables logging
}

automl_local_config = AutoMLConfig(task = '<my-problem-type>',         # Classification, regression or forecasting
                             primary_metric = '<my-metric>',     # Select the metric to be optimized through autoML 
                             num_classes = 5,                    # Set the number of categories classification
                             debug_log = 'automl_errors.log',    # Assigns the debug log name
                             #compute_target=cpu_cluster,         # Turn this off for local runs.
                             experiment_exit_score = 0.99,       # Threshold to end autoML runs prematurely 
                             #blacklist_models = ['Prophet'],        # Use to blacklist models
                             #whitelist_models = ['KNN'],        # Runs only your selected models
                             enable_onnx_compatible_models=False,# Enables/disables enforcing onnx compatible models
                             training_data = trainDF,           # Sets the training data
                             label_column_name = label,          # Sets the column to predict in your training data
                             model_explainability=True,          # Enables/disables model explainability
                             enable_voting_ensemble=True,        # Enables/disables voting ensemble algorithm
                             enable_stack_ensemble=True,         # Enables/disables stack ensemble algorithm
                             **automl_local_settings
                            )


## Set up your Experiment
An experiment is a grouping of many runs from a specified script. It always belongs to a workspace. When you submit a run, you provide an experiment name. Information for the run is stored under that experiment. If you submit a run and specify an experiment name that doesn't exist, a new experiment with that newly specified name is automatically created.

In [None]:
# choose a name for experiment appropriate to the project
local_experiment_name = '<my-experiment-name>'

local_experiment=Experiment(ws, local_experiment_name)

# Output a nice table with all of the essential experiment information
output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Experiment Name'] = local_experiment.name
pd.set_option('display.max_colwidth', -1)
outputDf = pd.DataFrame(data = output, index = [''])
outputDf.T

In [None]:
# Train your model
# Local ones give more detailed information than remote runs.
local_run = local_experiment.submit(automl_local_config, show_output = True)

In [None]:
# Run the Widget to best view Model Results
from azureml.widgets import RunDetails
RunDetails(local_run).show() 

In [None]:
# Compare your results to the default model accuracy.  This is the number to beat.  Pick one depending on your Problem.

# For Classification problems, the default model accuracy is simply predicting the most common class every time.
Default_Model_Accuracy = trainTransformedDF[trainTransformedDF.MyLabelColumn=='<my-most-common-value>'].Label.count()/trainTransformedDF.MyLabelColumn.count()
print(Default_Model_Accuracy)

In [None]:
# Compare your results to the default model accuracy.  This is the number to beat.  Pick one depending on your Problem.

# For Regression problems, the default model score is rmse if you predict the average.
trainTransformedDF['DefaultPrediction'] = np.mean(trainTransformedDF.MyPredictionColumn)
trainTransformedDF['SquaredError'] = (trainTransformedDF['MyPredictionColumn'] - trainTransformedDF['DefaultPrediction'])**2
Default_Model_RMSE = np.mean(trainTransformedDF.SquaredError)
print(Default_Model_RMSE)

## Obtain and Register Model to Machine Learning Work Space
Next up is retrieving your model and registering it to your workspace.  Registering your model lets you deploy it, run it in pipelines, and store it for later use.

### Download your scoring file and your environment file to your local notebook.
To deploy models, Azure ML requires a scoring script to make predictions on new data using your model.  It also requires an environment file containing all of the packages required to run your scoring script.  Here, we retrieve both of these from Auto_ML's get_output() function and write them to our local VM.

In [None]:
best_local_run, fitted_local_model = local_run.get_output()

In [None]:
# Create a Directory to your store your files first
inference_folder = os.path.join(os.getcwd(), 'inference')
os.makedirs(inference_folder, exist_ok=True)

# Specify names for your scoring scrip and environment file
script_file_name_local = 'inference/score_local.py'
environment_file_name_local = 'inference/AutoML_local.yml'

# Download the files locally
best_local_run.download_file('outputs/scoring_file_v_1_0_0.py', script_file_name_local)
best_local_run.download_file('outputs/conda_env_v_1_0_0.yml', environment_file_name_local)

In [None]:
# Give a meaningful description and tags to your autoML model
description = '<my-model-description>'
tags = {"<my-tag-name>": "<my-tag-value>", "<my-tag-name2>": "<my-tag-value2>"}

# Retrieve the model_name from the autoML run
model_name_local = best_local_run.properties['model_name']

# Register your model, set tags and description
model_local = local_run.register_model(model_name = model_name_local, description = description, tags = tags)

# Print the Model ID
print(local_run.model_id)

In [None]:
# Create an environment to register and give it a name
autoMLenv_local = Environment.from_conda_specification(name = "<my-automl-environment>",
                                             file_path = environment_file_name_local)

# Register the environment to your workspace
autoMLenv_local.register(workspace=ws)

<br><br><br><br><br>



Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.