# Training Models

The central goal of machine learning is to train predictive models that can be used by applications. In Azure Machine Learning,  you can use scripts to train models leveraging common machine learning frameworks like Scikit-Learn, Tensorflow, PyTorch, SparkML, and others. You can run these training scripts as experiments in order to track metrics and outputs - in particular, the trained models.

## Connect to Your Workspace

The first thing you need to do is to connect to your workspace using the Azure ML SDK.

> **Note**: If you do not have a current authenticated session with your Azure subscription, you'll be prompted to authenticate. Follow the instructions to authenticate using the code provided.

In [1]:
import azureml.core
from azureml.core import Workspace

# Load the workspace from the saved config file
ws = Workspace.from_config()
print('Ready to use Azure ML {} to work with {}'.format(azureml.core.VERSION, ws.name))

Ready to use Azure ML 1.13.0 to work with customer_360_ws


## Create a Training Script

You're going to use a Python script to train a machine learning model based on the flight_delays data, so let's start by creating a folder for the script and data files.

In [2]:
import os

experiment_folder = 'flight_delays'
os.makedirs(experiment_folder, exist_ok=True)

print('Folder ready.')

Folder ready.


In [3]:
%%writefile $experiment_folder/flight_delays_training.py
# Import libraries
import argparse
import json
import joblib
from azureml.core import Run
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns


# Set regularization parameter
parser = argparse.ArgumentParser()
parser.add_argument('--regularization', type=float, dest='reg_rate', default=0.01, help='regularization rate')
args = parser.parse_args()
reg = args.reg_rate




# Get the experiment run context
run = Run.get_context()

# load the diabetes dataset
print("Loading Data...")


dataset = run.input_datasets['flight_delays_data'].to_pandas_dataframe().dropna() # Get the training data from the estimator input

# Remove target leaker and features that are not useful
target_leakers = ['DepDel15','ArrDelay','Cancelled','Year']
dataset.drop(columns=target_leakers, axis=1, inplace=True)

# convert some variables to categorical features
columns_as_categorical = ['OriginAirportID','DestAirportID','ArrDel15']
dataset[columns_as_categorical] = dataset[columns_as_categorical].astype('object')

categorical_feature_mask = dataset.dtypes == object 
categorical_cols = dataset.columns[categorical_feature_mask].tolist()

le = LabelEncoder()
dataset[categorical_cols] = dataset[categorical_cols].apply(lambda col:le.fit_transform(col))
dataset = dataset.dropna()

# Separate features and labels
X, y = dataset[['Month', 'DayofMonth', 'DayOfWeek', 'Carrier', 'OriginAirportID', 'DestAirportID', 'CRSDepTime', 'DepDelay', 'CRSArrTime']].values, dataset['ArrDel15'].values

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Train a logistic regression model
print('Training a logistic regression model with regularization rate of', reg)
run.log('Regularization Rate',  np.float(reg))
model = LogisticRegression(C=1/reg, solver="liblinear").fit(X_train, y_train)

# calculate accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
print('Accuracy:', acc)
run.log('Accuracy', np.float(acc))

# calculate AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
print('AUC: ' + str(auc))
run.log('AUC', np.float(auc))


# work on confusion matrix

cnf_matrix = confusion_matrix(y_test, y_hat)

cnf_matrix_list = cnf_matrix.tolist()
cnf_matrix_json_ = json.dumps(cnf_matrix_list)
run.log_confusion_matrix(name='Confusion_matrix', value=cnf_matrix_json_)

class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.show()
run.log_image(name='confusion_matrix_img', plot=fig)


os.makedirs('outputs', exist_ok=True)
# note file saved in the outputs folder is automatically uploaded into experiment record
joblib.dump(value=model, filename='outputs/flight_delays_model.pkl')

run.complete()

Overwriting flight_delays/flight_delays_training.py


## Define an Environment

When you run a Python script as an experiment in Azure Machine Learning, a Conda environment is created to define the execution context for the script. Azure Machine Learning provides a default environment that includes many common packages; including the **azureml-defaults** package that contains the libraries necessary for working with an experiment run, as well as popular packages like **pandas** and **numpy**.

You can also define your own environment and add packages by using **conda** or **pip**, to ensure your experiment has access to all the libraries it requires. 

Run the following cell to create an environment for the diabetes experiment.

In [4]:

from azureml.core import Environment
from azureml.core.conda_dependencies import CondaDependencies

# Create a Python environment for the experiment
flight_delays_env = Environment("flight-delays-experiment-env")
flight_delays_env.python.user_managed_dependencies = False # Let Azure ML manage dependencies
flight_delays_env.docker.enabled = True # Use a docker container

# Create a set of package dependencies (conda or pip as required)
flight_delays_packages = CondaDependencies.create(conda_packages=['scikit-learn'],
                                          pip_packages=['azureml-defaults', 'azureml-dataprep[pandas]', 'matplotlib', 'seaborn'])

# Add the dependencies to the environment
flight_delays_env.python.conda_dependencies = flight_delays_packages

print(flight_delays_env.name, 'defined.')

# Register the environment
flight_delays_env.register(workspace=ws)
print(flight_delays_env.name, 'registered.')

flight-delays-experiment-env defined.
flight-delays-experiment-env registered.


Now you can use the environment for the experiment by assigning it to an Estimator (or RunConfig).

The following code assigns the environment you created to a generic estimator, and submits an experiment. As the experiment runs, observe the run details in the widget and in the **azureml_logs/60_control_log.txt** output log, you'll see the conda environment being built.

## Run an Experiment on a Remote Compute Target

In many cases, your local compute resources may not be sufficient to process a complex or long-running experiment that needs to process a large volume of data; and you may want to take advantage of the ability to dynamically create and use compute resources in the cloud.

Azure ML supports a range of compute targets, which you can define in your workpace and use to run experiments; paying for the resources only when using them. In this case, we'll run the diabetes training experiment on a compute cluster with a unique name of your choosing, so let's verify that exists (and if not, create it) so we can use it to run training experiments.

> **Important**: Change *your-compute-cluster* to a unique name for your compute cluster in the code below before running it!

In [5]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

cluster_name = "aml-cluster"

try:
    # Get the cluster if it exists
    training_cluster = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    # If not, create it
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS2_V2', max_nodes=2)
    training_cluster = ComputeTarget.create(ws, cluster_name, compute_config)

training_cluster.wait_for_completion(show_output=True)

Found existing cluster, use it.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


Now you're ready to run the experiment on the compute you created. You can do this by specifying the **compute_target** parameter in the estimator (you can set this to either the name of the compute target, or a **ComputeTarget** object.)

You'll also reuse the environment you registered previously.

In [6]:
from azureml.train.sklearn import SKLearn
from azureml.core import Environment, Experiment
from azureml.widgets import RunDetails

# Get the environment
registered_env = Environment.get(ws, 'flight-delays-experiment-env')

# specify cluster name
cluster_name = "aml-cluster"

# Set the script parameters
script_params = {
    '--regularization': 0.1
}

# Get the training dataset
flight_delays_ds = ws.datasets.get("flight_delays_data")

# Create an estimator
estimator = SKLearn(source_directory=experiment_folder,
                      inputs=[flight_delays_ds.as_named_input('flight_delays_data')],
                      script_params=script_params,
                      compute_target = cluster_name, # Run the experiment on the remote compute target
                      environment_definition = registered_env,
                      entry_script='flight_delays_training.py')

# Create an experiment
experiment = Experiment(workspace = ws, name = 'flight-delays-training')

# Run the experiment
run = experiment.submit(config=estimator)
# Show the run details while running
RunDetails(run).show()
run.wait_for_completion()




_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

### Register Model

In [None]:
# Register the model
run.register_model(model_path='outputs/flight_delays_model.pkl', model_name='flight_delays_model',
                   tags={'Training context':'Parameterized SKLearn Estimator'},
                   properties={'AUC': run.get_metrics()['AUC'], 'Accuracy': run.get_metrics()['Accuracy']})