# Model Training with AutoML

In this lab you will us the automated machine learning (Auto ML) capabilities within the Azure Machine Learning service to automatically train multiple models with varying algorithms and hyperparameters, select the best performing model and register that model.

## Download the datasets

The following cell will download the dataset used by this lab. Click into the following cell and use `Shift + Enter` to execute it

In [5]:
import uuid
import os

demoName = 'churn'
tempFolderName = '/mnt/demodata/{0}/'.format(demoName)

#List all downloaded files
dbutils.fs.ls(tempFolderName)

## Train a model using AutoML

This lab builds upon the lessons learned in the previous lab, but is self contained so you work thru this lab without having to run a previous lab.

In following cell you are loading the data prepared in previous labs and acquiring (or creating) an instance of your Azure Machine Learning Workspace. In this cell, be sure to set the values for `subscription_id`, `resource_group`, `workspace_name` and `workspace_region` as directed by the comments. Execute the cell.

In [9]:
# Step 1 - Load training data and prepare Workspace
###################################################
import os
import numpy as np
import pandas as pd
from sklearn import linear_model 
from sklearn.externals import joblib
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import azureml
from azureml.core import Run
from azureml.core import Workspace
from azureml.core.run import Run
from azureml.core.experiment import Experiment
from azureml.train.automl import AutoMLConfig
import pickle
from azureml.train.automl import AutoMLConfig
from sklearn.preprocessing import LabelEncoder

In [10]:
def preprocess(input_df):

  df = input_df.drop_duplicates()

  df[['year','month']] = df[['year','month']].astype('object')
  df.drop(['year','month','customerid'],inplace=True, axis = 1)

  cat_cols = df.select_dtypes(include=['object','category']).columns
  num_cols = set(df.columns) - set(cat_cols)

  df = pd.get_dummies(data=df, columns=cat_cols, prefix='oh_') 

  return df 

In [11]:
def preprocess_simple(input_df):

  df = input_df.drop_duplicates()

#   df[['year','month']] = df[['year','month']].astype('object')
#   df.drop(['year','month','customerid', "state", "education", "gender"], axis = 1)

  df= df.drop('year', axis=1)  
  df= df.drop('customerid', axis=1)
  df= df.drop('month', axis=1)
  df= df.drop('state', axis=1)
  df= df.drop('occupation', axis=1)
  df= df.drop('education', axis=1)
  df= df.drop('noadditionallines', axis=1)
  df= df.drop('gender', axis=1)
  
  cols = ["homeowner","usesinternetservice","customersuspended","usesvoiceservice"]
  df[cols] = df[cols].replace({'Yes':1, 'No':0})
  df["maritalstatus"] = df["maritalstatus"].replace({'Married':1, 'Single':0})
  
  df = pd.get_dummies(data=df) 

  return df 

In [12]:
# Verify AML SDK Installed
# view version history at https://pypi.org/project/azureml-sdk/#history 
print("SDK Version:", azureml.core.VERSION)

# Load our training data set
pathToCsvFile = os.path.join('/dbfs' + tempFolderName, 'CATelcoCustomerChurnTrainingSample.csv')
df = pd.read_csv(pathToCsvFile, delimiter=',')
df = preprocess_simple(df)

full_X = df.loc[:, df.columns != 'churn']
full_Y = df[["churn"]]

In [13]:
print(full_X.info(verbose=True))
print(full_Y.info(verbose=True))

In [14]:
#Read AML congig.json (form azure portal workspace) from azure storage
ws = Workspace.from_config(_file_name = "/dbfs/mnt/demodata/config/config.json")

print("Workspace Provisioning complete.")


To train a model using AutoML you need only provide a configuration for AutoML that defines items such as the type of model (classification or regression), the performance metric to optimize, exit criteria in terms of max training time and iterations and desired performance, any algorithms that should not be used, and the path into which to output the results. This configuration is specified using the `AutomMLConfig` class, which is then used to drive the submission of an experiment via `experiment.submit`.  When AutoML finishes the parent run, you can easily get the best performing run and model from the returned run object by using `run.get_output()`. Execute the following cell to define the helper function that wraps the AutoML job submission.

In [16]:
# Step 2 - Define a helper method that will use AutoML to train multiple models and pick the best one
##################################################################################################### 
def auto_train_model(ws, experiment_name, model_name, full_X, full_Y,training_set_percentage, training_target_accuracy):

    # start a training run by defining an experiment
    experiment = Experiment(ws, experiment_name)
    
    train_X, test_X, train_Y, test_Y = train_test_split(full_X, full_Y, train_size=training_set_percentage, random_state=42)

    train_Y_array = train_Y.values.flatten()

    # Configure the automated ML job
    # The model training is configured to run on the local machine
    # The values for all settings are documented at https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-auto-train
    # Notice we no longer have to scale the input values, as Auto ML will try various data scaling approaches automatically
    Automl_config = AutoMLConfig(task = 'classification',
                                 primary_metric = 'accuracy',
                                 iteration_timeout_minutes = 200,
                                 iterations = 10,
                                 blacklist_models = ['kNN','LinearSVM'],
                                 enable_stack_ensemble = False,
                                 X = train_X,
                                 y = train_Y_array,
                                 path='.\\outputs')

    # Execute the job
    run = experiment.submit(Automl_config, show_output=True)

    # Get the run with the highest accuracy value.
    best_run, best_model = run.get_output()

    return (best_model, run, best_run)

In [17]:
# Step 3 - Execute the AutoML driven training
# invoke the AutoML job to begin the training. Execute the following cell.
#############################################
experiment_name = "Experiment-AutoML-Churn"
model_name = "churnmodel"
training_set_percentage = 0.50
training_target_accuracy = 0.90
best_model, run, best_run = auto_train_model(ws, experiment_name, model_name, full_X, full_Y, training_set_percentage, training_target_accuracy)

# Examine some of the metrics for the best performing run
import pprint
pprint.pprint({k: v for k, v in best_run.get_metrics().items() if isinstance(v, float)})

Try out the best model by executing the following cell.

In [19]:
# Step 4 - Try the best model
#############################
age = 20
annualincome = 137977
calldroprate = 0.05
callfailurerate = 0.03
callingnum = 4251042488
customerid = 4
customersuspended = "Yes"
education = "PhD or equivalent"
gender = "Male"
homeowner = "No"
maritalstatus = "Single"
monthlybilledamount = "74"
noadditionallines = "\N"
numberofcomplaints = 1
numberofmonthunpaid = 7
numdayscontractequipmentplanexpiring = 73
occupation = "Technology Related Job"
penaltytoswitch = 76
state = "KY"
totalminsusedinlastmonth = 412
unpaidbalance = 159
usesinternetservice = "Yes"
usesvoiceservice = "No"
percentagecalloutsidenetwork = 0.94
totalcallduration = 834
avgcallduration = 834
year = 2015
month = 1
print(best_model.predict( [[age,annualincome,calldroprate,callfailurerate,callingnum,customerid,customersuspended,education,gender,homeowner,maritalstatus,monthlybilledamount,noadditionallines,numberofcomplaints,numberofmonthunpaid,numdayscontractequipmentplanexpiring,occupation,penaltytoswitch,state,totalminsusedinlastmonth,unpaidbalance,usesinternetservice,usesvoiceservice,percentagecalloutsidenetwork,totalcallduration,avgcallduration,year,month]] ))

## Register an AutoML created model

You can register models created by AutoML with Azure Machine Learning just as you would any other model. Execute the following cell to register this model.

In [22]:
# Step 5 - Register the best performing model for later use and deployment
#################################################################
# notice the use of the root run (not best_run) to register the best model
run.register_model(description='AutoML trained Churn classifier')