Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Automated Machine Learning
_**Classification with Local Compute**_

## Contents
1. [Introduction](#Introduction)
1. [Setup](#Setup)
1. [Data](#Data)
1. [Train](#Train)
1. [Results](#Results)
1. [Test](#Test)



## Introduction

In this example we use the scikit-learn's [digit dataset](http://scikit-learn.org/stable/datasets/index.html#optical-recognition-of-handwritten-digits-dataset) to showcase how you can use AutoML for a simple classification problem.

Make sure you have executed the [configuration](../../../configuration.ipynb) before running this notebook.

In this notebook you will learn how to:
1. Create an `Experiment` in an existing `Workspace`.
2. Configure AutoML using `AutoMLConfig`.
3. Train the model using local compute.
4. Explore the results.
5. Test the best fitted model.

## Setup

As part of the setup you have already created an Azure ML `Workspace` object. For AutoML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments.

In [3]:
import logging

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig

In [4]:
ws = Workspace.from_config()

# Choose a name for the experiment and specify the project folder.
experiment_name = 'automl-classification'
project_folder = './sample_projects/automl-classification'

experiment = Experiment(ws, experiment_name)

output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace Name'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Project Directory'] = project_folder
output['Experiment Name'] = experiment.name
pd.set_option('display.max_colwidth', -1)
outputDf = pd.DataFrame(data = output, index = [''])
outputDf.T

Found the config file in: /home/nbuser/library/how-to-use-azureml/automated-machine-learning/classification/config.json


Unnamed: 0,Unnamed: 1
Experiment Name,automl-classification
Location,northeurope
Project Directory,./sample_projects/automl-classification
Resource Group,customerchurn
SDK version,1.0.17
Subscription ID,a2a1fc9f-5671-4479-8922-ad16e34c0fdc
Workspace Name,customerchurn


## Data

This uses scikit-learn's [load_digits](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html) method.

In [10]:
digits = datasets.load_digits()
print(type(digits))

# Exclude the first 100 rows from training so that they can be used for test.
X_train = digits.data[100:,:]
y_train = digits.target[100:]

print(X_train.shape)

<class 'sklearn.utils.Bunch'>
(1697, 64)


In [5]:
import pandas as pd 
data = pd.read_csv("churndataset.csv") 
data.head()
print("Data shape: " + str(data.shape))

# Remove first 100 rows for testing. X = features so take first 21 features
X_train = data.iloc[100:,0:20]
# Remove first 100 rows for testing. y = label so take final churn column
y_train = data.iloc[100:,20:21].values
y_train = np.squeeze(y_train)

print("X Shape: " + str(X_train.shape) + " and y shape: " + str(y_train.shape))
print("X_TRAIN")
print(X_train.head())
print("Y_TRAIN")
print(y_train.shape)


Data shape: (3333, 21)
X Shape: (3233, 20) and y shape: (3233,)
X_TRAIN
    state  account length  area code phone number international plan  \
100  IA    98              510        379-6506     no                  
101  MA    108             415        347-7741     no                  
102  VT    135             415        354-3783     no                  
103  KY    95              408        401-7594     no                  
104  IN    122             408        397-4976     no                  

    voice mail plan  number vmail messages  total day minutes  \
100  yes             21                    161.20               
101  no              0                     178.30               
102  no              0                     151.70               
103  no              0                     135.00               
104  no              0                     170.50               

     total day calls  total day charge  total eve minutes  total eve calls  \
100  114             27.40

## Train

Instantiate an `AutoMLConfig` object to specify the settings and data used to run the experiment.

|Property|Description|
|-|-|
|**task**|classification or regression|
|**primary_metric**|This is the metric that you want to optimize. Classification supports the following primary metrics: <br><i>accuracy</i><br><i>AUC_weighted</i><br><i>average_precision_score_weighted</i><br><i>norm_macro_recall</i><br><i>precision_score_weighted</i>|
|**iteration_timeout_minutes**|Time limit in minutes for each iteration.|
|**iterations**|Number of iterations. In each iteration AutoML trains a specific pipeline with the data.|
|**n_cross_validations**|Number of cross validation splits.|
|**X**|(sparse) array-like, shape = [n_samples, n_features]|
|**y**|(sparse) array-like, shape = [n_samples, ], [n_samples, n_classes]<br>Multi-class targets. An indicator matrix turns on multilabel classification. This should be an array of integers.|
|**path**|Relative path to the project folder. AutoML stores configuration files for the experiment under this folder. You can specify a new empty folder.|

In [6]:
automl_config = AutoMLConfig(task = 'classification',
                             debug_log = 'automl_errors.log',
                             primary_metric = 'AUC_weighted',
                             iteration_timeout_minutes = 60,
                             iterations = 5,
                             n_cross_validations = 3,
                             verbosity = logging.INFO,
                             X = X_train, 
                             y = y_train,
                             preprocess=True,
                             path = project_folder)

Call the `submit` method on the experiment object and pass the run configuration. Execution of local runs is synchronous. Depending on the data and the number of iterations this can run for a while.
In this example, we specify `show_output = True` to print currently running iterations to the console.

In [7]:
local_run = experiment.submit(automl_config, show_output = True)

Running on local machine
Parent Run ID: AutoML_672b2e01-41ed-4de0-9dd9-fac506706b54
********************************************************************************************************************
ITERATION: The iteration being evaluated.
PIPELINE: A summary description of the pipeline being evaluated.
SAMPLING %: Percent of the training data to sample.
DURATION: Time taken for the current iteration.
METRIC: The result of computing score on the fitted pipeline.
BEST: The best observed score thus far.
********************************************************************************************************************

 ITERATION   PIPELINE                                       SAMPLING %  DURATION      METRIC      BEST
         0   MaxAbsScaler LightGBM                          100.0000    0:00:27       0.8611    0.8611
         1   MaxAbsScaler LightGBM                          100.0000    0:00:26       0.8798    0.8798
         2   MaxAbsScaler LightGBM                          100

In [75]:
local_run

Experiment,Id,Type,Status,Details Page,Docs Page
automl-classification,AutoML_28e16be1-8b97-4369-905a-2e9a9e43cc68,automl,Completed,Link to Azure Portal,Link to Documentation


Optionally, you can continue an interrupted local run by calling `continue_experiment` without the `iterations` parameter, or run more iterations for a completed run by specifying the `iterations` parameter:

In [76]:
local_run = local_run.continue_experiment(X = X_train, 
                                          y = y_train, 
                                          show_output = True,
                                          iterations = 6)

No run_configuration provided, running locally with default configuration
Running on local machine
Parent Run ID: AutoML_28e16be1-8b97-4369-905a-2e9a9e43cc68
********************************************************************************************************************
ITERATION: The iteration being evaluated.
PIPELINE: A summary description of the pipeline being evaluated.
SAMPLING %: Percent of the training data to sample.
DURATION: Time taken for the current iteration.
METRIC: The result of computing score on the fitted pipeline.
BEST: The best observed score thus far.
********************************************************************************************************************

 ITERATION   PIPELINE                                       SAMPLING %  DURATION      METRIC      BEST
         5   MaxAbsScaler LogisticRegression                100.0000    0:00:30       0.8380    0.9021
         6   StandardScalerWrapper LightGBM                 100.0000    0:00:25       0.7802

## Results

#### Widget for Monitoring Runs

The widget will first report a "loading" status while running the first iteration. After completing the first iteration, an auto-updating graph and table will be shown. The widget will refresh once per minute, so you should see the graph update as child runs complete.

**Note:** The widget displays a link at the bottom. Use this link to open a web interface to explore the individual run details.

In [8]:
from azureml.widgets import RunDetails
RunDetails(local_run).show() 

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…


#### Retrieve All Child Runs
You can also use SDK methods to fetch all the child runs and see individual metrics that we log.

In [None]:
children = list(local_run.get_children())
metricslist = {}
for run in children:
    properties = run.get_properties()
    metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}
    metricslist[int(properties['iteration'])] = metrics

rundata = pd.DataFrame(metricslist).sort_index(1)
rundata

### Retrieve the Best Model

Below we select the best pipeline from our iterations. The `get_output` method returns the best run and the fitted model. The Model includes the pipeline and any pre-processing.  Overloads on `get_output` allow you to retrieve the best run and fitted model for *any* logged metric or for a particular *iteration*.

In [9]:
best_run, fitted_model = local_run.get_output()
print(best_run)
print(fitted_model)

Run(Experiment: automl-classification,
Id: AutoML_672b2e01-41ed-4de0-9dd9-fac506706b54_4,
Type: None,
Status: Completed)
Pipeline(memory=None,
     steps=[('datatransformer', DataTransformer(logger=None, task=None)), ('prefittedsoftvotingclassifier', PreFittedSoftVotingClassifier(classification_labels=None,
               estimators=[('LightGBM', Pipeline(memory=None,
     steps=[('standardscalerwrapper', <automl.client.core.common.model_wrappe...er object at 0x7f649f17bdd8>)]))],
               flatten_transform=None, weights=[0.2, 0.2, 0.6]))])
Y_transformer(['LabelEncoder', LabelEncoder()])


#### Best Model Based on Any Other Metric
Show the run and the model that has the smallest `log_loss` value:

In [None]:
lookup_metric = "log_loss"
best_run, fitted_model = local_run.get_output(metric = lookup_metric)
print(best_run)
print(fitted_model)

#### Model from a Specific Iteration
Show the run and the model from the third iteration:

In [None]:
iteration = 3
third_run, third_model = local_run.get_output(iteration = iteration)
print(third_run)
print(third_model)

## Test 

#### Load Test Data

In [10]:
#digits = datasets.load_digits()
print(data.shape)
X_test = data.iloc[:99,0:20]
y_test = data.iloc[:99,20:21].values
y_test = np.squeeze(y_test)

print("X: " + str(X_test.shape) + " Y: " + str(y_test.shape))

(3333, 21)
X: (99, 20) Y: (99,)


#### Testing Our Best Fitted Model
We will try to predict 2 digits and see how our model works.

In [16]:
# Randomly select digits and test.
for index in np.random.choice(len(y_test), 20, replace = False):
    #print(index)
    predicted = fitted_model.predict(X_test[index:index + 1])[0]
    label = y_test[index]
    print("Index: " + str(index))
    print(" Prediction: " + str(predicted))
    print(" Actual Label: " + str(label))
    #title = "Label value = %d  Predicted value = %d " % (label, predicted)
    #fig = plt.figure(1, figsize = (3,3))
    #ax1 = fig.add_axes((0,0,.8,.8))
    #ax1.set_title(title)
    #plt.imshow(images[index], cmap = plt.cm.gray_r, interpolation = 'nearest')
    #plt.show()

Index: 56
 Prediction: False
 Actual Label: False
Index: 18
 Prediction: False
 Actual Label: False
Index: 26
 Prediction: False
 Actual Label: False
Index: 96
 Prediction: False
 Actual Label: False
Index: 71
 Prediction: False
 Actual Label: False
Index: 72
 Prediction: False
 Actual Label: False
Index: 36
 Prediction: False
 Actual Label: False
Index: 10
 Prediction: False
 Actual Label: True
Index: 54
 Prediction: False
 Actual Label: True
Index: 38
 Prediction: False
 Actual Label: False
Index: 11
 Prediction: False
 Actual Label: False
Index: 33
 Prediction: True
 Actual Label: True
Index: 37
 Prediction: False
 Actual Label: False
Index: 2
 Prediction: False
 Actual Label: False
Index: 43
 Prediction: False
 Actual Label: False
Index: 83
 Prediction: False
 Actual Label: False
Index: 16
 Prediction: False
 Actual Label: False
Index: 15
 Prediction: True
 Actual Label: True
Index: 77
 Prediction: False
 Actual Label: True
Index: 44
 Prediction: False
 Actual Label: False
