Azure ML & Azure Databricks notebooks by Parashar Shah.

Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

![04ACI](files/tables/tables_image4.JPG)

# Automated ML on Azure Databricks

In this example we use the scikit-learn's <a href="http://scikit-learn.org/stable/datasets/index.html#optical-recognition-of-handwritten-digits-dataset" target="_blank">digit dataset</a> to showcase how you can use AutoML for a simple classification problem.

In this notebook you will learn how to:
1. Create Azure Machine Learning Workspace object and initialize your notebook directory to easily reload this object from a configuration file.
2. Create an `Experiment` in an existing `Workspace`.
3. Configure Automated ML using `AutoMLConfig`.
4. Train the model using Azure Databricks.
5. Explore the results.
6. Test the best fitted model.

Before running this notebook, please follow the <a href="https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/azure-databricks" target="_blank">readme for using Automated ML on Azure Databricks</a> for installing necessary libraries to your cluster.

We support installing AML SDK with Automated ML as library from GUI. When attaching a library follow <a href="https://docs.databricks.com/user-guide/libraries.html" target="_blank">this link</a> and add the below string as your PyPi package. You can select the option to attach the library to all clusters or just one cluster.

**azureml-sdk with automated ml**
* Source: Upload Python Egg or PyPi
* PyPi Name: `azureml-sdk[automl_databricks]`
* Select Install Library

### Check the Azure ML Core SDK Version to Validate Your Installation

In [1]:
import azureml.core

print("SDK Version:", azureml.core.VERSION)

SDK Version: 1.5.0


## Initialize an Azure ML Workspace
### What is an Azure ML Workspace and Why Do I Need One?

An Azure ML workspace is an Azure resource that organizes and coordinates the actions of many other Azure resources to assist in executing and sharing machine learning workflows.  In particular, an Azure ML workspace coordinates storage, databases, and compute resources providing added functionality for machine learning experimentation, operationalization, and the monitoring of operationalized models.


### What do I Need?

To create or access an Azure ML workspace, you will need to import the Azure ML library and specify following information:
* A name for your workspace. You can choose one.
* Your subscription id. Use the `id` value from the `az account show` command output above.
* The resource group name. The resource group organizes Azure resources and provides a default region for the resources in the group. The resource group will be created if it doesn't exist. Resource groups can be created and viewed in the [Azure portal](https://portal.azure.com)
* Supported regions include `eastus2`, `eastus`,`westcentralus`, `southeastasia`, `westeurope`, `australiaeast`, `westus2`, `southcentralus`.

In [2]:
subscription_id = "78c7c665-8c7e-4de1-9ec3-f1e5a03f0d04" #you should be owner or contributor
resource_group = "automl" #you should be owner or contributor
workspace_name = "automl" #your workspace name
workspace_region = "westeurope" #your region

## Creating a Workspace
If you already have access to an Azure ML workspace you want to use, you can skip this cell.  Otherwise, this cell will create an Azure ML workspace for you in the specified subscription, provided you have the correct permissions for the given `subscription_id`.

This will fail when:
1. The workspace already exists.
2. You do not have permission to create a workspace in the resource group.
3. You are not a subscription owner or contributor and no Azure ML workspaces have ever been created in this subscription.

If workspace creation fails for any reason other than already existing, please work with your IT administrator to provide you with the appropriate permissions or to provision the required resources.

**Note:** Creation of a new workspace can take several minutes.

In [3]:
# Import the Workspace class and check the Azure ML SDK version.
from azureml.core import Workspace

ws = Workspace.create(name = workspace_name,
                      subscription_id = subscription_id,
                      resource_group = resource_group, 
                      location = workspace_region,
                      exist_ok=True)
ws.get_details()



Performing interactive authentication. Please follow the instructions on the terminal.




Interactive authentication successfully completed.


{'id': '/subscriptions/78c7c665-8c7e-4de1-9ec3-f1e5a03f0d04/resourceGroups/automl/providers/Microsoft.MachineLearningServices/workspaces/automl',
 'name': 'automl',
 'location': 'westeurope',
 'type': 'Microsoft.MachineLearningServices/workspaces',
 'tags': {},
 'sku': 'Basic',
 'workspaceid': 'c6dc0f0b-a6c8-4c96-950b-dbb7607d6b0c',
 'description': '',
 'friendlyName': '',
 'creationTime': '2020-05-24T05:53:46.6216949+00:00',
 'keyVault': '/subscriptions/78c7c665-8c7e-4de1-9ec3-f1e5a03f0d04/resourcegroups/automl/providers/microsoft.keyvault/vaults/automl9707804087',
 'applicationInsights': '/subscriptions/78c7c665-8c7e-4de1-9ec3-f1e5a03f0d04/resourcegroups/automl/providers/microsoft.insights/components/automl6121313671',
 'identityPrincipalId': '0dd0f81e-4e15-4213-9336-ba2991f0a77a',
 'identityTenantId': 'a7f1d862-decb-4991-9b4d-9aa7fed56118',
 'identityType': 'SystemAssigned',
 'storageAccount': '/subscriptions/78c7c665-8c7e-4de1-9ec3-f1e5a03f0d04/resourcegroups/automl/providers/micro

## Configuring Your Local Environment
You can validate that you have access to the specified workspace and write a configuration file to the default configuration location, `./aml_config/config.json`.

In [4]:
from azureml.core import Workspace

ws = Workspace(workspace_name = workspace_name,
               subscription_id = subscription_id,
               resource_group = resource_group)

# Persist the subscription id, resource group name, and workspace name in aml_config/config.json.
ws.write_config()

## Create an Experiment

As part of the setup you have already created an Azure ML `Workspace` object. For Automated ML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments.

In [5]:
import logging
import os
import random
import time

from matplotlib import pyplot as plt
from matplotlib.pyplot import imshow
import numpy as np
import pandas as pd

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.train.automl.run import AutoMLRun

In [26]:
ws = Workspace.from_config()

# Choose a name for the experiment and specify the project folder.
experiment_name = 'DSVM-sin-target'
project_folder = './sample_projects/automl-porto-seguro'

experiment = Experiment(ws, experiment_name)

output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace Name'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Project Directory'] = project_folder
output['Experiment Name'] = experiment.name
pd.set_option('display.max_colwidth', -1)
pd.DataFrame(data = output, index = ['']).T

Unnamed: 0,Unnamed: 1
SDK version,1.5.0
Subscription ID,78c7c665-8c7e-4de1-9ec3-f1e5a03f0d04
Workspace Name,automl
Resource Group,automl
Location,westeurope
Project Directory,./sample_projects/automl-porto-seguro
Experiment Name,DSVM-sin-target


## Diagnostics

Opt-in diagnostics for better experience, quality, and security of future releases.

In [27]:
from azureml.telemetry import set_diagnostics_collection
set_diagnostics_collection(send_diagnostics = True)

Turning diagnostics collection on. 


## Load Training Data Using DataPrep

In [28]:
import pandas as pd

X = pd.read_csv('./train_prep.csv')
y = X[['target']]
X = X.drop(['target'], axis=1)
display(X)
#print(y)

Unnamed: 0,ps_ind_01,ps_ind_03,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,ps_ind_09_bin,ps_ind_10_bin,ps_ind_11_bin,ps_ind_12_bin,ps_ind_13_bin,...,ps_car_11_cat_oh_50,ps_car_11_cat_oh_4,ps_car_11_cat_oh_58,ps_car_11_cat_oh_9,ps_car_11_cat_oh_17,ps_car_11_cat_oh_11,ps_car_11_cat_oh_45,ps_car_11_cat_oh_14,ps_car_11_cat_oh_81,ps_car_11_cat_oh_47
0,2,5,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,7,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,5,9,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,2,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,5,4,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,2,3,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,5,4,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,5,3,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,1,2,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [29]:
#Automated ML requires a dataflow, which is different from dataframe.
#If your data is in a dataframe, please use read_pandas_dataframe to convert a dataframe to dataflow before usind dprep.

import azureml.dataprep as dprep
# You can use `auto_read_file` which intelligently figures out delimiters and datatypes of a file.
# The data referenced here was pulled from `sklearn.datasets.load_digits()`.
#simple_example_data_root = 'https://dprepdata.blob.core.windows.net/automl-notebook-data/'
#X_train = dprep.auto_read_file(simple_example_data_root + 'X.csv').skip(1)  # Remove the header row.
X_train = dprep.read_pandas_dataframe(X, temp_folder='./azureml/X', overwrite_ok=1)

# You can also use `read_csv` and `to_*` transformations to read (with overridable delimiter)
# and convert column types manually.
# Here we read a comma delimited file and convert all columns to integers.
#y_train = dprep.read_csv(simple_example_data_root + 'y.csv').to_long(dprep.ColumnSelector(term='.*', use_regex = True))
y_train = dprep.read_pandas_dataframe(y, temp_folder='./azureml/y', overwrite_ok=1)

## Review the Data Preparation Result
You can peek the result of a Dataflow at any range using skip(i) and head(j). Doing so evaluates only j records for all the steps in the Dataflow, which makes it fast even against large datasets.

In [30]:
X_train.skip(1).head(5)

Unnamed: 0,ps_ind_01,ps_ind_03,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,ps_ind_09_bin,ps_ind_10_bin,ps_ind_11_bin,ps_ind_12_bin,ps_ind_13_bin,...,ps_car_11_cat_oh_50,ps_car_11_cat_oh_4,ps_car_11_cat_oh_58,ps_car_11_cat_oh_9,ps_car_11_cat_oh_17,ps_car_11_cat_oh_11,ps_car_11_cat_oh_45,ps_car_11_cat_oh_14,ps_car_11_cat_oh_81,ps_car_11_cat_oh_47
0,1,6,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,2,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,5,3,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,6,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,5,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [31]:
y_train.skip(1).head(5)

Unnamed: 0,target
0,0
1,0
2,0
3,0
4,0


## Configure AutoML

Instantiate an `AutoMLConfig` object to specify the settings and data used to run the experiment.

|Property|Description|
|-|-|
|**task**|classification or regression|
|**primary_metric**|This is the metric that you want to optimize. Classification supports the following primary metrics: <br><i>accuracy</i><br><i>AUC_weighted</i><br><i>average_precision_score_weighted</i><br><i>norm_macro_recall</i><br><i>precision_score_weighted</i>|
|**primary_metric**|This is the metric that you want to optimize. Regression supports the following primary metrics: <br><i>spearman_correlation</i><br><i>normalized_root_mean_squared_error</i><br><i>r2_score</i><br><i>normalized_mean_absolute_error</i>|
|**iteration_timeout_minutes**|Time limit in minutes for each iteration.|
|**iterations**|Number of iterations. In each iteration AutoML trains a specific pipeline with the data.|
|**n_cross_validations**|Number of cross validation splits.|
|**spark_context**|Spark Context object. for Databricks, use spark_context=sc|
|**max_concurrent_iterations**|Maximum number of iterations to execute in parallel. This should be <= number of worker nodes in your Azure Databricks cluster.|
|**X**|(sparse) array-like, shape = [n_samples, n_features]|
|**y**|(sparse) array-like, shape = [n_samples, ], [n_samples, n_classes]<br>Multi-class targets. An indicator matrix turns on multilabel classification. This should be an array of integers.|
|**path**|Relative path to the project folder. AutoML stores configuration files for the experiment under this folder. You can specify a new empty folder.|
|**preprocess**|set this to True to enable pre-processing of data eg. string to numeric using one-hot encoding|
|**exit_score**|Target score for experiment. It is associated with the metric. eg. exit_score=0.995 will exit experiment after that|

In [33]:
automl_config = AutoMLConfig(task = 'classification',
                             debug_log = 'automl_errors.log',
                             primary_metric = 'AUC_weighted',
                             experiment_timeout_hours = 24,
                             iteration_timeout_minutes = 120,
                             iterations = 12,
                             n_cross_validations = 10,
                             max_concurrent_iterations = 6, 
                             featurization = "auto",
                             verbosity = logging.INFO,
                             X = X_train, 
                             y = y_train,
                             path = project_folder)



## Train the Models

Call the `submit` method on the experiment object and pass the run configuration. Execution of local runs is synchronous. Depending on the data and the number of iterations this can run for a while.
In this example, we specify `show_output = True` to print currently running iterations to the console. If you are running a lot of iterations, you can set it to False and visualize in the portal.

In [None]:
local_run = experiment.submit(automl_config, show_output = False) # for higher runs please use show_output=False and use the below



## Explore the Results

#### Portal URL for Monitoring Runs

The following will provide a link to the web interface to explore individual run details and status.

In [17]:
displayHTML("<a href={} target='_blank'>Your experiment in Azure Portal: {}</a>".format(local_run.get_portal_url(), local_run.id))

NameError: name 'displayHTML' is not defined

#### Retrieve All Child Runs after the experiment has COMPLETED.
You can also use SDK methods to fetch all the child runs and see individual metrics that we log. This can take some time.

In [18]:
children = list(local_run.get_children())
metricslist = {}
for run in children:
    properties = run.get_properties()
    metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}    
    metricslist[int(properties['iteration'])] = metrics

rundata = pd.DataFrame(metricslist).sort_index(1)
rundata

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
AUC_macro,1.0,1.0,0.63,1.0,1.0,1.0,1.0,0.61,1.0,0.61,1.0,1.0
AUC_micro,1.0,1.0,0.97,1.0,1.0,1.0,1.0,0.61,1.0,0.62,1.0,1.0
AUC_weighted,1.0,1.0,0.63,1.0,1.0,1.0,1.0,0.61,1.0,0.61,1.0,1.0
accuracy,1.0,1.0,0.96,1.0,1.0,1.0,1.0,0.58,1.0,0.6,1.0,0.96
average_precision_score_macro,1.0,1.0,0.52,1.0,1.0,1.0,1.0,0.51,1.0,0.51,1.0,1.0
average_precision_score_micro,1.0,1.0,0.97,1.0,1.0,1.0,1.0,0.6,1.0,0.58,1.0,1.0
average_precision_score_weighted,1.0,1.0,0.94,1.0,1.0,1.0,1.0,0.94,1.0,0.94,1.0,1.0
balanced_accuracy,1.0,1.0,0.5,1.0,1.0,1.0,1.0,0.58,1.0,0.58,1.0,0.5
f1_score_macro,1.0,1.0,0.49,1.0,1.0,1.0,1.0,0.41,1.0,0.42,1.0,0.49
f1_score_micro,1.0,1.0,0.96,1.0,1.0,1.0,1.0,0.58,1.0,0.6,1.0,0.96


### Retrieve the Best Model after the experiment has COMPLETED.

Below we select the best pipeline from our iterations. The `get_output` method returns the best run and the fitted model. The Model includes the pipeline and any pre-processing.  Overloads on `get_output` allow you to retrieve the best run and fitted model for *any* logged metric or for a particular *iteration*.

In [19]:
best_run, fitted_model = local_run.get_output()
print(best_run)
print(fitted_model)

Run(Experiment: automl-porto-seguro,
Id: AutoML_6ac1d226-dc67-4f7a-84a9-f99ed1564a97_11,
Type: None,
Status: Completed)
Pipeline(memory=None,
     steps=[('datatransformer', DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
        feature_sweeping_config=None, feature_sweeping_timeout=None,
        featurization_config=None, force_text_dnn=None,
        is_cross_validation=None, is_onnx_compatible=None, logger=None,
        obser...7f4cc8374630>,
           solver='lbfgs', tol=0.0001, verbose=0),
            training_cv_folds=5))])


#### Best Model Based on Any Other Metric after the above run is complete based on the child run
Show the run and the model that has the smallest `log_loss` value:

In [36]:
lookup_metric = "log_loss"
best_run, fitted_model = local_run.get_output(metric = lookup_metric)
print(best_run)
print(fitted_model)

### Test the Best Fitted Model

#### Load Test Data - you can split the dataset beforehand & pass Train dataset to AutoML and use Test dataset to evaluate the best model.

In [20]:
X_test = pd.read_csv('./test_prep.csv')
display(X_test)

Unnamed: 0,ps_ind_01,ps_ind_03,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,ps_ind_09_bin,ps_ind_10_bin,ps_ind_11_bin,ps_ind_12_bin,ps_ind_13_bin,...,ps_car_11_cat_oh_19,ps_car_11_cat_oh_1,ps_car_11_cat_oh_13,ps_car_11_cat_oh_73,ps_car_11_cat_oh_33,ps_car_11_cat_oh_79,ps_car_11_cat_oh_59,ps_car_11_cat_oh_58,ps_car_11_cat_oh_15,ps_car_11_cat_oh_63
0,0,8,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,4,5,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,5,3,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,6,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,7,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,6,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,3,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,7,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,1,6,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Testing Our Best Fitted Model
We will try to predict digits and see how our model works. This is just an example to show you.

In [21]:
y_test['target'] = fitted_model.predict(X_test)

DataErrorException: DataErrorException:
	Message: The fitted data has 208 columns but the input data has 207 columns.
	InnerException: None
	ErrorResponse 
{
    "error": {
        "code": "System",
        "inner_error": {
            "code": "DataError"
        },
        "message": "The fitted data has 208 columns but the input data has 207 columns."
    }
}

In [42]:
y_test

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,1
5,897,0
6,898,1
7,899,0
8,900,1
9,901,0


In [43]:
y_test.to_csv('/dbfs/titanic/results.csv', index=False)