<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Create-an-Experiment-in-Azure-ML-workspace" data-toc-modified-id="Create-an-Experiment-in-Azure-ML-workspace-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Create an Experiment in Azure ML workspace</a></span></li><li><span><a href="#HyperDrive-Pipeline" data-toc-modified-id="HyperDrive-Pipeline-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>HyperDrive Pipeline</a></span><ul class="toc-item"><li><span><a href="#Create-Resources-for-Training-Experiments" data-toc-modified-id="Create-Resources-for-Training-Experiments-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Create Resources for Training Experiments</a></span></li><li><span><a href="#Hyperparameter-Tunning" data-toc-modified-id="Hyperparameter-Tunning-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Hyperparameter Tunning</a></span><ul class="toc-item"><li><span><a href="#Parameter-sampler" data-toc-modified-id="Parameter-sampler-3.2.1"><span class="toc-item-num">3.2.1&nbsp;&nbsp;</span>Parameter sampler</a></span></li><li><span><a href="#Early-Termination-Policy" data-toc-modified-id="Early-Termination-Policy-3.2.2"><span class="toc-item-num">3.2.2&nbsp;&nbsp;</span>Early Termination Policy</a></span></li><li><span><a href="#Create-a-SKLearn-Estimator" data-toc-modified-id="Create-a-SKLearn-Estimator-3.2.3"><span class="toc-item-num">3.2.3&nbsp;&nbsp;</span>Create a SKLearn Estimator</a></span></li><li><span><a href="#Create-a-HyperDriveConfig" data-toc-modified-id="Create-a-HyperDriveConfig-3.2.4"><span class="toc-item-num">3.2.4&nbsp;&nbsp;</span>Create a <a href="https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.hyperdrive.hyperdriveconfig?view=azure-ml-py" target="_blank">HyperDriveConfig</a></a></span></li></ul></li></ul></li><li><span><a href="#AutoML-Run" data-toc-modified-id="AutoML-Run-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>AutoML Run</a></span><ul class="toc-item"><li><span><a href="#Create-Dataset" data-toc-modified-id="Create-Dataset-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Create Dataset</a></span></li><li><span><a href="#Inspect-Dataset" data-toc-modified-id="Inspect-Dataset-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Inspect Dataset</a></span></li><li><span><a href="#Clean-and-Split-Dataset" data-toc-modified-id="Clean-and-Split-Dataset-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Clean and Split Dataset</a></span></li><li><span><a href="#Configure-Experiment" data-toc-modified-id="Configure-Experiment-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>Configure Experiment</a></span></li><li><span><a href="#Submitting-Training-Experiment" data-toc-modified-id="Submitting-Training-Experiment-4.5"><span class="toc-item-num">4.5&nbsp;&nbsp;</span>Submitting Training Experiment</a></span></li><li><span><a href="#Monitor-using-Widget" data-toc-modified-id="Monitor-using-Widget-4.6"><span class="toc-item-num">4.6&nbsp;&nbsp;</span>Monitor using <code>Widget</code></a></span></li><li><span><a href="#Retrieve-and-Save-Best-Model" data-toc-modified-id="Retrieve-and-Save-Best-Model-4.7"><span class="toc-item-num">4.7&nbsp;&nbsp;</span>Retrieve and Save Best Model</a></span></li></ul></li><li><span><a href="#Cleaning-Up-Cluster" data-toc-modified-id="Cleaning-Up-Cluster-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Cleaning Up Cluster</a></span></li></ul></div>

# Introduction

# Create an Experiment in Azure ML workspace

For this project we will be using an Azure Machine Learning Notebook VM, therefore we can skip setting up the environment.

To start we need to initialize our workspace and create a Azule ML experiment. It is also to remember that accessing the Azure ML workspace requires authentication with Azure.

In [4]:
from azureml.core import Workspace, Experiment

# Initialize a workspace object for an existing Azure Machine Learning Workspace
ws = Workspace.get("quick-starts-ws-127549")

# Create a experiment
exp = Experiment(workspace=ws, name="udacity-project")

run = exp.start_logging()

In [5]:
import pandas as pd

dic_data = {'Workspace name': ws.name,
            'Azure region': ws.location,
            'Subscription id': ws.subscription_id,
            'Resource group': ws.resource_group,
            'Experiment Name': exp.name}

df_data = pd.DataFrame.from_dict(data = dic_data, orient='index')

df_data.rename(columns={0:''}, inplace = True)
df_data

Unnamed: 0,Unnamed: 1
Workspace name,quick-starts-ws-127549
Azure region,southcentralus
Subscription id,55e71b9d-a209-42c0-8818-ca9cc885909c
Resource group,aml-quickstarts-127549
Experiment Name,udacity-project


# HyperDrive Pipeline

## Create Resources for Training Experiments

Now that we have initialized our workspace and created our experiment, it is time to define our resources.

In this section you will create default compute clusters for use by the notebook and any other necessary operations we need.

In order to create a cluster we need to specify a compute configuration that defines the `type of machine` to be used and the `scalability behaviors`. Also, it is necessary to define the name of the cluster which must be unique within the workspace. This name is used to address the cluster later.

For this project we use a CPU cluster with following parameters:

* `type of the machine`:

    * `vm_size`: Defines the size of the virtual machine. We use here "STANDARD_D2_V2" (more details [here](https://docs.microsoft.com/en-us/azure/cloud-services/cloud-services-sizes-specs#dv2-series))

* `Scalability behaviors`:

    * `min_nodes`: Sets minimun size of the cluster. Setting the minimum to 0 the cluster will shut down all nodes while not in use. If you use another value you are able to have faster start-up times, but you will also be billed when the cluster is not in use.

    * `max_nodes`: Sets the maximun size of the cluster. Larger number allows for more concurrency and a greater distributed processing of scale-out jobs.



In [7]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Define CPU cluster name
cpu_cluster_name = "cpu-cluster"


# Verify that cluster does not exist already
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print("Found existing cpu-cluster")
except ComputeTargetException:
    
    # Specify the configuration for the new cluster
    compute_config = AmlCompute.provisioning_configuration(vm_size="STANDARD_D2_V2",
                                                           min_nodes=0, # when innactive
                                                           max_nodes=4) # when busy

    # Create the cluster with the specified name and configuration
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)
    
    # Wait for the cluster to complete, show the output log
    cpu_cluster.wait_for_completion(show_output=True)

Creating
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


In [8]:
# Check details about compute_targets (i.e. cpu_cluster)

compute_targets = ws.compute_targets
for name, ct in compute_targets.items():
    print(name, ct.type, ct.provisioning_state)

cpu-c231120 ComputeInstance Succeeded
cpu-cluster AmlCompute Succeeded


## Hyperparameter Tunning

### Parameter sampler

In this example using HyperDrive we use [`random sampling`](https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.hyperdrive.randomparametersampling?view=azure-ml-py) to try different configuration sets of hyperparameters to maximize the chosen primary metric, accuracy. The function `choice` specify a discrete set of options to sample from.

The hyperparameters and metric used are defined in the script `train.py`.

### Early Termination Policy

This saves us from continuing to explore hyperparameters that don't show promise of helping reach our target metric.

An early termination policy help us improving computational efficiency by terminating poorly performing runs.

The `early termination policy` we used [`Bandit Policy`]( https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.hyperdrive.banditpolicy?preserve-view=true&view=azure-ml-py#&preserve-view=truedefinition ). This policy is based on `slack factor/slack amount` and `evaluation interval`. Bandit terminates runs where the primary metric is not within the specified slack factor/slack amount compared to the best performing run.

This allows more aggressive savings than Median Stopping policy if we apply a smaller allowable slack.

Parameter `slack_factor` which is the slack allowed with respect to the best performing training run, need to be defined while `evaluation_interval` and `delay_interval` are optional.

`evaluation_interval` says when the policy is applied. If the `evaluation_interval` is not defined the default value is one, i.e., policy is applied every time the training script reports the primary metric.

Specifying `delay_interval` avoids premature termination of training runs by allowing all configurations to run for a minimum number of intervals. If specified, the policy applies every multiple of evaluation_interval that is greater than or equal to delay_evaluation.

For example, in our example, by applying the Bandit policy with `slack_factor = 0.1`, `evaluation_interval=2`, `delay_evaluation=5` the early termination policy is applied at every other time interval when metrics are reported, starting at evaluation interval 5. Any run whose primary metric falls outside of the top 10% range, Azure ML terminate the job.

### Create a SKLearn Estimator

[SKLearn Class](https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.sklearn.sklearn?view=azure-ml-py) creates an estimator for training in Scikit-learn experiments.

### Create a [HyperDriveConfig](https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.hyperdrive.hyperdriveconfig?view=azure-ml-py)

Now we are ready to configure a run configuration object. 

As parameters we inform `parameter sampler`, `early termination policy`, and `estimator` that we just configured. We also specify the primary metric `Accuracy` that's recorded in your training runs and we tell the service that we want to maximize this value.  

Moreover, we set the `number of samples` to 20, and `maximal concurrent job` to 4, which is the same as the number of nodes in our computer cluster.


In [9]:
from azureml.widgets import RunDetails
from azureml.train.sklearn import SKLearn
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.parameter_expressions import uniform, choice
import os

# Specify parameter sampler

ps = RandomParameterSampling({
    '--C': choice(0.01, 0.1, 0.2, 0.5, 0.7, 1.0),
    '--max_iter': choice(range(10,110,10))
    }
)

# Specify a Policy
policy = BanditPolicy(slack_factor = 0.1, # specifies the allowable slack as a ratio
                      evaluation_interval=2, # frequency for applying the policy
                      delay_evaluation=5) # delays the first policy evaluation for a specified number of intervals

if "training" not in os.listdir():
    os.mkdir("./training")

# Create a SKLearn estimator for use with train.py

est = SKLearn( 
    source_directory='./', # directory containing experiment configuration files (train.py)
    compute_target=cpu_cluster, # compute target where training will happen
    vm_size="STANDARD_D2_V2", # VM size of the compute target
    vm_priority='lowpriority', # VM priority of the compute target (default value is 'dedicated')
    entry_script='train.py'
)

# Create a HyperDriveConfig using the estimator, hyperparameter sampler, and policy.

hyperdrive_config = HyperDriveConfig(estimator=est,
                                hyperparameter_sampling=ps,
                                policy=policy,
                                primary_metric_name='Accuracy',
                                primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                                max_total_runs=4,
                                max_concurrent_runs=4
                                    )


In [10]:
# Submit hyperdrive run to the experiment 

hyperdrive_run = exp.submit(config = hyperdrive_config)

# Show run details with the Jupyter widget

RunDetails(hyperdrive_run).show()
hyperdrive_run.wait_for_completion(show_output=True)



_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

RunId: HD_e4a8f3f1-8636-4161-ae5b-58a20a6eadf7
Web View: https://ml.azure.com/experiments/udacity-project/runs/HD_e4a8f3f1-8636-4161-ae5b-58a20a6eadf7?wsid=/subscriptions/55e71b9d-a209-42c0-8818-ca9cc885909c/resourcegroups/aml-quickstarts-127549/workspaces/quick-starts-ws-127549

Streaming azureml-logs/hyperdrive.txt

"<START>[2020-11-23T14:39:19.649662][GENERATOR][INFO]Trying to sample '4' jobs from the hyperparameter space<END>\n""<START>[2020-11-23T14:39:19.930344][GENERATOR][INFO]Successfully sampled '4' jobs, they will soon be submitted to the execution target.<END>\n""<START>[2020-11-23T14:39:19.138318][API][INFO]Experiment created<END>\n"<START>[2020-11-23T14:39:20.4789276Z][SCHEDULER][INFO]The execution environment is being prepared. Please be patient as it can take a few minutes.<END>

Execution Summary
RunId: HD_e4a8f3f1-8636-4161-ae5b-58a20a6eadf7
Web View: https://ml.azure.com/experiments/udacity-project/runs/HD_e4a8f3f1-8636-4161-ae5b-58a20a6eadf7?wsid=/subscriptions/55e71

{'runId': 'HD_e4a8f3f1-8636-4161-ae5b-58a20a6eadf7',
 'target': 'cpu-cluster',
 'status': 'Completed',
 'startTimeUtc': '2020-11-23T14:39:18.803266Z',
 'endTimeUtc': '2020-11-23T14:46:49.925747Z',
 'properties': {'primary_metric_config': '{"name": "Accuracy", "goal": "maximize"}',
  'resume_from': 'null',
  'runTemplate': 'HyperDrive',
  'azureml.runsource': 'hyperdrive',
  'platform': 'AML',
  'ContentSnapshotId': '26c3792e-5253-43f8-a916-e70710c8f45d',
  'score': '0.9108750632271118',
  'best_child_run_id': 'HD_e4a8f3f1-8636-4161-ae5b-58a20a6eadf7_3',
  'best_metric_status': 'Succeeded'},
 'inputDatasets': [],
 'outputDatasets': [],
 'logFiles': {'azureml-logs/hyperdrive.txt': 'https://mlstrg127549.blob.core.windows.net/azureml/ExperimentRun/dcid.HD_e4a8f3f1-8636-4161-ae5b-58a20a6eadf7/azureml-logs/hyperdrive.txt?sv=2019-02-02&sr=b&sig=kBCS7oPWfkz1lAJMl23SHVuC%2Fl4o8SzOXMBQroDogNY%3D&st=2020-11-23T14%3A37%3A10Z&se=2020-11-23T22%3A47%3A10Z&sp=r'}}

In [11]:
import joblib

# Get your best run and save the model from that run.

best_run = hyperdrive_run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()

print('Best Run Id: ', best_run.id)
print('Accuracy:', best_run_metrics['Accuracy'])

best_run

Best Run Id:  HD_e4a8f3f1-8636-4161-ae5b-58a20a6eadf7_3
Accuracy: 0.9108750632271118


Experiment,Id,Type,Status,Details Page,Docs Page
udacity-project,HD_e4a8f3f1-8636-4161-ae5b-58a20a6eadf7_3,azureml.scriptrun,Completed,Link to Azure Machine Learning studio,Link to Documentation


In [12]:
# check metrics details

best_run_metrics

{'Regularization Strength:': 0.7,
 'Max iterations:': 100,
 'Accuracy': 0.9108750632271118}

In [13]:
# get name of files of best_run
best_run.get_file_names()

['azureml-logs/55_azureml-execution-tvmps_dad1ad448bab9b03d12706e0d3dfd042b705522863d87ca5d29363f64da9d37b_d.txt',
 'azureml-logs/65_job_prep-tvmps_dad1ad448bab9b03d12706e0d3dfd042b705522863d87ca5d29363f64da9d37b_d.txt',
 'azureml-logs/70_driver_log.txt',
 'azureml-logs/75_job_post-tvmps_dad1ad448bab9b03d12706e0d3dfd042b705522863d87ca5d29363f64da9d37b_d.txt',
 'azureml-logs/process_info.json',
 'azureml-logs/process_status.json',
 'logs/azureml/102_azureml.log',
 'logs/azureml/job_prep_azureml.log',
 'logs/azureml/job_release_azureml.log',
 'outputs/model.joblib']

In [14]:
# save the model, i.e., output file of best_run
model = best_run.register_model(model_name='model_hd', model_path='outputs/model.joblib')

# AutoML Run

Now we use the same dataset to obtain a model by running AutoML.

## Create Dataset

In [15]:
from azureml.data.dataset_factory import TabularDatasetFactory

# Create TabularDataset using TabularDatasetFactory

ds = TabularDatasetFactory.from_delimited_files(path = "https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv")

## Inspect Dataset

In [16]:
# create a dataframe with ds data

ds_df = ds.to_pandas_dataframe()

In [17]:
ds_df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,57,technician,married,high.school,no,no,yes,cellular,may,mon,...,1,999,1,failure,-1.8,92.893,-46.2,1.299,5099.1,no
1,55,unknown,married,unknown,unknown,yes,no,telephone,may,thu,...,2,999,0,nonexistent,1.1,93.994,-36.4,4.86,5191.0,no
2,33,blue-collar,married,basic.9y,no,no,no,cellular,may,fri,...,1,999,1,failure,-1.8,92.893,-46.2,1.313,5099.1,no
3,36,admin.,married,high.school,no,no,no,telephone,jun,fri,...,4,999,0,nonexistent,1.4,94.465,-41.8,4.967,5228.1,no
4,27,housemaid,married,high.school,no,yes,no,cellular,jul,fri,...,2,999,0,nonexistent,1.4,93.918,-42.7,4.963,5228.1,no


In [18]:
ds_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32950 entries, 0 to 32949
Data columns (total 21 columns):
age               32950 non-null int64
job               32950 non-null object
marital           32950 non-null object
education         32950 non-null object
default           32950 non-null object
housing           32950 non-null object
loan              32950 non-null object
contact           32950 non-null object
month             32950 non-null object
day_of_week       32950 non-null object
duration          32950 non-null int64
campaign          32950 non-null int64
pdays             32950 non-null int64
previous          32950 non-null int64
poutcome          32950 non-null object
emp.var.rate      32950 non-null float64
cons.price.idx    32950 non-null float64
cons.conf.idx     32950 non-null float64
euribor3m         32950 non-null float64
nr.employed       32950 non-null float64
y                 32950 non-null object
dtypes: float64(5), int64(5), object(11)
memory usa

In [19]:
ds_df.y.value_counts(normalize=True)

no     0.887951
yes    0.112049
Name: y, dtype: float64

Dataset contains 32950 entries and 21 columns where one of the column is our target (`y`). This dataset is a bit imbalanced, therefore I'm applying stratify when spliting the data so we can have the same percentage of each class in train and test dataset.

## Clean and Split Dataset

In [20]:
from train import clean_data

# Use the clean_data function to clean your data.
x, y = clean_data(ds)

In [21]:
from sklearn.model_selection import train_test_split

# split data into train and test sets - 

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30, random_state=123, stratify = y)

In [22]:
# Create a new dataframe with only training data

df_train = pd.concat([x_train,y_train], axis=1)
df_train.reset_index(drop=True, inplace=True)

In [23]:
df_train.head()

Unnamed: 0,age,marital,default,housing,loan,month,day_of_week,duration,campaign,pdays,...,contact_telephone,education_basic.4y,education_basic.6y,education_basic.9y,education_high.school,education_illiterate,education_professional.course,education_university.degree,education_unknown,y
0,31,1,0,1,0,5,2,161,5,999,...,0,0,0,1,0,0,0,0,0,0
1,48,1,0,0,0,11,3,1061,4,999,...,0,0,0,0,1,0,0,0,0,1
2,32,1,0,1,1,5,5,134,1,999,...,0,0,0,0,0,0,1,0,0,0
3,36,0,0,1,1,5,2,347,1,999,...,0,0,0,0,0,0,0,1,0,0
4,31,0,0,1,0,5,5,12,10,999,...,0,0,0,1,0,0,0,0,0,0


In [24]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23065 entries, 0 to 23064
Data columns (total 40 columns):
age                              23065 non-null int64
marital                          23065 non-null int64
default                          23065 non-null int64
housing                          23065 non-null int64
loan                             23065 non-null int64
month                            23065 non-null int64
day_of_week                      23065 non-null int64
duration                         23065 non-null int64
campaign                         23065 non-null int64
pdays                            23065 non-null int64
previous                         23065 non-null int64
poutcome                         23065 non-null int64
emp.var.rate                     23065 non-null float64
cons.price.idx                   23065 non-null float64
cons.conf.idx                    23065 non-null float64
euribor3m                        23065 non-null float64
nr.employed        

## Configure Experiment 

In [26]:
from azureml.train.automl import AutoMLConfig

# Set parameters for AutoMLConfig
# NOTE: DO NOT CHANGE THE experiment_timeout_minutes PARAMETER OR YOUR INSTANCE WILL TIME OUT.
# If you wish to run the experiment longer, you will need to run this notebook in your own
# Azure tenant, which will incur personal costs.

automl_config = AutoMLConfig(
    experiment_timeout_minutes=30,
    task="classification",
    primary_metric="accuracy",
    training_data=df_train,
    label_column_name='y',
    n_cross_validations=5)




## Submitting Training Experiment

In [27]:
# Submit your automl run

experiment_name = 'automl-experiment'

experiment = Experiment(ws, experiment_name)
automl_run = experiment.submit(config=automl_config, show_output=True)
automl_run

Running on local machine
Parent Run ID: AutoML_770c77d2-9751-454d-8d86-98b0c225ada9

Current status: DatasetEvaluation. Gathering dataset statistics.
Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetFeaturization. Beginning to fit featurizers and featurize the dataset.
Current status: DatasetFeaturizationCompleted. Completed fit featurizers and featurizing the dataset.
Current status: DatasetBalancing. Performing class balancing sweeping
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       ALERTED
DESCRIPTION:  To decrease model bias, please cancel the current run and fix balancing problem.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData
DETAILS:      Imbalanced data can lead to a falsely p

Experiment,Id,Type,Status,Details Page,Docs Page
automl-experiment,AutoML_770c77d2-9751-454d-8d86-98b0c225ada9,automl,Completed,Link to Azure Machine Learning studio,Link to Documentation


## Monitor using `Widget`

Once more we make use of `widget`. This time to explore the results obtained by using AutoML.

In [29]:
# from azureml.widgets import RunDetails
RunDetails(automl_run).show()

automl_run.wait_for_completion(show_output=True)

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…



****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       ALERTED
DESCRIPTION:  To decrease model bias, please cancel the current run and fix balancing problem.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData
DETAILS:      Imbalanced data can lead to a falsely perceived positive effect of a model's accuracy because the input data has bias towards one class.
+---------------------------------+---------------------------------+--------------------------------------+
|Size of the smallest class       |Name/Label of the smallest class |Number of samples in the training data|
|2584                             |1                                |23065                                 |
+---------------------------------+---------------------------------+--------------------------------------+

********************************************

{'runId': 'AutoML_770c77d2-9751-454d-8d86-98b0c225ada9',
 'target': 'local',
 'status': 'Completed',
 'startTimeUtc': '2020-11-23T14:53:43.605472Z',
 'endTimeUtc': '2020-11-23T15:26:42.769357Z',
 'properties': {'num_iterations': '1000',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'accuracy',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': '5',
  'target': 'local',
  'DataPrepJsonString': None,
  'EnableSubsampling': None,
  'runTemplate': 'AutoML',
  'azureml.runsource': 'automl',
  'display_task_type': 'classification',
  'dependencies_versions': '{"azureml-widgets": "1.18.0", "azureml-train": "1.18.0", "azureml-train-restclients-hyperdrive": "1.18.0", "azureml-train-core": "1.18.0", "azureml-train-automl": "1.18.0", "azureml-train-automl-runtime": "1.18.0", "azureml-train-automl-client": "1.18.0", "azureml-tensorboard": "1.18.0", "azureml-telemetry": "1.18.0", "azureml-sdk": "1.18.0", "azureml-samples": "0+unknow

## Retrieve and Save Best Model

Below we select the best model from all the training iterations using get_output method.


In [30]:
# Retrieve model

best_run, fitted_model = automl_run.get_output()

# get name of files of best_run
best_run.get_file_names()

In [34]:
# save best model
best_run.register_model(model_name = "model.pkl", model_path = './outputs/')
print(fitted_model._final_estimator)

PreFittedSoftVotingClassifier(classification_labels=None,
                              estimators=[('1',
                                           Pipeline(memory=None,
                                                    steps=[('maxabsscaler',
                                                            MaxAbsScaler(copy=True)),
                                                           ('xgboostclassifier',
                                                            XGBoostClassifier(base_score=0.5,
                                                                              booster='gbtree',
                                                                              colsample_bylevel=1,
                                                                              colsample_bynode=1,
                                                                              colsample_bytree=1,
                                                                              gamma=0,
              

In [35]:
fitted_model.steps


# Cleaning Up Cluster

In [36]:
cpu_cluster.delete()