# HyperDrive vs. Automated Machine Learning
_**Classification using Bank Marketing Dataset**_

## Introduction

In this project, we'll have the opportunity to create and optimize an ML pipeline. We'll are being provided a custom-coded model—a standard Scikit-learn Logistic Regression—the hyperparameters of which we will optimize using HyperDrive. We'll also use AutoML to build and optimize a model on the same dataset, so that we can compare the results of the two methods.

In this notebook we will perform how to:
1. Create a workspace and an experiment
2. Create a Compute Cluster
3. Configure HyperDrive by specifying a parameter sampler, Policy then create a SKLearn estimator which would call the training script. Finally we will Create a HyperDriveConfig using the estimator, hyperparameter sampler, and policy.
4. Submit the HyperDrive run and get the best run. Finally save the model from that run.
5. Register the model from that run
6. Create TabularDataset using TabularDatasetFactory & clean the data using the Clean function
7. Configure and submit the AutoML run
8. Retrieve and save the best AutoML model
9. Finally Register the model from that run


### Create a workspace and an experiment

In [1]:
from azureml.core import Workspace, Experiment
ws = Workspace.from_config()
print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')
experiment_name = 'udacity-project'
exp=Experiment(ws, experiment_name)

run = exp.start_logging()

Workspace name: quick-starts-ws-127212
Azure region: southcentralus
Subscription id: 4910dccd-0348-46c4-a51f-d8c85e078b14
Resource group: aml-quickstarts-127212


### Create a Compute Cluster

In [2]:
from azureml.core.compute import ComputeTarget, AmlCompute

# TODO: Create compute cluster
# Use vm_size = "Standard_D2_V2" in your provisioning configuration.
# max_nodes should be no greater than 4.

# Choose a name for your CPU cluster
from azureml.core.compute_target import ComputeTargetException
cpu_cluster_name = "cpu-cluster-mla"

   # Verify that cluster does not exist already
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                              max_nodes=4)
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

    cpu_cluster.wait_for_completion(show_output=True)

Creating
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


### Configure HyperDrive Run

In [3]:
from azureml.widgets import RunDetails
from azureml.train.sklearn import SKLearn
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.parameter_expressions import uniform, choice
import os
#import numpy as np

# Specify parameter sampler
ps = RandomParameterSampling({"--C": uniform(0.02, 1),
                             "--max_iter": choice(50, 100, 200)})

# Specify a Policy
policy = BanditPolicy(evaluation_interval=2, slack_factor=0.1,delay_evaluation=5)


mydir = os.getcwd()
# Create a SKLearn estimator for use with train.py
est = SKLearn(mydir,compute_target=cpu_cluster, entry_script='train_new.py')


# Create a HyperDriveConfig using the estimator, hyperparameter sampler, and policy.
hyperdrive_config = HyperDriveConfig(estimator=est,
                                    hyperparameter_sampling=ps,
                                    policy=policy,
                                    primary_metric_name='Accuracy',
                                    primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                                    max_total_runs=30,
                                    max_concurrent_runs=4)

### Submit HyperDrive Run

In [4]:
hdr= exp.submit(config=hyperdrive_config,show_output=True)
RunDetails(hdr).show()



_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

### Get your best run and save the model from that run.

In [5]:
import joblib
# Get your best run and save the model from that run.

best_run = hdr.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()

print('Best Run Id: ', best_run.id)
print('\n Accuracy:', best_run_metrics['Accuracy'])

Best Run Id:  HD_5f8e7496-7e5c-49b8-899f-f1cd157d21af_26

 Accuracy: 0.9177541729893779


### Register the model from that run

In [6]:
os.makedirs('outputs',exist_ok=True)
joblib.dump(value=best_run.id, filename='outputs/model.joblib')

['outputs/model.joblib']

In [7]:
best_run.register_model(model_name='best_model', model_path='outputs/model.joblib')

Model(workspace=Workspace.create(name='quick-starts-ws-127212', subscription_id='4910dccd-0348-46c4-a51f-d8c85e078b14', resource_group='aml-quickstarts-127212'), name=best_model, id=best_model:1, version=1, tags={}, properties={})

### Create TabularDataset using TabularDatasetFactory & clean the data using the Clean function

In [8]:
from azureml.data.dataset_factory import TabularDatasetFactory
from azureml.core.dataset import Dataset
from azureml.data.datapath import DataPath

url_path = 'https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv'
dataset = TabularDatasetFactory.from_delimited_files(path=url_path)

In [9]:
from train_new import clean_data

x, y = clean_data(dataset)

In [10]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y)

import pandas as pd
dataset = pd.concat([x_train,y_train],axis=1)

### AutoML Settings

In [11]:
automl_settings = {
    'enable_early_stopping': True,
    'iteration_timeout_minutes':5,
    'max_concurrent_iterations':4,
    'featurization': 'auto'
}

### Configure AutoML Run

In [12]:

from azureml.train.automl import AutoMLConfig

# Set parameters for AutoMLConfig
# NOTE: DO NOT CHANGE THE experiment_timeout_minutes PARAMETER OR YOUR INSTANCE WILL TIME OUT.
# If you wish to run the experiment longer, you will need to run this notebook in your own
# Azure tenant, which will incur personal costs.
automl_config = AutoMLConfig(
    experiment_timeout_minutes=30,
    task='classification',
    primary_metric='accuracy',
    training_data=dataset,
    label_column_name='y',
    n_cross_validations=5,
    **automl_settings
    )

### Submit AutoML Run - Save and Register the best model

In [13]:
# Submit your automl run

automl_run = exp.submit(automl_config,show_output=True)

Running on local machine
Parent Run ID: AutoML_d8384813-5f0d-4b4b-8648-6ee0b62b6e03

Current status: DatasetEvaluation. Gathering dataset statistics.
Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetFeaturization. Beginning to fit featurizers and featurize the dataset.
Current status: DatasetFeaturizationCompleted. Completed fit featurizers and featurizing the dataset.
Current status: DatasetBalancing. Performing class balancing sweeping
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       ALERTED
DESCRIPTION:  To decrease model bias, please cancel the current run and fix balancing problem.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData
DETAILS:      Imbalanced data can lead to a falsely p

In [14]:
# Retrieve and save your best automl model.

automl_run.get_output()

(Run(Experiment: udacity-project,
 Id: AutoML_d8384813-5f0d-4b4b-8648-6ee0b62b6e03_33,
 Type: None,
 Status: Completed),
 Pipeline(memory=None,
          steps=[('datatransformer',
                  DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
                                  feature_sweeping_config=None,
                                  feature_sweeping_timeout=None,
                                  featurization_config=None, force_text_dnn=None,
                                  is_cross_validation=None,
                                  is_onnx_compatible=None, logger=None,
                                  observer=None, task=None, working_dir=None)),
                 ('prefittedsoftvotingclassifier',...
                                                                                                   l1_ratio=0.836734693877551,
                                                                                                   learning_rate='constant',
         

In [15]:
best_automl_run, best_model = automl_run.get_output()
best_automl_run.register_model(model_name = "best_run_automl.pkl", model_path = './outputs/')

Model(workspace=Workspace.create(name='quick-starts-ws-127212', subscription_id='4910dccd-0348-46c4-a51f-d8c85e078b14', resource_group='aml-quickstarts-127212'), name=best_run_automl.pkl, id=best_run_automl.pkl:1, version=1, tags={}, properties={})

### Proof of cluster clean up

In [None]:
ws.delete(delete_dependent_resources = False, no_wait = False)