Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/NotebookVM/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-with-automated-machine-learning-step.png)

# Udacity Capstone Project: Azure AutoML
This notebook demonstrates the use of AutoML in Azure Machine Learning Pipeline for the Udacity capstone project.

## Introduction
In this example we showcase how you can use AzureML Dataset to load data for AutoML via AML Pipeline. 

If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you have executed the [configuration](https://aka.ms/pl-config) before running this notebook.

In this notebook you will learn how to:
1. Create an `Experiment` in an existing `Workspace`.
2. Create or Attach existing AmlCompute to a workspace.
3. Define data loading in a `TabularDataset`.
4. Configure AutoML using `AutoMLConfig`.
5. Use AutoMLStep
6. Train the model using AmlCompute
7. Explore the results.
8. Test the best fitted model.

## Azure Machine Learning and Pipeline SDK-specific imports

In [2]:
import os
import sys
import json
import azureml
import logging
import pickle
import requests
import pandas as pd
import numpy as np
from io import BytesIO
from sklearn.externals import joblib
from sklearn.metrics import confusion_matrix
from pprint import pprint
from matplotlib import pyplot as plt
from train import *


import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.automl.core.featurization import FeaturizationConfig
from azureml.core.authentication import InteractiveLoginAuthentication
from azureml.core import Workspace, Dataset
from azureml.data.datapath import DataPath

from azureml.widgets import RunDetails
from azureml.train.automl import constants
from azureml.pipeline.steps import AutoMLStep
from azureml.pipeline.core import PipelineData, TrainingOutput
from azureml.pipeline.core import Pipeline

# Model deployment
from azureml.core.model import InferenceConfig
from azureml.core.webservice import AciWebservice
from azureml.core.webservice import Webservice
from azureml.core.model import Model

import warnings
warnings.filterwarnings("ignore")

pd.set_option('display.max_rows', None)

# Check system and core SDK version number
print("System version: {}".format(sys.version))
print("SDK version:", azureml.core.VERSION)

System version: 3.6.13 |Anaconda, Inc.| (default, Feb 23 2021, 12:58:59) 
[GCC Clang 10.0.0 ]
SDK version: 1.23.0


## Initialize Workspace
Initialize a workspace object from persisted configuration. Make sure the config file is present at .\config.json

In [3]:
interactive_auth = InteractiveLoginAuthentication(tenant_id="660b3398-b80e-49d2-bc5b-ac1dc93b5254")
ws = Workspace(subscription_id="81cefad3-d2c9-4f77-a466-99a7f541c7bb",
                   resource_group="aml-quickstarts-142415",
                   workspace_name="quick-starts-ws-142415",
                   auth=interactive_auth)

experiment_name = 'online_news_project'
experiment=Experiment(ws, experiment_name)
experiment

Name,Workspace,Report Page,Docs Page
online_news_project,quick-starts-ws-142415,Link to Azure Machine Learning studio,Link to Documentation


In [4]:
dic_data = {'Workspace name': ws.name,
            'Azure region': ws.location,
            'Subscription id': ws.subscription_id,
            'Resource group': ws.resource_group,
            'Experiment Name': experiment.name}

az_data = pd.DataFrame.from_dict(data = dic_data, orient='index')
az_data.rename(columns={0:''}, inplace = True)
az_data

Unnamed: 0,Unnamed: 1
Workspace name,quick-starts-ws-142415
Azure region,southcentralus
Subscription id,81cefad3-d2c9-4f77-a466-99a7f541c7bb
Resource group,aml-quickstarts-142415
Experiment Name,online_news_project


### Create or Attach an AmlCompute cluster
You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for your AutoML run. In this tutorial, you get the default `AmlCompute` as your training compute resource.

**Udacity Note** There is no need to create a new compute target, it can re-use the previous cluster

In [5]:
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
from azureml.core.compute_target import ComputeTargetException

# Define CPU cluster name
compute_target_name = "cpu-cluster"

try:
    compute_target = ComputeTarget(workspace=ws, name=compute_target_name)
    print("Found existing cpu-cluster. Use it.")
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size="STANDARD_DS12_V2",
                                                           min_nodes=1, 
                                                           max_nodes=4) 
    compute_target = ComputeTarget.create(ws, compute_target_name, compute_config)

compute_target.wait_for_completion(show_output=True)

print(compute_target.get_status().serialize())

Found existing cpu-cluster. Use it.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned
{'currentNodeCount': 1, 'targetNodeCount': 1, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 1, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2021-04-11T12:19:28.485000+00:00', 'errors': None, 'creationTime': '2021-04-11T12:18:03.615477+00:00', 'modifiedTime': '2021-04-11T12:18:18.955759+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 1, 'maxNodeCount': 4, 'nodeIdleTimeBeforeScaleDown': 'PT120S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_DS12_V2'}


In [6]:
# Check details about compute_targets (i.e. compute_target)
compute_targets = ws.compute_targets
for name, ct in compute_targets.items():
    print(name, ct.type, ct.provisioning_state)

notebook142415 ComputeInstance Succeeded
cpu-cluster AmlCompute Succeeded


## Dataset

**Udacity note:** Make sure the `key` is the same name as the dataset that is uploaded, and that the description matches. If it is hard to find or unknown, loop over the `ws.datasets.keys()` and `print()` them.
If it *isn't* found because it was deleted, it can be recreated with the link that has the CSV 

In [7]:
DATA_LOC = "https://raw.githubusercontent.com/franckess/AzureML_Capstone/main/data/OnlineNewsPopularity.csv"
BORUTA_LOC = "https://github.com/franckess/AzureML_Capstone/releases/download/1.1/boruta_model_final.pkl"

# Loading data
df = pd.read_csv(DATA_LOC)

# Removing space character in the feature names
df.columns=df.columns.str.replace(' ','')

# Drop URL column
df = df.drop(['url'], axis=1)

# Perform Data pre-processing
df = corr_drop_cols(df)
df = create_label(df)
df = scaling_num(df)
df = feature_selection(df, BORUTA_LOC)
    
# Split train data into train & test
X_train, X_test, y_train, y_test = split_train_test(df)

m, k = X_train.shape
print("{} x {} table of data:".format(m, k))
X_train.info()

31715 x 47 table of data:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 31715 entries, 38512 to 35050
Data columns (total 47 columns):
n_tokens_title                   31715 non-null float64
n_tokens_content                 31715 non-null float64
n_unique_tokens                  31715 non-null float64
num_hrefs                        31715 non-null float64
num_self_hrefs                   31715 non-null float64
num_imgs                         31715 non-null float64
num_videos                       31715 non-null float64
average_token_length             31715 non-null float64
num_keywords                     31715 non-null float64
data_channel_is_entertainment    31715 non-null int64
data_channel_is_bus              31715 non-null int64
data_channel_is_socmed           31715 non-null int64
data_channel_is_tech             31715 non-null int64
data_channel_is_world            31715 non-null int64
kw_min_min                       31715 non-null float64
kw_max_min                     

### Upload data to Azure Datastore

In [8]:
# merge the output x and y dataframes into a single table for AutoML experiment
train_data = pd.concat([X_train, y_train], axis=1)
train_data.to_csv('./data/train_data.csv', index = None, header=True)

datastore = ws.get_default_datastore()
datastore.upload_files(files = ['./data/train_data.csv'],  target_path='data/', overwrite=True, show_progress=True)

datastore_path =[
    DataPath(datastore, 'data/train_data.csv')
]

# Upload the training data as a tabular dataset for access during training on remote compute
train_data = Dataset.Tabular.from_delimited_files(path=datastore_path)

Uploading an estimated of 1 files
Uploading ./data/train_data.csv
Uploaded ./data/train_data.csv, 1 files out of an estimated total of 1
Uploaded 1 files


In [9]:
print(
    "Datastore type: " + datastore.datastore_type,
    "Account name: " + datastore.account_name,
    "Container name: " + datastore.container_name,
    sep="\n",
)

Datastore type: AzureBlob
Account name: mlstrg142415
Container name: azureml-blobstore-e3f99bb8-a492-4d55-add2-2ab0bb5281ce


In [10]:
train_data

{
  "source": [
    "('workspaceblobstore', 'data/train_data.csv')"
  ],
  "definition": [
    "GetDatastoreFiles",
    "ParseDelimited",
    "DropColumns",
    "SetColumnTypes"
  ]
}

## Train
This creates a general AutoML settings object.
**Udacity notes:** These inputs must match what was used when training in the portal. `label_column_name` has to be `y` for example.

In [11]:
automl_settings = {
    "experiment_timeout_minutes": 60, # define the duration of the experiment (in minutes).
    "max_concurrent_iterations": 9,
    "primary_metric" : 'accuracy'
}

project_folder = './capstone-project'

automl_config = AutoMLConfig(compute_target=compute_target,
                             task = "classification",
                             training_data=train_data,
                             label_column_name="label",   
                             path = project_folder,
                             enable_early_stopping= True,
                             featurization= 'auto',
                             debug_log = "online_news_automl_errors.log",
                             n_cross_validations=5,
                             max_cores_per_iteration=-1,
                             verbosity=logging.INFO,
                             **automl_settings)

In [12]:
# Submit your automl run
automl_exp = Experiment(workspace=ws, name="Udacity_capstone_AutoML")  
automl_run = automl_exp.submit(automl_config, show_output = True)
RunDetails(automl_run).show()
automl_run.wait_for_completion(show_output=True)

Running on remote.
No run_configuration provided, running on cpu-cluster with default configuration
Running on remote compute: cpu-cluster
Parent Run ID: AutoML_690cf95e-ea6f-40dc-b297-5261ea2009ed

Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       PASSED
DESCRIPTION:  Your inputs were analyzed, and all classes are balanced in your training data.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData

****************************************************************************************************

TYPE:         Missing feature values imputation
STATUS:       PASSED
DESCRIPTION:  No feature missing values we

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…



****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       PASSED
DESCRIPTION:  Your inputs were analyzed, and all classes are balanced in your training data.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData

****************************************************************************************************

TYPE:         Missing feature values imputation
STATUS:       PASSED
DESCRIPTION:  No feature missing values were detected in the training data.
              Learn more about missing value imputation: https://aka.ms/AutomatedMLFeaturization

****************************************************************************************************

TYPE:         High cardinality feature detection
STATUS:       PASSED
DESCRIPTION:  Your inputs were analyzed, and no high cardinality features were detected.
              Learn more abo

{'runId': 'AutoML_690cf95e-ea6f-40dc-b297-5261ea2009ed',
 'target': 'cpu-cluster',
 'status': 'Completed',
 'startTimeUtc': '2021-04-11T12:39:14.341114Z',
 'endTimeUtc': '2021-04-11T12:59:50.012852Z',
 'properties': {'num_iterations': '1000',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'accuracy',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': '5',
  'target': 'cpu-cluster',
  'DataPrepJsonString': '{\\"training_data\\": \\"{\\\\\\"blocks\\\\\\": [{\\\\\\"id\\\\\\": \\\\\\"5314e080-abe3-4da7-acf1-c105bcd56e1e\\\\\\", \\\\\\"type\\\\\\": \\\\\\"Microsoft.DPrep.GetDatastoreFilesBlock\\\\\\", \\\\\\"arguments\\\\\\": {\\\\\\"datastores\\\\\\": [{\\\\\\"datastoreName\\\\\\": \\\\\\"workspaceblobstore\\\\\\", \\\\\\"path\\\\\\": \\\\\\"data/train_data.csv\\\\\\", \\\\\\"resourceGroup\\\\\\": \\\\\\"aml-quickstarts-142415\\\\\\", \\\\\\"subscription\\\\\\": \\\\\\"81cefad3-d2c9-4f77-a466-99a7f541c7bb\\\\\\", \\\\\\"work

## Examine Results

### Retrieve the Best Model

In [None]:
best_run, best_model = automl_run.get_output()
print(best_model.steps)

In [None]:
get_best_autoML_metrics = best_run.get_metrics()
for run_metric in get_best_autoML_metrics:
    metric = get_best_autoML_metrics[run_metric]
    print(run_metric,metric)

In [None]:
best_run.get_file_names()

In [None]:
# Save the best model
automl_model_name = best_run.properties['model_name']
joblib.dump(best_model, filename="output/automl_model.pkl")
print("Model saved successfully!")

In [None]:
# Register best model
AutoML_model = best_run.register_model(model_name = 'best_autoML_model', model_path =  'outputs/model.pkl')
AutoML_model

In [None]:
best_run

In [None]:
def print_model(model, prefix=""):
    for step in model.steps:
        print(prefix + step[0])
        if hasattr(step[1], 'estimators') and hasattr(step[1], 'weights'):
            pprint({'estimators': list(e[0] for e in step[1].estimators), 'weights': step[1].weights})
            print()
            for estimator in step[1].estimators:
                print_model(estimator[1], estimator[0]+ ' - ')
        elif hasattr(step[1], '_base_learners') and hasattr(step[1], '_meta_learner'):
            print("\nMeta Learner")
            pprint(step[1]._meta_learner)
            print()
            for estimator in step[1]._base_learners:
                print_model(estimator[1], estimator[0]+ ' - ')
        else:
            pprint(step[1].get_params())
            print()
            
print_model(best_model)

In [None]:
# Download scoring file 
best_run.download_file('outputs/scoring_file_v_1_0_0.py', './automl_score.py')
# script_file_name = './score.py'

In [None]:
# Download environment file
best_run.download_file('outputs/conda_env_v_1_0_0.yml', './AzureML_envFile.yml')

## Model Deployment

Create an inference config and deploy the model as a web service.

In [None]:
inference_config = InferenceConfig(entry_script=script_file_name)

aciconfig = AciWebservice.deploy_configuration(cpu_cores = 2, 
                                               memory_gb = 2, 
                                               tags = {'Company': "Mashable", 'type': "capstone_Classifier"}, 
                                               description = 'sample service for Capstone Project AutoML Classifier for Online News popularity')

In [None]:
aci_service_name = 'capstone-automl'
print(aci_service_name)
aci_service = Model.deploy(ws, aci_service_name, [AutoML_model], inference_config, aciconfig)
aci_service.wait_for_deployment(True)
print(aci_service.state)
print(aci_service.scoring_uri)
print(aci_service.swagger_uri)

Testing our deployment web service

In [None]:
test_data = pd.concat([X_test, y_test], axis=1)
test_data = test_data[10:15]
display(test_data)

In [None]:
# remove label column
label_data = test_data.pop('label')

# convert test input data to dictionary form
input_data = json.dumps({'data': test_data.to_dict(orient='records')})

# print test input data
print(input_data)

In [None]:
output = aci_service.run(input_data)
print(output)

In [None]:
print(aci_service.get_logs())

### Create Pipeline and AutoMLStep

Define outputs for the AutoMLStep using TrainingOutput.

In [None]:
ds = ws.get_default_datastore()
metrics_output_name = 'metrics_output'
best_model_output_name = 'best_model_output'

metrics_data = PipelineData(name='metrics_data',
                           datastore=ds,
                           pipeline_output_name=metrics_output_name,
                           training_output=TrainingOutput(type='Metrics'))
model_data = PipelineData(name='model_data',
                           datastore=ds,
                           pipeline_output_name=best_model_output_name,
                           training_output=TrainingOutput(type='Model'))

In [None]:
# Create AutoMLStep
automl_step = AutoMLStep(name='automl_module',
                         automl_config=automl_config,
                         outputs=[metrics_data, model_data],
                         allow_reuse=True)

In [None]:
pipeline = Pipeline(description="pipeline_with_automlstep",
                    workspace=ws,    
                    steps=[automl_step])

pipeline_run = experiment.submit(pipeline)

In [None]:
RunDetails(pipeline_run).show()
pipeline_run.wait_for_completion()

## Examine Results

### Retrieve the metrics of all child runs

Outputs of above run can be used as inputs of other steps in pipeline. In this tutorial, we will examine the outputs by retrieve output data and running some tests.

In [None]:
metrics_output = pipeline_run.get_pipeline_output(metrics_output_name)
num_file_downloaded = metrics_output.download('.', show_progress=True)

In [None]:
with open(metrics_output._path_on_datastore) as f:
    metrics_output_result = f.read()
    
deserialized_metrics_output = json.loads(metrics_output_result)
df = pd.DataFrame(deserialized_metrics_output)
df

### Retrieve the Best Model

In [None]:
# Retrieve best model from Pipeline Run
best_model_output = pipeline_run.get_pipeline_output(best_model_output_name)
num_file_downloaded = best_model_output.download('.', show_progress=True)

In [None]:
with open(best_model_output._path_on_datastore, "rb" ) as f:
    best_model = pickle.load(f)
best_model

In [None]:
best_model.steps

### Test Model

Testing our best fitted model

In [None]:
ypred = best_model.predict(X_test)
cm = confusion_matrix(y_test, ypred)

In [None]:
# Visualize the confusion matrix
pd.DataFrame(cm).style.background_gradient(cmap='Blues', low=0, high=0.9)