Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/NotebookVM/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-with-automated-machine-learning-step.png)

# Azure Machine Learning Pipeline with AutoMLStep (Udacity Course 2)
This notebook demonstrates the use of AutoMLStep in Azure Machine Learning Pipeline.

## Introduction
In this example we showcase how you can use AzureML Dataset to load data for AutoML via AML Pipeline. 

If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you have executed the [configuration](https://aka.ms/pl-config) before running this notebook.

In this notebook you will learn how to:
1. Create an `Experiment` in an existing `Workspace`.
2. Create or Attach existing AmlCompute to a workspace.
3. Define data loading in a `TabularDataset`.
4. Configure AutoML using `AutoMLConfig`.
5. Use AutoMLStep
6. Train the model using AmlCompute
7. Explore the results.
8. Test the best fitted model.

## Azure Machine Learning and Pipeline SDK-specific imports

In [1]:
import logging
import os
import csv

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets
import pkg_resources

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.core.dataset import Dataset

from azureml.pipeline.steps import AutoMLStep

# Check core SDK version number
print("We are currently using version", azureml.core.VERSION, "of the Azure ML SDK")

We are currently using version 1.23.0 of the Azure ML SDK


## Initialize Workspace
Initialize a workspace object from persisted configuration. Make sure the config file is present at .\config.json

In [2]:
ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

Note, we have launched a browser for you to login. For old experience with device code, use "az login --use-device-code"
Performing interactive authentication. Please follow the instructions on the terminal.
You have logged in. Now let us find all the subscriptions to which you have access...
Interactive authentication successfully completed.
quick-starts-ws-140144
aml-quickstarts-140144
southcentralus
6b4af8be-9931-443e-90f6-c4c34a1f9737


## Create an Azure ML experiment
Let's create an experiment named "automlstep-classification" and a folder to hold the training scripts. The script runs will be recorded under the experiment in Azure.

The best practice is to use separate folders for scripts and its dependent files for each step and specify that folder as the `source_directory` for the step. This helps reduce the size of the snapshot created for the step (only the specific folder is snapshotted). Since changes in any files in the `source_directory` would trigger a re-upload of the snapshot, this helps keep the reuse of the step when there are no changes in the `source_directory` of the step.

*Udacity Note:* There is no need to create an Azure ML experiment, this needs to re-use the experiment that was already created


In [3]:
# Choose a name for the run history container in the workspace.
# NOTE: update these to match your existing exp name
experiment_name = 'automl-bank-experiment'
project_folder = './pipeline-bank-project'

experiment = Experiment(ws, experiment_name)

experiment

Name,Workspace,Report Page,Docs Page
automl-bank-experiment,quick-starts-ws-140144,Link to Azure Machine Learning studio,Link to Documentation


In [4]:

dic_data = {'Workspace name': ws.name,
            'Azure region': ws.location,
            'Subscription id': ws.subscription_id,
            'Resource group': ws.resource_group,
            'Experiment Name': experiment.name}

df_data = pd.DataFrame.from_dict(data = dic_data, orient='index')

df_data.rename(columns={0:''}, inplace = True)
df_data

Unnamed: 0,Unnamed: 1
Workspace name,quick-starts-ws-140144
Azure region,southcentralus
Subscription id,6b4af8be-9931-443e-90f6-c4c34a1f9737
Resource group,aml-quickstarts-140144
Experiment Name,automl-bank-experiment


### Create or Attach an AmlCompute cluster
You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for your AutoML run. In this tutorial, you get the default `AmlCompute` as your training compute resource.

**Udacity Note** There is no need to create a new compute target, it can re-use the previous cluster

In [5]:
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
from azureml.core.compute_target import ComputeTargetException

# NOTE: update the cluster name to match the existing cluster
# Choose a name for your CPU cluster
amlcompute_cluster_name = "cpu-cluster"

# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=amlcompute_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS12_V2',# for GPU, use "STANDARD_NC6"
                                                           #vm_priority = 'lowpriority', # optional
                                                           min_nodes=1,
                                                           max_nodes=4)
    compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, compute_config)
    
    compute_target.wait_for_completion(show_output=True, min_node_count = 1, timeout_in_minutes = 10)
# For a more detailed view of current AmlCompute status, use get_status().
print(compute_target.get_status().serialize())

Creating...
SucceededProvisioning operation finished, operation "Succeeded"
Succeeded......................
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned
{'currentNodeCount': 1, 'targetNodeCount': 1, 'nodeStateCounts': {'preparingNodeCount': 1, 'runningNodeCount': 0, 'idleNodeCount': 0, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2021-03-10T09:53:36.191000+00:00', 'errors': None, 'creationTime': '2021-03-10T09:51:27.453394+00:00', 'modifiedTime': '2021-03-10T09:51:42.881348+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 1, 'maxNodeCount': 4, 'nodeIdleTimeBeforeScaleDown': 'PT120S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_DS12_V2'}


In [6]:
# Check details about compute_targets (i.e. compute_target)

compute_targets = ws.compute_targets
for name, ct in compute_targets.items():
    print(name, ct.type, ct.provisioning_state)

cpu-cluster AmlCompute Succeeded


## Data

**Udacity note:** Make sure the `key` is the same name as the dataset that is uploaded, and that the description matches. If it is hard to find or unknown, loop over the `ws.datasets.keys()` and `print()` them.
If it *isn't* found because it was deleted, it can be recreated with the link that has the CSV 

In [7]:
# Try to load the dataset from the Workspace. Otherwise, create it from the file
# NOTE: update the key to match the dataset name
found = False
key = "BankMarketing Dataset"
description_text = "Bank Marketing DataSet for Udacity Course 2"

if key in ws.datasets.keys(): 
        found = True
        dataset = ws.datasets[key] 

if not found:
        # Create AML Dataset and register it into Workspace
        example_data = 'https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv'
        dataset = Dataset.Tabular.from_delimited_files(example_data)        
        #Register Dataset in Workspace
        dataset = dataset.register(workspace=ws,
                                   name=key,
                                   description=description_text)


df = dataset.to_pandas_dataframe()

# Describe dataset
m, k = df.shape
print("{} x {} table of data:".format(m, k))
display(df.head())
print("...")

32950 x 21 table of data:


Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,57,technician,married,high.school,no,no,yes,cellular,may,mon,...,1,999,1,failure,-1.8,92.893,-46.2,1.299,5099.1,no
1,55,unknown,married,unknown,unknown,yes,no,telephone,may,thu,...,2,999,0,nonexistent,1.1,93.994,-36.4,4.86,5191.0,no
2,33,blue-collar,married,basic.9y,no,no,no,cellular,may,fri,...,1,999,1,failure,-1.8,92.893,-46.2,1.313,5099.1,no
3,36,admin.,married,high.school,no,no,no,telephone,jun,fri,...,4,999,0,nonexistent,1.4,94.465,-41.8,4.967,5228.1,no
4,27,housemaid,married,high.school,no,yes,no,cellular,jul,fri,...,2,999,0,nonexistent,1.4,93.918,-42.7,4.963,5228.1,no


...


In [8]:
df.dtypes

age                 int64
job                object
marital            object
education          object
default            object
housing            object
loan               object
contact            object
month              object
day_of_week        object
duration            int64
campaign            int64
pdays               int64
previous            int64
poutcome           object
emp.var.rate      float64
cons.price.idx    float64
cons.conf.idx     float64
euribor3m         float64
nr.employed       float64
y                  object
dtype: object

### Further review of the Dataset

#### Featurization

The featurization customization is an advanced feature in AutoML which allows to change the default forecasting featurization behaviors and column types through `FeaturizationConfig`.

In [9]:
df.y.value_counts(normalize=True)

no     0.887951
yes    0.112049
Name: y, dtype: float64

We are facing a case of `Imbalanced data`: **88.80% - no** and **11.20% - yes**.

**Automated ML** has built in capabilities to help deal with imbalanced data such as:

* Apply a weight column to the dataset.
* Run an experiment with sub-sampled data.
* Review performance metric for imbalanced data such as AUC_weighted and F1-score.
* Resampling to even the class imbalance.

## Train
This creates a general AutoML settings object.
**Udacity notes:** These inputs must match what was used when training in the portal. `label_column_name` has to be `y` for example.

In [10]:
automl_settings = {
    "experiment_timeout_minutes": 60, # define the duration of the experiment (in minutes).
    "max_concurrent_iterations": 5,
    "primary_metric" : 'AUC_weighted'
}
automl_config = AutoMLConfig(compute_target=compute_target,
                             task = "classification",
                             training_data=dataset,
                             label_column_name="y",   
                             path = project_folder,
                             enable_early_stopping= True,
                             featurization= 'auto',
                             debug_log = "automl_errors.log",
                             **automl_settings
                            )

**__NOTE__:** I am using the parameter `blocked_models=['XGBoostClassifier']` in my `AutoMLConfig` because:

* AutoML uses `Python 3.6` & `py-xgboost==0.90`.
* My Python dev environment uses `Python 3.8.5` which does not support `py-xgboost<=0.90`.

#### Create Pipeline and AutoMLStep

You can define outputs for the AutoMLStep using TrainingOutput.

In [11]:
from azureml.pipeline.core import PipelineData, TrainingOutput

ds = ws.get_default_datastore()
metrics_output_name = 'metrics_output'
best_model_output_name = 'best_model_output'

metrics_data = PipelineData(name='metrics_data',
                           datastore=ds,
                           pipeline_output_name=metrics_output_name,
                           training_output=TrainingOutput(type='Metrics'))
model_data = PipelineData(name='model_data',
                           datastore=ds,
                           pipeline_output_name=best_model_output_name,
                           training_output=TrainingOutput(type='Model'))

Create an `AutoMLStep`.

In [12]:
automl_step = AutoMLStep(
    name='automl_module',
    automl_config=automl_config,
    outputs=[metrics_data, model_data],
    allow_reuse=True)

We can now create a pipeline using the `AutoMLStep` created above.

In [13]:
from azureml.pipeline.core import Pipeline
pipeline = Pipeline(
    description="pipeline_with_automlstep",
    workspace=ws,    
    steps=[automl_step])

In [14]:
pipeline_run = experiment.submit(pipeline)

Created step automl_module [55cf9d05][c9ae4379-43a8-407a-8e96-1314f09c1a4e], (This step will run and generate new outputs)
Submitted PipelineRun f16a1ce1-6f4c-4839-a6eb-ca81e45a814c
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/automl-bank-experiment/runs/f16a1ce1-6f4c-4839-a6eb-ca81e45a814c?wsid=/subscriptions/6b4af8be-9931-443e-90f6-c4c34a1f9737/resourcegroups/aml-quickstarts-140144/workspaces/quick-starts-ws-140144


In [15]:
from azureml.widgets import RunDetails
RunDetails(pipeline_run).show()

_PipelineWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', …

In [16]:
pipeline_run.wait_for_completion()

PipelineRunId: f16a1ce1-6f4c-4839-a6eb-ca81e45a814c
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/automl-bank-experiment/runs/f16a1ce1-6f4c-4839-a6eb-ca81e45a814c?wsid=/subscriptions/6b4af8be-9931-443e-90f6-c4c34a1f9737/resourcegroups/aml-quickstarts-140144/workspaces/quick-starts-ws-140144
PipelineRun Status: Running


StepRunId: 1dce1f34-1509-49d0-9a73-79b3f2a03411
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/automl-bank-experiment/runs/1dce1f34-1509-49d0-9a73-79b3f2a03411?wsid=/subscriptions/6b4af8be-9931-443e-90f6-c4c34a1f9737/resourcegroups/aml-quickstarts-140144/workspaces/quick-starts-ws-140144
StepRun( automl_module ) Status: NotStarted
StepRun( automl_module ) Status: Queued
StepRun( automl_module ) Status: Running

StepRun(automl_module) Execution Summary
StepRun( automl_module ) Status: Finished



PipelineRun Execution Summary
PipelineRun Status: Finished
{'runId': 'f16a1ce1-6f4c-4839-a6eb-ca81e45a814c', 'status': 'Comple

'Finished'

## Examine Results

### Retrieve the metrics of all child runs
Outputs of above run can be used as inputs of other steps in pipeline. In this tutorial, we will examine the outputs by retrieve output data and running some tests.

In [17]:
metrics_output = pipeline_run.get_pipeline_output(metrics_output_name)
num_file_downloaded = metrics_output.download('.', show_progress=True)

Downloading azureml/1dce1f34-1509-49d0-9a73-79b3f2a03411/metrics_data
Downloaded azureml/1dce1f34-1509-49d0-9a73-79b3f2a03411/metrics_data, 1 files out of an estimated total of 1


In [18]:
import json
with open(metrics_output._path_on_datastore) as f:
    metrics_output_result = f.read()
    
deserialized_metrics_output = json.loads(metrics_output_result)
df = pd.DataFrame(deserialized_metrics_output)
df

Unnamed: 0,1dce1f34-1509-49d0-9a73-79b3f2a03411_7,1dce1f34-1509-49d0-9a73-79b3f2a03411_0,1dce1f34-1509-49d0-9a73-79b3f2a03411_8,1dce1f34-1509-49d0-9a73-79b3f2a03411_14,1dce1f34-1509-49d0-9a73-79b3f2a03411_10,1dce1f34-1509-49d0-9a73-79b3f2a03411_27,1dce1f34-1509-49d0-9a73-79b3f2a03411_21,1dce1f34-1509-49d0-9a73-79b3f2a03411_13,1dce1f34-1509-49d0-9a73-79b3f2a03411_31,1dce1f34-1509-49d0-9a73-79b3f2a03411_26,...,1dce1f34-1509-49d0-9a73-79b3f2a03411_18,1dce1f34-1509-49d0-9a73-79b3f2a03411_20,1dce1f34-1509-49d0-9a73-79b3f2a03411_24,1dce1f34-1509-49d0-9a73-79b3f2a03411_22,1dce1f34-1509-49d0-9a73-79b3f2a03411_25,1dce1f34-1509-49d0-9a73-79b3f2a03411_12,1dce1f34-1509-49d0-9a73-79b3f2a03411_16,1dce1f34-1509-49d0-9a73-79b3f2a03411_19,1dce1f34-1509-49d0-9a73-79b3f2a03411_29,1dce1f34-1509-49d0-9a73-79b3f2a03411_17
AUC_micro,[0.9591913991171614],[0.979695082216353],[0.9651563849212837],[0.9627301217414531],[0.8409726421372337],[0.9792746171257781],[0.9792565642982309],[0.9758990146932517],[0.9782191714581113],[0.9793693944704005],...,[0.8415028057870366],[0.9763877305247063],[0.9789654164009017],[0.9766515228619257],[0.9733974085902907],[0.8268028304254619],[0.8372877468735681],[0.9746105401802059],[0.9795984627464707],[0.8300112599906512]
precision_score_weighted,[0.8447687581189142],[0.9072720074188747],[0.788565560086672],[0.788565560086672],[0.8762921187995815],[0.9051091556982174],[0.9062798949414683],[0.8929725418691179],[0.9026893654928896],[0.9080335867085474],...,[0.8966294781600979],[0.895092517403297],[0.9113037932659365],[0.9000274768383943],[0.8843704047191627],[0.8788181388329129],[0.8914692469897325],[0.8890546332831104],[0.9074329436294158],[0.880811359793208]
average_precision_score_macro,[0.7171045179270072],[0.8151093723721079],[0.7382949076278038],[0.7319151069033817],[0.7242131699523829],[0.8150378454592331],[0.8126929119384294],[0.7985126174047921],[0.8043768280782629],[0.8085204474402641],...,[0.7192687147508846],[0.7953500733144905],[0.812573046671845],[0.7998321444303222],[0.7791553533474418],[0.7074092097970927],[0.7118838873357954],[0.7810523962199729],[0.8198704441305439],[0.713816524369288]
AUC_macro,[0.8578750090303364],[0.9450464668693166],[0.887867766237471],[0.8756689395328676],[0.859009589754134],[0.9438841004951403],[0.943998021661693],[0.9308878256246677],[0.9400441236128014],[0.9448491887516277],...,[0.8734155232871537],[0.9329981457709313],[0.9430264500867838],[0.9342679499932389],[0.9243114252741982],[0.850013985444024],[0.8529537072540925],[0.9285931939975585],[0.9442592067752529],[0.8324821662433985]
average_precision_score_micro,[0.9586854354179465],[0.9806603102489483],[0.9643687504090147],[0.9638724327121927],[0.8292355418289271],[0.980258627086168],[0.9802395848606664],[0.9766643355999638],[0.9792529378468265],[0.980352027134298],...,[0.8477036469540248],[0.9775048906893984],[0.9799669312828488],[0.9777871805237555],[0.9746121226103798],[0.8083696028796385],[0.794703907496258],[0.9757189583187845],[0.9805583578526404],[0.8085191533067357]
recall_score_micro,[0.8880121396054628],[0.9116843702579667],[0.8880121396054628],[0.8880121396054628],[0.7089529590288316],[0.9113808801213961],[0.9125948406676783],[0.9062215477996965],[0.9092564491654022],[0.9128983308042489],...,[0.7326251896813354],[0.9077389984825494],[0.9165402124430956],[0.9101669195751139],[0.9004552352048558],[0.7298937784522003],[0.8169954476479514],[0.9025796661608497],[0.9132018209408195],[0.7614567526555387]
norm_macro_recall,[0.002368263600612819],[0.5026785366965085],[0.0],[0.0],[0.4496320253701511],[0.4762858735901099],[0.4800211911893555],[0.24549085203770704],[0.4644204746900509],[0.5016773270945287],...,[0.5710201223680043],[0.3585080587647982],[0.510515016291653],[0.4109757023749321],[0.17742249192826853],[0.4708454432459568],[0.5547219860441941],[0.3455932884687698],[0.49017777259112316],[0.48981100200612415]
accuracy,[0.8880121396054628],[0.9116843702579667],[0.8880121396054628],[0.8880121396054628],[0.7089529590288316],[0.9113808801213961],[0.9125948406676783],[0.9062215477996965],[0.9092564491654022],[0.9128983308042489],...,[0.7326251896813354],[0.9077389984825494],[0.9165402124430956],[0.9101669195751139],[0.9004552352048558],[0.7298937784522003],[0.8169954476479514],[0.9025796661608497],[0.9132018209408195],[0.7614567526555387]
balanced_accuracy,[0.5011841318003064],[0.7513392683482543],[0.5],[0.5],[0.7248160126850756],[0.7381429367950549],[0.7400105955946777],[0.6227454260188535],[0.7322102373450254],[0.7508386635472644],...,[0.7855100611840021],[0.6792540293823991],[0.7552575081458265],[0.705487851187466],[0.5887112459641343],[0.7354227216229784],[0.777360993022097],[0.6727966442343849],[0.7450888862955616],[0.7449055010030621]
weighted_accuracy,[0.9840510704229208],[0.9514937218005303],[0.9843450583187134],[0.9843450583187134],[0.7050145918943272],[0.9543911754422495],[0.9554428403944659],[0.9766010009385309],[0.9532122345414047],[0.9531333625443325],...,[0.7194953065987924],[0.9644656358962787],[0.9565823452967744],[0.9609831957806473],[0.9778528352011013],[0.7285210914182785],[0.8268356106376938],[0.9596285749796182],[0.954939715235299],[0.765565980737067]


### Retrieve the Best Model

If you are a Mac user make sure you have:

* **lightgbm** installed: `brew install lightgbm`
* Confirm the version of **XGBoost** installed on your local environment matches with the installed on the AutoML training environment.

Install the correct version as follows:
```bash
conda install -c anaconda py-xgboost==<the_specific_version>
```


In [19]:
# Retrieve best model from Pipeline Run
best_model_output = pipeline_run.get_pipeline_output(best_model_output_name)
num_file_downloaded = best_model_output.download('.', show_progress=True)

Downloading azureml/1dce1f34-1509-49d0-9a73-79b3f2a03411/model_data
Downloaded azureml/1dce1f34-1509-49d0-9a73-79b3f2a03411/model_data, 1 files out of an estimated total of 1


In [21]:
from azureml.core.environment import Environment

# Retrieve AutoML training environment for the best model
print(Environment.get(ws, "AzureML-AutoML").python.conda_dependencies.serialize_to_string())

channels:
- anaconda
- conda-forge
- pytorch
dependencies:
- python=3.6.2
- pip=20.2.4
- pip:
  - azureml-core==1.23.0
  - azureml-pipeline-core==1.23.0
  - azureml-telemetry==1.23.0
  - azureml-defaults==1.23.0
  - azureml-interpret==1.23.0
  - azureml-automl-core==1.23.0
  - azureml-automl-runtime==1.23.0
  - azureml-train-automl-client==1.23.0
  - azureml-train-automl-runtime==1.23.0
  - azureml-dataset-runtime==1.23.0
  - azureml-mlflow==1.23.0
  - inference-schema
  - py-cpuinfo==5.0.0
  - boto3==1.15.18
  - botocore==1.18.18
- numpy~=1.18.0
- scikit-learn==0.22.1
- pandas~=0.25.0
- py-xgboost<=0.90
- fbprophet==0.5
- holidays==0.9.11
- setuptools-git
- psutil>5.0.0,<6.0.0
name: azureml_661474bbe74e96b5d8added5888dfc85



In [20]:
import pickle

with open(best_model_output._path_on_datastore, "rb" ) as f:
    best_model = pickle.load(f)
best_model

PipelineWithYTransformations(Pipeline={'memory': None,
                                       'steps': [('datatransformer',
                                                  DataTransformer(enable_dnn=None,
                                                                  enable_feature_sweeping=None,
                                                                  feature_sweeping_config=None,
                                                                  feature_sweeping_timeout=None,
                                                                  featurization_config=None,
                                                                  force_text_dnn=None,
                                                                  is_cross_validation=None,
                                                                  is_onnx_compatible=None,
                                                                  logger=None,
                                                              

In [21]:
best_model.steps

[('datatransformer',
  DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
                  feature_sweeping_config=None, feature_sweeping_timeout=None,
                  featurization_config=None, force_text_dnn=None,
                  is_cross_validation=None, is_onnx_compatible=None, logger=None,
                  observer=None, task=None, working_dir=None)),
 ('prefittedsoftvotingclassifier',
  PreFittedSoftVotingClassifier(classification_labels=None,
                                estimators=[('0',
                                             Pipeline(memory=None,
                                                      steps=[('maxabsscaler',
                                                              MaxAbsScaler(copy=True)),
                                                             ('lightgbmclassifier',
                                                              LightGBMClassifier(boosting_type='gbdt',
                                                          

### Test the Model
#### Load Test Data
For the test data, it should have the same preparation step as the train data. Otherwise it might get failed at the preprocessing step.

In [22]:
dataset_test = Dataset.Tabular.from_delimited_files(path='https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv')
df_test = dataset_test.to_pandas_dataframe()
df_test = df_test[pd.notnull(df_test['y'])]

y_test = df_test['y']
X_test = df_test.drop(['y'], axis=1)

# Describe dataset
m, k = X_test.shape
print("{} x {} table of data:".format(m, k))
display(X_test.head())
print("...")

32950 x 20 table of data:


Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
0,57,technician,married,high.school,no,no,yes,cellular,may,mon,371,1,999,1,failure,-1.8,92.89,-46.2,1.3,5099.1
1,55,unknown,married,unknown,unknown,yes,no,telephone,may,thu,285,2,999,0,nonexistent,1.1,93.99,-36.4,4.86,5191.0
2,33,blue-collar,married,basic.9y,no,no,no,cellular,may,fri,52,1,999,1,failure,-1.8,92.89,-46.2,1.31,5099.1
3,36,admin.,married,high.school,no,no,no,telephone,jun,fri,355,4,999,0,nonexistent,1.4,94.47,-41.8,4.97,5228.1
4,27,housemaid,married,high.school,no,yes,no,cellular,jul,fri,189,2,999,0,nonexistent,1.4,93.92,-42.7,4.96,5228.1


...


#### Testing Our Best Fitted Model

We will use confusion matrix to see how our model works.

In [23]:
from sklearn.metrics import confusion_matrix
ypred = best_model.predict(X_test)
cm = confusion_matrix(y_test, ypred)

In [24]:
# Visualize the confusion matrix
pd.DataFrame(cm).style.background_gradient(cmap='Blues', low=0, high=0.9)

Unnamed: 0,0,1
0,28729,529
1,1190,2502


## Publish and run from REST endpoint

Run the following code to publish the pipeline to your workspace. In your workspace in the portal, you can see metadata for the pipeline including run history and durations. You can also run the pipeline manually from the portal.

Additionally, publishing the pipeline enables a REST endpoint to rerun the pipeline from any HTTP library on any platform.


In [25]:
published_pipeline = pipeline_run.publish_pipeline(
    name="Bankmarketing Train", 
    description="Training bankmarketing pipeline", 
    version="1.0")

published_pipeline


Name,Id,Status,Endpoint
Bankmarketing Train,3f067809-5669-45c7-9f05-da3bcb9604e5,Active,REST Endpoint


Authenticate once again, to retrieve the `auth_header` so that the endpoint can be used

In [26]:
from azureml.core.authentication import InteractiveLoginAuthentication

interactive_auth = InteractiveLoginAuthentication()
auth_header = interactive_auth.get_authentication_header()



Get the REST url from the endpoint property of the published pipeline object. You can also find the REST url in your workspace in the portal. Build an HTTP POST request to the endpoint, specifying your authentication header. Additionally, add a JSON payload object with the experiment name and the batch size parameter. As a reminder, the process_count_per_node is passed through to ParallelRunStep because you defined it is defined as a PipelineParameter object in the step configuration.

Make the request to trigger the run. Access the Id key from the response dict to get the value of the run id.


In [27]:
import requests

rest_endpoint = published_pipeline.endpoint
response = requests.post(rest_endpoint, 
                         headers=auth_header, 
                         json={"ExperimentName": "pipeline-rest-endpoint"}
                        )

In [28]:
try:
    response.raise_for_status()
except Exception:    
    raise Exception("Received bad response from the endpoint: {}\n"
                    "Response Code: {}\n"
                    "Headers: {}\n"
                    "Content: {}".format(rest_endpoint, response.status_code, response.headers, response.content))

run_id = response.json().get('Id')
print('Submitted pipeline run: ', run_id)

Submitted pipeline run:  1087f5e9-2faf-4c57-890f-d1514ec5a736


Use the run id to monitor the status of the new run. This will take another 10-15 min to run and will look similar to the previous pipeline run, so if you don't need to see another pipeline run, you can skip watching the full output.

In [29]:
from azureml.pipeline.core.run import PipelineRun
from azureml.widgets import RunDetails

published_pipeline_run = PipelineRun(ws.experiments["pipeline-rest-endpoint"], run_id)
RunDetails(published_pipeline_run).show()

_PipelineWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', …