# Azure Machine Learning Pipeline with AutoMLStep 

This notebook is part of [Project 2: Operationalizing Machine Learning with Azure](https://github.com/dpbac/Operationalizing-Machine-Learning-with-Azure) of the `Udacity Nanodegree Program : Machine Learning Engineer with Azure`.

In this project we will continue working with the Bank Dataset introduced in [Project 1: Optimizing an ML Pipeline in Azure](https://github.com/dpbac/Optimizing-an-ML-Pipeline-in-Azure). We will use Azure in this project to configure a cloud-based machine learning product model, deploy it, and consume it.

This notebook demonstrates the use of AutoMLStep in Azure Machine Learning Pipeline. With the code here presented we create, publish and consume a pipeline

The best model is generated using AutoML for classifcation using the dataset available at https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv

If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you have executed the [configuration](https://aka.ms/pl-config) before running this notebook.

In this notebook the following steps are performed

1. Create an `Experiment` in an existing `Workspace`.
2. Create or Attach existing AmlCompute to a workspace.
3. Define data loading in a `TabularDataset`.
4. Configure AutoML using `AutoMLConfig`.
5. Use AutoMLStep
6. Train the model using AmlCompute
7. Explore the results.
8. Test the best fitted model.
9. Publish and run from REST endpoint



## Azure Machine Learning and Pipeline SDK-specific imports

In [1]:
# Azure Machine Learning and Pipeline SDK-specific imports

import logging
import os
import csv

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets
import pkg_resources

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.core.dataset import Dataset

from azureml.pipeline.steps import AutoMLStep

# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

SDK version: 1.19.0


## Initialize Workspace

To start we need to initialize our workspace and create a Azule ML experiment. It is also to remember that accessing the Azure ML workspace requires authentication with Azure.

Make sure the config file is present at `.\config.json`. This file can be downloaded from home of Azure Machine Learning Studio.

In [2]:
ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

quick-starts-ws-133370
aml-quickstarts-133370
southcentralus
610d6e37-4747-4a20-80eb-3aad70a55f43


## Create an Azure ML experiment

Let's create an experiment named `automl-bank-experiment` and a folder to hold the training scripts, `./pipeline-bank-project` The script runs will be recorded under the experiment in Azure.

The best practice is to use separate folders for scripts and its dependent files for each step and specify that folder as the `source_directory` for the step. This helps reduce the size of the snapshot created for the step (only the specific folder is snapshotted). Since changes in any files in the `source_directory` would trigger a re-upload of the snapshot, this helps keep the reuse of the step when there are no changes in the `source_directory` of the step.

In [3]:
# Choose a name for the run history container in the workspace.
# NOTE: update these to match your existing exp name
experiment_name = 'automl-bank-experiment'
project_folder = './pipeline-bank-project'

experiment = Experiment(ws, experiment_name)

experiment

Name,Workspace,Report Page,Docs Page
automl-bank-experiment,quick-starts-ws-133370,Link to Azure Machine Learning studio,Link to Documentation


In [4]:
dic_data = {'Workspace name': ws.name,
            'Azure region': ws.location,
            'Subscription id': ws.subscription_id,
            'Resource group': ws.resource_group,
            'Experiment Name': experiment.name}

df_data = pd.DataFrame.from_dict(data = dic_data, orient='index')

df_data.rename(columns={0:''}, inplace = True)
df_data

Unnamed: 0,Unnamed: 1
Workspace name,quick-starts-ws-133370
Azure region,southcentralus
Subscription id,610d6e37-4747-4a20-80eb-3aad70a55f43
Resource group,aml-quickstarts-133370
Experiment Name,automl-bank-experiment


### Create or Attach an AmlCompute cluster

Now that we have initialized our workspace and created our experiment, it is time to define our resources. This means we need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for your AutoML run.

In this section you will create default compute clusters for use by the notebook and any other necessary operations we need. Here we get the default `AmlCompute` as your training compute resource.

In order to create a cluster we need to specify a compute configuration that defines the `type of machine` to be used and the `scalability behaviors`. Also, it is necessary to define the name of the cluster which must be unique within the workspace. This name is used to address the cluster later.

For this project we use a CPU cluster with following parameters:

* `type of the machine`:

    * `vm_size`: Defines the size of the virtual machine. We use here "STANDARD_DS12_V2" (more details [here](https://docs.microsoft.com/en-us/azure/cloud-services/cloud-services-sizes-specs#dv2-series))

* `Scalability behaviors`:

    * `min_nodes`: Sets minimun size of the cluster. Setting the minimum to 0 the cluster will shut down all nodes while not in use. If you use another value you are able to have faster start-up times, but you will also be billed when the cluster is not in use. In this experiment we define `min_nodes` as 1.

    * `max_nodes`: Sets the maximun size of the cluster. Larger number allows for more concurrency and a greater distributed processing of scale-out jobs.

In [5]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Define CPU cluster name
compute_target_name = "cpu-cluster"


# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=compute_target_name)
    print("Found existing cpu-cluster. Use it.")
except ComputeTargetException:
    # Specify the configuration for the new cluster
    compute_config = AmlCompute.provisioning_configuration(vm_size="STANDARD_DS12_V2",
                                                           #vm_priority = 'lowpriority', # optional
                                                           min_nodes=1, # when innactive
                                                           max_nodes=4) # when busy

    # Create the cluster with the specified name and configuration
    compute_target = ComputeTarget.create(ws, compute_target_name, compute_config)

    compute_target.wait_for_completion(show_output=True, min_node_count = 1, timeout_in_minutes = 10)
#     compute_target.wait_for_completion(show_output=True)

# For a more detailed view of current AmlCompute status, use get_status()
print(compute_target.get_status().serialize())

Found existing cpu-cluster. Use it.
{'errors': [], 'creationTime': '2021-01-04T19:01:24.797224+00:00', 'createdBy': {'userObjectId': '1f09f37e-8721-4866-b902-0d688c898830', 'userTenantId': '660b3398-b80e-49d2-bc5b-ac1dc93b5254', 'userName': None}, 'modifiedTime': '2021-01-04T19:04:56.436713+00:00', 'state': 'Running', 'vmSize': 'STANDARD_DS12_V2'}


In [6]:
# Check details about compute_targets (i.e. compute_target)

compute_targets = ws.compute_targets
for name, ct in compute_targets.items():
    print(name, ct.type, ct.provisioning_state)

cpu-cluster ComputeInstance Succeeded


## Dataset

The code below try first to load the dataset from the workspace. The key is the dataset name and is used to find the dataset and load it.

If the key is not found, i.e., the dataset is not in the workspace the dataset is loaded from the web using the link given at the beginning of this notebook.

In [7]:
# Try to load the dataset from the Workspace. Otherwise, create it from the file

found = False
key = "BankMarketing Dataset"
description_text = "Bank Marketing DataSet for 2nd Project Udacity Azure ML Engineer."

if key in ws.datasets.keys(): 
        found = True
        dataset = ws.datasets[key] 

if not found:
        # Create AML Dataset and register it into Workspace
        example_data = 'https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv'
        # Create TabularDataset using TabularDatasetFactory
        dataset = Dataset.Tabular.from_delimited_files(example_data)        
        #Register Dataset in Workspace
        dataset = dataset.register(workspace=ws,
                                   name=key,
                                   description=description_text)


#create a data frame
df = dataset.to_pandas_dataframe()

### Inspect Dataset

In [8]:
df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,57,technician,married,high.school,no,no,yes,cellular,may,mon,...,1,999,1,failure,-1.8,92.893,-46.2,1.299,5099.1,no
1,55,unknown,married,unknown,unknown,yes,no,telephone,may,thu,...,2,999,0,nonexistent,1.1,93.994,-36.4,4.86,5191.0,no
2,33,blue-collar,married,basic.9y,no,no,no,cellular,may,fri,...,1,999,1,failure,-1.8,92.893,-46.2,1.313,5099.1,no
3,36,admin.,married,high.school,no,no,no,telephone,jun,fri,...,4,999,0,nonexistent,1.4,94.465,-41.8,4.967,5228.1,no
4,27,housemaid,married,high.school,no,yes,no,cellular,jul,fri,...,2,999,0,nonexistent,1.4,93.918,-42.7,4.963,5228.1,no


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32950 entries, 0 to 32949
Data columns (total 21 columns):
age               32950 non-null int64
job               32950 non-null object
marital           32950 non-null object
education         32950 non-null object
default           32950 non-null object
housing           32950 non-null object
loan              32950 non-null object
contact           32950 non-null object
month             32950 non-null object
day_of_week       32950 non-null object
duration          32950 non-null int64
campaign          32950 non-null int64
pdays             32950 non-null int64
previous          32950 non-null int64
poutcome          32950 non-null object
emp.var.rate      32950 non-null float64
cons.price.idx    32950 non-null float64
cons.conf.idx     32950 non-null float64
euribor3m         32950 non-null float64
nr.employed       32950 non-null float64
y                 32950 non-null object
dtypes: float64(5), int64(5), object(11)
memory usa

In [10]:
df.describe()

Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
count,32950.0,32950.0,32950.0,32950.0,32950.0,32950.0,32950.0,32950.0,32950.0,32950.0
mean,40.040212,257.335205,2.56173,962.17478,0.17478,0.076228,93.574243,-40.51868,3.615654,5166.859608
std,10.432313,257.3317,2.763646,187.646785,0.496503,1.572242,0.578636,4.623004,1.735748,72.208448
min,17.0,0.0,1.0,0.0,0.0,-3.4,92.201,-50.8,0.634,4963.6
25%,32.0,102.0,1.0,999.0,0.0,-1.8,93.075,-42.7,1.344,5099.1
50%,38.0,179.0,2.0,999.0,0.0,1.1,93.749,-41.8,4.857,5191.0
75%,47.0,318.0,3.0,999.0,0.0,1.4,93.994,-36.4,4.961,5228.1
max,98.0,4918.0,56.0,999.0,7.0,1.4,94.767,-26.9,5.045,5228.1


In [11]:
df.y.value_counts(normalize=True)

no     0.887951
yes    0.112049
Name: y, dtype: float64

Unbalanced data: 88.80% label `no` and 11.20% label `yes`.

We can apply some techniques on the dataset to make the data more balanced before applying ML. **Automated ML** has built in capabilities to help deal with imbalanced. For instances, the algorithms used by automated ML detect imbalance when the number of samples in the minority class is equal to or fewer than 20% of the number of samples in the majority class, where minority class refers to the one with fewest samples and majority class refers to the one with most samples. Subsequently, AutoML will run an experiment with sub-sampled data to check if using class weights would remedy this problem and improve performance.

In addition, it is important to use a more appropriated metric such as **AUC_weighted** which is a primary metric that calculates the contribution of every class based on the relative number of samples representing that class. Therefore, it is more robust against imbalance.

For more details, check this [link](https://docs.microsoft.com/en-us/azure/machine-learning/concept-manage-ml-pitfalls).



## Train model:

Now we use the [BankMarketing Dataset](https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv) to obtain a model by running AutoML.

AutoML is an automated machine learning capability included in Azure ML. It leverages the scalability of cloud compute to automatically try multiple pre-processing techniques and model-training algorithms in parallel to find the best performing supervised machine learning model for a dataset. In this particular case for the `BankMarketing` dataset, we have just loaded.

### Configure AutoML

In [12]:
automl_settings = {
    "experiment_timeout_minutes": 60, # define how long, in minutes, the experiment should continue to run.
    "max_concurrent_iterations": 5,
    "primary_metric" : 'AUC_weighted'
}
automl_config = AutoMLConfig(compute_target=compute_target,
                             task = "classification",
                             training_data=dataset,
                             label_column_name="y",   
                             path = project_folder,
                             enable_early_stopping= True,
                             featurization= 'auto', # as part of preprocessing, data guardrails and featurization steps are performed automatically.
                             debug_log = "automl_errors.log",
                             **automl_settings
                            )

### Create Pipeline and AutoMLStep

You can define outputs for the AutoMLStep using TrainingOutput.

In [13]:
from azureml.pipeline.core import PipelineData, TrainingOutput

ds = ws.get_default_datastore()
metrics_output_name = 'metrics_output'
best_model_output_name = 'best_model_output'

metrics_data = PipelineData(name='metrics_data',
                           datastore=ds,
                           pipeline_output_name=metrics_output_name,
                           training_output=TrainingOutput(type='Metrics'))
model_data = PipelineData(name='model_data',
                           datastore=ds,
                           pipeline_output_name=best_model_output_name,
                           training_output=TrainingOutput(type='Model'))

Create an AutoMLStep.

In [14]:
automl_step = AutoMLStep(
    name='automl_module',
    automl_config=automl_config,
    outputs=[metrics_data, model_data],
    allow_reuse=True)

Now we create the pipeline using the `automl_step` just created.

In [15]:
from azureml.pipeline.core import Pipeline
pipeline = Pipeline(
    description="pipeline_with_automlstep",
    workspace=ws,    
    steps=[automl_step])

In [16]:
pipeline_run = experiment.submit(pipeline)

Created step automl_module [3b413623][aa566e83-f548-43df-af85-6c3f6f5bd70d], (This step will run and generate new outputs)
Submitted PipelineRun e08963c5-8e15-4044-9920-c8ee9b2d071a
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/automl-bank-experiment/runs/e08963c5-8e15-4044-9920-c8ee9b2d071a?wsid=/subscriptions/610d6e37-4747-4a20-80eb-3aad70a55f43/resourcegroups/aml-quickstarts-133370/workspaces/quick-starts-ws-133370


In [17]:
from azureml.widgets import RunDetails
RunDetails(pipeline_run).show()

_PipelineWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', …

In [18]:
pipeline_run.wait_for_completion()

PipelineRunId: e08963c5-8e15-4044-9920-c8ee9b2d071a
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/automl-bank-experiment/runs/e08963c5-8e15-4044-9920-c8ee9b2d071a?wsid=/subscriptions/610d6e37-4747-4a20-80eb-3aad70a55f43/resourcegroups/aml-quickstarts-133370/workspaces/quick-starts-ws-133370
PipelineRun Status: Running


This usually indicates a package conflict with one of the dependencies of azureml-core or azureml-pipeline-core.
Please check for package conflicts in your python environment






PipelineRun Execution Summary
PipelineRun Status: Finished
{'runId': 'e08963c5-8e15-4044-9920-c8ee9b2d071a', 'status': 'Completed', 'startTimeUtc': '2021-01-04T19:11:46.546679Z', 'endTimeUtc': '2021-01-04T19:41:14.497874Z', 'properties': {'azureml.runsource': 'azureml.PipelineRun', 'runSource': 'SDK', 'runType': 'SDK', 'azureml.parameters': '{}'}, 'inputDatasets': [], 'outputDatasets': [], 'logFiles': {'logs/azureml/executionlogs.txt': 'https://mlstrg133370.blob.core.windows.net/azureml/ExperimentRun/dcid.e08963c5-8e15-4044-9920-c8ee9b2d071a/logs/azureml/executionlogs.txt?sv=2019-02-02&sr=b&sig=gKA5JKdaeL3B%2B2SnQubRvrrageVdcdEQuwgudEYGTJI%3D&st=2021-01-04T19%3A01%3A49Z&se=2021-01-05T03%3A11%3A49Z&sp=r', 'logs/azureml/stderrlogs.txt': 'https://mlstrg133370.blob.core.windows.net/azureml/ExperimentRun/dcid.e08963c5-8e15-4044-9920-c8ee9b2d071a/logs/azureml/stderrlogs.txt?sv=2019-02-02&sr=b&sig=p%2F2CV80%2BsscryIkYevpVSqepGG2ipvhCXncngxGjpwg%3D&st=2021-01-04T19%3A01%3A49Z&se=2021-01-05

'Finished'

## Examine Results

### Retrieve the metrics of all child runs
Outputs of above run can be used as inputs of other steps in pipeline. 

In this section we test the pipeline and retrieve the best model.

In [19]:
metrics_output = pipeline_run.get_pipeline_output(metrics_output_name)
num_file_downloaded = metrics_output.download('.', show_progress=True)

Downloading azureml/95d9e91f-19f6-4cdf-a85d-3ad7fc89f72a/metrics_data
Downloaded azureml/95d9e91f-19f6-4cdf-a85d-3ad7fc89f72a/metrics_data, 1 files out of an estimated total of 1


In [20]:
import json
with open(metrics_output._path_on_datastore) as f:
    metrics_output_result = f.read()
    
deserialized_metrics_output = json.loads(metrics_output_result)
df = pd.DataFrame(deserialized_metrics_output)
df

Unnamed: 0,95d9e91f-19f6-4cdf-a85d-3ad7fc89f72a_0,95d9e91f-19f6-4cdf-a85d-3ad7fc89f72a_22,95d9e91f-19f6-4cdf-a85d-3ad7fc89f72a_13,95d9e91f-19f6-4cdf-a85d-3ad7fc89f72a_19,95d9e91f-19f6-4cdf-a85d-3ad7fc89f72a_17,95d9e91f-19f6-4cdf-a85d-3ad7fc89f72a_25,95d9e91f-19f6-4cdf-a85d-3ad7fc89f72a_8,95d9e91f-19f6-4cdf-a85d-3ad7fc89f72a_7,95d9e91f-19f6-4cdf-a85d-3ad7fc89f72a_21,95d9e91f-19f6-4cdf-a85d-3ad7fc89f72a_37,...,95d9e91f-19f6-4cdf-a85d-3ad7fc89f72a_28,95d9e91f-19f6-4cdf-a85d-3ad7fc89f72a_29,95d9e91f-19f6-4cdf-a85d-3ad7fc89f72a_2,95d9e91f-19f6-4cdf-a85d-3ad7fc89f72a_4,95d9e91f-19f6-4cdf-a85d-3ad7fc89f72a_11,95d9e91f-19f6-4cdf-a85d-3ad7fc89f72a_26,95d9e91f-19f6-4cdf-a85d-3ad7fc89f72a_31,95d9e91f-19f6-4cdf-a85d-3ad7fc89f72a_3,95d9e91f-19f6-4cdf-a85d-3ad7fc89f72a_34,95d9e91f-19f6-4cdf-a85d-3ad7fc89f72a_38
log_loss,[0.17775706110025447],[0.22589233807484954],[0.33655623030329523],[0.20678955773307725],[0.5721342290536302],[0.29313359336803707],[0.2746807819003103],[0.2759988121399161],[0.17981385781039308],[0.17806725250510322],...,[0.19281962981552705],[0.2684414628119108],[0.2425640556745006],[0.27446544902825754],[0.5152463881145037],[0.3304940796991938],[0.22699153043755016],[0.5135658713446012],[0.1816076074356531],[0.19996693332329674]
recall_score_macro,[0.7513392683482543],[0.5827905869626023],[0.6227454260188535],[0.6727966442343849],[0.7305273531204212],[0.5],[0.5],[0.5],[0.7400105955946777],[0.742390899643788],...,[0.7425377004966222],[0.5],[0.5379162985067991],[0.5],[0.7304189890839441],[0.6554005116264423],[0.5667133465593029],[0.802032798181707],[0.743733872745426],[0.739680872543517]
balanced_accuracy,[0.7513392683482543],[0.5827905869626023],[0.6227454260188535],[0.6727966442343849],[0.7305273531204212],[0.5],[0.5],[0.5],[0.7400105955946777],[0.742390899643788],...,[0.7425377004966222],[0.5],[0.5379162985067991],[0.5],[0.7304189890839441],[0.6554005116264423],[0.5667133465593029],[0.802032798181707],[0.743733872745426],[0.739680872543517]
average_precision_score_weighted,[0.9531771295804466],[0.945222197717833],[0.947605275820125],[0.9437150575561564],[0.9054675801419351],[0.948588659974036],[0.927174621933145],[0.9181594429175348],[0.9525161907226625],[0.9544988227892583],...,[0.9514170948185521],[0.9459201488993374],[0.9314178191754465],[0.9215625144281412],[0.9227952121601465],[0.9499966225005592],[0.9459165403870718],[0.9286625308091279],[0.9517838306968476],[0.9536313704564511]
precision_score_micro,[0.9116843702579667],[0.9004552352048558],[0.9062215477996965],[0.9025796661608497],[0.7317147192716237],[0.8880121396054628],[0.8880121396054628],[0.8880121396054628],[0.9125948406676783],[0.9147192716236723],...,[0.910773899848255],[0.8880121396054628],[0.8922610015174507],[0.8880121396054628],[0.7125948406676783],[0.9074355083459787],[0.8992412746585736],[0.812443095599393],[0.9128983308042489],[0.9141122913505311]
f1_score_micro,[0.9116843702579667],[0.9004552352048558],[0.9062215477996965],[0.9025796661608497],[0.7317147192716237],[0.8880121396054628],[0.8880121396054628],[0.8880121396054628],[0.9125948406676783],[0.9147192716236722],...,[0.9107738998482551],[0.8880121396054628],[0.8922610015174507],[0.8880121396054628],[0.7125948406676783],[0.9074355083459787],[0.8992412746585736],[0.812443095599393],[0.9128983308042489],[0.9141122913505311]
weighted_accuracy,[0.9514937218005303],[0.9793227746800656],[0.9766010009385309],[0.9596285749796182],[0.7320095101692564],[0.9843450583187134],[0.9843450583187134],[0.9843450583187134],[0.9554428403944659],[0.957503744982689],...,[0.9525423974350735],[0.9843450583187134],[0.9802352064129604],[0.9843450583187134],[0.7081695867509762],[0.9700089806001332],[0.9817989644773634],[0.8150276908544242],[0.9548972899190897],[0.9574188943502702]
matthews_correlation,[0.5323740218566827],[0.3256750549961802],[0.3976739324324451],[0.4276972780112856],[0.31179619696582217],[0.0],[0.0],[0.0],[0.5254139610791995],[0.5346413710313234],...,[0.5217153406413008],[0.0],[0.20382267725353595],[0.0],[0.30588958891371876],[0.43128813031641494],[0.30257645264619965],[0.438645402871231],[0.5296299768558868],[0.5302822053458942]
f1_score_weighted,[0.9091539479147899],[0.8719631449552753],[0.885603431576398],[0.892406452644354],[0.7784848543055262],[0.8353395018439429],[0.8353395018439429],[0.8353395018439429],[0.9086613440609772],[0.9105638077637765],...,[0.9075335307123638],[0.8353395018439429],[0.8531514455159263],[0.8353395018439429],[0.7641263997663074],[0.8929502722795345],[0.8664403742544295],[0.8405096466507697],[0.9092974412848348],[0.9097977989408933]
accuracy,[0.9116843702579667],[0.9004552352048558],[0.9062215477996965],[0.9025796661608497],[0.7317147192716237],[0.8880121396054628],[0.8880121396054628],[0.8880121396054628],[0.9125948406676783],[0.9147192716236723],...,[0.910773899848255],[0.8880121396054628],[0.8922610015174507],[0.8880121396054628],[0.7125948406676783],[0.9074355083459787],[0.8992412746585736],[0.812443095599393],[0.9128983308042489],[0.9141122913505311]


### Retrieve the Best Model

In [21]:
# Retrieve best model from Pipeline Run
best_model_output = pipeline_run.get_pipeline_output(best_model_output_name)
num_file_downloaded = best_model_output.download('.', show_progress=True)

Downloading azureml/95d9e91f-19f6-4cdf-a85d-3ad7fc89f72a/model_data
Downloaded azureml/95d9e91f-19f6-4cdf-a85d-3ad7fc89f72a/model_data, 1 files out of an estimated total of 1


In [22]:
import pickle

with open(best_model_output._path_on_datastore, "rb" ) as f:
    best_model = pickle.load(f)
best_model

PipelineWithYTransformations(Pipeline={'memory': None,
                                       'steps': [('datatransformer',
                                                  DataTransformer(enable_dnn=None,
                                                                  enable_feature_sweeping=None,
                                                                  feature_sweeping_config=None,
                                                                  feature_sweeping_timeout=None,
                                                                  featurization_config=None,
                                                                  force_text_dnn=None,
                                                                  is_cross_validation=None,
                                                                  is_onnx_compatible=None,
                                                                  logger=None,
                                                              

In [23]:
best_model.steps

[('datatransformer',
  DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
                  feature_sweeping_config=None, feature_sweeping_timeout=None,
                  featurization_config=None, force_text_dnn=None,
                  is_cross_validation=None, is_onnx_compatible=None, logger=None,
                  observer=None, task=None, working_dir=None)),
 ('prefittedsoftvotingclassifier',
  PreFittedSoftVotingClassifier(classification_labels=None,
                                estimators=[('0',
                                             Pipeline(memory=None,
                                                      steps=[('maxabsscaler',
                                                              MaxAbsScaler(copy=True)),
                                                             ('lightgbmclassifier',
                                                              LightGBMClassifier(boosting_type='gbdt',
                                                          

### Test the Model
#### Load Test Data


In [24]:
dataset_test = Dataset.Tabular.from_delimited_files(path='https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv')
df_test = dataset_test.to_pandas_dataframe()
df_test = df_test[pd.notnull(df_test['y'])]

y_test = df_test['y']
X_test = df_test.drop(['y'], axis=1)

#### Testing Our Best Fitted Model

We will use confusion matrix to see how our model works.

In [25]:
from sklearn.metrics import confusion_matrix
ypred = best_model.predict(X_test)
cm = confusion_matrix(y_test, ypred)

In [26]:
# Visualize the confusion matrix
pd.DataFrame(cm).style.background_gradient(cmap='Blues', low=0, high=0.9)

Unnamed: 0,0,1
0,28859,399
1,1049,2643


## Publish and run from REST endpoint

Publishing a pipeline is the process of making a pipeline publicly available. Here we will use Python SDK to publish our pipeline.

The following code publish the pipeline to your workspace. In your workspace in the portal, you can see metadata for the pipeline including run history and durations. You can also run the pipeline manually from the portal.

Additionally, publishing a pipeline, a public HTTP endpoint becomes available, allowing other services, including external ones, to interact with an Azure Pipeline.

In [27]:
published_pipeline = pipeline_run.publish_pipeline(
    name="Bankmarketing Train", 
    description="Training bankmarketing pipeline", 
    version="1.0")

published_pipeline

Name,Id,Status,Endpoint
Bankmarketing Train,667bdb1c-539c-4ba3-bedc-31e8f53e3971,Active,REST Endpoint


Authenticate once again, to retrieve the `auth_header` so that the endpoint can be used

In [28]:
from azureml.core.authentication import InteractiveLoginAuthentication

interactive_auth = InteractiveLoginAuthentication()
auth_header = interactive_auth.get_authentication_header()

Get the `REST url` from the endpoint property of the published pipeline object. You can also find the REST url in your workspace in the portal. Build an HTTP POST request to the endpoint, specifying your authentication header. Additionally, add a JSON payload object with the experiment name and the batch size parameter. As a reminder, the `process_count_per_node` is passed through to `ParallelRunStep` because you defined it is defined as a `PipelineParameter` object in the step configuration.

Make the request to trigger the run. Access the Id key from the response dict to get the value of the run id.

In [29]:
import requests

rest_endpoint = published_pipeline.endpoint
response = requests.post(rest_endpoint, 
                         headers=auth_header, 
                         json={"ExperimentName": "pipeline-rest-endpoint"}
                        )

In [30]:
try:
    response.raise_for_status()
except Exception:    
    raise Exception("Received bad response from the endpoint: {}\n"
                    "Response Code: {}\n"
                    "Headers: {}\n"
                    "Content: {}".format(rest_endpoint, response.status_code, response.headers, response.content))

run_id = response.json().get('Id')
print('Submitted pipeline run: ', run_id)

Submitted pipeline run:  93dce02d-12f2-431e-a7c2-f618ed20312c


Use the run id to monitor the status of the new run. This will look similar to the previous pipeline run, so if you don't need to see another pipeline run, you can skip watching the full output.

In [31]:
from azureml.pipeline.core.run import PipelineRun
from azureml.widgets import RunDetails

published_pipeline_run = PipelineRun(ws.experiments["pipeline-rest-endpoint"], run_id)
RunDetails(published_pipeline_run).show()

_PipelineWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', …