# Automated Machine Learning 
**Continuous retraining using Pipelines**


## Introduction
In this example we use AutoML and Pipelines to enable contious retraining of a model based on updates to the training dataset. We will create two pipelines:
* one to demonstrate a training dataset that gets updated over time. 
* The second pipeline utilizes pipeline `Schedule` to trigger continuous retraining. 

In this notebook you will learn how to:
* Create an Experiment in an existing Workspace.
* Configure AutoML using AutoMLConfig.
* Create data ingestion pipeline to update a dataset
* Create training pipeline to prepare data, run AutoML, register the model and setup pipeline triggers.

## Setup
As part of the setup you have already created an Azure ML `Workspace` object. For AutoML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments.

In [None]:
import logging

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig

This sample notebook may use features that are not available in previous versions of the Azure ML SDK.
If needed run this:  `!pip install --upgrade --upgrade-strategy eager azureml-sdk`.  If you are running an AMLS Compute instance it's probably better to just rebuild the compute note. 

In [None]:
print("This notebook was created using version 1.18.0 of the Azure ML SDK")
print("You are currently using version", azureml.core.VERSION, "of the Azure ML SDK")

Accessing the Azure ML workspace requires authentication with Azure.

The default authentication is interactive authentication using the default tenant. Executing the ws = Workspace.from_config() line in the cell below will prompt for authentication the first time that it is run.

If you have multiple Azure tenants, you can specify the tenant by replacing the ws = Workspace.from_config() line in the cell below with the following:
```
from azureml.core.authentication import InteractiveLoginAuthentication
auth = InteractiveLoginAuthentication(tenant_id = 'mytenantid')
ws = Workspace.from_config(auth = auth)
```
If you need to run in an environment where interactive login is not possible, you can use Service Principal authentication by replacing the ws = Workspace.from_config() line in the cell below with the following:
```
from azureml.core.authentication import ServicePrincipalAuthentication
auth = auth = ServicePrincipalAuthentication('mytenantid', 'myappid', 'mypassword')
ws = Workspace.from_config(auth = auth)
```
For more details, see aka.ms/aml-notebook-auth

In [None]:
ws = Workspace.from_config()
output = {}
output['Subscription ID'] = ws.subscription_id
output['Workspace'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
print(output)

In [None]:
dstor = ws.get_default_datastore()

# Choose a name for the run history container in the workspace.
experiment_name = 'ar-factoring-2class-autoretrain'
experiment = Experiment(ws, experiment_name)

output['Run History Name'] = experiment_name
pd.set_option('display.max_colwidth', -1)
outputDf = pd.DataFrame(data = output, index = [''])
outputDf.T

If you look in your workspace the experiment is not yet created.  

## Compute 

#### Create or Attach existing AmlCompute

You will need to create a compute target for your AutoML run, or use existing.  
#### Creation of AmlCompute takes approximately 5 minutes. 
If the AmlCompute with that name is already in your workspace this code will skip the creation process.


In [None]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Choose a name for your CPU cluster, or use the existing compute cluster
amlcompute_cluster_name = "automl"

# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=amlcompute_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                           max_nodes=4)
    compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True)

## Run Configuration

In [None]:
from azureml.core.runconfig import CondaDependencies, RunConfiguration

# create a new RunConfig object
conda_run_config = RunConfiguration(framework="python")

# Set compute target to AmlCompute
conda_run_config.target = compute_target

conda_run_config.environment.docker.enabled = True

cd = CondaDependencies.create(pip_packages=['azureml-sdk[automl]', 'applicationinsights', 'azureml-opendatasets', 'azureml-defaults'], 
                              conda_packages=['numpy==1.16.2'], 
                              pin_sdk_version=False)
conda_run_config.environment.python.conda_dependencies = cd

print('run config is ready')

## Data Ingestion Pipeline 
For this lab, we are simply going to pull a copy of the existing data directly from our github repo, overriding the existing data.  In the real world we would pull the latest data into the registered dataset.  Simply making a copy of the data is sufficient because the copy will set the flags for the last time the data was updated.  We can use that information later to determine if we want to start a retraining event.  

In the next cell we have a little python program that simply copies the data to the existing dataset/datastore.  When we build a AMLS pipeline we need to use a `python script` so this cell actually builds a python script (the first line does this) for us.  Change any variables you need and run this cell.  You should see the `upload_latest_data.py` is updated.  

Let's create a subfolder just for this script file.  

In [None]:
project_folder = './ar-pipeline'

# create project folder
if not os.path.exists(project_folder):
    os.makedirs(project_folder)

In [None]:
%%writefile $project_folder/upload_latest_data.py

# vars to change
web_paths = ['https://raw.githubusercontent.com/davew-msft/MLOps-E2E/master/Lab43/WA_Fn-UseC_-Accounts-Receivable.csv']
# the name of your dataset in AMLS
ds_name = 'ar-factoring-2class'

import argparse
import os
from datetime import datetime
from dateutil.relativedelta import relativedelta
import pandas as pd
import traceback
from azureml.core import Dataset
from azureml.core.run import Run, _OfflineRun
from azureml.core import Workspace
from azureml.opendatasets import NoaaIsdWeather

run = Run.get_context()
ws = None
if type(run) == _OfflineRun:
    ws = Workspace.from_config()
else:
    ws = run.experiment.workspace


parser = argparse.ArgumentParser("split")
parser.add_argument("--descr", help="the descr has to be updated or the dataset will not be re-downloaded")
args = parser.parse_args()

print("Argument 1(descr): %s" % args.descr)
descr = args.descr or "default descr"

ar_ds = Dataset.Tabular.from_delimited_files(path=web_paths)
# create a new version of our dataset
ar_ds = ar_ds.register(workspace = ws,
                                 name = ds_name,
                                 description = descr,
                                 create_new_version = True)


Let's see where the file was written.  We will use this file as the code for the AMLS pipeline next.  Also, let's test that the .py file actually works!

In [None]:
!ls ./ar-pipeline/ -alF

In [None]:
%run ./ar-pipeline/upload_latest_data.py  --descr "Get Latest1"

If it worked we should see a new version of our dataset in the AMLS dataset UI.  Go check this now.


### Upload Data Step
The data ingestion pipeline has a single step with a script to get the latest data and upload it to our dataset/datastore as a new version. 

In [None]:
from azureml.pipeline.core import Pipeline, PipelineParameter
from azureml.pipeline.steps import PythonScriptStep
from datetime import date

ds_descr = PipelineParameter(name="descr", default_value="This is the default descr")
upload_data_step = PythonScriptStep(script_name="upload_latest_data.py", 
                                         allow_reuse=False,
                                         name="upload_latest_data",
                                         arguments=["--descr", ds_descr],
                                         compute_target=compute_target, 
                                         runconfig=conda_run_config,
                                         source_directory=project_folder)

If you look in AMLS the pipeline nor the experiment is created yet. We need to submit it first.

### Submit Pipeline Run

In [None]:
# this changes the description which is enough to cause a reload of the dataset
latest = datetime.now().strftime("%Y/%m/%d %H:%M:%S")

data_pipeline = Pipeline(
    description="pipeline to upload latest AR data",
    workspace=ws,    
    steps=[upload_data_step])
data_pipeline_run = experiment.submit(data_pipeline, pipeline_parameters={"descr":latest})

Now you can either monitor this from the AMLS portal or do it with the next cell

In [None]:
data_pipeline_run.wait_for_completion(show_output=True)

## Training Pipeline
### Prepare Training Data Step

Script to check if new data is available since the model was last trained. If no new data is available, we cancel the remaining pipeline steps. We need to set allow_reuse flag to False to allow the pipeline to run even when inputs don't change. We also need the name of the model to check the time the model was last trained.

First, like above, we need to create a `check_data.py` script that our Pipeline will use.  We'll build that using the tricks above, in our project folder.

In [None]:
%%writefile $project_folder/check_data.py

# vars to change

import argparse
import os
import azureml.core
from datetime import datetime
import pandas as pd
import pytz
from azureml.core import Dataset, Model
from azureml.core.run import Run, _OfflineRun
from azureml.core import Workspace

run = Run.get_context()
ws = None
if type(run) == _OfflineRun:
    ws = Workspace.from_config()
else:
    ws = run.experiment.workspace

print("Check for new data.")

parser = argparse.ArgumentParser("split")
parser.add_argument("--ds_name", help="input dataset name")
parser.add_argument("--model_name", help="name of the deployed model")

args = parser.parse_args()

print("Argument 1(ds_name): %s" % args.ds_name)
print("Argument 2(model_name): %s" % args.model_name)

# Get the latest registered model
try:
    model = Model(ws, args.model_name)
    last_train_time = model.created_time
    print("Model was last trained on {0}.".format(last_train_time))
except Exception as e:
    print("Could not get last model train time.")
    last_train_time = datetime.min.replace(tzinfo=pytz.UTC)

try: 
    train_ds = Dataset.get_by_name(ws, args.ds_name)
    format = "%Y/%m/%d %H:%M:%S"
    dataset_changed_time = datetime.strptime(train_ds.description,format)
    print ("Data was last updated on {0}.".format(dataset_changed_time))
    if not dataset_changed_time > last_train_time:
        print("Cancelling run since there is no new data.")
        run.parent.cancel()
    else:
        # New data is available since the model was last trained
        print("Dataset was last updated on {0}. Retraining...".format(dataset_changed_time))
except Exception as e:
    print("Date not in the format we were expecting, do a re-train anyway.")



Now, let's test the script above, just to make sure it works.

In [None]:
# vars to change
# since we used automl the name is probably a little goofy.  Don't include the :1 which is the version indicator
registered_model_name = "AutoMLb9be0a22f28"
dataset_name = "ar-factoring-2class"

In [None]:
%run $project_folder/check_data.py  --ds_name $dataset_name --model_name $registered_model_name

In [None]:
from azureml.pipeline.core import PipelineData

# The model name with which to register the trained model in the workspace.
model_name = PipelineParameter("model_name", default_value=registered_model_name)

In [None]:
data_prep_step = PythonScriptStep(script_name="check_data.py", 
                                         allow_reuse=False,
                                         name="check_data",
                                         arguments=["--ds_name", dataset_name,
                                                    "--model_name", registered_model_name],
                                         compute_target=compute_target, 
                                         runconfig=conda_run_config,
                                         source_directory=project_folder)

In [None]:
from azureml.core import Dataset
train_ds = Dataset.get_by_name(ws, dataset_name)
target_column_name="LatePayment"

In [None]:
# create an automl step for the pipeline
from azureml.train.automl import AutoMLConfig
from azureml.pipeline.steps import AutoMLStep

automl_settings = {
    "iteration_timeout_minutes": 3,
    "experiment_timeout_hours": 0.15,
    "n_cross_validations": 3,
    "primary_metric": 'accuracy',
    "max_concurrent_iterations": 3,
    "max_cores_per_iteration": -1,
    "verbosity": logging.INFO,
    "enable_early_stopping": True
}

automl_config = AutoMLConfig(task = 'classification',
                             debug_log = 'automl_errors.log',
                             path = ".",
                             compute_target=compute_target,
                             training_data = train_ds,
                             label_column_name = target_column_name,
                             **automl_settings
                            )


### Register Model Step
Script to register the model to the workspace. 

In [None]:
%%writefile $project_folder/check_data.py
# we need to build a py script to register our model in the pipeline
from azureml.core.model import Model, Dataset
from azureml.core.run import Run, _OfflineRun
from azureml.core import Workspace
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--model_name")
parser.add_argument("--model_path")
parser.add_argument("--ds_name")
args = parser.parse_args()

print("Argument 1(model_name): %s" % args.model_name)
print("Argument 2(model_path): %s" % args.model_path)
print("Argument 3(ds_name): %s" % args.ds_name)

run = Run.get_context()
ws = None
if type(run) == _OfflineRun:
    ws = Workspace.from_config()
else:
    ws = run.experiment.workspace

train_ds = Dataset.get_by_name(ws, args.ds_name)
datasets = [(Dataset.Scenario.TRAINING, train_ds)]

# Register model with training dataset

model = Model.register(workspace=ws,
                       model_path=args.model_path,
                       model_name=args.model_name,
                       datasets=datasets)

print("Registered version {0} of model {1}".format(model.version, model.name))

In [None]:
register_model_step = PythonScriptStep(script_name="register_model.py",
                                       name="register_model",
                                       allow_reuse=False,
                                       arguments=["--model_name", model_name, "--model_path", model_data, "--ds_name", ds_name],
                                       inputs=[model_data],
                                       compute_target=compute_target,
                                       runconfig=conda_run_config)

### Submit Pipeline Run

In [None]:
training_pipeline = Pipeline(
    description="training_pipeline",
    workspace=ws,    
    steps=[data_prep_step, automl_step, register_model_step])

In [None]:
training_pipeline_run = experiment.submit(training_pipeline, pipeline_parameters={
        "ds_name": dataset, "model_name": registered_model_name})

In [None]:
training_pipeline_run.wait_for_completion(show_output=False)

### Publish Retraining Pipeline and Schedule
Once we are happy with the pipeline, we can publish the training pipeline to the workspace and create a schedule to trigger on blob change. The schedule polls the blob store where the data is being uploaded and runs the retraining pipeline if there is a data change. A new version of the model will be registered to the workspace once the run is complete.

In [None]:
pipeline_name = "Retraining-Pipeline-AR-Factoring"

published_pipeline = training_pipeline.publish(
    name=pipeline_name, 
    description="Pipeline that retrains AutoML model")

published_pipeline

In [None]:
from azureml.pipeline.core import Schedule
schedule = Schedule.create(workspace=ws, name="RetrainingSchedule",
                           pipeline_parameters={"ds_name": dataset, "model_name": registered_model_name},
                           pipeline_id=published_pipeline.id, 
                           experiment_name=experiment_name, 
                           datastore=dstor,
                           wait_for_provisioning=True,
                           polling_interval=1440)

## Test Retraining
Here we setup the data ingestion pipeline to run on a schedule, to verify that the retraining pipeline runs as expected. 



In [None]:
pipeline_name = "DataIngestion-Pipeline-AR Factoring Dataset"

published_pipeline = training_pipeline.publish(
    name=pipeline_name, 
    description="Pipeline that updates AR Factoring Dataset")

published_pipeline

In [None]:
from azureml.pipeline.core import Schedule
schedule = Schedule.create(workspace=ws, name="RetrainingSchedule-DataIngestion",
                           pipeline_parameters={"ds_name":dataset},
                           pipeline_id=published_pipeline.id, 
                           experiment_name=experiment_name, 
                           datastore=dstor,
                           wait_for_provisioning=True,
                           polling_interval=1440)