# Azure Machine Learning Responsible AI MLOps

## Overview

Read [Azure Machine Learning Pipelines](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-ml-pipelines) overview, or the [readme article](../README.md) on Azure Machine Learning Pipelines to get more information.
 

This notebook shows the construction of a machine learning service **pipeline** with MLOps techniques that runs jobs unattended in different compute clusters.

## Prerequisites and Azure Machine Learning Basics

Please, before anything set up with a working config file that has information on your workspace, subscription id, etc located on:

- './notebooks/notebook-settings/config.json' 

## Imports

In [None]:
import os
import sys
import requests
import azureml.core
import warnings
warnings.filterwarnings('ignore')

from azureml.widgets import RunDetails
from azureml.core.compute import AmlCompute
from azureml.train.automl import AutoMLConfig
from azureml.core.compute import ComputeTarget
from azureml.pipeline.core import PipelineEndpoint
from azureml.pipeline.core import PipelineData, TrainingOutput
from azureml.core import Workspace, Run, Experiment, Datastore, Dataset
from azureml.core.runconfig import RunConfiguration, CondaDependencies
from azureml.core.authentication import InteractiveLoginAuthentication


## Pipeline-specific SDK imports

Here, we import key pipeline modules, whose use will be illustrated in the examples below.

In [None]:
from azureml.data import DataType
from azureml.data.data_reference import DataReference
from azureml.pipeline.core import Pipeline, PipelineData, StepSequence, TrainingOutput
from azureml.pipeline.steps import PythonScriptStep, AutoMLStep
from azureml.pipeline.core import PublishedPipeline, PipelineRun
from azureml.pipeline.core.graph import PipelineParameter

print("Pipeline SDK-specific imports completed")

### Initialize Workspace

Now you're ready to connect to your workspace using the Azure ML SDK.

Note: If the authenticated session with your Azure subscription has expired since you completed the previous exercise, you'll be prompted to reauthenticate.

In [None]:
ws = Workspace.from_config("../notebooks-settings/config.json")
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

## Get the Default datastore (Azure Blob storage)

### Datastore concepts
A [Datastore](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.datastore(class) is a place where data can be stored that is then made accessible to a compute either by means of mounting or copying the data to the compute target. 

A Datastore can either be backed by an Azure File Storage (default) or by an Azure Blob Storage.

In this next step, we will upload the training and test set into the workspace's default storage (File storage), and another piece of data to Azure Blob Storage. When to use [Azure Blobs](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blobs-introduction), [Azure Files](https://docs.microsoft.com/en-us/azure/storage/files/storage-files-introduction), or [Azure Disks](https://docs.microsoft.com/en-us/azure/virtual-machines/linux/managed-disks-overview) is [detailed here](https://docs.microsoft.com/en-us/azure/storage/common/storage-decide-blobs-files-disks).

**Please take good note of the concept of the datastore.**

Now we will get Blob storage associated with the workspace.

The following call GETS the Azure Blob Store associated with your workspace.

Note that workspaceblobstore is **the name of this store and CANNOT BE CHANGED and must be used as is**

The above call is equivalent to Datastore(ws, "workspaceblobstore") or simply Datastore(ws)

In [None]:
def_blob_store = Datastore(ws, "workspaceblobstore")
print("Blobstore's name: {}".format(def_blob_store.name))

## Required data and script files for the the tutorial
Sample files required to finish this tutorial are already copied to the project folder specified above. Even though the .py provided in the samples don't have much "ML work," as a data scientist, you will work on this extensively as part of your work. To complete this tutorial, the contents of these files are not very important. The one-line files are for demostration purpose only.

#### Upload data to default datastore
Default datastore on workspace is the Azure  File storage. The workspace has a Blob storage associated with it as well. Let's upload a file to each of these storages. 

get_default_datastore() gets the default Azure File Store associated with your workspace. Here we are reusing the def_file_store object we obtained earlier

Parameters of upload_dataset:

1. Workspace object
2. Blob Storage Datastore object
3. Azure Dataset name
4. Local path on Datastore
5. Local path of dataset
6. This dataset will be use on Datadrift Detector (True/False)
7. dataset type.

In the next cell we'll upload:

1. The complete dataset about parkinson patients. This dataset contain information about parkinson relevent factors to predict if the patient need a treatment or not.

In [None]:
upload_dataset(ws, def_blob_store, 'parkinson',
                                  'parkinson/acc_final_v3_0.csv', "../../dataset/acc_final_v3_0.csv",
                                  use_datadrift=False, type_dataset="Standard")

In [None]:
upload_dataset(ws, def_blob_store, 'parkinson',
                                  'parkinson/acc_final_v3_1.csv', "../../dataset/acc_final_v3_1.csv",
                                  use_datadrift=False, type_dataset="Standard")

In [None]:
upload_dataset(ws, def_blob_store, 'parkinson',
                                  'parkinson/acc_final_v3_2.csv', "../../dataset/acc_final_v3_2.csv",
                                  use_datadrift=False, type_dataset="Standard")

In [None]:
upload_dataset(ws, def_blob_store, 'parkinson',
                                  'parkinson/acc_final_v3_3.csv', "../../dataset/acc_final_v3_3.csv",
                                  use_datadrift=False, type_dataset="Standard")

#### (Optional) See your files using Azure Portal
Once you successfully uploaded the files, you can browse to them (or upload more files) using [Azure Portal](https://portal.azure.com). At the portal, make sure you have selected your subscription (click *Resource Groups* and then select the subscription). Then look for your **Machine Learning Workspace** (it has your *alias* as the name). It has a link to your storage. Click on the storage link. It will take you to a page where you can see [Blobs](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blobs-introduction), [Files](https://docs.microsoft.com/en-us/azure/storage/files/storage-files-introduction), [Tables](https://docs.microsoft.com/en-us/azure/storage/tables/table-storage-overview), and [Queues](https://docs.microsoft.com/en-us/azure/storage/queues/storage-queues-introduction). We have just uploaded a file to the Blob storage and another one to the File storage. You should be able to see both of these files in their respective locations. 

## Compute Targets
A compute target specifies where to execute your program such as a remote Docker on a VM, or a cluster. A compute target needs to be addressable and accessible by you.

**You need at least one compute target to send your payload to. We are planning to use Azure Machine Learning Compute exclusively for this tutorial for all steps. However in some cases you may require multiple compute targets as some steps may run in one compute target like Azure Machine Learning Compute, and some other steps in the same pipeline could run in a different compute target.**

*The example belows show creating/retrieving/attaching to an Azure Machine Learning Compute instance.*

#### List of Compute Targets on the workspace

In [None]:
cts = ws.compute_targets
for ct in cts:
    print(ct)

#### Retrieve or create a Azure Machine Learning compute
Azure Machine Learning Compute is a service for provisioning and managing clusters of Azure virtual machines for running machine learning workloads. Let's create a new Azure Machine Learning Compute in the current workspace, if it doesn't already exist. We will then run the training script on this compute target.

If we could not find the compute with the given name in the previous cell, then we will create a new compute here. We will create an Azure Machine Learning Compute containing **STANDARD_D2_V2 CPU VMs**. This process is broken down into the following steps:

1. Create the configuration
2. Create the Azure Machine Learning compute

**This process will take about 3 minutes and is providing only sparse output in the process. Please make sure to wait until the call returns before moving to the next cell.**

In [None]:
aml_compute_name = "aml-compute"
vm_size = "STANDARD_DS3_V2"
aml_compute = get_compute_aml(ws, aml_compute_name, vm_size)
print("Azure Machine Learning Compute attached")

## Create MLOps Pipeline Compute

Pipelines consist of one or more steps, which can be Python scripts, or specialized steps like an Auto ML training estimator or a data transfer step that copies data from one location to another. Each step can run in its own compute context.

In this notebook, you'll build a Responsible AI pipeline that contains nine steps, which include the following steps:

1. **Exploratory Analysis and Preprocessing**
2. **AutoML Training/Evaluation**
3. **Register AutoML Model**

## Create pipeline steps run configs

In this notebook, you'll use the same compute for both steps, but it's important to realize that each step is run independently; so you could specify different compute contexts for each step if appropriate.

First, get the compute target you created in a previous cells (if it doesn't exist, it will be created).
Each step of the pipeline will need some dependencies to execute and generate well results. So, because of this need, we break down the run configurations and focus on each step and see what dependencies each needs.

#### Run Config - Analysis and preprocessing step

In [None]:
run_preprocessing_config = RunConfiguration(conda_dependencies=CondaDependencies.create(
        conda_packages=['numpy==1.18.1', 'pandas==1.0.4'],
        pip_packages=['azureml-sdk', 'matplotlib==3.1.3', 'seaborn==0.10.0', 'sklearn==0.0', 'pandas-profiling==2.8.0'])
    )
run_preprocessing_config.environment.docker.enabled = True

#### Run Config - Register Model step

In [None]:
run_automl_config = RunConfiguration(conda_dependencies=CondaDependencies.create(
        conda_packages=[],
        pip_packages=['azureml-sdk', 'azureml-train-automl-runtime==1.8.0.post1', 'xgboost==1.1.1'])
    )
run_automl_config.environment.docker.enabled = True

## Creating Steps in a Pipeline
A Step is a unit of execution. Step typically needs a target of execution (compute target), a script to execute, and may require script arguments and inputs, and can produce outputs. The step also could take a number of other parameters. Azure Machine Learning Pipelines provides the following built-in Steps:

- [**PythonScriptStep**](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.python_script_step.pythonscriptstep?view=azure-ml-py): Add a step to run a Python script in a Pipeline.

- [**AutoML Step**](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.automl_step.automlsteprun?view=azure-ml-py): Creates a AutoML step to manage, check status, and retrieve run details once an automated ML run is submitted in a pipeline.

- [**AdlaStep**](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.adla_step.adlastep?view=azure-ml-py): Adds a step to run U-SQL script using Azure Data Lake Analytics.

- [**DataTransferStep**](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.data_transfer_step.datatransferstep?view=azure-ml-py): Transfers data between Azure Blob and Data Lake accounts.

- [**DatabricksStep**](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.databricks_step.databricksstep?view=azure-ml-py): Adds a DataBricks notebook as a step in a Pipeline.

- [**HyperDriveStep**](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.hyper_drive_step.hyperdrivestep?view=azure-ml-py): Creates a Hyper Drive step for Hyper Parameter Tuning in a Pipeline.

The following code will create a PythonScriptStep to be executed in the Azure Machine Learning Compute we created above using train.py, one of the files already made available in the project folder.

A **PythonScriptStep** is a basic, built-in step to run a Python Script on a compute target. It takes a script name and optionally other parameters like arguments for the script, compute target, inputs and outputs. If no compute target is specified, default compute target for the workspace is used.

## Configure PipelineData

Here, we configure some outputs on the pipeline. The first output (preprocess_step_output) will be between **Exploratory Analysis and Preprocessing** and **AutoML Training/Evaluation**.

In [None]:
preprocess_step_output = PipelineData("hd_preprocessed", datastore=def_blob_store).as_dataset()

### Convert Dataset Columns to PipelineDataType

In [None]:
data_dict = convert_dataset_columns(ws,'./scripts/schema_dataset.json')

## Pipeline Parameters Builder

In [None]:
ppb = PipelineParameterBuilder("../utils/params_config/pipeline_parameters.json")
ppb.build()

## Configure AutoML 

In [None]:
automl_settings = {
    "max_concurrent_iterations": 3,
    "primary_metric" : 'AUC_weighted',
    "featurization": 'auto',
    'model_explainability': True,
    'iterations': 1,
    'validation_size': 0.3,
    'enable_early_stopping': True,
    'label_column_name': 'Tremor',
    'enable_onnx_compatible_models': True,
    'task': 'classification'
}
automl_config = AutoMLConfig(compute_target=aml_compute,
                             training_data = preprocess_step_output.parse_delimited_files(set_column_types=data_dict),
                             path = ".",
                             **automl_settings
                            )

In [None]:
metrics_output_name = 'metrics_output'
best_model_output_name = 'best_model_output'

metrics_data = PipelineData(name='metrics_data',
                           datastore=def_blob_store,
                           pipeline_output_name=metrics_output_name,
                           training_output=TrainingOutput(type='Metrics'))
model_data = PipelineData(name='model_data',
                           datastore=def_blob_store,
                           pipeline_output_name=best_model_output_name,
                           training_output=TrainingOutput(type='Model'))

## Running Pipeline

Now you're ready to create and run a pipeline. First you need to define the steps for the pipeline, and any data references that need to passed between them. The Pipeline Data and Parameters configurations were created in the last cells.

In [None]:
project_folder = './scripts'

In [None]:
deploy_folder = '../deployment'

In [None]:
preprocessing_step = PythonScriptStep(name="Exploratory Analysis and Preprocessing",
                         script_name="preprocessing.py",
                         compute_target=aml_compute, 
                         source_directory=project_folder,
                         arguments=[
                              "--datastore", ppb.get_pipeline_parameter_obj('datastore_preprocessing_step'),
                              "--dataset_name", ppb.get_pipeline_parameter_obj('dataset_name_preprocessing_step'),
                              "--dataset_preprocessed_name", ppb.get_pipeline_parameter_obj('dataset_preprocessed_name_preprocessing_step'),
                              "--output_preprocess_dataset", preprocess_step_output,
                         ],
                         outputs=[preprocess_step_output],
                         runconfig=run_preprocessing_config,
                         allow_reuse=False)

automl_step = AutoMLStep(
                name='AutoML_training',
                automl_config=automl_config,
                inputs = [preprocess_step_output],
                outputs=[metrics_data, model_data],
                passthru_automl_config=False,
                allow_reuse=False)

register_model_step = PythonScriptStep(script_name="register_model.py",
                                  source_directory=project_folder,
                                  name="Register Model",
                                  arguments=["--model_name", ppb.get_pipeline_parameter_obj('model_name_register_model_step'),
                                             "--model_data", model_data,
                                             "--metrics_data", metrics_data,
                                             "--dataset_name", ppb.get_pipeline_parameter_obj('dataset_name_register_model_step')],
                                  inputs=[model_data, metrics_data],
                                  compute_target=aml_compute,
                                  runconfig=run_automl_config,
                                  allow_reuse=False)

steps = [preprocessing_step, automl_step,register_model_step]
print("Steps created")

### Allow reuse

**Note:** The flag *allow_reuse* determines whether the step should reuse previous results when run with the same settings/inputs. This flag's default value is *True*; the default is set to *True* because, when inputs and parameters have not changed, we typically do not want to re-run a given pipeline step. 

If *allow_reuse* is set to *False*, a new run will always be generated for this step during pipeline execution. The *allow_reuse* flag can come in handy in situations where you do *not* want to re-run a pipeline step.

## Build the pipeline
Once we have the steps (or steps collection), we can build the [pipeline](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.pipeline.pipeline?view=azure-ml-py). By deafult, all these steps will run in **parallel** once we submit the pipeline for run.

A pipeline is created with a list of steps and a workspace. Submit a pipeline using [submit](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.experiment%28class%29?view=azure-ml-py#submit). When submit is called, a [PipelineRun](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.pipelinerun?view=azure-ml-py) is created which in turn creates [StepRun](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.steprun?view=azure-ml-py) objects for each step in the workflow.

In [None]:
pipeline = Pipeline(workspace=ws, steps=steps)
print ("Pipeline is built")

## Validate the pipeline
You have the option to [validate](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.pipeline.pipeline?view=azure-ml-py#validate) the pipeline prior to submitting for run. The platform runs validation steps such as checking for circular dependencies and parameter checks etc. even if you do not explicitly call validate method.

In [None]:
pipeline.validate()
print("Pipeline validation complete")

## Publish the pipeline to use via REST
Once you are satisfied with the results of your experiment, you may want to publish the pipeline to get a REST endpoint so the pipeline can be invoked later.

In [None]:
published_training_pipeline = pipeline.publish(name="parkinson_pipeline",
                                            description="This pipeline train a model to detect parkinson disease")
print("The published pipeline ID is {}".format(published_training_pipeline.id))

In [None]:
pipeline_endpoint = PipelineEndpoint.publish(workspace=ws,
                                            name="parkinson_demo_v1",
                                            pipeline=published_training_pipeline,
                                            description="This http pipeline train a model to detect parkinson disease")