# Azure Machine Learning Pipeline

 In this notebook, we will be creating a Azure Machine Learning Pipeline for the complete stage of machine learning lifecycle:
 
 1. Data Engineering
 2. Model Training
 3. Model Management
 4. Model Deployment (to same environment)
 
![Data Engineering](./images/00-Pipeline.jpg)

## Problem Statement

### Production Review

We will be working with product review dataset. There are 5 classes from software reviews. 

In this multi-class classification problem, we will use scikit-learning to train a multinomial naive bayes model and use Azure Machine Learning to create experiment, manage models and deploy to web service for consumption.

![Sample data](./images/sample_data.png)


## 1. Data Engineering 

**Input** : Raw Data 

**Output** : Registered Data Set (ProductReview)

Registering the data asset will enable you to:

* reuse and share the data asset in future pipelines
* use versions to track the modification to the data asset
* use the data asset from Azure ML designer, which is Azure ML's GUI for pipeline authoring

<img src="../e2e-ds-experience/media/dataset.PNG">

In [None]:
import os
os.makedirs('data_engineering',exist_ok=True)
os.makedirs('train', exist_ok=True)
os.makedirs('model_selection', exist_ok=True)
os.makedirs('model_deploy', exist_ok=True)

In [None]:
%%writefile data_engineering/data_engineering.py
import os
import json
import gzip
import pandas as pd
from urllib.request import urlopen
import requests

from azureml.core.run import Run
from azureml.core import Dataset, Datastore, Workspace
from azureml.data.datapath import DataPath

# get run context
run = Run.get_context()

# Download data from source
url = "https://jmcauley.ucsd.edu/data/amazon_v2/categoryFilesSmall/Software_5.json.gz"
response = requests.get(url, stream=True,verify=False)

with open("Software_5.json.gz", "wb") as handle:
    for data in response.iter_content():
        handle.write(data)

### load the meta data
data = []
with gzip.open('Software_5.json.gz') as f:
    for l in f:
        data.append(json.loads(l.strip()))
    
# total length of list, this number equals total number of products
print(len(data))

# first row of the list
print(data[0])
df = pd.DataFrame.from_dict(data)

### remove rows with unformatted title (i.e. some 'title' may still contain html style content)
df3 = df.fillna('')
df3.iloc[2]

# register dataset
workspace = run.experiment.workspace
default_datastore = Datastore.get_default(workspace)

ds_name = 'ProductReview'
data_path = DataPath(datastore=default_datastore, path_on_datastore='product_review')

ds = Dataset.Tabular.register_pandas_dataframe(df3, 
                                    default_datastore, 
                                    ds_name, 
                                    description=None, 
                                    tags=None, 
                                    show_progress=True)


![Dataset](./images/01-DataEngineeringOutput.jpb.jpg)

## 2. Model Training

In this step, we are ready to do some experiments and select the best model for production use.

**Input** : Register Data Set (ProductReview)

**Output** : Trained Model 

In [None]:
%%writefile train/train_2.py
# General libraries.
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.metrics import classification_report,plot_confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from azureml.core.run import Run
from sklearn.model_selection import GridSearchCV
from azureml.core import Workspace, Dataset
import matplotlib.pyplot as plt
from joblib import dump
from sklearn.feature_extraction.text import CountVectorizer

run = Run.get_context()

# Get workspace from run context
workspace = run.experiment.workspace

# Load Data
dataset = Dataset.get_by_name(workspace, name='ProductReview')
data = dataset.to_pandas_dataframe()[['overall', 'reviewText']]

# Prepare X & Y
Y = data.pop('overall').to_numpy()
X = data.pop('reviewText').to_numpy()
train_x, test_x, train_y, test_y = train_test_split(X,Y, test_size = 0.1, random_state=1)

# Fit vectorizer
vec = CountVectorizer()
fitted_train_data = vec.fit_transform(train_x)
fitted_test_data = vec.transform(test_x)

# Train a Naive Bayes model
model = MultinomialNB()
params = {'alpha': [1.0e-5, 0.0001, 0.001, 0.01, 0.1, 1.0, 10.0]}

clf = GridSearchCV(model, params, scoring = "f1_macro", verbose=0, cv = 5)
clf_result = clf.fit(fitted_train_data, train_y)
run.log("Best alpha ",clf_result.best_estimator_.alpha)
pred = clf.predict(fitted_test_data)
run.log("F1", metrics.f1_score(test_y, pred, average='weighted'))

plot_confusion_matrix(clf, fitted_test_data, test_y)  
plt.savefig('confusion_matrix.png')
run.log_image(name='Confusion-Matrix', path='./confusion_matrix.png')

# Save trained model
dump(vec, './vec.pkl')
dump(clf, './mnb.pkl')
run.upload_file(name='vec.pkl', path_or_stream='./vec.pkl')
run.upload_file(name='mnb.pkl', path_or_stream='./mnb.pkl')

![CM](./images/confusion_matrix.png)


## 3. Model Selection

In this step, we will use a predefined metrics **F1**. We will list all today's runs and select the highest F1 score model, which will be registered in Model Registry and prepare for deployment.

**Input** : Trained Model

**Output** : Registered Model in Model Registry

In [None]:
%%writefile model_selection/model_select.py
import sklearn
from datetime import datetime, date
from azureml.core.run import Run
# from azureml.core import Experiment
# from azureml.core.model import Model

# get run context
run = Run.get_context()
workspace = run.experiment.workspace

# Get Experiment and runs for model select
# In this step, we will use F1
exp = run.experiment # Experiment.list(workspace, experiment_name='MLOps-Workshop')

today = date.today()

select_run = None
F1 = 0
for r in exp.get_runs():
    run_starttime = datetime.strptime(r.get_details()['startTimeUtc'][:10], '%Y-%m-%d').date()
    if run_starttime==today:
        for step in r.get_children():
            current_step_f1 = step.get_metrics(name='F1')
            if 'F1' in current_step_f1.keys() and F1<current_step_f1['F1']:
                F1=current_step_f1['F1']
                select_run = step
    
if select_run != None:
    # Load Data
    mnb_model = select_run.register_model("ProductReview-NaiveBayes",
                            model_path="./mnb.pkl",
                            )

    vector    = select_run.register_model("ProductReview-CountVector", 
                            model_path="./vec.pkl",
                            )

![Model Register](./images/model_register.jpg)


## 4. Model Deploy


**Input** : Registered Model in Model Registry

**Output** : Container Instance / Endpoint (Web Service)

In [None]:
%%writefile model_deploy/score.py
import json, os, joblib
from azureml.core.model import Model

def init(): 
  global vec, clf
  print(Model.get_model_path('ProductReview-NaiveBayes'))
  vec = joblib.load(Model.get_model_path('ProductReview-CountVector'))
  clf = joblib.load(Model.get_model_path('ProductReview-NaiveBayes'))

def run(data): 
  input_data = json.loads(data)['data'] 
  fitted_data = vec.transform(input_data)
  pred = clf.predict(fitted_data)
  return json.dumps(pred.tolist())


In [None]:
%%writefile model_deploy/deploy.py
from azureml.core.model import InferenceConfig, Model
from azureml.core import Environment
from azureml.core.run import Run
from azureml.core.webservice import AciWebservice

# get run context
run = Run.get_context()
workspace = run.experiment.workspace

# inference config
service_name = 'product-review-service'
env = Environment.get(workspace=workspace, name="AzureML-sklearn-0.24-ubuntu18.04-py37-cpu")

inference_config = InferenceConfig(entry_script='score.py', 
                            source_directory='.',
                            environment=env)

# deployment config
deployment_config = AciWebservice.deploy_configuration(cpu_cores=1, memory_gb=1)

nb_model = Model(workspace, 'ProductReview-NaiveBayes')
vectorizor = Model(workspace, 'ProductReview-CountVector')

service = Model.deploy(
    workspace,
    name = service_name,
    models=[nb_model, vectorizor],
    inference_config= inference_config,
    deployment_config= deployment_config,
    overwrite=True,
)
service.wait_for_deployment(show_output=True)

## 5. Create pipeline

In [None]:
from azureml.pipeline.steps import PythonScriptStep
from azureml.pipeline.core import Pipeline, PipelineData
import azureml.core
from azureml.core import Workspace, Environment, Experiment
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.core.runconfig import RunConfiguration
import os

workspace = Workspace.from_config()
# 1. Get ComputeTarget
aml_compute_target = "cpu-cluster"
try:
    aml_compute = AmlCompute(workspace, aml_compute_target)
    print("found existing compute target.")
except ComputeTargetException:
    print("creating new compute target")
    
    provisioning_config = AmlCompute.provisioning_configuration(vm_size = "STANDARD_D2_V2",
                                                                min_nodes = 1, 
                                                                max_nodes = 4)    
    aml_compute = ComputeTarget.create(workspace, aml_compute_target, provisioning_config)
    aml_compute.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)    
print("Azure Machine Learning Compute attached")

# 2. Get environment
env = Environment.get(workspace=workspace, name="AzureML-sklearn-0.24-ubuntu18.04-py37-cpu")
# create a new runconfig object
runconfig = RunConfiguration()
runconfig.environment = env

# 3. Define steps
dataprep_step = PythonScriptStep( name="prep_data", 
                                script_name="data_engineering.py", 
	                            source_directory="data_engineering", 
                                compute_target=aml_compute_target, 
                                runconfig=runconfig                                
	                            )

train_step    = PythonScriptStep( name="train", 
                                script_name="train_2.py", 
	                            source_directory="train", 
                                compute_target=aml_compute_target, 
                                runconfig=runconfig
	                            )
train_step.run_after(dataprep_step)

select_step   = PythonScriptStep( name="select_model", 
                                script_name="model_select.py", 
	                            source_directory="model_selection", 
                                compute_target=aml_compute_target, 
                                runconfig=runconfig
	                            )
select_step.run_after(train_step)

deploy_step   = PythonScriptStep( name="deploy_model", 
                                script_name="deploy.py", 
	                            source_directory="model_deploy", 
                                compute_target=aml_compute_target, 
                                runconfig=runconfig
	                            )
deploy_step.run_after(select_step)


# Run pipeline
experiment_name = 'MLOps-Workshop'
pipeline = Pipeline(workspace=workspace, steps=[deploy_step])
pipeline_run = Experiment(workspace, experiment_name).submit(pipeline)
print("Pipeline is submitted for execution")

![Published endpoint](./images/endpoint.jpg)

### Test Endpoint

In [None]:
import requests, json

# replace uri below with your service endpoint
uri = '...'

headers = {"Content-Type": "application/json"}
comment1 = """
"I've been using Dreamweaver (and it's predecessor Macromedia's UltraDev) for many years.  
 This is a great tool for someone who is a relative novice at web design.  
 If you're a novice, a relative newcomer or just an experienced web 
 designer who wants a refresher course, this is a good way to do it."
"""
sample_input = json.dumps({
    'data': [comment1]
})
response = requests.post(uri, data=sample_input, headers=headers)
print(response.json())

In [None]:
comment2 = """
"It is the worst software I have ever used. Very bad UX"
"""
sample_input = json.dumps({
    'data': [comment2]
})
response = requests.post(uri, data=sample_input, headers=headers)
print(response.json())

### Publish Pipeline

- Can be published as a REST endpoint to run om different inputs/ clients (python, C#, Java, ...)
- Can be configured to accept parameters 
- versioned

In [None]:
published_pipeline = pipeline_run.publish_pipeline(
     name="MLOps-Workshop-Pipeline",
     description="Published Pipeline for MLOps Workshop",
     version="1.0")

## 6. Schedule Pipeline runs

https://docs.microsoft.com/en-us/azure/machine-learning/how-to-trigger-published-pipeline
- Time-based schedules: for routine tasks (monitoring data drift)
- Change-based schedules: to react to irregular or unpredictable changes (new data uploaded, old data edited,...) - only blob storage monitoring

In [None]:
from azureml.pipeline.core.schedule import ScheduleRecurrence, Schedule

## create time-based pipeline
# Frequency can be Minute / Hour / Day / Week / Month
recurrence = ScheduleRecurrence(frequency="Month", interval=1)
recurring_schedule = Schedule.create(workspace, name="MonthlySchedule", 
                            description="Based on time",
                            pipeline_id=published_pipeline.id, 
                            experiment_name=experiment_name, 
                            recurrence=recurrence)

In [None]:
recurring_schedule

### Pipeline Schedule Management

- Enable / Disable

enable(wait_for_provisioning=False, wait_timeout=3600)

disable(wait_for_provisioning=False, wait_timeout=3600)

- Get (Set the schedule to 'Active' and available to run)

get(workspace, id, _workflow_provider=None, _service_endpoint=None)

- List ( Get all schedules in the current workspace)

list(workspace, active_only=True, pipeline_id=None, pipeline_endpoint_id=None, _workflow_provider=None, _service_endpoint=None)

https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-setup-schedule-for-a-published-pipeline.ipynb
