# Model Baseline and Schedule

This notebook will take you through the steps
1. Enable real-time inference data capture
2. Model Monitor - Baseling
3. Analyse initial monitoring schedule
4. Create monitoring schedule

## Step 1: Enable real-time inference data capture

To enable data capture for monitoring the model data quality, you specify the new capture option called `DataCaptureConfig`. 

You can capture the request payload, the response payload or both with this configuration. The capture config applies to all variants. Please provide the Endpoint name in the following cell:

In [None]:
from sagemaker.model_monitor import DataCaptureConfig, DefaultModelMonitor
from sagemaker import RealTimePredictor
from sagemaker import session
import boto3

sm_session = session.Session(boto3.Session())
bucket = sm_session.default_bucket()
prefix='text-multiclass'

In [None]:
codepipeline = boto3.client('codepipeline')
sm = boto3.client('sagemaker')

pipeline_name = 'mlops1-text-multiclass'
training_job_name_mask='mlops1-text-multiclass-%s'
endpoint_name_mask='mlops1-text-multiclass-%s-%s'

# Get the current execution id for the latest succesful prod deploy
response = codepipeline.get_pipeline_state( name=pipeline_name )
executionId = response['stageStates'][-1]['latestExecution']['pipelineExecutionId']
endpoint_name = endpoint_name_mask % ('prd', executionId)
print('endpoint name: {}'.format(endpoint_name))

In [None]:
s3_capture_prefix = '{}/datacapture'.format(prefix)
s3_capture_upload_path = 's3://{}/{}'.format(bucket, s3_capture_prefix)
print('data capture: {}'.format(s3_capture_upload_path))

In [None]:
from sagemaker.model_monitor import DataCaptureConfig
from sagemaker import RealTimePredictor
from sagemaker import session
from sagemaker.utils import name_from_base

import boto3
sm_session = session.Session(boto3.Session())

# Change parameters as you would like - adjust sampling percentage, 
#  chose to capture request or response or both.
#  Learn more from our documentation
data_capture_config = DataCaptureConfig(
                        enable_capture = True,
                        sampling_percentage=100,
                        destination_s3_uri=s3_capture_upload_path,
                        kms_key_id=None,
                        capture_options=["REQUEST", "RESPONSE"],
                        csv_content_types=["text/csv"],
                        json_content_types=["application/json"])

# NOTE: The following doesn't work when created by CFN
# # Now it is time to apply the new configuration and wait for it to be applied
# predictor = RealTimePredictor(endpoint=endpoint_name)
# predictor.update_data_capture_config(data_capture_config=data_capture_config)
# sm_session.wait_for_endpoint(endpoint=endpoint_name)

endpoint = sm.describe_endpoint(EndpointName=endpoint_name)
if endpoint['EndpointStatus'] != 'InService':
    raise(Exception('Endpoint not InService'))

# Get the current endpoint config
endpoint_config_name = endpoint['EndpointConfigName']
new_config_name = name_from_base(base=endpoint_config_name)

# Create a new config from the existing adding data capture
new_tags = [{'Key': 'datacapture', 'Value': 'true'}] 
sm_session.create_endpoint_config_from_existing(
    endpoint_config_name, new_config_name, new_tags=new_tags, 
    new_data_capture_config_dict=data_capture_config._to_request_dict())

# Update the endpoint
sm_session.update_endpoint(endpoint_name=endpoint_name, endpoint_config_name=new_config_name)

## Step 2: Model Monitor - Baseling

In addition to collecting the data, SageMaker allows you to monitor and evaluate the data observed by the Endpoints. 

For this :
1. We need to create a baseline with which we compare the realtime traffic against. 
1. Once a baseline is ready, we can setup a schedule to continously evaluate/compare against the baseline.

In [None]:
import pandas as pd

baseline_file = 'output/data/predictions.csv'

In [None]:
!head -3 $baseline_file

### Constraint suggestion with baseline/training dataset

Use the output predictions from test dataset to upload as baseline

In [None]:
# copy over the training dataset to Amazon S3 (if you already have it in Amazon S3, you could reuse it)
baseline_prefix = prefix + '/baselining'
baseline_results_prefix = baseline_prefix + '/results'

baseline_data_uri = 's3://{}/{}'.format(bucket,baseline_file)
baseline_results_uri = 's3://{}/{}'.format(bucket, baseline_results_prefix)
print('Baseline data file: {}'.format(baseline_data_uri))
print('Baseline results uri: {}'.format(baseline_results_uri))

In [None]:
boto3.Session().resource('s3').Bucket(bucket).Object(baseline_file).upload_file(baseline_file)
print('Uploaded baseline: {}'.format(baseline_file))

### Create a baselining job with the training dataset

Now that we have the training data ready in S3, let's kick off a job to `suggest` constraints. `DefaultModelMonitor.suggest_baseline(..)` kicks off a `ProcessingJob` using a SageMaker provided Model Monitor container to generate the constraints. Please edit the configurations to fit your needs.

In [None]:
from sagemaker.model_monitor import DefaultModelMonitor
from sagemaker.model_monitor.dataset_format import DatasetFormat
from sagemaker import get_execution_role

role = get_execution_role()

my_default_monitor = DefaultModelMonitor(
    role=role,
    instance_count=1,
    instance_type='ml.m5.xlarge',
    volume_size_in_gb=20,
    max_runtime_in_seconds=3600,
)

my_default_monitor.suggest_baseline(
    baseline_dataset=baseline_data_uri,
    dataset_format=DatasetFormat.csv(header=True),
    output_s3_uri=baseline_results_uri,
    wait=True
)

### Explore the generated constraints and statistics

In [None]:
baseline_job = my_default_monitor.latest_baselining_job
schema_df = pd.io.json.json_normalize(baseline_job.baseline_statistics().body_dict["features"])
schema_df.head(10)

In [None]:
constraints_df = pd.io.json.json_normalize(baseline_job.suggested_constraints().body_dict["features"])
constraints_df.head(10)

Before proceeding to enable monitoring, you could chose to edit the constraint file as required to fine tune the constraints.

## Step 3: Analyse initial monitoring schedule

We have collected the data above, here we proceed to analyze and monitor the data with MonitoringSchedules.

Start with sending some different data so that we can then process the data capture in monitoring

### Inspect Captured data

In [None]:
s3_client = boto3.Session().client('s3')

# Get capture files for this new endpoint
results_prefix = s3_capture_prefix+'/'+endpoint_name
result = s3_client.list_objects(Bucket=bucket, Prefix=results_prefix)
if not 'Contents' in result:
    raise(Exception('No results vailable yet for location: {}'.format(results_prefix)))
else:
    capture_files = ['s3://{0}/{1}'.format(bucket, capture_file.get("Key")) 
                     for capture_file in result.get('Contents')][::-1]
    print("Captured Files: {}, top 3:".format(len(capture_files)))
    print("\n ".join(capture_files[:3]))

In [None]:
!mkdir -p baselining/output
!aws s3 cp {capture_files[1]} baselining/output/captured_data_example.jsonl
!head -1 baselining/output/captured_data_example.jsonl

Write the first payload from this line

In [None]:
import json

def parse_event_output(data):
    import csv
    from io import StringIO
    cols = ['class_predictions',
             'class_probabilities_<UNK>',
             'class_probabilities___label__eating_out',
             'class_probabilities___label__groceries',
             'class_probabilities___label__transport',
             'class_probabilities___label__shopping',
             'class_probabilities___label__health',
             'class_probabilities___label__travel',
             'class_probabilities___label__entertainment',
             'class_probabilities___label__education',
             'class_probabilities___label__home',
             'class_probabilities___label__utilities',
             'class_probability']
    for row in csv.DictReader(StringIO(data), fieldnames=cols):
        return dict(row) # Return first row only, or return list of dicts?

with open('baselining/output/captured_data_example.jsonl', 'r') as f:
    lines = f.read().split('\n')
    event = json.loads(lines[0])
    print('input: {}\n{}'.format(event['captureData']['endpointInput']['observedContentType'], 
                                 event['captureData']['endpointInput']['data'][:200]))
    print('output: {}\n{}'.format(event['captureData']['endpointOutput']['observedContentType'], 
                                  parse_event_output(event['captureData']['endpointOutput']['data'])))

### Run an immediate schedule

Lets start by running a schedule on some drifted data.

In [None]:
import os, sys
from urllib.parse import urlparse
from sagemaker.processing import Processor, ProcessingInput, ProcessingOutput

def get_model_monitor_container_uri(region):
    container_uri_format = '{0}.dkr.ecr.{1}.amazonaws.com/sagemaker-model-monitor-analyzer'
    
    regions_to_accounts = {
        'eu-north-1': '895015795356',
        'me-south-1': '607024016150',
        'ap-south-1': '126357580389',
        'us-east-2': '680080141114',
        'us-east-2': '777275614652',
        'eu-west-1': '468650794304',
        'eu-central-1': '048819808253',
        'sa-east-1': '539772159869',
        'ap-east-1': '001633400207',
        'us-east-1': '156813124566',
        'ap-northeast-2': '709848358524',
        'eu-west-2': '749857270468',
        'ap-northeast-1': '574779866223',
        'us-west-2': '159807026194',
        'us-west-1': '890145073186',
        'ap-southeast-1': '245545462676',
        'ap-southeast-2': '563025443158',
        'ca-central-1': '536280801234'
    }
    
    container_uri = container_uri_format.format(regions_to_accounts[region], region)
    return container_uri

def get_file_name(url):
    a = urlparse(url)
    return os.path.basename(a.path)

def run_model_monitor_job_processor(region, instance_type, role, data_capture_path, statistics_path, constraints_path, reports_path,
                                    instance_count=1, preprocessor_path=None, postprocessor_path=None, publish_cloudwatch_metrics='Disabled'):
    
    data_capture_sub_path = data_capture_path[data_capture_path.rfind('datacapture/') :]
    data_capture_sub_path = data_capture_sub_path[data_capture_sub_path.find('/') + 1 :]
    processing_output_paths = reports_path + '/' + data_capture_sub_path
    
    input_1 = ProcessingInput(input_name='input_1',
                          source=data_capture_path,
                          destination='/opt/ml/processing/input/endpoint/' + data_capture_sub_path,
                          s3_data_type='S3Prefix',
                          s3_input_mode='File')

    baseline = ProcessingInput(input_name='baseline',
                               source=statistics_path,
                               destination='/opt/ml/processing/baseline/stats',
                               s3_data_type='S3Prefix',
                               s3_input_mode='File')

    constraints = ProcessingInput(input_name='constraints',
                                  source=constraints_path,
                                  destination='/opt/ml/processing/baseline/constraints',
                                  s3_data_type='S3Prefix',
                                  s3_input_mode='File')

    outputs = ProcessingOutput(output_name='result',
                               source='/opt/ml/processing/output',
                               destination=processing_output_paths,
                               s3_upload_mode='Continuous')

    env = {'baseline_constraints': '/opt/ml/processing/baseline/constraints/' + get_file_name(constraints_path),
           'baseline_statistics': '/opt/ml/processing/baseline/stats/' + get_file_name(statistics_path),
           'dataset_format': '{"sagemakerCaptureJson":{"captureIndexNames":["endpointInput","endpointOutput"]}}',
           'dataset_source': '/opt/ml/processing/input/endpoint',
           'output_path': '/opt/ml/processing/output',
           'publish_cloudwatch_metrics': publish_cloudwatch_metrics }
    
    inputs=[input_1, baseline, constraints]
    
    if postprocessor_path:
        env['post_analytics_processor_script'] = '/opt/ml/processing/code/postprocessing/' + get_file_name(postprocessor_path)
        
        post_processor_script = ProcessingInput(input_name='post_processor_script',
                                                source=postprocessor_path,
                                                destination='/opt/ml/processing/code/postprocessing',
                                                s3_data_type='S3Prefix',
                                                s3_input_mode='File')
        inputs.append(post_processor_script)

    if preprocessor_path:
        env['record_preprocessor_script'] = '/opt/ml/processing/code/preprocessing/' + get_file_name(preprocessor_path)
         
        pre_processor_script = ProcessingInput(input_name='pre_processor_script',
                                               source=preprocessor_path,
                                               destination='/opt/ml/processing/code/preprocessing',
                                               s3_data_type='S3Prefix',
                                               s3_input_mode='File')
        
        inputs.append(pre_processor_script) 
    
    processor = Processor(image_uri = get_model_monitor_container_uri(region),
                          instance_count = instance_count,
                          instance_type = instance_type,
                          role=role,
                          env = env)
    
    return processor.run(inputs=inputs, outputs=[outputs])

In [None]:
%%writefile preprocessor.py
import json 

import json

def parse_event_output(data):
    import csv
    from io import StringIO
    cols = ['class_predictions',
             'class_probabilities_<UNK>',
             'class_probabilities___label__eating_out',
             'class_probabilities___label__groceries',
             'class_probabilities___label__transport',
             'class_probabilities___label__shopping',
             'class_probabilities___label__health',
             'class_probabilities___label__travel',
             'class_probabilities___label__entertainment',
             'class_probabilities___label__education',
             'class_probabilities___label__home',
             'class_probabilities___label__utilities',
             'class_probability'] # Define columns
    for row in csv.DictReader(StringIO(data), fieldnames=cols):
        return dict(row) # Return first row only, or is a list supported

def preprocess_handler(inference_record):
    try:
        # Parse the CSV with header
        data = inference_record.endpoint_output.data
        if inference_record.endpoint_output.encoding == 'CSV':
            data = parse_event_output(data)
        return data
    except:
        # Return an undefined label
        return {'class_predictions': '__label__undefined', 'class_probabilities_<UNK>': 1.0 }

In [None]:
%%writefile postprocessor.py
def postprocess_handler():
    print("Hello from post-proc script!")

In [None]:
import boto3

monitoring_code_prefix = '{0}/monitoring/code'.format(prefix)
print(monitoring_code_prefix)

boto3.Session().resource('s3').Bucket(bucket).Object(monitoring_code_prefix + '/preprocessor.py').upload_file('preprocessor.py')
s3_preprocessor_path = 's3://{0}/{1}/monitoring/code/preprocessor.py'.format(bucket, prefix)
print(s3_preprocessor_path)

boto3.Session().resource('s3').Bucket(bucket).Object(monitoring_code_prefix + '/postprocessor.py').upload_file('postprocessor.py')
s3_postprocessor_path = 's3://{0}/{1}/monitoring/code/postprocessor.py'.format(bucket, prefix)
print(s3_postprocessor_path)

s3_reports_path = 's3://{0}/{1}/monitoring/reports'.format(bucket, prefix)
print(s3_reports_path)

In [None]:
# Pick the last statistics/contstraints from capture files
s3_data_capture_path = capture_files[len(capture_files) - 1][: capture_files[len(capture_files) - 1].rfind('/')]
s3_statistics_path = baseline_results_uri + '/statistics.json'
s3_constraints_path = baseline_results_uri + '/constraints.json'

print(s3_data_capture_path)
print(s3_postprocessor_path)
print(s3_statistics_path)
print(s3_constraints_path)
print(s3_reports_path)

In [None]:
region = boto3.Session().region_name

processor = run_model_monitor_job_processor(region, 'ml.m5.xlarge', role, 
                                s3_data_capture_path, s3_statistics_path, s3_constraints_path, s3_reports_path,
                                #preprocessor_path=s3_preprocessor_path,
                                postprocessor_path=s3_postprocessor_path)

### Analysis

When the monitoring job completes, monitoring reports are saved to Amazon S3. Let's list the generated reports.

In [None]:
s3_client = boto3.Session().client('s3')
monitoring_reports_prefix = '{}/monitoring/reports/{}'.format(prefix, endpoint_name)

result = s3_client.list_objects(Bucket=bucket, Prefix=monitoring_reports_prefix)
try:
    monitoring_reports = ['s3://{0}/{1}'.format(bucket, capture_file.get("Key")) for capture_file in result.get('Contents')]
    print("Monitoring Reports Files: ")
    print("\n ".join(monitoring_reports))
except:
    print('No monitoring reports found.')

Copy monitoring reports locally

In [None]:
!aws s3 cp {monitoring_reports[0]} monitoring/
!aws s3 cp {monitoring_reports[1]} monitoring/
!aws s3 cp {monitoring_reports[2]} monitoring/

In [None]:
import pandas as pd
pd.set_option('display.max_colwidth', -1)

file = open('monitoring/constraint_violations.json', 'r')
data = file.read()

violations_df = pd.io.json.json_normalize(json.loads(data)['violations'])
violations_df

### Advanced Hints

You might be asking yourself what are the type of violations that are monitored and how drift from the baseline is computed.

The types of violations monitored are listed here: https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-interpreting-violations.html. Most of them use configurable thresholds, that are specified in the monitoring configuration section of the baseline constraints JSON. Let's take a look at this configuration from the baseline constraints file:

In [None]:
!aws s3 cp {statistics_path} baseline/
!aws s3 cp {constraints_path} baseline/

In [None]:
import json
with open ("baseline/constraints.json", "r") as myfile:
    data=myfile.read()

print(json.dumps(json.loads(data)['monitoring_config'], indent=2))

This configuration is intepreted when the monitoring job is executed and used to compare captured data to the baseline. If you want to customize this section, you will have to update the constraints.json file and upload it back to Amazon S3 before launching the monitoring job.

When data distributions are compared to detect potential drift, you can choose to use either a Simple or Robust comparison method, where the latter has to be preferred when dealing with small datasets. Additional info: https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-byoc-constraints.html.

In [None]:
print('Schedule status: {}'.format(my_default_monitor.describe_schedule()['MonitoringScheduleStatus']))

### List Executions

The schedule starts jobs at the previously specified intervals. Here, you list the latest five executions. Note that if you are kicking this off after creating the hourly schedule, you might find the executions empty. You might have to wait until you cross the hour boundary (in UTC) to see executions kick off. The code below has the logic for waiting.

Note: Even for an hourly schedule, Amazon SageMaker has a buffer period of 20 minutes to schedule your execution. You might see your execution start in anywhere from zero to ~20 minutes from the hour boundary. This is expected and done for load balancing in the backend.

In [None]:
import time

mon_executions = my_default_monitor.list_executions()
print("We created a hourly schedule above and it will kick off executions ON the hour (plus 0 - 20 min buffer).")
print("We will have to wait till we hit the hour...")

while len(mon_executions) == 0:
    time.sleep(30)
    print("Waiting for the 1st execution to happen...")
    mon_executions = my_default_monitor.list_executions()

### Inspect a specific execution (latest execution)
In the previous cell, you picked up the latest completed or failed scheduled execution. Here are the possible terminal states and what each of them mean: 
* Completed - This means the monitoring execution completed and no issues were found in the violations report.
* CompletedWithViolations - This means the execution completed, but constraint violations were detected.
* Failed - The monitoring execution failed, maybe due to client error (perhaps incorrect role premissions) or infrastructure issues. Further examination of FailureReason and ExitMessage is necessary to identify what exactly happened.
* Stopped - job exceeded max runtime or was manually stopped.

In [None]:
mon_executions = my_default_monitor.list_executions()

# get the latest completed schedule
for execution in mon_executions[::-1]:
    latest_job = execution.describe()
    print('{:%Y-%m-%d %H:%M} {}\n{}'.format(latest_job['ProcessingEndTime'], 
                                            latest_job['ProcessingJobStatus'],
                                            latest_job['ProcessingJobArn']))
    if latest_job['ProcessingJobStatus'] == 'Completed':
        break
    time.sleep(1)
           
if latest_job['ProcessingJobStatus'] == 'Completed':
    execution.wait(logs=False)
    print("Latest execution result: {}".format(latest_job['ExitMessage']))
else:
    print("====STOP====\nNo completed executions to inspect further. \nPlease wait till an execution completes or investigate previously reported failures.")

### Visualize the schedule

In [None]:
from IPython.display import HTML, display
import json
import os
import boto3

import sagemaker
from sagemaker import session
from sagemaker.model_monitor import MonitoringExecution
from sagemaker.s3 import S3Downloader

In [None]:
!wget -O utils.py https://raw.githubusercontent.com/awslabs/amazon-sagemaker-examples/master/sagemaker_model_monitor/visualization/utils.py

import utils as mu

In [None]:
execution.describe()['ExitMessage']

In [None]:
exec_inputs = {inp['InputName']: inp for inp in execution.describe()['ProcessingInputs']}
exec_results = execution.output.destination

In [None]:
baseline_statistics_filepath = exec_inputs['baseline']['S3Input']['S3Uri'] if 'baseline' in exec_inputs else None
execution_statistics_filepath = os.path.join(exec_results, 'statistics.json')
violations_filepath = os.path.join(exec_results, 'constraint_violations.json')

baseline_statistics = json.loads(S3Downloader.read_file(baseline_statistics_filepath)) if baseline_statistics_filepath is not None else None
execution_statistics = json.loads(S3Downloader.read_file(execution_statistics_filepath))
violations = json.loads(S3Downloader.read_file(violations_filepath))['violations']

## Overview

The code below shows the violations and constraichecks across all features in a simple table.

In [None]:
mu.show_violation_df(baseline_statistics=baseline_statistics, latest_statistics=execution_statistics, violations=violations)

## Distributions

This section visualizes the distribution and renders the distribution statistics for all features

In [None]:
features = mu.get_features(execution_statistics)
feature_baselines = mu.get_features(baseline_statistics)
mu.show_distributions(features)

### Execution Stats vs Baseline

In [None]:
mu.show_distributions(features, feature_baselines)