# Assignment 6: Implement data and model quality monitoring
In this assignment you use [Amazon SageMaker model monitor](https://aws.amazon.com/sagemaker/model-monitor/) to implement a continious data quality monitoring for a real-time inference endpoint.

Refer to the notebook [`06-monitoring.ipynb`](../06-monitoring.ipynb) for code snippets and a general guidance for the exercises in this assignment.

## Import packages

In [None]:
%pip install jsonlines tqdm

In [None]:
import boto3
import botocore
import sagemaker 
import json
import jsonlines
import random
from tqdm import trange
from sagemaker.predictor import Predictor
import time
from time import gmtime, strftime
from datetime import datetime, timedelta
import uuid
import pandas as pd
import numpy as np
from sagemaker.model_monitor import (
    DefaultModelMonitor,
    DataCaptureConfig,
    CronExpressionGenerator,
    ModelQualityMonitor,
    EndpointInput,
)
from sagemaker.model_monitor.dataset_format import DatasetFormat
from utils.monitoring_utils import run_model_monitor_job
from sagemaker.s3 import S3Downloader, S3Uploader
from sagemaker.clarify import (
    BiasConfig,
    DataConfig,
    ModelConfig,
    ModelPredictedLabelConfig,
    SHAPConfig,
)
from urllib.parse import urlparse

In [None]:
sm = boto3.client("sagemaker")
s3 = boto3.client("s3")
session = sagemaker.Session()
pd.set_option("display.max_colwidth", None)

## Exercise 1: Check data capture configuration
Use any one of the existing inference endpoints you deployed in the previous notebook. The data capture is configured at 100% of the incoming data for the staging and at 80% for the production endpoint. Verify this configuration in **Endpoint details** view in Studio UX.

![](../img/endpoints.png)

![](../img/endpoint-details-data-capture.png)

You can also use `boto3` to describe an endpoint.

In [None]:
# Get the details of the endpoint
# ep_name = 
# sm.describe_endpoint()

In [None]:
# Get the S3 url where the capture data files are stored
# data_capture_uri = sm.describe_endpoint(EndpointName=ep_name)['DataCaptureConfig']['DestinationS3Uri']

## Exercise 2: Generate and view captured data
In this exercise you send data to the inference endpoint to generate captured data. Use SageMaker Python SDK class [Predictor](https://sagemaker.readthedocs.io/en/stable/api/inference/predictors.html#sagemaker.predictor.Predictor) to interact with the endpoint.

In [None]:
# Create a predictor from the endpoint name
# endpoint_name = 
# predictor = Predictor()

For test data you can use the test dataset in the `tmp` folder created in the `02-sagemaker-containers.ipynb` notebook. If you don't have the test dataset you can generate it by running the model building pipeline and dowloading the test dataset from the Amazon S3 bucket to the `tmp` folder.

In [None]:
# Load test data
# test_x = pd.read_csv()

In [None]:
# Send data to the endpoint
def generate_endpoint_traffic(predictor, data):
    l = len(data)
    print(f"Sending {l} vectors to the endpoint")
    for i in trange(l):
        predictions = np.array(predictor.predict(data.iloc[i].values), dtype=float).squeeze()
        time.sleep(0.001)

In [None]:
# Generate endpoint traffic
# generate_endpoint_traffic(predictor, test_x)

Wait several minutes for files with captured data to appear in the Amazon S3 bucket. 

Each inference request is captured in one line in the `jsonl` file. The line contains both the input and output merged together.

In [None]:
# List the files in the capture S3 prefix
# !aws s3 ls {data_capture_uri} --recursive

In [None]:
# Download the last captured datset to Studio's EFS

In [None]:
# Print jsonl objects 

## Exercise 3: Run baseline data profiling
To enable data monitoring you must first create baseline statistics and constraints.

### Create a baselining job
To profile the data and create a baseline, use the baseline dataset `baseline.csv` which is produced by the model building pipeline. If you don't have the baseline dataset, execute the pipeline. Refer to the notebook `03-assignment-sagemaker-pipeline.ipynb` to get the S3 path to the baseline dataset.

Use [DefaultModelMonitor](https://sagemaker.readthedocs.io/en/stable/api/inference/model_monitor.html#sagemaker.model_monitor.model_monitoring.DefaultModelMonitor) to interact with SageMaker model monitor functionality. To create a baseline call [`suggest_baseline`](https://sagemaker.readthedocs.io/en/stable/api/inference/model_monitor.html#sagemaker.model_monitor.model_monitoring.DefaultModelMonitor.suggest_baseline) method.

In [None]:
# Check if there is a baseline dataset under the specified S3 url
# !aws s3 ls {baseline_s3_url}/

In [None]:
# Set Amazon S3 paths corresponding to your environment
baseline_results_s3_url = "<where the baseline results will be stored>"
reports_s3_url = "<where the monitoring job reports will be stored>"
baseline_dataset_uri = "<points to the baseline dataset including file name>"
baseline_job_name = "<job name so you can recognize it in the SageMaker console>"

In [None]:
# Create DefaultModelMonitor
# data_monitor = DefaultModelMonitor()

# Run profiling job
# data_monitor.suggest_baseline()

Wait until the profiling job completes.

### See the generated statistics and constraints
The baselining jobs saves the baseline statistics to the `statistics.json` file and the suggested baseline constraints to the `constraints.json` file in the location you specify as the `output_s3_uri` parameter.

You can access statistics and constraints also via [`baseline_statistics()`](https://sagemaker.readthedocs.io/en/stable/api/inference/model_monitor.html#sagemaker.model_monitor.model_monitoring.BaseliningJob.baseline_statistics) and [`suggested_constraints()`](https://sagemaker.readthedocs.io/en/stable/api/inference/model_monitor.html#sagemaker.model_monitor.model_monitoring.BaseliningJob.suggested_constraints) methods of the `DefaultModelMonitor.latest_baselining_job` attribute.

Explore what statistics and constraints the profiling job generated.

In [None]:
# !aws s3 ls {baseline_results_s3_url}/

In [None]:
# Explore generated constraints and statistics for the baseline dataset
# baseline_job = data_monitor.latest_baselining_job

You can also load a normalized JSON from the `statistics.json` and `constraints.json` into a Pandas DataFrame.

In [None]:
# statistics_df = pd.json_normalize(baseline_job.baseline_statistics().body_dict["features"])

## Exercise 4: Monitor data quality
After you have created the baseline constraints and statistics you can now validate if incoming data has the same statistical distribution and complies with all configured constraints.

You can either use scheduled executions of the Model Monitor analyser or run the analyser manually as a SageMaker processing job. 

The Model Monitor compares the captured data with the baseline periodically based on a configured [monitoring schedule](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-schedule-data-monitor.html).

If you run the analyzer manually, you provide the baseline statistics and constraints as SageMaker processing job parameters.

### Create a monitoring schedule
To create a monitoring schedule use [`create_monitoring_schedule()`](https://sagemaker.readthedocs.io/en/stable/api/inference/model_monitor.html#sagemaker.model_monitor.model_monitoring.DefaultModelMonitor.create_monitoring_schedule) method of the `DefaultModelMonitor` class. Use [`CronExpressionGenerator`](https://sagemaker.readthedocs.io/en/stable/api/inference/model_monitor.html#sagemaker.model_monitor.cron_expression_generator.CronExpressionGenerator) class to generate a cron expression string.

When you created a monitoring baseline, you used the baseline dataset with all features but without the label. The Model Monitor by default concatenates model's input and output, resulting in a dataset which contains all features plus the label. If you don't preprocess records before passing them to the Model Monitor analyzer, the number of columns in the baseline dataset won't match the number of columns in the record, and Model Monitor will report a `extra_column_check` violation. To avoid this situation, you need either to include the label column in the baselining or remove model output from the monitored records. You can use a [custom record preprocessing](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-pre-and-post-processing.html) script that returns only input data without the label. See the notebook [`06-monitoring.ipynb`](../06-monitoring.ipynb) for more details.

In [None]:
# Explore the custom record preprocessing script
!pygmentize ../record_preprocessor.py

In [None]:
# Upload the record preprocessing script to S3

In [None]:
# Set monitoring schedule name and create monitoring schedule
# mon_schedule_name = # use a unique name for your monitoring schedule
# data_monitor.create_monitoring_schedule()

In [None]:
# Get monitoring schedule details
## data_monitor.describe_schedule()

### Generate compliant traffic
Generate some endpoint traffic using `generate_endpoint_traffic` helper function.

In [None]:
# generate_endpoint_traffic(predictor, test_x)

In [None]:
### See the captured data under {data_capture_uri}

### Launch a manual monitoring job
If you don't want to wait until a configured scheduled Model Monitor run launched, you can run analyser manually using a [built-in container](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-pre-built-container.html) and a SageMaker [processing job](https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html).

See the source code in this [repository](https://github.com/aws-samples/reinvent2019-aim362-sagemaker-debugger-model-monitor/tree/master/02_deploy_and_monitor). You have also the copy of the [helper function](../utils/monitoring_utils.py) in this repository.

In [None]:
!pygmentize ../utils/monitoring_utils.py

In [None]:
# Set the parameters and run a Model Analyzer processing job
# utils.monitoring_utils.run_model_monitor_job()

### Explore the monitoring job output
Since you run the analyser as a SageMaker processing job, you can access all job details via standard API. For example, you can retrieve an S3 uri for the job output.

In [None]:
analyzer_job_name = sm.list_processing_jobs(
    NameContains = 'sagemaker-model-monitor-analyzer',
    SortOrder='Descending',
    MaxResults=2
)['ProcessingJobSummaries'][0]['ProcessingJobName']

analyzer_job_info = sm.describe_processing_job(
    ProcessingJobName=analyzer_job_name
)

analyzer_job_output_s3_url = analyzer_job_info['ProcessingOutputConfig']['Outputs'][0]['S3Output']['S3Uri']

print(analyzer_job_output_s3_url)

In [None]:
# See the generated analyzer output
!aws s3 ls {analyzer_job_output_s3_url}/

In [None]:
# Load JSON files as Pandas DataFrame and explore generated statistics, constraints, and violations
# statistics = # load file
# constraints = # load file
# violations = pd.read_json(f"{analyzer_job_output_s3_url}/constraint_violations.json")

### Generate non-compliant traffic
Now generate some non-compliant traffic to your real-time inference endpoint and run the Model Monitor analyzer again.

In [None]:
# Remove previous data capture files

In [None]:
# Create or inject non-compliant data into requests
# Prepare a non-compliant dataset

In [None]:
# Generate traffic using non-compliant dataset

In [None]:
# Launch a Model Monitor analyzer processing job

In [None]:
# explore the analyzer report
# use the same code as in the previous section

### Work with scheduled executions and monitoring reports
Model Monitor scheduled executions offer a more abstract way of working with analyzer runs and monitoring reports. Instead of using SageMaker processing job API, you can use Python SDK [`ModelMonitor`](https://sagemaker.readthedocs.io/en/stable/api/inference/model_monitor.html#sagemaker.model_monitor.model_monitoring.ModelMonitor)-derived classed to access all scheduled executions, execution details, generated statistics, constraints, and violations for each execution.

The scheduled executions automatically process only the newest captured data since the last Model Monitor analyzer run. You can also [visualize data quality reports](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker_model_monitor/visualization/SageMaker-Model-Monitor-Visualize.html) in SageMaker Studio.

Refer to [SageMaker Model Monitor development guide](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-interpreting-violations.html) for result interpretation.

#### List executions of a scheduled Model Monitoring job
Use [`list_executions()`](https://sagemaker.readthedocs.io/en/stable/api/inference/model_monitor.html#sagemaker.model_monitor.model_monitoring.ModelMonitor.list_executions) of the `ModelMonitor` Python SDK class.

In [None]:
# List all executions
# Get the latest execution details
# Get the execution output S3 url

#### Get the latest execution statistics and constraints
You can access the latest output with this code:
```
my_executions = my_monitor.list_executions()
lastest_execution_statistics = my_executions[-1].statistics()
lastest_execution_violations = my_executions[-1].constraint_violations()
```

In [None]:
# Write the code to print the latest statistics and constraint violations
# Hint: use Pandas DataFrame to visualize the reports

#### See the baseline and the latest data profiling statistics
Use [`latest_monitoring_statistics()`](https://sagemaker.readthedocs.io/en/stable/api/inference/model_monitor.html#sagemaker.model_monitor.model_monitoring.ModelMonitor.latest_monitoring_statistics) and [`baseline_statistics()`](https://sagemaker.readthedocs.io/en/stable/api/inference/model_monitor.html#sagemaker.model_monitor.model_monitoring.ModelMonitor.baseline_statistics) methods to load monitoring output.

In [None]:
# Write code here to see the latest monitoring statistics

#### See a violation report
Use [`latest_monitoring_constraint_violations()`](https://sagemaker.readthedocs.io/en/stable/api/inference/model_monitor.html#sagemaker.model_monitor.model_monitoring.ModelMonitor.latest_monitoring_constraint_violations) to return the latest constraint violation report. 

In [None]:
# Load the latest constraint violations report into a Pandas DataFrame

In [None]:
# Explore data monitoring results

---

## Exercise 5: Monitor model quality
Implementing model quality monitoring follows the same steps as the data quality monitoring with addition of ground truth data ingestion.

See the [Developer Guide](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-model-quality.html) documentation on model quality monitoring and **Part 2: Monitor model quality** of the [step 6](../06-monitoring.ipynb) notebook.

### Create a model quality monitor

In [None]:
# model_monitor = ModelQualityMonitor(...)

### Run a model quality baseline job

In [None]:
# model_baseline_job = model_monitor.suggest_baseline(...)

### Inspect the generated baseline reports

In [None]:
# latest_model_baseline_job = model_monitor.latest_baselining_job

### Generate endpoint traffic

### Ingest ground truth data
Remember to correlate the ground truth labels with the inference input via `EventId` identifier

### Create a model monitoring schedule

In [None]:
# endpoint_input = EndpointInput(...)
# model_monitor.create_monitoring_schedule(...)

### Inspect model monitor executions and reports

In [None]:
# model_mon_executions = model_monitor.list_executions()

## Continue with the clean-up
After you finished with the assignments and experiments, you must clean-up all created resources.

Navigate to the [clean-up](../99-clean-up.ipynb) notebook.