# Amazon SageMaker Model Monitoring and Clarify Explainability

SageMaker contains several integrated services to monitor models for data and model quality, bias, and explainability.

In this lab, you will learn how to:
  * Capture inference requests, results, and metadata from our pipeline deployed model.
  * Schedule a Clarify default monitor to monitor for data drift on a regular schedule.
  * Schedule a Clarify model monitor to monitor model performance on a regular schedule.
  * Schedule a Clarify bias monitor to monitor predictions for bias drift on a regular schedule.
  * Schedule Clarify explainability monitor to monitor predictions for feature attribution drift on a regular schedule.

## Setup

In [None]:
# !pip install -U sagemaker==2.101.1

In [None]:
from datetime import datetime, timedelta
import pandas as pd
import time
import csv
import json
import boto3
import sagemaker

region = boto3.Session().region_name
sagemaker_session = sagemaker.session.Session()
role = sagemaker.get_execution_role()
default_bucket = sagemaker_session.default_bucket()

sagemaker_client = sagemaker_session.sagemaker_client
sagemaker_runtime_client = sagemaker_session.sagemaker_runtime_client

from sagemaker.predictor import Predictor
from sagemaker.serializers import CSVSerializer

from sagemaker.clarify import (
    BiasConfig,
    DataConfig,
    ModelConfig,
    ModelPredictedLabelConfig,
    SHAPConfig,
)

from sagemaker.model_monitor import (
    BiasAnalysisConfig,
    CronExpressionGenerator,
    DataCaptureConfig,
    EndpointInput,
    ExplainabilityAnalysisConfig,
    ModelBiasMonitor,
    ModelExplainabilityMonitor,
    DefaultModelMonitor,
    ModelQualityMonitor,
)

from sagemaker.model_monitor.dataset_format import DatasetFormat

from sagemaker.s3 import S3Downloader, S3Uploader

In [None]:
print(f"AWS region: {region}")
# A different bucket can be used, but make sure the role for this notebook has
# the s3:PutObject permissions. This is the bucket into which the data is captured.
print(f"S3 Bucket: {default_bucket}")

# Endpoint metadata.
endpoint_name = "workshop-project-prod"
endpoint_instance_count = 1
endpoint_instance_type = "ml.m5.large"
print(f"Endpoint: {endpoint_name}")

prefix = "sagemaker/DEMO-xgboost-dm-model-monitoring"
s3_key = f"s3://{default_bucket}/{prefix}"
print(f"S3 key: {s3_key}")

s3_capture_upload_path = f"{s3_key}/data_capture"
s3_ground_truth_upload_path = f"{s3_key}/ground_truth_data/{datetime.now():%Y-%m-%d-%H-%M-%S}"
s3_baseline_results_path = f"{s3_key}/baselines"
s3_report_path = f"{s3_key}/reports"

print(f"Capture path: {s3_capture_upload_path}")
print(f"Ground truth path: {s3_ground_truth_upload_path}")
print(f"Baselines path: {s3_baseline_results_path}")
print(f"Report path: {s3_report_path}")

## Configure data capture and generate synthetic traffic

Data quality monitoring automatically monitors machine learning (ML) models in production and notifies you when data quality issues arise. ML models in production have to make predictions on real-life data that is not carefully curated like most training datasets. If the statistical nature of the data that your model receives while in production drifts away from the nature of the baseline data it was trained on, the model begins to lose accuracy in its predictions. Amazon SageMaker Model Monitor uses rules to detect data drift and alerts you when it happens.

In [None]:
# Create a Predictor Python object for real-time endpoint requests. https://sagemaker.readthedocs.io/en/stable/api/inference/predictors.html
predictor = Predictor(endpoint_name=endpoint_name, serializer=CSVSerializer())

In [None]:
data_capture_config = DataCaptureConfig(
    enable_capture=True,
    sampling_percentage=100,
    destination_s3_uri=s3_capture_upload_path,
)

In [None]:
# Update endpoint with s3_capture_upload_path.
predictor.update_data_capture_config(data_capture_config)

### Invoke the deployed model endpoint

Now send data to this endpoint to get inferences in real time. 

With data capture enabled in the previous step, the request and response payload, along with some additional metadata, is saved to the S3 location specified in `DataCaptureConfig`.

In [None]:
# Use test set to create a file without headers and labels to mirror data at inference time.
test_df = pd.read_csv("test.csv")
test_df.drop(['y_no', 'y_yes'], axis=1).sample(180).to_csv("test-samples-no-header.csv", header=False)

In [None]:
print("Sending test traffic to the endpoint {}. \nPlease wait...".format(endpoint_name))

test_sample_df = pd.read_csv("test-samples-no-header.csv")

response = predictor.predict(data=test_sample_df.to_numpy())

print("Done!")

### View captured data

List the data capture files stored in Amazon S3. 

There should be different files from different time periods organized in S3 based on the hour in which the invocation occurred in the format: 

`s3://{destination-bucket-prefix}/{endpoint-name}/{variant-name}/yyyy/mm/dd/hh/filename.jsonl`

In [None]:
print("Waiting 30 seconds for captures to show up", end="")

for _ in range(30):
    capture_files = sorted(S3Downloader.list(f"{s3_capture_upload_path}/{endpoint_name}"))
    if capture_files:
        break
    print(".", end="", flush=True)
    time.sleep(1)

print("\nFound Capture Files:")
print("\n ".join(capture_files[-10:]))

Next, view the content of a single capture file, looking at the first few lines in the captured file.

In [None]:
capture_file = S3Downloader.read_file(capture_files[-1]).split("\n")[-10:-1]
print(capture_file[-1])

View a single line is present below in a formatted JSON file.

In [None]:
print(json.dumps(json.loads(capture_file[-1]), indent=2))

### Generate synthetic traffic

Start a thread to generate synthetic traffic to send continuously to the deployed model endpoint. 

The `WorkerThread` class will run continuously on the notebook kernel to generate predictions that are captured and sent to S3 until the kernel is restarted or the thread is explicitly terminated. 

See the cell in the `Cleanup` section to terminate the threads.

If there is no traffic, the monitoring jobs are marked as `Failed` since there is no data to process.

In [None]:
import threading

class WorkerThread(threading.Thread):
    def __init__(self, do_run, *args, **kwargs):
        super(WorkerThread, self).__init__(*args, **kwargs)
        self.__do_run = do_run
        self.__terminate_event = threading.Event()

    def terminate(self):
        self.__terminate_event.set()

    def run(self):
        while not self.__terminate_event.is_set():
            self.__do_run(self.__terminate_event)

In [None]:
def invoke_endpoint(terminate_event):
    with open("test-samples-no-header.csv", "r") as f:
        i = 0
        for row in f:
            payload = row.rstrip("\n")
            response = sagemaker_runtime_client.invoke_endpoint(
                EndpointName=endpoint_name,
                ContentType="text/csv",
                Body=payload,
                InferenceId=str(i),  # unique ID per row
            )
            i += 1
            response["Body"].read()
            time.sleep(1)
            if terminate_event.is_set():
                break


# Keep invoking the endpoint with test data
invoke_endpoint_thread = WorkerThread(do_run=invoke_endpoint)
invoke_endpoint_thread.start()

### Generate synthetic ground truth data

Besides data capture, model bias monitoring execution also requires ground truth data.

In real use cases, ground truth data should be regularly collected and uploaded to designated S3 location. 

In this example notebook, below code snippet is used to generate fake ground truth data. The first-party merge container will combine captures and ground truth data, and the merged data will be passed to model bias monitoring job for analysis. Similar to captures, the model bias monitoring execution will fail if there's no data to merge.

In [None]:
import random

def ground_truth_with_id(inference_id):
    # set random seed to get consistent results.
    random.seed(inference_id) 
    rand = random.random()
    # format required by the merge container.
    return {
        "groundTruthData": {
            # randomly generate positive labels 70% of the time.
            "data": "1" if rand < 0.7 else "0",
            "encoding": "CSV",
        },
        "eventMetadata": {
            "eventId": str(inference_id),
        },
        "eventVersion": "0",
    }


def upload_ground_truth(upload_time):
    # 180 are the number of rows in data we're sending for inference.
    records = [ground_truth_with_id(i) for i in range(180)]
    fake_records = [json.dumps(r) for r in records]
    data_to_upload = "\n".join(fake_records)
    target_s3_uri = f"{s3_ground_truth_upload_path}/{upload_time:%Y/%m/%d/%H/%M%S}.jsonl"
    print(f"Uploading {len(fake_records)} records to", target_s3_uri)
    S3Uploader.upload_string_as_file_body(data_to_upload, target_s3_uri)

In [None]:
# Generate data for the last hour.
upload_ground_truth(datetime.utcnow() - timedelta(hours=1))

In [None]:
# Generate data once a hour.
def generate_fake_ground_truth(terminate_event):
    upload_ground_truth(datetime.utcnow())
    for _ in range(0, 60):
        time.sleep(60)
        if terminate_event.is_set():
            break


ground_truth_thread = WorkerThread(do_run=generate_fake_ground_truth)
ground_truth_thread.start()

## Monitor data quality

Configure `DefaultModelMonitor` for monitoring for data drift.

In [None]:
data_quality_monitor = DefaultModelMonitor(
    role=role,
    instance_count=1,
    instance_type='ml.m5.xlarge',
    volume_size_in_gb=20,
    max_runtime_in_seconds=3600,
)

Start a data quality baseline processing job with `DefaultModelMonitor.suggest_baseline(..)` using the Amazon SageMaker Python SDK. This uses an Amazon SageMaker Model Monitor prebuilt container that generates baseline statistics and suggests baseline constraints for the dataset and writes them to the `output_s3_uri` location that you specify.

The baseline calculations of statistics and constraints are needed as a standard against which data drift and other data quality issues can be detected. 

SageMaker Model Monitor provides a built-in container that provides the ability to suggest the constraints automatically for CSV and flat JSON input. This sagemaker-model-monitor-analyzer container also provides you with a range of model monitoring capabilities, including constraint validation against a baseline, and emitting Amazon CloudWatch metrics. This container is based on Spark and is built with [Deequ "unit tests for data"](https://github.com/awslabs/deequ). All column names in your baseline dataset must be compliant with Spark. For column names, use only lowercase characters, and `_` as the only special character.

The training dataset that you used to trained the model is usually a good baseline dataset. The training dataset data schema and the inference dataset schema should exactly match (the number and order of the features). Note that the prediction/output columns are assumed to be the first columns in the training dataset. From the training dataset, you can ask SageMaker to suggest a set of baseline constraints and generate descriptive statistics to explore the data.

Note: the data quality baseline job can take 5 min

In [None]:
data_quality_baseline_job_name = f"DataQualityBaselineJob-{datetime.utcnow():%Y-%m-%d-%H%M}"

data_quality_baseline_job = data_quality_monitor.suggest_baseline(
    job_name=data_quality_baseline_job_name,
    baseline_dataset="train-headers.csv",
    dataset_format=DatasetFormat.csv(header=True),
)

data_quality_baseline_job.wait(logs=False)

Amazon SageMaker Model Monitor prebuilt container computes per column/feature statistics. The statistics are calculated for the baseline dataset and also for the current dataset that is being analyzed.

In [None]:
latest_data_quality_baseline_job = data_quality_monitor.latest_baselining_job
schema_df = pd.json_normalize(latest_data_quality_baseline_job.baseline_statistics().body_dict["features"])
schema_df.head(10)

In [None]:
constraints_df = pd.json_normalize(latest_data_quality_baseline_job.suggested_constraints().body_dict["features"])
constraints_df.head(10)

#### Create a monitoring schedule

You can create a data monitoring schedule for the endpoint created earlier. 

Use the baseline resources (constraints and statistics) to compare against the real-time traffic hourly.

In [None]:
## Create a data quality monitoring schedule name.
data_quality_monitor_schedule_name = (
    f"xgboost-dm-data-monitoring-schedule-{datetime.utcnow():%Y-%m-%d-%H%M}"
)

In [None]:
# Create an enpointInput
endpointInput = EndpointInput(
    endpoint_name=predictor.endpoint_name,
    # probability_attribute="0",
    # probability_threshold_attribute=0.5,
    destination="/opt/ml/processing/input_data",
)

In [None]:
# Specify where to write the data quality monitoring results report to.
data_quality_baseline_job_result_uri = f"{s3_baseline_results_path}/data_quality"

response = data_quality_monitor.create_monitoring_schedule(
    monitor_schedule_name=data_quality_monitor_schedule_name,
    endpoint_input=endpointInput,
    output_s3_uri=data_quality_baseline_job_result_uri,
    # ground_truth_input=ground_truth_upload_path,
    constraints=latest_data_quality_baseline_job.suggested_constraints(),
    # Create the monitoring schedule to execute every hour.    
    schedule_cron_expression=CronExpressionGenerator.hourly(),
    enable_cloudwatch_metrics=True,
)

In [None]:
# Create the monitoring schedule
# You will see the monitoring schedule in the 'Scheduled' status
data_quality_monitor.describe_schedule()

In [None]:
# Check default model monitor created.
predictor.list_monitors()

In [None]:
# Initially there will be no executions since the first execution happens at the top of the hour
# Note that it is common for the execution to launch upto 20 min after the hour.
executions = data_quality_monitor.list_executions()
executions[:5]

## Monitor model quality

Model quality monitoring jobs monitor the performance of a model by comparing the predictions that the model makes with the actual ground truth labels that the model attempts to predict. To do this, model quality monitoring merges data that is captured from real-time inference with actual labels stored in S3, and then compares the predictions with the actual labels.

### Define `ModelQualityMonitor`

In [None]:
model_quality_monitor = ModelQualityMonitor(
    role=role,
    instance_count=1,
    instance_type='ml.m5.xlarge',
    volume_size_in_gb=20,
    max_runtime_in_seconds=1800,
    sagemaker_session=sagemaker_session
)

### Run model quality baseline job

In [None]:
prediction_threshold = 0.5
model_quality_baseline_job_result_uri = f"{s3_baseline_results_path}/model-quality"
validate_dataset = "validation_with_predictions.csv"

In [None]:
limit = 200  # Need at least 200 samples to compute standard deviations
i = 0
with open(f"{validate_dataset}", "w") as baseline_file:
    baseline_file.write("probability,prediction,label\n")  # our header
    with open("validation.csv", "r") as f:
        for row in f:
            (label, input_cols) = row.split(",", 1)
            probability = float(predictor.predict(input_cols))
            prediction = "1" if probability > prediction_threshold else "0"
            baseline_file.write(f"{probability},{prediction},{label}\n")
            i += 1
            if i > limit:
                break
            print(".", end="", flush=True)
            time.sleep(0.5)
print()
print("Done!")

Call the `suggest_baseline` method of the `ModelQualityMonitor` object to run a baseline job.

Note: this step can take about 8-10 min.

In [None]:
model_quality_baseline_job_name = f"ModelQualityBaselineJob-{datetime.utcnow():%Y-%m-%d-%H%M}"

model_quality_baseline_job = model_quality_monitor.suggest_baseline(
    job_name=model_quality_baseline_job_name,
    baseline_dataset="validation_with_predictions.csv", # The S3 location of the validation dataset.
    dataset_format=DatasetFormat.csv(header=True),
    output_s3_uri = model_quality_baseline_job_result_uri, # The S3 location to store the results.
    problem_type="BinaryClassification",
    inference_attribute= "prediction", # The column in the dataset that contains predictions.
    probability_attribute= "probability", # The column in the dataset that contains probabilities.
    ground_truth_attribute= "label" # The column in the dataset that contains ground truth labels.
)

model_quality_baseline_job.wait(logs=False)

View the suggested model quality baseline constraints.

In [None]:
latest_model_quality_baseline_job = model_quality_monitor.latest_baselining_job
pd.DataFrame(latest_model_quality_baseline_job.suggested_constraints().body_dict["binary_classification_constraints"]).T

### Schedule continuous model quality monitoring

You can create a model monitoring schedule for the endpoint created earlier.

Use the baseline resources (constraints and statistics) to compare against the real-time traffic hourly.

In [None]:
model_quality_monitor_schedule_name = (
    f"xgboost-dm-model-monitoring-schedule-{datetime.utcnow():%Y-%m-%d-%H%M}"
)

In [None]:
# Create an enpointInput
endpointInput = EndpointInput(
    endpoint_name=predictor.endpoint_name,
    probability_attribute="0",
    probability_threshold_attribute=0.5,
    destination="/opt/ml/processing/input_data",
)

In [None]:
response = model_quality_monitor.create_monitoring_schedule(
    monitor_schedule_name=model_quality_monitor_schedule_name,
    endpoint_input=endpointInput,
    output_s3_uri=model_quality_baseline_job_result_uri,
    problem_type="BinaryClassification",
    ground_truth_input=ground_truth_upload_path,
    constraints=latest_model_quality_baseline_job.suggested_constraints(),
    schedule_cron_expression=CronExpressionGenerator.hourly(),
    # enable_cloudwatch_metrics=True,
)

In [None]:
# Check default model monitor created.
predictor.list_monitors()

In [None]:
# You will see the monitoring schedule in the 'Scheduled' status.
model_quality_monitor.describe_schedule()

In [None]:
# Initially there will be no executions since the first execution happens at the top of the hour
# Note that it is common for the execution to luanch upto 20 min after the hour.
executions = model_quality_monitor.list_executions()
executions[:5]

## Monitor model bias

Model bias monitor can detect bias drift of Machine Learning models in a regular basis. Similar to the other monitoring types, the standard procedure of creating a model bias monitor is first baselining and then monitoring schedule.

In [None]:
model_bias_monitor = ModelBiasMonitor(
    role=role,
    sagemaker_session=sagemaker_session,
    max_runtime_in_seconds=1800,
)

### Create a model bias baseline job

#### Configure `DataConfig`

In [None]:
model_bias_baseline_job_result_uri = f"{baseline_results_uri}/model_bias"

model_bias_data_config = DataConfig(
    s3_data_input_path="train-headers.csv",
    s3_output_path=model_bias_baseline_job_result_uri,
    label="y_yes",
    headers=train_df.columns.to_list(),
    dataset_type="text/csv",
)

#### Configure `BiasConfig`

In [None]:
model_bias_config = BiasConfig(
    label_values_or_threshold=[1],
    facet_name="age",
    facet_values_or_threshold=[100],
)

#### Configure `ModelPredictedLabelConfig`

In [None]:
model_predicted_label_config = ModelPredictedLabelConfig(
    probability_threshold=0.5,
)

#### Configure `ModelConfig`

In [None]:
model_config = ModelConfig(
    model_name="Model-gqHK8mt3zYu7",
    instance_count=endpoint_instance_count,
    instance_type=endpoint_instance_type,
    content_type="text/csv",
    accept_type="text/csv",
)

### Run model bias baseline job

In [None]:
model_bias_baseline_job = model_bias_monitor.suggest_baseline(
    model_config=model_config,
    data_config=model_bias_data_config,
    bias_config=model_bias_config,
    model_predicted_label_config=model_predicted_label_config,
)

model_bias_baseline_job.wait(logs=False)

print(f"ModelBiasMonitor baselining job: {model_bias_monitor.latest_baselining_job_name}")

In [None]:
model_bias_constraints = model_bias_monitor.suggested_constraints()

print(f"ModelBiasMonitor suggested constraints: {model_bias_constraints.file_s3_uri}")

print(S3Downloader.read_file(model_bias_constraints.file_s3_uri))

### Schedule continuous model bias monitoring

In [None]:
model_bias_analysis_config = None
if not model_bias_monitor.latest_baselining_job:
    model_bias_analysis_config = BiasAnalysisConfig(
        model_bias_config,
        headers=all_headers,
        label=label_header,
    )

model_bias_monitor.create_monitoring_schedule(
    analysis_config=model_bias_analysis_config,
    output_s3_uri=s3_report_path,
    endpoint_input=EndpointInput(
        endpoint_name=endpoint_name,
        destination="/opt/ml/processing/input/endpoint",
        start_time_offset="-PT1H",
        end_time_offset="-PT0H",
        probability_threshold_attribute=0.5,
    ),
    ground_truth_input=ground_truth_upload_path,
    schedule_cron_expression=schedule_expression,
)
print(f"Model bias monitoring schedule: {model_bias_monitor.monitoring_schedule_name}")

## Monitor feature attribution drift

### Define `ModelExplainabilityMonitor`

In [None]:
model_explainability_monitor = ModelExplainabilityMonitor(
    role=role,
    sagemaker_session=sagemaker_session,
    max_runtime_in_seconds=1800,
)

### Run explainability baseline job

#### Define data config

In [None]:
model_explainability_baseline_job_result_uri = f"{baseline_results_path}/model-explainability"

model_explainability_data_config = DataConfig(
    s3_data_input_path="train-headers.csv",
    s3_output_path=model_explainability_baseline_job_result_uri,
    label="y_yes",
    headers=train_df.columns.to_list(),
    dataset_type="text/csv",
)

#### Define Clarify SHAP feature attributions config

In [None]:
# Here use the mean value of train dataset as SHAP baseline
shap_baseline = [list(train_df.mean())]

shap_config = SHAPConfig(
    baseline=shap_baseline,
    num_samples=100,
    agg_method="mean_abs",
    save_local_shap_values=False,
)

In [None]:
model_config = ModelConfig(
    model_name="Model-gqHK8mt3zYu7",
    instance_count=endpoint_instance_count,
    instance_type=endpoint_instance_type,
    content_type="text/csv",
    accept_type="text/csv",
)

#### Run baseline job

In [None]:
model_quality_baseline_job_name = f"ModelExplainabilityBaselineJob-{datetime.utcnow():%Y-%m-%d-%H%M}"

model_explainability_baseline_job = model_explainability_monitor.suggest_baseline(
    job_name=model_quality_baseline_job_name,
    data_config=model_explainability_data_config,
    model_config=model_config,
    explainability_config=shap_config,
)

model_explainability_baseline_job.wait(logs=False)

In [None]:
model_explainability_constraints = model_explainability_monitor.suggested_constraints()

print(f"ModelExplainabilityMonitor suggested constraints: {model_explainability_constraints.file_s3_uri}")
print(S3Downloader.read_file(model_explainability_constraints.file_s3_uri))

#### Schedule model explainability monitor

In [None]:
response = model_explainability_monitor.create_monitoring_schedule(
    output_s3_uri=f"{s3_report_path}/model-explainability",
    endpoint_input=endpoint_name,
    schedule_cron_expression=CronExpressionGenerator.hourly(),
)

In [None]:
# Check default model monitor created.
predictor.list_monitors()

In [None]:
# Create the monitoring schedule
# You will see the monitoring schedule in the 'Scheduled' status
model_explainability_monitor.describe_schedule()

In [None]:
# Initially there will be no executions since the first execution happens at the top of the hour
# Note that it is common for the execution to luanch upto 20 min after the hour.
executions = model_explainability_monitor.list_executions()
executions

## Cleanup

First, stop the worker threads.

In [None]:
invoke_endpoint_thread.terminate()
ground_truth_thread.terminate()

Stop all monitors scheduled to the endpoint.

In [None]:
model_monitors = predictor.list_monitors()

for monitor in model_monitors:
    monitor.stop_monitoring_schedule()
    monitor.delete_monitoring_schedule()

Finally, delete the endpoint.

In [None]:
predictor.delete_endpoint()
predictor.delete_model()