### Introduction

Jupyter notebooks are divided into cells that can contain markdown or code that you can run interactively from the notebook interface. You can progress through the cells in the notebook by clicking the play button in the notebook tab's toolbar:

![](assets/2024-09-09-09-50-34.png)

Click the play button to advance to the next cell and continue on in the lab whenever you have completed a cell.

After clicking the play button, the status in the left-hand side of the bottom status bar will change from **Idle** to **Busy**:

![](assets/2024-09-09-09-50-00.png)

Wait for the status to change back to **Idle** before proceeding to the next cell.

### Notebook overview

This notebook guides you through the process of configuring an Amazon SageMaker Model Monitoring schedule for a pre-trained model.

In this lab, an Amazon SageMaker endpoint with a pre-trained model has been deployed for you. The model has been trained on a synthetic dataset using the XGBoost algorithm.

You will use the Python3 programming language to interact with the Amazon SageMaker SDK to configure a model monitoring schedule for the endpoint. You will also examine the data that the endpoint receives and the data that the endpoint returns.


### Ensuring dependencies are installed

To begin with, you will ensure that the correct dependency versions are installed. The following cell uses the Python package installer `pip` to install specific versions of the libraries used in this notebook.

Ensuring that dependencies are using specific versions means that you can re-run the notebook over time without encountering issues due to changes in the libraries.

Run the following cell to ensure that the dependencies are installed.

In [None]:
import sys

! pip install --upgrade pip
!{sys.executable} -m pip install sagemaker==2.232.1 scikit-learn==1.5.2 pandas==2.2.3
!{sys.executable} -m pip install -U boto3==1.35.26

*Note*: You may see an errors and warnings about the `pip` dependency resolver. These are expected and can be ignored.

### Setting up the notebook session

To use the Amazon SageMaker SDK, you need to set up the notebook session with the appropriate permissions. The following cell:

- Imports the `boto3` and `sagemaker` libraries
- Creates a SageMaker session
- Defines the name of pre-created SageMaker endpoint
- Retrieves the IAM role associated with the notebook instance
- Retrieves the name of a bucket that was created for you during lab setup

Run the following cell to proceed.

In [None]:
import boto3
import sagemaker

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
endpoint_name = "lab-sagemaker-endpoint"

bucket = next(
    (
        bucket["Name"]
        for bucket in boto3.client("s3").list_buckets()["Buckets"]
        if bucket["Name"].startswith("lab-sagemaker-")
    ),
    None,
)

### Examining the synthetic dataset

The model that the Amazon SageMaker endpoint is using was trained on a synthetic dataset. The following code cell generates a sample of the synthetic data used and saves it to a CSV file.

The synthetic data is structured for binary classification. In this case, it creates 20 samples, each with 10 features, where 8 are informative and 2 are redundant. There are 2 target classes, and the random_state=42 ensures that the generated data is reproducible.

Binary classification has a wide variety of applications, including spam detection, fraud detection, and medical diagnosis.

The sample dataset is converted to a pandas DataFrame before being saved to a CSV file.

Run the following cell to generate the synthetic data and save it to a CSV file.

In [2]:
from sklearn.datasets import make_classification
import pandas as pd

X_normal, y_normal = make_classification(
    n_samples=20,
    n_features=10,
    n_informative=8,
    n_redundant=2,
    n_classes=2,
    random_state=42,
)

df_normal = pd.DataFrame(X_normal, columns=[f"feature_{i}" for i in range(1, 11)])
df_normal["target"] = y_normal

df_normal.to_csv("synthetic_normal_data.csv", index=False)

Run the following cell to display the sample data.

In [None]:
df_normal

### Sending data to the endpoint

To see how the model responds to normal data, you can send the synthetic data you have generated to the endpoint.

The following cell removes the target column from the synthetic data, creates a client for the `sagemaker-runtime` service, and sends the data to the endpoint.

Run the following cell to proceed.

In [None]:
inference_data = df_normal.drop("target", axis=1).values

runtime = boto3.client("sagemaker-runtime")

for row in inference_data:
    payload = ",".join(map(str, row))
    response = runtime.invoke_endpoint(
        EndpointName=endpoint_name, ContentType="text/csv", Body=payload
    )
    result = response["Body"].read()
    print(float(result))

In response, you will see the model's predictions for the synthetic data.

### Generating anomalous data

To see how the model responds to anomalous data, you can generate some data that is different from the synthetic data you have generated.

The following cell generates twenty samples of synthetic data with a different distribution from the original synthetic data. The number of features is the same (10), but number of informative and redundant features is different. The data is then converted to a pandas DataFrame and saved to a CSV file.

This changed dataset is intended to represent drift in the data distribution that the model was trained on. This can occur in non-laboratory settings due to changes in the data source or changes in the data collection process. Drift can lead to a decrease in model performance.

Run the following cell to generate anomalous data.

In [4]:
from sklearn.datasets import make_classification
import pandas as pd

X_drifted, y_drifted = make_classification(
    n_features=10,
    n_samples=20,
    n_informative=4,
    n_redundant=6,
    n_classes=2,
    random_state=99,
)

df_drifted = pd.DataFrame(X_normal, columns=[f"feature_{i}" for i in range(1, 11)])
df_drifted["target"] = y_normal

df_drifted.to_csv("synthetic_drifted_data.csv", index=False)

Run the following cell to view the anomalous data.

In [None]:
df_drifted

### Introducing data quality 

As well as drift, data quality issues such as missing values, or unexpected outliers can also affect model performance. The following cell introduces some data quality issues to the anomalous data.

One value is set to the constant `nan` representing a missing value, and another value is set to a large number, representing an unexpected outlier.

Run the following cell to introduce data quality issues to the anomalous data.

In [5]:
import numpy as np

df_drifted.loc[df_drifted.sample(frac=0.1).index, "feature_1"] = np.nan
df_drifted.loc[df_drifted.sample(frac=0.1).index, "feature_2"] *= 10

### Sending anomalous data to the endpoint

To see how the model responds to the anomalous data, you can send the anomalous data you have generated to the endpoint.
 
Run the following cell to proceed.

In [None]:
inference_data = df_drifted.drop("target", axis=1).values

for row in inference_data:
    payload = ",".join(map(str, row))
    response = runtime.invoke_endpoint(
        EndpointName=endpoint_name, ContentType="text/csv", Body=payload
    )
    result = response["Body"].read()
    print(float(result))

In response, you will see the model's predictions for the anomalous data.

### Preparing to configure model monitoring

To help you identify when a model is no longer performing, Amazon SageMaker provides model monitoring. Model monitoring allows you to set up a schedule to monitor the data that the model receives and the data that the model returns.

The first step in configuring model monitoring is to enable data capture for the Amazon SageMaker endpoint you wish to monitor. Once enabled, data is captured as the endpoint receives requests and returns responses. The captured data is stored in Amazon S3.

The second step in configuring model monitoring is to create a baseline. The baseline is a dataset that represents the expected distribution of the data that the model receives and the data that the model returns. Generating the baseline data requires capturing data and using an Amazon SageMaker processing job.

Amazon SageMaker Model Monitor uses the baseline data when the model is no longer performing as expected.

In this lab, for the sake of time and convenience, data capture has been enabled on the endpoint for you, and baseline data has been provided.

### Observing data capture configuration on an endpoint

The following cell uses the SageMaker SDK to retrieve the configuration of the endpoint. The `IPython.display` library is used to display the configuration in a human-readable format.

The `CurrentSamplingPercentage` attribute is set to 100, meaning that all data is captured. And, a `DestinationS3Uri` attribute is set to the Amazon S3 URI where the captured data is stored.

Run the following cell to see the configuration of the endpoint resource, and locate the `DataCaptureConfig` attribute to see the data capture configuration for the endpoint.

In [None]:
import json
from IPython.display import JSON

response = boto3.client("sagemaker").describe_endpoint(EndpointName=endpoint_name)
JSON(response, expanded=True)

### Observing the baseline data

In this lab, a baseline for the model has been provided for you. The following cell creates variables containing Amazon S3 URIs for the baseline data.

The baseline data consists of a statistics JSON file and a constraints JSON file. The statistics file defines the expected distribution of the data, and the constraints file defines the constraints that the data should adhere to.

Run the following cell to proceed.

In [None]:
baseline_statistics = f"s3://{bucket}/baseline_output/statistics.json"
baseline_constraints = f"s3://{bucket}/baseline_output/constraints.json"

To view the contents of the statistics and constraints files, you can run the following three cells.

The first cell reads the files from the Amazon S3 bucket and the second and third cells display the contents of the statistics and constraints files using the JSON helper from the `IPython.display` library.

Run the following cells to proceed.

In [None]:
s3 = boto3.resource("s3")
statistics_content = (
    s3.Object(bucket, "baseline_output/statistics.json")
    .get()["Body"]
    .read()
    .decode("utf-8")
)
constraints_content = (
    s3.Object(bucket, "baseline_output/constraints.json")
    .get()["Body"]
    .read()
    .decode("utf-8")
)

In [None]:
JSON(json.loads(statistics_content), expanded=True)

In [None]:
JSON(json.loads(constraints_content), expanded=True)

### Configuring model monitoring

The final step in configuring model monitoring is to create a monitoring schedule. The monitoring schedule defines the frequency at which the model is monitored and the Amazon S3 URI where the monitoring results are stored.

This is a two step process. First you create a `DefaultModelMonitor` object, which defines the instance type of the processing job that will be used to monitor the model.

Then, using the model monitor object, you create a monitoring schedule resource. The configuration of this resource specifies the following:

- Where the monitoring output will be stored
- S3 URIs of the baseline data
- The name of the endpoint to monitor
- A cron expression denoting how frequently the monitoring job should run

Run the following cell to configure model monitoring.

In [None]:
from sagemaker.model_monitor import DefaultModelMonitor
from sagemaker.model_monitor import CronExpressionGenerator

monitor = DefaultModelMonitor(
    role=role,
    instance_count=1,
    instance_type="ml.m5.large",
    volume_size_in_gb=20,
    max_runtime_in_seconds=3600,
    sagemaker_session=sagemaker_session,
)

monitor.create_monitoring_schedule(
    endpoint_input=endpoint_name,
    output_s3_uri=f"s3://{bucket}/monitoring_output",
    statistics=baseline_statistics,
    constraints=baseline_constraints,
    schedule_cron_expression=CronExpressionGenerator.daily(),
    enable_cloudwatch_metrics=True,
)

Return to the lab step to complete the lab.