# SageMaker Data Quality Model Monitor for Batch Transform with SageMaker Pipelines On-demand

In this notebook, we use SageMaker Pipelines and SageMaker Model Monitor to monitor the data quality of a batch transform job.

Data quality monitoring automatically monitors machine learning (ML) models in production and notifies you when data quality issues arise. ML models in production have to make predictions on real-life data that is not carefully curated like most training datasets. If the statistical nature of the data that your model receives while in production drifts away from the nature of the baseline data it was trained on, the model begins to lose accuracy in its predictions.

We introduce a new step type called `MonitorBatchTransformStep` to do this.

In [None]:
import sys

! pip install --upgrade pip
!{sys.executable} -m pip install sagemaker==2.114.0
!{sys.executable} -m pip install -U boto3

If you run this notebook in SageMaker Studio, you need to make sure latest python SDK is installed and restart the kernel, so please uncomment the code in the next cell, and run it.

In [None]:
# import IPython
# IPython.Application.instance().kernel.do_shutdown(True)  # has to restart kernel so changes are used

In [None]:
import os
import boto3
import re
import time
import json
from sagemaker import get_execution_role, session
import pandas as pd

region = boto3.Session().region_name

role = get_execution_role()
print("RoleArn: {}".format(role))

In [None]:
bucket = session.Session(boto3.Session()).default_bucket()

print("Demo Bucket: {}".format(bucket))
prefix = f"sagemaker/demo-model-monitor-batch-transform/data-quality/{int(time.time())}"

reports_prefix = "{}/reports".format(prefix)
s3_report_path = "s3://{}/{}".format(bucket, reports_prefix)

transform_output_path = "s3://{}/{}/transform-outputs".format(bucket, prefix)

print("Transform Output path: {}".format(transform_output_path))
print("Report path: {}".format(s3_report_path))

## Construct a SageMaker Pipeline

Amazon SageMaker Model Building Pipelines is a tool for building machine learning pipelines that take advantage of direct SageMaker integration. We can leverage it to run batch transform job with monitoring on-demand.

In this notebook, we showcase how to use SageMaker Pipeline to orchestrate the on-demand batch inference monitoring. In summary, we create and execute a pipeline to:

- Create a model
- Run a batch inference with the model
- Run a model monitoring job to evaluate the inference inputs/outputs.

In [None]:
from time import gmtime, strftime
import sagemaker
from sagemaker.model import Model
from sagemaker.image_uris import retrieve
from sagemaker.workflow.pipeline_context import PipelineSession
from sagemaker.workflow.model_step import ModelStep

In [None]:
pipeline_session = PipelineSession()

### Create a model

Here we take a pretrained model and upload it to S3. We use this model in our batch transform step.

In [None]:
model_file_name = "xgb-churn-prediction-model.tar.gz"

In [None]:
!aws s3 cp model/{model_file_name} s3://{bucket}/{prefix}/{model_file_name}

In [None]:
model_name = "DEMO-xgb-churn-pred-model-monitor-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
model_url = "https://{}.s3-{}.amazonaws.com/{}/{}".format(bucket, region, prefix, model_file_name)

In [None]:
image_uri = retrieve("xgboost", boto3.Session().region_name, "0.90-1")

model = Model(
    image_uri=image_uri,
    model_data=model_url,
    role=role,
    sagemaker_session=pipeline_session,
)

create_model_step = ModelStep(
    name="CreateXGBoostModelStep",
    step_args=model.create(),
)

### Configure a transformer

We must first upload the dataset used to generate predictions to S3. We then define a transformer object to be used in the `MonitorBatchTransformStep`.

In [None]:
# Dataset used to get predictions

!aws s3 cp test_data/validation.csv s3://{bucket}/{prefix}/transform_input/validation/validation.csv

In [None]:
from sagemaker.transformer import Transformer
from sagemaker.workflow.parameters import ParameterString

In [None]:
transformer = Transformer(
    model_name=create_model_step.properties.ModelName,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    accept="text/csv",
    assemble_with="Line",
    output_path=transform_output_path,
    sagemaker_session=pipeline_session,
)

In [None]:
transform_input_param = ParameterString(
    name="transform_input",
    default_value=f"s3://{bucket}/{prefix}/transform_input/validation",
)

transform_arg = transformer.transform(
    transform_input_param,
    content_type="text/csv",
    split_type="Line",
    # exclude the ground truth (first column) from the validation set
    # when doing inference.
    input_filter="$[1:]",
)

### Configure data quality monitoring

In this section, we first run a baseline job, and use the suggested constraints and statistics as the baseline for running the data quality monitoring job during pipeline execution.

In [None]:
from sagemaker.model_monitor import DefaultModelMonitor
from sagemaker.model_monitor.dataset_format import DatasetFormat
from sagemaker.workflow.check_job_config import CheckJobConfig
from sagemaker.workflow.quality_check_step import DataQualityCheckConfig

In [None]:
baseline_prefix = prefix + "/baselining"
baseline_data_prefix = baseline_prefix + "/data"
baseline_results_prefix = baseline_prefix + "/results"

baseline_data_uri = "s3://{}/{}".format(bucket, baseline_data_prefix)
baseline_results_uri = "s3://{}/{}".format(bucket, baseline_results_prefix)
print("Baseline data uri: {}".format(baseline_data_uri))
print("Baseline results uri: {}".format(baseline_results_uri))

### Generate a baseline for Model Monitor

We use the training dataset called `training-dataset-with-header.csv` to generate a baseline that will be used by the Data Quality Monitor. To do this, we use the `suggest_baseline` method. The purpose of this is to generate a set of `statistics` and `constraints` file. These files will be used by Model Monitor to compare the data passed to the Transform job and report any violations that are detected.

The `suggest_baseline` method has an argument called `baseline_dataset`. This is typically the dataset used during training. 

We upload the dataset used for baselining and the data used for inference to S3.

In [None]:
training_data_file = "test_data/training-dataset-with-header.csv"

In [None]:
# Dataset used to generate statistics and constraints file

!aws s3 cp {training_data_file} {baseline_data_uri}/training-dataset-with-header.csv

In [None]:
my_default_monitor = DefaultModelMonitor(
    role=role,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    volume_size_in_gb=20,
    max_runtime_in_seconds=3600,
)

my_default_monitor.suggest_baseline(
    baseline_dataset=baseline_data_uri + "/training-dataset-with-header.csv",
    dataset_format=DatasetFormat.csv(header=True),
    output_s3_uri=baseline_results_uri,
    wait=True,
    logs=False,
)

In [None]:
s3_client = boto3.Session().client("s3")
result = s3_client.list_objects(Bucket=bucket, Prefix=baseline_results_prefix)
report_files = [report_file.get("Key") for report_file in result.get("Contents")]
print("Found Files:")
print("\n ".join(report_files))

In [None]:
statistics_path = "{}/statistics.json".format(baseline_results_uri)
constraints_path = "{}/constraints.json".format(baseline_results_uri)

### Configure the Data Quality Check

There are two configurations we create here, one is `CheckJobConfig` and the other is `DataQualityCheckConfig`. The `CheckJobConfig` is used to configure the underlying processing job used by Model Monitor. This is where users can specify the role, instance type, etc. 

The `DataQualityCheckConfig` is used to configure how Model Monitor runs the data quality check. It accepts an argument called `baseline_dataset`. This is the dataset that is passed to the transform job. The dataset passed here is compared against the baseline and statistics file generated by the `suggest_baseline` method.

In [None]:
job_config = CheckJobConfig(role=role)
data_quality_config = DataQualityCheckConfig(
    baseline_dataset=transform_input_param,
    dataset_format=DatasetFormat.csv(header=False),
    output_s3_uri=s3_report_path,
)

### Use the `MonitorBatchTransformStep` to monitor the transform job

This step runs a batch transform job using the transformer object configured above and monitors the data passed to the transformer before executing the job.

The baselines calculated above must be passed to this step so that the incoming data can be compared against them to detect violations.

You can configure the step to fail if a violation to Data Quality is found by toggling the `fail_on_violation` flag.

In [None]:
from sagemaker.workflow.monitor_batch_transform_step import MonitorBatchTransformStep

transform_and_monitor_step = MonitorBatchTransformStep(
    name="MonitorCustomerChurnDataQuality",
    transform_step_args=transform_arg,
    monitor_configuration=data_quality_config,
    check_job_configuration=job_config,
    # since this is for data quality monitoring,
    # you could choose to run the monitoring job before the batch inference.
    monitor_before_transform=True,
    # if violation is detected in the monitoring, you can skip it and continue running batch transform
    fail_on_violation=False,
    supplied_baseline_statistics=statistics_path,
    supplied_baseline_constraints=constraints_path,
)

### Create and run the pipeline

In [None]:
from sagemaker.workflow.pipeline import Pipeline

pipeline = Pipeline(
    name="MonitorDataQualityBatchTransformPipeline",
    parameters=[transform_input_param],
    steps=[create_model_step, transform_and_monitor_step],
)

In [None]:
pipeline.upsert(role_arn=role)

### Start a pipeline execution

In [None]:
execution = pipeline.start()

In [None]:
execution.wait()

### Read the model monitor reports

You must wait for the pipeline to finish executing before you can read the violation reports.

This pipeline succeeds even though violations are found by model monitor because `fail_on_violation` is set to `False`.

In [None]:
from sagemaker.model_monitor import MonitoringExecution

monitoring_step = [step for step in execution.list_steps() if "QualityCheck" in step["Metadata"]][0]

In [None]:
monitoring = MonitoringExecution.from_processing_arn(
    sagemaker_session=pipeline_session,
    processing_job_arn=monitoring_step["Metadata"]["QualityCheck"]["CheckJobArn"],
)
violation = monitoring.constraint_violations(file_name="constraint_violations.json")

In [None]:
pd.set_option("display.max_colwidth", -1)

constraints_df = pd.io.json.json_normalize(violation.body_dict["violations"])
constraints_df.head(10)

### Other commands
We can also start and stop the monitoring schedules.

In [None]:
# my_default_monitor.stop_monitoring_schedule()
# my_default_monitor.start_monitoring_schedule()

### Delete the resources


In [None]:
# my_default_monitor.stop_monitoring_schedule()
# my_default_monitor.delete_monitoring_schedule()
# time.sleep(60)  # actually wait for the deletion