# Detecting Data Drift in NLP using Maximum Mean Discrepancy

## Introduction

Detecting data drift in NLP is a challenging task due to complex nature of human language. NLP deep learning models are trained on certain corpus of text data, data drift occurs when distribution changes between the training and inference data. Model monitoring is an important aspect in production NLP applications, any change in data distribution(drift) during inference can cause model performance degradation(model decay). Please refer this [blog](https://aws.amazon.com/blogs/machine-learning/detect-nlp-data-drift-using-custom-amazon-sagemaker-model-monitor/) for more details on NLP data drift.

In this notebook, we will implement a data drift detection technique using [Maximum Mean Discrepancy](https://en.wikipedia.org/wiki/Kernel_embedding_of_distributions) (MMD) distance measure. MMD is a kernel based two-sample testing method to measure distance between distributions. We will compare MMD distance between training and inference sentence embeddings to determine if there is a data drift using Amazon SageMaker custom model monitoring. You can establish a custom baseline such as sentence embeddings and evaluate inference data points against it for any potential data drift. 

## Solution Architecture

In this notebook, we will implement an end to end solution to detect data drift in NLP using Amazon SageMaker model monitoring feature. Below are the highlevel steps involved in this architecture

1. Fine tune `distilroberta-base` model using HuggingFace Estimators
2. Deploy the model to SageMaker real time endpoint
3. Create a baseline using `sentence-transformer` sentence embeddings
4. Detect data drift using Maximum Mean Discrepancy technique
5. Define a model monitoring schedule and inspect violation reports



<img src="images/nlp-data-drift-mmd.png" alt="nlp-data-drfit" width="800" align="left"/>

#### Setup

To start, we install and upgrade required packages

In [None]:
!pip install torch --quiet
!pip install transformers --quiet
!pip install "sagemaker>=2.48.0" --upgrade --quiet
!pip install -U sentence-transformers --quiet

#### Imports

In [None]:
import os
import numpy as np
import pandas as pd
from time import gmtime, strftime
import boto3
import json

#sagemaker
import sagemaker
from sagemaker.s3 import S3Downloader, S3Uploader

from sagemaker.huggingface import HuggingFace
import time
from sagemaker import TrainingJobAnalytics
from sagemaker.model import Model
from sagemaker.model_monitor import DataCaptureConfig

#### Variables

In [None]:
sagemaker_session_bucket=None
current_timestamp = strftime('%m-%d-%H-%M', gmtime())

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

if sagemaker_session_bucket is None and sess is not None:
    sagemaker_session_bucket = sess.default_bucket()
role = sagemaker.get_execution_role()

region = sess.boto_region_name

local_train_dataset = "train.json"
local_test_dataset = "test.json"
remote_train_dataset = f"s3://{sagemaker_session_bucket}/nlp-data-drift-mmd/data"
remote_test_dataset = f"s3://{sagemaker_session_bucket}/nlp-data-drift-mmd/data"
base_name = "nlp-data-drift-mmd"
report_file_name = "constraints_violations.json"

train_job_name = f'{base_name}-{current_timestamp}'
model_monitor_job_name = f'{base_name}-{current_timestamp}'
endpoint_name = f'{base_name}-{current_timestamp}'
monitor_schedule_name = f'{base_name}-{current_timestamp}'
processing_job_name = f'{base_name}-{current_timestamp}'

prefix = "sagemaker/nlp-data-drift-mmd"
data_capture_prefix = f"{prefix}/data_capture"
s3_capture_upload_path = f"s3://{sagemaker_session_bucket}/{data_capture_prefix}"

sm_client = boto3.client('sagemaker')
s3 = boto3.client('s3')

account_id = boto3.client('sts').get_caller_identity().get('Account')

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sagemaker_session_bucket}")
print(f"sagemaker session region: {region}")

## Dataset

We will use Corpus of Linguistic Acceptability (CoLA) dataset (https://nyu-mll.github.io/CoLA/), a dataset of 10,657 English sentences labeled as grammatical or ungrammatical from published linguistics literature. In the script below, we do

1. Download the data as a zip file
2. Extract content to a folder
3. Read `sentence` and `label` columns from a tsv file
4. Split dataset into train and test files
5. Store the data in JSON format, each data point will be in a seperate line

In [None]:
!pygmentize ./scripts/create_dataset.py

Let's execute the script to create training, testing data in JSON format

In [None]:
%%time
!mkdir -p ./data/
!python ./scripts/create_dataset.py

Upload train and test files to Amazon S3 bucket

In [None]:
# upload datasets
S3Uploader.upload(os.path.join('./data', local_train_dataset),remote_train_dataset)
S3Uploader.upload(os.path.join('./data',local_test_dataset),remote_test_dataset)

print(f"train dataset uploaded to: {remote_train_dataset}/{local_train_dataset}")
print(f"test dataset uploaded to: {remote_test_dataset}/{local_test_dataset}")

### Fine-tuning distilroberta-base model using HuggingFace Estimators

In order to create our sagemaker training job we need an `HuggingFace` Estimator. The Estimator handles all end-to-end Amazon SageMaker training and deployment tasks. In the Estimator we define, fine-tuning script (`entry_point`), `instance_type` to launch training job, `hyperparameters` with model name, epoch and training batch size


```python
huggingface_estimator = HuggingFace(entry_point='train.py',
                            source_dir='./scripts',
                            base_job_name='huggingface-sdk-extension',
                            instance_type='ml.p3.2xlarge',
                            instance_count=1,
                            transformers_version='4.12',
                            pytorch_version='1.9',
                            py_version='py38',
                            role=role,
                            hyperparameters = {'epochs': 1,
                                               'train_batch_size': 32,
                                               'model_name':'distilroberta-base'
                                               
                                                })
```

Let's take a look at the entry point training script

In [None]:
!pygmentize ./scripts/train.py

We set below hyperparameters to be passed to the training job

In [None]:
hyperparameters={'epochs': 1,                          # number of training epochs
                 'train_batch_size': 32,               # batch size for training
                 'eval_batch_size': 64,                # batch size for evaluation
                 'learning_rate': 3e-5,                # learning rate used during training
                 'model_id':'distilroberta-base', # pre-trained model
                 'fp16': True,                         # Whether to use 16-bit (mixed) precision training
                 'train_file': local_train_dataset,    # training dataset
                 'test_file': local_test_dataset,      # test dataset
                 }

Define metrics for training job by specifying a name and a regular expression for each metric that the training job monitors.

In [None]:
metric_definitions=[
    {'Name': 'eval_loss',               'Regex': "'eval_loss': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_accuracy',           'Regex': "'eval_accuracy': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_f1',                 'Regex': "'eval_f1': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_precision',          'Regex': "'eval_precision': ([0-9]+(.|e\-)[0-9]+),?"}]

Below estimator runs a Hugging Face training script in a SageMaker training environment. The estimator initiates the SageMaker-managed Hugging Face environment by using the pre-built Hugging Face Docker container and runs the Hugging Face training script that user provides through the entry_point argument.

In [None]:

huggingface_estimator = HuggingFace(
    entry_point          = 'train.py',        # fine-tuning script used in training jon
    source_dir           = './scripts',       # directory where fine-tuning script is stored
    instance_type        = 'ml.p3.2xlarge',   # instances type used for the training job
    instance_count       = 1,                 # the number of instances used for training
    base_job_name        = train_job_name,    # the name of the training job
    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3
    transformers_version = '4.12',            # the transformers version used in the training job
    pytorch_version      = '1.9',             # the pytorch_version version used in the training job
    py_version           = 'py38',            # the python version used in the training job
    hyperparameters      = hyperparameters,   # the hyperparameter used for running the training job
    metric_definitions   = metric_definitions,# the metrics regex definitions to extract logs
    disable_profiler     = True               # disable sagemaker debugger profiler
)

After configuring the estimator class, use the class method `fit()` to start a training job. This step takes around 10 minutes to complete.

In [None]:
training_data = {
    'train': remote_train_dataset,
    'test': remote_test_dataset
}

# starting the train job with our uploaded datasets as input
huggingface_estimator.fit(training_data, wait=True)

Fetch training metrics data from CloudWatch Metrics for a specific training job.

In [None]:
training_job_name = huggingface_estimator.latest_training_job.name
print(f"Training jobname: {training_job_name}")

df = TrainingJobAnalytics(training_job_name=training_job_name).dataframe()
df

Capture real-time inference data from Amazon SageMaker endpoints. To enable data capture for monitoring the model data quality, specify the new capture option called `DataCaptureConfig` when deploying to an endpoint.

In [None]:
data_capture_config = DataCaptureConfig(
    enable_capture=True, sampling_percentage=100, destination_s3_uri=s3_capture_upload_path
)


Deploy the trained model to an Amazon SageMaker endpoint

In [None]:
predictor = huggingface_estimator.deploy(1,"ml.g4dn.xlarge", endpoint_name=endpoint_name, data_capture_config=data_capture_config,)

In [None]:
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

predictor = sagemaker.predictor.Predictor(endpoint_name=endpoint_name, 
                                                   sagemaker_session=sess,
                                                   serializer=JSONSerializer(),
                                                   deserializer=JSONDeserializer())

### Deploy trained model to SageMaker real-time endpoint

The `deploy()` method returns a predictor that provides `predict()` which can be used to send requests to the Amazon SageMaker endpoint and obtain inferences.

In [None]:
#read validation data, sample random 10 records and send it to the precitor
df_validation = pd.read_json("./data/validation.json", lines = True)

sentiment_input = {}
for row in df_validation.sample(n=10).iterrows():
    val_data = row[1][1]
    sentiment_input["inputs"] = val_data
    pred_resp = predictor.predict(sentiment_input)
    print(f"Model prediction : {pred_resp}")

#### View Captured Data

View captured data by listing the data capture files stored in Amazon S3. Expect to see different files from different time periods, organized based on the hour when the invocation occurred. Please note it may take few minutes for the data cpature files to be created.

In [None]:
import boto3

s3_client = boto3.Session().client('s3')
time.sleep(120)

current_endpoint_capture_prefix = "{}/{}".format(data_capture_prefix, endpoint_name)
result = s3_client.list_objects(Bucket=sagemaker_session_bucket, Prefix=current_endpoint_capture_prefix)
capture_files = [capture_file.get("Key") for capture_file in result.get("Contents")]
print("Found Capture Files:")
print("\n ".join(capture_files))

In [None]:
def get_obj_body(obj_key):
    return s3_client.get_object(Bucket=sagemaker_session_bucket, Key=obj_key).get('Body').read().decode("utf-8")

capture_file = get_obj_body(capture_files[-1])
print(capture_file[:2000])

Note the captured inoput data format,  model monitoring schedule evaluates the captured data in string format. Preprocessing logic is included in the `evalaution.py` file

In [None]:
import json

print(json.dumps(json.loads(capture_file.split('\n')[0]), indent=2))

### Establish Sentence Embedding Baseline

In this section, we will establish a baseline from the training data using `SentenceTransformers`. [SentenceTransformers](https://www.sbert.net/) is a Python framework for state-of-the-art sentence, text and image embeddings.

We create a SentenceTransformer pretrained model `all-distilroberta-v1`, that can be used to map sentences / text to embeddings. Transformer model's (BERT/ RoBERTa etc) runtime and the memory requirement grows quadratic with the input length. So we limit input text length to 128 tokens, longer inputs will be truncated. Compute sentence embeddings with `encode()` method and append the sentence embedding to a list. We will represent this baseline as as numpy object and use it during model monitoring schedule execution.

In [None]:
%%time
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-distilroberta-v1')
model.max_seq_length = 128
print(f"Max Sequence Length: {model.max_seq_length}")

df_sentences = pd.read_json('./data/train.json', lines=True)
sentences = df_sentences['sentence'].values

sentence_embeddings = []
counter = 0
for sent in sentences:
    print(f"Encoding sentence #: {counter}")
    sentence_embedding = model.encode(sent)
    sentence_embeddings.append(sentence_embedding[None])
    counter+=1

Save `sentence_embeddings_list` list as a numpy file that will be uploaded to S3

In [None]:
np.save('./data/embeddings.npy', sentence_embeddings)

Copy the .npy baseline data to Amazon S3, we will read this in the evaluation script in the later section.

In [None]:
!aws s3 cp ./data/embeddings.npy s3://{sagemaker_session_bucket}/{prefix}/baseline/

### Maximum Mean Discrepancy

Empirical estimation of MMD: We represent emperical estimation of MMD using following formula:


$$ MMD^{2}(X,Y) = \underbrace{\frac{1}{m (m-1)} \sum_{i} \sum_{j \neq i} k(\mathbf{x_{i}}, \mathbf{x_{j}})}_\text{A} - \underbrace{2 \frac{1}{m.m} \sum_{i} \sum_{j} k(\mathbf{x_{i}}, \mathbf{y_{j}})}_\text{B} + \underbrace{\frac{1}{m (m-1)} \sum_{i} \sum_{j \neq i} k(\mathbf{y_{i}}, \mathbf{y_{j}})}_\text{C} \tag{1} $$

$\mathbf{x_{i}}$'s are the data points in first distribution and $\mathbf{y_{i}}$'s are the data points in second distribution, so that MMD score guides us towards the evaluation of underlying distributions. In the `evaluation.py` file, we will demonstrate how to implement above equation in pytorch and use the MMD as a distance measure to ensure inference distributions doesn't diverge from the training data.

Given $X,Y$ maximum mean discrepancy is the distance between feature means of $X,Y$:

$$ MMD^{2}(X,Y) = \Vert \mu_{X} - \mu_{Y} \Vert^{2} _{\mathcal{F}} \tag{2} $$ 
First, we obtain the similarity matrices between  $X$ and $X$, $X$ and $Y$, finally $Y$ and $Y$ with given distance metric, then plugging the results to kernel specific function such as exponential. 

For example, let's say kernel in question is gaussian meaning 

$$ k(\mathbf{x_{i}}, \mathbf{x_{j}}) = \exp \left(\frac{- \Vert \mathbf{x_{i}} - \mathbf{x_{j}} \Vert^{2}}{2\sigma^{2}}\right) = \exp \left(\frac{-1}{\sigma^{2}} [\mathbf{x_{i}}^\intercal \mathbf{x_{i}} - 2 \mathbf{x_{i}}^\intercal \mathbf{x_{j}} + \mathbf{x_{j}}^\intercal \mathbf{x_{j}}]\right) $$

If we can construct a matrix with elements such that for every i and j corresponding element is $[\mathbf{x_{i}}^\intercal \mathbf{x_{i}} - 2 \mathbf{x_{i}}^\intercal \mathbf{x_{j}} + \mathbf{x_{j}}^\intercal \mathbf{x_{j}}]$, then it is possible to just plug that matrix into `pytorch.exp()` for the result.

We will implement MMD logic in model monitor evaluation script.

In [None]:
!pygmentize docker/evaluation.py

### Set up Model monitor evaluation Script

Amazon SageMaker Model Monitor provides a prebuilt container with ability to analyze the data captured from endpoints for tabular datasets. If you would like to bring your own container, Model Monitor provides extension points which you can leverage. 

Under the hood, when you create a MonitoringSchedule, Model Monitor ultimately kicks off processing jobs. Hence the container needs to be aware of the processing job contract 

We need to create an evaluation script that is compatible with container contract inputs and outputs

[Container Contract Inputs](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-byoc-contract-inputs.html)

[Container Contract Outputs](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-byoc-contract-outputs.html)


### Build and Push Image to ECR

below script shows how to build the Docker image and push it to ECR to be ready for use by SageMaker.

In [None]:
ecr_repository = f'{base_name}-{current_timestamp}'
tag = ':latest'

uri_suffix = 'amazonaws.com'
if region in ['cn-north-1', 'cn-northwest-1']:
    uri_suffix = 'amazonaws.com.cn'
processing_repository_uri = f'{account_id}.dkr.ecr.{region}.{uri_suffix}/{ecr_repository + tag}'

In [None]:
# Creating the ECR repository and pushing the container image

# SageMaker Classic Notebook Instance:
!docker build -t $ecr_repository docker
!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)
!aws ecr create-repository --repository-name $ecr_repository
!docker tag {ecr_repository + tag} $processing_repository_uri
!docker push $processing_repository_uri

# SageMaker Studio:
# !cd docker && sm-docker build . --repository $ecr_repository$tag

### Custom Model Monitor for detetcing data drift

ModelMonitor class assist in creating monitoring schedules for data captured by SageMaker Endpoints. Use this class  to provide your own container image containing the code to evaluate the captured data and detect and potential drift

<div class="alert alert-info"> 💡 <strong> Note </strong>
The threshold value for average MMD could vary depending on the data and use case. You can run experiments with MMD distance measure and specify a threshold value in the ModelMonitor that fits your use case
</div>

In [None]:
from sagemaker.model_monitor import ModelMonitor

monitor = ModelMonitor(
    base_job_name=model_monitor_job_name,
    role=role,
    image_uri=processing_repository_uri,
    instance_count=1,
    instance_type='ml.m5.large',
    env={ 'THRESHOLD':'6.0', 'bucket': sagemaker_session_bucket },
)

Create a monitoring schedule to monitor an Amazon SageMaker Endpoint. We have established a sentence embedding baseline and stored in S3 as a numpy file. During model monitoring schedule execution, S3 file will be downloaded and MMD distance metrics will be calculated for every data point captured part of data capture config.

In [None]:
from sagemaker.model_monitor import CronExpressionGenerator, MonitoringOutput
from sagemaker.processing import ProcessingInput, ProcessingOutput

destination = f's3://{sagemaker_session_bucket}/{prefix}/{endpoint_name}/monitoring_schedule'

processing_output = ProcessingOutput(
    output_name='result',
    source='/opt/ml/processing/resultdata',
    destination=destination,
    s3_upload_mode="EndOfJob"
)
output = MonitoringOutput(source=processing_output.source, destination=processing_output.destination, s3_upload_mode="EndOfJob")

monitor.create_monitoring_schedule(
    monitor_schedule_name=monitor_schedule_name,
    output=output,
    endpoint_input=predictor.endpoint_name,
    schedule_cron_expression=CronExpressionGenerator.hourly(),
)

Describe and inspect the schedule, note that the MonitoringScheduleStatus changes from 'Pending' to 'Scheduled'

In [None]:
monitor.describe_schedule()

Note: Even for an hourly schedule, Amazon SageMaker has a buffer period of 20 minutes to schedule your execution. You might see your execution start in anywhere from zero to ~20 minutes from the hour boundary. This is expected and done for load balancing on the backend.

In [None]:
mon_executions = monitor.list_executions()
print(
    "We created a hourly schedule above and it will kick off executions ON the hour (plus 0 - 20 min buffer.\nWe will have to wait till we hit the hour..."
)

while len(mon_executions) == 0:
    print("Waiting for the 1st execution to happen...")
    time.sleep(60)
    mon_executions = monitor.list_executions()

Examine the latest monitring schedule execution and check the `ProcessingJobStatus` of the execution

In [None]:
latest_execution = mon_executions[
    -1
]  # latest execution's index is -1, second to last is -2 and so on..
time.sleep(60)
latest_execution.wait(logs=False)

print("Latest execution status: {}".format(latest_execution.describe()["ProcessingJobStatus"]))
print("Latest execution result: {}".format(latest_execution.describe()["ExitMessage"]))

latest_job = latest_execution.describe()
if latest_job["ProcessingJobStatus"] != "Completed":
    print(
        "====STOP==== \n No completed executions to inspect further. Please wait till an execution completes or investigate previously reported failures."
    )

Model monitoring schedule excution will create violation reports if there any data points that violates MMD threshold indicating a data drift. Reports are uploaded to S3 location below.

In [None]:
report_uri = latest_execution.output.destination
print("Report Uri: {}".format(report_uri))

In [None]:
from urllib.parse import urlparse

s3uri = urlparse(report_uri)
report_bucket = s3uri.netloc
report_key = s3uri.path.lstrip("/")
print("Report bucket: {}".format(report_bucket))
print("Report key: {}".format(report_key))

s3_client = boto3.Session().client("s3")
result = s3_client.list_objects(Bucket=report_bucket, Prefix=report_key)
report_files = [report_file.get("Key") for report_file in result.get("Contents")]
print("Found Report Files:")
print("\n ".join(report_files))

Let's check the violations report

In [None]:
try:
    s3.download_file(report_bucket, report_key+'/'+report_file_name, report_file_name)
    with open(report_file_name, 'r') as handle:
        parsed = json.load(handle)
    
    print(json.dumps(parsed, indent=4, sort_keys=True))
except Exception as e:
    print(str(e))

### Manually execute the processing job

You can test SageMaker model monitoring schedule execution manually by launching a processing job with the ECR image, this can help with reducing test cycles and debugging.

In [None]:
from sagemaker.processing import Processor

processor = Processor(
    base_job_name=processing_job_name,
    role=role,
    image_uri=processing_repository_uri,
    instance_count=1,
    instance_type='ml.m5.large',
    env={ 'THRESHOLD':'6.0','bucket': sagemaker_session_bucket },
)
    
processor.run(
    [ProcessingInput(
        input_name='endpointdata',
        source = "s3://{}/{}/{}".format(sagemaker_session_bucket, data_capture_prefix,endpoint_name),
        #source=f's3://{sagemaker_session.default_bucket()}/{s3_prefix}/endpoint/data_capture',
        destination = '/opt/ml/processing/input/endpoint',
    )],
    [ProcessingOutput(
        output_name='result',
        source='/opt/ml/processing/resultdata',
        destination=destination,
    )],
)

### Cleanup

Lastly, please remember to delete the monitoring schedule and Amazon SageMaker endpoint to avoid charges:

In [None]:
#Delete the monitoring schedule
monitor.delete_monitoring_schedule()
time.sleep(20)

In [None]:
#Delete endpoint
sm_client.delete_endpoint(EndpointName=endpoint_name)

## Conclusion

In this notebook, we discussed how to leverage Maximum mean discrepacy distance measure to detect data drift in NLP applications using Amazon SageMaker custom model monitoring. You can use this pattern with any other distance measure such as cosine distance. Give this a try !