# Monitor shifts in operational data

This guide provides a beginner friendly introduction on monitoring post deployment data shifts.

Estimated time to complete: 5 minutes

Relevant ML stages: [Monitoring](../concepts/workflows/ML_Lifecycle.md#monitoring)

Relevant personas: Machine Learning Engineer, T&E Engineer

## What you'll do

- Construct embeddings by training a simple neural network
- Compare the embeddings between a training and operational set
- Compare the label distributions between a training and operational set

## What you'll learn

- Learn how to analyze embeddings for operational drift
- Learn how to analyze label distributions

## What you'll need

- Knowledge of Python
- Beginner knowledge of PyTorch or neural networks


## Introduction

Monitoring is a critical step in the [AI/ML lifecycle](../concepts/workflows/ML_Lifecycle.md). When a model is deployed, data can, and generally will, drift from the distribution on which the model was originally trained.
One critical step in AI T&E is the detection of changes in the operational distribution so that they may be proactively addressed. While some change might not affect performance, significant deviation is often associated with model degradation.

For this tutorial, you will use the popular [2011 VOC](http://host.robots.ox.ac.uk/pascal/VOC/voc2011/index.html) computer vision dataset to detect drift between the image distribution of the `train` split and the `val` split, which will represent an operational dataset in this guide. You will then determine if the labels within these two datasets has high parity, or equivalent label distributions.


## Setup

You'll begin by importing the necessary libraries for this tutorial.


In [None]:
try:
    import google.colab  # noqa: F401

    %pip install -q dataeval
except Exception:
    pass

In [None]:
import numpy as np
import torch
import torch.nn as nn
from torchvision import models

# Drift
from dataeval.detectors.drift import DriftCVM, DriftKS, DriftMMD
from dataeval.metrics.bias import label_parity
from dataeval.utils.data import Embeddings, Metadata
from dataeval.utils.data.datasets import VOCDetection

# Set a random seed
rng = np.random.default_rng(213)

# Set default torch device for notebook
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.set_default_device(device)

**More on device**\
The device is set above as it will be used in subsequent steps. The device is the piece of hardware where the model, data, and other related objects are stored in memory. If a GPU is available, this notebook will use that hardware rather than the CPU. To force running only on the CPU, change `device` to `"cpu"` For more information, see the [PyTorch device page](https://pytorch.org/tutorials/recipes/recipes/changing_default_device.html).


## Step 1: Constructing Embeddings

A common first step in many aspects of data monitoring is reducing images down to a smaller dimension. While this step is not always necessary, it is good practice to use embeddings over raw images to improve
the speed and memory efficiency of many workflows without sacrificing downstream performance.

In this step, you will use a [pretrained ResNet18 model](https://pytorch.org/vision/main/models/generated/torchvision.models.resnet18.html) to reduce the dimensionality of the VOC dataset.

### Define model architecture

Below is a simple [PyTorch nn.Module](https://pytorch.org/docs/stable/generated/torch.nn.Module.html) that wraps the pre-trained ResNet18 referred to above.


In [None]:
# Define the embedding network
class EmbeddingNet(nn.Module):
    def __init__(self):
        super().__init__()
        # Load in pretrained resnet18 model
        self.model = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)
        # Add an additional fully connected layer with an embedding dimension of 128
        self.model.fc = nn.Linear(self.model.fc.in_features, 128)

    def forward(self, x):
        """Run input data through the model"""

        return self.model(x)

The model can now be instantiated in the code below.


In [None]:
embedding_net = EmbeddingNet()

### Download VOC dataset

With the model created on the device set at the beginning, you will download the train and validation splits of the 2011 VOC Dataset. Afterwards, you will use the defined `custom_batch` function to chunk the data into batches to make the model run more efficiently.


In [None]:
# Define pretrained model transformations
preprocess = models.ResNet18_Weights.DEFAULT.transforms()

# Load the training dataset
train_ds = VOCDetection("./data", year="2011", image_set="train", download=False, transform=preprocess)
# Load the "operational" dataset
operational_ds = VOCDetection("./data", year="2011", image_set="val", download=False, transform=preprocess)

print(train_ds.info())
print(operational_ds.info())

It is good to notice a few points about each dataset:

- Number of datapoints
- Resize size

These two values give an estimate of the memory impact that each dataset has. The following step will modify the resize size by creating model embeddings for each image to reduce this impact.


### Extract Embeddings

Now it is time to process the datasets through your model. Aggregating the model outputs gives you the embeddings of the data. This will be helpful in determining drift between the training and operational splits.


Below you will call the helper function and create embeddings for both the train and operational splits. The labels will also be saved so they can be used in a later step.


In [None]:
# Create training batches and targets
train_embs = Embeddings(train_ds, batch_size=64, model=embedding_net).to_tensor()

# Create operational batches and targets
operational_embs = Embeddings(operational_ds, batch_size=64, model=embedding_net).to_tensor()

Notice that the shape of embeddings is different than before.

**Previously**

Training shape - (5717, 256)\
Operational shape - (5823, 256)

**After embeddings**


In [None]:
print(train_embs.shape)
print(operational_embs.shape)

The reduced shape of both the training and operational datasets will improve the performance of the upcoming drift algorithms without impacting the accuracy of the results.


## Step 2: Monitor drift

In this step, you will be checking for drift between the training embeddings and the operational embeddings from before. If drift is detected, a model trained on this training data should be retrained with new operational data. This can help mitigate performance degradation in a deployed model. Visit our [About Drift](../concepts/Drift.md) page to learn more.

### Drift detectors

DataEval offers a few drift detectors: {class}`.DriftMMD`, {class}`.DriftCVM`, {class}`.DriftKS`

Since each detector outputs a binary decision on whether drift is detected, a **majority vote** will be used to make the determination of drift.\
To learn more about these algorithms, see the [theory behind drift detection](../concepts/Drift.md#theory-behind-drift-detection) concept page.

### Fit the detectors

Each drift detector needs a reference set that the operational set will be compared against. In the following code, you will set the reference data to the training embeddings.


In [None]:
# A type alias for all of the drift detectors
DriftDetector = DriftMMD | DriftCVM | DriftKS

# Create a mapping for the detectors to iterate over
detectors: dict[str, DriftDetector] = {
    "MMD": DriftMMD(train_embs),
    "CVM": DriftCVM(train_embs),
    "KS": DriftKS(train_embs),
}

### Make predictions

Now that the detectors are setup, predictions can be made against the operational embeddings you made earlier.


In [None]:
# Iterate and print the name of the detector class and its boolean drift prediction
for name, detector in detectors.items():
    print(f"{name} detected drift? {detector.predict(operational_embs).drifted}")

Did you expect these results?

There is no drift detected between the train and operational embeddings because they come from very similar distributions.\
Ideally, your training data and your validation data, which we used as operational, come from the same distribution. This is the purpose of [data splitters](https://scikit-learn.org/stable/api/sklearn.model_selection.html#splitters).

So how do we know if the detectors can detect drift?

Well, add some random Gaussian noise to the operational embeddings and find out.


In [None]:
# Creates a normal distribution around the operational embeddings
noisy_embs = torch.normal(mean=operational_embs)

In [None]:
# Iterate and print the name of the detector class and its boolean drift prediction
for name, detector in detectors.items():
    print(f"{name} detected drift? {detector.predict(noisy_embs).drifted}")

Now drift is detected!

Adding Gaussian noise was enough to cause a noticeable change in the drift detectors, but this is not always the case. There are many [types of drift](../concepts/Drift.md#formal-definition-and-types-of-drift) that data can and will experience.

In this step, you learned how to take your generated embeddings and detect drift between the training and operational image data. While there was no drift originally, you were able to add small perturbations to the data that did affect the data distributions and cause drift.

Next you will look at the labels' distributions.


## Step 3: Parity


Instead of looking at the images, you can compare the distributions of the labels using a method called [label parity](../concepts/LabelParity.md).\
There is parity between two sets of labels if the label frequencies are approximately equal.

You will now compare the label distributions using the `label_parity` function.


In [None]:
# The VOC dataset has 20 classes
label_parity(Metadata(train_ds).targets.labels, Metadata(operational_ds).targets.labels, num_classes=20).p_value

From the {class}`.ParityOutput` class, you can see that it calculated a p_value of ~**0.95**. Since this is close to 1.0, it can be said that the two distributions **have** parity, or similar distributions.


## Conclusion

In this tutorial, you have learned to create embeddings from the VOC dataset, look for drift between two sets of data, and calculate the parity of two label distributions. These are important steps when monitoring data as drift and lack of parity can affect a model's ability to achieve performance recorded during model training. When data drift is detected or the label distributions lack parity, it is a good idea to consider retraining the model and incorporating operational data into the dataset.

---

## What's next

DataEval plays a small, but impactful role in data monitoring as a metrics library.\
Visit these additional resources for more information on other aspects:

- Read about the entire [monitoring in AI/ML](../concepts/workflows/ML_Lifecycle.md#monitoring) stage
- Explore DataEval's [API reference](../reference/autoapi/dataeval/index.rst) for drift and other monitoring tools
