# Data Monitoring Guide


## Introduction

Monitoring is a critical step in the AI/ML lifecycle. When a model is deployed, data can, and generally will, drift from the distribution on which the model was originally trained, or may be fundamentally different from the outset for a variety of reasons. One critical step in AI T&E is the detection of changes in the operational distribution so that one may proactively address them. While some changes will not affect performance, significant deviation is often associated with model degradation.

You will walk through the steps of detecting drift and parity.

For this tutorial, you will use the VOC dataset, an image dataset used for computer vision competitions. You will be comparing the image distribution of the `train` split to that of the `val` split, pretending as though the `val` split represents an operational dataset.


### What you'll need


You'll begin by importing the necessary libraries for this tutorial.


In [None]:
try:
    import google.colab  # noqa: F401

    # specify the version (==X.XX.X) at the end of the statement below when testing version of DataEval other
    # than the latest
    %pip install -q dataeval[torch]
except Exception:
    pass

    # install numpy
    %pip install numpy 

import numpy as np
import torch
import torch.nn as nn
from torchvision import datasets, models

# Drift
from dataeval.detectors.drift import DriftCVM, DriftKS, DriftMMD
from dataeval.metrics.bias import label_parity

# Set the random value
rng = np.random.default_rng(213)

### What you'll learn

- You'll learn how to detect drift on an object detection dataset
- You'll learn how to measure Parity on metadata between your training and test set
- You'll learn how to use embeddings to efficiently run large datasets


## Step 1: Constructing Embeddings


### Encoding Images

The first step in many aspects of data monitoring is reducing images down to a dimension that our tools can operate in. To do this, you will use existing model weights from ResNet18. You will apply these to the VOC dataset. A more in depth look at this dataset and the construction of embeddings can be seen in the [EDA Tutorial](./EDA_Part1.ipynb).


The first steps are defining the encoder network and embedding the training images.


In [None]:
# Define the embedding network
class EmbeddingNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.model = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)
        self.model.fc = nn.Linear(self.model.fc.in_features, 128)

    def forward(self, x):
        x = self.model(x)
        return x


embedding_net = EmbeddingNet()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
embedding_net.to(device)


# Extract embeddings
def extract_embeddings(dataset, model):
    model.eval()

    embeddings = torch.empty(size=(0, 128)).to(device)
    with torch.no_grad():
        images = []
        for i, (img, _) in enumerate(dataset):
            images.append(img)
            if (i + 1) % 64 == 0:
                inputs = torch.stack(images, dim=0).to(device)
                outputs = model(inputs)
                embeddings = torch.vstack((embeddings, outputs))
                images = []
        inputs = torch.stack(images, dim=0).to(device)
        outputs = model(inputs)
        embeddings = torch.vstack((embeddings, outputs))
    return embeddings.detach().cpu().numpy()

Next, you will reload our training dataset with the desired preprocessing for our given model and then you will run the model to get the image embeddings.


In [None]:
# Define pretrained model transformations
preprocess = models.ResNet18_Weights.DEFAULT.transforms()

# Load the dataset
dataset = datasets.VOCDetection("./data", year="2011", image_set="train", download=False, transform=preprocess)

# Create image embeddings
embeddings = extract_embeddings(dataset, embedding_net)

In [None]:
np.shape(embeddings)

The images are reduced to dimension 128. Next you do the same for the operational dataset.


In [None]:
# Load the 'operational' dataset
op_dataset = datasets.VOCDetection("./data", year="2011", image_set="val", download=False, transform=preprocess)

# Create image embeddings
op_embeddings = extract_embeddings(op_dataset, embedding_net)

In [None]:
np.shape(op_embeddings)

## Step 2: Drift

Now that you have embedded both sets of images into 128-dimensional space, you would like to determine if the `val` dataset has drifted from the `train` dataset.
you will use 3 dataeval tools to make this determination. Each operated by comparing the distributions of embeddings between the two images sets. They produce a probability value, where a small value means that it is very unlikely that these two sets of embeddings come from the same distribution, and therefore drift has likely occurred. Based on this p-value(s), each drift metric will output a binary `is_drift`, which you will examine here.


In [None]:
d1 = DriftMMD(embeddings)
d2 = DriftCVM(embeddings)
d3 = DriftKS(embeddings)

In [None]:
d1.predict(op_embeddings).is_drift

In [None]:
d2.predict(op_embeddings).is_drift

In [None]:
d3.predict(op_embeddings).is_drift

Since these two image sets are random subsets of the same dataset, you unsurprisingly do not detect and drift. However, let's add some Gaussian noise to the operational embeddings to see what happens to the drift detectors.


In [None]:
perturbed_op_embeddings = np.float32(op_embeddings + np.random.normal(size=np.shape(op_embeddings)))

In [None]:
d1.predict(perturbed_op_embeddings).is_drift

In [None]:
d2.predict(perturbed_op_embeddings).is_drift

In [None]:
d3.predict(perturbed_op_embeddings).is_drift

When you perturb the operational embeddings, you find that drift is detected. To give a more realistic example, you can also look at an individual class from the operational set.


In [None]:
labels = []
for data in op_dataset:
    objects = data[1]["annotation"]["object"]
    names = []
    for each in objects:
        names.append(each["name"])
    labels.append(names)

In [None]:
# Subset embeddings of images which contain a chair
chair_embeddings = op_embeddings[[("chair" in i) for i in labels], :]

In [None]:
d1.predict(chair_embeddings).is_drift

In [None]:
d2.predict(chair_embeddings).is_drift

In [None]:
d3.predict(chair_embeddings).is_drift

In both cases, you can see the drift detectors pick up on very simple perturbations, but return `0` when the dataset is indistinguishable from that on which the model was trained.


## Step 3: Parity


Another task you might want to perform in monitoring is looking at parity of classes between training and operational datasets. There is parity between two datasets in terms of label if the label frequencies are (approximately) equal. Lets check if the distribution of the objects in each image is the same between datasets.


In [None]:
op_labels = []
for data in op_dataset:
    objects = data[1]["annotation"]["object"]
    names = []
    for each in objects:
        names.append(each["name"])
    op_labels.append(names)
op_labels = [x for i in op_labels for x in i]
labels = []
for data in dataset:
    objects = data[1]["annotation"]["object"]
    names = []
    for each in objects:
        names.append(each["name"])
    labels.append(names)
labels = [x for i in labels for x in i]

In [None]:
from sklearn import preprocessing

# Turn string labels into integer labels so the DataEval parity function can read them.
le = preprocessing.LabelEncoder()
le.fit(labels)
label_int = le.transform(labels)
op_label_int = le.transform(op_labels)

In [None]:
label_parity(label_int, op_label_int, 20).p_value

You can see, unsurprisingly, that there is no discernible difference in the distribution of classes between the datasets (the p_value is extremely high).


## Conclusion


You have checked for potential issues in the operational dataset that may affect the model after deployment. Both drift and class parity (lack thereof) can affect a model's ability to achieve the performance recorded at model training. If one detects that a dataset has drifted significantly and/or that parity has been violated, it might be a good idea to consider retraining the model, incorporating operational data into this retraining.
