# Drift Detection Tutorial Using Multiple Drift Detectors


## _Problem Statement_

When evaluating and monitoring data after model deployment, it is important to test incoming data for potential drift which may affect model performance.


### _When to use_

The `dataeval.detectors` drift detection classes should be used when you would like to measure new data for operational drift.


### _What you will need_

1. A set of image embeddings for each dataset (usually obtained with an AutoEncoder)
2. A python environment with the following packages installed:
   - `dataeval[torch]` or `dataeval[all]`


### _Setting up_

Let's import the required libraries needed to set up a minimal working example


In [None]:
try:
    import google.colab  # noqa: F401

    # specify the version of DataEval (==X.XX.X) for versions other than the latest
    %pip install -q dataeval[torch]
except Exception:
    pass

In [1]:
from functools import partial

import numpy as np
import torch

from dataeval._internal.datasets import MNIST
from dataeval.detectors.drift import (
    DriftCVM,
    DriftKS,
    DriftMMD,
    preprocess_drift,
)
from dataeval.utils.torch.models import AriaAutoencoder

device = "cuda" if torch.cuda.is_available() else "cpu"

2024-09-28 05:22:57.333294: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-09-28 05:22:57.335190: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-09-28 05:22:57.360812: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-28 05:22:57.360835: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-28 05:22:57.361504: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to

## Loading in data

Let's start by loading in torchvision's mnist dataset,
then we will examine it


In [2]:
# Load in the training mnist dataset and use the first 4000
train_ds = MNIST(root="./data/", train=True, download=True, size=4000, dtype=np.float32, channels="channels_first")

# Split out the images and labels
images, labels = train_ds.data, train_ds.targets

In [3]:
print("Number of samples: ", len(images))
print("Image shape:", images[0].shape)

Number of samples:  4000
Image shape: (1, 28, 28)


## Test reference against control

Let's check for drift between the first 2000 images and the second 2000 images from this sample.


In [4]:
data_reference = images[0:2000]
data_control = images[2000:]

In order to reduce the dimensionality of the data, we can set a simple Autoencoder to the `preprocess_fn`. While this is optional for the MNIST data set, it is highly recommended for datasets that have higher dimensionality.

For the purposes of the tutorial, we will use 3 forms of drift detectors: Maximum Mean Discrepancy (MMD), Cramér-von Mises (CVM), and Kolmogorov-Smirnov (KS).


In [5]:
# define encoder
encoder_net = AriaAutoencoder(1).encoder.to(device)

# define preprocessing function
preprocess_fn = partial(preprocess_drift, model=encoder_net, batch_size=64, device=device)

# initialise drift detectors
detectors = [detector(data_reference, preprocess_fn=preprocess_fn) for detector in [DriftMMD, DriftCVM, DriftKS]]

We estimate that the test for drift is false for all detectors as both the reference and test data set is from the same MNIST training dataset.


In [6]:
[(type(detector).__name__, detector.predict(data_control).is_drift) for detector in detectors]

[('DriftMMD', False), ('DriftCVM', False), ('DriftKS', False)]

## Loading in corrupted data

Now let's load in a corrupted MNIST dataset.


In [7]:
corruption = MNIST(
    root="./data",
    train=True,
    download=False,
    size=2000,
    dtype=np.float32,
    channels="channels_first",
    corruption="translate",
)
corrupted_images = corruption.data

Files already downloaded and verified


In [8]:
print("Number of corrupted samples: ", len(corrupted_images))
print("Corrupted image shape:", corrupted_images[0].shape)

Number of corrupted samples:  2000
Corrupted image shape: (1, 28, 28)


## Check for drift against corrupted data

Test for drift between the corrupted dataset and the original reference set using all 3 detectors.


In [9]:
[(type(detector).__name__, detector.predict(corrupted_images).is_drift) for detector in detectors]

[('DriftMMD', True), ('DriftCVM', True), ('DriftKS', True)]

We conclude that the translated MNIST images are significantly different from the original images according to all 3 measures of drift.
