# HP Divergence Estimation Tutorial


## _Problem Statement_

When evaluating new testing data, or comparing two datasets, we often want to have a quantitative way of comparing and evaluating shifts in covariates. HP divergence is a nonparametric divergence metric which gives the distance between two datasets. A divergence of 0 means that the two datasets are approximately identically distributed. A divergence of 1 means the two datasets are completely separable.


### _When to use_

The `Divergence` class should be used when you would like to know how far two datasets are diverged for one another. For example, if you would like to measure operational drift.


### _What you will need_

1. A set of image embeddings for each dataset (usually obtained with an AutoEncoder)


### _Setting up_

Let's import the required libraries needed to set up a minimal working example


In [None]:
try:
    import google.colab  # noqa: F401

    # specify the version of DataEval (==X.XX.X) for versions other than the latest
    %pip install -q dataeval
except Exception:
    pass

In [1]:
from dataeval._internal.datasets import MNIST
from dataeval.metrics.estimators import divergence

2024-09-28 05:31:18.817912: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-09-28 05:31:18.819789: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-09-28 05:31:18.844005: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-28 05:31:18.844033: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-28 05:31:18.844703: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to

## Loading in data

Let's start by loading in tensorflow's MNIST dataset,
then we will examine it.


In [2]:
# Load in the training mnist dataset and use the first 4000
train_ds = MNIST(root="./data/", train=True, download=True, size=4000, flatten=True)

# Split out the images and labels
images, labels = train_ds.data, train_ds.targets

In [3]:
print("Number of samples: ", len(images))
print("Image shape:", images[0].shape)

Number of samples:  4000
Image shape: (1, 28, 28)


## Calculate initial divergence

Let's calculate the divergence between the first 2000 images and the second 2000 images from this sample.


In [4]:
data_a = images[0:2000]
data_b = images[2000:]

In [5]:
div = divergence(data_a, data_b)
print(div)

DivergenceOutput(divergence=0.025000000000000022, errors=1950.0)


We estimate that the divergence between these (identically distributed) images sets is at or close to 0.


## Loading in corrupted data

Now let's load in a corrupted mnist dataset.


In [6]:
corruption = MNIST(root="./data", train=True, download=False, size=2000, flatten=True, corruption="translate")
corrupted_images = corruption.data

Files already downloaded and verified


In [7]:
print("Number of corrupted samples: ", len(corrupted_images))
print("Corrupted image shape:", corrupted_images[0].shape)

Number of corrupted samples:  2000
Corrupted image shape: (1, 28, 28)


## Calculate corrupted divergence

Now lets calculate the Divergence between this corrupted dataset and the original images


In [8]:
div = divergence(data_a, corrupted_images)
print(div)

DivergenceOutput(divergence=0.9655, errors=69.0)


We conclude that the translated MNIST images are significantly different from the original images.
