# HP Divergence Estimation Tutorial


## _Problem Statement_

When evaluating new testing data, or comparing two datasets, we often want to have a quantitative way of comparing and evaluating shifts in covariates. HP divergence is a nonparametric divergence metric which gives the distance between two datasets. A divergence of 0 means that the two datasets are approximately identically distributed. A divergence of 1 means the two datasets are completely separable.


### _When to use_

The `Divergence` class should be used when you would like to know how far two datasets are diverged for one another. For example, if you would like to measure operational drift.


### _What you will need_

1. A set of image embeddings for each dataset (usually obtained with an AutoEncoder)
2. A python environment with the following packages installed:
   - `tensorflow-datasets`
   - `pytest`


### _Setting up_

Let's import the required libraries needed to set up a minimal working example


In [None]:
try:
    import google.colab  # noqa: F401

    %pip install -q dataeval==v0.68.0
except Exception:
    pass

import os

from pytest import approx

os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"

In [None]:
import numpy as np
import tensorflow_datasets as tfds

from dataeval.metrics.estimators import divergence

## Loading in data

Let's start by loading in tensorflow's MNIST dataset,
then we will examine it.


In [None]:
# Load in the mnist dataset from tensorflow datasets
images, ds_info = tfds.load(
    "mnist",
    split="train[:4000]",
    with_info=True,
)
tfds.visualization.show_examples(images, ds_info)
images = images.shuffle(images.cardinality())
images = np.array([i["image"] for i in images])

In [None]:
print("Number of samples: ", len(images))
print("Image shape:", images[0].shape)

## Calculate initial divergence

Let's calculate the divergence between the first 2500 images and the second 2500 images from this sample.


In [None]:
data_a = images[0:2000].reshape((2000, -1))
data_b = images[2000:].reshape((2000, -1))

In [None]:
div = divergence(data_a, data_b)
print(div)

We estimate that the divergence between these (identically distributed) images sets is at or close to 0.


## Loading in corrupted data

Now let's load in a corrupted mnist dataset.


In [None]:
corrupted, ds_info = tfds.load(
    "mnist_corrupted/translate",
    split="train[:2000]",
    with_info=True,
)
tfds.visualization.show_examples(corrupted, ds_info)
corrupted = corrupted.shuffle(corrupted.cardinality())
corrupted = np.array([i["image"] for i in corrupted])

In [None]:
print("Number of corrupted samples: ", len(corrupted))
print("Corrupted image shape:", corrupted[0].shape)

## Calculate corrupted divergence

Now lets calculate the Divergence between this corrupted dataset and the original images


In [None]:
data_corrupted = corrupted.reshape((2000, -1))
div = divergence(data_a, data_corrupted)
print(div)

In [None]:
### TEST ASSERTION ###
print(div)
assert div.divergence == approx(0.96, abs=0.02)

We conclude that the translated MNIST images are significantly different from the original images.
