# How to add intrinsic metadata factors


## Problem Statement

When performing analysis on datasets, metadata may sometimes be sparse or unavailable.
Adding metadata to a dataset for analysis may be necessary at times, and can come in
the forms of calculated intrinsic values or additional information originally
unavailable on the source dataset.

This guide will show you how to add in the calculated statistics from DataEval's
{func}`.imagestats` function to the metadata for bias analysis.


### _When to use_

Adding metadata factors should be done when little or no metadata is available on the
dataset, or to gain insights specific to metadata of interest that is not present natively
in the dataset metadata.


### _What you will need_

1. A dataset to analyze
2. A Python environment with the following packages installed:
   - `dataeval[all]`


## _Getting Started_

First import the required libraries needed to set up the example.


In [None]:
try:
    import google.colab  # noqa: F401

    # specify the version of DataEval (==X.XX.X) for versions other than the latest
    %pip install -q dataeval[all]
except Exception:
    pass

In [None]:
from dataeval.metrics.bias import balance, diversity, parity
from dataeval.metrics.stats import imagestats
from dataeval.utils.data import Metadata, Select
from dataeval.utils.data.datasets import CIFAR10
from dataeval.utils.data.selections import Limit, Shuffle

## Load the dataset

Begin by loading in the CIFAR-10 dataset.

The CIFAR-10 dataset contains 60,000 images - 50,000 in the train set and 10,000
in the test set. We will use a shuffled sample of 20,000 images from both sets.


In [None]:
# Load in the CIFAR10 dataset and limit to 20,000 images with random shuffling
cifar10 = Select(CIFAR10("data", image_set="base", download=True), [Limit(20000), Shuffle(seed=0)])
print(cifar10)


## Inspect the metadata

You can begin by inspecting the available factor names in the dataset.


In [None]:
metadata = Metadata(cifar10)
print(f"Factor names: {metadata.discrete_factor_names + metadata.continuous_factor_names}")


A quick check of the {func}`.balance` of the single factor will show no mutual information
between the classes and the `batch_num` which indicates the on-disk binary file the image
was extracted from.

In [None]:
# Balance at index 0 is always class
balance(metadata).balance[1]


## Add image statistics to the metadata

In order to perform additional bias analysis on the dataset when no meaningful metadata
are provided, you will augment the metadata with statistics of the images using the
{func}`.imagestats` function.

Begin by running `imagestats` on the dataset and adding the factors to the `metadata`.


In [None]:
# Calculate image statistics
stats = imagestats(cifar10)

# Append the factors to the metadata
metadata.add_factors(stats.factors())

:::{note}

When calculating {func}`.imagestats` for an object detection dataset, you will want
to provide `per_box=True` to get statistics calculated for each target.

:::

Next you will add the `imagestats` output to the metadata as factors, and exclude
factors that are uniform or without significance.

Additionally, you will specify a binning strategy for continuous statistical factors,
which are, for our purposes, continuous. For this example, bin everything into 10
uniform-width bins.

In [None]:
# Exclude dimension statistics (as CIFAR10 images are all of uniform shape) and the batch_num
metadata.exclude = ["aspect_ratio", "width", "height", "depth", "channels", "size", "batch_num"]

# Provide binning for the continuous statistical factors using 10 uniform-width bins for each factor
keys = ("mean", "std", "var", "skew", "kurtosis", "entropy", "brightness", "darkness", "sharpness", "contrast", "zeros")
metadata.continuous_factor_bins = dict.fromkeys(keys, 10)

## Perform bias analysis

Now you can run the bias analysis functions {func}`.balance`, {func}`.diversity` and
{func}`.parity` on the dataset metadata augmented with intrinsic statistical factors.


In [None]:
balance_output = balance(metadata)
_ = balance_output.plot()

Notice the very high mutual information between the variance and standard deviation
of image intensities, which is expected. Mean image intensity correlates with
brightness, darkness, and contrast. However, none of the intrinsic factors correlate
strongly with class label.

In [None]:
_ = balance_output.plot(plot_classwise=True)

Classwise balance also indicates minimal correlation of image statistics and individual
classes. Uniform mutual information between individual classes and all class labels
indicates balanced class representation in the subsampled dataset.

In [None]:
diversity_output = diversity(metadata)
_ = diversity_output.plot()

The diversity index also indicates uniform sampling of classes within the dataset. The
apparently low diversity of kurtosis across the dataset may indicate an inadequate binning
strategy (for metric computation) given that the other statistical moments appear to be
more evenly distributed.  Further investigation and iteration could be done to assess
sensitivity to binning strategy.

In [None]:
parity_output = parity(metadata)
parity_output.to_dataframe()

You can now augment your datasets with additional metadata information, either from
additional sources or using `dataeval` statistical functions for insights into your data.
