# Identify bias and correlations

This guide provides a beginner friendly introduction to dataset bias, including balance, diversity and parity.

Estimated time to complete: 15 minutes

Relevant ML stages: Data Engineering

Relevant personas: Data Engineer, T&E Engineer


## What you'll do

- Use DataEval to identify bias and correlations in the 2011 VOC dataset
- Analyze the results using plots and tables


## What you'll learn

- You will see how to identify bias and correlations present in a dataset.
- You will understand the potential impact on your data and ways to mitigate them.


## What you'll need

- Basic familiarity with Python
- Basic understanding of your dataset structure, including but not limited to its metadata
- An environment with DataEval installed with the `all` extra


## Introduction

Identifying any biases or correlations present in a dataset is essential to
accurately interpreting your model's performance and its ability to generalize
to new data. A common cause of poor generalization is shortcut learning &mdash;
where a model uses secondary or background information to make predictions
&mdash; which is enabled or exacerbated by dataset sampling biases.

### Bias and correlations

Understanding biases or correlations present in your dataset is a key component
to creating meaningful data splits. Bias in data can lead to misleading
conclusions and poor model performance on operational data. There are many
different [types of bias](https://arxiv.org/abs/1908.09635). A few of these
biases occur during data collection, others occur during dataset development,
others occur during model development, while others are a result of the user.

Not all forms of bias directly affect the dataset and in order to address the
biases that do, you have to make a few assumptions:

1. All desired classes are present.
2. All available metadata is provided.
3. The metadata has been recorded correctly.

If any of the above assumptions are violated, then the analysis will not be
accurate. When using your own data, you should verify the above assumptions.

This guide does not focus on eliminating all bias, rather it focuses on
identifying the bias that can be found when developing a dataset.

### DataEval metrics

DataEval has three dedicated functions for identifying and understanding the
bias or correlations that may be present in a dataset: {func}`.balance`,
{func}`.diversity` and {func}`.parity`.

The `balance` function measures correlational relationships between metadata
factors and classes by calculating the mutual information between the metadata
factors and the labels.

The `diversity` function measures the evenness or uniformity of the sampling
of metadata factors over a dataset using the inverse Simpson index or Shannon
index.

The `parity` function measures the relationship between metadata factors
and classes using a chi-squared test.

These techniques help ensure that when you split the data for your projects,
you minimize things like shortcut learning and leakage between training and
testing sets.


## Importing the necessary libraries

You'll begin by importing the necessary libraries to walk through this guide.


In [1]:
try:
    import google.colab  # noqa: F401

    # specify the version of DataEval (==X.XX.X) for versions other than the latest
    %pip install -q dataeval
except Exception:
    pass

In [2]:
# Load the functions from DataEval that are helpful for bias
# as well as the VOCDetection dataset for the tutorial
from dataeval.metrics.bias import balance, diversity, parity
from dataeval.utils.data.datasets import VOCDetection
from dataeval.utils.metadata import merge, preprocess

## Step 1: Load the data

You are going to work with the PASCAL VOC 2011 dataset.
This dataset is a small curated dataset that was used for a computer vision competition.
The images were used for classification, object detection, and segmentation.
This dataset was chosen because it has multiple classes and a variety of images and metadata.

If this data is already on your computer you can change the file location from `"./data"` to wherever the data is stored.
Remember to also change the download value from `True` to `False`.

For the sake of ensuring that this tutorial runs quickly on most computers, you are going to analyze only the training dataset, which is a little under 6000 images.


In [None]:
# Download the 2011 train dataset and verify the size of the loaded dataset
ds = VOCDetection(root="./data", year="2011", image_set="train", download=True)
len(ds)

Before moving on, verify that the above code cell printed out 5717 for the size of the [dataset](http://host.robots.ox.ac.uk/pascal/VOC/voc2011/dbstats.html).

This ensures that everything is working as needed for the tutorial.


## Step 2: Structure the metadata

This guide focuses on evaluating labels and metadata of the dataset rather than
the images themselves. As each dataset has its own image and metadata formats, you
will need to understand how your particular metadata is structured.

Start by taking a look at the metadata structure of the VOC 2011 dataset on the first
dataset item. Each dataset item is composed of an (image, metadata) tuple.


In [None]:
# Check the label structure
ds[0][1]

The metadata is provided as a dictionary entry for each datum, such that the
aggregated data is a collection of _N_ metadata dictionaries each with a nested
list of _M_ objects in the image. In the example above, the first image contains
2 objects; a _horse_ and a _person_ defined in the _annotation_/_object_ list.

In order to use this information with DataEval, we will show how we can use
{func}`.merge` and {func}`.preprocess` to prepare the metadata for use.

Begin by running `merge` on the aggregated metadata dictionaries to process the
collection into a single dictionary of metadata keys, each with a flattened list
of all object metadata values.


In [None]:
# Merge the dataset metadata entries into a single metadata dictionary
merged = merge(metadata=(d[1] for d in ds), image_index_key="filename")

Note that `merge` was unable to process _annotation | object | part_ keys as it
is a nested list. For this dataset, _part_ describes certain parts of a _person_
object (such as _head_, _foot_ and _hand_), each with separate bounding box
coordinates. You can ignore this information for this example.


Now take a look at the same first two objects in the merged metadata dictionary.


In [None]:
{k: v[:2] for k, v in merged.items()}

Note that the nested _horse_ and _person_ objects from the first metadata
entry have been expanded to complete metadata entries for each object.

Next you will run {func}`.preprocess` on the metadata to create the
{class}`.Metadata` object DataEval will use for the bias functions.

The primary functions performed by `preprocess` are extracting class labels
(identifiable through the key _name_), discretization of continuous data
(like _xmin_) into bins, and selecting keys of interest to include.

In this example, you will want to include image information (_width_,
_height_) as well as object information (_name_, _pose_,
_truncated_, _occluded_, _xmin_, _ymin_, _xmax_, _ymax_, _difficult_)
to use for bias analysis.


In [19]:
metadata = preprocess(
    metadata=merged,
    class_labels="name",
    continuous_factor_bins={"width": 5, "height": 5, "xmin": 5, "ymin": 5, "xmax": 5, "ymax": 5},
    exclude=["folder", "filename", "database", "annotation", "image", "depth"],
    image_index_key="filename",
)

Now that the `Metadata` is ready to go, you can begin analyzing the dataset for bias!


## Step 3: Assess dataset balance


The {func}`.balance` function measures correlational relationships between metadata
factors and classes in a dataset. It analyzes the metadata factors against both the
classes and other factors to identify relationships.

The results can be retrieved using the _balance_ and _factors_ attributes of the output.


In [8]:
bal = balance(metadata)

The information provided by the `balance` function may be visually understood with a
heat map. The {class}`.BalanceOutput` class contains a plot function to plot the results as
a heat map.


In [None]:
_ = bal.plot()

The heatmap shows that the greatest correlations are in the bounding box locations
(_xmin_ with _xmax_ and _ymin_ with _ymax_) and the image dimensions (_height_ and
_width_).

Also the _ymax_ of the bounding box location is correlated with the _height_ of the
image. It is not surprising that _height_ and _width_ have correlation since many
of the images are similarly sized.

The correlations between _xmin_ and _xmax_ and between _ymin_ and _ymax_ suggests
that there is repetition in bounding box width and height across the objects.
However, the fact that _pose_ has a value of 0.08 with _class_ means that a few of
the classes have specific poses across a fair percentage of the images for that
class. An example of this would be most _pottedplant_ images having the same _pose_
value.

In addition to analyzing class and other factors, the balance function also analyzes
metadata factors with individual classes to identify relationships between only one
class and secondary factors.

Again, the plot function of the balance output class can plot a heatmap of the
classwise results for visualizing. The _plot_classwise_ parameter needs to be set to
_True_ to use the classwise results.


In [None]:
_ = bal.plot(plot_classwise=True)

The classwise heatmap shows that factors other than _class_ do not have any significant
correlation with a specific class.

Classwise balance shows correlation of individual classes with all class labels,
indicating relative class imbalance. In this case the _person_ class is over-represented
relative to most other classes.

This means that a model might learn a bias towards the _person_ class label due to its
frequency in the training set, which becomes a problem if the test/operational dataset
doesn't have the same imbalance.


## Step 4: Assess dataset diversity


The {func}`.diversity` function measures the evenness or uniformity of the sampling
of metadata factors over a dataset. Values near 1 indicate uniform sampling, while
values near 0 indicate imbalanced sampling, e.g. all values taking a single value.
For more information see the [Diversity](../concepts/Diversity.md) concept page.

The results can be retrieved using the _diversity_index_ attribute of the output.


In [11]:
div = diversity(metadata)

Again, it's often easiest to see the differences between the different factors when
visualizing them. The {class}`.DiversityOutput` class contains a plot function to plot the
results of the diversity function. It uses a bar chart to plot the factor-class analysis.


In [None]:
_ = div.plot()

In the results above, the factors _truncated_ and _occluded_ have values near
1, meaning that there is relatively little or no bias in these factors.

The categories of most interest are those that are between 0.4 and 0.1 because
this region represents skewed value distributions for the factor.

The following factors fall into this category:

- _class_
- _width_
- _height_
- _segmented_
- _difficult_

These factors contain bias that should be addressed either by adding or removing
data to even out the sampling. For instance, the _class_ factor highlights that
there is unevenness in the number of data points per class.

In addition to analyzing class, the diversity function also analyzes metadata
factors with individual classes to assess uniformity of metadata factors within
a class. As above, the plot function of the diversity output class can plot a
heatmap of the classwise results for visualizing. The `plot_classwise` parameter
needs to be set to True to use the classwise results.


In [None]:
_ = div.plot(plot_classwise=True)

These results expand the above results on a classwise basis.

Things to look for here are large variances for a given factor across the
different classes. For example, _pose_ has values ranging from 0.01 to 0.84,
which means that a few classes have almost uniform selection of the different
_pose_ values while other classes essentially only have one _pose_ value.
This makes sense as the _bottle_ or _pottedplant_ class does not have multiple
_pose_ directions, while the _person_ class does.

What needs to be further investigated are things like whether the _sofa_ class
should have a _pose_ direction, because a diversity value of 0.4 means that a
few of the images do while others do not.

Also, the _cat_ class has a low score signifying that most of the images fall
into one or two categories rather than being spread even across the categories.
This highlights an error in the data collection process &mdash; the value was
not specified for most _cat_ images and therefore defaulted to "Unspecified".

An alternative error would be a dataset in which the _cat_ images have most
cats facing a specific direction, which would require additional data to
overcome the bias, but that is not the case for this dataset. It has plenty of
cats facing each direction, but only a few of them contain a _pose_ value.


## Step 5: Assess dataset parity


The {func}`.parity` function measures the relationship between metadata factors
and classes using a chi-squared test. A high score with a low p-value suggests
that a metadata factor is strongly correlated with a class label.

The results can be retrieved using the _score_ and _p_value_ attributes of the output.


In [None]:
par = parity(metadata)

The warning above states that the metric works best when there are
more than 5 samples in each value-label combination. However,
because of the large number of total samples, the difference between
1 and 5 samples does not significantly affect the results.

When evaluating the results of parity for a large number of factors,
it may be easier to understand the results in a DataFrame.

The {class}`.ParityOutput` class contains a `to_dataframe` function to format
the results of the diversity function as a DataFrame.


In [None]:
par.to_dataframe()

According to the results, all metadata are correlated with _class_ labels,
except for _image_ and _depth_. However, `parity` is based on the idea of
an expected frequency and how the observed differs from what is expected.
The expected frequencies are determined by sums of the values for each
metadata category.

This function works best when the expected frequencies for a given factor
for each individual class are known a priori. For the case above, the
expected frequency for the _pose_ metadata category shouldn't be the same
for all classes. The _diningtable_, _pottedplant_, and _bottle_ classes
only have a single value for _pose_ which automatically throws off the
metric because not all of the classes have an identical expected frequency
for _pose_.


## Conclusion


Having analyzed the dataset for bias with multiple metrics, the conclusion is that
this dataset has bias. Training a model on this dataset has the potential to learn
shortcuts and underperform on operational data if the biases are not representative
of biases in the operational dataset.

The metadata categories identified by the `balance`, `diversity` and `parity`
functions contain issues such as imbalanced classes and imbalanced parameters per
class. DataEval isn't able to tell you exactly why they are imbalanced, but it
highlights the categories that you need to check.

As you can see, the DataEval methods are here to help you gain a deep understanding
of your dataset and all of its strengths and limitations. It is designed to help you
create representative and reliable datasets.

Good luck with your data!

---


## What's next

In addition to identifying bias and correlations in a dataset, DataEval offers additional tutorials to help you learn about dataset analysis:

- To clean a dataset use the [Data Cleaning Guide](EDA_Part1.ipynb).
- To identify coverage gaps and outliers use the [Assessing the Data Space Guide](EDA_Part2.ipynb).
- To monitor data for shifts during operation use the [Data Monitoring Guide](Data_Monitoring.ipynb).

To learn more about the balance, diversity and parity functions, see the [Balance](../concepts/Balance.md), [Diversity](../concepts/Diversity.md) and [Parity](../concepts/Parity.md) concept pages.

## On your own

Once you are familiar with DataEval and dataset analysis, you will want to run this analysis on your own dataset.
When you do, make sure that you analyze all of your data and not just the training set.
