# Dataset Linting Tutorial


## Problem Statement

Exploratory data analysis (EDA) can be overwhelming. There are so many things to check.
Duplicates in your dataset, bad/corrupted images in the set, blurred or bright/dark images, the list goes on.

DAML created a Linting class to assist you with your EDA so you can start training your models on high quality data.


# When to use

The Linting class should be used during the initial EDA process or if you are trying to verify that you have the right data in your dataset.


### _What you will need_

1. A dataset to analyze


## _Getting Started_

Let's import the required libraries needed to set up a minimal working example


In [1]:
try:
    import google.colab  # noqa: F401

    %pip install -q daml
except Exception:
    pass

import os

os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"

In [2]:
import numpy as np
import torch
import torchvision.datasets as datasets
import torchvision.transforms.v2 as v2
from scipy.ndimage import gaussian_filter

from daml.detectors import Linter

## Loading in the data

We are going to start by loading in torchvision's CIFAR-10 dataset.

The CIFAR-10 dataset contains 60,000 images - 50,000 in the train set and 10,000 in the test set.
For the purposes of this demonstration, we are just going to use the test set.


In [3]:
# Load in the cifar-10 dataset from torchvision
to_tensor = v2.Compose([v2.ToImage(), v2.ToDtype(torch.float32, scale=True)])
dataset = datasets.CIFAR10("./data", train=False, download=True, transform=to_tensor)
test_data = np.array(dataset.data, dtype=float)

Files already downloaded and verified


## Linting the Dataset

Now we can begin finding those images which are significantly different from the rest of the data.


In [4]:
# Initialize the Duplicates class
lint = Linter(test_data)  # type: ignore

# Evaluate the data
results = lint.evaluate()

The results are a dictionary with the keys being the image that has an issue in one of the listed properties below:

- Brightness
- Blurriness
- Missing
- Zero
- Width
- Height
- Size
- Aspect Ratio
- Channels
- Depth


In [5]:
print(f"Total number of images with an issue: {len(results)}")

Total number of images with an issue: 396


In [6]:
# Show each image that has at least one issue
for image, issue in results.items():
    print(f"{image} - {issue}")

16 - {'zero': 59.0}
21 - {'zero': 128.0}
57 - {'zero': 642.0}
62 - {'zero': 51.0}
75 - {'blurriness': 112.36}
81 - {'zero': 316.0}
102 - {'zero': 139.0}
111 - {'zero': 81.0}
118 - {'zero': 566.0}
123 - {'blurriness': 114.45}
162 - {'zero': 96.0}
176 - {'blurriness': 107.41}
198 - {'zero': 328.0}
218 - {'brightness': 0.93}
244 - {'blurriness': 111.02}
276 - {'zero': 124.0}
286 - {'zero': 118.0}
294 - {'zero': 66.0}
322 - {'zero': 67.0}
374 - {'blurriness': 110.58}
385 - {'brightness': 0.91}
387 - {'blurriness': 114.98}
391 - {'blurriness': 107.77}
417 - {'zero': 179.0}
557 - {'zero': 79.0}
565 - {'zero': 56.0}
605 - {'zero': 510.0}
616 - {'zero': 50.0}
671 - {'zero': 112.0}
763 - {'zero': 51.0}
765 - {'blurriness': 112.97}
807 - {'blurriness': 110.56}
834 - {'blurriness': 107.74}
872 - {'zero': 469.0}
888 - {'zero': 289.0}
891 - {'blurriness': 114.14}
925 - {'zero': 1057.0}
943 - {'blurriness': 109.24}
947 - {'zero': 55.0}
994 - {'zero': 95.0}
1007 - {'zero': 118.0}
1052 - {'zero': 94.0