# Dataset Linting Tutorial


## Problem Statement

Exploratory data analysis (EDA) can be overwhelming. There are so many things to check.
Duplicates in your dataset, bad/corrupted images in the set, blurred or bright/dark images, the list goes on.

DAML created a Linting class to assist you with your EDA so you can start training your models on high quality data.


### _When to use_

The Linting class should be used during the initial EDA process or if you are trying to verify that you have the right data in your dataset.


### _What you will need_

1. A dataset to analyze


## _Getting Started_

Let's import the required libraries needed to set up a minimal working example


In [None]:
try:
    import google.colab  # noqa: F401

    %pip install -q daml
except Exception:
    pass

import os

os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"

In [None]:
import numpy as np
import torch
import torchvision.datasets as datasets
import torchvision.transforms.v2 as v2

from daml.detectors import Linter

## Loading in the data

We are going to start by loading in torchvision's CIFAR-10 dataset.

The CIFAR-10 dataset contains 60,000 images - 50,000 in the train set and 10,000 in the test set.
For the purposes of this demonstration, we are just going to use the test set.


In [None]:
# Load in the cifar-10 dataset from torchvision
to_tensor = v2.Compose([v2.ToImage(), v2.ToDtype(torch.float32, scale=True)])
dataset = datasets.CIFAR10("./data", train=False, download=True, transform=to_tensor)
test_data = np.array(dataset.data, dtype=float)

## Linting the Dataset

Now we can begin finding those images which are significantly different from the rest of the data.


In [None]:
# Initialize the Duplicates class
lint = Linter(test_data)  # type: ignore

# Evaluate the data
results = lint.evaluate()

The results are a dictionary with the keys being the image that has an issue in one of the listed properties below:

- Brightness
- Blurriness
- Missing
- Zero
- Width
- Height
- Size
- Aspect Ratio
- Channels
- Depth


In [None]:
print(f"Total number of images with an issue: {len(results)}")

In [None]:
# Show each image that has at least one issue
for image, issue in results.items():
    print(f"{image} - {issue}")