# How to visualize linting issues

## Problem Statement

Exploratory data analysis (EDA) can be overwhelming. There are so many things to check.
Duplicates in your dataset, bad/corrupted images in the set, blurred or bright/dark images, the list goes on.

DataEval created a Linting class to assist you with your EDA so you can start training your models on high quality data.


### _When to use_

The Linting class should be used during the initial EDA process or if you are trying to verify that you have the right data in your dataset.


### _What you will need_

1. A dataset to analyze
2. A Python environment with the following packages installed:
   - `dataeval` or `dataeval[all]`


## _Getting Started_

Let's import the required libraries needed to set up a minimal working example


In [None]:
try:
    import google.colab  # noqa: F401

    # specify the version of DataEval (==X.XX.X) for versions other than the latest
    %pip install -q dataeval maite-datasets
except Exception:
    pass

In [None]:
from maite_datasets.image_classification import CIFAR10

from dataeval.detectors.linters import Outliers

## Loading in the data

We are going to start by loading in the CIFAR-10 dataset.

The CIFAR-10 dataset contains 60,000 images - 50,000 in the train set and 10,000 in the test set.
For the purposes of this demonstration, we are just going to use the test set.


In [None]:
# Load in the CIFAR10 dataset
testing_dataset = CIFAR10("./data", image_set="test", download=True)

## Linting the Dataset

Now we can begin finding those images which are significantly different from the rest of the data.


In [None]:
# Initialize the Duplicates class
outliers = Outliers(outlier_method="zscore", outlier_threshold=3.5)

# Evaluate the data
results = outliers.evaluate(testing_dataset)

The results are a dictionary with the keys being the image that has an issue in one of the listed properties below:

- Brightness
- Blurriness
- Missing
- Zero
- Width
- Height
- Size
- Aspect Ratio
- Channels
- Depth


In [None]:
print(f"Total number of images with an issue: {len(results.issues)}")

In [None]:
# Show a count of issues by type
issue_count_by_type = {}
for issue in results.issues.values():
    for k, v in issue.items():
        issue_count_by_type[k] = issue_count_by_type.setdefault(k, 0) + 1
for issue in sorted(issue_count_by_type, key=lambda k: issue_count_by_type[k], reverse=True):
    print(f"{issue:>10}: {issue_count_by_type[issue]:<5}")

In [None]:
### TEST ASSERTION CELL ###
assert len(results.issues) == 306
assert {
    "var",
    "mean",
    "skew",
    "kurtosis",
    "entropy",
    "brightness",
    "contrast",
    "zeros",
    "sharpness",
    "std",
    "darkness",
} == set(issue_count_by_type)