# Getting Started

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cleanlab/cleanvision/blob/main/examples/demo.ipynb) 

In [None]:
!pip install -U pip
!pip install git+https://github.com/cleanlab/cleanvision.git

We plan to upload it on pypi soon!

## What is CleanVision?
**CleanVision** is built to automatically detects various issues in image datasets, such as images that are: (near) duplicates, blurry, over/under-exposed, etc. This data-centric AI package is designed as a quick first step for any computer vision project to find problems in your dataset, which you may want to address before applying machine learning.

`cleanvision` provides a class **Imagelab** that you can use to find issues in your dataset, get a detailed summary of the issues found and also visualize them. It is a clean, effective and easy to understand interface that can be used for detailed investigation of issues in your dataset.

The easiest way to use Imagelab class is to run it for a set of default predefined issue types. Here is the list of all issue types that cleanvision can detect.

|     | Issue Type      | Description                                                                                  | Issue Key        |
|-----|------------------|----------------------------------------------------------------------------------------------|------------------|
| 1   | Light            | Images that are too bright/washed out in the dataset                                         | light            |
| 2   | Dark             | Images that are irregularly dark                                                             | dark             |
| 3   | Odd Aspect Ratio | Images with an unusual aspect ratio (i.e. overly skinny/wide)                                                       | odd_aspect_ratio |
| 4   | Exact Duplicates | Images that are exact duplicates of each other                          | exact_duplicates |
| 5   | Near Duplicates  | Images that are almost visually identical to each other (e.g. same image with different filters)                                 | near_duplicates  |
| 6   | Blurry           | Images that are blurry or out of focus                                                  | blurry           |
| 7   | Grayscale        | Images that are grayscale (lacking color)                                                            | grayscale        |
| 8   | Low Information  | Images that lack much information (e.g. a completely black image with a few white dots) | low_information  |

The Issue Key column specifies the name for each type of issue in the CleanVision code. In the examples, it is shown how one can use these keys to detect particular issue types and specify custom hyperparameters.

This notebook uses an example dataset, that you can download from here.

In [None]:
!wget - nc 'https://cleanlab-public.s3.amazonaws.com/CleanVision/image_files.zip'

In [None]:
!unzip image_files.zip

## Examples

### Using Imagelab to detect default issue types

In [None]:
from cleanvision.imagelab import Imagelab

# Path to your dataset, you can specify your own dataset path
dataset_path = "./image_files/"

# Initialize imagelab with your dataset
imagelab = Imagelab(data_path=dataset_path)

# Visualize a few images from the dataset
imagelab.visualize(num_images=8)

# Find issues
imagelab.find_issues()

To have a look at the issues found in the dataset dataset use `imagelab.report()` as shown below. It reports the top issues found in the dataset, `num_images` found in the dataset for each issue type,  as well as some examples found.

In [None]:
imagelab.report()

You can dive deep into the analysis by further looking at following imagelab attributes
- `imagelab.issue_summary`
- `imagelab.issues`
- `imagelab.info`

#### imagelab.issue_summary
Dataframe that contains summary of all issue types detected

In each row:\
`issue_type` - name of the issue\
`num_images` - number of images of that issue type found in the dataset

In [None]:
imagelab.issue_summary

#### imagelab.issues

This is a dataframe that list scores of all issue type for all images in the dataset. It also has a bool column showing whether that particular image is an issue or not.
For example `dark_score` is the score for issue type dark. Here all scores lie between 0 and 1, where lower values indicate higher probability that there is an issue in the image. A very obvious use case for`imagelab.issues` is to filter out all images with say `odd_aspect_ratio` issue type

In [None]:
# Get all images with blurry issue type
blurry_images = imagelab.issues[imagelab.issues["is_blurry_issue"] == True].index.tolist()
blurry_images

In [None]:
imagelab.visualize(image_files=blurry_images[:4])  # visualize the given image files

In [None]:
imagelab.visualize(issue_types=['blurry']) # visualize a specific issue types, in order of ascending scores

#### imagelab.info

This is a nested dictionary that contains statistics on images or other useful information that was collected while checking for issues in the dataset.

In [None]:
# Possible keys: statistics, issue names
print(list(imagelab.info.keys()), "\n")

In [None]:
# statistics collected, you can further look into these by checking their values
print(list(imagelab.info["statistics"].keys()))
imagelab.info['statistics']['brightness']

#### Duplicate sets
`imagelab.info` can also be used to see which images are near or exact  duplicates of each other. 

`issue.summary` shows the number of exact duplicate images but does not show how many such sets of images exist in the dataset. To see the number of exact duplicate sets, you can use `imagelab.info`

In [None]:
imagelab.info['exact_duplicates']['num_sets']

You can also get the sets of image files for each duplicated set using `imagelab.info`

In [None]:
# This is a list of list, where each nested list is a set of exact duplicate images.
# Similarly, you can also retrieve near duplicate sets.
imagelab.info['exact_duplicates']['sets']

### Using Imagelab to detect specific issues

It might be the case that only a few issue types are relevant for your dataset and you don't want to run it through all checks to save time. You can do so by specifying `issue_types` as an argument.

`issue_types` is a dict, where keys are the issue types that you want to detect and values are dict which contains hyperparameters. For now we are using default hyperparameters. You can find keys in the above table that lists all issue types supported by cleanvision. 

In [None]:
# Initialize imagelab with your dataset
imagelab = Imagelab(data_path=dataset_path)

# specify issue types to detect
issue_types = {"exact_duplicates": {}}

# Find issues
imagelab.find_issues(issue_types)

# Show a report of the issues found
imagelab.report()

### Check for additional types of issues using existing Imagelab instance

Suppose you also want to check for near duplicates now after detecting exact duplicates in the dataset. You can use the **same** imagelab instance to run a check for near duplicates

In [None]:
issue_types = {"near_duplicates": {}}

imagelab.find_issues(issue_types)

imagelab.report()

### Save and load Imagelab

Imagelab also has a save and load functionality that you can use to save the results and instance and load them at a later point in time to see results or run more checks

In [None]:
# For saving, specify force=True to overwrite files
save_path = "./results"
imagelab.save(save_path)

In [None]:
# For loading
# specify data_path to help Imagelab check for any inconsistencies between dataset paths in the previous and current run
imagelab = Imagelab.load(save_path, dataset_path)

### Check for an issue with a different threshold

You can use the loaded imagelab instance to check for an issue type with a custom hyperparameter. Here is a table of hyperparameters that each issue type supports and their permissible values. 

`threshold`- All images with scores below this threshold will be flagged as an issue

`hash_size` - This controls how much detail about an image we want to keep for getting perceptual hash. Higher sizes imply more detail.

`hash_type` - Type of perceptual hash to use. Currently `whash` and `phash` are the supported hash types. Check [here](https://github.com/JohannesBuchner/imagehash) for more details on these hash types.

|   | Issue Key        | Hyperparameters                                   |
|---|------------------|---------------------------------------------------|
| 1 | light            | threshold (between 0 and 1)                       |
| 2 | dark             | threshold (between 0 and 1)                       |
| 3 | odd_aspect_ratio | threshold (between 0 and 1)                       |
| 4 | exact_duplicates | N/A                                               |
| 5 | near_duplicates  | hash_size (power of 2), hash_types (whash, phash) |
| 6 | blurry           | threshold (between 0 and 1)                       |
| 7 | grayscale        | threshold (between 0 and 1)                       |
| 8 | low_information  | threshold (between 0 and 1)                       |

In [None]:
issue_types = {"dark": {"threshold": 0.2}}
imagelab.find_issues(issue_types)

imagelab.report()

Note the number of images with dark issue has reduced from the previous run

### Run imagelab for default issue types, but override hyperparameters for one or more issues

In [None]:
imagelab = Imagelab(data_path=dataset_path)

# Check for all default issue types
imagelab.find_issues()

# Specify an issue with custom hyperparameters
issue_types = {"odd_aspect_ratio": {"threshold": 0.2}}

# Run find issues again with specified issue types
imagelab.find_issues(issue_types)


# Pass list of issue_types to imagelab.report() to report only those issue_types
imagelab.report(["odd_aspect_ratio", "low_information"])

### Customize report

Report can also be customized in various ways to help with the analysis. For example, you can change the verbosity to return more or less information on issues found, default is `verbosity=1`

In [None]:
# Change verbosity
imagelab.report(verbosity=2)

You may want to exclude issues from your report which are prevalent in say more than 50% of the dataset and are not real issues but just how the dataset is, for example dark images in an astronomy dataset may not be an issue. You can use the `max_prevalence` parameter in report to exclude such issues. In this example all issues present in more than 3% of the dataset are excluded.

In [None]:
imagelab.report(max_prevalence=0.03)

### Visualize specific issues

Imagelab provides `imagelab.visualize` that you can use to see examples of specific issues in your dataset

`num_images` and `cell_size` are options arguments, that you can use to control number of examples of each issue type and size of image in the image grid respectively.

In [None]:
issue_types = ["grayscale"]
imagelab.visualize(issue_types=issue_types, num_images=8, cell_size=(3, 3))

## Advanced: Create your own issue type

You can also create a custom issue type by extending a base class of `Imagelab` called `IssueManager`. CleanVision can then detect your custom issue along with our pre-defined issues in any image dataset! Here's an example of a custom issue manager, which can also be found in the `examples/` folder of the source code.

In [None]:
from typing import Any, Dict, List, Optional

import numpy as np
import pandas as pd
from PIL import Image
from tqdm import tqdm

from cleanvision.issue_managers import register_issue_manager
from cleanvision.utils.base_issue_manager import IssueManager
from cleanvision.utils.utils import get_is_issue_colname, get_score_colname

ISSUE_NAME = "custom"


@register_issue_manager(ISSUE_NAME)
class CustomIssueManager(IssueManager):
    """
    Example class showing how you can self-define a custom type of issue that
    CleanVision can simultaneously check your data for alongside its built-in issue types.
    """

    issue_name: str = ISSUE_NAME
    visualization: str = "individual_images"

    def __init__(self) -> None:
        super().__init__()
        self.params = self.get_default_params()

    def get_default_params(self) -> Dict[str, Any]:
        return {"threshold": 0.4}

    def update_params(self, params: Dict[str, Any]) -> None:
        self.params = self.get_default_params()
        non_none_params = {k: v for k, v in params.items() if v is not None}
        self.params = {**self.params, **non_none_params}

    @staticmethod
    def calculate_mean_pixel_value(image: Image.Image) -> float:
        gray_image = image.convert("L")
        return np.mean(np.array(gray_image))

    def get_scores(self, raw_scores: "np.ndarray[Any, Any]") -> "np.ndarray[Any, Any]":
        scores = np.array(raw_scores)
        return scores / 255.0

    def mark_issue(
        self, scores: "np.ndarray[Any, Any]", threshold: float
    ) -> "np.ndarray[Any, Any]":
        return scores < threshold

    def update_summary(self, summary_dict: Dict[str, Any]) -> None:
        self.summary = pd.DataFrame({"issue_type": [self.issue_name]})
        for column_name, value in summary_dict.items():
            self.summary[column_name] = [value]

    def find_issues(
        self,
        *,
        params: Optional[Dict[str, Any]] = None,
        filepaths: Optional[List[str]] = None,
        imagelab_info: Optional[Dict[str, Any]] = None,
        **kwargs: Any,
    ) -> None:
        super().find_issues(**kwargs)
        assert params is not None
        assert imagelab_info is not None
        assert filepaths is not None

        self.update_params(params)

        raw_scores = []
        for path in tqdm(filepaths):
            image = Image.open(path)
            raw_scores.append(self.calculate_mean_pixel_value(image))

        self.issues = pd.DataFrame(index=filepaths)
        scores = self.get_scores(raw_scores)
        self.issues[get_score_colname(self.issue_name)] = scores
        self.issues[get_is_issue_colname(self.issue_name)] = self.mark_issue(
            scores, self.params["threshold"]
        )
        self.info[self.issue_name] = {"PixelValue": raw_scores}
        summary_dict = self._compute_summary(
            self.issues[get_is_issue_colname(self.issue_name)]
        )

        self.update_summary(summary_dict)

### Run imagelab on custom issue

In [None]:
imagelab = Imagelab(data_path=dataset_path)

issue_name = CustomIssueManager.issue_name


# To ensure your issue manager is registered, check list of possible issue types
# issue_name should be present in this list
imagelab.list_possible_issue_types()

In [None]:
issue_types = {issue_name: {}}
imagelab.find_issues(issue_types)
imagelab.report()