# Find typical issues in image datasets with Cleanvision

We extract typical issues (regarding brightness, blurr, aspect ratio, SNR and duplicates) in image datasets with the [Cleanvision library](https://github.com/cleanlab/cleanvision). We then identify critical segments with Spotlight.

More information about this play can be found in the Spotlight documentation: [Find typical image datasets with Cleanvision](https://renumics.com/docs/playbook/cv-issues)

For more data-centric AI workflows, check out our [Awesome Open Data-centric AI](https://github.com/Renumics/awesome-open-data-centric-ai) list on Github.

## tldr

In [None]:
# @title Install required packages with PIP

!pip install renumics-spotlight cleanvision datasets

In [None]:
from cleanvision.imagelab import Imagelab
import pandas as pd
from renumics import spotlight
import requests
import json


def cv_issues_cleanvision(df, image_name="image"):
    image_paths = df["image"].to_list()
    imagelab = Imagelab(filepaths=image_paths)
    imagelab.find_issues()

    df_cv = imagelab.issues.reset_index()

    return df_cv

## Step-by-step example on CIFAR-100

### Load CIFAR-100 from Huggingface hub and convert it to Pandas dataframe

In [None]:
import datasets
from renumics import spotlight

dataset = datasets.load_dataset("renumics/cifar100-enriched", split="all")

df = dataset.to_pandas()

### Compute heuristics for typical image data error scores with Cleanvision

In [None]:
df_cv = cv_issues_cleanvision(df)
df = pd.concat([df, df_cv], axis=1)

### Inspect errors and detect problematic data segments with Spotlight

> ⚠️ Running Spotlight in Colab currently has severe limitations (slow, no similarity map, no layouts) due to Colab restrictions (e.g. no websocket support). Run the notebook locally for the full Spotlight experience.

In [None]:
df_show = df.drop(columns=["embedding", "probabilities"])


# handle google colab differently
import sys

IN_COLAB = "google.colab" in sys.modules

if IN_COLAB:
    # visualization in Google Colab only works in chrome and does not support websockets, we need some hacks to visualize something
    df_show = df_show[:10000]
    df_show["embx"] = [emb[0] for emb in df_show["embedding_reduced"]]
    df_show["emby"] = [emb[1] for emb in df_show["embedding_reduced"]]
    port = 50123
    layout_url = "https://raw.githubusercontent.com/Renumics/spotlight/main/playbook/veteran/cv_issues_colab.json"
    response = requests.get(layout_url)
    layout = spotlight.layout.nodes.Layout(**json.loads(response.text))
    spotlight.show(df_show, port=port, dtype={"image": spotlight.Image}, layout=layout)
    from google.colab.output import eval_js  # type: ignore

    print(str(eval_js(f"google.colab.kernel.proxyPort({port}, {{'cache': true}})")))

else:
    df_show = df.drop(columns=["embedding", "probabilities"])
    layout_url = "https://raw.githubusercontent.com/Renumics/spotlight/main/playbook/veteran/cv_issues.json"
    response = requests.get(layout_url)
    layout = spotlight.layout.nodes.Layout(**json.loads(response.text))
    spotlight.show(
        df_show,
        dtype={"image": spotlight.Image, "embedding_reduced": spotlight.Embedding},
        layout=layout,
    )