# Scoring the quality of synthetic data

This tutorial demonstrates how to format a dataset that has both real and synthetic examples so that Cleanlab Studio can analyze it. You will learn how to use simple functions that automatically assess the quality of your synthetic examples along multiple characteristics. You can use these quality scores to compare different synthetic data generators (including prompts being used to produce the synthetic data).

## Install and import dependencies

Make sure you have `wget` and `zip` installed to run this tutorial. You can use pip to install all other packages required for this tutorial as follows:

```bash
pip install cleanlab-studio datasets
```


In [None]:
from cleanlab_studio import Studio
from datasets import load_dataset
from typing import Dict

import os
import datasets
import pandas as pd

In [2]:
# code to render image_filename column of DataFrame as images

from IPython.core.display import HTML

def path_to_img_html(path: str) -> str:
    return f'<img src="{path}" width="100"/>'

def display(df: pd.DataFrame) -> None:
    return HTML(df.to_html(escape=False, formatters=dict(image_filename=path_to_img_html)))

## Properly format your data

Usually, your real data has been stored separately from your synthetic data. Here we demonstrate how to merge these two data sources into a single dataset with a binary `real_or_synthetic` column. This is the proper format to assess your synthetic data quality with Cleanlab Studio. The proportion of synthetic vs real data can vary, but try to provide enough real data that Cleanlab's AI system can learn key characteristics of the real data.

We'll use a combination of real and synthetic images of watermelons and pineapples as our tutorial dataset. We do not consider separate training/test data splits for simplicity.

In [None]:
!wget -nc 'https://cleanlab-public.s3.amazonaws.com/Datasets/synthetic-quality/real_images.zip'
!wget -nc 'https://cleanlab-public.s3.amazonaws.com/Datasets/synthetic-quality/synthetic_images.zip'

!unzip -qu real_images.zip -d images/ && unzip -qu synthetic_images.zip -d images/

In [None]:
pwd = os.getcwd()

def dataset_to_dataframe(
    dataset: datasets.Dataset,
    label_column: str,
    image_column: str,
    synthetic_flag: str,
) -> pd.DataFrame:

    # Convert to pandas dataframe
    df = dataset.to_pandas()

    # Create id column
    ids = [f"{synthetic_flag}_{i}" for i in range(len(df))]
    df["id"] = ids

    # Format image column
    image_to_image_filename = lambda row: row[image_column]["path"].split(pwd)[-1].strip("/")
    df["image_filename"] = df.apply(image_to_image_filename, axis=1)

    # Format label column
    int2str = dataset.features[label_column].int2str
    df["label"] = df[label_column].apply(int2str)

    # Add synthetic flag
    df["real_or_synthetic"] = synthetic_flag

    return df[["id", "image_filename", "label", "real_or_synthetic"]]


# Load dataset from local directory
real_dataset = load_dataset("imagefolder", data_dir="images/real", split="train")
synthetic_dataset = load_dataset("imagefolder", data_dir="images/synthetic", split="train")


# Convert to pandas dataframes
real_df = dataset_to_dataframe(
    real_dataset, label_column="label", image_column="image", synthetic_flag="real"
)
synthetic_df = dataset_to_dataframe(
    synthetic_dataset, label_column="label", image_column="image", synthetic_flag="synthetic"
)

# Combine real and synthetic dataframes
combined_dataset_df = pd.concat([real_df, synthetic_df], ignore_index=True)

In [None]:
display(combined_dataset_df.groupby(["label", "real_or_synthetic"]).sample(1))

![Random images in the dataset](./assets/synthetic_data/random_images.png)

Note this DataFrame also contains `id` and `label` columns, which are relevant metadata that we'd like to see in our Cleanlab Studio results.  Such metadata is optional, it will not be used by Cleanlab Studio to evaluate the quality of the synthetic data.

## Load the combined real+synthetic dataset into Cleanlab Studio

### Dataset Structure and Formatting

Cleanlab Studio has particular directory structure requirements when uploading locally-hosted image datasets, and we are adhering to these requirements to ensure the smooth integration of our dataset. 

Here's the expected directory structure:

```
images
├── metadata.csv
├── real
│   ├── pineapple
│   │   ├── image_0.png
|   |   ├── image_1.png
│   │   ⋮
│   └── watermelon
|       ├── image_1.png
│       ├── image_0.png
│       ⋮
└── synthetic
    ├── pineapple
    │   ├── image_0.png
    │   ├── image_1.png
    │   ⋮
    └── watermelon
        ├── image_0.png
        ├── image_1.png
        ⋮
```

- **Parent Directory**: For our demonstration, `images/` serves as the top-level directory. It holds all images and the essential `metadata.csv` file.

- **Real & Synthetic Directories**: These are fundamental divisions of our dataset. The `real/` and `synthetic/` directories clearly differentiate between actual and generated images.

- **Sub-category Directories (Optional)**: Divisions such as `pineapple/` and `watermelon/` are for organizational clarity. Cleanlab Studio will not consider these in the synthetic data assessment we run here.

Ensure that your image dataset respects a similar structure.

We write the contents of our combined DataFrame to such a `metadata.csv` file and zip the `images/` directory for loading our data into Cleanlab Studio.

In [5]:
combined_dataset_df.to_csv("images/metadata.csv", index=False)

!zip -rq formatted_synthetic_data.zip images/

### Metadata.csv Contents
The `metadata.csv` plays a critical role in helping Cleanlab Studio understand and categorize each image. Below, we delve into the purpose and requirements of each column:

```
id,image_filename,label (optional metadata),real_or_synthetic
real_0,images/real/pineapple/image_0.png,pineapple,real
real_1,images/real/pineapple/image_1.png,pineapple,real
real_2,images/real/pineapple/image_10.png,pineapple,real
...
synthetic_199,images/synthetic/watermelon/image_97.png,watermelon,synthetic
synthetic_200,images/synthetic/watermelon/image_98.png,watermelon,synthetic
synthetic_201,images/synthetic/watermelon/image_99.png,watermelon,synthetic
```

- `id`: A unique identifier for each image. It doesn't have to follow the `real_`/`synthetic_` prefix format as demonstrated above. The chosen format in this tutorial is for clarity and convenience. However, it's crucial that each id is unique across the dataset.
- `image_filename`: This column represents the path to the image file. The path should be relative to the top directory where the `metadata.csv` is located.
- `real_or_synthetic`: *This column is essential for this tutorial*. It categorizes the image as either real or synthetic. Make sure you correctly specify this for every image.
- label (optional metadata): This column is purely optional. It provides a label for the image but doesn't impact the evaluation. It's a piece of additional metadata attached to each example, allowing further analysis of the results if desired.

From this point, it's just a matter of uploading the zip-file to Cleanlab Studio.

If you're working with a large image dataset, you may want to host your images externally on S3. We [discuss this approach in the 'Finding Issues in Large-Scale Image Datasets' tutorial](https://help.cleanlab.ai/tutorials/large_image_datasets#prep-and-upload-dataset).


In [5]:
# you can find your API key by going to app.cleanlab.ai/upload, 
# clicking "Upload via Python API", and copying the API key there
API_KEY = "<insert your API key>"
studio = Studio(API_KEY)

dataset_id = studio.upload_dataset("formatted_synthetic_data.zip", dataset_name="SyntheticData", modality="image", id_column="id")

## Create a Project

Using your `dataset_id` from the previous step, you can create a Project in Cleanlab Studio with one line of code! This will train Cleanlab's AI to analyze your synthetic data for shortcomings relative to the real data, which takes some time ( ~1-2 minutes for our tutorial dataset). You will get an email when the process has completed.


In [6]:
project_id = studio.create_project(
    dataset_id, 
    project_name="SyntheticDataProject", 
    modality="image", 
    model_type="fast", 
    label_column="real_or_synthetic", # Don't confuse this with the column named "label" in the dataset
)
print(f'Project successfully created and training has begun! project_id: {project_id}')

**Warning!** This next cell may take a long time to execute while the model is being trained. If your notebook has timed out during this process then you can resume work by re-running the cell (which should return the `cleanset_id` instantly if the project has completed training).

- In the **case of notebook timeout or closing of notebook**, rerun the code cells up to (but not including) the "*Load the combined real+synthetic dataset into Cleanlab Studio*" section. You can manually get your `cleanlab_cols` by using the `project_id` printed above and running the code below in a new cell. **DO NOT RUN `studio.create_project` A SECOND TIME!**

  ```python
  studio = Studio(api_key)
  cleanset_id = studio.get_latest_cleanset_id(project_id)
  project_status = studio.poll_cleanset_status(cleanset_id)
  ```

You can check your Project's status with the Python API using the following code:

In [None]:
cleanset_id = studio.get_latest_cleanset_id(project_id)
project_status = studio.poll_cleanset_status(cleanset_id)
print(f'Project training complete! cleanset_id: {cleanset_id}')

Once training is complete, you can proceed to fetching the cleanlab columns from your Project. 

In [None]:
cleanlab_cols = studio.download_cleanlab_columns(cleanset_id)

# Combine the dataset with the cleanlab columns
cleanset_df = combined_dataset_df.merge(cleanlab_cols, on='id')

## Assess the synthetic data via quality scores

Now that we've obtained the results (`cleanlab_cols`) of our Project, we can score the quality of the synthetic data. Thus far, we have not explicitly told Cleanlab Studio that our goal is synthetic data assessment (we just ran a typical Project with a special `real_vs_synthetic` column in our dataset). Now we will use functions intended specifically for synthetic data assessment.


Below are some helper functions for processing the dataframes, as well as adding a few helpful columns for the analysis.


In [9]:
from typing import Optional, Tuple

REAL_OR_SYNTH_COLUMN = "real_or_synthetic"

def _fidelity_score(cleanset_df: pd.DataFrame, target_type: str = "synthetic") -> float:
    """Scores the dataset based on how indistinguishable the synthetic data is from real data."""
    return _lqs_to_synthetic_score(cleanset_df, target_type)

def _representativeness_score(cleanset_df: pd.DataFrame, target_type: str = "real") -> float:
    """Scores the dataset on how well the real data is represented in the synthetic data."""
    return _lqs_to_synthetic_score(cleanset_df, target_type)

def _diversity_score(cleanset_df: pd.DataFrame, target_type: str = "synthetic") -> float:
    """Scores the diversity among synthetic examples; higher scores indicate less repetition of similar synthetic instances."""
    return _calculate_synthetic_duplicate_scores(cleanset_df, target_type, False)

def _freshness_score(cleanset_df: pd.DataFrame, target_type: str = "synthetic") -> float:
    """Scores the synthetic data's novelty; higher scores suggest synthetic data isn't just echoing real examples."""
    return _calculate_synthetic_duplicate_scores(cleanset_df, target_type, True)

def _lqs_to_synthetic_score(cleanset_df: pd.DataFrame, target_type: str) -> float:
    # 1. Filter rows by synthetic type
    indices = cleanset_df[REAL_OR_SYNTH_COLUMN] == target_type

    # 2. Get mean label quality score
    mean_lqs = cleanset_df.loc[indices, "label_issue_score"].mean()

    # 3. Compute the score
    score = 1 - mean_lqs
    return score


def _calculate_synthetic_duplicate_scores(cleanset_df: pd.DataFrame, target_type: str, contains_real: bool) -> float:
    
    # (0. Define the complementary class name)
    synthetic_class_names = cleanset_df[REAL_OR_SYNTH_COLUMN].unique()
    assert len(synthetic_class_names) == 2, "The dataset should contain both real and synthetic examples"
    complementary_class: str = synthetic_class_names[~(synthetic_class_names == target_type)][0]
    
    # 1. Extract rows that are marked as near duplicates
    df_near_duplicates = cleanset_df.query("is_near_duplicate")
    
    # 2. Compute the groups of near dupliates that contain real data, but always contain synthetic data
    group_contains_real = df_near_duplicates.groupby('near_duplicate_id').apply(lambda group: (group[REAL_OR_SYNTH_COLUMN] == complementary_class).any())
    
    # 3. Filter the dataframe based on the synthetic type (real or synthetic) and whether it's a near duplicate or not
    filtered_df = df_near_duplicates[df_near_duplicates[REAL_OR_SYNTH_COLUMN] == target_type]
    
    # 4. Count the synthetic examples based on the condition (whether they duplicate real data or not)
    if contains_real:
        synthetic_duplicate_count = filtered_df[filtered_df['near_duplicate_id'].map(group_contains_real)].shape[0]
    else:
        synthetic_duplicate_count = filtered_df[~filtered_df['near_duplicate_id'].map(group_contains_real)].shape[0]
    
    # 5. Compute the total synthetic examples
    total_synthetic_examples = cleanset_df[cleanset_df[REAL_OR_SYNTH_COLUMN] == target_type].shape[0]
    
    # 6. Calculate the final score
    score = 1 - (synthetic_duplicate_count / total_synthetic_examples)
    
    return score


def _display_example_counts(cleanset_df: pd.DataFrame, real_type: str):
    """Displays the number of real and synthetic examples in the dataset."""
    num_real_examples = len(cleanset_df.query(f"{REAL_OR_SYNTH_COLUMN} == '{real_type}'"))
    num_synthetic_examples = len(cleanset_df) - num_real_examples
    print(f"Number of real examples: {num_real_examples}")
    print(f"Number of synthetic examples: {num_synthetic_examples}")


In [10]:
def score_synthetic_data_quality(cleanset_df: pd.DataFrame, synthetic_class_names: Optional[Tuple[str, str]] = None) -> Dict:
    """Computes the scores for a dataset consisting of real and synthetic data, to evaluate the quality of the synthetic data.
    
    Parameters
    ----------
    cleanset_df: pd.DataFrame
        The dataframe containing the dataset to score. It should contain a column named "real_or_synthetic" that indicates whether each example is real or synthetic.
        It should also have the cleanset columns provided by Cleanlab Studio.

    synthetic_class_names: Optional[Tuple[str, str]]
        The class names of the "real_or_synthetic" column.
        If not provided, the function will assume that the name of the synthetic class is "synthetic" and the name of the real class is "real".
    """
    
    # Configure the synthetic class names
    synthetic_type, real_type = synthetic_class_names or ("synthetic", "real")

    # Display the number of real and synthetic examples
    _display_example_counts(cleanset_df, real_type)

    # Compute the scores
    scores = {
        "fidelity": _fidelity_score(cleanset_df, target_type=synthetic_type),
        "representativeness": _representativeness_score(cleanset_df, target_type=real_type),
        "diversity": _diversity_score(cleanset_df, target_type=synthetic_type),
        "freshness": _freshness_score(cleanset_df, target_type=synthetic_type)
    }
    return scores

In [11]:
# Score the quality of the synthetic data in the dataset
score_synthetic_data_quality(cleanset_df)

Number of real examples: 200
Number of synthetic examples: 202


{'representativeness': 0.1389657307255,
 'fidelity': 0.13592233734455428,
 'diversity': 0.9108910891089109,
 'freshness': 0.9752475247524752}

The scores quantify how closely synthetic data mirrors real examples, its distinguishability from real data, its internal diversity, and its originality compared to real examples.

- **Fidelity**: Evaluates how indistinguishable the synthetic data appears from real data. Low values indicate there are many unrealistic-looking synthetic samples which are obviously fake.
- **Representativeness**: Measures how well represented the real data is amongst the synthetic data samples. Low values indicate there may exist tails of the real data distribution (or rare events) that the distribution of synthetic samples fails to capture. 
- **Diversity**: Evaluates how much variety there is among synthetic samples. Low values indicate an overly repetitive synthetic data generator that produced similar instances which look nearly duplicated.
- **Freshness**: Evaluates the novelty of the synthetic data. Low values indicate many synthetic samples are near duplicates of an example from the real dataset, i.e. the synthetic data generator may be memorizing the real data too closely and failing to generalize.

**Note**: For all quality scores, lower values indicate lower-quality synthetic data.

## Evaluate synthetic samples based on quality scores

The fidelity score gives us an idea of how realistic the synthetic images look. A low score indicates that the synthetic data generator makes too many images that are easy to discern from real images.

In [None]:
# Show unrealistic synthetic images

columns = ["id", "image_filename", "label", "label_issue_score"]
unrealistic_synthetic_images = (
    cleanset_df
    .query("real_or_synthetic == 'synthetic'")
    .sort_values("label_issue_score",ascending=False)
    .head()
    [columns]
)
display(unrealistic_synthetic_images)


![Unrealistic synthetic examples](./assets/synthetic_data/unrealistic_synthetic.png)

The representativeness score indicates whether real examples are well represented in the synthetic data. When the representativeness score is low, it means that the synthetic generator fails to pick up certain patterns in the real data.

In [None]:
# Show underrepresented real images

columns = ["id", "image_filename", "label", "label_issue_score"]
underrepresented_real_images = (
    cleanset_df
    .query("real_or_synthetic == 'real'")
    .sort_values("label_issue_score",ascending=False)
    .head(5)
    [columns]
)
display(underrepresented_real_images)


![Underrepresented real examples](./assets/synthetic_data/underrepresented_real.png)

We use two metrics, the diversity and freshness scores, to measure the occurrence of near duplicates among synthetic examples. The diversity score quantifies how many synthetic examples only resemble other synthetic examples. The freshness score highlights the proportion of synthetic examples that are near duplicates of real instances.
To explore these near duplicates in detail, they can be grouped using the `near_duplicate_id` assigned to each identified instance.

For visualizing the set of duplicate images, we have to look up the associated image filenames and collect those. Because near duplicates can technically be exact duplicates, it's important to display some identifier for each example. We use the `id` column for this purpose. With the formatting of the `id` column, we can also get a sense of how many synthetic and real examples are in each group.

In [14]:
def get_near_duplicate_group(df: pd.DataFrame) -> pd.Series:
    # Create a dictionary with the near_duplicate_id as keys and list of indices as values
    near_duplicate_groups = df.groupby('near_duplicate_id').apply(lambda x: x.index.tolist())
    
    # For each row, if it is a near duplicate, get the group indices excluding its own index
    near_duplicate_group_column = df.apply(lambda row: [idx for idx in near_duplicate_groups.get(row['near_duplicate_id'], []) if idx != row.name] if row['is_near_duplicate'] else [], axis=1)
    
    return near_duplicate_group_column

def get_near_duplicate_groupings(df: pd.DataFrame) -> dict:
    groups = get_near_duplicate_group(df)
    return {index: group for index, group in groups.items()}


def get_associated_images_html(index, df, groupings):
    associated_ids = groupings.get(index, [])
    associated_paths = df.loc[associated_ids, 'image_filename'].tolist()

    img_htmls = [path_to_img_html(path) for path in associated_paths]

    # Aad a caption of the id for each image
    captions = [f'<figcaption style="text-align:center">{df.loc[id]["id"]}</figcaption>' for id in associated_ids]

    # Wrap each image in a figure tag, and add a caption
    img_htmls_with_captions = [f'<figure>{img_html}{caption}</figure>' for img_html, caption in zip(img_htmls, captions)]

    # In line all the images to show them side by side, instead of one below the other
    img_htmls_side_by_side = '<div style="display:flex">' + ''.join(img_htmls_with_captions) + '</div>'
    # Align the images by the caption vertically
    return img_htmls_side_by_side

def display_with_associated_images(df: pd.DataFrame, n:int=5) -> None:
    groupings = get_near_duplicate_groupings(df)
    filtered_df = df[df.is_near_duplicate].copy()
    filtered_df['associated_images'] = filtered_df.index.map(lambda index: get_associated_images_html(index, df, groupings))
    columns = ["id", "image_filename", "label", "real_or_synthetic", "associated_images", "near_duplicate_id"]
    return HTML(filtered_df[columns].groupby("near_duplicate_id").first().to_html(escape=False, formatters=dict(image_filename=path_to_img_html)))


First, we can look at sets of near duplicates that are exclusively synthetic. 

This contributes to the "diversity" score of the synthetic data, where a low scores means that most of the synthetic data is likely being regurgitated over and over again.

In [None]:
def display_low_diversity_synthetic_examples(df: pd.DataFrame):
    groupings = get_near_duplicate_groupings(df)

    filtered_df = df[df.is_near_duplicate].copy()
    group_contains_real = filtered_df.groupby('near_duplicate_id').apply(lambda group: (group['real_or_synthetic'] == 'real').any())
    stale_near_duplicate_ids = group_contains_real[group_contains_real == False].index
    filtered_df['associated_images'] = filtered_df.index.map(lambda index: get_associated_images_html(index, df, groupings))
    filtered_df['is_stale'] = filtered_df['near_duplicate_id'].isin(stale_near_duplicate_ids)
    stale_near_duplicates = filtered_df.query("is_stale")

    
    columns = ["id", "image_filename", "label", "real_or_synthetic", "associated_images", "near_duplicate_id"]
    return HTML(stale_near_duplicates[columns].groupby("near_duplicate_id").first().to_html(escape=False, formatters=dict(image_filename=path_to_img_html)))


display_low_diversity_synthetic_examples(cleanset_df)

![Low diversity synthetic near duplicate examples](./assets/synthetic_data/diversity_issues.png)

The other group of near duplicates worth looking at in this tutorial are the set of synthetic examples that are similar to some real examples. These contribute to the "freshness" score that we calculated.

A low freshness score means that the synthetic data generator has likely memorized it’s training data and struggles to generate novel samples that are actually different from the real data.

In [None]:
def display_stale_near_duplicates(df: pd.DataFrame):
    groupings = get_near_duplicate_groupings(df)

    filtered_df = df[df.is_near_duplicate].copy()
    group_contains_real = filtered_df.groupby("near_duplicate_id").apply(lambda group: (group["real_or_synthetic"] == "real").any())
    memorized_near_duplicate_ids = group_contains_real[group_contains_real == True].index
    filtered_df["associated_images"] = filtered_df.index.map(lambda index: get_associated_images_html(index, df, groupings))
    filtered_df["is_memorized"] = filtered_df["near_duplicate_id"].isin(memorized_near_duplicate_ids)

    stale_near_duplicates = filtered_df.query("is_memorized")

    
    columns = ["id", "image_filename", "label", "real_or_synthetic", "associated_images", "near_duplicate_id"]
    return HTML(stale_near_duplicates[columns].groupby("near_duplicate_id").first().to_html(escape=False, formatters=dict(image_filename=path_to_img_html)))

display_stale_near_duplicates(cleanset_df)
    

![Low freshness near duplicate examples](./assets/synthetic_data/freshness_issues.png)

## Next steps

The quality of your synthetic dataset is vital. We've shown you how to assess it with one particular image dataset, but the code from this tutorial can be generally applied to any synthetic and real data you might have (including text or tabular data).

Try repeatedly running this notebook and the `score_synthetic_data_quality` function for every set of prompts (or other data generation settings you can tweak) to see if they improve the quality of your synthetic dataset. Note you can run many Cleanlab Studio projects simultaneously to speed up your synthetic data optimization and prompt engineering efforts.