# Detecting Issues in Text Datasets

This is the recommended quickstart tutorial for analyzing text datasets via the Cleanlab Studio's [Python API](/guide/quickstart/api/).

In this tutorial, we demonstrate the metadata Cleanlab Studio automatically generates for any text classification dataset. This metadata (returned as "Cleanlab columns") helps you discover various problems in your dataset and understand their severity. This entire notebook is run using the `cleanlab_studio` Python package, so you can audit your datasets programmatically.

## Install and import dependencies

Make sure you have `wget` installed to run this tutorial. You can use pip to install all other packages required for this tutorial as follows:

In [None]:
!pip install cleanlab-studio

In [1]:
import numpy as np
import pandas as pd
import os

## Load dataset into Cleanlab Studio

To fetch the data for this tutorial, make sure you have `wget` installed.

In [None]:
!wget -nc https://cleanlab-public.s3.amazonaws.com/Datasets/banking-text-quickstart.csv -P data

Here we'll use a variant of the [BANKING77](https://paperswithcode.com/dataset/banking77-oos) text dataset. This is a **multi-class classification** dataset where customer service requests are labeled as belonging to one of *K* classes (intent categories).

### Dataset Structure

The data is stored in a standard CSV file containing the following columns:

```
text,label
<a text example>,<a class label>
"<a text example with quotes, to escape commas as column separators>",<another class label>
...
```

You can similarly format any other text dataset and run the rest of this tutorial. Details on how to format your dataset can be found in [this guide](/guide/concepts/datasets/), which also outlines other format options.

In [3]:
BASE_PATH = os.getcwd()
dataset_path = os.path.join(BASE_PATH, "data/banking-text-quickstart.csv")

Next use your API key to instantiate a `studio` object, which analyzes your dataset.

In [4]:
from cleanlab_studio import Studio

# you can find your Cleanlab Studio API key by going to app.cleanlab.ai/upload,
# clicking "Upload via Python API", and copying the API key there
API_KEY = "<insert your API key>"

# initialize studio object
studio = Studio(API_KEY)

Load the data into Cleanlab Studio (more details/options can be found in [this guide](/guide/quickstart/api/#uploading-a-dataset)). This may take a while for big datasets.

In [None]:
dataset_id = studio.upload_dataset(dataset_path, dataset_name="banking77oos")
print(f"Dataset ID: {dataset_id}")

## Launch a Project

A Cleanlab Studio project automatically trains ML models to provide AI-based analysis of your dataset. Let's launch one.

In [None]:
project_id = studio.create_project(
    dataset_id=dataset_id,
    project_name="banking77-oos project",
    modality="text",
    task_type="multi-class",
    model_type="regular",
)
print(f"Project successfully created and training has begun! project_id: {project_id}")

Once the project has been launched successfully and you see your `project_id` you can feel free to close this notebook. It will take some time for Cleanlab’s AI to train on your data and analyze it. Come back after training is complete (you will receive an email) and continue with the notebook to review your results.

You should only execute the above cell once per dataset. After launching the project, you can poll for its status to programmatically wait until the results are ready for review. Each project creates a [cleanset](/guide/concepts/cleanset/), an improved version of your original dataset that contains additional metadata for helping you clean up the data. The next code cell simply waits until this [cleanset](/guide/concepts/cleanset) has been created.

**Warning!** For big datasets, this next cell may take a long time to execute while Cleanlab's AI model is training. If your Jupyter notebook has timed out during this process then you can resume work by re-running the below cell (which should return instantly if the project has completed training; **do not** create a new project).

In [None]:
cleanset_id = studio.get_latest_cleanset_id(project_id)
print(f"cleanset_id: {cleanset_id}")
project_status = studio.wait_until_cleanset_ready(cleanset_id)

Once the above cell completes execution, your project results are ready for review!  At this point, you can optionally view your project in the [Cleanlab Studio web interface](https://app.cleanlab.ai/) and interactively improve your dataset. However this tutorial will stick with a fully programmatic workflow.

## Download Cleanlab columns

We can fetch [Cleanlab columns](/guide/concepts/cleanlab_columns/) that store metadata for this [cleanset](/guide/concepts/cleanset) using its `cleanset_id`. These columns have the same length as your original dataset and provide metadata about each indiviudal data point, like what types of issues it exhibits and how severely.

If at any point you want to re-run the remaining parts of this notebook (without creating another project), simply call `studio.download_cleanlab_columns(cleanset_id)` with the `cleanset_id` printed from the previous cell.

In [8]:
cleanlab_columns_df = studio.download_cleanlab_columns(cleanset_id)
cleanlab_columns_df.head()

Unnamed: 0,cleanlab_row_ID,corrected_label,is_label_issue,label_issue_score,suggested_label,is_ambiguous,ambiguous_score,is_well_labeled,is_near_duplicate,near_duplicate_score,near_duplicate_cluster_id,is_outlier,outlier_score,is_initially_unlabeled
0,0,,False,0.335749,,False,0.910778,True,False,0.974842,,False,0.034647,False
1,1,,False,0.341566,,False,0.912265,True,False,0.963262,,False,0.038207,False
2,2,,False,0.231065,,False,0.847001,True,False,0.936913,,False,0.072151,False
3,3,,False,0.26755,,False,0.843165,True,False,0.974782,,False,0.043101,False
4,4,,False,0.349997,,False,0.910711,True,False,0.980708,,False,0.021091,False


## Examples of data issues

Details about all of the Cleanlab columns and their meanings can be found in [this guide](/guide/concepts/cleanlab_columns/). Here we briefly showcase some of the Cleanlab columns that correspond to issues detected in our tutorial dataset:
- **Label issue** indicates the given label of this data point is likely wrong. For such data, consider correcting their label to the `suggested_label` if it seems more appropriate.
- **Ambiguous** indicates this data point does not clearly belong to any of the classes (e.g. a borderline case). Multiple human annotators might disagree on how to label this data point, so you might consider refining your annotation instructions to clarify how to handle data points like this.
- **Outlier** indicates this data point is very different from the rest of the data (looks atypical). The presence of outliers may indicate problems in your data sources, consider deleting such data from your dataset if appropriate.
- **Near duplicate** indicates there are other data points that are (exactly or nearly) identical to this data point. Duplicated data points can have an outsized impact on models/analytics, so consider deleting the extra copies from your dataset if appropriate.

The data points exhibiting each type of issue are indicated with boolean values in the respective `is_<issue>` column, and the severity of this issue in each data point is quantified in the respective `<issue>_score` column (on a scale of 0-1 with 1 indicating the most severe instances of the issue).

Let's go through some of the Cleanlab columns and types of data issues, starting with label issues (i.e. mislabeled data). We first create a `given_label` column in our dataframe to clearly indicate the original class label originally assigned to each data point (customer service request).

In [9]:
from IPython.display import display, Markdown
pd.set_option("display.max_colwidth", None)

# Load the dataset into a DataFrame
df = pd.read_csv(dataset_path)

# Combine the dataset with the cleanlab columns
combined_dataset_df = df.merge(cleanlab_columns_df, left_index=True, right_on="cleanlab_row_ID")

# Set a "given_label" column to the original label
combined_dataset_df.rename(columns={"label": "given_label"}, inplace=True)

To see which text examples are estimated to be mislabeled, we filter by `is_label_issue`. We sort by `label_issue_score` to see which of these data points are *most likely* mislabeled.

In [10]:
samples_ranked_by_label_issue_score = combined_dataset_df.query("is_label_issue").sort_values("label_issue_score", ascending=False)

columns_to_display = ["cleanlab_row_ID", "text", "label_issue_score", "is_label_issue", "given_label", "suggested_label"]
display(samples_ranked_by_label_issue_score.head(5)[columns_to_display])

Unnamed: 0,cleanlab_row_ID,text,label_issue_score,is_label_issue,given_label,suggested_label
978,978,why am i being charge a fee when using an atm?,0.851647,True,card_about_to_expire,card_payment_fee_charged
974,974,can i change my pin on holiday?,0.769496,True,beneficiary_not_allowed,change_pin
960,960,how do i find my new pin?,0.767412,True,visa_or_mastercard,change_pin
980,980,why do i see extra charges for withdrawing my money?,0.765521,True,card_about_to_expire,card_payment_fee_charged
972,972,what atms will allow me to change my pin?,0.757016,True,beneficiary_not_allowed,change_pin


Note that in each of these examples, the `given_label` really does seem wrong (the annotated intent in the original dataset does not appear appropriate for the customer request). Data labeling is an error-prone process and annotators make mistakes! Luckily we can easily correct these data points by just using Cleanlab's `suggested_label` above, which seems like a much more suitable label in most cases.

While the boolean flags above can help estimate the overall label error rate, the numeric scores help decide what data to prioritize for review. You can alternatively ignore these boolean `is_label_issue` flags and filter the data by thresholding the `label_issue_score` yourself (if say you find the default thresholds produce false positives/negatives).

Next, let's look at the ambiguous examples detected in the dataset.

In [11]:
samples_ranked_by_ambiguous_score = combined_dataset_df.query("is_ambiguous").sort_values("ambiguous_score", ascending=False)

columns_to_display = ["cleanlab_row_ID", "text", "ambiguous_score", "is_ambiguous", "given_label", "suggested_label"]
display(samples_ranked_by_ambiguous_score.head(5)[columns_to_display])

Unnamed: 0,cleanlab_row_ID,text,ambiguous_score,is_ambiguous,given_label,suggested_label
954,954,i tried to withdraw 40 pounds but only 20 came out. did you steal my money?,0.982856,True,card_payment_fee_charged,
965,965,why haven't i gotten my payment yet?,0.979637,True,lost_or_stolen_phone,
662,662,payment did not process,0.976018,True,beneficiary_not_allowed,card_payment_fee_charged
958,958,how long do money transfers take? my friend really needs the money i sent a couple of hours ago but it's not there yet.,0.975079,True,supported_cards_and_currencies,
967,967,the card payment didn't work,0.972252,True,change_pin,


Next, let's look at the outliers detected in the dataset.

In [12]:
samples_ranked_by_outlier_score = combined_dataset_df.query("is_outlier").sort_values("outlier_score", ascending=False)

columns_to_display = ["cleanlab_row_ID", "text", "outlier_score", "is_outlier", "given_label", "suggested_label"]
display(samples_ranked_by_outlier_score.head(5)[columns_to_display])


Unnamed: 0,cleanlab_row_ID,text,outlier_score,is_outlier,given_label,suggested_label
999,999,636C65616E6C616220697320617765736F6D6521,0.200036,True,cancel_transfer,
990,990,Connection Timed Out,0.19057,True,apple_pay_or_google_pay,
81,81,cancel transaction,0.178279,True,cancel_transfer,
998,998,https://github.com/cleanlab/cleanlab,0.174992,True,visa_or_mastercard,
662,662,payment did not process,0.167117,True,beneficiary_not_allowed,card_payment_fee_charged


Next, let's look at the near duplicates detected in the dataset.

In [13]:
n_near_duplicate_sets = len(set(combined_dataset_df.loc[combined_dataset_df["near_duplicate_cluster_id"].notna(), "near_duplicate_cluster_id"]))
print(f"There are {n_near_duplicate_sets} sets of near duplicate texts in the dataset.")

There are 5 sets of near duplicate texts in the dataset.


Note that the near duplicate data points each have an associated `near_duplicate_cluster_id` integer.  Data points that share the same IDs are near duplicates of each other, so you can use this column to find the near duplicates of any data point. And remember the near duplicates also include *exact* duplicates as well (which have `near_duplicate_score` $=1$).
 

Let's check out the near duplicates with id $= 3$:

In [15]:
near_duplicate_cluster_id = 3  # play with this value to see other sets of near duplicates
selected_samples_by_near_duplicate_cluster_id = combined_dataset_df.query("near_duplicate_cluster_id == @near_duplicate_cluster_id")

columns_to_display = ["cleanlab_row_ID", "text", "near_duplicate_score", "is_near_duplicate", "given_label"]
selected_samples_by_near_duplicate_cluster_id[columns_to_display]

Unnamed: 0,cleanlab_row_ID,text,near_duplicate_score,is_near_duplicate,given_label
475,475,what should i do if my phone is lost or stolen?,0.997346,True,lost_or_stolen_phone
481,481,what should i do if my smart phone is lost or stolen?,0.997346,True,lost_or_stolen_phone


## Improve the dataset based on the detected issues

Since the results of this analysis appear reasonable, let's use the Cleanlab columns to improve the quality of our dataset. For your own datasets, which actions you should take to remedy the detected issues will depend on what you are using the data for. No action may be the best choice for certain datasets, we caution against blindly copying the actions we perform below. 

For data marked as `label_issue`, we create a new `corrected_label` column, which will be the given label for data without detected label issues, and the `suggested_label` for data with detected label issues.

In [16]:
corrected_label = np.where(combined_dataset_df["is_label_issue"],
                           combined_dataset_df["suggested_label"],
                           combined_dataset_df["given_label"])

For data marked as outlier or ambiguous, we will simply exclude them from our dataset. Here we create a boolean vector `rows_to_exclude` to track which data points will be excluded.

In [17]:
# create an exclude column to keep track of the excluded data
rows_to_exclude = combined_dataset_df["is_outlier"] | combined_dataset_df["is_ambiguous"]

For each set of near duplicates, we only want to keep one of the data points that share a common `near_duplicate_cluster_id` (so that the resulting dataset will no longer contain any near duplicates).

In [18]:
near_duplicates_to_exclude = combined_dataset_df['is_near_duplicate'] & combined_dataset_df['near_duplicate_cluster_id'].duplicated(keep='first')

rows_to_exclude |= near_duplicates_to_exclude

We can check the total amount of excluded data:

In [19]:
print(f"Excluding {rows_to_exclude.sum()} text examples (out of {len(combined_dataset_df)})")

Excluding 29 text examples (out of 1000)


Finally, let's actually make a new version of our dataset with these changes. 

We craft a new dataframe from the original, applying corrections and exclusions, and then use this dataframe to save the new dataset in a separate CSV file. The new dataset is a CSV file that has the same format as our original dataset -- you can use it as a plug-in replacement to get more reliable results in your ML and Analytics pipelines, without any change in your existing modeling code.

In [20]:
new_dataset_filename = "improved_dataset.csv"  # where to save the new dataset

In [None]:
# Fetch the original dataset
fixed_dataset = combined_dataset_df[["text"]].copy()

# Add the corrected label column 
fixed_dataset["label"] = corrected_label

# Automatically exclude selected rows
fixed_dataset = fixed_dataset[~rows_to_exclude]

# Check if the file exists before saving
if os.path.exists(new_dataset_filename):
    raise ValueError(f"File {new_dataset_filename} already exists. Cannot overwite so please delete it first, or specify a different new_dataset_filename.")
else:
    # Save the adjusted dataset to a CSV file
    fixed_dataset.to_csv(new_dataset_filename, index=False)
    print(f"Adjusted dataset saved to {new_dataset_filename}")
