# Detecting Issues in a Text Dataset with Cleanlab

Working through page in Open-source AI cookbook:

[https://huggingface.co/learn/cookbook/issues_in_text_dataset](https://huggingface.co/learn/cookbook/issues_in_text_dataset)

## Dataset

We are using a subset of the Banking77-OOS Dataset containing 1,000 customer service requests which are classified into 10 categories based on their intent (you can run this same code on any text classification dataset).

## Overview

- Use a pretrained transformer model to extract the text embeddings from the customer service requests
- Train a simple Logistic Regression model on the text embeddings to compute out-of-sample predicted probabilities
- Run Cleanlab’s Datalab audit with these predictions and embeddings in order to identify problems like: label issues, outliers, and near duplicates in the dataset.

In [1]:
!pip install cleanlab

Collecting cleanlab
  Downloading cleanlab-2.6.5-py3-none-any.whl.metadata (63 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m63.3/63.3 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
Downloading cleanlab-2.6.5-py3-none-any.whl (352 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m352.3/352.3 kB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[0mInstalling collected packages: cleanlab
Successfully installed cleanlab-2.6.5


In [2]:
!pip install -U scikit-learn sentence-transformers datasets
!pip install -U "cleanlab[datalab]"

Collecting scikit-learn
  Downloading scikit_learn-1.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting sentence-transformers
  Downloading sentence_transformers-3.0.0-py3-none-any.whl.metadata (10 kB)
[31mERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: '/opt/conda/lib/python3.10/site-packages/aiohttp-3.9.1.dist-info/METADATA'
[0m[31m
[31mERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: '/opt/conda/lib/python3.10/site-packages/aiohttp-3.9.1.dist-info/METADATA'
[0m[31m
[0m

In [4]:
pip install -U sentence-transformers

Collecting sentence-transformers
  Using cached sentence_transformers-3.0.0-py3-none-any.whl.metadata (10 kB)
Downloading sentence_transformers-3.0.0-py3-none-any.whl (224 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.7/224.7 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[0mInstalling collected packages: sentence-transformers
Successfully installed sentence-transformers-3.0.0
Note: you may need to restart the kernel to use updated packages.


In [5]:
import re
import string
import pandas as pd
from sklearn.metrics import accuracy_score, log_loss
from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import LogisticRegression
from sentence_transformers import SentenceTransformer

from cleanlab import Datalab

  from tqdm.autonotebook import tqdm, trange
2024-06-02 16:44:49.751081: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-02 16:44:49.751234: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-02 16:44:49.855726: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [6]:
import random
import numpy as np

pd.set_option("display.max_colwidth", None)

SEED = 123456  # for reproducibility
np.random.seed(SEED)
random.seed(SEED)

# Load and format the dataset

In [7]:
from datasets import load_dataset

dataset = load_dataset("PolyAI/banking77", split="train")
data = pd.DataFrame(dataset[:1000])
data.head()

Downloading data:   0%|          | 0.00/298k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/93.9k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/10003 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3080 [00:00<?, ? examples/s]

Unnamed: 0,text,label
0,I am still waiting on my card?,11
1,What can I do if my card still hasn't arrived after 2 weeks?,11
2,I have been waiting over a week. Is the card still coming?,11
3,Can I track my card while it is in the process of delivery?,11
4,"How do I know if I will get my card, or if it is lost?",11


In [8]:
raw_texts, labels = data["text"].values, data["label"].values
num_classes = len(set(labels))

In [9]:
print(f"This dataset has {num_classes} classes.")
print(f"Classes: {set(labels)}")

This dataset has 7 classes.
Classes: {32, 34, 36, 11, 13, 46, 17}


In [10]:
i = 1  # change this to view other examples from the dataset
print(f"Example Label: {labels[i]}")
print(f"Example Text: {raw_texts[i]}")

Example Label: 11
Example Text: What can I do if my card still hasn't arrived after 2 weeks?


We will use numeric representations from a pretrained Transformer model as embeddings of our text. The Sentence Transformers library offers simple methods to compute these embeddings for text data. Here, we load the pretrained electra-small-discriminator model, and then run our data through network to extract a vector embedding of each example.

In [11]:
transformer = SentenceTransformer("google/electra-small-discriminator")
text_embeddings = transformer.encode(raw_texts)



config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/54.2M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

In [17]:
text_embeddings[0].shape

(256,)

Our subsequent ML model will directly operate on elements of text_embeddings in order to classify the customer service requests.

---



In [18]:
model = LogisticRegression(max_iter=400)

pred_probs = cross_val_predict(model, text_embeddings, labels, method="predict_proba")

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [20]:
data_dict = {"texts": raw_texts, "labels": labels}

In [21]:
lab = Datalab(data_dict, label_name="labels")
lab.find_issues(pred_probs=pred_probs, features=text_embeddings)

Finding null issues ...
Finding label issues ...


  self.pid = os.fork()
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Finding outlier issues ...
Fitting OOD estimator based on provided features ...


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Finding near_duplicate issues ...
Finding non_iid issues ...
Finding class_imbalance issues ...
Finding underperforming_group issues ...

Audit complete. 62 issues found in the dataset.


In [22]:
lab.report()

Here is a summary of the different kinds of issues found in the data:

    issue_type  num_issues
       outlier          37
near_duplicate          14
         label          10
       non_iid           1

Dataset Information: num_examples: 1000, num_classes: 7


---------------------- outlier issues ----------------------

About this issue:
	Examples that are very different from the rest of the dataset 
    (i.e. potentially out-of-distribution or rare/anomalous instances).
    

Number of examples with this issue: 37
Overall dataset quality in terms of this issue: 0.3671

Examples representing most severe instances of this issue:
     is_outlier_issue  outlier_score
791              True       0.024866
601              True       0.031162
863              True       0.060738
355              True       0.064199
157              True       0.065075


------------------ near_duplicate issues -------------------

About this issue:
	A (near) duplicate issue refers to two or more example

In [24]:
raw_texts[429], raw_texts[459] # says these 2 are similar for example

('I purchased something overseas and the incorrect exchange rate was applied.',
 'I purchased something abroad and the incorrect exchange rate was applied.')

In [26]:
raw_texts[412], raw_texts[501], raw_texts[517] # says these 3 are similar for example

("The exchange rate you are using is bad.This can't be the official interbank exchange rate.",
 "The exchange rate you are using is really bad.This can't be the official interbank exchange rate.",
 "The exchange rate you are using is really bad. This can't possibly be the official interbank exchange rate.")

In [27]:
raw_texts[607], raw_texts[698] # says these 2 are similar for example

("There is an odd 1£ charge that appears as pending on my statement. What's the reason for that? I haven't purchased anything for a pound.",
 "There is a strange 1£ charge that appears as pending on my statement. What's the cause for that? I haven't purchased anything for a pound.")

In [30]:
raw_texts[397], labels[397] # says this has a LABEL issue i.e. wrongly classified

# says 32 : exchange_rate
# should be 11 : card_arrival

('I want to know your exchange rates.', 32)

In [31]:
raw_texts[485], labels[485]

# is labelled
# 17: card_payment_wrong_exchange_rate
# should be
# 34: extra_charge_on_statement
# YES seems correct (didnt read all labels description, but 17 does seem wrong in any case)

('Was I charged more than I should of been for a currency exchange?', 17)

In [28]:
label_issues = lab.get_issues("label")
label_issues.head()

Unnamed: 0,is_label_issue,label_score,given_label,predicted_label
0,False,0.903878,11,11
1,False,0.86055,11,11
2,False,0.658273,11,11
3,False,0.69705,11,11
4,False,0.435318,11,11


This method returns a dataframe containing a label quality score for each example. These numeric scores lie between 0 and 1, where lower scores indicate examples more likely to be mislabeled. The dataframe also contains a boolean column specifying whether or not each example is identified to have a label issue (indicating it is likely mislabeled).

We can get the subset of examples flagged with label issues, and also sort by label quality score to find the indices of the 5 most likely mislabeled examples in our dataset.

In [32]:
identified_label_issues = label_issues[label_issues["is_label_issue"] == True]
lowest_quality_labels = label_issues["label_score"].argsort()[:5].to_numpy()

print(
    f"cleanlab found {len(identified_label_issues)} potential label errors in the dataset.\n"
    f"Here are indices of the top 5 most likely errors: \n {lowest_quality_labels}"
)

cleanlab found 10 potential label errors in the dataset.
Here are indices of the top 5 most likely errors: 
 [379 100 300 485 159]


Let’s review some of the most likely label errors.

Here we display the top 5 examples identified as the most likely label errors in the dataset, together with their given (original) label and a suggested alternative label from cleanlab.

In [33]:
data_with_suggested_labels = pd.DataFrame(
    {"text": raw_texts, "given_label": labels, "suggested_label": label_issues["predicted_label"]}
)
data_with_suggested_labels.iloc[lowest_quality_labels]

Unnamed: 0,text,given_label,suggested_label
379,Is there a specific source that the exchange rate for the transfer I'm planning on making is pulled from?,32,11
100,can you share card tracking number?,11,36
300,"If I need to cash foreign transfers, how does that work?",32,46
485,Was I charged more than I should of been for a currency exchange?,17,34
159,Is there any way to see my card in the app?,13,11


**I read through the label lookup on the dataset page and tbh only 1-2 of these seem wrong**

[https://huggingface.co/datasets/PolyAI/banking77](https://huggingface.co/datasets/PolyAI/banking77)



---


# Outliers

According to the report, our dataset contains some outliers. We can see which examples are outliers (and a numeric quality score quantifying how typical each example appears to be) via get_issues. We sort the resulting DataFrame by cleanlab’s outlier quality score to see the most severe outliers in our dataset.

In [34]:
outlier_issues = lab.get_issues("outlier")
outlier_issues.sort_values("outlier_score").head()

Unnamed: 0,is_outlier_issue,outlier_score
791,True,0.024866
601,True,0.031162
863,True,0.060738
355,True,0.064199
157,True,0.065075


In [35]:
lowest_quality_outliers = outlier_issues["outlier_score"].argsort()[:5]

data.iloc[lowest_quality_outliers]

Unnamed: 0,text,label
791,withdrawal pending meaning?,46
601,$1 charge in transaction.,34
863,My atm withdraw is stillpending,46
355,explain the interbank exchange rate,32
157,"lost card found, want to put it back in app",13


We see that cleanlab has identified entries in this dataset that do not appear to be proper customer requests. Outliers in this dataset appear to be out-of-scope customer requests and other nonsensical text which does not make sense for intent classification. Carefully consider whether such outliers may detrimentally affect your data modeling, and consider removing them from the dataset if so.

---

# Duplicates

According to the report, our dataset contains some sets of nearly duplicated examples. We can see which examples are (nearly) duplicated (and a numeric quality score quantifying how dissimilar each example is from its nearest neighbor in the dataset) via get_issues. We sort the resulting DataFrame by cleanlab’s near-duplicate quality score to see the text examples in our dataset that are most nearly duplicated.

In [36]:
duplicate_issues = lab.get_issues("near_duplicate")
duplicate_issues.sort_values("near_duplicate_score").head()

Unnamed: 0,is_near_duplicate_issue,near_duplicate_score,near_duplicate_sets,distance_to_nearest_neighbor
459,True,0.009549,[429],0.000566
429,True,0.009549,[459],0.000566
412,True,0.046045,[501],0.002781
501,True,0.046045,"[412, 517]",0.002781
698,True,0.054626,[607],0.003314


In [37]:
data.iloc[[459, 429]]

Unnamed: 0,text,label
459,I purchased something abroad and the incorrect exchange rate was applied.,17
429,I purchased something overseas and the incorrect exchange rate was applied.,17


In [38]:
data.iloc[[501, 412]]

Unnamed: 0,text,label
501,The exchange rate you are using is really bad.This can't be the official interbank exchange rate.,17
412,The exchange rate you are using is bad.This can't be the official interbank exchange rate.,17
