# 🧐 Find label errors with cleanlab

In this tutorial, we will show you how you can find possible labeling errors in your data set with the help of [*cleanlab*](https://github.com/cgnorthcutt/cleanlab) and *Rubrix*.

## Introduction

As shown recently by [Curtis G. Northcutt et al.](https://arxiv.org/abs/2103.14749) label errors are pervasive even in the most-cited test sets used to benchmark the progress of the field of machine learning.
In the worst-case scenario, these label errors can destabilize benchmarks and tend to favor more complex models with a higher capacity over lower capacity models.

They introduce a new principled framework to “identify label errors, characterize label noise, and learn with noisy labels” called **confident learning**. It is open-sourced as the [cleanlab Python package](https://github.com/cgnorthcutt/cleanlab) that supports finding, quantifying, and learning with label errors in data sets.

This tutorial walks you through 5 basic steps to find and correct label errors in your data set:

1. 💾 Load the data set you want to check, and a model trained on it;
2. 💻 Make predictions for the test split of your data set;
3. 🧐 Get label error candidates with *cleanlab*;
4. 🔦 Uncover label errors with *Rubrix*;
5. 🖍 Correct label errors and load the corrected data set;

## Setup Rubrix

If you are new to Rubrix, visit and star Rubrix for updates: ⭐ [Github repository](https://github.com/recognai/rubrix)

If you have not installed and launched Rubrix, check the [Setup and Installation guide](../getting_started/setup&installation.rst).

Once installed, you only need to import Rubrix:

In [1]:
import rubrix as rb

### Install tutorial dependencies

Apart from [cleanlab](https://github.com/cgnorthcutt/cleanlab), we will also install the Hugging Face libraries [transformers](https://github.com/huggingface/transformers) and [datasets](https://github.com/huggingface/datasets), as well as [PyTorch](https://pytorch.org/), that provide us with the model and the data set we are going to investigate.

In [2]:
%pip install cleanlab torch transformers datasets -qqq

Note: you may need to restart the kernel to use updated packages.


### Imports

Let us import all the necessary stuff in the beginning.

In [7]:
import rubrix as rb
from cleanlab.pruning import get_noise_indices

import torch
import datasets
from transformers import AutoTokenizer, AutoModelForSequenceClassification

## 1. Load model and data set

In [8]:
tokenizer = AutoTokenizer.from_pretrained("andi611/distilbert-base-uncased-ner-agnews")
model = AutoModelForSequenceClassification.from_pretrained("andi611/distilbert-base-uncased-ner-agnews")

We then get the test split of the MRPC data set, that we will scan for label errors.

In [9]:
dataset = datasets.load_dataset("ag_news", split="test")

Using custom data configuration default
Reusing dataset ag_news (/Users/dani/.cache/huggingface/datasets/ag_news/default/0.0.0/bc2bcb40336ace1a0374767fc29bb0296cdaf8a6da7298436239c54d79180548)


In [10]:
dataset.to_pandas().head()

Unnamed: 0,text,label
0,Fears for T N pension after talks Unions repre...,2
1,The Race is On: Second Private Team Sets Launc...,3
2,Ky. Company Wins Grant to Study Peptides (AP) ...,3
3,Prediction Unit Helps Forecast Wildfires (AP) ...,3
4,Calif. Aims to Limit Farm-Related Smog (AP) AP...,3


## 2. Make predictions

Now let us use the model to get predictions for our data set, and add those to our dataset instance. We will use the `.map` functionality of the *datasets* library to process our data batch-wise.

In [None]:
def get_model_predictions(batch):
    # batch is a dictionary of lists
    tokenized_input = tokenizer(
        batch["text"], padding=True, return_tensors="pt"
    )
    # get logits of the model prediction
    logits = model(**tokenized_input).logits
    # convert logits to probabilities
    probabilities = torch.softmax(logits, dim=1).detach().numpy()
    
    return {"probabilities": probabilities}
    
# Apply predictions batch-wise
dataset = dataset.map(
    get_model_predictions,
    batched=True,
    batch_size=16,
)

  0%|          | 0/475 [00:00<?, ?ba/s]

## 3. Get label error candidates

To identify label error candidates the cleanlab framework simply needs the probability matrix of our predictions (`n x m`, where `n` is the number of examples and `m` the number of labels), and the potentially noisy labels.

In [3]:
# Output the data as numpy arrays
dataset.set_format("numpy")

# Get a boolean array of label error candidates
label_error_candidates = get_noise_indices(
    s=dataset["label"],
    psx=dataset["probabilities"],
)

In [4]:
frac = label_error_candidates.sum()/len(dataset)
print(
    f"Total: {len(dataset)}\n"
    f"Candidates: {label_error_candidates.sum()} ({100*frac:0.1f}%)"
)

Total: 7600
Candidates: 163 (2.1%)


## 4. Uncover label errors in Rubrix

Now that we have a list of potential candidates, let us log them to *Rubrix* to uncover and correct the label errors.
First we switch to a pandas DataFrame to filter out our candidates.

In [5]:
candidates = dataset.to_pandas()[label_error_candidates]

Then we will turn those candidates into [TextClassificationRecords](../reference/python/python_client.rst#rubrix.client.models.TextClassificationRecord) that we will log to *Rubrix*.

In [21]:
def make_record(row):
    prediction = list(zip(dataset.features['label'].names, row.probabilities))
    annotation = dataset.features['label'].names[row.label]
        
    return rb.TextClassificationRecord(
        inputs=row["text"],
        prediction=prediction, 
        annotation=annotation, 
        annotation_agent="original_benchmark",
        status="Default"
    )
        
records = candidates.apply(make_record, axis=1)

Having our records at hand we can now log them to *Rubrix* and save them in a dataset that we call `"agnews_label_errors"`. 

In [22]:
rb.log(records, name="agnews_label_errors")

  0%|          | 0/163 [00:00<?, ?it/s]

163 records logged to http://localhost:6900/agnews_label_errors


BulkResponse(dataset='agnews_label_errors', processed=163, failed=0)

Scanning through the records in the [*Explore Mode*](../reference/rubrix_webapp_reference.rst#explore-mode) of *Rubrix*, we were able to find at least **30 clear cases** of label errors. 
A couple of examples are shown below, in which the noisy labels are shown in the upper right corner of each example.
The predictions of the model together with their probabilities are shown below each sentence pair.