# Finding Label Errors in Token Classification Datasets 

This tutorial shows how you can use cleanlab to find potential label errors in text datasets used for the NLP task of token classification. In token-classification, our data consists of a bunch of sentences (aka documents) in which every token (aka word) is labeled with one of K classes, and we train models to predict the class of each token in a new sentence. Example applications include part-of-speech-tagging or entity recognition, which is the focus on this tutorial. Here, we use CONLL-2003 named entity recognition dataset which contains 20,718 sentences and 301,361 tokens, where each token is labeled with one of the following classes:

- LOC (location entity)
- PER (person entity)
- ORG (organization entity)
- MISC (miscellaneous other type of entity)
- O (other type of word that does not correspond to an entity)

**Overview of what we'll do in this tutorial:** 
- Identify potential token label issues using cleanlab's `token_classification.filter.find_label_issues` method. 
- Rank sentences using cleanlab's `token_classification.rank.get_label_quality_scores` method. 
- TODO: (Clean Learning) Train a more robust model by removing problematic sentences. 

## 1. Install required dependencies and download data

You can use `pip` to install all packages required for this tutorial as follows: 

    !pip install cleanlab 

In [None]:
!wget -nc https://data.deepai.org/conll2003.zip && mkdir data 
!unzip conll2003.zip -d data/ && rm conll2003.zip 
!wget -nc 'https://cleanlab-public.s3.amazonaws.com/TokenClassification/pred_probs.npz' 

In [None]:
# Package installation (hidden on docs website).
# Package versions used: cleanlab==2.0.0 numpy==1.16.6 
# ericwang/cleanlab -b token_classification for now 

dependencies = ["cleanlab"]

if "google.colab" in str(get_ipython()):  # Check if it's running in Google Colab
    %pip install git+https://github.com/cleanlab/cleanlab.git@0b4e9089d509d3d17ec026e59c58776a85453b2c
    cmd = ' '.join([dep for dep in dependencies if dep != "cleanlab"])
    %pip install $cmd
else:
    missing_dependencies = []
    for dependency in dependencies:
        try:
            __import__(dependency)
        except ImportError:
            missing_dependencies.append(dependency)

    if len(missing_dependencies) > 0:
        print("Missing required dependencies:")
        print(*missing_dependencies, sep=", ")
        print("\nPlease install them before running the rest of this notebook.")

In [None]:
import numpy as np
from cleanlab.token_classification.filter import find_label_issues 
from cleanlab.token_classification.rank import get_label_quality_scores, issues_from_scores 
from cleanlab.internal.token_classification_utils import get_sentence, filter_sentence, mapping 
from cleanlab.token_classification.summary import display_issues, common_label_issues, filter_by_token 

np.set_printoptions(suppress=True)

## 2. Get `pred_probs`, `labels`  and read file 

`pred_probs` are out-of-sample model-predicted probabilities of the CoNLL-2003 dataset (including training, development, and testing dataset), obtained via cross-validation. To detect potential labels issues, we first get `pred_probs` and `labels`, which are both in nested-list format, such that: 

- `pred_probs` is a list of `np.arrays`, such that `pred_probs[i]` is the model-predicted probabilities for the tokens in the i'th sentence, and has shape `(N_i, K)`, where `N_i` is the number of word-level tokens of the `i`'th sentence. Each row of the matrix corresponds to a token `t` and contains the model-predicted probabilities that `t` belongs to each possible class, for each of the K classes. The columns must be ordered such that the probabilities correspond to class 0, 1, ..., K-1. 
        
- `labels` is a list of lists, such that `labels[i]` is a list of given token labels of the `i`'th sentence. For dataset with K classes, labels must be in 0, 1, ..., K-1. All the classes (0, 1, ..., and K-1) MUST be present in ``labels[i]`` for some ``i``. 

Here, indicies are a tuple `(i, j)` unless otherwise specified, which refers to the `j`'th word-level token of the `i`'th sentence. Given that each sentence has different number of tokens, we store `pred_probs` and `labels` as `.npz` files, which can be easily converted to dictionaries. Use `read_npz` to retrieve `pred_probs` and `labels` in nested-list format. 

<details><summary>Below is the code used to read the `.npz` file. </summary> 
    
    def read_npz(filepath): 
        data = dict(np.load(filepath)) 
        data = [data[str(i)] for i in range(len(data))] 
        return data 

</details> 

In [None]:
def read_npz(filepath): 
    data = dict(np.load(filepath)) 
    data = [data[str(i)] for i in range(len(data))] 
    return data 

In [None]:
pred_probs = read_npz('pred_probs.npz') 

As shown above, `pred_probs` is a list of np.array. In our example, the first sentence has 9 given tokens and 5 class names. 

Given that we would like to visualize the results, we first read the files. We obtain the sentences from the original files to display the word-level token label issues in context. `given_words` contains the word-level tokens in the dataset such that `given_words[i]` is a list of words of the `i`'th sentence; `given_labels` contains the given labels in the dataset such that `given_labels[i]` is a list of labels of the `i`th sentence. Note that in CoNLL-2003, the `B-` and `I-` prefixes indicates whether the tokens are at the start of an entity, which are ignored in this tutorial. Therefore, we have two sets of entities: 

    given_entities = ['O', 'B-MISC', 'I-MISC', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'] 
     
and 

    merge_entities = ['O', 'MISC', 'PER', 'ORG', 'LOC'] 

<details><summary>Below is the code used for reading the files.</summary>

    filepaths = ['data/train.txt', 'data/valid.txt', 'data/test.txt'] 
    given_entities = ['O', 'B-MISC', 'I-MISC', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']
    merged_entities = ['O', 'MISC', 'PER', 'ORG', 'LOC'] 
    entity_map = {entity: i for i, entity in enumerate(given_entities)} 

    def readfile(filepath, sep=' '): 
        lines = open(filepath)

        data, sentence, label = [], [], []
        for line in lines:
            if len(line) == 0 or line.startswith('-DOCSTART') or line[0] == '\n':
                if len(sentence) > 0:
                    data.append((sentence, label))
                    sentence, label = [], []
                continue
            splits = line.split(sep) 
            word = splits[0]
            if len(word) > 0 and word[0].isalpha() and word.isupper():
                word = word[0] + word[1:].lower()
            sentence.append(word)
            label.append(entity_map[splits[-1][:-1]])

        if len(sentence) > 0:
            data.append((sentence, label))

        given_words = [d[0] for d in data] 
        given_labels = [d[1] for d in data] 

        return given_words, given_labels 

    given_words, given_labels = [], [] 

    for filepath in filepaths: 
        words, label = readfile(filepath) 
        given_words.extend(words) 
        given_labels.extend(label)

    sentences = list(map(get_sentence, given_words)) 

    sentences, mask = filter_sentence(sentences) 
    given_words = [words for m, words in zip(mask, given_words) if m] 
    given_labels = [labels for m, labels in zip(mask, given_labels) if m] 

    maps = [0, 1, 1, 2, 2, 3, 3, 4, 4] 
    given_labels = [mapping(labels, maps) for labels in given_labels] 
    
</details>

In [None]:
filepaths = ['data/train.txt', 'data/valid.txt', 'data/test.txt'] 
given_entities = ['O', 'B-MISC', 'I-MISC', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'] 
merged_entities = ['O', 'MISC', 'PER', 'ORG', 'LOC'] 
entity_map = {entity: i for i, entity in enumerate(given_entities)} 

In [None]:
def readfile(filepath, sep=' '): 
    lines = open(filepath)
    
    data, sentence, label = [], [], []
    for line in lines:
        if len(line) == 0 or line.startswith('-DOCSTART') or line[0] == '\n':
            if len(sentence) > 0:
                data.append((sentence, label))
                sentence, label = [], []
            continue
        splits = line.split(sep) 
        word = splits[0]
        if len(word) > 0 and word[0].isalpha() and word.isupper():
            word = word[0] + word[1:].lower()
        sentence.append(word)
        label.append(entity_map[splits[-1][:-1]])

    if len(sentence) > 0:
        data.append((sentence, label))
        
    given_words = [d[0] for d in data] 
    given_labels = [d[1] for d in data] 
    
    return given_words, given_labels 

In [None]:
given_words, given_labels = [], [] 

for filepath in filepaths: 
    words, label = readfile(filepath) 
    given_words.extend(words) 
    given_labels.extend(label)
    
sentences = list(map(get_sentence, given_words)) 

sentences, mask = filter_sentence(sentences) 
given_words = [words for m, words in zip(mask, given_words) if m] 
given_labels = [labels for m, labels in zip(mask, given_labels) if m] 

maps = [0, 1, 1, 2, 2, 3, 3, 4, 4] 
labels = [mapping(labels, maps) for labels in given_labels] 

Here, we request the inputs to be in the following format: 

In [None]:
indices_to_preview = 3
for i in range(indices_to_preview):
    print('\nsentences[%d]:\t' % i + str(sentences[i])) 
    print('labels[%d]:\t' % i + str(labels[i])) 
    print('pred_probs[%d]:\n' % i + str(pred_probs[i])) 

## 3. Use cleanlab to find potential label issues 

Based on the given labels and out-of-sample predicted probabilities, cleanlab can quickly help us identify label issues in our dataset. Here we request that the indices of the identified label issues be sorted by cleanlab’s self-confidence score, which measures the quality of each given label via the probability assigned to it in our model’s prediction. Here, `issues` is a list of tuples `(i, j)`, which corresponds to the `j`'th token of the `i`'th sentence. These are the tokens cleanlab thinks may be badly labeled in your dataset. 

In [None]:
issues = find_label_issues(labels, pred_probs, return_indices_ranked_by='self_confidence') 

## 4. Most likely issues 
Let's look at the top 20 examples cleanlab thinks are most likely to be incorrectly labeled. 

In [None]:
top = 20 
print('Cleanlab found %d potential label issues. ' % len(issues)) 
print('The top %d most likely label errors:' % top) 
print(issues[:top]) 

We show the top 20 potential label issues. Given that `O` and `MISC` are hard to distinguish and can sometimes be ambiguous, they are excluded from the examples below. They can be specified in the `exclude` argument, which is a list of tuples `(i, j)` such that tokens predicted as `merged_entities[j]` but labels as `merged_entities[i]` are ignored. In the following example, we ignore mislabels between `O` and `MISC`, which are indexed `0` and `1`. 

In [None]:
display_issues(issues, given_words, pred_probs=pred_probs, given_labels=labels, 
               exclude=[(0, 1), (1, 0)], class_names=merged_entities) 

More than half of the potential label issues correspond to tokens that are incorrectly labeled. As shown above, some examples are ambigious and require manual checking. Observe that there are some edge cases such as tokens simply being punctuations such as `/` and `(`. 

## 4. Most common word-level token mislabels 

It may be useful to examine the most common word-level token mislabels. 

In [None]:
info = common_label_issues(issues, given_words, 
                           labels=labels, 
                           pred_probs=pred_probs, 
                           class_names=merged_entities, 
                           exclude=[(0, 1), (1, 0)]) 

The above printed information is also stored as a DataFrame `info`, sorted by the number of mislabels in descending order. 

## 5. Find issue sentences with particular word 

Call `search_token` to examine the token label issues of a specific token. 

In [None]:
token_issues = filter_by_token('United', issues, given_words) 
display_issues(token_issues, given_words, pred_probs=pred_probs, given_labels=labels, 
               exclude=[(0, 1), (1, 0)], class_names=merged_entities) 

## 6. Sentence label quality score 

Cleanlab can analyze every label in the dataset and provide a numerical score for each sentence. The score ranges between 0 and 1: a lower score indicates that the sentence is more likely to contain at least one error. 

In [None]:
scores, token_scores = get_label_quality_scores(labels, pred_probs) 
issues = issues_from_scores(scores, token_scores=token_scores) 
display_issues(issues, given_words, pred_probs=pred_probs, given_labels=labels, 
               exclude=[(0, 1), (1, 0)], class_names=merged_entities) 

In [None]:
highlighted_indices = [(2907, 0), (19392, 0), (9962, 4), (8904, 30), (19303, 0), 
                       (12918, 0), (9256, 0), (11855, 20), (18392, 4), (20426, 28), 
                       (19402, 21), (14744, 15), (19371, 0), (4645, 2), (83, 9), 
                       (10331, 3), (9430, 10), (6143, 25), (18367, 0), (12914, 3)] 

if not all(x in issues for x in highlighted_indices):
    raise Exception("Some highlighted examples are missing from ranked_label_issues.")