In [1]:
# %%capture 
# !pip uninstall cleanlab -y 
# !pip install git+https://github.com/ericwang1997/cleanlab.git@token_classification 
# !pip install termcolor 

In [2]:
# package versions: 
# cleanlab==2.0.0 (ericwang/cleanlab -b token_classification)
# termcolor==1.1.0 

# Token Classification Label Error Detection - Part 2 

In this tutorial, we show how you can use cleanlab to find potential label errors in a NLP token classification dataset. Here, we use CONLL-2003 which contains 20,718 sentences and 301,361 tokens. In a standard NLP token-classification task, the model predicts the entity of each token within each sentence. In this example, the named entity includes miscellaneous `MISC`, location `LOC`, person `PER`, and organization `ORG`. Unnamed entities are labeled as `O`. 

**Overview of what we'll do in this tutorial:** 
- Identify potential token label issues using cleanlab's `token_classification.filter.find_label_issues` method. 
- Rank sentences using cleanlab's `token_classification.rank.get_label_quality_score` method. 
- TODO: (Clean Learning) Train a more robust model by removing problematic sentences. 

## 1. Install the required dependencies 

In [3]:
import numpy as np
from termcolor import colored 
from cleanlab.token_classification.filter import find_label_issues 
from cleanlab.token_classification.rank import get_label_quality_scores
from utils import * 
# change to this after token_classification.rank is pushed to the official package 
# from cleanlab.token_classification.rank import get_label_quality_scores 

## 2. Get `pred_probs` and `labels` 

`pred_probs` are out-of-sample model-predicted probabilities of the CoNLL-2003 dataset (including training, development, and testing dataset), obtained via cross-validation. To detect potential labels issues, we first get `pred_probs` and `labels`, which are both in nested-list format, such that: 

- `pred_probs` is a list of `np.arrays`, such that `pred_probs[i]` is the model-predicted probabilities for the tokens in the i'th sentence, and has shape `(N_i, K)`, where `N_i` is the number of word-level tokens of the `i`'th sentence. Each row of the matrix corresponds to a token `t` and contains the model-predicted probabilities that `t` belongs to each possible class, for each of the K classes. The columns must be ordered such that the probabilities correspond to class 0, 1, ..., K-1. 
        
- `labels` is a list of lists, such that `labels[i]` is a list of token labels of the `i`'th sentence. Same format requirements as `cleanlab.rank.get_label_quality_scores`. 

Here, indicies are a tuple `(i, j)` unless otherwise specified, which refers to the `j`'th word-level token of the `i`'th sentence. Given that each sentence has different number of tokens, we store `pred_probs` and `labels` as `.npz` files, which can be easily converted to dictionaries. Use `read_npz` to retrieve `pred_probs` and `labels` in nested-list format. 

In [4]:
labels = read_npz('labels.npz') 
pred_probs = read_npz('pred_probs.npz') 

## 3. Use cleanlab to find potential label issues 

Based on the given labels and out-of-sample predicted probabilities, cleanlab can quickly help us identify label issues in our dataset. Here we request that the indices of the identified label issues be sorted by cleanlab’s self-confidence score, which measures the quality of each given label via the probability assigned to it in our model’s prediction.

In [5]:
issues = find_label_issues(labels, pred_probs, return_indices_ranked_by='self_confidence') 
top = 20 
print('Cleanlab found %d potential label issues. ' % len(issues)) 
print('The top %d most likely label errors:' % top) 
print(str(issues[:top])) 

Cleanlab found 2255 potential label issues. 
The top 20 most likely label errors:
[(2907, 0), (19392, 0), (9962, 4), (8904, 30), (19303, 0), (12918, 0), (9256, 0), (11855, 20), (18392, 4), (20426, 28), (19402, 21), (14744, 15), (19371, 0), (4645, 2), (83, 9), (10331, 3), (9430, 10), (6143, 25), (18367, 0), (12914, 3)]


Let's look at the top 20 examples cleanlab thinks are most likely to be incorrectly labeled. We obtain the sentences from the original file to display the word-level token label issues in context. 

In [6]:
# %%capture 
# !wget https://data.deepai.org/conll2003.zip && mkdir data 
# !unzip conll2003.zip -d data/ && rm conll2003.zip 

In [7]:
# collapse this 
filepaths = ['data/train.txt', 'data/valid.txt', 'data/test.txt'] 
entities = ['O', 'B-MISC', 'I-MISC', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']
entity_map = {entity: i for i, entity in enumerate(entities)} 

def readfile(filepath, sep=' '): 
    lines = open(filepath)
    
    data, sentence, label = [], [], []
    for line in lines:
        if len(line) == 0 or line.startswith('-DOCSTART') or line[0] == '\n':
            if len(sentence) > 0:
                data.append((sentence, label))
                sentence, label = [], []
            continue
        splits = line.split(sep) 
        word = splits[0]
        if len(word) > 0 and word[0].isalpha() and word.isupper():
            word = word[0] + word[1:].lower()
        sentence.append(word)
        label.append(entity_map[splits[-1][:-1]])

    if len(sentence) > 0:
        data.append((sentence, label))
        
    given_words = [d[0] for d in data] 
    given_labels = [d[1] for d in data] 
    
    return given_words, given_labels 

given_words, given_labels = [], [] 

for filepath in filepaths: 
    words, label = readfile(filepath) 
    given_words.extend(words) 
    given_labels.extend(label)
    
sentences = list(map(get_sentence, given_words)) 

sentences, mask = filter_sentence(sentences) 
given_words = [words for m, words in zip(mask, given_words) if m] 
given_labels = [labels for m, labels in zip(mask, given_labels) if m] 

maps = [0, 1, 1, 2, 2, 3, 3, 4, 4] 
given_labels = [mapping(labels, maps) for labels in given_labels] 

We show the top 20 potential label issues. Given that `O` and `MISC` are hard to distinguish and can sometimes be ambiguous, they are excluded from the examples below. 

In [8]:
show_issues(issues, labels, pred_probs, given_words, sentences, exclude=[(0, 1), (1, 0)], top=20) 

1. [31mLittle[0m change from today's weather expected.
Given label: PER, predicted label: O

2. [31mLet[0m's march together," Scalfaro, a northerner himself, said.
Given label: LOC, predicted label: O

3. 3. Nastja Rysich ([31mgermany[0m) 3.75
Given label: LOC, predicted label: O

4. The Spla has fought Khartoum's government forces in the south since 1983 for greater autonomy or independence of the mainly Christian and animist region from the Moslem, Arabised [31mnorth[0m.
Given label: LOC, predicted label: O

5. [31mMayor[0m Antonio Gonzalez Garcia, of the opposition Revolutionary Workers' Party, said in Wednesday's letter that army troops recently raided several local farms, stole cattle and raped women.
Given label: PER, predicted label: O

6. [31mSpring[0m Chg Hrw 12pct Chg White Chg
Given label: LOC, predicted label: O

7. " We have seen the photos but for the moment the palace has no comment," a spokeswoman for [31mPrince[0m Rainier told Reuters.
Given label: PER, p

More than half of the potential label issues are identified correctly. As shown above, some examples are ambigious and require manual checking. 

## 4. Most common word-level token mislabels 

It may be useful to examine the most common word-level token mislabels. 

In [9]:
entities = ['O', 'MISC', 'PER', 'ORG', 'LOC'] 
common_token_issues(issues, given_words, labels, pred_probs, entities, exclude=[(0, 1), (1, 0)]) 

'/' is mislabeled 42 times
-----------------------------
labeled as O, but predicted as LOC 36 times
labeled as O, but predicted as PER 4 times
labeled as O, but predicted as ORG 2 times

'Chicago' is mislabeled 27 times
-----------------------------
labeled as ORG, but predicted as LOC 22 times
labeled as LOC, but predicted as ORG 3 times
labeled as MISC, but predicted as ORG 2 times

'U.s.' is mislabeled 21 times
-----------------------------
labeled as LOC, but predicted as ORG 8 times
labeled as ORG, but predicted as LOC 6 times
labeled as LOC, but predicted as O 3 times
labeled as LOC, but predicted as MISC 2 times
labeled as MISC, but predicted as LOC 1 times
labeled as MISC, but predicted as ORG 1 times

'Digest' is mislabeled 20 times
-----------------------------
labeled as O, but predicted as ORG 20 times

'Press' is mislabeled 20 times
-----------------------------
labeled as O, but predicted as ORG 20 times

'New' is mislabeled 17 times
-----------------------------
labeled

## 5. Find issue sentences with particular word 

Call `search_token` to examine the token label issues of a specific token. 

In [10]:
token = 'United' 
_ = search_token(token, issues, labels, pred_probs, given_words, sentences, entities) 

Sentence 471: Soccer - Keane Signs Four-year Contract With Manchester [31mUnited[0m.
Given label: LOC, predicted label: ORG

Sentence 15658: [31mUnited[0m Nations 1996-08-29
Given label: ORG, predicted label: LOC

Sentence 19072: The Humane Society of the [31mUnited[0m States estimates that between 500,000 and one million bites are delivered by dogs each year, more than half of which are suffered by children.
Given label: LOC, predicted label: ORG

Sentence 19104: [31mUnited[0m Nations 1996-12-06
Given label: ORG, predicted label: LOC

Sentence 19879: 1. [31mUnited[0m States Iii (Brian Shimer, Randy Jones) one
Given label: ORG, predicted label: LOC

Sentence 19910: His father Clarence Woolmer represented [31mUnited[0m Province, now renamed Uttar Pradesh, in India's Ranji Trophy national championship and captained the state during 1949.
Given label: LOC, predicted label: ORG



## 6. Sentence label quality score 

Cleanlab can analyze every label in the dataset and provide a numerical score for each sentence. The score ranges between 0 and 1: a lower score indicates that the sentence is more likely to contain at least one error. 

In [11]:
scores, token_scores = get_label_quality_scores(labels, pred_probs, return_token_info=True) 
show_sentence_issues(scores, token_scores, pred_probs, given_words, sentences, given_labels, entities)

Sentence 2907: score=0.000001
[31mLittle[0m change from today's weather expected.
Given label: PER, suggested label: O

Sentence 19392: score=0.000001
[31mLet[0m's march together," Scalfaro, a northerner himself, said.
Given label: LOC, suggested label: O

Sentence 9962: score=0.000001
3. Nastja Rysich ([31mgermany[0m) 3.75
Given label: LOC, suggested label: O

Sentence 8904: score=0.000002
The Spla has fought Khartoum's government forces in the south since 1983 for greater autonomy or independence of the mainly Christian and animist region from the Moslem, Arabised [31mnorth[0m.
Given label: LOC, suggested label: O

Sentence 19303: score=0.000002
[31mAccess[0m energy futures prices add to daytime gains.
Given label: MISC, suggested label: O

Sentence 12918: score=0.000002
[31mMayor[0m Antonio Gonzalez Garcia, of the opposition Revolutionary Workers' Party, said in Wednesday's letter that army troops recently raided several local farms, stole cattle and raped women.
Given l