In [1]:
%%capture 
!pip install cleanlab termcolor 

In [2]:
# package versions: 
# cleanlab==2.0.0 
# termcolor==1.1.0 

# Token Classification Label Error Detection - Part 2 

In this tutorial, we show how you can use cleanlab to find potential label errors in token classification dataset. Here, we use CONLL-2003 which contains 3,449 sentences and 46,400 tokens (after filtering). 

**Overview of what we'll do in this tutorial:** 
- Identify potential token label issues using cleanlab's `find_label_issues` method. 
- Rank sentences using cleanlab's `token_classification.rank.get_label_quality_score` method. 
- TODO: Train a more robust model by removing problematic sentences. 

## 1. Install the required dependencies 

In [3]:
import numpy as np
from termcolor import colored 
from cleanlab.filter import find_label_issues 
from rank import get_label_quality_scores 
from utils import * 
# change to this after token_classification.rank is pushed to the official package 
# from cleanlab.token_classification.rank import get_label_quality_scores 

## 2. Get `pred_probs` and `labels` 

For more information on how to get `pred_probs` and `labels`, see part 1. Recall that we also need the number of word-level tokens for each sentence. Also get their nested list format. 

In [4]:
labels = dict(np.load('labels.npz')) 
pred_probs = dict(np.load('pred_probs.npz')) 

labels_nl = to_nl(labels) 
pred_probs_nl = to_nl(pred_probs) 

labels = [label for labels in labels_nl for label in labels] 
pred_probs = np.array([pred_prob for pred_probs in pred_probs_nl for pred_prob in pred_probs]) 

## 3. Use cleanlab to find potential label issues 

Based on the given labels and out-of-sample predicted probabilities, cleanlab can quickly help us identify label issues in our dataset. Here we request that the indices of the identified label issues be sorted by cleanlab’s self-confidence score, which measures the quality of each given label via the probability assigned to it in our model’s prediction.

In [5]:
issues = find_label_issues(labels, pred_probs, return_indices_ranked_by='self_confidence') 
top = 10 
print('Cleanlab found %d potential label issues. ' % len(issues)) 
print('The top %d most likely label errors:' % top) 
print(str(issues[:top])) 

Cleanlab found 611 potential label issues. 
The top 10 most likely label errors:
[32203 31970 40344 34547  6727 31531 34553 46388 28663 13979]


Note that the indices are in the flattened format. We map them to a tuple `(i, j)`, which corresponds to the `j`'th word-level token of the `i`'th sentence. 

In [6]:
index_to_tuple = get_mapping(labels_nl) 

Let's look at the top 10 examples cleanlab thinks are most likely to be incorrectly labeled. We obtain the sentences from the original file to display the word-level token label issues in context. 

\* Uncomment the block below after tutorials are ready 

In [7]:
# %%capture 
# !wget https://data.deepai.org/conll2003.zip && mkdir data 
# !unzip conll2003.zip -d data/ && rm conll2003.zip 

In [8]:
# same from tutorial part 1, probably want to collapse this block 
filepath = 'data/test.txt'
entities = ['O', 'B-MISC', 'I-MISC', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']
entity_map = {entity: i for i, entity in enumerate(entities)} 

def readfile(filepath, sep=' '): 
    lines = open(filepath)
    
    data, sentence, label = [], [], []
    for line in lines:
        if len(line) == 0 or line.startswith('-DOCSTART') or line[0] == '\n':
            if len(sentence) > 0:
                data.append((sentence, label))
                sentence, label = [], []
            continue
        splits = line.split(sep) 
        word = splits[0]
        if len(word) > 0 and word[0].isalpha() and word.isupper():
            word = word[0] + word[1:].lower()
        sentence.append(word)
        label.append(entity_map[splits[-1][:-1]])

    if len(sentence) > 0:
        data.append((sentence, label))
        
    given_words = [d[0] for d in data] 
    given_labels = [d[1] for d in data] 
    
    return given_words, given_labels 

given_words, given_labels = readfile(filepath) 
sentences = list(map(get_sentence, given_words)) 

sentences, mask = filter_sentence(sentences) 
given_words = [words for m, words in zip(mask, given_words) if m] 
given_labels = [labels for m, labels in zip(mask, given_labels) if m] 

maps = [0, 1, 1, 2, 2, 3, 3, 4, 4] 
given_labels = [mapping(labels, maps) for labels in given_labels] 

In [9]:
entities = ['O', 'MISC', 'PER', 'ORG', 'LOC'] 

def color_sentence(sentence, word): 
    start_idx = sentence.index(word) 
    before, after = sentence[:start_idx], sentence[start_idx + len(word):]
    return '%s%s%s' % (before, colored(word, 'red'), after) 

def print_issue(issue, show_labels=True): 
    i, j = index_to_tuple[issue] 
    issue_label = entities[labels[issue]] 
    predicted_label = entities[np.argmax(pred_probs[issue])] 
    issue_word = given_words[i][j] 
    print('%s' % color_sentence(sentences[i], issue_word)) 
    if show_labels: 
        print('Given label: %s, suggested label: %s\n' % (issue_label, predicted_label)) 

for idx, issue in enumerate(issues[:top]): 
    print('%d.' % (idx+1), end=' ') 
    print_issue(issue) 

1. A Reuter consensus survey sees medical equipment group Radiometer reporting largely unchanged earnings when it publishes first half 19996/97 results next [31mWednesday[0m.
Given label: ORG, suggested label: O

2. [31mLet[0m's march together," Scalfaro, a northerner himself, said.
Given label: LOC, suggested label: O

3. Scottish [31mpremier[0m division after Saturday's matches:
Given label: MISC, suggested label: O

4. [31mBut[0m 2 27/11/96 5,000 Burma
Given label: MISC, suggested label: O

5. [31m1.[0m Fc Cologne 16 8 2 6 31 27 26
Given label: ORG, suggested label: O

6. [31mNetwork[0m operators said the draft laws would hold them responsible for copyright infringements in the system and expose them to multi-billion-dollar liabilities.
Given label: MISC, suggested label: O

7. -- [31mBangkok[0m newsroom (662) 652-0642
Given label: MISC, suggested label: LOC

8. The lanky former Leeds United defender did not make his England debut until the age of 30 but eventually won

Let’s zoom into some specific examples from the above: 

Given label is `ORG` but should be `O`: 

In [10]:
print_issue(issues[0], show_labels=False) 

A Reuter consensus survey sees medical equipment group Radiometer reporting largely unchanged earnings when it publishes first half 19996/97 results next [31mWednesday[0m.


Given label is `MISC` but should be `LOC`: 

In [11]:
print_issue(issues[6], show_labels=False) 

-- [31mBangkok[0m newsroom (662) 652-0642


Given label is `MISC` but should be `LOC`: 

In [12]:
print_issue(issues[8], show_labels=False) 

The basket comprises Algeria's Saharan Blend, Indonesia's Minas, Nigeria's Bonny Light, Saudi Arabia's Arabian Light, [31mDubai[0m of the Uae, Venezuela's Tia Juana and Mexico's Isthmus.


## 4. Most common word-level token mislabels 

It may be useful to examine the most common word-level token mislabels. 

In [13]:
words = [word for words in given_words for word in words]  
frequency = frequent_words(issues, words, labels, pred_probs) 
show_frequent_issues(frequency, entities, verbose=True) 

'Division' is mislabeled 30 times
-----------------------------
labeled as O, but predicted as MISC 30 times

'Czech' is mislabeled 13 times
-----------------------------
labeled as LOC, but predicted as MISC 12 times
labeled as ORG, but predicted as LOC 1 times

'League' is mislabeled 12 times
-----------------------------
labeled as O, but predicted as MISC 8 times
labeled as ORG, but predicted as MISC 2 times
labeled as LOC, but predicted as O 1 times
labeled as LOC, but predicted as ORG 1 times

'Conference' is mislabeled 10 times
-----------------------------
labeled as O, but predicted as MISC 10 times

'Hockey' is mislabeled 10 times
-----------------------------
labeled as O, but predicted as MISC 5 times
labeled as ORG, but predicted as MISC 5 times

'National' is mislabeled 9 times
-----------------------------
labeled as ORG, but predicted as MISC 8 times
labeled as O, but predicted as MISC 1 times

'Alpine' is mislabeled 8 times
-----------------------------
labeled as O, b

As shown above, many mislabels are between `O` and `MISC`, which are inherently hard to differentiate. Therefore, you can add label/prediction pairs to `exclude`. 

In [14]:
frequency = frequent_words(issues, words, labels, pred_probs, exclude=[(0, 1), (1, 0)]) 
show_frequent_issues(frequency, entities, verbose=True) 

'Czech' is mislabeled 13 times
-----------------------------
labeled as LOC, but predicted as MISC 12 times
labeled as ORG, but predicted as LOC 1 times

'National' is mislabeled 8 times
-----------------------------
labeled as ORG, but predicted as MISC 8 times

'I' is mislabeled 6 times
-----------------------------
labeled as ORG, but predicted as O 5 times
labeled as MISC, but predicted as ORG 1 times

'Union' is mislabeled 6 times
-----------------------------
labeled as ORG, but predicted as MISC 6 times

'Hockey' is mislabeled 5 times
-----------------------------
labeled as ORG, but predicted as MISC 5 times

'United' is mislabeled 5 times
-----------------------------
labeled as LOC, but predicted as ORG 3 times
labeled as ORG, but predicted as LOC 2 times

'Fe' is mislabeled 4 times
-----------------------------
labeled as ORG, but predicted as LOC 3 times
labeled as LOC, but predicted as ORG 1 times

'Wto' is mislabeled 4 times
-----------------------------
labeled as ORG, b

## 5. Find issue sentences with particular word 

If you want to examine a specific token, call `search_token` to return a list of sentence indicies which contain the issue sentence. 

\* Do you think we should also output the given and predicted labels? If so, the function will take more outputs, but I will imagine this addition will be useful. It also allows users to exclude certain types of mislabels like the above 

In [15]:
token = 'National' 
indicies_with_token = search_token(token, issues, index_to_tuple, given_words) 

for index in indicies_with_token: 
    print('Sentence %d: %s\n' % (index, color_sentence(sentences[index], token))) 

Sentence 354: Standings of [31mNational[0m Hockey

Sentence 405: Results of [31mNational[0m Hockey

Sentence 485: [31mNational[0m Football League

Sentence 510: [31mNational[0m Football Conference

Sentence 552: Result of [31mNational[0m Football

Sentence 3302: Results of [31mNational[0m Basketball

Sentence 3315: Standings of [31mNational[0m Hockey

Sentence 3367: Results of [31mNational[0m Hockey

Sentence 3378: Vancouver Canucks star right wing Pavel Bure was suspended for one game by the [31mNational[0m Hockey League and fined$ 1,000 Friday for his hit on Buffalo Sabres defenceman Garry Galley on Wednesday.



## 6. Sentence label quality score 

Cleanlab can analyze every label in the dataset and provide a numerical score for each sentence. The score ranges between 0 and 1: a lower score indicates that the sentence is more likely to contain at least one error. 

In [16]:
scores, token_scores = get_label_quality_scores(labels_nl, pred_probs_nl, return_token_info=True) 
ranking = np.argsort(scores) 

for r in ranking[:top]: 
    print('Sentence %d - score=%.6f' % (r, scores[r])) 
    issue_index = np.argmin(token_scores[r]) 
    issue_word = given_words[r][issue_index] 
    print(color_sentence(sentences[r], issue_word)) 
    given_label = given_labels[r][issue_index] 
    suggested_label = np.argmax(pred_probs_nl[r][issue_index]) 
    print('Given label: %s, suggested label: %s\n' % (entities[given_label], entities[suggested_label])) 

Sentence 2133 - score=0.000007
A Reuter consensus survey sees medical equipment group Radiometer reporting largely unchanged earnings when it publishes first half 19996/97 results next [31mWednesday[0m.
Given label: ORG, suggested label: O

Sentence 2123 - score=0.000009
[31mLet[0m's march together," Scalfaro, a northerner himself, said.
Given label: LOC, suggested label: O

Sentence 2770 - score=0.000010
Scottish [31mpremier[0m division after Saturday's matches:
Given label: MISC, suggested label: O

Sentence 2272 - score=0.000013
[31mBut[0m 2 27/11/96 5,000 Burma
Given label: MISC, suggested label: O

Sentence 604 - score=0.000015
[31m1.[0m Fc Cologne 16 8 2 6 31 27 26
Given label: ORG, suggested label: O

Sentence 2102 - score=0.000017
[31mNetwork[0m operators said the draft laws would hold them responsible for copyright infringements in the system and expose them to multi-billion-dollar liabilities.
Given label: MISC, suggested label: O

Sentence 2273 - score=0.000021
-