# Walkthrough token classification with CleanLab

Just following :

[https://docs.cleanlab.ai/stable/tutorials/token_classification.html](https://docs.cleanlab.ai/stable/tutorials/token_classification.html)

Here not taking any notes, just wanted to make sure code works OK in Kaggle.

TODO: annotate and then run a custom dataset - need to get a model to make predictions and get your `pred_probs` yourself (here they provide all the data already).

In [1]:
!pip install cleanlab

Collecting cleanlab
  Downloading cleanlab-2.6.5-py3-none-any.whl.metadata (63 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m63.3/63.3 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
Downloading cleanlab-2.6.5-py3-none-any.whl (352 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m352.3/352.3 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[?25hInstalling collected packages: cleanlab
Successfully installed cleanlab-2.6.5


In [2]:
# get their files
# in particular it has the "pred_probs" already computed - otherwise need to run through a model

!wget -nc https://data.deepai.org/conll2003.zip && mkdir data 
!unzip conll2003.zip -d data/ && rm conll2003.zip 
!wget -nc 'https://cleanlab-public.s3.amazonaws.com/TokenClassification/pred_probs.npz' 

--2024-06-12 20:56:34--  https://data.deepai.org/conll2003.zip
Resolving data.deepai.org (data.deepai.org)... 138.199.36.11, 2400:52e0:1e00::1077:1
Connecting to data.deepai.org (data.deepai.org)|138.199.36.11|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 982975 (960K) [application/zip]
Saving to: 'conll2003.zip'


2024-06-12 20:56:34 (18.5 MB/s) - 'conll2003.zip' saved [982975/982975]

Archive:  conll2003.zip
  inflating: data/metadata           
  inflating: data/test.txt           
  inflating: data/train.txt          
  inflating: data/valid.txt          
--2024-06-12 20:56:36--  https://cleanlab-public.s3.amazonaws.com/TokenClassification/pred_probs.npz
Resolving cleanlab-public.s3.amazonaws.com (cleanlab-public.s3.amazonaws.com)... 52.216.250.252, 54.231.233.209, 52.216.44.17, ...
Connecting to cleanlab-public.s3.amazonaws.com (cleanlab-public.s3.amazonaws.com)|52.216.250.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 170

In [3]:
import numpy as np
from cleanlab.token_classification.filter import find_label_issues 
from cleanlab.token_classification.rank import get_label_quality_scores, issues_from_scores 
from cleanlab.internal.token_classification_utils import get_sentence, filter_sentence, mapping 
from cleanlab.token_classification.summary import display_issues, common_label_issues, filter_by_token 

np.set_printoptions(suppress=True)

In [4]:
def read_npz(filepath): 
    data = dict(np.load(filepath)) 
    data = [data[str(i)] for i in range(len(data))] 
    return data 

In [5]:
pred_probs = read_npz('pred_probs.npz') 

In [8]:
print(type(pred_probs), type(pred_probs[0]))

<class 'list'> <class 'numpy.ndarray'>


In [9]:
given_entities = ['O', 'B-MISC', 'I-MISC', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']
entities = ['O', 'MISC', 'PER', 'ORG', 'LOC'] 
entity_map = {entity: i for i, entity in enumerate(given_entities)} 

def readfile(filepath, sep=' '): 
    lines = open(filepath)
    data, sentence, label = [], [], []
    for line in lines:
        if len(line) == 0 or line.startswith('-DOCSTART') or line[0] == '\n':
            if len(sentence) > 0:
                data.append((sentence, label))
                sentence, label = [], []
            continue
        splits = line.split(sep) 
        word = splits[0]
        if len(word) > 0 and word[0].isalpha() and word.isupper():
            word = word[0] + word[1:].lower()
        sentence.append(word)
        label.append(entity_map[splits[-1][:-1]])

    if len(sentence) > 0:
        data.append((sentence, label))
        
    tokens = [d[0] for d in data] 
    given_labels = [d[1] for d in data] 
    return tokens, given_labels 

In [10]:
filepaths = ['data/train.txt', 'data/valid.txt', 'data/test.txt'] 
tokens, given_labels = [], [] 

for filepath in filepaths: 
    words, label = readfile(filepath) 
    tokens.extend(words) 
    given_labels.extend(label)
    
sentences = list(map(get_sentence, tokens)) 

sentences, mask = filter_sentence(sentences) 
tokens = [words for m, words in zip(mask, tokens) if m] 
given_labels = [labels for m, labels in zip(mask, given_labels) if m] 

maps = [0, 1, 1, 2, 2, 3, 3, 4, 4] 
labels = [mapping(labels, maps) for labels in given_labels]

In [12]:
print(len(tokens), len(given_labels))

20718 20718


In [13]:
indices_to_preview = 3  # increase this to view more examples
for i in range(indices_to_preview):
    print('\nsentences[%d]:\t' % i + str(sentences[i])) 
    print('labels[%d]:\t' % i + str(labels[i])) 
    print('pred_probs[%d]:\n' % i + str(pred_probs[i])) 


sentences[0]:	Eu rejects German call to boycott British lamb.
labels[0]:	[3, 0, 1, 0, 0, 0, 1, 0, 0]
pred_probs[0]:
[[0.00030412 0.00023826 0.99936208 0.00007009 0.00002545]
 [0.99998795 0.00000401 0.00000218 0.00000455 0.00000131]
 [0.00000749 0.99996115 0.00001371 0.0000087  0.00000895]
 [0.99998936 0.00000382 0.00000178 0.00000366 0.00000137]
 [0.99999101 0.00000266 0.00000174 0.0000035  0.00000109]
 [0.99998768 0.00000482 0.00000202 0.00000438 0.0000011 ]
 [0.00000465 0.99996392 0.00001105 0.0000116  0.00000878]
 [0.99998671 0.00000364 0.00000213 0.00000472 0.00000281]
 [0.99999073 0.00000211 0.00000159 0.00000442 0.00000115]]

sentences[1]:	Peter Blackburn
labels[1]:	[2, 2]
pred_probs[1]:
[[0.00000358 0.00000529 0.99995623 0.000022   0.0000129 ]
 [0.0000024  0.00001812 0.99994141 0.00001645 0.00002162]]

sentences[2]:	Brussels 1996-08-22
labels[2]:	[4, 0]
pred_probs[2]:
[[0.00001172 0.00000821 0.00004661 0.0000618  0.99987167]
 [0.99999061 0.00000201 0.00000195 0.00000408 0.00000

In [14]:
issues = find_label_issues(labels, pred_probs) 

2024-06-12 21:09:04.274264: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-12 21:09:04.274394: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-12 21:09:04.421037: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [15]:
top = 20  # increase this value to view more identified issues
print('Cleanlab found %d potential label issues. ' % len(issues)) 
print('The top %d most likely label errors:' % top) 
print(issues[:top]) 

Cleanlab found 2254 potential label issues. 
The top 20 most likely label errors:
[(2907, 0), (19392, 0), (9962, 4), (8904, 30), (19303, 0), (12918, 0), (9256, 0), (11855, 20), (18392, 4), (20426, 28), (19402, 21), (14744, 15), (19371, 0), (4645, 2), (83, 9), (10331, 3), (9430, 10), (6143, 25), (18367, 0), (12914, 3)]


In [16]:
display_issues(issues, tokens, pred_probs=pred_probs, labels=labels, 
               exclude=[(0, 1), (1, 0)], class_names=entities) 

Sentence index: 2907, Token index: 0
Token: Little
Given label: PER, predicted label according to provided pred_probs: O
----
[31mLittle[0m change from today's weather expected.


Sentence index: 19392, Token index: 0
Token: Let
Given label: LOC, predicted label according to provided pred_probs: O
----
[31mLet[0m's march together," Scalfaro, a northerner himself, said.


Sentence index: 9962, Token index: 4
Token: germany
Given label: LOC, predicted label according to provided pred_probs: O
----
3. Nastja Rysich ([31mgermany[0m) 3.75


Sentence index: 8904, Token index: 30
Token: north
Given label: LOC, predicted label according to provided pred_probs: O
----
The Spla has fought Khartoum's government forces in the south since 1983 for greater autonomy or independence of the mainly Christian and animist region from the Moslem, Arabised [31mnorth[0m.


Sentence index: 12918, Token index: 0
Token: Mayor
Given label: PER, predicted label according to provided pred_probs: O
----
[3

In [17]:
info = common_label_issues(issues, tokens, 
                           labels=labels, 
                           pred_probs=pred_probs, 
                           class_names=entities, 
                           exclude=[(0, 1), (1, 0)]) 

Token '/' is potentially mislabeled 42 times throughout the dataset
---------------------------------------------------------------------------------------
labeled as class `O` but predicted to actually be class `LOC` 36 times
labeled as class `O` but predicted to actually be class `PER` 4 times
labeled as class `O` but predicted to actually be class `ORG` 2 times

Token 'Chicago' is potentially mislabeled 27 times throughout the dataset
---------------------------------------------------------------------------------------
labeled as class `ORG` but predicted to actually be class `LOC` 22 times
labeled as class `LOC` but predicted to actually be class `ORG` 3 times
labeled as class `MISC` but predicted to actually be class `ORG` 2 times

Token 'U.s.' is potentially mislabeled 21 times throughout the dataset
---------------------------------------------------------------------------------------
labeled as class `LOC` but predicted to actually be class `ORG` 8 times
labeled as class `OR

In [18]:
token_issues = filter_by_token('United', issues, tokens)

display_issues(token_issues, tokens, pred_probs=pred_probs, labels=labels, 
               exclude=[(0, 1), (1, 0)], class_names=entities) 

Sentence index: 471, Token index: 8
Token: United
Given label: LOC, predicted label according to provided pred_probs: ORG
----
Soccer - Keane Signs Four-year Contract With Manchester [31mUnited[0m.


Sentence index: 19072, Token index: 5
Token: United
Given label: LOC, predicted label according to provided pred_probs: ORG
----
The Humane Society of the [31mUnited[0m States estimates that between 500,000 and one million bites are delivered by dogs each year, more than half of which are suffered by children.


Sentence index: 19910, Token index: 5
Token: United
Given label: LOC, predicted label according to provided pred_probs: ORG
----
His father Clarence Woolmer represented [31mUnited[0m Province, now renamed Uttar Pradesh, in India's Ranji Trophy national championship and captained the state during 1949.


Sentence index: 15658, Token index: 0
Token: United
Given label: ORG, predicted label according to provided pred_probs: LOC
----
[31mUnited[0m Nations 1996-08-29


Sentence 

In [19]:
sentence_scores, token_scores = get_label_quality_scores(labels, pred_probs)
issues = issues_from_scores(sentence_scores, token_scores=token_scores) 
display_issues(issues, tokens, pred_probs=pred_probs, labels=labels, 
               exclude=[(0, 1), (1, 0)], class_names=entities) 

Sentence index: 2907, Token index: 0
Token: Little
Given label: PER, predicted label according to provided pred_probs: O
----
[31mLittle[0m change from today's weather expected.


Sentence index: 19392, Token index: 0
Token: Let
Given label: LOC, predicted label according to provided pred_probs: O
----
[31mLet[0m's march together," Scalfaro, a northerner himself, said.


Sentence index: 9962, Token index: 4
Token: germany
Given label: LOC, predicted label according to provided pred_probs: O
----
3. Nastja Rysich ([31mgermany[0m) 3.75


Sentence index: 8904, Token index: 30
Token: north
Given label: LOC, predicted label according to provided pred_probs: O
----
The Spla has fought Khartoum's government forces in the south since 1983 for greater autonomy or independence of the mainly Christian and animist region from the Moslem, Arabised [31mnorth[0m.


Sentence index: 12918, Token index: 0
Token: Mayor
Given label: PER, predicted label according to provided pred_probs: O
----
[3