# Token Classification Label Error Detection - Part 1 

In this tutorial, we show how you can retrieve the model-predictive probabilities and labels from a NLP token-classification dataset. These inputs are used to identify potential label issues in the dataset. 

**Overview of what we'll do in this tutorial:** 
- Use pre-trained HuggingFace models to get the predictive probabilities 
- Reduce subword-level tokens to word-level tokens

\* In most NLP literatures, tokens typically refer to words or punctuation marks, while most HuggingFace tokenizers break down longer words into subwords. To avoid confusion, given tokens are referred as "word-level tokens", and tokens obtained from the tokenizers as "subword-level tokens". 

## 1. Install the required dependencies 

Disable `TOKENIZERS_PARALLELISM` if multiple processors exist. 

In [1]:
!pip install transformers tqdm 
import numpy as np
import string
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
from tqdm import tqdm 

import os 
if os.cpu_count() > 1: 
    os.environ["TOKENIZERS_PARALLELISM"] = "false" 

Defaulting to user installation because normal site-packages is not writeable
--- Logging error ---
Traceback (most recent call last):
  File "/home/ericw/.local/lib/python3.9/site-packages/pip/_internal/utils/logging.py", line 177, in emit
    self.console.print(renderable, overflow="ignore", crop=False, style=style)
  File "/home/ericw/.local/lib/python3.9/site-packages/pip/_vendor/rich/console.py", line 1752, in print
    extend(render(renderable, render_options))
  File "/home/ericw/.local/lib/python3.9/site-packages/pip/_vendor/rich/console.py", line 1390, in render
    for render_output in iter_render:
  File "/home/ericw/.local/lib/python3.9/site-packages/pip/_internal/utils/logging.py", line 134, in __rich_console__
    for line in lines:
  File "/home/ericw/.local/lib/python3.9/site-packages/pip/_vendor/rich/segment.py", line 245, in split_lines
    for segment in segments:
  File "/home/ericw/.local/lib/python3.9/site-packages/pip/_vendor/rich/console.py", line 1368, in rende

## 2. Fetch the CONLL-2003 dataset 

CONLL-2003 dataset is in the following format: 

`-DOCSTART- -X- -X- O` 

`[word] [pos_tags] [chunk_tags] [ner_tags]` <- Start of first sentence 

`...`

`[word] [pos_tags] [chunk_tags] [ner_tags]` 

`[empty line]` 

`[word] [pos_tags] [chunk_tags] [ner_tags]` <- Start of second sentence 

`...`

`[word] [pos_tags] [chunk_tags] [ner_tags]` 

In our example, we focus on the `ner_tags` (named-entity recognition tags), which include: 

| `ner_tags` |             Description              |
|:----------:|:------------------------------------:|
|     `O`    |      Other (not a named entity)      |
|   `B-MIS`  | Beginning of a miscellaneous entity  |
|   `I-MIS`  |         Miscellaneous entity         |
|   `B-PER`  |     Beginning of a person entity     |
|   `I-PER`  |            Person entity             |
|   `B-ORG`  | Beginning of an organization entity  |
|   `I-ORG`  |         Organization entity          |
|   `B-LOC`  |    Beginning of a location entity    |
|   `I-LOC`  |           Location entity            | 

For more information, see [here](https://paperswithcode.com/dataset/conll-2003). Here, all-caps words are casted into lowercase except for the first character (eg. `JAPAN` -> `Japan`). This is to discourage the tokenizer from breaking such words into multiple subwords. The `readfile` implementation is adapted from [here](https://github.com/kamalkraj/BERT-NER/blob/dev/run_ner.py#L92). 

In [2]:
filepath = 'data/conll.txt'
entities = ['O', 'B-MISC', 'I-MISC', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']
entity_map = {entity: i for i, entity in enumerate(entities)} 

def readfile(filepath, sep=' '): 
    lines = open(filepath)
    
    data, sentence, label = [], [], []
    for line in lines:
        if len(line) == 0 or line.startswith('-DOCSTART') or line[0] == '\n':
            if len(sentence) > 0:
                data.append((sentence, label))
                sentence, label = [], []
            continue
        splits = line.split(sep) 
        word = splits[0]
        if len(word) > 0 and word[0].isalpha() and word.isupper():
            word = word[0] + word[1:].lower()
        sentence.append(word)
        label.append(entity_map[splits[-1][:-1]])

    if len(sentence) > 0:
        data.append((sentence, label))
        
    given_words = [d[0] for d in data] 
    given_labels = [d[1] for d in data] 
    
    return given_words, given_labels 

given_words, given_labels = readfile(filepath) 

`given_words` and `given_labels` are in nested list format. 

In [3]:
i = 0 

print('Word\t\tLabel\tEntity\n-------------------------------') 
for word, label in zip(given_words[i], given_labels[i]): 
    print('{:14s}{:3d}\t{:10s}'.format(word, label, entities[label])) 

Word		Label	Entity
-------------------------------
Soccer          0	O         
-               0	O         
Japan           7	B-LOC     
Get             0	O         
Lucky           0	O         
Win             0	O         
,               0	O         
China           3	B-PER     
In              0	O         
Surprise        0	O         
Defeat          0	O         
.               0	O         


Next, obtain the sentences with some minor pre-processing for readability. Sentences that contain `#` or have length shorter than or equal to one character are removed. The first condition is to ensure that the symbol does not get confused with the same symbol representing subwords after the sentence is being tokenized. 

In [4]:
def get_sentence(words): 
    sentence = ''
    for word in words:
        if word not in string.punctuation or word in ['-', '(']:
            word = ' ' + word
        sentence += word
    sentence = sentence.replace(" '", "'").replace('( ', '(').strip()
    return sentence

sentences = list(map(get_sentence, given_words)) 

mask = [len(sentence) > 1 and '#' not in sentence for sentence in sentences] 
given_words = [words for m, words in zip(mask, given_words) if m] 
given_labels = [labels for m, labels in zip(mask, given_labels) if m] 
sentences = [sentence for m, sentence in zip(mask, sentences) if m] 

n = len(sentences) 

In [5]:
print('Number of sentences: %d' % n) 
print(sentences[0]) 

Number of sentences: 3449
Soccer - Japan Get Lucky Win, China In Surprise Defeat.


## 3. Pre-trained model 

In this example, we use `dslim/bert-base-NER` (`bert`) as our NLP model and tokenizer. Note that most tokenizers break down sentences into subword-level tokens, which are units smaller than word-level tokens. Since we are only interested in label issues at a word level, we first need to know how each sentence is tokenized. Feel free to try out an alternative model - `xlm-roberta-large-fintuned-conll03-english` (`xlm`)! 

In [6]:
model_name = 'dslim/bert-base-NER' 
# model_name = 'xlm-roberta-large-finetuned-conll03-english'
model = AutoModelForTokenClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
pipe = pipeline(task="ner", model=model, tokenizer=tokenizer)

For example, the following sentence: 

In [7]:
print(sentences[0]) 

Soccer - Japan Get Lucky Win, China In Surprise Defeat.


is tokenized into: 

In [8]:
token_ids = tokenizer(sentences[0])['input_ids'] 
tokens = [tokenizer.decode(token) for token in token_ids] 
print(tokens) 

['[CLS]', 'Soccer', '-', 'Japan', 'Get', 'Lucky', 'Win', ',', 'China', 'In', 'Sur', '##prise', 'De', '##fe', '##at', '.', '[SEP]']


`[CLS]` and `[SEP]` are special tokens to help the BERT model to identify the start and end of each sentence, and `##` indicates that the token is a subword. We remove them and manually map them to the original tokens afterwards. 

In [9]:
tokens = [token.replace('#', '') for token in tokens][1:-1] 
print(tokens) 
sentence_tokens = [[tokenizer.decode(token) for token in tokenizer(sentence)['input_ids']] for sentence in sentences] 
sentence_tokens = [[token.replace('#', '') for token in sentence_token][1:-1] for sentence_token in sentence_tokens] 

['Soccer', '-', 'Japan', 'Get', 'Lucky', 'Win', ',', 'China', 'In', 'Sur', 'prise', 'De', 'fe', 'at', '.']


## 4. Get predictive probabilities 

In this example, we are more interested in severe types of mislabels, such as `B-LOC` vs. `B-PER`, instead of `B-LOC` vs. `I-LOC`. Therefore, we discard the `B-` and `I-` prefixes, and get the model-predictive probabilities for each subword-level token. Set `merge_entities=False` if you do not wish to merge*. 

\* Due to a slightly different set of named entities in the `xlm` model, we HAVE to discard the prefixes. For more information, see [here](https://huggingface.co/xlm-roberta-large-finetuned-conll03-english/blob/main/config.json). 

In [10]:
merge_entities = True 
merge_entities = merge_entities or model_name == 'xlm-roberta-large-finetuned-conll03-english' 

if merge_entities: 
    merge = lambda idx: (idx+1) // 2 
    merge_list = lambda l: list(map(merge, l)) 
    given_labels = [merge_list(labels) for labels in given_labels] 

def get_probs(sentence, merge_method='dslim/bert-base-NER'): 
    def softmax(logit): 
        return np.exp(logit) / np.sum(np.exp(logit)) 
    
    forward = pipe.forward(pipe.preprocess(sentence)) 
    logits = forward['logits'][0].numpy() 
    probs = np.array([softmax(logit) for logit in logits]) 
    probs = probs[1:-1] 
    
    if not merge_entities: 
        return probs 
    
    if merge_method == 'dslim/bert-base-NER': 
        probs_merged = np.zeros([len(probs), 5]) 
        for i in range(9): 
            merged_idx = merge(i) 
            probs_merged[:, merged_idx] += probs[:, i] 
        return probs_merged 
    
    elif merge_method == 'xlm-roberta-large-finetuned-conll03-english': 
        def xlm_mapping(prob): 
            new_prob = np.zeros(5) 

            new_prob[0] = prob[7] 
            new_prob[1] = prob[1] + prob[4] 
            new_prob[2] = prob[6] 
            new_prob[3] = prob[2] + prob[5] 
            new_prob[4] = prob[0] + prob[3] 

            return new_prob 

        return np.array(list(map(xlm_mapping, probs))) 

sentence_probs = [get_probs(sentence, merge_method=model_name) for sentence in tqdm(sentences)] 

100%|███████████████████████████████████████████████████████████████████████████████| 3449/3449 [00:59<00:00, 57.55it/s]


## 5. Reducing from subword-level to word-level 

Reduce subword-level tokens to word-level tokens. Here, we show an example of how the reduction is implemented. Consider the following tokenization: 

In [11]:
print('Sentence:\t' + sentences[0]) 
print('Given words:\t' + str(given_words[0])) 
print('Subwords:\t' + str(sentence_tokens[0])) 

Sentence:	Soccer - Japan Get Lucky Win, China In Surprise Defeat.
Given words:	['Soccer', '-', 'Japan', 'Get', 'Lucky', 'Win', ',', 'China', 'In', 'Surprise', 'Defeat', '.']
Subwords:	['Soccer', '-', 'Japan', 'Get', 'Lucky', 'Win', ',', 'China', 'In', 'Sur', 'prise', 'De', 'fe', 'at', '.']


The token `Surprise` is tokenized into two subwords `Sur` and `prise`. In this case, we assign the average predictive probabilities of the two subwords to the token. Alternatively, we can take the weighted average, such that the weight for each predictive probability is proportional to the length of its corresponding subword. This is to ensure that longer subwords have heavier weights on the average predictive probabilities, although the benefits are insignificant for most datasets. 

Each tokenizer tokenizes sentences differently. In some rare cases, a subword may overlap two tokens, resulting in a misalignment in tokenization. For example, consider the following tokenization: 

In [12]:
demo_sentence = 'Massachusetts Institute of Technology (MIT)' 
demo_given_words = ['Massachusetts', 'Institute', 'of', 'Technology', '(', 'MIT', ')'] 
demo_subwords = ['Massachusetts', 'Institute', 'of', 'Technology', '(M', 'IT', ')'] 

print('Sentence:\t' + demo_sentence) 
print('Given words:\t' + str(demo_given_words)) 
print('Subwords:\t' + str(demo_subwords)) 

Sentence:	Massachusetts Institute of Technology (MIT)
Given words:	['Massachusetts', 'Institute', 'of', 'Technology', '(', 'MIT', ')']
Subwords:	['Massachusetts', 'Institute', 'of', 'Technology', '(M', 'IT', ')']


In this case, we assign the predictive probabilities of `(M` to `(`, and the average predictive probabilities of `(M` and `IT` to `MIT`. 

In [13]:
def get_pred_probs_and_labels(scores, tokens, given_token, given_label, weighted=False): 
    i, j = 0, 0 
    pred_probs, labels = [], [] 
    for token, label in zip(given_token, given_label): 
        i_new, j_new = i, j 
        acc = 0 
        
        weights = []         
        while acc != len(token): 
            token_len = len(tokens[i_new][j_new:]) 
            remain = len(token) - acc 
            weights.append(min(remain, token_len)) 
            if token_len > remain: 
                acc += remain 
                j_new += remain 
            else: 
                acc += token_len 
                i_new += 1 
                j_new = 0 
        
        if i != i_new: 
            probs = np.average(scores[i:i_new], axis=0, weights=weights if weighted else None) 
        else: 
            probs = scores[i] 
        i, j = i_new, j_new 
        
        pred_probs.append(probs) 
        labels.append(label)
        
    return np.array(pred_probs), labels 

pred_probs_and_labels = [get_pred_probs_and_labels(sentence_probs[i], 
                                                   sentence_tokens[i], 
                                                   given_words[i], 
                                                   given_labels[i]) for i in range(n)] 

## 6. Save `pred_probs` and `labels` 

Lastly, save the predictive probabilities and the given labels, which are flattened for easier storage. In addition, we save the number of word-level tokens in each sentence, which is a crucial information that we lose once they are flattened. 

In [14]:
pred_probs = [pred_prob for pred_prob, _ in pred_probs_and_labels] 
labels = [label for _, label in pred_probs_and_labels] 

pred_probs = np.array([prob for pred_prob in pred_probs for prob in pred_prob]) 
labels = np.array([l for label in labels for l in label]) 

sentence_lengths = np.array([len(tokens) for tokens in given_words]) 

np.save('pred_probs.npy', pred_probs) 
np.save('labels.npy', labels) 
np.save('sentence_lengths.npy', sentence_lengths) 