In [1]:
%%capture 
!pip install transformers tqdm 

In [2]:
# packages version: 
# numpy == 1.23.0 
# tqdm == 4.64.0 
# transformers == 4.20.1 

# Token Classification Label Error Detection - Part 1 

In this tutorial, we show how you can retrieve the model-predictive probabilities and labels from a NLP token-classification dataset. These inputs are used to identify potential label issues in the dataset. 

**Overview of what we'll do in this tutorial:** 
- Use pre-trained HuggingFace models to get the predictive probabilities 
- Reduce subword-level tokens to word-level tokens

\* In most NLP literatures, tokens typically refer to words or punctuation marks, while most HuggingFace tokenizers break down longer words into subwords. To avoid confusion, given tokens are referred as "word-level tokens", and tokens obtained from the tokenizers as "subword-level tokens". 

## 1. Install the required dependencies 

Disable `TOKENIZERS_PARALLELISM` if multiple processors exist. 

In [3]:
# !pip install transformers tqdm 
import numpy as np
import string
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
from utils import * 
from tqdm import tqdm 

import os 
if os.cpu_count() > 1: 
    os.environ["TOKENIZERS_PARALLELISM"] = "false" 

## 2. Fetch the CONLL-2003 dataset 

In [4]:
%%capture 
!wget https://data.deepai.org/conll2003.zip && mkdir data 
!unzip conll2003.zip -d data/ && rm conll2003.zip 

CONLL-2003 dataset is in the following format: 

`-DOCSTART- -X- -X- O` 

`[word] [pos_tags] [chunk_tags] [ner_tags]` <- Start of first sentence 

`...`

`[word] [pos_tags] [chunk_tags] [ner_tags]` 

`[empty line]` 

`[word] [pos_tags] [chunk_tags] [ner_tags]` <- Start of second sentence 

`...`

`[word] [pos_tags] [chunk_tags] [ner_tags]` 

In our example, we focus on the `ner_tags` (named-entity recognition tags), which include: 

| `ner_tags` |             Description              |
|:----------:|:------------------------------------:|
|     `O`    |      Other (not a named entity)      |
|   `B-MIS`  | Beginning of a miscellaneous entity  |
|   `I-MIS`  |         Miscellaneous entity         |
|   `B-PER`  |     Beginning of a person entity     |
|   `I-PER`  |            Person entity             |
|   `B-ORG`  | Beginning of an organization entity  |
|   `I-ORG`  |         Organization entity          |
|   `B-LOC`  |    Beginning of a location entity    |
|   `I-LOC`  |           Location entity            | 

For more information, see [here](https://paperswithcode.com/dataset/conll-2003). Here, all-caps words are casted into lowercase except for the first character (eg. `JAPAN` -> `Japan`). This is to discourage the tokenizer from breaking such words into multiple subwords. The `readfile` implementation is adapted from [here](https://github.com/kamalkraj/BERT-NER/blob/dev/run_ner.py#L92). 

In [5]:
filepath = 'data/test.txt'
entities = ['O', 'B-MISC', 'I-MISC', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']
entity_map = {entity: i for i, entity in enumerate(entities)} 

def readfile(filepath, sep=' '): 
    lines = open(filepath)
    
    data, sentence, label = [], [], []
    for line in lines:
        if len(line) == 0 or line.startswith('-DOCSTART') or line[0] == '\n':
            if len(sentence) > 0:
                data.append((sentence, label))
                sentence, label = [], []
            continue
        splits = line.split(sep) 
        word = splits[0]
        if len(word) > 0 and word[0].isalpha() and word.isupper():
            word = word[0] + word[1:].lower()
        sentence.append(word)
        label.append(entity_map[splits[-1][:-1]])

    if len(sentence) > 0:
        data.append((sentence, label))
        
    given_words = [d[0] for d in data] 
    given_labels = [d[1] for d in data] 
    
    return given_words, given_labels 

given_words, given_labels = readfile(filepath) 

`given_words` and `given_labels` are in nested list format. 

In [6]:
i = 0 

print('Word\t\tLabel\tEntity\n-------------------------------') 
for word, label in zip(given_words[i], given_labels[i]): 
    print('{:14s}{:3d}\t{:10s}'.format(word, label, entities[label])) 

Word		Label	Entity
-------------------------------
Soccer          0	O         
-               0	O         
Japan           7	B-LOC     
Get             0	O         
Lucky           0	O         
Win             0	O         
,               0	O         
China           3	B-PER     
In              0	O         
Surprise        0	O         
Defeat          0	O         
.               0	O         


Next, obtain the sentences with some minor pre-processing for readability. Sentences that contain `#` or have length shorter than or equal to one character are removed. The first condition is to ensure that the symbol does not get confused with the same symbol representing subwords after the sentence is being tokenized. 

In [7]:
sentences = list(map(get_sentence, given_words)) 

sentences, mask = filter_sentence(sentences) 
given_words = [words for m, words in zip(mask, given_words) if m] 
given_labels = [labels for m, labels in zip(mask, given_labels) if m] 

In [8]:
print('Number of sentences: %d' % len(sentences)) 
print(sentences[0]) 

Number of sentences: 3449
Soccer - Japan Get Lucky Win, China In Surprise Defeat.


## 3. Pre-trained model 

In this example, we use `dslim/bert-base-NER` as our NLP model and tokenizer. Note that most tokenizers break down sentences into subword-level tokens, which are units smaller than word-level tokens. Since we are only interested in label issues at a word level, we first need to know how each sentence is tokenized. 

In [9]:
model_name = 'dslim/bert-base-NER' 
model = AutoModelForTokenClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
pipe = pipeline(task="ner", model=model, tokenizer=tokenizer)

maps = [0, 1, 1, 2, 2, 3, 3, 4, 4] 

For example, the following sentence: 

In [10]:
print(sentences[0]) 

Soccer - Japan Get Lucky Win, China In Surprise Defeat.


is tokenized into: 

In [11]:
token_ids = tokenizer(sentences[0])['input_ids'] 
tokens = [tokenizer.decode(token) for token in token_ids] 
print(tokens) 

['[CLS]', 'Soccer', '-', 'Japan', 'Get', 'Lucky', 'Win', ',', 'China', 'In', 'Sur', '##prise', 'De', '##fe', '##at', '.', '[SEP]']


`[CLS]` and `[SEP]` are special tokens to help the BERT model to identify the start and end of each sentence, and `##` indicates that the token is a subword. We remove them and manually map them to the original tokens afterwards. 

In [12]:
tokens = [token.replace('#', '') for token in tokens][1:-1] 
print(tokens) 
sentence_tokens = [[tokenizer.decode(token) for token in tokenizer(sentence)['input_ids']] for sentence in sentences] 
sentence_tokens = [[token.replace('#', '') for token in sentence_token][1:-1] for sentence_token in sentence_tokens] 

['Soccer', '-', 'Japan', 'Get', 'Lucky', 'Win', ',', 'China', 'In', 'Sur', 'prise', 'De', 'fe', 'at', '.']


## 4. Get predictive probabilities 

In this example, we are more interested in severe types of mislabels, such as `B-LOC` vs. `B-PER`, instead of `B-LOC` vs. `I-LOC`. Therefore, we discard the `B-` and `I-` prefixes, and get the model-predictive probabilities for each subword-level token. Set `merge_entities=False` if you do not wish to merge. 

In [13]:
merge_entities = True 

if merge_entities: 
    given_labels = [mapping(labels, maps) for labels in given_labels] 
    
sentence_probs = [get_probs(sentence, pipe, maps=maps) for sentence in tqdm(sentences)] 

100%|███████████████████████████████████████████████████████████████████████████████| 3449/3449 [01:01<00:00, 56.25it/s]


## 5. Reducing from subword-level to word-level 

Reduce subword-level tokens to word-level tokens. Here, we show an example of how the reduction is implemented. Consider the following tokenization: 

In [14]:
print('Sentence:\t' + sentences[0]) 
print('Given words:\t' + str(given_words[0])) 
print('Subwords:\t' + str(sentence_tokens[0])) 

Sentence:	Soccer - Japan Get Lucky Win, China In Surprise Defeat.
Given words:	['Soccer', '-', 'Japan', 'Get', 'Lucky', 'Win', ',', 'China', 'In', 'Surprise', 'Defeat', '.']
Subwords:	['Soccer', '-', 'Japan', 'Get', 'Lucky', 'Win', ',', 'China', 'In', 'Sur', 'prise', 'De', 'fe', 'at', '.']


The token `Surprise` is tokenized into two subwords `Sur` and `prise`. In this case, we assign the average predictive probabilities of the two subwords to the token. Alternatively, we can take the weighted average, such that the weight for each predictive probability is proportional to the length of its corresponding subword. This is to ensure that longer subwords have heavier weights on the average predictive probabilities, although the benefits are insignificant for most datasets. 

Each tokenizer tokenizes sentences differently. In some rare cases, a subword may overlap two tokens, resulting in a misalignment in tokenization. For example, consider the following tokenization: 

In [15]:
demo_sentence = 'Massachusetts Institute of Technology (MIT)' 
demo_given_words = ['Massachusetts', 'Institute', 'of', 'Technology', '(', 'MIT', ')'] 
demo_subwords = ['Massachusetts', 'Institute', 'of', 'Technology', '(M', 'IT', ')'] 

print('Sentence:\t' + demo_sentence) 
print('Given words:\t' + str(demo_given_words)) 
print('Subwords:\t' + str(demo_subwords)) 

Sentence:	Massachusetts Institute of Technology (MIT)
Given words:	['Massachusetts', 'Institute', 'of', 'Technology', '(', 'MIT', ')']
Subwords:	['Massachusetts', 'Institute', 'of', 'Technology', '(M', 'IT', ')']


In this case, we assign the predictive probabilities of `(M` to `(`, and the average predictive probabilities of `(M` and `IT` to `MIT`. 

In [16]:
pred_probs_and_labels = [get_pred_probs_and_labels(sentence_probs[i], 
                                                   sentence_tokens[i], 
                                                   given_words[i], 
                                                   given_labels[i]) for i in range(len(sentences))] 

## 6. Save `pred_probs` and `labels` 

Lastly, save the predictive probabilities and the given labels, which are flattened for easier storage. In addition, we save the number of word-level tokens in each sentence, which is a crucial information that we lose once they are flattened. 

In [17]:
pred_probs = [pred_prob for pred_prob, _ in pred_probs_and_labels] 
labels = [label for _, label in pred_probs_and_labels] 

pred_probs_dict = to_dict(pred_probs) 
labels_dict = to_dict(labels) 

np.savez('pred_probs.npz', **pred_probs_dict) 
np.savez('labels.npz', **labels_dict) 