In [1]:
%%capture 
!pip install transformers tqdm 

In [2]:
# packages version: 
# numpy == 1.23.0 
# tqdm == 4.64.0 
# transformers == 4.20.1 

# Token Classification Label Error Detection - Part 1 

In this tutorial, we show how you can retrieve the model-predictive probabilities and labels from a NLP token-classification dataset. These inputs are used to identify potential label issues in the dataset. 

**Overview of what we'll do in this tutorial:** 
- Use pre-trained HuggingFace models to get the predictive probabilities 
- Reduce subword-level tokens to word-level tokens

\* In most NLP literatures, tokens typically refer to words or punctuation marks, while most HuggingFace tokenizers break down longer words into subwords. To avoid confusion, given tokens are referred as "word-level tokens", and tokens obtained from the tokenizers as "subword-level tokens". 

## 1. Install the required dependencies 

Disable `TOKENIZERS_PARALLELISM` if multiple processors exist. 

In [3]:
# !pip install transformers tqdm 
import numpy as np
import string
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
from utils import * 
from tqdm import tqdm 
from bert import Ner 
from sklearn.metrics import balanced_accuracy_score 

import os 
if os.cpu_count() > 1: 
    os.environ["TOKENIZERS_PARALLELISM"] = "false" 
    
# import warnings
# warnings.filterwarnings('ignore')

## 2. Fetch the CONLL-2003 dataset 

In [4]:
%%capture 
# !wget https://data.deepai.org/conll2003.zip && mkdir data 
# !unzip conll2003.zip -d data/ && rm conll2003.zip 

CONLL-2003 dataset is in the following format: 

`-DOCSTART- -X- -X- O` 

`[word] [pos_tags] [chunk_tags] [ner_tags]` <- Start of first sentence 

`...`

`[word] [pos_tags] [chunk_tags] [ner_tags]` 

`[empty line]` 

`[word] [pos_tags] [chunk_tags] [ner_tags]` <- Start of second sentence 

`...`

`[word] [pos_tags] [chunk_tags] [ner_tags]` 

In our example, we focus on the `ner_tags` (named-entity recognition tags), which include: 

| `ner_tags` |             Description              |
|:----------:|:------------------------------------:|
|     `O`    |      Other (not a named entity)      |
|   `B-MIS`  | Beginning of a miscellaneous entity  |
|   `I-MIS`  |         Miscellaneous entity         |
|   `B-PER`  |     Beginning of a person entity     |
|   `I-PER`  |            Person entity             |
|   `B-ORG`  | Beginning of an organization entity  |
|   `I-ORG`  |         Organization entity          |
|   `B-LOC`  |    Beginning of a location entity    |
|   `I-LOC`  |           Location entity            | 

For more information, see [here](https://paperswithcode.com/dataset/conll-2003). Here, all-caps words are casted into lowercase except for the first character (eg. `JAPAN` -> `Japan`). This is to discourage the tokenizer from breaking such words into multiple subwords. The `readfile` implementation is adapted from [here](https://github.com/kamalkraj/BERT-NER/blob/dev/run_ner.py#L92). 

In [5]:
filepaths = ['data/train.txt', 'data/valid.txt', 'data/test.txt'] 
entities = ['O', 'B-MISC', 'I-MISC', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']
entity_map = {entity: i for i, entity in enumerate(entities)} 

def readfile(filepath, sep=' '): 
    lines = open(filepath)
    
    data, sentence, label = [], [], []
    for line in lines:
        if len(line) == 0 or line.startswith('-DOCSTART') or line[0] == '\n':
            if len(sentence) > 0:
                data.append((sentence, label))
                sentence, label = [], []
            continue
        splits = line.split(sep) 
        word = splits[0]
        if len(word) > 0 and word[0].isalpha() and word.isupper():
            word = word[0] + word[1:].lower()
        sentence.append(word)
        label.append(entity_map[splits[-1][:-1]])

    if len(sentence) > 0:
        data.append((sentence, label))
        
    given_words = [d[0] for d in data] 
    given_labels = [d[1] for d in data] 
    
    return given_words, given_labels 

given_words, given_labels = [], [] 

for filepath in filepaths: 
    words, labels = readfile(filepath) 
    given_words.extend(words) 
    given_labels.extend(labels)

`given_words` and `given_labels` are in nested list format. 

In [6]:
i = 0 

print('Word\t\tLabel\tEntity\n-------------------------------') 
for word, label in zip(given_words[i], given_labels[i]): 
    print('{:14s}{:3d}\t{:10s}'.format(word, label, entities[label])) 

Word		Label	Entity
-------------------------------
Eu              5	B-ORG     
rejects         0	O         
German          1	B-MISC    
call            0	O         
to              0	O         
boycott         0	O         
British         1	B-MISC    
lamb            0	O         
.               0	O         


Next, obtain the sentences with some minor pre-processing for readability. Sentences that contain `#` or have length shorter than or equal to one character are removed. The first condition is to ensure that the symbol does not get confused with the same symbol representing subwords after the sentence is being tokenized. 

In [7]:
sentences = list(map(get_sentence, given_words)) 

sentences, mask = filter_sentence(sentences) 
given_words = [words for m, words in zip(mask, given_words) if m] 
given_labels = [labels for m, labels in zip(mask, given_labels) if m] 

In [8]:
print('Number of sentences: %d' % len(sentences)) 
print(sentences[0]) 

Number of sentences: 20718
Eu rejects German call to boycott British lamb.


## 3. Train Models using Cross-Validation 

To identify potential label errors in the training dataset, we compute the out-of-sample predicted probabilities using cross-validation. We first partition the dataset into k-folds: 

In [9]:
lines = [[]] 
for filepath in filepaths: 
    for line in open(filepath) : 
        if len(line) == 0 or line.startswith('-DOCSTART') or line[0] == '\n':
            if len(lines[-1]) > 0: 
                lines.append([]) 
        else: 
            lines[-1].append(line) 
        
lines = lines[:-1] 
lines = [line for m, line in zip(mask, lines) if m] 

k = 5 
indicies = create_folds(lines, k=k) 

'folds/' already exists, skipping...


We train one model for each training/testing pair: 

In [10]:
for i in range(k): 
    if os.path.exists('folds/fold%d/model/' % i): 
        print('Model %d already exists, skipping...' % i) 
    else: 
        os.system(
            "python3 run_ner.py --data_dir=folds/fold%d --bert_model=bert-base-cased " \
            "--task_name=ner --output_dir=folds/fold%d/model --max_seq_length=256 " \
            "--do_train --num_train_epochs 10 --warmup_proportion=0.1" % (i, i)) 

Model 0 already exists, skipping...
Model 1 already exists, skipping...
Model 2 already exists, skipping...
Model 3 already exists, skipping...
Model 4 already exists, skipping...


## 4. Compute Out-of-Sample Predicted Probabilities 

We obtain the predicted probabilities for each sample using the model in which the sample was held out from training. Note that most tokenizers break down sentences into subword-level tokens, which are units smaller than word-level tokens. For example, the following sentence: 

In [11]:
i = 0 
print(sentences[i]) 

Eu rejects German call to boycott British lamb.


is tokenized into: 

In [12]:
model = Ner("folds/fold0/model/") 
tokens = model.tokenize(sentences[i])[0] 
print(tokens) 
tokens = [token.replace('#', '') for token in tokens] 

['E', '##u', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'la', '##mb', '.']


`##` indicates that the token is a subword. We remove them and manually map them to the original tokens afterwards. We collect both the predictive probabilities and the tokens for each sentence. 

In [13]:
sentence_tokens, sentence_probs = {}, {} 
for i in range(k): 
    model = Ner("folds/fold%d/model/" % i) 
    for index in tqdm(indicies[i]): 
        sentence_probs[index], sentence_tokens[index] = model.predict(sentences[index])  
        
sentence_tokens = [[token.replace('#', '') for token in sentence_tokens[i]] for i in range(len(sentences))] 
sentence_probs = [sentence_probs[i] for i in range(len(sentences))] 

100%|██████████| 4144/4144 [01:14<00:00, 55.86it/s]
100%|██████████| 4144/4144 [01:14<00:00, 55.54it/s]
100%|██████████| 4144/4144 [01:13<00:00, 56.07it/s]
100%|██████████| 4143/4143 [01:14<00:00, 55.93it/s]
100%|██████████| 4143/4143 [01:14<00:00, 55.74it/s]


In this example, we are more interested in severe types of mislabels, such as `B-LOC` vs. `B-PER`, instead of `B-LOC` vs. `I-LOC`. Therefore, we discard the `B-` and `I-` prefixes, and get the model-predictive probabilities for each subword-level token. 

In [14]:
given_maps = [0, 1, 1, 2, 2, 3, 3, 4, 4] 
model_maps = [-1, 0, 1, 1, 2, 2, 3, 3, 4, 4, -1, -1] 
given_labels = [mapping(labels, maps=given_maps) for labels in given_labels] 
sentence_probs = [merge_probs(pred_prob, maps=model_maps) for pred_prob in sentence_probs] 

## 5. Reducing from subword-level to word-level 

When a sentence gets tokenized, each word-level token may be broken down into subword-level tokens, each of which generates a predictive probability. Given that we only have the labels for word-level tokens, we reduce the subword-level tokens to word-level tokens. 

\* For this example, most subwords-to-words reduction are handled internally, but for most other models the reduction has to be done manually. In the following, we show our method of reduction, which is slightly different from how the `bert` model reduces it. See [here](https://github.com/kamalkraj/BERT-NER/blob/dev/bert.py#L85) for more info. 

In [15]:
print('Sentence:\t' + sentences[0]) 
print('Given words:\t' + str(given_words[0])) 
print('Subwords:\t' + str(tokens)) 

Sentence:	Eu rejects German call to boycott British lamb.
Given words:	['Eu', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']
Subwords:	['E', 'u', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'la', 'mb', '.']


The word `lamb` is tokenized into two subwords `la` and `mb`. In this case, we assign the average predictive probabilities of the two subwords to the token. Alternatively, we can take the weighted average, such that the weight for each predictive probability is proportional to the length of its corresponding subword. This is to ensure that longer subwords have heavier weights on the average predictive probabilities, although the benefits are insignificant for most datasets. 

Each tokenizer tokenizes sentences differently. In some rare cases, a subword may overlap two tokens, resulting in a misalignment in tokenization. For example, consider the following tokenization: 

In [16]:
demo_sentence = 'Massachusetts Institute of Technology (MIT)' 
demo_given_words = ['Massachusetts', 'Institute', 'of', 'Technology', '(', 'MIT', ')'] 
demo_subwords = ['Massachusetts', 'Institute', 'of', 'Technology', '(M', 'IT', ')'] 

print('Sentence:\t' + demo_sentence) 
print('Given words:\t' + str(demo_given_words)) 
print('Subwords:\t' + str(demo_subwords)) 

Sentence:	Massachusetts Institute of Technology (MIT)
Given words:	['Massachusetts', 'Institute', 'of', 'Technology', '(', 'MIT', ')']
Subwords:	['Massachusetts', 'Institute', 'of', 'Technology', '(M', 'IT', ')']


In this case, we assign the predictive probabilities of `(M` to `(`, and the average predictive probabilities of `(M` and `IT` to `MIT`. 

In [17]:
pred_probs_and_labels = [get_pred_probs_and_labels(sentence_probs[i], 
                                                   sentence_tokens[i], 
                                                   given_words[i], 
                                                   given_labels[i]) for i in range(len(sentences))] 

## 6. Save `pred_probs` and `labels` 

Lastly, save the predictive probabilities and the given labels, which are flattened for easier storage. In addition, we save the number of word-level tokens in each sentence, which is a crucial information that we lose once they are flattened. 

In [18]:
pred_probs = [pred_prob for pred_prob, _ in pred_probs_and_labels] 
labels = [label for _, label in pred_probs_and_labels] 

pred_probs_dict = to_dict(pred_probs) 
labels_dict = to_dict(labels) 

np.savez('pred_probs.npz', **pred_probs_dict) 
np.savez('labels.npz', **labels_dict) 

## 7. Model evaluation  

Lastly, we evaluate the model accuracy. We use the definition of precision and recall introduced by CoNLL-2003: 

> *“precision is the percentage of named entities found by the learning system that are correct. Recall is the percentage of named entities present in the corpus that are found by the system. A named entity is correct only if it is an exact match of the corresponding entity in the data file.”*

See [here](https://www.davidsbatista.net/blog/2018/05/09/Named_Entity_Evaluation/) for more info. 

In [19]:
predictions = [pred_prob.argmax(axis=1) for pred_prob in pred_probs] 
predictions_flatten = [pred for prediction in predictions for pred in prediction] 
given_labels_flatten = [label for given_label in given_labels for label in given_label] 

counts = [0, 0, 0, 0] 
correct = 0 

for truth, prediction in zip(given_labels_flatten, predictions_flatten): 
    if truth != 0: 
        if truth == prediction: 
            counts[0] += 1 
        counts[1] += 1 
    if prediction != 0: 
        if truth == prediction: 
            counts[2] += 1 
        counts[3] += 1 
    if truth == prediction: 
        correct += 1 
        
precision = counts[2] / counts[3] 
recall = counts[0] / counts[1] 
f1 = 2 * precision * recall / (precision + recall) 
accuracy = correct / len(given_labels_flatten) 

balanced_accuracy = balanced_accuracy_score(given_labels_flatten, predictions_flatten) 

print('Precision\t\t%.3f\nRecall\t\t\t%.3f\nf1-score\t\t%.3f\nAccuracy\t\t%.3f\nBalanced Accuracy\t%.3f' % 
     (precision, recall, f1, accuracy, balanced_accuracy)) 

Precision		0.911
Recall			0.916
f1-score		0.913
Accuracy		0.976
Balanced Accuracy	0.923
