# Training Entity Recognition Model for Token Classification Tutorial

In this tutorial, we show how you can retrieve the out-of-sample model-predicted probabilities and labels from a NLP token-classification dataset. These outputs are used to identify potential label issues in the dataset, which are demonstrated [here](https://cleanlab.ai/). The specific token classification task we consider here is Named Entity Recognition with the CoNLL-2003 dataset, and we train a Transformer network for this task using the HuggingFace transformers library. TODO: update link 

**Overview of what we'll do in this tutorial:** 
- Read and process datasets in CoNLL formats 
- Compute out-of-sample predicted probabilities via cross-validation 
- Reduce subword-level tokens to word-level tokens 

\* In most NLP literatures, tokens typically refer to words or punctuation marks, while most modern tokenizers break down longer words into subwords. To avoid confusion, given tokens are referred as "word-level tokens", which represent the individual tokens given in the original dataset with a separate class label provided for each token; tokens obtained from the tokenizers are referred as "subword-level tokens". 

## 1. Install the required dependencies 

You can use `pip` to install all packages required for this tutorial as follows: 

    !pip install tqdm transformers 
    !pip install cleanlab 

In [None]:
!wget -nc https://data.deepai.org/conll2003.zip && mkdir data 
!unzip conll2003.zip -d data/ && rm conll2003.zip 

In [None]:
# Package installation (hidden on docs website).
# Package versions used: tqdm==4.64.0 transformers==4.22.0.dev0 cleanlab==2.0.0 numpy==1.23.0 sklearn==0.0 
# (cont): torch==1.12.0 nltk==3.7 pytorch_transformers==1.2.0 seqeval==1.2.2 
# ericwang/cleanlab -b token_classification for now 

dependencies = ["tqdm", "transformers", "cleanlab", "sklearn", "torch", "nltk", "pytorch_transformers", "seqeval"]

if "google.colab" in str(get_ipython()):  # Check if it's running in Google Colab
    %pip install cleanlab  # for colab
    cmd = ' '.join([dep for dep in dependencies if dep != "cleanlab"])
    %pip install $cmd
else:
    missing_dependencies = []
    for dependency in dependencies:
        try:
            __import__(dependency)
        except ImportError:
            missing_dependencies.append(dependency)

    if len(missing_dependencies) > 0:
        print("Missing required dependencies:")
        print(*missing_dependencies, sep=", ")
        print("\nPlease install them before running the rest of this notebook.")

In [1]:
# !pip install transformers tqdm 
import numpy as np
import string
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
from cleanlab.internal.token_classification_utils import * 
from tqdm import tqdm 
from bert import Ner 
from token_classification_tutorial_utils import * 
from sklearn.metrics import balanced_accuracy_score 

# Disable `TOKENIZERS_PARALLELISM` if multiple processors exist. 
import os 
if os.cpu_count() > 1: 
    os.environ["TOKENIZERS_PARALLELISM"] = "false" 

## 2. Fetch the CONLL-2003 dataset 

CONLL-2003 dataset is in the following format: 

`-DOCSTART- -X- -X- O` 

`[word] [pos_tags] [chunk_tags] [ner_tags]` <- Start of first sentence 

`...`

`[word] [pos_tags] [chunk_tags] [ner_tags]` 

`[empty line]` 

`[word] [pos_tags] [chunk_tags] [ner_tags]` <- Start of second sentence 

`...`

`[word] [pos_tags] [chunk_tags] [ner_tags]` 

In our example, we focus on the `ner_tags` (named-entity recognition tags), which include: 

| `ner_tags` |             Description              |
|:----------:|:------------------------------------:|
|     `O`    |      Other (not a named entity)      |
|   `B-MIS`  | Beginning of a miscellaneous entity  |
|   `I-MIS`  |         Miscellaneous entity         |
|   `B-PER`  |     Beginning of a person entity     |
|   `I-PER`  |            Person entity             |
|   `B-ORG`  | Beginning of an organization entity  |
|   `I-ORG`  |         Organization entity          |
|   `B-LOC`  |    Beginning of a location entity    |
|   `I-LOC`  |           Location entity            | 

For more information, see [here](https://paperswithcode.com/dataset/conll-2003). Here, all-caps words are casted into lowercase except for the first character (eg. `JAPAN` -> `Japan`). This is to discourage the tokenizer from breaking such words into multiple subwords. The `readfile` implementation is adapted from [here](https://github.com/kamalkraj/BERT-NER/blob/dev/run_ner.py#L92). 

In [2]:
filepaths = ['data/train.txt', 'data/valid.txt', 'data/test.txt'] 
entities = ['O', 'B-MISC', 'I-MISC', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']
entity_map = {entity: i for i, entity in enumerate(entities)} 

def readfile(filepath, sep=' '): 
    lines = open(filepath)
    
    data, sentence, label = [], [], []
    for line in lines:
        if len(line) == 0 or line.startswith('-DOCSTART') or line[0] == '\n':
            if len(sentence) > 0:
                data.append((sentence, label))
                sentence, label = [], []
            continue
        splits = line.split(sep) 
        word = splits[0]
        if len(word) > 0 and word[0].isalpha() and word.isupper():
            word = word[0] + word[1:].lower()
        sentence.append(word)
        label.append(entity_map[splits[-1][:-1]])

    if len(sentence) > 0:
        data.append((sentence, label))
        
    given_words = [d[0] for d in data] 
    given_labels = [d[1] for d in data] 
    
    return given_words, given_labels 

given_words, given_labels = [], [] 

for filepath in filepaths: 
    words, labels = readfile(filepath) 
    given_words.extend(words) 
    given_labels.extend(labels)

`given_words` and `given_labels` are in nested list format. 

In [3]:
i = 0 

print('Word\t\tLabel\tEntity\n-------------------------------') 
for word, label in zip(given_words[i], given_labels[i]): 
    print('{:14s}{:3d}\t{:10s}'.format(word, label, entities[label])) 

Word		Label	Entity
-------------------------------
Eu              5	B-ORG     
rejects         0	O         
German          1	B-MISC    
call            0	O         
to              0	O         
boycott         0	O         
British         1	B-MISC    
lamb            0	O         
.               0	O         


Next, obtain the sentences with some minor pre-processing for readability. Sentences containing the `#` character are removed for simplicity, because this special character is later used to represent subword-tokens by our sentence tokenizers (See section 4 for more details). Additionally, sentences such that `len(sentence) <= 1` are also removed. 

In [4]:
sentences = list(map(get_sentence, given_words)) 

sentences, mask = filter_sentence(sentences) 
given_words = [words for m, words in zip(mask, given_words) if m] 
given_labels = [labels for m, labels in zip(mask, given_labels) if m] 

In [5]:
print('Number of sentences: %d' % len(sentences)) 
print(sentences[0]) 

Number of sentences: 20718
Eu rejects German call to boycott British lamb.


## 3. Train Models using Cross-Validation 

To identify potential label errors in the training dataset, we compute the out-of-sample predicted probabilities using cross-validation. We first partition the dataset into k-folds: 

In [6]:
lines = [[]] 
for filepath in filepaths: 
    for line in open(filepath) : 
        if len(line) == 0 or line.startswith('-DOCSTART') or line[0] == '\n':
            if len(lines[-1]) > 0: 
                lines.append([]) 
        else: 
            lines[-1].append(line) 
        
lines = lines[:-1] 
lines = [line for m, line in zip(mask, lines) if m] 

k = 5 
indices = create_folds(lines, k=k) 

'folds/' already exists, skipping...


We train one model for each fold's training/testing pair: 

- Warning! The following code will take a long time to execute, and is recommended to run on GPU, otherwise it will take forever. 

In [7]:
for i in range(k): 
    if os.path.exists('folds/fold%d/model/' % i): 
        print('Model %d already exists, skipping...' % i) 
    else: 
        print('Training model on fold %d (out of %d) of cross-validation...' % (i, k)) 
        os.system(
            "python3 run_ner.py --data_dir=folds/fold%d --bert_model=bert-base-cased " \
            "--task_name=ner --output_dir=folds/fold%d/model --max_seq_length=256 " \
            "--do_train --num_train_epochs 10 --warmup_proportion=0.1" % (i, i)) 
        print('Model %d saved' % i) 
        # TODO: make this a function in run_ner.py? 

Model 0 already exists, skipping...
Model 1 already exists, skipping...
Model 2 already exists, skipping...
Model 3 already exists, skipping...
Model 4 already exists, skipping...


## 4. Compute Out-of-Sample Predicted Probabilities 

We obtain the predicted probabilities for each sample using the model in which the sample was held out from training. Note that most tokenizers break down sentences into subword-level tokens, which are units smaller than word-level tokens. For example, the following sentence: 

In [8]:
i = 0 
print(sentences[i]) 

Eu rejects German call to boycott British lamb.


is tokenized into: 

In [9]:
model = Ner("folds/fold0/model/") 
tokens = model.tokenize(sentences[i])[0] 
print(tokens) 
tokens = [token.replace('#', '') for token in tokens] 

['E', '##u', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'la', '##mb', '.']


`##` indicates that the token is a subword. We collect both the predicted probabilities and the tokens for each sentence. 

In [10]:
sentence_tokens, sentence_probs = {}, {} 
for i in range(k): 
    model = Ner("folds/fold%d/model/" % i) 
    for index in tqdm(indices[i]): 
        sentence_probs[index], sentence_tokens[index] = model.predict(sentences[index]) 
        
sentence_tokens = [sentence_tokens[i] for i in range(len(sentences))] 
sentence_probs = [sentence_probs[i] for i in range(len(sentences))] 

100%|██████████| 4144/4144 [01:14<00:00, 55.58it/s]
100%|██████████| 4144/4144 [01:14<00:00, 55.37it/s]
100%|██████████| 4144/4144 [01:14<00:00, 55.43it/s]
100%|██████████| 4143/4143 [01:14<00:00, 55.57it/s]
100%|██████████| 4143/4143 [01:14<00:00, 55.54it/s]


Most tokenizers partition sentences into subword-level tokens without altering the characters. However, you should verify whether any characters are modified, particularly for edge cases such as single or double quotations. In this example, double quotations `"` are broken down into two `'`s or two `` ` ``s. Run the following to verify if any of the characters in the sentences has been modified. 

In [11]:
print(modified(given_words, sentence_tokens)) 

True


Given that some characters have been modified, we need to map `sentence_tokens` back to the characters from the original dataset, so that we can compare the predicted labels with the given labels. Note that such mapping is different for different models, and is not required by most tokenizers. We cannot work with modified tokens directly because we do not have the given labels for the modified tokens. 

<details><summary>Below is the code used to map `sentence_tokens` back to the original characters.</summary>
    
    replace = [('#', ''), ('``', '"'), ("''", '"')] 
    sentence_tokens = [[process_token(token, replace) for token in sentence_tokens[i]] for i in range(len(sentences))] 

    for i in range(len(sentences)): 
        short = ''.join(given_words[i]) 
        if "''" in short: 
            processed_tokens, processed_probs = [], [] 
            for token, prob in zip(sentence_tokens[i], sentence_probs[i]): 
                if token != '"': 
                    processed_tokens.append(token) 
                    processed_probs.append(prob) 
                else: 
                    for _ in range(2): 
                        processed_tokens.append("'") 
                        processed_probs.append(prob) 
            sentence_tokens[i] = [token for token in processed_tokens] 
            sentence_probs[i] = np.array([prob for prob in processed_probs]) </details> 

In [12]:
replace = [('#', ''), ('``', '"'), ("''", '"')] 
sentence_tokens = [[process_token(token, replace) for token in sentence_tokens[i]] for i in range(len(sentences))] 

for i in range(len(sentences)): 
    short = ''.join(given_words[i]) 
    if "''" in short: 
        processed_tokens, processed_probs = [], [] 
        for token, prob in zip(sentence_tokens[i], sentence_probs[i]): 
            if token != '"': 
                processed_tokens.append(token) 
                processed_probs.append(prob) 
            else: 
                for _ in range(2): 
                    processed_tokens.append("'") 
                    processed_probs.append(prob) 
        sentence_tokens[i] = [token for token in processed_tokens] 
        sentence_probs[i] = np.array([prob for prob in processed_probs]) 

In this example, we are more interested in severe types of mislabels, such as `B-LOC` vs. `B-PER`, instead of `B-LOC` vs. `I-LOC`. Therefore, we discard the `B-` and `I-` prefixes, and get the model-predicted probabilities for each subword-level token. The merged entities are `[O, MIS, PER, ORG, LOC]`, which correspond to the classes in our token classification task. In a [subsequent notebook](https://cleanlab.ai/), we will use the probabilistic predictions from this trained model to identify instances where the class label was incorrectly chosen for particular tokens. As shown below: TODO: update link 

- `given_maps` is an array of length equal to the original number of entities, such that `given_maps[i]` is the mapped entity of the i'th entity 
- `model_maps` is an array of length equal to the number of model predicted labels, such that `model_maps[i]` is the mapped entity of the i'th model predicted entity. If `model_maps[i] < 0`, it indicates that the entity does not map to a valid named entity. This usually occurs when the model predicted entities include start/end tags. If `np.any(model_maps < 0)`, `pred_probs` will be normalized. 

In [13]:
given_maps = [0, 1, 1, 2, 2, 3, 3, 4, 4] 
model_maps = [-1, 0, 1, 1, 2, 2, 3, 3, 4, 4, -1, -1] 
given_labels = [mapping(labels, maps=given_maps) for labels in given_labels] 
sentence_probs = [merge_probs(pred_prob, maps=model_maps) for pred_prob in sentence_probs] 

Specifically, `merge_probs` takes in two parameters: 

- `probs`: `np.array` of shape `(N, L)`, where `N` is the number of tokens in the sentence, and `L` is the number of classes of the model. 
- `maps`: `list` of length `L`, where `L` is the number of classes of the model, with details specified above in `model_maps`. 

and returns: 

- `probs_merged`: `np.array` of shape `(N, K)`, where `N` is the number of tokens in the sentence, and `K` is the number of classes of the new set of entities. 

such that `probs_merged[:, j] == \sum_{maps[j']=j} probs[:, j']`. If any element in `maps` is negative (does not map to anything in the new set of classes), `probs_merged` is normalized such that each row sums up to 1. 

## 5. Reducing from subword-level to word-level 

When a sentence gets tokenized, each word-level token may be broken down into subword-level tokens, each of which generates a predicted probability. Given that we only have the labels for word-level tokens, we reduce the subword-level tokens to word-level tokens. 

\* For this example, most subwords-to-words reduction are handled internally, but for most other models the reduction has to be done manually. In the following, we show our method of reduction, which is slightly different from how the `bert` model reduces it. See [here](https://github.com/kamalkraj/BERT-NER/blob/dev/bert.py#L85) for more info. 

In [14]:
print('Sentence:\t' + sentences[0]) 
print('Given words:\t' + str(given_words[0])) 
print('Subwords:\t' + str(tokens)) 

Sentence:	Eu rejects German call to boycott British lamb.
Given words:	['Eu', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']
Subwords:	['E', 'u', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'la', 'mb', '.']


The word `lamb` is tokenized into two subwords `la` and `mb`. In this case, we assign the average predicted probabilities of the two subwords to the token. Alternatively, we can take the weighted average, such that the weight for each predicted probability is proportional to the length of its corresponding subword. This is to ensure that longer subwords have heavier weights on the average predicted probabilities, although the benefits are insignificant for most datasets. 

Each tokenizer tokenizes sentences differently. In some rare cases, a subword may overlap two tokens, resulting in a misalignment in tokenization. For example, consider the following tokenization: 

In [15]:
demo_sentence = 'Massachusetts Institute of Technology (MIT)' 
demo_given_words = ['Massachusetts', 'Institute', 'of', 'Technology', '(', 'MIT', ')'] 
demo_subwords = ['Massachusetts', 'Institute', 'of', 'Technology', '(M', 'IT', ')'] 

print('Sentence:\t' + demo_sentence) 
print('Given words:\t' + str(demo_given_words)) 
print('Subwords:\t' + str(demo_subwords)) 

Sentence:	Massachusetts Institute of Technology (MIT)
Given words:	['Massachusetts', 'Institute', 'of', 'Technology', '(', 'MIT', ')']
Subwords:	['Massachusetts', 'Institute', 'of', 'Technology', '(M', 'IT', ')']


In this case, we assign the predicted probabilities of `(M` to `(`, and the average predicted probabilities of `(M` and `IT` to `MIT`. 

We use the method above to map the predicted probabilities for each token generated by the tokenizers to each token in the original dataset. `get_pred_probs` return `pred_probs` which is a nested list, such that `pred_probs[i]` is a `np.array` of shape `(N, K)`, such that `N` is the number of given tokens for sentence `i`, and `K` is the number of classes. This is the expected format for methods in the `cleanlab.token_classification` package. 

In [16]:
pred_probs = [get_pred_probs(sentence_probs[i], sentence_tokens[i], given_words[i]) 
                         for i in range(len(sentences))] 

For example, we observe the tokens, given labels of the first sentence, and its predicted probabilities and label for each token. 

In [17]:
entities = ['O', 'MISC', 'PER', 'ORG', 'LOC'] 
for word, label, prob in zip(given_words[0], given_labels[0], pred_probs[0]): 
    print('Token: %s, given label: %s' % (word, entities[label])) 
    print('Predicted probabilities: %s' % str(np.round(prob, 7))) 
    print('Predicted label: %s\n' % entities[np.argmax(prob)]) 

Token: Eu, given label: ORG
Predicted probabilities: [3.041000e-04 2.383000e-04 9.993621e-01 7.010000e-05 2.550000e-05]
Predicted label: PER

Token: rejects, given label: O
Predicted probabilities: [9.99988e-01 4.00000e-06 2.20000e-06 4.50000e-06 1.30000e-06]
Predicted label: O

Token: German, given label: MISC
Predicted probabilities: [7.500000e-06 9.999611e-01 1.370000e-05 8.700000e-06 9.000000e-06]
Predicted label: MISC

Token: call, given label: O
Predicted probabilities: [9.999894e-01 3.800000e-06 1.800000e-06 3.700000e-06 1.400000e-06]
Predicted label: O

Token: to, given label: O
Predicted probabilities: [9.99991e-01 2.70000e-06 1.70000e-06 3.50000e-06 1.10000e-06]
Predicted label: O

Token: boycott, given label: O
Predicted probabilities: [9.999877e-01 4.800000e-06 2.000000e-06 4.400000e-06 1.100000e-06]
Predicted label: O

Token: British, given label: MISC
Predicted probabilities: [4.700000e-06 9.999639e-01 1.100000e-05 1.160000e-05 8.800000e-06]
Predicted label: MISC

Token: 

## 6. Save `pred_probs` 

Lastly, we save the predicted probabilities and given labels. `to_dict` converts `pred_probs` into a dictionary `d` where `d[str(i)]==pred_probs[i]`. The dictionary is saved as a `.npz` file. 

In [18]:
pred_probs_dict = to_dict(pred_probs) 
np.savez('pred_probs.npz', **pred_probs_dict) 

## 7. Model evaluation  

Lastly, we evaluate the model accuracy. We use the definition of precision and recall introduced by CoNLL-2003: 

> *“precision is the percentage of named entities found by the learning system that are correct. Recall is the percentage of named entities present in the corpus that are found by the system. A named entity is correct only if it is an exact match of the corresponding entity in the data file.”*

See [here](https://www.davidsbatista.net/blog/2018/05/09/Named_Entity_Evaluation/) for more info. 

In [19]:
predictions = [pred_prob.argmax(axis=1) for pred_prob in pred_probs] 
predictions_flatten = [pred for prediction in predictions for pred in prediction] 
given_labels_flatten = [label for given_label in given_labels for label in given_label] 

counts = [0, 0, 0, 0] 
correct = 0 

for truth, prediction in zip(given_labels_flatten, predictions_flatten): 
    if truth != 0: 
        if truth == prediction: 
            counts[0] += 1 
        counts[1] += 1 
    if prediction != 0: 
        if truth == prediction: 
            counts[2] += 1 
        counts[3] += 1 
    if truth == prediction: 
        correct += 1 
        
precision = counts[2] / counts[3] 
recall = counts[0] / counts[1] 
f1 = 2 * precision * recall / (precision + recall) 
accuracy = correct / len(given_labels_flatten) 

balanced_accuracy = balanced_accuracy_score(given_labels_flatten, predictions_flatten) 

print('Precision\t\t%.3f\nRecall\t\t\t%.3f\nf1-score\t\t%.3f\nAccuracy\t\t%.3f\nBalanced Accuracy\t%.3f' % 
     (precision, recall, f1, accuracy, balanced_accuracy)) 

Precision		0.949
Recall			0.952
f1-score		0.951
Accuracy		0.989
Balanced Accuracy	0.955


In [20]:
expected_words = ['Eu', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'] 
expected_labels = [3, 0, 1, 0, 0, 0, 1, 0, 0] 
if given_words[0] != expected_words or given_labels[0] != expected_labels: 
    raise Exception("Something wrong with reading file") 