# Training Entity Recognition Model for Token Classification Tutorial

This notebook demonstrates how to train a NLP model for entity recognition and use it to produce out-of-sample predicted probabilities for each token. These are required inputs to find label issues in token classification datasets with cleanlab. The specific token classification task we consider here is Named Entity Recognition with the [CoNLL-2003 dataset](https://deepai.org/dataset/conll-2003-english), and we train a Transformer network from [HuggingFace's transformers library](https://github.com/huggingface/transformers). This notebook demonstrates how to produce the `pred_probs`, using them to find label issues is demonstrated in cleanlab's [Token Classification Tutorial](https://docs.cleanlab.ai/). 

**Overview of what we'll do in this notebook:** 
- Read and process text datasets with per-token labels in the CoNLL format. 
- Compute out-of-sample predicted probabilities by training a Transformer network via cross-validation. 
- Aggregate subword-level tokens\* into word-level tokens which are more individually meaningful. 

\* In NLP, tokens typically refer to words or punctuation marks, but modern tokenizers used with Transformers often break down longer words into smaller subwords. To avoid confusion, we use "word-level tokens" to refer to the individual given tokens in the original dataset (with a separate class label provided for each such token). Tokens obtained from processing the raw text with tokenizers are referred as "subword-level tokens". 

## 1. Fetch data and load required dependencies

In [None]:
!wget -nc https://data.deepai.org/conll2003.zip && mkdir -p data 
!unzip conll2003.zip -d data/ && rm conll2003.zip 

In [None]:
# Package versions we used: tqdm==4.64.0 transformers==4.22.0.dev0 numpy==1.23.0 sklearn==1.1.1 
import os 
import warnings 
import string 
import numpy as np 
from tqdm import tqdm 
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline 
from sklearn.metrics import balanced_accuracy_score 
import nltk 
from cleanlab.internal.token_classification_utils import get_sentence, filter_sentence, process_token, mapping, merge_probs 
from bert import Ner 
from token_classification_tutorial_utils import create_folds, modified, get_pred_probs, to_dict 
from run_ner import train 

nltk.download("punkt") 
warnings.filterwarnings("ignore") 

# Disable `TOKENIZERS_PARALLELISM` if multiple processors exist. 
if os.cpu_count() > 1: 
    os.environ["TOKENIZERS_PARALLELISM"] = "false" 

## 2. Load the CONLL-2003 dataset

CONLL-2003 data are in the following format: 

`-DOCSTART- -X- -X- O` 

`[word] [pos_tags] [chunk_tags] [ner_tags]` <- Start of first sentence 

`...`

`[word] [pos_tags] [chunk_tags] [ner_tags]` 

`[empty line]` 

`[word] [pos_tags] [chunk_tags] [ner_tags]` <- Start of second sentence 

`...`

`[word] [pos_tags] [chunk_tags] [ner_tags]` 

In our example, we focus on the `ner_tags` (named-entity recognition tags), which include the following classes: 

| `ner_tags` |             Description              |
|:----------:|:------------------------------------:|
|     `O`    |      Other (not a named entity)      |
|   `B-MIS`  | Beginning of a miscellaneous entity  |
|   `I-MIS`  |         Miscellaneous entity         |
|   `B-PER`  |     Beginning of a person entity     |
|   `I-PER`  |            Person entity             |
|   `B-ORG`  | Beginning of an organization entity  |
|   `I-ORG`  |         Organization entity          |
|   `B-LOC`  |    Beginning of a location entity    |
|   `I-LOC`  |           Location entity            | 

For more information, see [here](https://paperswithcode.com/dataset/conll-2003). We cast all-caps words into lowercase except for the first character (eg. `JAPAN` -> `Japan`), to discourage the tokenizer from breaking such words into multiple subwords.

In [None]:
filepaths = ['data/train.txt', 'data/valid.txt', 'data/test.txt'] 
entities = ['O', 'B-MISC', 'I-MISC', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']
entity_map = {entity: i for i, entity in enumerate(entities)} 

# This code is adapted from: https://github.com/kamalkraj/BERT-NER/blob/dev/run_ner.py 
def readfile(filepath, sep=' '): 
    lines = open(filepath)
    
    data, sentence, label = [], [], []
    for line in lines:
        if len(line) == 0 or line.startswith('-DOCSTART') or line[0] == '\n':
            if len(sentence) > 0:
                data.append((sentence, label))
                sentence, label = [], []
            continue
        splits = line.split(sep) 
        word = splits[0]
        if len(word) > 0 and word[0].isalpha() and word.isupper():
            word = word[0] + word[1:].lower()
        sentence.append(word)
        label.append(entity_map[splits[-1][:-1]])

    if len(sentence) > 0:
        data.append((sentence, label))
        
    given_words = [d[0] for d in data] 
    given_labels = [d[1] for d in data] 
    
    return given_words, given_labels 

given_words, given_labels = [], [] 

for filepath in filepaths: 
    words, labels = readfile(filepath) 
    given_words.extend(words) 
    given_labels.extend(labels)

`given_words` and `given_labels` above are strings/labels corresponding to each word-level token, represented as nested lists.

In [None]:
i = 0  # change this to view a different example from the dataset 

print('Word\t\tLabel\tEntity\n-------------------------------') 
for word, label in zip(given_words[i], given_labels[i]): 
    print('{:14s}{:3d}\t{:10s}'.format(word, label, entities[label])) 

We next apply minor pre-processing for readability. Sentences containing the `#` character are removed for simplicity, because this special character is later used to represent subword-tokens by the sentence tokenizers used in HuggingFace (See section 4). We also remove single token sentences with `len(sentence) <= 1`. 

In [None]:
sentences = list(map(get_sentence, given_words)) 

sentences, mask = filter_sentence(sentences) 
given_words = [words for m, words in zip(mask, given_words) if m] 
given_labels = [labels for m, labels in zip(mask, given_labels) if m] 

In [None]:
print('Number of sentences: %d' % len(sentences)) 
print(sentences[0])  # display first sentence in the processed dataset

## 3. Train Token Classification Model using Cross-Validation 

To later find label issues in the training dataset, we first compute out-of-sample predicted probabilities (`pred_probs`) using cross-validation. We start by partitioning the dataset into `k = 5` disjoint folds: 

In [None]:
lines = [[]] 
for filepath in filepaths: 
    for line in open(filepath) : 
        if len(line) == 0 or line.startswith('-DOCSTART') or line[0] == '\n':
            if len(lines[-1]) > 0: 
                lines.append([]) 
        else: 
            lines[-1].append(line) 
        
lines = lines[:-1] 
lines = [line for m, line in zip(mask, lines) if m] 

k = 5 
indices = create_folds(lines, k=k) 

We train one model for each fold's training/testing pair: 

- Warning! The following code will take a long time to execute, and is recommended to run on GPU, otherwise it will take forever. 

In [None]:
for i in range(k): 
    if os.path.exists('folds/fold%d/model/' % i): 
        print('Model %d already exists, skipping...' % i) 
    else: 
        print('Training model on fold %d (out of %d) of cross-validation...' % (i, k)) 
        train(data_dir='folds/fold%d' % i, 
              bert_model='bert-base-cased', 
              task_name='ner', 
              output_dir='folds/fold%d/model' % i, 
              max_seq_length=256, 
              do_train=True, 
              num_train_epochs=10, 
              warmup_proportion=0.1,
              train_batch_size=16
        ) 
        print('Model %d saved' % i) 

## 4. Compute Out-of-Sample Predicted Probabilities 

We obtain predicted class probabilities for each token using the model in which this token was held out during training (i.e. its sentence was part of the validation fold above). Note that most modern tokenizers break sentences into subword-level tokens (smaller units than word-level tokens). For example, the following sentence: 

In [None]:
i = 0  # change this to view a different example from the dataset 
print(sentences[i]) 

is tokenized into: 

In [None]:
model = Ner("folds/fold0/model/") 
tokens = model.tokenize(sentences[i])[0] 
print(tokens) 
tokens = [token.replace('#', '') for token in tokens] 

`##` here is appended by the tokenizer to indicate that the token is a subword. Let's collect both the predicted class probabilities and strings for each token in each sentence. 

In [None]:
sentence_tokens, sentence_probs = {}, {} 
for i in range(k): 
    model = Ner("folds/fold%d/model/" % i) 
    for index in tqdm(indices[i]): 
        sentence_probs[index], sentence_tokens[index] = model.predict(sentences[index]) 
        
sentence_tokens = [sentence_tokens[i] for i in range(len(sentences))] 
sentence_probs = [sentence_probs[i] for i in range(len(sentences))] 

Most tokenizers partition sentences into subword-level tokens without altering the characters. However, you should verify whether any characters are modified, particularly for edge cases such as single or double quotations. In this example, the tokenizer has broken down double quotations `"` into two `'`s or two `` ` ``s. The following code checks if any characters in the sentences were modified. 

In [None]:
print(modified(given_words, sentence_tokens)) 

Given that some characters were modified, we should map `sentence_tokens` back to the characters from the original dataset, so that we can better compare predicted labels with the given labels to spot label issues. This mapping differs between different models, and may not be required by many tokenizers. We should not work with modified tokens directly because we lack their given labels. 

In [None]:
# Code to map `sentence_tokens` back to the characters from the original text 
replace = [('#', ''), ('``', '"'), ("''", '"')] 
sentence_tokens = [[process_token(token, replace) for token in sentence_tokens[i]] for i in range(len(sentences))] 

for i in range(len(sentences)): 
    short = ''.join(given_words[i]) 
    if "''" in short: 
        processed_tokens, processed_probs = [], [] 
        for token, prob in zip(sentence_tokens[i], sentence_probs[i]): 
            if token != '"': 
                processed_tokens.append(token) 
                processed_probs.append(prob) 
            else: 
                for _ in range(2): 
                    processed_tokens.append("'") 
                    processed_probs.append(prob) 
        sentence_tokens[i] = [token for token in processed_tokens] 
        sentence_probs[i] = np.array([prob for prob in processed_probs]) 

In this example, we are more interested in severe types of mislabels, such as `B-LOC` vs. `B-PER`, instead of `B-LOC` vs. `I-LOC`. Therefore, we discard the `B-` and `I-` prefixes, and get the model-predicted probabilities for each subword-level token. The merged entities are `[O, MIS, PER, ORG, LOC]`, which correspond to the classes in our token classification task. In cleanlab's Token Classification tutorial, we will use the probabilistic predictions from this trained model to identify instances where the class label was incorrectly chosen for particular tokens. As shown below:

- `given_maps` is an array of length equal to the original number of entities, such that `given_maps[i]` is the mapped entity of the i'th entity 
- `model_maps` is an array of length equal to the number of model predicted labels, such that `model_maps[i]` is the mapped entity of the i'th model predicted entity. If `model_maps[i] < 0`, it indicates that the entity does not map to a valid named entity. This usually occurs when the model predicted entities include start/end tags. If `np.any(model_maps < 0)`, `pred_probs` will be normalized. 

In [None]:
given_maps = [0, 1, 1, 2, 2, 3, 3, 4, 4] 
model_maps = [-1, 0, 1, 1, 2, 2, 3, 3, 4, 4, -1, -1] 
given_labels = [mapping(labels, maps=given_maps) for labels in given_labels] 
sentence_probs = [merge_probs(pred_prob, maps=model_maps) for pred_prob in sentence_probs] 

Specifically, `merge_probs` takes in two parameters: 

- `probs`: `np.array` of shape `(N, L)`, where `N` is the number of tokens in the sentence, and `L` is the number of classes of the model. 
- `maps`: `list` of length `L`, where `L` is the number of classes of the model, with details specified above in `model_maps`. 

and returns: 

- `probs_merged`: `np.array` of shape `(N, K)`, where `N` is the number of tokens in the sentence, and `K` is the number of classes of the new set of entities. 

such that `probs_merged[:, j] == \sum_{maps[j']=j} probs[:, j']`. If any element in `maps` is negative (does not map to anything in the new set of classes), `probs_merged` is normalized such that each row sums up to 1. 

## 5. Reducing from subword-level to word-level 

When a sentence gets tokenized, each word-level token may be broken down into subword-level tokens, each of which generates a predicted probability. Given that we only have the labels for word-level tokens, we reduce the subword-level tokens to word-level tokens. 

\* For this example, most subwords-to-words reduction are handled internally, but for most other models the reduction has to be done manually. In the following, we show our method of reduction, which is slightly different from how the `bert` model reduces it. See [here](https://github.com/kamalkraj/BERT-NER/blob/dev/bert.py#L85) for more info. 

In [None]:
print('Sentence:\t' + sentences[0]) 
print('Given words:\t' + str(given_words[0])) 
print('Subwords:\t' + str(tokens)) 

The word `lamb` is tokenized into two subwords `la` and `mb`. In this case, we assign the average predicted probabilities of the two subwords to the token. Alternatively, we can take the weighted average, such that the weight for each predicted probability is proportional to the length of its corresponding subword. This is to ensure that longer subwords have heavier weights on the average predicted probabilities, although the benefits are insignificant for most datasets. 

Each tokenizer tokenizes sentences differently. In some rare cases, a subword may overlap two tokens, resulting in a misalignment in tokenization. For example, consider the following tokenization: 

In [None]:
demo_sentence = 'Massachusetts Institute of Technology (MIT)' 
demo_given_words = ['Massachusetts', 'Institute', 'of', 'Technology', '(', 'MIT', ')'] 
demo_subwords = ['Massachusetts', 'Institute', 'of', 'Technology', '(M', 'IT', ')'] 

print('Sentence:\t' + demo_sentence) 
print('Given words:\t' + str(demo_given_words)) 
print('Subwords:\t' + str(demo_subwords)) 

In this case, we assign the predicted probabilities of `(M` to `(`, and the average predicted probabilities of `(M` and `IT` to `MIT`. 

We use the method above to map the predicted probabilities for each token generated by the tokenizers to each token in the original dataset. `get_pred_probs` return `pred_probs` which is a nested list, such that `pred_probs[i]` is a `np.array` of shape `(N, K)`, such that `N` is the number of given tokens for sentence `i`, and `K` is the number of classes. This is the expected format for methods in the `cleanlab.token_classification` package. 

In [None]:
pred_probs = [get_pred_probs(sentence_probs[i], sentence_tokens[i], given_words[i]) 
                         for i in range(len(sentences))] 

For example, we observe the tokens, given labels of the first sentence, and its predicted probabilities and label for each token. 

In [None]:
entities = ['O', 'MISC', 'PER', 'ORG', 'LOC'] 
for word, label, prob in zip(given_words[0], given_labels[0], pred_probs[0]): 
    print('Token: %s, given label: %s' % (word, entities[label])) 
    print('Predicted probabilities: %s' % str(np.round(prob, 7))) 
    print('Predicted label: %s\n' % entities[np.argmax(prob)]) 

## 6. Save `pred_probs` 

Lastly, we save the predicted probabilities and given labels. `to_dict` converts `pred_probs` into a dictionary `d` where `d[str(i)]==pred_probs[i]`. The dictionary is saved as a `.npz` file. 

In [None]:
pred_probs_dict = to_dict(pred_probs) 
np.savez('pred_probs.npz', **pred_probs_dict) 

## 7. Model evaluation  

Lastly, we evaluate the model accuracy. We use the definition of precision and recall introduced by CoNLL-2003: 

> *“precision is the percentage of named entities found by the learning system that are correct. Recall is the percentage of named entities present in the corpus that are found by the system. A named entity is correct only if it is an exact match of the corresponding entity in the data file.”*

See [here](https://www.davidsbatista.net/blog/2018/05/09/Named_Entity_Evaluation/) for more info. 

In [None]:
predictions = [pred_prob.argmax(axis=1) for pred_prob in pred_probs] 
predictions_flatten = [pred for prediction in predictions for pred in prediction] 
given_labels_flatten = [label for given_label in given_labels for label in given_label] 

counts = [0, 0, 0, 0] 
correct = 0 

for truth, prediction in zip(given_labels_flatten, predictions_flatten): 
    if truth != 0: 
        if truth == prediction: 
            counts[0] += 1 
        counts[1] += 1 
    if prediction != 0: 
        if truth == prediction: 
            counts[2] += 1 
        counts[3] += 1 
    if truth == prediction: 
        correct += 1 
        
precision = counts[2] / counts[3] 
recall = counts[0] / counts[1] 
f1 = 2 * precision * recall / (precision + recall) 
accuracy = correct / len(given_labels_flatten) 

balanced_accuracy = balanced_accuracy_score(given_labels_flatten, predictions_flatten) 

print('Precision\t\t%.3f\nRecall\t\t\t%.3f\nf1-score\t\t%.3f\nAccuracy\t\t%.3f\nBalanced Accuracy\t%.3f' % 
     (precision, recall, f1, accuracy, balanced_accuracy)) 

In [None]:
expected_words = ['Eu', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'] 
expected_labels = [3, 0, 1, 0, 0, 0, 1, 0, 0] 
if given_words[0] != expected_words or given_labels[0] != expected_labels: 
    raise Exception("Something wrong with reading file") 