# 5. Name Entity Recognition (NER)

## Setup

In [39]:
! pip install --q transformers torch seqeval

In [40]:
from datasets import load_dataset, load_metric

from transformers import AutoTokenizer, DataCollatorForTokenClassification, AutoModelForTokenClassification, pipeline

## Data Load

In [41]:
data = load_dataset('conll2003')
data

Using the latest cached version of the dataset since conll2003 couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'conll2003' at /home/jupyter/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/01ad4ad271976c5258b9ed9b910469a806ff3288 (last modified on Fri Feb  7 20:18:20 2025).


DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

In [42]:
data['train'][0]

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

**Note** that punctuation is included in the list of words, and words haven't been low-cased.

In [43]:
data['train'].features

{'id': Value(dtype='string', id=None),
 'tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
 'pos_tags': Sequence(feature=ClassLabel(names=['"', "''", '#', '$', '(', ')', ',', '.', ':', '``', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'NN|SYM', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB'], id=None), length=-1, id=None),
 'chunk_tags': Sequence(feature=ClassLabel(names=['O', 'B-ADJP', 'I-ADJP', 'B-ADVP', 'I-ADVP', 'B-CONJP', 'I-CONJP', 'B-INTJ', 'I-INTJ', 'B-LST', 'I-LST', 'B-NP', 'I-NP', 'B-PP', 'I-PP', 'B-PRT', 'I-PRT', 'B-SBAR', 'I-SBAR', 'B-UCP', 'I-UCP', 'B-VP', 'I-VP'], id=None), length=-1, id=None),
 'ner_tags': Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)}

**Note** that POS tags, Chunk tags, and NER tags are all sequences, all with different classes and each class with string type.

In [44]:
data['train'].features['ner_tags'].feature.names

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

This names property from the feature attribute returns a string list, which we can use later when we want to map label IDs back to label names.

In [45]:
# save for later
label_names = data['train'].features['ner_tags'].feature.names

## Data Tokenization

In [46]:
# We could also try using BERT directly
checkpoint = "distilbert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

If we want to tokenize a doc that has been already split into words, like this one, we just need to pass in the argument `is_split_into_words=True`:

In [47]:
idx = 0
t = tokenizer(data['train'][idx]['tokens'], is_split_into_words=True)
t

{'input_ids': [101, 7270, 22961, 1528, 1840, 1106, 21423, 1418, 2495, 12913, 119, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [48]:
type(t)

transformers.tokenization_utils_base.BatchEncoding

We see that although the tokenized data seemed like a dictionary, it's really a BatchEncoding object. This object has some useful object methods we can use, e.g. we can call tokens() to see the tokens represented in string format, instead of ints.

In [49]:
t.tokens()

['[CLS]',
 'EU',
 'rejects',
 'German',
 'call',
 'to',
 'boycott',
 'British',
 'la',
 '##mb',
 '.',
 '[SEP]']

We see the usual `[CLS]` and `[SEP]` [special tokens from BERT](https://www.neclab.eu/human-centric-ai-a-road-map-to-human-ai-collaboration/attending-to-future-tokens-for-bidirectional-sequence-generation#:~:text=The%20special%20tokens%20have%20specific,with%20the%20%5BMASK%5D%20token). Challenge: We have an input sequence and an output sequence, and their lengths vary. And we need to align the targets to the tokens, as tokens (words) have been split into sub-tokens (sub-words), but targets in NER are at the token (word) level.

To fix this, we will **expand** our dataset, in the sense that for any word split into multiple tokens, we’ll assign the same target.

But wait! What about special tokens from BERT (e.g. [CLS] and [SEP])? We need to create targets for those too.
- Why do we need to do this? Transformers work that way. It’s similar to the RNN concept for the same task, we have (x(t)  h(t)  y(t), for all t), so we need all that tokens representation in their former and latter stage.
- What value should we set for them? -100. Why? Because that’s the value that Hugging Face uses to know it needs to ignore them, not using them to update the model weights.

We’ll need to write our own algorithm for this.

In [50]:
# value of i indicates it's the i'th word in the input sentence (counting from 0)
t.word_ids()

[None, 0, 1, 2, 3, 4, 5, 6, 7, 7, 8, None]

In [51]:
t.word_ids()[4]

3

In [52]:
# Detail: Need to map the beginning and inside each word for different types of words
#['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']
begin2inside = {
  1: 2,
  3: 4,
  5: 6,
  7: 8,
}

In [53]:
def align_targets(labels, word_ids):
    aligned_labels = []
    last_word = None
    for word in word_ids:
        if word is None:         # token like [CLS]
            label = -100
        elif word != last_word:  # new word!
            label = labels[word]
        else:                    # same word as before
            label = labels[word]

            # change B-<tag> to I-<tag> if necessary
            if label in begin2inside:
                label = begin2inside[label]

        # add the label 
        aligned_labels.append(label)

        # update last word
        last_word = word

    return aligned_labels

In [54]:
# try our function
labels = data['train'][idx]['ner_tags']
word_ids = t.word_ids()
aligned_targets = align_targets(labels, word_ids)
aligned_targets

[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, 0, -100]

In [55]:
aligned_labels = [label_names[t] if t >= 0 else None for t in aligned_targets]
for x, y in zip(t.tokens(), aligned_labels):
    print(f"{x}\t{y}")

[CLS]	None
EU	B-ORG
rejects	O
German	B-MISC
call	O
to	O
boycott	O
British	B-MISC
la	O
##mb	O
.	O
[SEP]	None


In [56]:
# make up a fake input just to test if B vs I works good
words = [
  '[CLS]', 'Ger', '##man', 'call', 'to', 'boycott', 'Micro', '##soft', '[SEP]']
word_ids = [None, 0, 0, 1, 2, 3, 4, 4, None]
labels = [7, 0, 0, 0, 3]
aligned_targets = align_targets(labels, word_ids)
aligned_labels = [label_names[t] if t >= 0 else None for t in aligned_targets]
for x, y in zip(words, aligned_labels):
  print(f"{x}\t{y}")

[CLS]	None
Ger	B-MISC
##man	I-MISC
call	O
to	O
boycott	O
Micro	B-ORG
##soft	I-ORG
[SEP]	None


In [57]:
# tokenize both inputs and targets
def tokenize_fn(batch):
    # tokenize the input sequence first
    # this populates input_ids, attention_mask, etc.
    tokenized_inputs = tokenizer(
        batch['tokens'], truncation=True, is_split_into_words=True
    )

    labels_batch = batch['ner_tags'] # original targets
    aligned_labels_batch = []
    for i, labels in enumerate(labels_batch):
        word_ids = tokenized_inputs.word_ids(i)
        aligned_labels_batch.append(align_targets(labels, word_ids))

    # recall: the 'target' must be stored in key called 'labels'
    tokenized_inputs['labels'] = aligned_labels_batch

    return tokenized_inputs

The former function returns input our tokenized inputs, now containing their ids, attention masks, and so forth, but also the aligned labels.

In [58]:
# want to remove these from model inputs - they are neither inputs nor targets
data["train"].column_names

['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags']

In [59]:
tokenized_datasets = data.map(
  tokenize_fn,
  batched=True,
  remove_columns=data["train"].column_names,
)

In [60]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 3453
    })
})

As we see, columns in our dataset are the ones we need now.

### Data Collator

In [61]:
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

In [62]:
tokenized_datasets['train'][0:2]

{'input_ids': [[101,
   7270,
   22961,
   1528,
   1840,
   1106,
   21423,
   1418,
   2495,
   12913,
   119,
   102],
  [101, 1943, 14428, 102]],
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1]],
 'labels': [[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, 0, -100], [-100, 1, 2, -100]]}

Unfortunately, this dict is not a format that Data Collator can work with. We need to grab a dict for each sample separately, and store it in a list.

In [63]:
[tokenized_datasets['train'][i] for i in range(2)]

[{'input_ids': [101,
   7270,
   22961,
   1528,
   1840,
   1106,
   21423,
   1418,
   2495,
   12913,
   119,
   102],
  'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
  'labels': [-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, 0, -100]},
 {'input_ids': [101, 1943, 14428, 102],
  'attention_mask': [1, 1, 1, 1],
  'labels': [-100, 1, 2, -100]}]

In [64]:
# example
batch = data_collator([tokenized_datasets['train'][i] for i in range(2)])
batch['labels']

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


tensor([[-100,    3,    0,    7,    0,    0,    0,    7,    0,    0,    0, -100],
        [-100,    1,    2, -100, -100, -100, -100, -100, -100, -100, -100, -100]])

**Note** that we get back a Torch tensor object, with our words expanded, and with pad tokens set to -100. There pad tokens being set correctly could trick our accuracy, giving a false sensation that our model is performing great, so we'll need to take this into account in the future.

## Model evaluation

In [65]:
metric = load_metric("seqeval")

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


In [66]:
# Test it out. This metric assumes we're working with batches of sentences (we should pass a list of lists)
metric.compute(
    predictions=[[0, 0, 0]],
    references=[[0, 0, 1]]
)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  avg = a.mean(axis, **keepdims_kw)
  ret = ret.dtype.type(ret / rcount)


{'overall_precision': 0.0,
 'overall_recall': 0.0,
 'overall_f1': 0.0,
 'overall_accuracy': 0.6666666666666666}

In [67]:
# Test it out. This metric assumes we're working with batches of sentences (we should pass a list of lists)
metric.compute(
    predictions=[['A','A','A']],
    references=[['A','A','B']]
)



{'_': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 1},
 'overall_precision': 0.0,
 'overall_recall': 0.0,
 'overall_f1': 0.0,
 'overall_accuracy': 0.6666666666666666}

In [68]:
# Test it out, right formats now
metric.compute(
    predictions=[['O', 'O', 'I-ORG', 'B-MISC']],
    references=[['O', 'B-ORG', 'I-ORG', 'B-MISC']]
)

{'MISC': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1},
 'ORG': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 1},
 'overall_precision': 0.5,
 'overall_recall': 0.5,
 'overall_f1': 0.5,
 'overall_accuracy': 0.75}

**Note that** we get metrics for each entity, and then overall metrics. 
**Note 2:** It seems off that ORG precision is 0, even if we did some correct predictions (it should intuitively be 0.5). This is because seqeval does some special computation for ER evaluation (so their evaluation differs from the generic ML one).

Let's now write our compute metrics function.

In [69]:
import numpy as np

def compute_metrics(logits_and_labels):
    logits, labels = logits_and_labels
    preds = np.argmax(logits, axis=-1)

    # remove -100 from labels and predictions
    # and convert the label_ids to label names
    str_labels = [
        [label_names[t] for t in label if t != -100] for label in labels
    ]

    # do the same for predictions whenever true label is -100
    str_preds = [
        [label_names[p] for p, t in zip(pred, targ) if t != -100] \
            for pred, targ in zip(preds, labels)
    ]

    the_metrics = metric.compute(predictions=str_preds, references=str_labels)
    return {
        'precision': the_metrics['overall_precision'],
        'recall': the_metrics['overall_recall'],
        'f1': the_metrics['overall_f1'],
        'accuracy': the_metrics['overall_accuracy'],
    }

## Model and trainer

We'll use AutoModelForTokenClassification, which is accurate for our current task. For this, we need to specify the labels in an alternative way, passing the arguments id2label and label2id, defined based on our input dataset. Reminder:
- Our `label_names` var is a list which already stores a mapping from id to label

In [70]:
id2label = {k:v for k, v in enumerate(label_names)}
label2id = {v:k for k, v in id2label.items()}

In [71]:
model = AutoModelForTokenClassification.from_pretrained(
    checkpoint,
    id2label=id2label,
    label2id=label2id,
)

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [72]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    "distilbert-finetuned-ner",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
)

**NOTE:** Having an accelerator error I was not able to fix

In [73]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)
trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.0929,0.086093,0.877898,0.911141,0.894211,0.976291
2,0.0457,0.075361,0.908852,0.931336,0.919957,0.981236
3,0.0324,0.070182,0.91895,0.93689,0.927833,0.982325


Checkpoint destination directory distilbert-finetuned-ner/checkpoint-1756 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory distilbert-finetuned-ner/checkpoint-3512 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory distilbert-finetuned-ner/checkpoint-5268 already exists and is non-empty.Saving will proceed but saved results may be invalid.


TrainOutput(global_step=5268, training_loss=0.08099188565664792, metrics={'train_runtime': 314.6155, 'train_samples_per_second': 133.887, 'train_steps_per_second': 16.744, 'total_flos': 460336113849150.0, 'train_loss': 0.08099188565664792, 'epoch': 3.0})

In [74]:
trainer.save_model('my_saved_model') # Save model to disk

In [75]:
ner = pipeline(
  "token-classification",
  model='my_saved_model',
  aggregation_strategy="simple",
  device=0,
)

In [76]:
s = "Bill Gates was the CEO of Microsoft in Seattle, Washington."
ner(s)

[{'entity_group': 'PER',
  'score': 0.99922806,
  'word': 'Bill Gates',
  'start': 0,
  'end': 10},
 {'entity_group': 'ORG',
  'score': 0.9982521,
  'word': 'Microsoft',
  'start': 26,
  'end': 35},
 {'entity_group': 'LOC',
  'score': 0.99892324,
  'word': 'Seattle',
  'start': 39,
  'end': 46},
 {'entity_group': 'LOC',
  'score': 0.997781,
  'word': 'Washington',
  'start': 48,
  'end': 58}]