# Named Entity Recognition using Transformers

- This notebook is inspired from the course by LazyProgrammer on Udemy: https://www.udemy.com/course/data-science-transformers-nlp/
- There are few tweaking done as part of self-learning journey
- Dataset: 'conll2003'


## Install and Import Packages

In [33]:
!pip install -q transformers[torch] datasets seqeval

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[0m

In [38]:

import numpy as np

from datasets import load_dataset, load_metric

from transformers import (AutoTokenizer, 
                          DataCollatorForTokenClassification, 
                          AutoModelForTokenClassification, 
                          TrainingArguments, 
                          Trainer)

from huggingface_hub import notebook_login

In [5]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## The Data

In [3]:
# Load data from HuggingFace dataset
data = load_dataset('conll2003')
data

Downloading builder script:   0%|          | 0.00/2.58k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

Downloading and preparing dataset conll2003/conll2003 (download: 959.94 KiB, generated: 9.78 MiB, post-processed: Unknown size, total: 10.72 MiB) to /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/63f4ebd1bcb7148b1644497336fd74643d4ce70123334431a3c053b7ee4e96ee...


Downloading data:   0%|          | 0.00/983k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14042 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3251 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3454 [00:00<?, ? examples/s]

Dataset conll2003 downloaded and prepared to /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/63f4ebd1bcb7148b1644497336fd74643d4ce70123334431a3c053b7ee4e96ee. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14042
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3251
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3454
    })
})

In [6]:
# Check a sample train data
data['train'][0]

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

In [7]:
# Check features
data['train'].features

{'id': Value(dtype='string', id=None),
 'tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
 'pos_tags': Sequence(feature=ClassLabel(num_classes=47, names=['"', "''", '#', '$', '(', ')', ',', '.', ':', '``', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'NN|SYM', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB'], id=None), length=-1, id=None),
 'chunk_tags': Sequence(feature=ClassLabel(num_classes=23, names=['O', 'B-ADJP', 'I-ADJP', 'B-ADVP', 'I-ADVP', 'B-CONJP', 'I-CONJP', 'B-INTJ', 'I-INTJ', 'B-LST', 'I-LST', 'B-NP', 'I-NP', 'B-PP', 'I-PP', 'B-PRT', 'I-PRT', 'B-SBAR', 'I-SBAR', 'B-UCP', 'I-UCP', 'B-VP', 'I-VP'], id=None), length=-1, id=None),
 'ner_tags': Sequence(feature=ClassLabel(num_classes=9, names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)}

In [8]:
# Check 'ner_tags' features
data['train'].features['ner_tags']

Sequence(feature=ClassLabel(num_classes=9, names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)

In [10]:
# Check the names of features of 'ner_tags'
data['train'].features['ner_tags'].feature.names

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

In [12]:
# Save feature names
label_names = data['train'].features['ner_tags'].feature.names
label_names

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

## Tokenization

In [13]:
# Select a transformer model
# Cased model is selected as the 'Casing' matters in NER case
# "Bill" in "Bill Gates" is a name of a person
# "bill" in "I paid the bill" is an object.
MODEL_CKPT = 'distilbert-base-cased'

In [14]:
# Load the tokenizer for the model
tokenizer = AutoTokenizer.from_pretrained(MODEL_CKPT)

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [18]:
# Sanity check on a random text
idx = 0
t = tokenizer(data['train'][idx]['tokens'], is_split_into_words=True)
t.word_ids()

[None, 0, 1, 2, 3, 4, 5, 6, 7, 7, 8, None]

In [19]:
t.tokens()

['[CLS]',
 'EU',
 'rejects',
 'German',
 'call',
 'to',
 'boycott',
 'British',
 'la',
 '##mb',
 '.',
 '[SEP]']

### Target Alignment

- After Subword tokenization, the tokens won't allign with the targets. e.g. in above case: 'lamb' -> 'la', '##mb'
- This issue needs to be fixed.
- For any word split into multiple tokens, assign the same target.
- Special tokens [CLS] and [SEP] tokens need to be accounted in the targets.

In [20]:
# Mapping "begin_text" to "inside_text"
begin2inside = {
    1: 2, 
    3: 4, 
    5: 6, 
    7: 8
}

In [21]:
def align_targets(labels, word_ids):
    aligned_labels = []
    last_word = None
    for word in word_ids:
        if word is None:
            # Its a special token like [CLS]
            label = -100 # HF transformers use -100 for special token
        elif word != last_word:
            # Its a new word
            label = labels[word]
        else:
            # Its the same word as before
            label = labels[word]
            
            # Change B-<tag> to I-<tag> if necessary
            if label in begin2inside:
                label = begin2inside[label]
                
        # Add the label
        aligned_labels.append(label)
        # Update the last word
        last_word = word
        
    return aligned_labels

In [22]:
# Try alignement function
labels = data['train'][idx]['ner_tags']
word_ids = t.word_ids()
aligned_targets = align_targets(labels, word_ids)
aligned_targets

[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, 0, -100]

In [25]:
aligned_labels = [label_names[t] if t >= 0 else None for t in aligned_targets]
for x, y in zip(t.tokens(), aligned_labels):
    print(f"{x}\t{y}")

[CLS]	None
EU	B-ORG
rejects	O
German	B-MISC
call	O
to	O
boycott	O
British	B-MISC
la	O
##mb	O
.	O
[SEP]	None


In [27]:
# Function to Tokenize both inputs and targets
def tokenize_fn(batch):
    # Tokenize the input sequence first
    # This populates input_ids, attention_mask, etc.
    tokenized_inputs = tokenizer(
    batch['tokens'], 
    truncation=True, 
    is_split_into_words=True)
    # Original Targets
    labels_batch = batch['ner_tags']
    aligned_labels_batch = []
    for i, labels in enumerate(labels_batch):
        word_ids = tokenized_inputs.word_ids(i)
        aligned_labels_batch.append(align_targets(labels, word_ids))
        
    tokenized_inputs['labels'] = aligned_labels_batch
    return tokenized_inputs

In [28]:
# Map the tokenize_fn to datasets
tokenized_datasets = data.map(
    tokenize_fn, 
    batched=True, 
    remove_columns=data['train'].column_names)

  0%|          | 0/15 [00:00<?, ?ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

In [30]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 14042
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 3251
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 3454
    })
})

## Data Collator

In [31]:
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

In [32]:
# Example
batch = data_collator([tokenized_datasets['train'][i] for i in range(2)])
batch['labels']

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


tensor([[-100,    3,    0,    7,    0,    0,    0,    7,    0,    0,    0, -100],
        [-100,    1,    2, -100, -100, -100, -100, -100, -100, -100, -100, -100]])

## Metric

In [36]:
# Instantiate the metric
metric = load_metric('seqeval')

Downloading builder script:   0%|          | 0.00/2.47k [00:00<?, ?B/s]

In [45]:
# Function to compute Metrics
def compute_metrics(logits_and_labels):
    logits, labels = logits_and_labels
    preds = np.argmax(logits, axis=-1)
    
    # Remove -100 from labels and predictions
    # Convert the label_ids to label names
    str_labels = [
        [label_names[t] for t in label if t!= -100] for label in labels
    ]
    str_preds = [
        [label_names[p] for p, t in zip(pred, targ) if t!= -100] for pred, targ in zip(preds, labels)
    ]
    metrics_ = metric.compute(predictions=str_preds, 
                            references=str_labels)
    return {
        'precision': metrics_['overall_precision'], 
        'recall': metrics_['overall_recall'], 
        'f1': metrics_['overall_f1'], 
        'accuracy': metrics_['overall_accuracy']
    }

## Fine-tune the model

In [39]:
# Create id2label, label2id dictionary
id2label = {k: v for k, v in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}

In [40]:
# Instantiate Model
model = AutoModelForTokenClassification.from_pretrained(
    MODEL_CKPT, 
    id2label=id2label, 
    label2id=label2id)

Downloading pytorch_model.bin:   0%|          | 0.00/263M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertForTokenClassification: ['vocab_transform.bias', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this 

In [41]:
model_name = f"{MODEL_CKPT}-finetuned-CONLL2003"

In [42]:
# Training Arguments
training_args = TrainingArguments(output_dir=model_name,
                                  num_train_epochs=5,
                                  learning_rate=2e-5,
                                  weight_decay=0.01,
                                  evaluation_strategy='epoch',
                                  disable_tqdm=False,
                                  push_to_hub=True,)

In [47]:
# Instantiate trainer
trainer = Trainer(model=model, 
                 args=training_args, 
                 train_dataset=tokenized_datasets['train'], 
                 eval_dataset=tokenized_datasets['validation'], 
                 data_collator=data_collator, 
                 compute_metrics=compute_metrics, 
                 tokenizer=tokenizer)

/kaggle/working/distilbert-base-cased-finetuned-CONLL2003 is already a clone of https://huggingface.co/EulerianKnight/distilbert-base-cased-finetuned-CONLL2003. Make sure you pull the latest changes with `repo.git_pull()`.


In [48]:
# train the model
trainer.train()



Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.0302,0.083189,0.905478,0.931841,0.918471,0.981191
2,0.024,0.086726,0.923663,0.938741,0.931141,0.983296
3,0.0123,0.090936,0.922368,0.94379,0.932956,0.984473
4,0.0059,0.096248,0.921839,0.9448,0.933178,0.98437
5,0.0026,0.098283,0.927629,0.946988,0.937209,0.984812


TrainOutput(global_step=8780, training_loss=0.014584794206065305, metrics={'train_runtime': 640.5511, 'train_samples_per_second': 109.609, 'train_steps_per_second': 13.707, 'total_flos': 767854087685244.0, 'train_loss': 0.014584794206065305, 'epoch': 5.0})

In [49]:
trainer.push_to_hub(commit_message='First Commit')

Upload file pytorch_model.bin:   0%|          | 1.00/249M [00:00<?, ?B/s]

Upload file runs/Jul04_21-38-15_d673882b9f35/events.out.tfevents.1688507416.d673882b9f35.28.1:   0%|          …

To https://huggingface.co/EulerianKnight/distilbert-base-cased-finetuned-CONLL2003
   ec67e48..6c55bf3  main -> main

To https://huggingface.co/EulerianKnight/distilbert-base-cased-finetuned-CONLL2003
   6c55bf3..e2420a8  main -> main



'https://huggingface.co/EulerianKnight/distilbert-base-cased-finetuned-CONLL2003/commit/6c55bf31d243bc403a209b78f453d045395c8132'