<a href="https://colab.research.google.com/github/badrinarayanan02/LLM/blob/main/2348507_LLMLab5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetuning BERT for Named Entity Recognition

NER is used to classify and identify named entities such as (places, people, organizations, dates etc)

NER Workflow

Text Input -> Tokenization -> Entity Recognition -> Token Classification -> Output

Installing Necessary Dependencies

In [None]:
pip install transformers

In [None]:
pip install datasets

Loading the Libraries

In [63]:
import pandas as pd
from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import DataCollatorForTokenClassification
from transformers import AutoModelForTokenClassification
from transformers import TrainingArguments
from transformers import Trainer
import numpy as np
from transformers import pipeline
import evaluate

In [None]:
data = load_dataset('conllpp')
data

In [5]:
data['train'].features

{'id': Value(dtype='string', id=None),
 'tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
 'pos_tags': Sequence(feature=ClassLabel(names=['"', "''", '#', '$', '(', ')', ',', '.', ':', '``', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'NN|SYM', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB'], id=None), length=-1, id=None),
 'chunk_tags': Sequence(feature=ClassLabel(names=['O', 'B-ADJP', 'I-ADJP', 'B-ADVP', 'I-ADVP', 'B-CONJP', 'I-CONJP', 'B-INTJ', 'I-INTJ', 'B-LST', 'I-LST', 'B-NP', 'I-NP', 'B-PP', 'I-PP', 'B-PRT', 'I-PRT', 'B-SBAR', 'I-SBAR', 'B-UCP', 'I-UCP', 'B-VP', 'I-VP'], id=None), length=-1, id=None),
 'ner_tags': Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)}

We will use tokens and ner_tags for the analysis.

Checking the first data

In [6]:
pd.DataFrame(data['train'][:])[['tokens','ner_tags']].iloc[0]

Unnamed: 0,0
tokens,"[EU, rejects, German, call, to, boycott, Briti..."
ner_tags,"[3, 0, 7, 0, 0, 0, 7, 0, 0]"


Getting the index of the ner tags

In [7]:
tags = data['train'].features['ner_tags'].feature

In [8]:
indextotag = {idx:tag for idx, tag in enumerate(tags.names)}
tagtoindex = {tag:idx for idx, tag in enumerate(tags.names)}

Outcome of index to tag and tag to index

In [9]:
indextotag

{0: 'O',
 1: 'B-PER',
 2: 'I-PER',
 3: 'B-ORG',
 4: 'I-ORG',
 5: 'B-LOC',
 6: 'I-LOC',
 7: 'B-MISC',
 8: 'I-MISC'}

In [10]:
tagtoindex

{'O': 0,
 'B-PER': 1,
 'I-PER': 2,
 'B-ORG': 3,
 'I-ORG': 4,
 'B-LOC': 5,
 'I-LOC': 6,
 'B-MISC': 7,
 'I-MISC': 8}

Developing Tag Names

In [11]:
def develop_tags_names(batch):
  tag_name= {'ner_tags_str': [tags.int2str(idx) for idx in batch['ner_tags']]}
  return tag_name

Mapping is applied for train and validation

In [None]:
data = data.map(develop_tags_names)

In [13]:
pd.DataFrame(data['train'][:])[['tokens','ner_tags','ner_tags_str']].iloc[0]

Unnamed: 0,0
tokens,"[EU, rejects, German, call, to, boycott, Briti..."
ner_tags,"[3, 0, 7, 0, 0, 0, 7, 0, 0]"
ner_tags_str,"[B-ORG, O, B-MISC, O, O, O, B-MISC, O, O]"


ner_tags index are transformed to ner_tags_str

# Building the Model

Tokenization

In [None]:
checkpoint = "distilbert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [15]:
tokenizer.is_fast

True

In [16]:
data['train'][0]['tokens']

['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']

This is the pretokenized version.

In [17]:
inputs = data['train'][0]['tokens']
inputs = tokenizer(inputs, is_split_into_words=True)
inputs

{'input_ids': [101, 7270, 22961, 1528, 1840, 1106, 21423, 1418, 2495, 12913, 119, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

These are the cleaned version of input tokens.

In [18]:
inputs.tokens()

['[CLS]',
 'EU',
 'rejects',
 'German',
 'call',
 'to',
 'boycott',
 'British',
 'la',
 '##mb',
 '.',
 '[SEP]']

10 Tokens for NER tags, after tokenization now we have 10 tokens. We are facing misalignment.

In [19]:
data['train'][0]['tokens']

['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']

lamb token got converted to two tokens so we are noticing misalignment above.

In [20]:
inputs.word_ids()

[None, 0, 1, 2, 3, 4, 5, 6, 7, 7, 8, None]

(7,7) -> These two are single token

In [21]:
print(data['train'][0]['tokens'])
print(data['train'][0]['ner_tags_str'])

['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']
['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O']


Function to align the labels with tokens

In [22]:
def align_labels_with_tokens(labels, word_ids):
  new_labels = []
  current_word=None
  for word_id in word_ids:
    if word_id != current_word:
      current_word = word_id
      label = -100 if word_id is None else labels[word_id]
      new_labels.append(label)

    elif word_id is None:
      new_labels.append(-100) # Default padding when there is nothing to tag

    else:
      label = labels[word_id]

      if label%2==1: # If its odd number we are going to make this as even
        label = label + 1
      new_labels.append(label)

  return new_labels

Now (7 will go inside 7) - (7,7)

In [23]:
labels = data['train'][0]['ner_tags']
word_ids = inputs.word_ids()
print(labels, word_ids)

[3, 0, 7, 0, 0, 0, 7, 0, 0] [None, 0, 1, 2, 3, 4, 5, 6, 7, 7, 8, None]


In [24]:
align_labels_with_tokens(labels,word_ids)

[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, 0, -100]

Tokenizing and Aligning Labels

In [25]:
def tokenize_align_labels(examples):
  tokenized_inputs = tokenizer(examples['tokens'], truncation = True, is_split_into_words = True)
  all_labels = examples['ner_tags']
  new_labels = []
  for i, labels in enumerate(all_labels):
    word_ids = tokenized_inputs.word_ids(i)
    new_labels.append(align_labels_with_tokens(labels, word_ids))
  tokenized_inputs['labels'] = new_labels
  return tokenized_inputs

In [None]:
tokenized_data = data.map(tokenize_align_labels, batched = True, remove_columns = data['train'].column_names)

In [27]:
tokenized_data

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 3453
    })
})

These are only details we needed for training the model. Now we completed the data token alignment.

# Data Collation and Metrics

It is the Process where the data will be processed into batches

In [28]:
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

In [29]:
batch = data_collator([tokenized_data['train'][i] for i in range(2)])

In [30]:
batch

{'input_ids': tensor([[  101,  7270, 22961,  1528,  1840,  1106, 21423,  1418,  2495, 12913,
           119,   102],
        [  101,  1943, 14428,   102,     0,     0,     0,     0,     0,     0,
             0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]), 'labels': tensor([[-100,    3,    0,    7,    0,    0,    0,    7,    0,    0,    0, -100],
        [-100,    1,    2, -100, -100, -100, -100, -100, -100, -100, -100, -100]])}

In [31]:
[tokenized_data['train'][i] for i in range(2)]

[{'input_ids': [101,
   7270,
   22961,
   1528,
   1840,
   1106,
   21423,
   1418,
   2495,
   12913,
   119,
   102],
  'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
  'labels': [-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, 0, -100]},
 {'input_ids': [101, 1943, 14428, 102],
  'attention_mask': [1, 1, 1, 1],
  'labels': [-100, 1, 2, -100]}]

# Metrics

In NER every token should make sense, then only the meaning will be conveyed.

In [None]:
pip install seqeval

In [None]:
pip install evaluate

In [36]:
metric = evaluate.load('seqeval')

Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

In [37]:
ner_feature = data['train'].features['ner_tags']
ner_feature

Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)

In [38]:
label_names = ner_feature.feature.names
label_names

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

In [39]:
labels = data['train'][0]['ner_tags']
labels = [label_names[i] for i in labels]
labels

['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O']

In [40]:
predictions = labels.copy()
predictions[2] = 'O'
metric.compute(predictions=[predictions], references = [labels])

{'MISC': {'precision': 1.0,
  'recall': 0.5,
  'f1': 0.6666666666666666,
  'number': 2},
 'ORG': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1},
 'overall_precision': 1.0,
 'overall_recall': 0.6666666666666666,
 'overall_f1': 0.8,
 'overall_accuracy': 0.8888888888888888}

Now computing metrics for the entire data

In [60]:
def computing_metrics(eval_predictions):
  logits, labels = eval_predictions
  predictions = np.argmax(logits, axis = -1)
  true_labels = [[label_names[l] for l in label if l !=-100]for label in labels]
  true_predictions = [[label_names[l] for p, l in zip(prediction,label) if l!=-100] for prediction, label in zip(predictions,labels)]
  all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
  return all_metrics


# Model Training

In [43]:
id2label = {i:label for i, label in enumerate(label_names)}
label2id = {label:i for i, label in enumerate(label_names)}

In [44]:
print(id2label)

{0: 'O', 1: 'B-PER', 2: 'I-PER', 3: 'B-ORG', 4: 'I-ORG', 5: 'B-LOC', 6: 'I-LOC', 7: 'B-MISC', 8: 'I-MISC'}


In [55]:
model = AutoModelForTokenClassification.from_pretrained(
                                                    checkpoint,
                                                    id2label=id2label,
                                                    label2id=label2id)

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [57]:
args = TrainingArguments("distilbert-finetuned-ner",
                         evaluation_strategy = "epoch",
                         save_strategy="epoch",
                         learning_rate = 2e-5,
                         num_train_epochs=1,
                         weight_decay=0.01)



In [61]:
trainer = Trainer(model=model,
                  args=args,
                  train_dataset = tokenized_data['train'],
                  eval_dataset = tokenized_data['validation'],
                  data_collator=data_collator,
                  compute_metrics=computing_metrics,
                  tokenizer=tokenizer)

trainer.train()

Epoch,Training Loss,Validation Loss,Loc,Misc,Org,Per,Overall Precision,Overall Recall,Overall F1,Overall Accuracy
1,0.0487,0.074711,"{'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1837}","{'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 922}","{'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1341}","{'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1842}",1.0,1.0,1.0,1.0


Trainer is attempting to log a value of "{'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1837}" of type <class 'dict'> for key "eval/LOC" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 922}" of type <class 'dict'> for key "eval/MISC" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1341}" of type <class 'dict'> for key "eval/ORG" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1842}" of type <class 'dict'> for key "eval/PER" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped 

TrainOutput(global_step=1756, training_loss=0.06423409924691793, metrics={'train_runtime': 115.5788, 'train_samples_per_second': 121.484, 'train_steps_per_second': 15.193, 'total_flos': 153520489309824.0, 'train_loss': 0.06423409924691793, 'epoch': 1.0})

In [70]:
checkpoint = "/content/distilbert-finetuned-ner/checkpoint-1756"
token_classifier = pipeline(
    "token-classification", model=checkpoint, aggregation_strategy="simple"
)

token_classifier("My name BadriNarayanan S. Being grateful is important than any other. He's from tamilnadu")

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'entity_group': 'PER',
  'score': 0.98984885,
  'word': 'BadriNarayanan S.',
  'start': 8,
  'end': 25},
 {'entity_group': 'LOC',
  'score': 0.6601926,
  'word': 'tamilnadu',
  'start': 80,
  'end': 89}]

# Conclusion

Thus the DistillBERT llm model is fine tuned to perform NER tasks. Given conllp dataset, performed all the required operations for the compatibility of the model. Fine tuned the model and queried it, the model recognized the named entities in the specified sentence.