# Named entity recognition - fine-tuning BERT

In this notebook we'll take a look at the process needed to fine-tine a pretrained [BERT](https://arxiv.org/abs/1810.04805) model to recognize named entities in text. 

First things first, let's make sure we have a GPU instance in this Colab session:
- `Edit -> Notebook settings -> Hardware accelerator` must be set to GPU
- if needed, reinitiliaze the session by clicking `Connect` in top right corner

After the session is initilized, we can check our assigned GPU with the following command (fingers crossed it's a Tesla P100!!):

In [None]:
!nvidia-smi

Let's install some additional libraries: *transformers* for BERT implementation, *gdown* for loading from Drive, *seqeval* for token level evaluation and *termcolor* for colorful output.

In [None]:
!pip install transformers
!pip install gdown
!pip install seqeval
!pip install termcolor

In [None]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
from transformers import DistilBertConfig, DistilBertModel, DistilBertTokenizerFast, get_linear_schedule_with_warmup, AdamW
from sklearn.model_selection import train_test_split
import os
from tqdm import tqdm, trange
import numpy as np
import pandas as pd
from dataclasses import dataclass
import gc
import pickle
from seqeval.metrics import f1_score
from termcolor import colored

## Data

We are working with the famous [CoNLL-2003](https://www.aclweb.org/anthology/W03-0419.pdf) NER task which has established as a benchmark to evaluate new approaches to NER. It consists of Reuters news collections and contains four different entity types: person (PER), location (LOC), organization (ORG) and other miscellaneous entities (MISC).

Let's load the data and look at an example.

In [None]:
!gdown https://drive.google.com/uc?id=1q_EUUDtZpR5OWG-LukjTYdIc1tTSXjtY
!gdown https://drive.google.com/uc?id=1PChc0nIp2oLuLuhOB_dKLQ-dnJ1_dbjX

In [None]:
with open("/content/conllpp_train.txt", "r") as f:
  lines = f.readlines()

for i in range(2,15):  # prints first two examples
  print(lines[i])

In [None]:
class TokenClassificationDataset(Dataset):
  def __init__(self, data_path, max_seq_len, label2id=None, id2label=None, sep_token="[SEP]", cls_token="[PAD]", pad_token="[PAD]"):
    # special tokens in BERT: sep separates two sentences, cls contains representation for 
    # sequence class., pad is used to extend inputs to the same length
    self.sep_token, self.cls_token, self.pad_token = sep_token, cls_token, pad_token
    self.ignore_label_id = nn.CrossEntropyLoss().ignore_index  # we use this to ignore padding tokens when computing loss
    self.max_seq_len = max_seq_len
    self.label2id, self.id2label = label2id, id2label
    
    # reads sentences/labels from dataset file
    examples, self.unique_labels = self.read_examples_labels_from_file(data_path)

    if label2id is None or id2label is None:
      self.label2id = {label: i for i, label in enumerate(self.unique_labels)}
      self.id2label = {i: label for i, label in enumerate(self.unique_labels)}
      
    self.inputs = self.convert_examples_to_inputs(examples)

  def __len__(self):
    return len(self.inputs)

  def __getitem__(self, i):
    return self.inputs[i]

  def read_examples_labels_from_file(self, data_path):
    examples = []
    unique_labels = set()
    with open(data_path, encoding="utf-8") as f:
      words, labels = [], []
      for line in f:
        if line.startswith("-DOCSTART-") or line == "" or line == "\n":
          if words:
            examples.append({"words": words, "labels": labels})
            words, labels = [], []
        else:
          splits = line.strip().split(" ")
          words.append(splits[0])
          labels.append(splits[-1])
          unique_labels.add(splits[-1])
      if words:
        examples.append({"words": words, "labels": labels})
    return examples, unique_labels

  def convert_examples_to_inputs(self, examples):
    inputs = []
    for example in examples:
      inputs.append(
          self.convert_example_to_inputs(example["words"], example["labels"], self.label2id, self.max_seq_len, self.sep_token, 
                                         self.cls_token, self.pad_token, self.ignore_label_id)
      )

    return inputs

  @staticmethod
  def convert_example_to_inputs(words, labels, label2id, max_seq_len=512, sep_token="[SEP]", 
                                cls_token="[PAD]", pad_token="[PAD]", ignore_index=-100):
    tokens, label_ids = [], []
    for i, (word, label) in enumerate(zip(words, labels)):
      word_tokens = tokenizer.tokenize(word)
      if len(word_tokens) > 0:
        tokens.extend(word_tokens)
        label_ids.extend([label2id[label]] + [ignore_index] * (len(word_tokens) - 1))

    # truncate sentence if it's too long:
    if len(tokens) > max_seq_len - 2:  # 2, because we need to add CLS and SEP tokens
      tokens = tokens[:(max_seq_len - 2)]
      label_ids = label_ids[:(max_seq_len - 2)]

    # adding the separator to the end of the sentence
    tokens += [sep_token]
    label_ids += [ignore_index]

    # adding the classifier token to the start of the sentence
    tokens = [cls_token] + tokens
    label_ids = [ignore_index] + label_ids

    # The mask has 1 for real tokens and 0 for padding tokens. Only real tokens are attended to.
    input_mask = [1] * len(tokens)

    # padding
    padding_length = max_seq_len - len(tokens)
    tokens += [pad_token] * padding_length
    label_ids += [ignore_index] * padding_length
    input_mask += [0] * padding_length

    # convert tokens to input ids
    input_ids = tokenizer.convert_tokens_to_ids(tokens)

    return {"input_ids": input_ids, "labels": label_ids, "attention_mask": input_mask}

# used to combine inputs into a batch
def collate_batch_to_tensors(inputs):
  batch = {}
  for k in inputs[0].keys():
    batch[k] = torch.tensor([dat[k] for dat in inputs], dtype=torch.long)
  
  return batch

## Model

We've now implemented everything needed for the data side of the pipeline, let's now look at our model. The simplicity of using BERT for most of downstream tasks lies in the fact that we can just add a classification layer on top of produced representations and achieve good performance. To fine-tune the obtained model we update the combined parameters of both BERT and classification layer on downstream dataset.

For the purposes of this workshop, we won't directly work with BERT model as we are constrained by computational power and time. We rather opt out for [DistilBERT](https://arxiv.org/abs/1910.01108). DistilBERT is a smaller version of BERT (same architecture, less layers) that is trained by distilling the knowledge of BERT to the new model. It is much faster and retains most of the representational power of BERT base model, so it's perfect for our use case.

Named entity recogntion is a token classification task and a special version of DistillBERT for this type of downstream tasks is already implemented in *transformers* library, called [*DistilBertForTokenClassification*](https://huggingface.co/transformers/model_doc/distilbert.html#distilbertfortokenclassificationps://). For demonstrational purposes we reimplement a DistillBERT with a classification head in the cell below.

In [None]:
class DistilBertTokenClassificationModel(nn.Module):
  def __init__(self, bert_config, num_classes, dropout_prob=0.1):
    super(DistilBertTokenClassificationModel, self).__init__()
    self.num_classes = num_classes

    self.bert = DistilBertModel(bert_config)
    self.dropout = nn.Dropout(dropout_prob)
    self.classification_layer = nn.Linear(in_features=bert_config.hidden_size, out_features=num_classes)

  def forward(self, input_ids=None, attention_mask=None, labels=None):
    x = self.bert(input_ids, attention_mask)[0]  # produces token representations
    x = self.dropout(x)  # mitigates overfitting
    logits = self.classification_layer(x)  # classifies tokens into entity types

    if labels is None:
      return logits

    # compute the loss
    loss = nn.CrossEntropyLoss()(logits.view(-1, self.num_classes), labels.view(-1))

    return (loss, logits)

  def load(self, path_to_dir):
    self.bert = DistilBertModel.from_pretrained(path_to_dir)
    model_path = os.path.join(path_to_dir, "model.tar")
    if os.path.exists(model_path):
      checkpoint = torch.load(model_path)
      self.dropout.load_state_dict(checkpoint["dropout"])
      self.classification_layer.load_state_dict(checkpoint["cls"])
    else:
      print("No model.tar in provided directory, only loading bert model.")

  def save_pretrained(self, path_to_dir):
    self.bert.save_pretrained(path_to_dir)
    torch.save(
        {"dropout": self.dropout.state_dict(), "cls": self.classification_layer.state_dict()},
        os.path.join(path_to_dir, "model.tar")
    )

Below we implement the Trainer class that contains the main train loop.

In [None]:
class Trainer:
  def __init__(self, model):
    self.model = model

  def train(self, train_dataset, val_dataset, device, run_config):
    self.model = self.model.to(device)
    # create output folder if it doesn't yet exist
    if not os.path.isdir(run_config.output_dir): 
      os.makedirs(run_config.output_dir)
    
    # train dataloader will serve us the training data in batches
    train_dataloader = DataLoader(train_dataset, sampler=RandomSampler(train_dataset), 
                                  batch_size=run_config.batch_size, collate_fn=run_config.collate_fn)
    
    # optimizer and scheduler that modifies the learning rate during the training
    optimizer = AdamW(self.model.parameters(), lr=run_config.learning_rate)
    scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=run_config.num_warmup_steps,
                                                num_training_steps=len(train_dataloader)*run_config.num_epochs)
    
    print("Training started:")
    print(f"\tNum examples = {len(train_dataset)}")
    print(f"\tNum Epochs = {run_config.num_epochs}")

    global_step = 0  # to save after every save_steps if save_steps is >= 0

    train_iterator = trange(0, int(run_config.num_epochs), desc="Epoch")
    for epoch in train_iterator:
      epoch_iterator = tqdm(train_dataloader, desc="Iteration", position=0, leave=True)
      self.model.train()
      epoch_losses = []
      for step, inputs in enumerate(epoch_iterator):
        # move batch to GPU
        if isinstance(inputs, dict):
            for k, v in inputs.items():
                inputs[k] = v.to(device)
        else:
            inputs = inputs.to(device)

        # forward pass - model also outputs a computed loss
        outputs = self.model(**inputs)
        loss = outputs[0]

        epoch_losses.append(loss.item())

        # backward pass - backpropagation
        self.model.zero_grad()
        loss.backward()
        optimizer.step()
        scheduler.step()

        epoch_iterator.set_description(f"Training loss = {loss.item():.4f}")

        if run_config.save_steps > -1 and global_step > 0 and global_step % run_config.save_steps == 0:
          output_dir = os.path.join(run_config.output_dir, f"Step_{step}")
          self.model.save_pretrained(output_dir)
          test_loss = self.evaluate(self.model, val_dataset, device, run_config)
          print(f"After step {step + 1}: val loss ={test_loss}")

        global_step += 1
      
      if run_config.save_each_epoch:
        output_dir = os.path.join(run_config.output_dir, f"Epoch_{epoch + 1}")
        model.save_pretrained(output_dir)
        test_loss = self.evaluate(self.model, val_dataset, device, run_config)
        print(f"After epoch {epoch + 1}: val loss ={test_loss}")

  def evaluate(self, model, test_dataset, device, run_config):
    test_dataloader = DataLoader(test_dataset, sampler=SequentialSampler(test_dataset),
                                 batch_size=run_config.batch_size, collate_fn=run_config.collate_fn)
    self.model.eval()
    losses = []
    for inputs in tqdm(test_dataloader, desc="Evaluating", position=0, leave=True):
      # move batch to GPU
      if isinstance(inputs, dict):
        for k, v in inputs.items():
          inputs[k] = v.to(device)
      else:
        inputs = inputs.to(device)

      with torch.no_grad():
        loss = model(**inputs)[0]
      losses.append(loss.item())

    return np.mean(losses)

*RunConfig* holds the parameter for training/testing:

In [None]:
@dataclass
class RunConfig:
  learning_rate: float
  batch_size: int
  num_epochs: int
  num_warmup_steps: int = 1
  save_steps: int = -1
  save_each_epoch: bool = True
  output_dir: str = "/content/"
  collate_fn: None = None

## Training

We have now implemented everything to start fine-tuning. We can save the fine-tuned models to our Colab instance (available under `/content/`) or we can connect our Google Drive to Colab and use it as external memory. If you want to do the latter, run the cell below and follow instructions.

In [None]:
from google.colab import drive
drive.mount("/content/drive/")

Let's set the training parameters:

In [None]:
run_config = RunConfig(
    learning_rate = 3e-5,
    batch_size = 32,  # start with 32 and decrease if you get CUDA out of memory exception
    num_epochs = 3,
    save_each_epoch = True,
    output_dir = "/content/drive/My Drive/NLP-workshop/BERT-NER/",
    collate_fn = collate_batch_to_tensors
)

Instantiating the datasets and tokenizer.

In [None]:
max_len = 512  # max length of input, pretrained model only supports max_len up to 512, use smaller values for faster training

# instantiate a DistillBert tokenizer
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased", do_lower_case=True)

# datasets
train_dataset = TokenClassificationDataset("/content/conllpp_train.txt", max_len)
val_dataset = TokenClassificationDataset("/content/conllpp_dev.txt", max_len,
                                         train_dataset.label2id, train_dataset.id2label)

if not os.path.isdir(run_config.output_dir): 
  os.makedirs(run_config.output_dir)

# we save label2id, id2label to a pickle file for later use
with open(os.path.join(run_config.output_dir, "label2id.pkl"), "wb") as f:
  pickle.dump(train_dataset.label2id, f, pickle.HIGHEST_PROTOCOL)

with open(os.path.join(run_config.output_dir, "id2label.pkl"), "wb") as f:
  pickle.dump(train_dataset.id2label, f, pickle.HIGHEST_PROTOCOL)

Instatiate the model and start training!

In [None]:
model = DistilBertTokenClassificationModel(DistilBertConfig.from_pretrained("distilbert-base-uncased"), 
                                           num_classes=len(train_dataset.unique_labels))
model.load("distilbert-base-uncased")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

In [None]:
trainer = Trainer(model)
trainer.train(train_dataset, val_dataset, device, run_config)

If you happen to get a CUDA out of memory exception, do the following:
- cause another exception so python doesn't hold any references to trainer or model, e.g. run the bottom cell causing ZeroDivisionError
- run the cell below that empties GPU cache
- decrease the batch_size in run_config and rerun that cell
- reinstantiate the model and rerun training

In [None]:
1 / 0

In [None]:
model = None
trainer = None
gc.collect()
torch.cuda.empty_cache()

## Evaluation

After fine-tuning we have a BERT model specialized for detecting named entities in text. Let's see how it performs on the test set. For the purposes of this workshop we've prepared a pretrained model. You can get all the necessary files for evaluation by running the cell below.

In [None]:
!mkdir /content/bert-ner
!gdown -O /content/nertest.txt https://drive.google.com/uc?id=1q9EEadm5lBfWWOM8mITE3YJvhxbhNczC
!gdown https://drive.google.com/uc?id=10itTQ54hZd7G4t66KomAcjyMbfNJSSrm
!gdown https://drive.google.com/uc?id=10hEbs1zAeOBcG-wNLgSBS4iO3Aby2YoX
!gdown -O /content/bert-ner/config.json https://drive.google.com/uc?id=1-kUkp3AV4nfgjEuG2B88m_G2Gqc3g2bw
!gdown -O /content/bert-ner/model.tar https://drive.google.com/uc?id=1-o6aUye7n54EUER2B8V_Ue8Fbp4kUCSL
!gdown -O /content/bert-ner/pytorch_model.bin https://drive.google.com/uc?id=1-koeiI83JqeTdm0jH5zDRUmJfs9st8rJ

Let's instantiate the dataset, tokenizer, model and reload label2id, id2label.

In [None]:
test_dataset = TokenClassificationDataset("/content/nertest.txt", 512, label2id, id2label)

with open("/content/label2id.pkl", "rb") as f:
  label2id = pickle.load(f)

with open("/content/id2label.pkl", "rb") as f:
  id2label = pickle.load(f)

tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased", do_lower_case=True)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = DistilBertTokenClassificationModel(DistilBertConfig.from_pretrained("distilbert-base-uncased"), num_classes=len(id2label))
model.load("/content/bert-ner/")
model = model.to(device)

The following code is used to compute log loss and F1-score. As per the official BERT paper, we only consider the leading subword of each word for the label.

In [None]:
def align_predictions_and_labels(predictions, labels):
  label_list = [[] for _ in range(labels.shape[0])]
  preds_list = [[] for _ in range(labels.shape[0])]

  for i in range(labels.shape[0]):
    for j in range(labels.shape[1]):
      if labels[i, j] >= 0:  # otherwise, we ignore it
        label_list[i].append(id2label[labels[i][j]])
        preds_list[i].append(id2label[predictions[i][j]])
  return preds_list, label_list

def evaluate(model, test_dataset, device, run_config):
  test_dataloader = DataLoader(test_dataset, sampler=SequentialSampler(test_dataset),
                                batch_size=run_config.batch_size, collate_fn=run_config.collate_fn)
  model.eval()
  losses = []
  predictions, gt_labels = None, None
  for inputs in tqdm(test_dataloader, desc="Evaluating", position=0, leave=True):
    # move batch to GPU
    for k, v in inputs.items():
      inputs[k] = v.to(device)
    
    with torch.no_grad():
      loss, logits = model(**inputs)
    losses.append(loss.item())
    if gt_labels is None:
      gt_labels = inputs["labels"].detach().cpu().numpy()
      predictions = logits.detach().cpu().numpy()
    else:
      gt_labels = np.append(gt_labels, inputs["labels"].detach().cpu().numpy(), axis=0)
      predictions = np.append(predictions, logits.detach().cpu().numpy(), axis=0)

  predictions = np.argmax(predictions, axis=2)
  predictions, labels = align_predictions_and_labels(predictions, gt_labels)

  f1 = f1_score(labels, predictions)

  return np.mean(losses), f1

In [None]:
logloss, f1 = evaluate(model, test_dataset, device, run_config)
print(f"\nTest log loss = {logloss:.4f}\nTest F1-score = {f1:.4f}")

With a F1-score of 0.90 we achieve quite okay performance. Let's now evaluate our model on some extra data, we've selected some BBC articles for this, but you're free to experiment!

Sources for articles:

- https://www.bbc.com/sport/formula1/54316085
- https://www.bbc.com/news/entertainment-arts-54292947
- http://www.bbc.com/travel/story/20200914-in-guatemala-the-maya-world-untouched-for-centuries

In [None]:
label2color = {
    "B-PER" : "red",
    "I-PER" : "red",
    "B-ORG" : "blue",
    "I-ORG" : "blue",
    "B-LOC" : "green",
    "I-LOC" : "green",
    "B-MISC" : "yellow",
    "I-MISC" : "yellow",
    "O" : "white"
}

def tag_some_text(text, show_legend=True):
  if show_legend:
    print("Legend: " + colored("Person ", "red") + colored("Organization ", "blue") + 
          colored("Location ", "green") + colored("Misc. ", "yellow") + "\n\n")
  words = text.split()
  fake_labels = ["O"] * len(text)
  inputs = TokenClassificationDataset.convert_example_to_inputs(words, fake_labels, label2id)
  for k, v in inputs.items():
    input_ids = torch.tensor([inputs["input_ids"]], dtype=torch.long).to(device)
    attention_mask = torch.tensor([inputs["attention_mask"]], dtype=torch.long).to(device)
      
  with torch.no_grad():
    logits = model(input_ids, attention_mask)
  predictions = np.argmax(logits.cpu().numpy(), axis=2)
  predictions, _ = align_predictions_and_labels(predictions, np.array([inputs["labels"]]))
  colors = list(map(label2color.get, predictions[0]))
  colored_words = []
  for i in range(len(words)):
    colored_words.append(colored(words[i], colors[i])) 
  print(" ".join(colored_words))

In [None]:
tag_some_text("Lewis Hamilton's quest for the all-time record of Formula 1 wins was put on hold when he was hit with penalties at the Russian Grand Prix. Hamilton's Mercedes team-mate Valtteri Bottas dominated after the world champion was given a 10-second penalty for doing two illegal practice starts. Bottas was on the better strategy - starting on the medium tyres while Hamilton was on softs after a chaotic qualifying session for the Briton - and was tracking Hamilton in the early laps waiting for the race to play out. Behind the top three, Racing Point's Sergio Perez and Renault's Daniel Ricciardo had equally lonely races, the Australian having sufficient pace to overcome a five-second penalty for failing to comply with rules regarding how to rejoin the track when a car runs wide at Turn Two. Ferrari's Charles Leclerc made excellent use of a long first stint on the medium tyres to vault up from 11th on the grid to finish sixth, ahead of the second Renault of Esteban Ocon, the Alpha Tauris of Daniil Kvyat and Pierre Gasly and Alexander Albon's Red Bull. What's next? The Eifel Grand Prix on 11 October as the Nurburgring returns to the F1 calendar for the first time since 2013. The 24-hour touring car race there this weekend has been hit with miserable wet and wintery conditions in the Eifel mountains. Will F1 face the same?")

In [None]:
tag_some_text("Sir David Attenborough has broken Jennifer Aniston's record for the fastest time to reach a million followers on Instagram. At 94 years young, the naturalist's follower count raced to seven figures in four hours 44 minutes on Thursday, according to Guinness World Records. His debut post said: \'Saving our planet is now a communications challenge.\' Last October, Friends star Aniston reached the milestone in five hours and 16 minutes. Sir David's Instagram debut precedes the release of a book and a Netflix documentary, both titled A Life On Our Planet.")

In [None]:
tag_some_text("Using Lidar, in 2016 the Foundation for Maya Cultural and Natural Heritage launched the largest archaeological survey ever undertaken of the Maya lowlands. In the first phase, whose results were published in 2018, they mapped 2,100km of the Maya Biosphere Reserve. Their hope in the further phases – the second one of which took place in summer 2019, while I was there – is to triple the coverage area. That would make the project the largest Lidar survey not only in Central America, but in the world.")