# Named entity recognition - fine-tuning BERT

In this notebook we'll take a look at the process needed to fine-tine a pretrained [BERT](https://arxiv.org/abs/1810.04805) model to recognize named entities in text. 

Named entity recognition is a token classification task which means that we classify each token into one of the corresponding classes. Usually these classes are entity types such as *person*, *organization*, *location*, etc., and we have a special category for token that don't belong to any entity type: *other*.

<div>
<img src="https://github.com/andrejmiscic/NLP-workshop/raw/master/figures/ner.PNG" width="700"/>
</div>

## Setup

First things first, let's make sure we have a GPU instance in this Colab session:
- `Edit -> Notebook settings -> Hardware accelerator` must be set to GPU
- if needed, reinitiliaze the session by clicking `Connect` in top right corner

After the session is initilized, we can check our assigned GPU with the following command (fingers crossed it's a Tesla P100!!):

In [None]:
!nvidia-smi

Let's install some additional libraries:

In [None]:
!wget https://github.com/andrejmiscic/NLP-workshop/raw/master/utils/ner_utils.py
!wget https://github.com/andrejmiscic/NLP-workshop/raw/master/utils/trainer.py

In [None]:
!pip install transformers  # pretrained BERT model
!pip install gdown  # loading from Drive
!pip install seqeval  # NER evaluation
!pip install termcolor  # NER visualization

In [None]:
import gc
import os

import numpy as np
import torch
import torch.nn as nn

from ner_utils import TokenClassificationDataset, collate_dict_batch_to_tensors, align_predictions_and_labels, token_cls_evaluate
from termcolor import colored
from trainer import Trainer, RunConfig
from transformers import DistilBertConfig, DistilBertModel, DistilBertTokenizerFast

## Data

We are working with the commonly used [CoNLL-2003](https://www.aclweb.org/anthology/W03-0419.pdf) NER task which has been established as a benchmark to evaluate new approaches. It consists of Reuters news articles and contains four different entity types: person (PER), location (LOC), organization (ORG) and other miscellaneous entities (MISC).

Let's load the data and look at an example.

In [None]:
!wget https://raw.githubusercontent.com/andrejmiscic/NLP-workshop/master/Data/CoNLLP-NER/conllpp_train.txt
!wget https://raw.githubusercontent.com/andrejmiscic/NLP-workshop/master/Data/CoNLLP-NER/conllpp_dev.txt

In [None]:
with open("/content/conllpp_train.txt", "r") as f:
  lines = f.readlines()

for i in range(2,15):  # prints first two examples
  print(lines[i])

In [None]:
# max length of input, pretrained model only supports max_len up to 512, use smaller values for faster training
max_len = 512

# we use tokenizer to prepare the inputs
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased", do_lower_case=True)

train_dataset = TokenClassificationDataset("/content/conllpp_train.txt", tokenizer, max_len)
val_dataset = TokenClassificationDataset("/content/conllpp_dev.txt", tokenizer, max_len)

# we'll use this mapping to convert from model class id to human readable class
class_list = train_dataset.class_list
print(f"Classes: {class_list}")

Looking at the classes, we can see that we have all the aforementioned entity types. Also notice prefixes *B* and *I*, which denote whether a particular word is at the *beginning* of the entity or *inside* it.

Tokenizer helps us to convert our input sentences into a format that BERT will understand. Let's look at an example.

In [None]:
example = "This is an example of how to use a tokenizer."

# converts inputs to tokens from the vocabulary
tokens = tokenizer.tokenize(example)
print(f"Tokens: {tokens}")

# converts tokens to indices in the vocabulary
ids = tokenizer.convert_tokens_to_ids(tokens)
print(f"Ids: {ids}")

## Model

We've now implemented everything needed for the data side of the pipeline, let's now look at our model. The simplicity of using BERT for most of downstream tasks lies in the fact that we can just add a classification layer on top of produced representations and achieve good performance. To fine-tune the obtained model we update the combined parameters of both BERT and classification layer on the downstream dataset.

For the purposes of this workshop, we won't directly work with BERT model as we are constrained by computational power and time. We rather opt out for [DistilBERT](https://arxiv.org/abs/1910.01108). DistilBERT is a smaller version of BERT (same architecture, less layers) that is trained by distilling the knowledge of a large BERT model to the smaller model. It is much faster and retains most of the representational power of BERT base model, so it's perfect for our use case.

Named entity recogntion is a token classification task and a special version of DistillBERT for this type of downstream tasks is already implemented in *transformers* library, called [*DistilBertForTokenClassification*](https://huggingface.co/transformers/model_doc/distilbert.html#distilbertfortokenclassificationps://). For demonstrational purposes we reimplement a DistillBERT with a classification head in the cell below.

In [None]:
class DistilBertTokenClassificationModel(nn.Module):
  def __init__(self, bert_config, num_classes, dropout_prob=0.1):
    super(DistilBertTokenClassificationModel, self).__init__()
    self.num_classes = num_classes

    self.bert = DistilBertModel(bert_config)
    self.dropout = nn.Dropout(dropout_prob)
    self.classification_layer = nn.Linear(in_features=bert_config.hidden_size, out_features=num_classes)

  def forward(self, input_ids, attention_mask=None, labels=None):
    x = self.bert(input_ids, attention_mask)[0]  # produces token representations
    x = self.dropout(x)  # mitigates overfitting
    logits = self.classification_layer(x)  # classifies tokens into entity types

    if labels is None:
      return logits

    # compute the loss
    loss = nn.CrossEntropyLoss()(logits.view(-1, self.num_classes), labels.view(-1))

    return (loss, logits)

  def load(self, path_to_dir):
    self.bert = DistilBertModel.from_pretrained(path_to_dir)
    model_path = os.path.join(path_to_dir, "model.tar")
    if os.path.exists(model_path):
      checkpoint = torch.load(model_path)
      self.dropout.load_state_dict(checkpoint["dropout"])
      self.classification_layer.load_state_dict(checkpoint["cls"])
    else:
      print("No model.tar in provided directory, only loading bert model.")

  def save_pretrained(self, path_to_dir):
    self.bert.save_pretrained(path_to_dir)
    torch.save(
        {"dropout": self.dropout.state_dict(), "cls": self.classification_layer.state_dict()},
        os.path.join(path_to_dir, "model.tar")
    )

## Training

We have now implemented everything to start fine-tuning. We can save the fine-tuned models to our Colab instance (available under `/content/`) or we can connect our Google Drive to Colab and use it as external memory. If you want to do the latter, run the cell below and follow instructions.

In [None]:
# optional if you want to save your models to Google Drive
from google.colab import drive
drive.mount("/content/drive/")

Let's set the training parameters:

In [None]:
run_config = RunConfig(
    learning_rate = 3e-5,
    batch_size = 32,  # start with 32 and decrease if you get CUDA out of memory exception
    num_epochs = 3,
    output_dir = "/content/drive/MyDrive/NLP-workshop/BERT-NER/",
    collate_fn = collate_dict_batch_to_tensors
)

Instatiate the model and start training!

In [None]:
model = DistilBertTokenClassificationModel(
    DistilBertConfig.from_pretrained("distilbert-base-uncased"), 
    num_classes=len(class_list)
)
model.load("distilbert-base-uncased")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

In [None]:
trainer = Trainer(model)
trainer.train(train_dataset, val_dataset, device, run_config)

If you happen to get a CUDA out of memory exception, do the following:
- cause another exception so python doesn't hold any references to trainer or model, e.g. run the bottom cell causing ZeroDivisionError
- run the cell below that empties GPU cache
- decrease the batch_size in run_config and rerun that cell
- reinstantiate the model and rerun training

In [None]:
1 / 0

In [None]:
model = None
trainer = None
gc.collect()
torch.cuda.empty_cache()

## Evaluation

After fine-tuning we have a BERT model specialized for detecting named entities in text. Let's see how it performs on the test set. For the purposes of this workshop we've prepared a model that is already fine-tuned on CoNLL. You can get all the necessary files for evaluation by running the cell below.

In [None]:
!mkdir /content/bert-ner
!wget https://raw.githubusercontent.com/andrejmiscic/NLP-workshop/master/Data/CoNLLP-NER/conllpp_test.txt
!gdown -O /content/bert-ner/config.json https://drive.google.com/uc?id=1Tg_sFaL9Ouye8d6l6gJKFYgOJL3BpI7J
!gdown -O /content/bert-ner/model.tar https://drive.google.com/uc?id=1-5PKbK88VjIyHPJD1QZea09MCEzkERzT
!gdown -O /content/bert-ner/pytorch_model.bin https://drive.google.com/uc?id=1-78MPCczYFLDaZD7gIz4qFglN0b1INMJ

Let's instantiate everything we need.

In [None]:
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased", do_lower_case=True)
max_len = 512
test_dataset = TokenClassificationDataset("/content/conllpp_test.txt", tokenizer, max_len)
class_list = train_dataset.class_list
id2label = {i: label for i, label in enumerate(class_list)}

In [None]:
# only run if you want to use the model we've already fine-tuned for you
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = DistilBertTokenClassificationModel(
    DistilBertConfig.from_pretrained("distilbert-base-uncased"), 
    num_classes=len(class_list)
)
model.load("/content/bert-ner/")
model = model.to(device)

Evaluation of our model on the test set:

In [None]:
logloss, f1 = token_cls_evaluate(model, test_dataset, device, id2label)
print(f"\nTest log loss = {logloss:.4f}\nTest F1-score = {f1:.4f}")

With a F1-score of 0.90 we achieve quite okay performance. Let's now evaluate our model on some extra data, we've selected some BBC articles for this, but feel free to experiment!

Sources for articles:

- https://www.bbc.com/sport/formula1/54316085
- https://www.bbc.com/news/entertainment-arts-54292947
- http://www.bbc.com/travel/story/20200914-in-guatemala-the-maya-world-untouched-for-centuries

In [None]:
label2color = {
    "B-PER" : "red",
    "I-PER" : "red",
    "B-ORG" : "blue",
    "I-ORG" : "blue",
    "B-LOC" : "green",
    "I-LOC" : "green",
    "B-MISC" : "yellow",
    "I-MISC" : "yellow",
    "O" : "white"
}

def tag_some_text(text, show_legend=True):
    words = text.split()
    inputs = TokenClassificationDataset.convert_example_to_inputs(tokenizer, words, class_list=class_list)
    input_ids = torch.tensor([inputs["input_ids"]], dtype=torch.long).to(device)
    attention_mask = torch.tensor([inputs["attention_mask"]], dtype=torch.long).to(device)
      
    with torch.no_grad():
      logits = model(input_ids, attention_mask)
    predictions = np.argmax(logits.cpu().numpy(), axis=2)
    predictions, _ = align_predictions_and_labels(predictions, np.array([inputs["labels"]]), id2label)
    colors = list(map(label2color.get, predictions[0]))
    colored_words = []
    for i in range(len(words)):
      colored_words.append(colored(words[i], colors[i])) 
    print(" ".join(colored_words))

In [None]:
# word wrap
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

In [None]:
tag_some_text("Lewis Hamilton's quest for the all-time record of Formula 1 wins was put on hold when he was hit with penalties at the Russian Grand Prix. Hamilton's Mercedes team-mate Valtteri Bottas dominated after the world champion was given a 10-second penalty for doing two illegal practice starts. Bottas was on the better strategy - starting on the medium tyres while Hamilton was on softs after a chaotic qualifying session for the Briton - and was tracking Hamilton in the early laps waiting for the race to play out. Behind the top three, Racing Point's Sergio Perez and Renault's Daniel Ricciardo had equally lonely races, the Australian having sufficient pace to overcome a five-second penalty for failing to comply with rules regarding how to rejoin the track when a car runs wide at Turn Two. Ferrari's Charles Leclerc made excellent use of a long first stint on the medium tyres to vault up from 11th on the grid to finish sixth, ahead of the second Renault of Esteban Ocon, the Alpha Tauris of Daniil Kvyat and Pierre Gasly and Alexander Albon's Red Bull. What's next? The Eifel Grand Prix on 11 October as the Nurburgring returns to the F1 calendar for the first time since 2013. The 24-hour touring car race there this weekend has been hit with miserable wet and wintery conditions in the Eifel mountains. Will F1 face the same?")

In [None]:
tag_some_text("Sir David Attenborough has broken Jennifer Aniston's record for the fastest time to reach a million followers on Instagram. At 94 years young, the naturalist's follower count raced to seven figures in four hours 44 minutes on Thursday, according to Guinness World Records. His debut post said: \'Saving our planet is now a communications challenge.\' Last October, Friends star Aniston reached the milestone in five hours and 16 minutes. Sir David's Instagram debut precedes the release of a book and a Netflix documentary, both titled A Life On Our Planet.")

In [None]:
tag_some_text("Using Lidar, in 2016 the Foundation for Maya Cultural and Natural Heritage launched the largest archaeological survey ever undertaken of the Maya lowlands. In the first phase, whose results were published in 2018, they mapped 2,100km of the Maya Biosphere Reserve. Their hope in the further phases – the second one of which took place in summer 2019, while I was there – is to triple the coverage area. That would make the project the largest Lidar survey not only in Central America, but in the world.")