<img src="data/images/lecture-notebook-header.png" />

# RNN-Based Named Entity Recognition (NER)

Training a Recurrent Neural Network (RNN) for Named Entity Recognition (NER) involves the following steps:

* **Data Preparation:** Prepare the training data for the RNN by dividing it into input and output sequences. Each input sequence should correspond to a sentence in the text, and each output sequence should correspond to the corresponding labels for the named entities in the sentence. (This step we did in the previous notebook.)

* **Word Embedding:** Convert the input sequences into word embeddings, which are vector representations of words that capture semantic and syntactic information. Pre-trained word embeddings such as Word2Vec or GloVe can be used, or the embeddings can be trained from scratch using the training data.

* **Model Architecture:** Define the architecture of the RNN, including the number of layers, the type of RNN (such as LSTM or GRU), and the number of neurons in each layer. The output layer should have one neuron for each possible named entity label.

* **Training:** Train the RNN using the prepared data and the defined architecture. This involves optimizing the model's parameters (weights and biases) to minimize the loss function, which measures the difference between the predicted labels and the true labels.

* **Evaluation:** Evaluate the performance of the trained RNN on a validation set, using metrics such as precision, recall, and F1 score. Adjust the model architecture and training parameters as necessary to improve performance.

* **Prediction:** Use the trained RNN to predict the named entities in new text data, by feeding the input sequences through the model and extracting the output labels.

It is important to note that training an RNN for NER requires a large amount of labeled data, and can be computationally intensive. It is also important to carefully tune the hyperparameters of the model, such as the learning rate and batch size, to ensure good performance.

In this notebook, we will go through some of those basic steps.

## Setting up the Notebook

In [1]:
%reload_ext autoreload
%autoreload 2

### Importing Required Packages

In [2]:
import numpy as np

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader

from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

from tqdm import tqdm

In [None]:
from src.utils import Dict2Class
from src.sampler import BaseDataset, EqualLengthsBatchSampler
from src.rnn import VanillaRnnNER, PosRnnNER

### Checking/Setting the Device

PyTorch allows to train neural networks on supported GPU to significantly speed up the training process. If you have a support GPU, feel free to utilize it. However, for this notebook it's certainly not needed as our dataset is small and our network model is very simple. In fact, the training is fast on the CPU here since initializing memory on the GPU and moving the data to the GPU involves some overhead.


In [None]:
use_cuda = torch.cuda.is_available()
#use_cuda = False
device = torch.device("cuda:0" if use_cuda else "cpu")
print(device)

## Load & Prepare Dataset

In the previous notebook, we prepared our dataset for the training. This mainly meant that we created the vocabularies and vectorized the words as well as the POS tags. So here, we now only have to load the vocabularies and the vectorized sequences and labels.

In [None]:
vocab_words = torch.load("data/datasets/gmb-ner/gmb-ner-token.vocab")
vocab_pos = torch.load("data/datasets/gmb-ner/gmb-ner-pos.vocab")
vocab_label = torch.load("data/datasets/gmb-ner/gmb-ner-label.vocab")

vocab_size_words = len(vocab_words)
vocab_size_pos = len(vocab_pos)
vocab_size_label = len(vocab_label)

print("Size of word vocabulary:\t{}".format(vocab_size_words))
print("Size of POS vocabulary:\t{}".format(vocab_size_pos))
print("Size of TAG vocabulary:\t{}".format(vocab_size_label))

For the training, we want to consider 2 models, where the first model is using the words as input and the second model is using the words and the POS tags as input. To this end, we load the vectorized sequences into 2 lists. Recall that simply concatenated the sequence of token indices and the sequences of POS tag indices for each sentence. This, the data for the first model, we "cut the sequences into half" to only deal with the token indices.

In [None]:
samples_vanilla, samples_pos = [], []

num_sentences = sum(1 for i in open("data/datasets/gmb-ner/gmb-ner-data-vectorized.txt", "rb"))

print(num_sentences)

with open("data/datasets/gmb-ner/gmb-ner-data-vectorized.txt") as file:
    with tqdm(total=num_sentences) as t:
        for line in file:
            line = line.strip()
            inputs, targets = line.split(",")
            # Convert name to a sequence of integers
            input_seq_pos = [ int(index) for index in inputs.split() ]
            input_seq_vanilla = input_seq_pos[:len(input_seq_pos)//2]
            target_seq = [ int(index) for index in targets.split() ]
            # Add (sequence,label) pair to list of samples
            samples_vanilla.append((input_seq_vanilla, target_seq))
            samples_pos.append((input_seq_pos, target_seq))
            t.update(1)

---

## Vanilla RNN for NER

### Prepare Training & Test Data

For the first model, we only consider the words/tokens as input. The image below, taken from the lecture slides, shows the overall architecture.

<img width="80%" src="data/images/ner-rnn-basic-architecture.png">

Let's first create the basic dataset from the data we have just loaded.

In [None]:
X = [ torch.LongTensor(inputs) for (inputs, _) in samples_vanilla ]
Y = [ torch.LongTensor(targets) for (_, targets) in samples_vanilla ]

In this notebook, we won't perform a proper evaluation since evaluating NER models can be quite tricky (cf. lecture slides). We therefore can consider most samples for the training and just preserve some test samples for predicting the NER labels for some sentences.

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.01, shuffle=True, random_state=0)

print("Number of training samples:\t{}".format(len(X_train)))
print("Number of test samples:\t\t{}".format(len(X_test)))

Lastly, we create the data loaders for convenient handling of the data when training the model. Again, we discussed these steps and utility classes in full detail in previous notebooks, so we skip those details here.

In [None]:
batch_size_train = 64
batch_size_test = 1

dataset_train = BaseDataset(X_train, Y_train)
sampler_train = EqualLengthsBatchSampler(batch_size_train, X_train, Y_train)
loader_train = DataLoader(dataset_train, batch_sampler=sampler_train, shuffle=False, drop_last=False)

dataset_test = BaseDataset(X_test, Y_test)
sampler_test = EqualLengthsBatchSampler(batch_size_test, X_test, Y_test)
loader_test = DataLoader(dataset_test, batch_sampler=sampler_test, shuffle=False, drop_last=False)

### Create Model

Both models considered in this notebook can be found in the file `src/rnn.py`. Both models are implemented in a way to make them easily configurable. This means that there are various parameters you can set to specify the exact network architectures. Since this first model only considers the word/token features, this model has slightly less parameters.

In [None]:
embed_size_words = 100
embed_size_pos = 50

bilstm_hidden_size = 256
bilstm_num_layers = 1
bilstm_dropout = 0.2

params = {
    "device": device,
    "vocab_size_words": vocab_size_words,
    "vocab_size_label": vocab_size_label,
    "embed_size": embed_size_words,
    "bilstm_hidden_size": bilstm_hidden_size,
    "bilstm_num_layers": bilstm_num_layers,
    "bilstm_dropout": bilstm_dropout,
    "linear_hidden_sizes": [256, 128],
    "linear_dropout": 0.2
}

params = Dict2Class(params)

model = VanillaRnnNER(params).to(device)
# Define optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Define loss function
criterion = nn.CrossEntropyLoss()

print(model)

### Train Model

In the file `src/utils.py` we provide the method `train()` to train the model. This method contains all required training steps that we have seen multiple times in other lecture notebooks before. Feel free to check out the method to remind yourself of what it does. This method returns a list containing the losses for each epoch. We will later use this for plotting.


In [None]:

def train(model, loader, optimizer, criterion, num_epochs, device, verbose=False):

    losses = []
    
    # Set model to "train" mode
    model.train()
    
    print("Total Training Time (total number of epochs: {})".format(num_epochs))
    for epoch in range(1, num_epochs+1):

        # Initialize epoch loss (cummulative loss fo all batchs)
        epoch_loss = 0.0
        
        with tqdm(total=len(loader)) as pbar:
        
            for inputs, targets in loader:
                batch_size, seq_len = inputs.shape[0], inputs.shape[1]

                inputs, targets = inputs.to(device), targets.to(device)
                
                outputs = model(inputs)                

                loss = criterion(outputs.permute(0,2,1), targets)

                ### Pytorch magic! ###
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

                # Keep track of overall epoch loss
                epoch_loss += loss.item()

                pbar.update(batch_size)
                
        if verbose is True:
            print("Loss:\t{:.3f} (epoch {})".format(epoch_loss, epoch))
            
        losses.append(epoch_loss)
        
    return losses

In [None]:
num_epochs = 20

losses_vanilla = train(model, loader_train, optimizer, criterion, num_epochs, device, verbose=True)

### Test Model

Although we don't perform a proper evaluation, we can simply predict the NER labels for sentences from the test set. For this, we first need to define a method `eval_sample()` that predicts the NER labels for a given sentence and a given model.


In [None]:
def eval_sample(X, Y, model):
    model.eval()
    
    # We assume a batch size of 1!!!
    outputs = model(X.to(device)).squeeze(0)
    
    # Print: predicted label / true label / word
    for idx in range(outputs.shape[0]):
        _, topi = outputs[idx].topk(1)
        label_pred = vocab_label.lookup_token(topi)
        label_true = vocab_label.lookup_token(Y[0][idx])
        word = vocab_words.lookup_token(X[0][idx])
        print("{}\t{}\t{}".format(label_pred, label_true, word))

The code cell below picks a random sentence from the test set and predicts the NER labels. Just run the code cell multiple times for different sentences.

In [None]:
for X, Y in loader_test:
    eval_sample(X, Y, model)
    break

---

## RNN for NER with POS Tags

We now perform the same steps of creating a training and test dataset suitable for the modified RNN architecture to also consider the POS tags of words. The figure below shows the modified basic architecture that now takes both individual words as well as their POS tags as input features.

<img width="80%" src="data/images/ner-rnn-pos-architecture.png">

Let's first create the dataset from the data that also includes the POS tags.


In [None]:
X = [ torch.LongTensor(inputs) for (inputs, _) in samples_pos ]
Y = [ torch.LongTensor(targets) for (_, targets) in samples_pos ]

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.01, shuffle=True, random_state=0)

print("Number of training samples:\t{}".format(len(X_train)))
print("Number of test samples:\t\t{}".format(len(X_test)))

In [None]:
batch_size_train = 64
batch_size_test = 1

dataset_train = BaseDataset(X_train, Y_train)
sampler_train = EqualLengthsBatchSampler(batch_size_train, X_train, Y_train)
loader_train = DataLoader(dataset_train, batch_sampler=sampler_train, shuffle=False, drop_last=False)

dataset_test = BaseDataset(X_test, Y_test)
sampler_test = EqualLengthsBatchSampler(batch_size_test, X_test, Y_test)
loader_test = DataLoader(dataset_test, batch_sampler=sampler_test, shuffle=False, drop_last=False)

### Create Model

As this model is a bit more complex, it also offers more input parameters for its configuration. Again, you can check out the full implementation of the model in the file `src/rnn.py`.


In [None]:
embed_size_words = 100
embed_size_char = 50
embed_size_pos = 50

bilstm_hidden_size = 256
bilstm_num_layers = 1
bilstm_dropout = 0.2

params = {
    "device": device,
    "vocab_size_words": vocab_size_words,
    "vocab_size_pos": vocab_size_pos,
    "vocab_size_tag": vocab_size_label,
    "embed_size_words": embed_size_words,
    "embed_size_pos": embed_size_pos,
    "bilstm_hidden_size": bilstm_hidden_size,
    "bilstm_num_layers": bilstm_num_layers,
    "bilstm_dropout": bilstm_dropout,
    "linear_hidden_sizes": [256],
    "linear_dropout": 0.2
}

params = Dict2Class(params)

model = PosRnnNER(params).to(device)
# Define optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Define loss function
criterion = nn.CrossEntropyLoss()

print(model)

## Train Model

Training this model simply involves calling the `train()` method again.

In [None]:
num_epochs = 20

losses_pos = train(model, loader_train, optimizer, criterion, num_epochs, device, verbose=True)

If you want, you can again run the code cell below multiple times to see how the models perform on sentences from the test data.

In [None]:
for X, Y in loader_test:
    eval_sample(X, Y, model)
    break

## Compare Models

A proper comparison of both models would require a proper evaluation of the models which is beyond the scope here. However, we can compare the losses for each epoch of both models to get at least some crude insights.


In [None]:
x = list(range(1, len(losses_vanilla)+1))

plt.figure()

plt.plot(x, losses_vanilla, lw=3)
plt.plot(x, losses_pos, lw=3)

font_axes = {'family':'serif','color':'black','size':16}

plt.gca().set_xticks(x)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)

plt.xlabel("Epoch", fontdict=font_axes)
plt.ylabel("Loss", fontdict=font_axes)
plt.legend(['Vanilla RNN', 'POS RNN'], loc='upper right', fontsize=16)
plt.tight_layout()
plt.show()

As the plot shows, the losses for the POS RNN decrease/converge faster than the losses for the Vanilla RNN. Again, while this is not a proper evaluation, it does indicate that considering the POS tags does result in a more effective training.

---

## Summary

Recurrent neural networks (RNNs) are a type of neural network commonly used for natural language processing (NLP) tasks, including named entity recognition (NER). RNNs are well-suited for NER because they can capture the contextual dependencies between words in a sentence, which is important for accurately identifying named entities.

To train an RNN for NER, the data is first prepared by dividing it into input and output sequences. The input sequences correspond to sentences in the text, and the output sequences correspond to the labels for the named entities in the sentence. The input sequences are then converted into word embeddings, which are vector representations of words that capture semantic and syntactic information.

The RNN architecture is then defined, which typically includes multiple layers of LSTM or GRU units. The output layer has one neuron for each possible named entity label. The model is then trained using the prepared data, optimizing the parameters to minimize the loss function. Once trained, the performance of the RNN is evaluated on a validation set using metrics such as precision, recall, and F1 score. The model is then used to predict the named entities in new text data.

One advantage of using RNNs for NER is that they can handle variable-length input sequences, which is important for processing text data. Additionally, RNNs can be used with pre-trained word embeddings, which can improve performance and reduce the amount of data required for training. Overall, RNNs are a powerful tool for NER, and have been used successfully in a wide range of NLP applications. However, training an RNN for NER can be computationally intensive, and requires a large amount of labeled data and careful tuning of hyperparameters to achieve good performance.