# Named Entity Recognition (NER) using Glove 

Named Entity Recognition (NER) has many applications [NER](https://en.wikipedia.org/wiki/Named-entity_recognition) for example in:
- Search Engine Efficiency
- Recommendation engine
- Resume parsing
- Customer service

Here we used a known dataset from Kaggle at: [Data](https://www.kaggle.com/datasets/abhinavwalia95/entity-annotated-corpus)

This is the same dataset and general approach as the other notebook (Named Entity Recognition using Keras). The difference is that in the previous notebook the embeddings were learnt as part of the training but here we import [Glove](https://nlp.stanford.edu/data/glove.6B.zip) embeddings to improve the performance.

## Loading packages

In [1]:
import pandas as pd
from tqdm.notebook import tqdm
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
from itertools import chain
from sklearn.model_selection import train_test_split
from keras_preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

## Loading data

Data has been transferred into txt files and already split into train, validation, and test sub-datasets

In [2]:
data = pd.read_csv("./data/ner_dataset.csv", encoding= 'unicode_escape')

## Data Visualization and Preprocessing

Checking the data format. The sentences are one after each other and column Sentence # indicates the sentence. We will use that later to break the data into separate sentences.

In [3]:
data.head(30)

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O
5,,through,IN,O
6,,London,NNP,B-geo
7,,to,TO,O
8,,protest,VB,O
9,,the,DT,O


# Data Preprocessing

For processing the data we need to do a few tasks:
- Tokenize each word and each tage in the data to a numerical value. For this we create dictionaries to map each word and tage to a value. Then add columns to the data with corresponding value for the words and tags.
- Then we need to separate the sentences. First as we note from the dataframe, the first word of a new sentence has the sentence number. We first fill each Nan with the value of it sentence using [ffil](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html). Then we group the the words belonging to a same sentence together 
- Then, we use pre-trained Glove word embeddings to map each tokenized word to its embedding.
- After all this is done we can split the data into train and plit datasets using sklearn.



## Tokenizing

First, we a set from the columns of 'word' and 'tag' from the data. Then we map each word and tag in the data to its numerical value from the dictionaries and add them as new columns to the dataframe data

In [4]:
word_set   = list(set(data['Word'].to_list()))
word_index = {word:index for  index, word in enumerate(word_set)}
index_word = {index:word for  index, word in enumerate(word_set)}

tag_set    = list(set(data['Tag'].to_list()))
tag_index = {tag:index for  index, tag in enumerate(tag_set)}
index_tag = {index:tag for  index, tag in enumerate(tag_set)}


print("Token for id 0 :", index_word[3])
print("Tag for id 0: ", index_tag[4])
print("Number of unique words : {}".format(len(word_index)))
print("Number of unique tags :  {}".format(len(tag_index)))

Token for id 0 : 1895
Tag for id 0:  I-geo
Number of unique words : 35178
Number of unique tags :  17


Creating new columns with the mapped numerical values:

In [5]:
data['Word_index'] = data['Word'].map(word_index)
data['Tag_index'] = data['Tag'].map(tag_index)
data.head(10)

Unnamed: 0,Sentence #,Word,POS,Tag,Word_index,Tag_index
0,Sentence: 1,Thousands,NNS,O,9980,13
1,,of,IN,O,32615,13
2,,demonstrators,NNS,O,12071,13
3,,have,VBP,O,34863,13
4,,marched,VBN,O,29093,13
5,,through,IN,O,13810,13
6,,London,NNP,B-geo,32120,0
7,,to,TO,O,29014,13
8,,protest,VB,O,1524,13
9,,the,DT,O,30893,13


## Separating the sentences

First we with the NaN values with the sentence number for each word.

In [6]:
data_filled = data.fillna(method='ffill', axis=0)
data_filled.head(30)

Unnamed: 0,Sentence #,Word,POS,Tag,Word_index,Tag_index
0,Sentence: 1,Thousands,NNS,O,9980,13
1,Sentence: 1,of,IN,O,32615,13
2,Sentence: 1,demonstrators,NNS,O,12071,13
3,Sentence: 1,have,VBP,O,34863,13
4,Sentence: 1,marched,VBN,O,29093,13
5,Sentence: 1,through,IN,O,13810,13
6,Sentence: 1,London,NNP,B-geo,32120,0
7,Sentence: 1,to,TO,O,29014,13
8,Sentence: 1,protest,VB,O,1524,13
9,Sentence: 1,the,DT,O,30893,13


Then by creating a list of words with the same value 'Sentence #' we group them into a sentence.

In [7]:
data_grouped = data_filled.groupby(['Sentence #'],as_index = False)
data_ordered = data_grouped.agg(lambda x: list(x))
data_ordered.head(30)

Unnamed: 0,Sentence #,Word,POS,Tag,Word_index,Tag_index
0,Sentence: 1,"[Thousands, of, demonstrators, have, marched, ...","[NNS, IN, NNS, VBP, VBN, IN, NNP, TO, VB, DT, ...","[O, O, O, O, O, O, B-geo, O, O, O, O, O, B-geo...","[9980, 32615, 12071, 34863, 29093, 13810, 3212...","[13, 13, 13, 13, 13, 13, 0, 13, 13, 13, 13, 13..."
1,Sentence: 10,"[Iranian, officials, say, they, expect, to, ge...","[JJ, NNS, VBP, PRP, VBP, TO, VB, NN, TO, JJ, J...","[B-gpe, O, O, O, O, O, O, O, O, O, O, O, O, O,...","[28793, 20255, 33376, 13773, 29741, 29014, 316...","[12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 1..."
2,Sentence: 100,"[Helicopter, gunships, Saturday, pounded, mili...","[NN, NNS, NNP, VBD, JJ, NNS, IN, DT, NNP, JJ, ...","[O, O, B-tim, O, O, O, O, O, B-geo, O, O, O, O...","[14250, 31241, 325, 18595, 22093, 2854, 10591,...","[13, 13, 5, 13, 13, 13, 13, 13, 0, 13, 13, 13,..."
3,Sentence: 1000,"[They, left, after, a, tense, hour-long, stand...","[PRP, VBD, IN, DT, NN, JJ, NN, IN, NN, NNS, .]","[O, O, O, O, O, O, O, O, O, O, O]","[6556, 19800, 31014, 11037, 34349, 28296, 1071...","[13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13]"
4,Sentence: 10000,"[U.N., relief, coordinator, Jan, Egeland, said...","[NNP, NN, NN, NNP, NNP, VBD, NNP, ,, NNP, ,, J...","[B-geo, O, O, B-per, I-per, O, B-tim, O, B-geo...","[7468, 17046, 24925, 18233, 8628, 21915, 31096...","[0, 13, 13, 16, 1, 13, 5, 13, 0, 13, 12, 13, 1..."
5,Sentence: 10001,"[Mr., Egeland, said, the, latest, figures, sho...","[NNP, NNP, VBD, DT, JJS, NNS, VBP, CD, CD, NNS...","[B-per, I-per, O, O, O, O, O, O, O, O, O, O, O...","[2743, 8628, 21915, 30893, 26524, 11861, 13601...","[16, 1, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13..."
6,Sentence: 10002,"[He, said, last, week, 's, tsunami, and, the, ...","[PRP, VBD, JJ, NN, POS, NN, CC, DT, JJ, NN, NN...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[11451, 21915, 8495, 18535, 34279, 31210, 1916...","[13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 1..."
7,Sentence: 10003,"[Some, 1,27,000, people, are, known, dead, .]","[DT, CD, NNS, VBP, VBN, JJ, .]","[O, O, O, O, O, O, O]","[1656, 16459, 30856, 23491, 5047, 7471, 20450]","[13, 13, 13, 13, 13, 13, 13]"
8,Sentence: 10004,"[Aid, is, being, rushed, to, the, region, ,, b...","[NNP, VBZ, VBG, VBN, TO, DT, NN, ,, CC, DT, NN...","[O, O, O, O, O, O, O, O, O, O, B-geo, O, O, O,...","[580, 13033, 29348, 3594, 29014, 30893, 401, 2...","[13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 0, 13..."
9,Sentence: 10005,"[Lebanese, politicians, are, condemning, Frida...","[JJ, NNS, VBP, VBG, NNP, POS, NN, NN, IN, DT, ...","[B-gpe, O, O, O, B-tim, O, O, O, O, O, O, O, O...","[21627, 14215, 23491, 14003, 6735, 34279, 5223...","[12, 13, 13, 13, 5, 13, 13, 13, 13, 13, 13, 13..."


### Split the data into train, validation, and test

Before splitting, we are actually still not finished with pre-processing! We need to make sure that all the tokenized sentences and labels that will go into the model are the same size. This size has to be the size of the maximum sentence that we have found while tokenizing the sentences. 

In [8]:
words_padding = len(word_index) - 1
sentences  = data_ordered['Word_index'].tolist()
tags       = data_ordered['Tag_index'].tolist()
maxlen = max([len(s) for s in sentences])

sentences_padded = pad_sequences(sentences, maxlen = maxlen, dtype='int32', padding='post', value = words_padding)
tags_padded      = pad_sequences(tags     , maxlen = maxlen, dtype='int32', padding='post', value = tag_index['O'])

To be able to calculate cross-entropy loss during the training, we transform each tag to its one hot encoding. Alternatively, we could use categorical cross-entropy if available depending of the tool used for building and training the model.

Now the data is finally ready to be split between train, validation, and test.

In [9]:
x_train_val, x_test, y_train_val, y_test = train_test_split(sentences_padded, tags_padded, 
                                                            test_size = 0.1, train_size = 0.9, random_state = 42)

x_train, x_val, y_train, y_val = train_test_split(x_train_val, 
                                                  y_train_val,test_size = 0.2,train_size = 0.8, random_state = 42)


y_train = np.asarray(y_train, dtype = np.int64).squeeze()
    
y_val = np.asarray(y_val, dtype = np.int64).squeeze()

y_test = np.asarray(y_test, dtype = np.int64).squeeze()

print(y_train.shape)
print(y_val.shape)
print(y_test.shape)

(34530, 104)
(8633, 104)
(4796, 104)


## Glove embeddings

We want to use Glove embeddings for better performance and using word similarities pre-trained into them. 
- We first read the file. 
- Then we create dictionary mapping each word to its embedding.
- Then we need to create a matrix that maps each word index into its embedding.

In [10]:
lines = open("./data/glove.6B.100d.txt", "r", encoding="utf8").readlines()

word_emd = {}
for l in lines:
    s = l.split(" ")
    word = s[0]
    embedding = np.zeros( (1, len(s)-1))
    for k, x in enumerate(s[1:]):
        embedding[0,k] = float(x.strip())
    word_emd[word] = embedding

Mapping each word index to its embedding:

In [11]:
vocab_length = len(list(word_index.keys()))
embed_dim = 100
weight_matrix = np.zeros((vocab_length, embed_dim))

for i in range(len(list(word_index.keys()))):
    try:
        word = index_word[i].lower()
        weight_matrix[i] = word_emd[word]
    except KeyError:
        weight_matrix[i] = np.random.normal(scale = 0.6, size=(embed_dim, ))

Creating a torch embedding layer:

In [12]:
num_embeddings = weight_matrix.shape[0]
weights = torch.from_numpy(weight_matrix)
emb_layer = nn.Embedding(num_embeddings, embed_dim)
emb_layer.load_state_dict({'weight': weights})
emb_layer.weight.requires_grad = False

# Setting up the model

Embedding layer will be passed to the function that returns the model to create the embedding layer of the model.
The model is a LSTM layer followed by a dense layer equal to the size of the labels and the model forward return the softmax probabilities over the tags.

In [13]:
class Net(nn.Module):
    def __init__(self, weight_matrix, lstm_hidden_dim, num_of_tags, emb_layer, embed_dim):
        super(Net, self).__init__()

        #Embedding layer 
        self.embedding, embedding_dim = emb_layer, embed_dim

        #the LSTM takens embedded sentence
        self.lstm = nn.LSTM(embedding_dim, lstm_hidden_dim, batch_first = True)

        #fc layer transforms the output to give the final output layer
        self.fc = nn.Linear(lstm_hidden_dim, num_of_tags)



    def forward(self, s):
        #apply the embedding layer that maps each token to its embedding
        s = self.embedding(s)   # dim: batch_size x batch_max_len x embedding_dim

        #run the LSTM along the sentences of length batch_max_len
        s, _ = self.lstm(s)     # dim: batch_size x batch_max_len x lstm_hidden_dim                

        #reshape the Variable so that each row contains one token
        s = s.reshape(-1, s.shape[2])  # dim: batch_size*batch_max_len x lstm_hidden_dim

        #apply the fully connected layer and obtain the output for each token
        s = self.fc(s)          # dim: batch_size*batch_max_len x num_tags

        return F.log_softmax(s, dim = 1)   # dim: batch_size*batch_max_len x num_tags

## Defining the loss function

Outputs from the model that is sent to the loss function is of shape batch_size x max_len x num_tags and it has probs/softmax values of the tags classes. To calculate the cross-entropy loss we simply need to pick the softmax value corresponding to the true label from the last dimension for all the samples in the batch. To calculate the cost then we average by the total number of tags.

In [14]:
def loss_fn(outputs, batch_y):
    #reshape labels to give a flat vector of length batch_size * max_len
    batch_y = batch_y.view(-1)  

    #the number of tokens is the sum of elements in mask
    total_tags = batch_y.shape[0]

    #pick the values corresponding to labels and multiply by mask
    outputs = outputs[range(outputs.shape[0]), batch_y]

    #cross entropy loss for all non 'PAD' tokens
    return -torch.sum(outputs) / total_tags

## Defing the model and the optimizer

In [16]:
model = Net( weight_matrix, lstm_hidden_dim=128, num_of_tags=len(tag_index), emb_layer = emb_layer, embed_dim = embed_dim)

optimizer = optim.Adam( model.parameters(), lr = 1e-3 )

batch_size=128
indices = np.arange(y_train.shape[0])
epochs = 1500

# Training

We could use the validation set to report the accuracy during the training but it has been skipped.

In [17]:
losses = []
for e in tqdm(range(epochs)):
    np.random.shuffle(indices)
    batch_indices = indices[:batch_size]
    batch_x = x_train[batch_indices]
    batch_y = y_train[batch_indices]

    batch_x = torch.from_numpy(batch_x)
    batch_y = torch.from_numpy(batch_y)

    probs = model.forward(batch_x)
    loss = loss_fn(probs, batch_y)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    losses.append(loss.detach())
    if e % 100 == 0:
        print("Epoch: %d - %.6f" %(e, np.mean(losses)))
        losses = []

  0%|          | 0/1500 [00:00<?, ?it/s]

Epoch: 0 - 2.685585
Epoch: 100 - 0.339507
Epoch: 200 - 0.142585
Epoch: 300 - 0.114563
Epoch: 400 - 0.096106
Epoch: 500 - 0.082273
Epoch: 600 - 0.073100
Epoch: 700 - 0.066558
Epoch: 800 - 0.061846
Epoch: 900 - 0.057461
Epoch: 1000 - 0.054761
Epoch: 1100 - 0.052442
Epoch: 1200 - 0.050872
Epoch: 1300 - 0.048456
Epoch: 1400 - 0.047358


# Evaluating on the validation dataset

For each of the tages we calculate precision, recall, and f1-score.

In [18]:
from sklearn.metrics import precision_recall_fscore_support

batch_size = 128
k = 0

preds = None
while k < x_test.shape[0]:
    x = x_test[k:k+batch_size] if k+batch_size < x_test.shape[0] else x_test[k:]
    y = y_test[k:k+batch_size] if k+batch_size < y_test.shape[0] else y_test[k:]

    x = torch.from_numpy(x)
    y = torch.from_numpy(y)

    probs = model.forward(x).detach().numpy()
    yhat = np.argmax(probs, axis=1)

    preds = yhat if preds is None else np.hstack( (preds, yhat) )

    k += batch_size

labels = [ index_tag[i] for i in range(len(index_tag)) ]
y_test = y_test.reshape((-1,))

p, r, f, s = precision_recall_fscore_support( y_test, preds, zero_division = 0)

for i in range(len(labels)):
    print("Label: %s \t- Precision: %.4f - Recall: %.4f - f1: %.4f - Support: %.4f" %(labels[i], p[i], r[i], f[i], s[i]) )

Label: B-geo 	- Precision: 0.7753 - Recall: 0.8185 - f1: 0.7963 - Support: 3797.0000
Label: I-per 	- Precision: 0.7974 - Recall: 0.8806 - f1: 0.8369 - Support: 1658.0000
Label: I-nat 	- Precision: 0.0000 - Recall: 0.0000 - f1: 0.0000 - Support: 7.0000
Label: I-gpe 	- Precision: 0.0000 - Recall: 0.0000 - f1: 0.0000 - Support: 16.0000
Label: I-geo 	- Precision: 0.6892 - Recall: 0.6003 - f1: 0.6417 - Support: 713.0000
Label: B-tim 	- Precision: 0.8950 - Recall: 0.7000 - f1: 0.7855 - Support: 2033.0000
Label: B-art 	- Precision: 0.0000 - Recall: 0.0000 - f1: 0.0000 - Support: 46.0000
Label: I-org 	- Precision: 0.6991 - Recall: 0.3378 - f1: 0.4556 - Support: 1699.0000
Label: I-tim 	- Precision: 0.8322 - Recall: 0.4068 - f1: 0.5465 - Support: 585.0000
Label: B-nat 	- Precision: 0.0000 - Recall: 0.0000 - f1: 0.0000 - Support: 20.0000
Label: I-eve 	- Precision: 0.0000 - Recall: 0.0000 - f1: 0.0000 - Support: 39.0000
Label: I-art 	- Precision: 0.0000 - Recall: 0.0000 - f1: 0.0000 - Support: 40.

We can note that the tags that had very small number of occurance have zero accuracy. Sometimes we may not care as they are small percentage of the data. But if their occurance is critical and has a lot of information that cannot be missed we need to do oversampling. 

This can be implemented using a sampler in a dataloader. An example of that is shown in the notebook for probability calibration.