# Named Entity Recognition (NER) using Keras

Named Entity Recognition (NER) has many applications [NER](https://en.wikipedia.org/wiki/Named-entity_recognition) for example in:
- Search Engine Efficiency
- Recommendation engine
- Resume parsing
- Customer service

Here we used a known dataset from Kaggle at: [Data](https://www.kaggle.com/datasets/abhinavwalia95/entity-annotated-corpus)


## Loading packages

In [1]:
import os 
import numpy as np
import tensorflow as tf
import pandas as pd
import matplotlib.pyplot as plt
import random as rnd
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding
import tensorflow.keras.layers as tfl
from numpy import array
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten,Embedding,Dense
from tensorflow.keras import Model, Input
from tensorflow.keras.layers import LSTM, Embedding, Dense
from tensorflow.keras.layers import TimeDistributed, SpatialDropout1D, Bidirectional

## Loading data

Data has been transferred into txt files and already split into train, validation, and test sub-datasets

In [2]:
#Train data
with open("./data/train_sentences.txt", 'r', encoding="utf8") as f:
    t_sentences = f.readlines()

with open("./data/train_labels.txt", 'r', encoding="utf8") as f:
    t_labels = f.readlines()

#Validation data
with open("./data/eval_sentences.txt", 'r', encoding="utf8") as f:
    v_sentences = f.readlines()

with open("./data/eval_labels.txt", 'r', encoding="utf8") as f:
    v_labels = f.readlines()

#Test data
with open("./data/test_sentences.txt", 'r', encoding="utf8") as f:
    test_sentences = f.readlines()

with open("./data/test_labels.txt", 'r', encoding="utf8") as f:
    test_labels = f.readlines()
    
#Tags
with open("./data/tags.txt", 'r', encoding="utf8") as f:
    tags = f.readlines()

## Data Visualization and Preprocessing

Checking the first sentence in training data, Looking both at the sentence and the tags for each word:

In [3]:
print(t_sentences[0])
print(t_labels[0])

Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country .

O O O O O O B-geo O O O O O B-geo O O O O O B-gpe O O O O O



## Data Preprocessing

For processing the data we need to do a few tasks:
- Tokenize each sentence (in all train, validation, and test datasets)
- Tokenize the tag labels (Here, we have to make our own dictionary)
- Pad all sentences and labels to the maximum size

For tokenizing the sentences, we can use pre-trained models with good word embeddings, but here we to test, we don't do this and we will create a dictionary of the words and assign each word to a number as we encounter that while going through all the words in the corpus.

The number of classes is small and we only need to go through our training dataset to find all of them. We create a dictionary to translate each tag to a number.
    

In [4]:
word_index = {}
tag_index  = {}

counter = 0
for sentence in t_sentences:
    sentence = sentence.strip('\n')
    sentence = sentence.split(' ')
    for word in sentence:
        if word not in word_index:
            word_index[word] = counter + 1
            counter += 1
for sentence in v_sentences:
    sentence = sentence.strip('\n')
    sentence = sentence.split(' ')
    for word in sentence:
        if word not in word_index:
            word_index[word] = counter + 1
            counter += 1
for sentence in test_sentences:
    sentence = sentence.strip('\n')
    sentence = sentence.split(' ')
    for word in sentence:
        if word not in word_index:
            word_index[word] = counter + 1
            counter += 1
            
word_index['UNK'] = counter
            
counter = 0
for tags in t_labels:
    tags = tags.strip('\n')
    tags = tags.split(' ')
    for tag in tags:
        if tag not in tag_index:
            tag_index[tag] = counter
            counter += 1



In [5]:
print("Number of unique words : {}".format(len(word_index)))
print("Number of unique tags :  {}".format(len(tag_index)))

Number of unique words : 35179
Number of unique tags :  17


### Tokenizing the sentences using the words dictionary

We create a function and pass train, validation, and test sentences to it.

In [6]:
max_len = 0

In [7]:
def tokenize_sentences(sentences, word_index, max_len = 0):
    tokenized_sentences = []
    for sentence in sentences:
        sentence = sentence.strip('\n')
        sentence = sentence.split(' ')
        max_len = max(max_len, len(sentence))
        tokenized_sentence = []
        for word in sentence:
            tokenized_sentence.append(word_index[word])
        tokenized_sentences.append(tokenized_sentence)
    return tokenized_sentences, max_len

In [8]:
x_train, max_len = tokenize_sentences(t_sentences, word_index, 0)
x_val, max_len = tokenize_sentences(v_sentences, word_index, max_len)
x_test, max_len = tokenize_sentences(test_sentences, word_index, max_len)
print(max_len)

104


## Tokenizing the tags using the tags dictionary
We create a function and pass train, validation, and test labels to it.

In [9]:
def encode_tag(tags_sequences, tag_index):
    encoded_tags_sequences = []
    for sequence in tags_sequences:
        sequence = sequence.strip('\n')
        sequence = sequence.split(' ')
        tags = []
        for tag in sequence:
            tags.append(tag_index[tag])
        encoded_tags_sequences.append(tags)
    return encoded_tags_sequences

In [10]:
y_train = encode_tag(t_labels, tag_index)
y_val   = encode_tag(v_labels, tag_index)
y_test  = encode_tag(test_labels, tag_index)

### Visualizing the data so far

In [11]:
print(x_train[0])
print(y_train[0])
Vocab_size = len(word_index)

print(Vocab_size)

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 10, 16, 2, 17, 18, 19, 20, 21, 22]
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0]
35179


## Padding

We are still not finished with pre-processing. We need to make sure that all the tokenized sentences and labels that will go into the model are the same size. This size has to be the size of the maximum sentence that we have found while tokenizing the sentences.

In [12]:
max_len = 50
x_train_padded = pad_sequences(maxlen=max_len, sequences=x_train, padding="post", value=0)
y_train_padded = pad_sequences(maxlen=max_len, sequences=y_train, padding="post", value=tag_index['O'])

x_val_padded = pad_sequences(maxlen=max_len, sequences=x_val, padding="post", value=0)
y_val_padded = pad_sequences(maxlen=max_len, sequences=y_val, padding="post", value=tag_index['O'])

x_test_padded = pad_sequences(maxlen=max_len, sequences=x_test, padding="post", value=0)
y_test_padded = pad_sequences(maxlen=max_len, sequences=y_test, padding="post", value=tag_index['O'])

## Visualizing the pre-processed data

OK, now we can see that the sentences are tokenized into numbers for each word and labels are also encoded based on their tag_index hash.

Processing the input data is usually the most time-consuming part. And now we are ready to build the model.

In [13]:
print(x_train_padded[0])
print(y_train_padded[0])

[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 10 16  2 17 18 19 20 21 22
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0]
[0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0]


## Setting up the model

The model is an embedding layer, followed by one LSTM, and then a dense layer with number of layers equal to our number of unique tags.

We could also use to_categorical to go from the token code to a one_hot vector.

In [14]:
model = Sequential(
        [
            tfl.Input(shape = (max_len, )),
            tfl.Embedding(input_dim  = Vocab_size, output_dim = 50, input_length = max_len),
            tfl.SpatialDropout1D(0.1),
            tfl.LSTM(units = 100, return_sequences=True, recurrent_dropout=0.1),
            tfl.Dense(len(tag_index), activation = 'softmax'),
        ])

print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 50, 50)            1758950   
                                                                 
 spatial_dropout1d (SpatialD  (None, 50, 50)           0         
 ropout1D)                                                       
                                                                 
 lstm (LSTM)                 (None, 50, 100)           60400     
                                                                 
 dense (Dense)               (None, 50, 17)            1717      
                                                                 
Total params: 1,821,067
Trainable params: 1,821,067
Non-trainable params: 0
_________________________________________________________________
None


### Setting the optimizer

The key thing here is to use "sparse_categoritcal_crossentropy" because the targets are number of each class and not their hot_vector. 

In [15]:
model.compile(optimizer="adam",
              loss="sparse_categorical_crossentropy",
              metrics=["accuracy"])

In [16]:
history = model.fit(
    x=x_train_padded,
    y=y_train_padded,
    validation_data=(x_val_padded,y_val_padded),
    batch_size=32, 
    epochs=3,
    verbose=1
)

Epoch 1/3
Epoch 2/3
Epoch 3/3


## Evaluating on the validation dataset

In [17]:
model.evaluate(x_val_padded, y_val_padded, verbose = 1)



[0.061315711587667465, 0.9815123677253723]

## Evaluating on the custom input

We may encounter unknown words when processing a custom sentence and and an unknown token was added to the word dictionary when it was created that can be used here. But here we simply apply the pad token to unknown words to see what happens!

This step is important because one trivial convergence point of the NN could be that it will classifity all data as 'O' becuase there are many 'O' because of padding compared to other tags. Because of this we also had to reduce max_len from real value of 114 to 50 to have less padding. The distribution of sentences with more than 50 length is much smaller and adding more padding forces the NN to converge to 'O'. There could be other techniques to improve this but this is a simplification for now.

In [18]:
def tokenize_custom_sentences(sentences, word_index, max_len = 0):
    tokenized_sentences = []
    for sentence in sentences:
        sentence = sentence.strip('\n')
        sentence = sentence.strip('.')
        sentence = sentence.split(' ')
        max_len = max(max_len, len(sentence))
        tokenized_sentence = []
        for word in sentence:
            if word in word_index:
                tokenized_sentence.append(word_index[word])
            else:
                tokenized_sentence.append(Vocab_size - 1)
        tokenized_sentences.append(tokenized_sentence)
    return tokenized_sentences, max_len

### Tokenize and pad the input sentence

The custom sentence has to be processed the same as the training and validation data.

In [19]:
custom_sentence = "Monday morning we are going to have a meeting with Peter to asasdfdfd about San Diego traffic"
custom_sentence_processed = custom_sentence.strip('.\n').split(' ')
inputs = []
inputs.append(custom_sentence)
tokenized_input, _ = tokenize_custom_sentences(inputs, word_index)
padded_tokenized_input = pad_sequences(maxlen=max_len, sequences=tokenized_input, padding="post", value=0)

### Run the processed sentence through the NN

In [20]:
output_custom = model.predict(padded_tokenized_input)
custom = output_custom[0]



### Extract the tagged words and visualize the result

First we need to create the inverse dictionary of tag_index.
Then we need to find the max class for each word using argmax.

In [21]:
tags = list(tag_index.keys())
output_decoded = np.argmax(custom, axis = 1)
output_decoded = np.argmax(custom, axis = 1)

for i in range(len(output_decoded)):
    if output_decoded[i] != 0:
        print(custom_sentence_processed[i] + '\t\t' + tags[output_decoded[i]])
        

Monday		B-tim
morning		I-tim
Peter		B-per
San		B-geo
Diego		I-geo
