# Named Entity Recognition (NER) using Deep Learning

Named-entity recognition (NER) is also known as entity identification, entity chunking and entity extraction. The objective is to identify entities like person names, organizations, locations etc. from unstructured text.

In this project, we will work with a dataset provided by kaggle. The dataset can be accessed from the kaggle link below:
https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus

The sample data has already been annotated and each word has been tagged with the relevant POS and NER tags. In this project we will focus on identifying NER tags. However, the approach of identifying POS or NER tags using previous annotated data using Deep Learning remains the same. Just a minor change in the code can be used for the POS tagging instead. 

In [13]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

In [2]:
data = pd.read_csv(r'Data/ner_dataset.csv', encoding='unicode_escape')
data.head()

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O


### Create dictionaries to encode the words and NER tags

In [4]:
def get_vocab_dict(data, token_or_tag='token'):
    
    if token_or_tag == 'token':
        vocab = set(data['Word'])
    else:
        vocab = set(data['Tag'])
        
    idx2tok = {idx : tok for idx, tok in  enumerate(vocab)}
    tok2idx = {tok : idx for idx, tok in  enumerate(vocab)}
    
    return idx2tok, tok2idx### Transform columns to aggregate the tokens/tags at a sentence level

In [5]:
idx2tag, tag2idx = get_vocab_dict(data, 'tag')
idx2word, word2idx = get_vocab_dict(data, 'token')

print("Index to Tags : \n")
print(idx2tag)
print("\nTags to Index : \n")
print(tag2idx)

Index to Tags : 

{0: 'B-art', 1: 'O', 2: 'I-per', 3: 'B-nat', 4: 'B-tim', 5: 'I-geo', 6: 'B-eve', 7: 'B-geo', 8: 'I-gpe', 9: 'I-org', 10: 'I-tim', 11: 'I-nat', 12: 'B-per', 13: 'I-art', 14: 'I-eve', 15: 'B-org', 16: 'B-gpe'}

Tags to Index : 

{'B-art': 0, 'O': 1, 'I-per': 2, 'B-nat': 3, 'B-tim': 4, 'I-geo': 5, 'B-eve': 6, 'B-geo': 7, 'I-gpe': 8, 'I-org': 9, 'I-tim': 10, 'I-nat': 11, 'B-per': 12, 'I-art': 13, 'I-eve': 14, 'B-org': 15, 'B-gpe': 16}


### Add columns to convert the words and NER tags to numerically encoded values

This has to be done as ML models need data in a numerical format

In [6]:
data['Word_idx'] = data['Word'].map(word2idx)
data['Tag_idx'] = data['Tag'].map(tag2idx)
data['Sentence #'].fillna(method='ffill', axis=0, inplace=True)
data.head()

Unnamed: 0,Sentence #,Word,POS,Tag,Word_idx,Tag_idx
0,Sentence: 1,Thousands,NNS,O,17755,1
1,Sentence: 1,of,IN,O,29027,1
2,Sentence: 1,demonstrators,NNS,O,34706,1
3,Sentence: 1,have,VBP,O,11443,1
4,Sentence: 1,marched,VBN,O,30908,1


### Transform columns to aggregate the tokens/tags at a sentence level

In [7]:
data_group = data.groupby('Sentence #', as_index=False)['Word','POS','Tag','Word_idx','Tag_idx'].agg(lambda x : list(x))
data_group.head()

  """Entry point for launching an IPython kernel.


Unnamed: 0,Sentence #,Word,POS,Tag,Word_idx,Tag_idx
0,Sentence: 1,"[Thousands, of, demonstrators, have, marched, ...","[NNS, IN, NNS, VBP, VBN, IN, NNP, TO, VB, DT, ...","[O, O, O, O, O, O, B-geo, O, O, O, O, O, B-geo...","[17755, 29027, 34706, 11443, 30908, 10389, 334...","[1, 1, 1, 1, 1, 1, 7, 1, 1, 1, 1, 1, 7, 1, 1, ..."
1,Sentence: 10,"[Iranian, officials, say, they, expect, to, ge...","[JJ, NNS, VBP, PRP, VBP, TO, VB, NN, TO, JJ, J...","[B-gpe, O, O, O, O, O, O, O, O, O, O, O, O, O,...","[26574, 19104, 22117, 25087, 22100, 12616, 219...","[16, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,..."
2,Sentence: 100,"[Helicopter, gunships, Saturday, pounded, mili...","[NN, NNS, NNP, VBD, JJ, NNS, IN, DT, NNP, JJ, ...","[O, O, B-tim, O, O, O, O, O, B-geo, O, O, O, O...","[8087, 34712, 15377, 2082, 32462, 24997, 13078...","[1, 1, 4, 1, 1, 1, 1, 1, 7, 1, 1, 1, 1, 1, 15,..."
3,Sentence: 1000,"[They, left, after, a, tense, hour-long, stand...","[PRP, VBD, IN, DT, NN, JJ, NN, IN, NN, NNS, .]","[O, O, O, O, O, O, O, O, O, O, O]","[24125, 6952, 5610, 5157, 2551, 1506, 25443, 1...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]"
4,Sentence: 10000,"[U.N., relief, coordinator, Jan, Egeland, said...","[NNP, NN, NN, NNP, NNP, VBD, NNP, ,, NNP, ,, J...","[B-geo, O, O, B-per, I-per, O, B-tim, O, B-geo...","[24685, 1596, 3575, 15939, 34577, 5674, 7886, ...","[7, 1, 1, 12, 2, 1, 4, 1, 7, 1, 16, 1, 16, 1, ..."


### Pad the sequences to be of the same length

Keras expects all the sequences of words/tags to be of the same length. So we'll identify the longest sequence and pad all the rest of the sequences to be of that length

In [33]:
# Calculate the vocabulary size
n_tokens = len(set(data['Word']))

# Calculate the number of unique NER tokens
n_tags = len(set(data['Tag']))

# Calculate the length of the longest sequence
max_length = max([len(x) for x in data_group['Word_idx']])

print(f"Max length of sequences : {max_length}")

# Pad the tokens
pad_tokens = pad_sequences(data_group['Word_idx'],maxlen=max_length, padding='post', value = n_tokens - 1)

# Pad the tags
pad_tags = pad_sequences(data_group['Tag_idx'],maxlen=max_length, padding='post', value = tag2idx["O"])

# One-hot encode the tags
encoded_tags = [to_categorical(i , num_classes=n_tags) for i in pad_tags]

Max length of sequences : 104


#### Let us see what is the effect of these transformations on the data

In [34]:
print(f"Original words sequence : \n{data_group['Word_idx'][0]}")
print(f"\nPadded words sequence : \n{pad_tokens[0]}")

print(f"\nOriginal tags sequence : \n{data_group['Tag_idx'][0]}")
print(f"\nPadded tags sequence : \n{pad_tags[0]}")

print("\n\nEncoded tags : ")
for i in encoded_tags[0][:10]:
    print(i)

Original words sequence : 
[17755, 29027, 34706, 11443, 30908, 10389, 33495, 12616, 2825, 23743, 6437, 13078, 13876, 4854, 13298, 23743, 8528, 29027, 18277, 5011, 31552, 26038, 20246, 16765]

Padded words sequence : 
[17755 29027 34706 11443 30908 10389 33495 12616  2825 23743  6437 13078
 13876  4854 13298 23743  8528 29027 18277  5011 31552 26038 20246 16765
 35177 35177 35177 35177 35177 35177 35177 35177 35177 35177 35177 35177
 35177 35177 35177 35177 35177 35177 35177 35177 35177 35177 35177 35177
 35177 35177 35177 35177 35177 35177 35177 35177 35177 35177 35177 35177
 35177 35177 35177 35177 35177 35177 35177 35177 35177 35177 35177 35177
 35177 35177 35177 35177 35177 35177 35177 35177 35177 35177 35177 35177
 35177 35177 35177 35177 35177 35177 35177 35177 35177 35177 35177 35177
 35177 35177 35177 35177 35177 35177 35177 35177]

Original tags sequence : 
[1, 1, 1, 1, 1, 1, 7, 1, 1, 1, 1, 1, 7, 1, 1, 1, 1, 1, 16, 1, 1, 1, 1, 1]

Padded tags sequence : 
[ 1  1  1  1  1  1  7  

### Split into Train, Validation and Test sets

In [101]:
# Keep aside 10% data as test set
tokens, X_test, tags, y_test = train_test_split(pad_tokens, encoded_tags, test_size=0.1, random_state = 100)

# Split remaining 90% data into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(tokens, tags, test_size=0.25, random_state = 100)

y_train = np.array(y_train)
y_val = np.array(y_val)
y_test = np.array(y_test)

print(f"X_train shape : {X_train.shape} \ny_train shape : {y_train.shape}")
print(f"\nX_val shape   : {X_val.shape} \ny_val shape   : {y_val.shape}")

X_train shape : (32372, 104) 
y_train shape : (32372, 104, 17)

X_val shape   : (10791, 104) 
y_val shape   : (10791, 104, 17)


# Build the Bidirectional-LSTM Model

In [92]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Bidirectional, LSTM, Dropout, Dense, Embedding
from tensorflow.keras.utils import plot_model

Set the random seed for reproducibility

In [76]:
tf.random.set_seed(2)

Define the input and output dimensions for the model

In [77]:
input_dim = len(word2idx) + 1
output_dim = 64 # creating word vectors of length 64
input_length = max([len(s) for s in data_group['Word_idx']])
n_tags = len(tag2idx)

print(f"input_dim : {input_dim} \noutput_dim : {output_dim}\ninput_length : {input_length}\nnumber of tags : {n_tags}")

input_dim : 35179 
output_dim : 64
input_length : 104
number of tags : 17


In [89]:
model = Sequential()

model.add(Embedding(input_dim = input_dim,
                    output_dim = output_dim,
                    input_length = input_length))

model.add(Bidirectional(LSTM(units = output_dim, 
                             return_sequences = True,
                             dropout = 0.2,
                             recurrent_dropout = 0.2)))

model.add(Dense(n_tags, activation = 'softmax'))

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()

Model: "sequential_9"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_9 (Embedding)      (None, 104, 64)           2251456   
_________________________________________________________________
bidirectional_9 (Bidirection (None, 104, 128)          66048     
_________________________________________________________________
dense_9 (Dense)              (None, 104, 17)           2193      
Total params: 2,319,697
Trainable params: 2,319,697
Non-trainable params: 0
_________________________________________________________________


In [90]:
history = model.fit(X_train, np.array(y_train),
                    epochs = 10,
                    batch_size=512,
                    validation_data = (X_val, y_val),
                    verbose=1)

Train on 32372 samples, validate on 10791 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Summary

Human annotation quality is around 98%. We can see that using advanced ML techniques we can get around 99% accuracy on unseen data which is phenomenal :). Hope you enjoyed this article.