Dataset:

https://github.com/amankharwal/Website-data/blob/master/ner_dataset.csv?raw=true

In machine learning, the recognition of named entities is an essential subtask of natural language processing. It tries to recognize and classify multi-word phrases with special meaning, e.g. people, organizations, places, dates, etc.

### Import Libraries and the dataset

In [1]:
import pandas as pd
data = pd.read_csv('/content/ner_dataset.csv', encoding='ISO-8859-1')
data.head()

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O


I will train a neural network for the Named Entity Recognition (NER) task. So we need to make some modifications to the data to prepare it so that it can easily fit into a neural network. I’ll start this step by extracting the mappings needed to train the neural network:

In [2]:
from itertools import chain
def get_dict_map(data,token_or_tag):
    tok2idx = {}
    idx2tok = {}

    if token_or_tag == 'token':
        vocab = list(set(data['Word'].to_list()))
    else:
        vocab = list(set(data['Tag'].to_list()))

    idx2tok = {idx:tok for idx,tok in enumerate(vocab)}
    tok2idx = {tok:idx for idx,tok in enumerate(vocab)}
    return tok2idx, idx2tok

token2idx,idx2token = get_dict_map(data,'token')
tag2idx, idx2tag = get_dict_map(data,'tag')
data['Word_idx'] = data['Word'].map(token2idx)
data['Tag_idx'] = data['Tag'].map(tag2idx)

Now, I’m going to transform the columns in the data to extract the sequential data from our neural network:

In [3]:
data_fillna = data.fillna(method='ffill',axis=0)
# Groupby and collect columns
data_group = data_fillna.groupby(['Sentence #'],as_index=False
                                 )['Word','POS','Tag','Word_idx','Tag_idx'].agg(lambda x: list(x))

  data_group = data_fillna.groupby(['Sentence #'],as_index=False


I will now divide the data into training and test sets. I am going to create a function to split the data as LSTM layers only accept sequences of the same length. Thus, each sentence that appears as an integer in the data must be completed with the same length:

In [4]:
from sklearn.model_selection import train_test_split
from keras.utils import pad_sequences
from keras.utils import to_categorical

def get_pad_train_test_val(data_group, data):

    #get max token and tag length
    n_token = len(list(set(data['Word'].to_list())))
    n_tag   = len(list(set(data['Tag'].to_list())))

    # Pad tokens (X var)
    tokens = data_group['Word_idx'].tolist()
    maxlen = max([len(s) for s in tokens])
    pad_tokens = pad_sequences(tokens, maxlen=maxlen, dtype='int32', padding='post', value=n_token - 1)

    # Pad Tags (y var) and convert it into one hot encoding
    tags = data_group['Tag_idx'].tolist()
    pad_tags = pad_sequences(tags, maxlen=maxlen,dtype='int32', padding='post', value=tag2idx["O"])
    n_tags = len(tag2idx)
    pad_tags = [to_categorical(i, num_classes=n_tags) for i in pad_tags]

    # Split train, test and validation set
    tokens_, test_tokens, tags_, test_tags = train_test_split(pad_tokens, pad_tags, test_size=0.1, train_size=0.9, random_state=2023)
    train_tokens, val_tokens, train_tags, val_tags = train_test_split(tokens_, tags_,test_size=0.25,train_size=0.75,random_state=2023)

    print(
        'train_tokens length: ', len(train_tokens),
        '\ntest_tokens length: ', len(test_tokens),
        '\ntest_tags: ', len(test_tags),
        '\nval_tokens: ', len(val_tokens),
        '\nnval_tags: ', len(val_tags),
    )
    return train_tokens, val_tokens, test_tokens, train_tags, val_tags, test_tags

train_tokens, val_tokens, test_tokens, train_tags, val_tags, test_tags = get_pad_train_test_val(data_group, data)

train_tokens length:  32372 
test_tokens length:  4796 
test_tags:  4796 
val_tokens:  10791 
nval_tags:  10791


### Training a Neural Network for NER

I will now proceed to train the neural network architecture of our model. So let’s start by importing all the packages we need to train our neural network. Next, I’ll create layers that will take the dimensions of the LSTM layer and give the maximum length and maximum tags as output:

In [5]:
import numpy as np
import tensorflow
from tensorflow.keras import Sequential, Model, Input
from tensorflow.keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional
from numpy.random import seed

seed(1)
tensorflow.random.set_seed(2)

input_dim = len(list(set(data['Word'].to_list())))+1
output_dim = 64
input_length = max([len(s) for s in data_group['Word_idx'].tolist()])
n_tags = len(tag2idx)

Now I will create a helper function that will help us to give the summary of each layer of the neural network model for the task of recognizing named entities with Python:

In [6]:
def get_bilstm_model():
    model = Sequential()

    # Add Embedding layer
    model.add(Embedding(input_dim=input_dim, output_dim=output_dim, input_length=input_length))

    # Add bidirectional LSTM
    model.add(Bidirectional(LSTM(units=output_dim, return_sequences=True, dropout=0.2, recurrent_dropout=0.2), merge_mode='concat'))

    # Add LSTM
    model.add(LSTM(units=output_dim, return_sequences=True, dropout=0.5, recurrent_dropout=0.5))

    # Add timeDistributed Layer
    model.add(TimeDistributed(Dense(n_tags, activation="relu")))

    # Compile model
    model.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])
    model.summary()

    return model

Now I will create a function to train our model:

In [8]:
def train_model(X, y, model):
    loss = list()
    for i in range(2):
        # fit model for one epoch on this sequence
        hist = model.fit(X, y, batch_size=64, verbose=1, epochs=1, validation_split=0.2)
        loss.append(hist.history['loss'][0])
    return loss

results = pd.DataFrame()
model_bilstm = get_bilstm_model()
results['with_add_lstm'] = train_model(train_tokens, np.array(train_tags), model_bilstm)



Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 104, 64)           2251456   
                                                                 
 bidirectional_1 (Bidirecti  (None, 104, 128)          66048     
 onal)                                                           
                                                                 
 lstm_3 (LSTM)               (None, 104, 64)           49408     
                                                                 
 time_distributed_1 (TimeDi  (None, 104, 17)           1105      
 stributed)                                                      
                                                                 
Total params: 2368017 (9.03 MB)
Trainable params: 2368017 (9.03 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


### Testing the Named Entity Recognition Model

Now, I will use the spacy library in Python to test our NER model. I will add input of some lines about my self and let’s see what we will get after running the code:

In [9]:
import spacy
from spacy import displacy
nlp = spacy.load('en_core_web_sm')
text = nlp('Hi, My name is Aman Kharwal \n I am from India \n I want to work with Google \n Steve Jobs is My Inspiration')
displacy.render(text, style = 'ent', jupyter=True)