### Introduction
**Natural Language Processing (NLP)** consists of developing applications and services capable of understanding human languages. Some practical examples of NLP are;
1. Speech recognition for example Google voice search,
    * Nowadays, most of us have smartphones with voice recognition. These smartphones use NLP to understand what is being said.
    * Besides, many people use laptops whose operating system has built-in speech recognition like Cortana.
2. Understanding content or 
3. Analyzing feelings, etc.

 

### Named Entity Recognition (NER)

Named entity means anything that is a real-world object such as a person, place, organization, product that has a name. For example — **My name is Waqas and I am a Data Science trainer**. In this sentence;
* The **name** `Waqas`, 
* The **field or subject** `Data Science` and 
* The **profession** `trainer` are named entities.

In machine learning, **Named Entity Recognition (NER)** is a task of **Natural Language Processing (NLP)** to identify **Named Entities** in a certain piece of text. 

In [8]:
# from google.colab import files
# uploaded = files.upload()
import pandas as pd
data = pd.read_csv('ner_dataset.csv', encoding= 'unicode_escape')
data.head()

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O


In the data, we can see that the words are broken into columns which will represent our feature `X`, and the Tag column in the right will represent our label `Y`.

### Data Preparation for Neural Networks

We need to do some modifications in the data to prepare it in such a manner that can easily fit into a neutral network. We will start by extracting the mappings that are required to train the neural network:

In [9]:
def get_dict_map(data, token_or_tag):
    tok2idx = {}
    idx2tok = {}
    
    if token_or_tag == 'token':
        vocab = list(set(data['Word'].to_list()))
    else:
        vocab = list(set(data['Tag'].to_list()))
    
    idx2tok = {idx:tok for  idx, tok in enumerate(vocab)}
    tok2idx = {tok:idx for  idx, tok in enumerate(vocab)}
    return tok2idx, idx2tok

token2idx, idx2token = get_dict_map(data, 'token')
tag2idx, idx2tag = get_dict_map(data, 'tag')

Now we will transform the columns to extract the sequential data for our neural network:

In [10]:
data['Word_idx'] = data['Word'].map(token2idx)
data['Tag_idx'] = data['Tag'].map(tag2idx)
data_fillna = data.fillna(method='ffill', axis=0)

In [11]:
# Groupby and collect columns
data_group = data_fillna.groupby(['Sentence #'],
                                 as_index=False)['Word', 'POS', 'Tag',
                                                 'Word_idx', 'Tag_idx'].agg(lambda x: list(x))

  This is separate from the ipykernel package so we can avoid doing imports until


Now we will split the data into training and test sets. We will create a function for splitting the data because `LSTM` layers accept sequences of the same length only. So every sentence that appears as integer in the data must be padded with the same length:

In [12]:
from sklearn.model_selection import train_test_split
from keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

In [13]:
def get_pad_train_test_val(data_group, data):

    #get max token and tag length
    n_token = len(list(set(data['Word'].to_list())))
    n_tag = len(list(set(data['Tag'].to_list())))

    #Pad tokens (X var)    
    tokens = data_group['Word_idx'].tolist()
    maxlen = max([len(s) for s in tokens])
    pad_tokens = pad_sequences(tokens, maxlen=maxlen, dtype='int32', padding='post', 
                               value= n_token - 1)

    #Pad Tags (y var) and convert it into one hot encoding
    tags = data_group['Tag_idx'].tolist()
    pad_tags = pad_sequences(tags, maxlen=maxlen, dtype='int32', padding='post', value= tag2idx["O"])
    n_tags = len(tag2idx)
    pad_tags = [to_categorical(i, num_classes=n_tags) for i in pad_tags]
    
    #Split train, test and validation set
    tokens_, test_tokens, tags_, test_tags = train_test_split(pad_tokens, pad_tags, test_size=0.1, train_size=0.9, random_state=2020)
    train_tokens, val_tokens, train_tags, val_tags = train_test_split(tokens_,tags_,test_size = 0.25,train_size =0.75, random_state=2020)

    print(
        'train_tokens length:', len(train_tokens),
        '\ntrain_tokens length:', len(train_tokens),
        '\ntest_tokens length:', len(test_tokens),
        '\ntest_tags:', len(test_tags),
        '\nval_tokens:', len(val_tokens),
        '\nval_tags:', len(val_tags),
    )
    
    return train_tokens, val_tokens, test_tokens, train_tags, val_tags, test_tags

In [14]:
train_tokens, val_tokens, test_tokens, train_tags, val_tags, test_tags = get_pad_train_test_val(data_group, data)

train_tokens length: 32372 
train_tokens length: 32372 
test_tokens length: 4796 
test_tags: 4796 
val_tokens: 10791 
val_tags: 10791


Now, we will proceed with training neural network architecture of our model. Let’s start with importing all the packages we need for training our neural network:

In [15]:
import numpy as np
import tensorflow
from tensorflow.keras import Sequential, Model, Input
from tensorflow.keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional
from tensorflow.keras.utils import plot_model
from numpy.random import seed
seed(1)
tensorflow.random.set_seed(2)

The layer below will take dimensions from `LSTM` layer and return maximum length and maximum tags as an output:

In [16]:
input_dim = len(list(set(data['Word'].to_list())))+1
output_dim = 64
input_length = max([len(s) for s in data_group['Word_idx'].tolist()])
n_tags = len(tag2idx)

Now we will create a helper function which will help us giving summary of every layer.

In [17]:
def get_bilstm_lstm_model():
    model = Sequential()

    # Add Embedding layer
    model.add(Embedding(input_dim=input_dim, output_dim=output_dim, input_length=input_length))

    # Add bidirectional LSTM
    model.add(Bidirectional(LSTM(units=output_dim, return_sequences=True, dropout=0.2, recurrent_dropout=0.2), merge_mode = 'concat'))

    # Add LSTM
    model.add(LSTM(units=output_dim, return_sequences=True, dropout=0.5, recurrent_dropout=0.5))

    # Add timeDistributed Layer
    model.add(TimeDistributed(Dense(n_tags, activation="relu")))

    #Optimiser 
    # adam = k.optimizers.Adam(lr=0.0005, beta_1=0.9, beta_2=0.999)

    # Compile model
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    model.summary()
    
    return model

In [18]:
# Helper function to train Named Entity Recognition model:

def train_model(X, y, model):
    loss = list()
    for i in range(25):
        # fit model for one epoch on this sequence
        hist = model.fit(X, y, batch_size=1000, verbose=1, epochs=1, validation_split=0.2)
        loss.append(hist.history['loss'][0])
    return loss

In [19]:
# Driver code:

results = pd.DataFrame()

model_bilstm_lstm = get_bilstm_lstm_model()

plot_model(model_bilstm_lstm)

results['with_add_lstm'] = train_model(train_tokens, np.array(train_tags), model_bilstm_lstm)

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 104, 64)           2251456   
_________________________________________________________________
bidirectional (Bidirectional (None, 104, 128)          66048     
_________________________________________________________________
lstm_1 (LSTM)                (None, 104, 64)           49408     
_________________________________________________________________
time_distributed (TimeDistri (None, 104, 17)           1105      
Total params: 2,368,017
Trainable params: 2,368,017
Non-trainable params: 0
_________________________________________________________________


The model will give us final output after running `25` epochs.

Now let’s test our model on a piece of text:

In [7]:
# !pip install spacy --user

In [20]:
import spacy
from spacy import displacy
nlp = spacy.load('en_core_web_sm')
text = nlp('Hi, My name is Waqas Ali \n I am from Pakistan \n I want to work with Google \n Steve Jobs is My Inspiration')
displacy.render(text, style = 'ent', jupyter=True)

In [26]:
text = nlp('I love to visit Kashmir. Its my heart\n No one can deny Imran Khan and Pakistan\'s Army')
displacy.render(text, style = 'ent', jupyter=True)

Collecting spacy
  Downloading spacy-3.1.0-cp37-cp37m-win_amd64.whl (11.8 MB)
Collecting wasabi<1.1.0,>=0.8.1
  Using cached wasabi-0.8.2-py3-none-any.whl (23 kB)
Collecting typer<0.4.0,>=0.3.0
  Using cached typer-0.3.2-py3-none-any.whl (21 kB)
Collecting pathy>=0.3.5
  Downloading pathy-0.6.0-py3-none-any.whl (42 kB)
Collecting spacy-legacy<3.1.0,>=3.0.7
  Downloading spacy_legacy-3.0.8-py2.py3-none-any.whl (14 kB)
Collecting thinc<8.1.0,>=8.0.7
  Downloading thinc-8.0.7-cp37-cp37m-win_amd64.whl (1.0 MB)
Collecting srsly<3.0.0,>=2.4.1
  Downloading srsly-2.4.1-cp37-cp37m-win_amd64.whl (450 kB)
Collecting catalogue<2.1.0,>=2.0.4
  Downloading catalogue-2.0.4-py3-none-any.whl (16 kB)
Collecting pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4
  Downloading pydantic-1.8.2-cp37-cp37m-win_amd64.whl (1.9 MB)
Collecting click<7.2.0,>=7.1.1
  Using cached click-7.1.2-py2.py3-none-any.whl (82 kB)
Collecting smart-open<6.0.0,>=5.0.0
  Downloading smart_open-5.1.0-py3-none-any.whl (57 kB)
Installing collec