## Named Entity Recognition using Bidirectional LSTMs

In this notebook we train a bidirectional LSTM model for Named Entity Recognition on a Kaggle dataset.

Dataset from Kaggle: https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus

Source website: https://www.depends-on-the-definition.com/introduction-named-entity-recognition-python/

In [41]:
import pandas as pd
import numpy as np
import string
import re
from collections import Counter
import itertools

In [42]:
data = pd.read_csv("dataset/hate_speech.tsv", sep='\t',header=None)

In [43]:
data.columns = ["text", "label"]

In [44]:
data.dropna(inplace=True)

In [45]:
data.reset_index(inplace=True)

In [46]:
data.head()

Unnamed: 0,index,text,label
0,0,Knowing ki Vikas kitna samjhata hai Priyanka a...,no
1,1,I am Muhajir .. Aur mere lye sab se Pehly Paki...,no
2,2,Doctor sab sahi me ke PhD (in hate politics) ...,no
3,3,Poore Desh me Patel OBC me aate Hain sirf gujr...,no
4,4,Sarkar banne ke bad Hindu hit me ek bhi faisla...,yes


In [47]:
def clean(text):
#     text = re.sub(r"http\S+", "", text)
    text = text.translate(str.maketrans('','',string.punctuation))
    text = text.lower()
    text = text.strip()
    text = text.translate(str.maketrans('','','1234567890'))
    
    return text

In [48]:
data["text"] = data["text"].apply(clean)

In [49]:
data = pd.get_dummies(data, prefix=['label'], columns=['label'])

In [50]:
data = data.fillna(method="ffill")

In [51]:
data.tail(5)

Unnamed: 0,index,text,label_no,label_yes
4573,4574,ye attankwadi indian agent hai jo terrorism ph...,1,0
4574,4575,bola na terrorism ko support karna band karoge...,1,0
4575,4576,lagta hai aap ne movie dekhi hai which is writ...,1,0
4576,4577,tum log terrorism ko support karna band kardo ...,1,0
4577,4578,mujhe pehele se hi pata tha so sallu fans ke b...,0,1


In [52]:
def build_vocab(sentences):
    """
    Builds a vocabulary mapping from word to index based on the sentences.
    Returns vocabulary mapping and inverse vocabulary mapping.
    """
    # Build vocabulary
    word_counts = Counter(itertools.chain(*sentences))
    # Mapping from index to word
    vocabulary_inv = [x[0] for x in word_counts.most_common()]
    # Mapping from word to index
    vocabulary = {x: i for i, x in enumerate(vocabulary_inv)}
    
#     print("Vocabulary: ", len(vocabulary), len(vocabulary_inv))
    return [vocabulary, vocabulary_inv]

In [15]:
max_len = 50
word2idx = {w: i for i, w in enumerate(words)}
tag2idx = {t: i for i, t in enumerate(tags)}

In [16]:
word2idx["Obama"]

26043

In [17]:
tag2idx["B-geo"]

4

Now we map the senctences to a sequence of numbers and then pad the sequence.

In [18]:
from keras.preprocessing.sequence import pad_sequences
X = [[word2idx[w[0]] for w in s] for s in sentences]

Using TensorFlow backend.


In [19]:
X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=n_words - 1)

In [20]:
y = [[tag2idx[w[2]] for w in s] for s in sentences]

In [21]:
y = pad_sequences(maxlen=max_len, sequences=y, padding="post", value=tag2idx["O"])

In [22]:
from keras.utils import to_categorical

For training the network we also need to change the labels y to categorial.

In [1]:
y = [to_categorical(i, num_classes=n_tags) for i in y]

NameError: name 'y' is not defined

In [24]:
from sklearn.model_selection import train_test_split

In [34]:
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3)

Now we can fit a LSTM network with an embedding layer. Note that we used the functional API of keras here, as it is more suitable for complicated architectures.

In [35]:
from keras.models import Model, Input
from keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional

In [36]:
input = Input(shape=(max_len,))

model = Embedding(input_dim=n_words, output_dim=50, input_length=max_len)(input)
model = Dropout(0.5)(model)
model = Bidirectional(LSTM(units=100, return_sequences=True, dropout=0.5, recurrent_dropout=0.25))(model)
model = Dropout(0.5)(model)
model = Bidirectional(LSTM(units=100, return_sequences=True, dropout=0.5, recurrent_dropout=0.25))(model)
model = Dropout(0.5)(model)
model = Bidirectional(LSTM(units=100, return_sequences=True, dropout=0.5, recurrent_dropout=0.25))(model)

out = TimeDistributed(Dense(n_tags, activation="softmax"))(model)  # softmax output layer

In [37]:
model = Model(input, out)

In [38]:
model.compile(optimizer="rmsprop", loss="categorical_crossentropy", metrics=["accuracy"])

In [None]:
history = model.fit(X_tr, np.array(y_tr), batch_size=64, epochs=5, validation_split=0.3, verbose=1)

Train on 23499 samples, validate on 10072 samples
Epoch 1/5

In [None]:
hist = pd.DataFrame(history.history)

In [None]:
plt.figure(figsize=(12,12))
plt.plot(hist["acc"])
plt.plot(hist["val_acc"])
plt.show()

In [None]:
plt.figure(figsize=(12,12))
plt.plot(hist["loss"])
plt.plot(hist["val_loss"])
plt.show()

Now let us look at some predictions.

In [None]:
i = 1005
p = model.predict(np.array([X_te[i]]))
p = np.argmax(p, axis=-1)
true = np.argmax(y_te[i], -1)
print("{:15} ({:5}): {}".format("Word", "True", "Pred"))
for w, t, pred in zip(X_te[i], true, p[0]):
    print("{:15}: {:5} {}".format(words[w], tags[t], tags[pred]))