# Name Entity Recognition using Deep Learning

Instead of using the traditional NLP approach for NER (which is similar to POS tagging) we will use a Deep Learning approach, using Tensorflow and Keras to build a simple model.

We will use different embeddings (word2vec, doc2vec, GloVe), network layers and parameters in order to compare performance.

Inspired in the famous blog post "Embed, encode, attend, predict" (https://explosion.ai/blog/deep-learning-formula-nlp), the high level of the network structure is the following:

1. Hot-encoding
2. Word embeddings
3. LSTM layer




## Loading Data

The first step is to load a simple dataset to build a small network and try out the concept.

In [1]:
from nltk.corpus import ConllCorpusReader
from keras.preprocessing.text import Tokenizer
my_corpus = ConllCorpusReader('C:\Data', '.*\.txt', columntypes=('words', 'pos','chunk'), encoding="utf-8")
my_corpus.iob_words('2.txt')

# Skip reading the POS tags, just read the word and IOB NER tags
all_data = [((word,tag),iob) for word,tag,iob in my_corpus.iob_words('2.txt')]
all_words = my_corpus.words('2.txt')
all_tags = [iob for word,tag,iob in my_corpus.iob_words('2.txt')]
all_sents = [sent for sent in my_corpus.iob_sents('2.txt')]

sentences = list()
tags = list()
for sent in all_sents:
    word_reader = [word for word, tag, iob in sent]
    tag_reader = [iob for word, tag, iob in sent]
    sentences.append(' '.join(word_reader))
    tags.append(tag_reader)

print(len(all_sents), "sentences in corpus")
print(len(sentences), "sentences in corpus")
print(len(tags), "sentences in corpus")
print(len(all_words), "words in corpus")
print(len(all_tags), "IOB tags in corpus")

Using TensorFlow backend.


58 sentences in corpus
58 sentences in corpus
58 sentences in corpus
1491 words in corpus
1491 IOB tags in corpus


Now, we need to encode and pad the text sentences.

In [2]:
from keras.preprocessing.sequence import pad_sequences

t = Tokenizer()
t.fit_on_texts(sentences)
word_index = t.word_index
vocab_size = len(word_index) + 1
print(vocab_size, "vocab size in corpus")
encoded_docs = t.texts_to_sequences(sentences)

max_sentlen = max([len(x) for x in encoded_docs])
padded_sentences = pad_sequences(encoded_docs, maxlen=max_sentlen, padding='post')
print(padded_sentences.shape)

520 vocab size in corpus
(58, 72)


We also need to get the labels for the words.

In [43]:
import numpy as np

# create a list of unique labels
unique_list = []
max_label = 0
# traverse for all elements
for x in all_tags:
    # check if exists in unique_list or not
    if x not in unique_list:
        unique_list.append(x)
        max_label = max_label + 1

label_index = {label: (index + 1) for index, label in enumerate(unique_list)}

def onehot_label(length, hot_index):
    onehot = list()
    ind = 0
    for i in range(length):
        if ind == hot_index:
            onehot.append(1)
        else:
            onehot.append(0)
        ind = ind + 1
    return onehot

#print(onehot_label(15,2))

# encode labels
ll = list()
for s in tags:
    l = list()
    for t in s:
        #l.append(label_index[t])
        l.append(onehot_label(max_label,label_index[t]))
    ll.append(l)
    
#pad labels
padded_labels = pad_sequences(ll, maxlen=max_sentlen, padding='post')
print(padded_labels.shape)


(58, 72, 11)


## Using GloVe embeddings

The GloVe embedding data has couple of versions, first we will use the smaller 6 billion words dataset [available here](https://nlp.stanford.edu/projects/glove/).

In [44]:
from numpy import asarray

# load the whole embedding into memory
embeddings_index = dict()
f = open('C:\data\GloVe\\6B\glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()
print('Loaded %s word vectors.' % len(embeddings_index))
embedding_size = len(embeddings_index['the'])

Loaded 400000 word vectors.


In [45]:
from numpy import zeros

sent_size = len(all_sents)

# create a weight matrix for words in training docs
embedding_matrix = zeros((vocab_size, 100))
print(embedding_matrix.shape)
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

print(word_index['that'])
print(embedding_matrix[3])

(520, 100)
10
[-0.071953    0.23127     0.023731   -0.50638002  0.33923     0.19589999
 -0.32943001  0.18364    -0.18057001  0.28963     0.20448001 -0.54960001
  0.27399001  0.58327001  0.20468    -0.49228001  0.19973999 -0.070237
 -0.88049001  0.29484999  0.14071    -0.1009      0.99449003  0.36973
  0.44554001  0.28997999 -0.1376     -0.56365001 -0.029365   -0.4122
 -0.25268999  0.63181001 -0.44767001  0.24363001 -0.10813     0.25163999
  0.46967     0.37549999 -0.23613    -0.14128999 -0.44536999 -0.65736997
 -0.042421   -0.28636    -0.28810999  0.063766    0.20281    -0.53542
  0.41306999 -0.59722    -0.38613999  0.19389001 -0.17809001  1.66180003
 -0.011819   -2.3736999   0.058427   -0.26980001  1.2823      0.81924999
 -0.22322001  0.72931999 -0.053211    0.43507001  0.85010999 -0.42934999
  0.92663997  0.39050999  1.05850005 -0.24561    -0.18265    -0.53280002
  0.059518   -0.66018999  0.18990999  0.28836    -0.24339999  0.52784002
 -0.65762001 -0.14081     1.04910004  0.51340002 

In [49]:
from keras.models import Sequential
from keras.layers.recurrent import LSTM
from keras.layers.core import Activation
from keras.layers.embeddings import Embedding
import numpy as np

hidden_size = 11
out_size = len(label_index) + 1

model = Sequential()
model.add(Embedding(vocab_size, embedding_size, weights=[embedding_matrix], input_length=max_sentlen, mask_zero=True))
model.add(LSTM(hidden_size, return_sequences=True))  
#model.add(TimeDistributedDense(out_size))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam')

print(padded_sentences.shape)
print(padded_labels.shape)

batch_size = 32
model.fit(padded_sentences, padded_labels, batch_size=batch_size, epochs=10)#, validation_data=(X_test, y_test))

(58, 72)
(58, 72, 11)
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x2926a18aac8>