# Name Entity Recognition using Deep Learning

Instead of using the traditional NLP approach for NER (which is similar to POS tagging) we will use a Deep Learning approach, using Tensorflow and Keras to build a simple model.

We will use different embeddings (word2vec, doc2vec, GloVe), network layers and parameters in order to compare performance.

Inspired in the famous blog post "Embed, encode, attend, predict" (https://explosion.ai/blog/deep-learning-formula-nlp), the high level of the network structure is the following:

1. Hot-encoding
2. Word embeddings
3. LSTM layer




## Loading Data

The first step is to load a simple dataset to build a small network and try out the concept.

Using **ConllCorpusReader** from NLTK simplifies data consumption and transformation, as it can parse the data format with multiple tags (POS tags, IOB tags) and create the multiple versions of the data we need for the transformations.

In [1]:
from nltk.corpus import ConllCorpusReader
my_corpus = ConllCorpusReader('C:\Data', '.*\.txt', columntypes=('words', 'pos','chunk'), encoding="utf-8")

# Read the data in different formats (all words, by sentence, all tags, tags by sentence)
all_data = [((word,tag),iob) for word,tag,iob in my_corpus.iob_words()]
all_words = my_corpus.words()
all_tags = [iob for word,tag,iob in my_corpus.iob_words()]
all_sents = [sent for sent in my_corpus.iob_sents()]

sentences = list()
tags = list()
for sent in all_sents:
    word_reader = [word for word, tag, iob in sent]
    tag_reader = [iob for word, tag, iob in sent]
    sentences.append(' '.join(word_reader))
    tags.append(tag_reader)

# Couple of control print outs to check all dimensions match correctly
print(len(all_sents), "sentences in corpus")
print(len(sentences), "sentences in corpus")
print(len(tags), "sentences in corpus")
print(len(all_words), "words in corpus")
print(len(all_tags), "IOB tags in corpus")

70554 sentences in corpus
70554 sentences in corpus
70554 sentences in corpus
1685626 words in corpus
1685626 IOB tags in corpus


As part of pre-processing we need to encode and pad the text sentences.

**Encoding**: transforming all words into integers. We need to identify all unique words, and translate each one of them to their integer representation. Same with the IOB tags, identify unique tags that are part of the source files and then map them to integers.

**Padding**: To make all data items the same size, we identify the longest sentence in the set and we pad all other sentences.

In [2]:
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
# T
t = Tokenizer()
t.fit_on_texts(sentences)
word_index = t.word_index
vocab_size = len(word_index) + 1
print(vocab_size, "vocab size in corpus")
encoded_docs = t.texts_to_sequences(sentences)

max_sentlen = max([len(x) for x in encoded_docs])
padded_sentences = pad_sequences(encoded_docs, maxlen=max_sentlen, padding='post')
print(padded_sentences.shape)
print(max_sentlen, "max sentence length ")

Using TensorFlow backend.


44797 vocab size in corpus
(70554, 432)
432 max sentence length 


We also need to get the labels (IOB tags) for the text. This requires some more additional pre-processing.

**One-hot encoding**: As the neural network works better with discrete single outputs, we need to one-hot encode the labels. For each label (that we transformed first into a integer, out of all the unique possible labels) we create an array of "bits" that will be 1 (on) or 0 (off) depending on the type of tag:

*For example:*

O: outside tag   - 1 \[1 0 0 0\]

I: inside tag    - 2 \[0 1 0 0\]

B: beginning tag - 3 \[0 0 1 0\]

etc.

**Padding**: Padding is also required here, to the length of the longest sentence, to make both the sentences and tags matrices of matching shapes.

In [3]:
import numpy as np

# create a list of unique labels (IOB tags)
unique_list = []
max_label = 0
# traverse for all elements
for x in all_tags:
    # check if exists in unique_list or not
    if x not in unique_list:
        # if does not exist in list of unique, add it
        unique_list.append(x)
        # count how many unique tags
        max_label = max_label + 1

# create a hash of unique IOB tags
label_index = {label: (index + 1) for index, label in enumerate(unique_list)}

# create a one-hot list of specified length with specified position turned on
def onehot_label(length, hot_index):
    onehot = list()
    ind = 0
    
    for i in range(length):
        # set value to 1 for hot_index
        if ind == hot_index:
            onehot.append(1)
        else:
            # everything else is 0
            onehot.append(0)
        ind = ind + 1
    return onehot

# Encode all IOB tags into the matrix required for the network
ll = list()
# iterate through all the sentences
for s in tags:
    l = list()
    # iterate through all the tags
    for t in s:
        # add a new one-hot encoded vector for the label
        l.append(onehot_label(max_label,label_index[t]))
    ll.append(l)
    
# IOB labels padding, to the max sentence length
padded_labels = pad_sequences(ll, maxlen=max_sentlen, padding='post')
print("Shape of padded tags:", padded_labels.shape)

MemoryError: 

## Using GloVe embeddings

Instead of training our own embeddings, this steps loads pre-trained embeddings.

The GloVe embedding data has couple of versions, here we will use the smaller 6 billion words dataset [available here](https://nlp.stanford.edu/projects/glove/).

Different GloVe embeddings are trained on different datasets (i.e.: Wikipedia, web crawls, Twitter). In this case we chose to use embeddings trained on Wikipedia, as we think they have a more general domain that can be better for Name Entity Recognition tagging. 

There are also multiple versions of the same Wikipedia embeddings, with different number of dimensions (50, 100, 200 and 300 ). Here we use the 100 dimensions file.

In [4]:
from numpy import asarray
from numpy import zeros

# load the embedding file into memory
embeddings_index = dict()
f = open('C:\\GloVe\\6B\\glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Loaded %s word vectors.' % len(embeddings_index))
# find the embedding vector dimensions size from the most common word in english: 'the'
# http://www.dictionary.com/e/commonwords/
embedding_size = len(embeddings_index['the'])

sent_size = len(all_sents)

# Translate our words with the embedding records:
# create a weight matrix for words in training docs
# based on the embeddings
embedding_matrix = zeros((vocab_size, embedding_size))
print(embedding_matrix.shape)
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

Loaded 400000 word vectors.
(519, 100)


## Deep Model Creation using Keras

After we got all the data transformed into the shapes required for a model, we can create the DNN model using Keras.

The model has just four layers:

1. **Input: Embedding.** Here is where we include the embedding vectors as the weights.
2. **Hidden: LSTM.**
3. **Hidden: Dropout.**
4. **Output: Dense.** Use sigmoid as activation function.

The difference with other models here, is that we pass the embeddings matrix as the weights, so we will use the GloVe embeddings.

In [5]:
from keras.models import Sequential
from keras.layers.recurrent import LSTM
from keras.layers.core import Activation
from keras.layers.embeddings import Embedding
from keras.layers import TimeDistributed
from keras.layers import Dropout
from keras.layers import Dense
import numpy as np

hidden_size = max_label
out_size = len(label_index) + 1

model = Sequential()

# Here we add the GloVe embeddings as the weights parameter
model.add(Embedding(vocab_size, embedding_size, weights=[embedding_matrix], input_length=max_sentlen, mask_zero=True))
model.add(LSTM(hidden_size, return_sequences=True))  
model.add(Dropout(0.1))
model.add(TimeDistributed(Dense(max_label, activation='sigmoid')))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

print("Shape of sentences:", padded_sentences.shape)
print("Shape of IOB tags:", padded_labels.shape)
print(model.summary())

Shape of sentences: (61, 51)
Shape of IOB tags: (61, 51, 9)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 51, 100)           51900     
_________________________________________________________________
lstm_1 (LSTM)                (None, 51, 9)             3960      
_________________________________________________________________
dropout_1 (Dropout)          (None, 51, 9)             0         
_________________________________________________________________
time_distributed_1 (TimeDist (None, 51, 9)             90        
Total params: 55,950
Trainable params: 55,950
Non-trainable params: 0
_________________________________________________________________
None


## Split into train / test sets

In order to do model performance evaluation, this splits the data into a training and test sets. Split percentage can be adjusted but here is set at the traditional 70%-30%. We also fixed the random seed to get a consistent split through runs.

In [6]:
#from sklearn.cross_validation import train_test_split
from sklearn.model_selection import train_test_split
n_samples = len(all_sents)
train_pct = 0.7
test_pct = 1 - train_pct
meaning_of_life = 42
sentences_train, sentences_test, tags_train, tags_test = train_test_split(padded_sentences, padded_labels, test_size=int(test_pct*n_samples), train_size=int(train_pct*n_samples), random_state=meaning_of_life)

print("Training sentences shape:", sentences_train.shape)
print("Training tags shape:",tags_train.shape)
print("Test sentences shape:", sentences_test.shape)
print("Test tags shape:", tags_test.shape)

Training sentences shape: (42, 51)
Training tags shape: (42, 51, 9)
Test sentences shape: (18, 51)
Test tags shape: (18, 51, 9)


## Model Performance Evaluation

Now we run the model through 1 to 10 epochs to measure and plot accuracy improvements. Batch size is fixed at 20.


In [7]:
## Model performance evaluation for multiple epochs
import pandas as pd
import plotly
from plotly.graph_objs import Scatter, Layout
from sklearn.metrics import confusion_matrix, accuracy_score, precision_recall_fscore_support

batch_size = 20
epochs = list()
accuracies = list()

for n_epochs in range(1, 10):
    epochs.append(n_epochs)
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    model.fit(sentences_train, tags_train, batch_size=batch_size, epochs=n_epochs, verbose=0)
    loss, accuracy = model.evaluate(sentences_test, tags_test)
    accuracies.append(accuracy)
    print('Epochs:', n_epochs, '- Accuracy', accuracy * 100, "%")
    
plotly.offline.init_notebook_mode(connected=True)

plotly.offline.iplot({
    "data": [Scatter(x=epochs, y=accuracies)],
    "layout": Layout(title="Accuracy vs Training Epochs")
})

Epochs: 1 - Accuracy 17.741936445236206 %
Epochs: 2 - Accuracy 57.52689838409424 %
Epochs: 3 - Accuracy 89.51615691184998 %
Epochs: 4 - Accuracy 93.2796061038971 %
Epochs: 5 - Accuracy 93.54841709136963 %
Epochs: 6 - Accuracy 93.54841709136963 %
Epochs: 7 - Accuracy 93.54841709136963 %
Epochs: 8 - Accuracy 93.54841709136963 %
Epochs: 9 - Accuracy 93.54841709136963 %
