# LAB3 - NERC with CRF

Combining a bidirectional LSTM model and a CRF model. The so called LSTM-CRF is a state-of-the-art approach to named entity recognition.

We are going to use the implementation provided by the keras-contrib package, that contains useful extensions to the official keras package.

Credits:
* https://www.depends-on-the-definition.com/sequence-tagging-lstm-crf/

In [1]:
import pandas as pd
import numpy as np
# adapt to your local path
data = pd.read_csv("../../../data/NERC_datasets/entity-annotated-corpus/ner_dataset.csv", encoding="latin1")
data = data.fillna(method="ffill")
data.tail(10)

FileNotFoundError: [Errno 2] File b'../../../data/NERC_datasets/entity-annotated-corpus/ner_dataset.csv' does not exist: b'../../../data/NERC_datasets/entity-annotated-corpus/ner_dataset.csv'

In [6]:
words = list(set(data["Word"].values))
words.append("ENDPAD")
n_words = len(words); n_words

35179

In [7]:
tags = list(set(data["Tag"].values))
n_tags = len(tags); n_tags

17

In [8]:
class SentenceGetter(object):
    
    def __init__(self, data):
        self.n_sent = 1
        self.data = data
        self.empty = False
        agg_func = lambda s: [(w, p, t) for w, p, t in zip(s["Word"].values.tolist(),
                                                           s["POS"].values.tolist(),
                                                           s["Tag"].values.tolist())]
        self.grouped = self.data.groupby("Sentence #").apply(agg_func)
        self.sentences = [s for s in self.grouped]
    
    def get_next(self):
        try:
            s = self.grouped["Sentence: {}".format(self.n_sent)]
            self.n_sent += 1
            return s
        except:
            return None

In [9]:
getter = SentenceGetter(data)
sent = getter.get_next()
print(sent)

[('Thousands', 'NNS', 'O'), ('of', 'IN', 'O'), ('demonstrators', 'NNS', 'O'), ('have', 'VBP', 'O'), ('marched', 'VBN', 'O'), ('through', 'IN', 'O'), ('London', 'NNP', 'B-geo'), ('to', 'TO', 'O'), ('protest', 'VB', 'O'), ('the', 'DT', 'O'), ('war', 'NN', 'O'), ('in', 'IN', 'O'), ('Iraq', 'NNP', 'B-geo'), ('and', 'CC', 'O'), ('demand', 'VB', 'O'), ('the', 'DT', 'O'), ('withdrawal', 'NN', 'O'), ('of', 'IN', 'O'), ('British', 'JJ', 'B-gpe'), ('troops', 'NNS', 'O'), ('from', 'IN', 'O'), ('that', 'DT', 'O'), ('country', 'NN', 'O'), ('.', '.', 'O')]


In [10]:
sentences = getter.sentences

Prepare the data

Now we introduce dictionaries of words and tags.


In [11]:
max_len = 75
word2idx = {w: i + 1 for i, w in enumerate(words)}
tag2idx = {t: i for i, t in enumerate(tags)}
word2idx["Obama"]
tag2idx["B-geo"]

16

Now we map the senctences to a sequence of numbers and then pad the sequence. Note that we increased the index of the words by one to use zero as a padding value. This is done because we want to use the mask_zeor parameter of the embedding layer to ignore inputs with value zero.

In [12]:
from keras.preprocessing.sequence import pad_sequences
X = [[word2idx[w[0]] for w in s] for s in sentences]

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [13]:
X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=n_words-1)

And we need to do the same for our tag sequence.


In [15]:
y = [[tag2idx[w[2]] for w in s] for s in sentences]
y = pad_sequences(maxlen=max_len, sequences=y, padding="post", value=tag2idx["O"])

For training the network we also need to change the labels y to categorial.


In [16]:
from keras.utils import to_categorical
y = [to_categorical(i, num_classes=n_tags) for i in y]

We split in train and test set.


In [17]:
from sklearn.model_selection import train_test_split

In [18]:
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.1)

Setup the CRF-LSTM

Now we can fit a LSTM-CRF network with an embedding layer.

In [20]:
from keras.models import Model, Input
from keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional
from keras_contrib.layers import CRF
#https://github.com/keras-team/keras-contrib

In [21]:
input = Input(shape=(max_len,))
model = Embedding(input_dim=n_words + 1, output_dim=20,
                  input_length=max_len, mask_zero=True)(input)  # 20-dim embedding
model = Bidirectional(LSTM(units=50, return_sequences=True,
                           recurrent_dropout=0.1))(model)  # variational biLSTM
model = TimeDistributed(Dense(50, activation="relu"))(model)  # a dense layer as suggested by neuralNer
crf = CRF(n_tags)  # CRF layer
out = crf(model)  # output

In [22]:
model = Model(input, out)

In [26]:
model.compile(optimizer="rmsprop", loss=crf.loss_function, metrics=[crf.accuracy])



In [27]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 75)                0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 75, 20)            703600    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 75, 100)           28400     
_________________________________________________________________
time_distributed_1 (TimeDist (None, 75, 50)            5050      
_________________________________________________________________
crf_1 (CRF)                  (None, 75, 17)            1190      
Total params: 738,240
Trainable params: 738,240
Non-trainable params: 0
_________________________________________________________________


In [28]:
history = model.fit(X_tr, np.array(y_tr), batch_size=32, epochs=5,
                    validation_split=0.1, verbose=1)

Train on 38846 samples, validate on 4317 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [29]:
hist = pd.DataFrame(history.history)

In [30]:
import matplotlib.pyplot as plt
plt.style.use("ggplot")
plt.figure(figsize=(12,12))
plt.plot(hist["acc"])
plt.plot(hist["val_acc"])
plt.show()

KeyError: 'acc'

Evaluation
Now we can evaluate our model systematically. You can find the details in this post, here we just apply it.

In [32]:
# pip install seqeval
from seqeval.metrics import precision_score, recall_score, f1_score, classification_report

In [33]:
test_pred = model.predict(X_te, verbose=1)



In [34]:
idx2tag = {i: w for w, i in tag2idx.items()}

def pred2label(pred):
    out = []
    for pred_i in pred:
        out_i = []
        for p in pred_i:
            p_i = np.argmax(p)
            out_i.append(idx2tag[p_i].replace("PAD", "O"))
        out.append(out_i)
    return out
    
pred_labels = pred2label(test_pred)
test_labels = pred2label(y_te)

In [35]:
print("F1-score: {:.1%}".format(f1_score(test_labels, pred_labels)))

F1-score: 82.9%


In [36]:
print(classification_report(test_labels, pred_labels))

           precision    recall  f1-score   support

      per       0.74      0.78      0.76      1636
      geo       0.83      0.89      0.86      3741
      tim       0.90      0.83      0.86      2020
      eve       0.00      0.00      0.00        39
      gpe       0.97      0.93      0.95      1631
      org       0.80      0.64      0.71      2112
      art       0.00      0.00      0.00        35
      nat       0.00      0.00      0.00        23

micro avg       0.84      0.81      0.83     11237
macro avg       0.84      0.81      0.82     11237



Finally, we look at some predictions.

In [37]:
i = 1927
p = model.predict(np.array([X_te[i]]))
p = np.argmax(p, axis=-1)
true = np.argmax(y_te[i], -1)
print("{:15}||{:5}||{}".format("Word", "True", "Pred"))
print(30 * "=")
for w, t, pred in zip(X_te[i], true, p[0]):
    if w != 0:
        print("{:15}: {:5} {}".format(words[w-1], tags[t], tags[pred]))

Word           ||True ||Pred
Tens           : O     O
of             : O     O
thousands      : O     O
of             : O     O
flag-waving    : O     O
Lebanese       : B-gpe B-gpe
mourners       : O     O
packed         : O     O
central        : O     O
Beirut         : B-geo B-geo
Thursday       : B-tim B-tim
for            : O     O
the            : O     O
funeral        : O     O
of             : O     O
outspoken      : O     O
Syrian         : B-gpe B-gpe
critic         : O     O
Gebran         : B-per B-per
Tueni          : I-per I-per
.              : O     O
Barahona       : O     O
Barahona       : O     O
Barahona       : O     O
Barahona       : O     O
Barahona       : O     O
Barahona       : O     O
Barahona       : O     O
Barahona       : O     O
Barahona       : O     O
Barahona       : O     O
Barahona       : O     O
Barahona       : O     O
Barahona       : O     O
Barahona       : O     O
Barahona       : O     O
Barahona       : O     O
Barahona       : O    

This looks pretty good and it did require any feature engineering. The power of the CRF is not really visible here, but if we had a dataset with more complicated named entites this would be quite strong.

Inference With The LSTM-CRF

In [38]:
test_sentence = ["Hawking", "was", "a", "Fellow", "of", "the", "Royal", "Society", ",", "a", "lifetime", "member",
                 "of", "the", "Pontifical", "Academy", "of", "Sciences", ",", "and", "a", "recipient", "of",
                 "the", "Presidential", "Medal", "of", "Freedom", ",", "the", "highest", "civilian", "award",
                 "in", "the", "United", "States", "."]

Now we transform every word to it’s integer index. Note that we mapping unknown words to zero. Normally you would want to add a UNKNOWN token to your vocabulary. Then you cut the vocabulary on which you train the model and replace all uncommon words by the UNKNOWN token. We haven’t done this for simplicity.

In [39]:
x_test_sent = pad_sequences(sequences=[[word2idx.get(w, 0) for w in test_sentence]],
                            padding="post", value=0, maxlen=max_len)

And now we can predict with the model and see what we got.


In [40]:
p = model.predict(np.array([x_test_sent[0]]))
p = np.argmax(p, axis=-1)
print("{:15}||{}".format("Word", "Prediction"))
print(30 * "=")
for w, pred in zip(test_sentence, p[0]):
    print("{:15}: {:5}".format(w, tags[pred]))

Word           ||Prediction
Hawking        : I-eve
was            : O    
a              : O    
Fellow         : O    
of             : O    
the            : O    
Royal          : B-org
Society        : I-org
,              : O    
a              : O    
lifetime       : O    
member         : O    
of             : O    
the            : O    
Pontifical     : B-org
Academy        : I-org
of             : I-org
Sciences       : I-org
,              : O    
and            : O    
a              : O    
recipient      : O    
of             : O    
the            : O    
Presidential   : O    
Medal          : I-eve
of             : O    
Freedom        : B-geo
,              : O    
the            : O    
highest        : O    
civilian       : O    
award          : O    
in             : O    
the            : O    
United         : B-geo
States         : I-geo
.              : O    


References and further reading:

Huang et. al: Bidirectional LSTM-CRF Models for Sequence Tagging [https://arxiv.org/pdf/1508.01991v1.pdf]
Ma et al.: End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF [https://arxiv.org/pdf/1603.01354.pdf]