<div style="text-align: right"><a href="http://ml-school.uni-koeln.de">Virtual Summer School "Deep Learning for
    Language Analysis"</a> <br/><strong>Text Analysis with Deep Learning</strong><br/>Aug 31 — Sep 4, 2020<br/>Nils Reiter<br/><a href="mailto:nils.reiter@uni-koeln.de">nils.reiter@uni-koeln.de</a></div>

# Exercise 3

This exercise is about sequence labeling. A sequence of items (words, in this case) must the tagged with a sequence of labels. In this case the labels are named entity tags in the BIO scheme.

The data we will be using comes from the Groningen Meaning Bank (GMB). Its annotation scheme can be found [here](http://www.let.rug.nl/bjerva/gmb/manual.php). As always, we will first preprocess the data, and then create and train the model.

----

Download the data, if necessary.

In [None]:
! if ! [[ -f data/ner/gmb.csv ]]; then curl https://nilsreiter.de/assets/2020-08-31-deep-learning/ner/gmb.csv > data/ner/gmb.csv; fi

Import into a pandas dataframe, and fill missing values. Also, let's look at the head of the table. We also directly encode the strings as integers, using the [numpy-function `np.unique(...)`](https://numpy.org/doc/stable/reference/generated/numpy.unique.html). This will allow us to convert the index numbers back into readable tag strings later on.

For padding (see below), we will be using `_____` as a "word", and `O` as a tag. `_____` needs to be added to the lists of unique words as well.

In [None]:
import pandas as pd
import numpy as np

# read in CSV file
data = pd.read_csv("data/ner/gmb.csv",encoding = 'latin1')

# the first column of the file contains the sentence number
# -- but only for the first token of each sentence.
# The following line fills the rows downwards.
data = data.fillna(method = 'ffill')

# create a list of unique words and assign an integer number to it
unique_words, coded_words = np.unique(data["Word"], return_inverse=True)
data["Word_idx"] = coded_words
EMPTY_WORD_IDX = len(unique_words)
np.array(unique_words.tolist().append("_____"))
num_words = len(unique_words)+1

# create a list of unique tags and assign an integer number to it
unique_tags, coded_tags = np.unique(data["Tag"], return_inverse=True)
data["Tag_idx"]  = coded_tags
NO_TAG_IDX = unique_tags.tolist().index("O")
num_words_tag = len(unique_tags)

# for verification and inspection, we print out the table so far
data[1:20]

In this step, we convert the table in such a way that we can access individual sentences. The result of the function is a list of list of tuples, with the tuples containing the word, its part of speech tag and its named entity tag.

In [None]:
def get_sentences(data):
    n_sent=1
    agg_func = lambda s:[(w,p,t) for w,p,t in zip(s["Word_idx"].values.tolist(),
                                                     s["POS"].values.tolist(),
                                                     s["Tag_idx"].values.tolist())]
    grouped = data.groupby("Sentence #").apply(agg_func)
    return [s for s in grouped]


sentences = get_sentences(data)

# print out the first sentence for verification
sentences[0]

In this block, we pad the sequences to the length of the longest sentence.

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

# find the maximum length for the sentences
max_len = max([len(s) for s in sentences])

x = np.array([ np.array([ w[0] for w in s ]) for s in sentences ])
y = np.array([ np.array([ w[2] for w in s ]) for s in sentences ])

# shorter sentences are now padded to same length, using (index of) padding symbol
x = pad_sequences(maxlen = max_len, sequences = x, padding = 'post', value = EMPTY_WORD_IDX)

# we do the same for the y data
y = pad_sequences(maxlen = max_len, sequences = y, padding = 'post', value = NO_TAG_IDX)

# but we also convert the indices to keras categories
y = np.array([to_categorical(i, num_classes = num_words_tag) for i in  y])

Split the data into trainig and test data

In [None]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.1,random_state=1)

Now we create the model architecture.

Things to try:
- Pretrained embeddings
- Dropout/Regularization
- Bidirectionality
- More dense layers

In [None]:
from tensorflow.keras import models, layers, optimizers

model = models.Sequential()
model.add(layers.Input(shape = (max_len,)))
model.add(layers.Embedding(input_dim = num_words, output_dim = 1, input_length = max_len))
model.add(layers.SimpleRNN(units = 5, return_sequences = True))
model.add(layers.Dense(num_words_tag, activation = 'softmax'))
model.summary()

model.compile(loss = 'categorical_crossentropy', metrics = ['accuracy'])

Run the training

In [None]:
history = model.fit(
    x_train, np.array(y_train),
    batch_size = 128,
    epochs = 1,
    verbose = 1
)

In [None]:
model.evaluate(x_test, np.array(y_test))

## Evaluation by class

So far, we have mostly looked at accuracy scores. For this task, however, this may not giving us the entire picture, because there are many different target classes, and the model might perform differently for them. So look at an evaluation by class. For this, the [function `classification_report(...)` from scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) can be used.

In [None]:
from sklearn.metrics import classification_report

Y_test = np.argmax(y_test, axis=2)

y_pred = np.argmax(model.predict(x_test), axis=2)


print(classification_report(Y_test.flatten(), y_pred.flatten(), zero_division=0, target_names=unique_tags))

# Challenge

! We will do this on Thursday, 9:30 ! 

In [None]:
! if ! [[ -f data/ner/challenge.wb.csv ]]; then curl https://nilsreiter.de/assets/2020-08-31-deep-learning/ner/challenge.wb.csv > data/ner/challenge.wb.csv; fi

In [None]:
! if ! [[ -f data/ner/challenge.bc.csv ]]; then curl https://nilsreiter.de/assets/2020-08-31-deep-learning/ner/challenge.bc.csv > data/ner/challenge.bc.csv; fi

In [None]:
! if ! [[ -f data/ner/challenge.nw.csv ]]; then curl https://nilsreiter.de/assets/2020-08-31-deep-learning/ner/challenge.nw.csv > data/ner/challenge.nw.csv; fi

In [None]:
word_index = { w : i for i, w in enumerate( unique_words ) }
tag_index = { t : i for i, t in enumerate( unique_tags ) }

In [None]:
challenge = pd.read_csv("data/ner/challenge.wb.csv", header = 0, names=["Sentence #","Word","POS","Tag"])

challenge["Word_idx"] = [ word_index.get(w, EMPTY_WORD_IDX) for w in challenge["Word"]]
challenge["Tag_idx"] = [ tag_index[w] for w in challenge["Tag"]]

sentences = get_sentences(challenge)


In [None]:
x_challenge = [[w[0] for w in s] for s in sentences]
x_challenge = pad_sequences(maxlen = max_len,sequences = x_challenge,padding = 'post',value = EMPTY_WORD_IDX)
y_challenge = [[w[2] for w in s] for s in sentences]
y_challenge = pad_sequences(maxlen = max_len,sequences = y_challenge,padding =
                        'post',value = tag_index['O'])
y_challenge = [to_categorical(i, num_classes = num_words_tag) for i in  y_challenge]

In [None]:
model.evaluate(x_challenge, np.array(y_challenge))

In [None]:
Y_test = np.argmax(y_challenge, axis=2)

y_pred = np.argmax(model.predict(x_challenge), axis=2)

print(classification_report(Y_test.flatten(), y_pred.flatten(), zero_division=0, labels=range(len(unique_tags)), target_names=unique_tags))