# NER Workshop Exercise 2: Training a Custom NER Algorithm

In this exercise, we will train our own RNN-based Named Entity Recognition algorithm, using the CoNLL-2003 tagged dataset.

## Part 1: Loading CoNLL-2003 data

The [CoNLL-2003](https://www.clips.uantwerpen.be/conll2003/ner/) shared task was a joint effort by academics to provide approaches to named entity recognition, using a tagged dataset of named entities in English and German. We will be using the tagged English data from CoNLL-2003, found in the accompanying file *conll2003.zip*.

After uploading this file to the current directory, access the data as follows:

In [1]:
# ! unzip conll2003.zip

In [2]:
import pandas as pd
def read_conll(filename):
  df = pd.read_csv(filename,
                     sep = ' ', header = None, keep_default_na = False,
                     names = ['TOKEN', 'POS', 'CHUNK', 'NE'],
                     quoting = 3, skip_blank_lines = False)
  df['SENTENCE'] = (df.TOKEN == '').cumsum()
  return df[df.TOKEN != '']
train_df = read_conll('conll2003/train.txt')
valid_df = read_conll('conll2003/valid.txt')
test_df = read_conll('conll2003/test.txt')

Note that the CoNLL-2003 data contains part-of-speech (POS) and chunk tags, but we will only be using the token text and named entity (NE) tags that are provided.

**Questions:**
  1. What percentages of the CoNLL-2003 data are training, validation, and testing data? (calculate directly)
  2. What do the tags in column 'NE' mean?

In [3]:
train_size = train_df.shape[0]
test_size = test_df.shape[0]
valid_size = valid_df.shape[0]
colon_size = train_size + test_size + valid_size
train_per = train_size/colon_size*100
test_per = test_size/colon_size*100
valid_per = valid_size/colon_size*100
print("the train set is {}% of the data".format(train_per))
print("the test set is {}% of the data".format(test_per))
print("the validation set is {}% of the data".format(valid_per))

the train set is 67.55600027740076% of the data
the test set is 15.410932892134038% of the data
the validation set is 17.033066830465206% of the data


In [40]:
train_df['NE'].value_counts()

O         170524
B-LOC       7140
B-PER       6600
B-ORG       6321
I-PER       4528
I-ORG       3704
B-MISC      3438
I-LOC       1157
I-MISC      1155
Name: NE, dtype: int64

In the above list, the B stands for Beginning of the sequence, I for Inside the sequence and O stands for Outside. 
Loc is Location, PER is Person, ORG is Organization, Misc is miscellaneous.

## Part 2: Feature calculation

In order to learn named entity recognition using RNNs, we must transform our input and output into numeric vectors by calculating relevant features. For our basic NER algorithm, we will simply use word indices as input and one-hot embeddings of NER tags as output.

**Questions:**

3. Save a list of the 5000 most common word tokens (values from column 'TOKEN') in our training data as a list 'vocab', and save a list of all unique entity tags (values from column 'NE') as a list 'ne_tags'. 
4. Create a function token2index(token) that takes in the value of a word token and returns a unique integer. It should return 1 for any token which is not found in 'vocab' (i.e. which is out-of-vocabulary) and a number >= 2 for every token found in 'vocab'.
5. Create a function ne_tag2index(ne_tag) which returns a unique integer >= 1 for every entity tag.
6. Add new columns 'token_index' and 'ne_index' to the CoNLL data DataFrames containing the values of token2index() and ne_tag2index() for each token and entity tag.
7. Generate training data feature matrix X_train of size (14987, 50) as follows:
  * Use train_df.groupby('SENTENCE').token_index.apply(list) to get a list of lists of token indices, one list for each sentence.
  * Use pad_sequences() from keras.preprocessing.sequence to pad every list of token indices with the value '0' at the beginning so they are all of length 50.
8. Generate output data feature matrix Y_train of size (14987, 50, 10) by applying the same method to the entity token indices (column 'ne_index'), and then one-hot encoding using to_categorical() from keras.utils.
9. Apply 7-8 on the validation and testing data as well to generate matrices X_valid, Y_valid, X_test, Y_test.

In [41]:
from collections import Counter
vocab = Counter(train_df['TOKEN']).most_common(5000)
vocab = [word[0] for word in vocab]
ne_tags = list(train_df['NE'].unique())

In [42]:
def token2index(token):
    if token in vocab:
        return vocab.index(token) + 2
    else:
        return 1

In [43]:
def ne_tag2index(ne_tag):
    return ne_tags.index(ne_tag) + 1

Add new columns 'token_index' and 'ne_index' to the CoNLL data DataFrames containing the values of token2index() and ne_tag2index() for each token and entity tag.

In [44]:
train_df["token_index"] = train_df['TOKEN'].apply(token2index)
train_df["ne_index"] = train_df['NE'].apply(ne_tag2index)

test_df["token_index"] = test_df['TOKEN'].apply(token2index)
test_df["ne_index"] = test_df['NE'].apply(ne_tag2index)

valid_df["token_index"] = valid_df['TOKEN'].apply(token2index)
valid_df["ne_index"] = valid_df['NE'].apply(ne_tag2index)

In [45]:
from keras.preprocessing.sequence import pad_sequences
new_train_df = train_df.groupby('SENTENCE').token_index.apply(list)

In [46]:
X_train = pad_sequences(new_train_df.to_list(), value = 0, maxlen = 50)

In [47]:
new_test_df = test_df.groupby('SENTENCE').token_index.apply(list)
new_valid_df = valid_df.groupby('SENTENCE').token_index.apply(list)

In [48]:
X_test = pad_sequences(new_test_df.to_list(), value = 0, maxlen = 50)
X_valid = pad_sequences(new_valid_df.to_list(), value = 0, maxlen = 50)

In [49]:
new_train_df2 = train_df.groupby('SENTENCE').ne_index.apply(list)
new_test_df2 = test_df.groupby('SENTENCE').ne_index.apply(list)
new_valid_df2 = valid_df.groupby('SENTENCE').ne_index.apply(list)

In [50]:
y_train = pad_sequences(new_train_df2.to_list(), value = 0, maxlen = 50)
y_test = pad_sequences(new_test_df2.to_list(), value = 0, maxlen = 50)
y_valid = pad_sequences(new_valid_df2.to_list(), value = 0, maxlen = 50)

In [51]:
from keras.utils import to_categorical
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)
y_valid = to_categorical(y_valid)

In [52]:
y_train.shape

(14987, 50, 10)

## Part 3: Building and training the model

Now we are ready to build our network that will predict NER tags from the inputted words. The architecture will be roughly similar to our previous exercise on RNNs.

The following imports will help you:

In [53]:
from keras.models import Sequential
from keras.layers import Dense, LSTM, Embedding, TimeDistributed, Bidirectional

In [54]:
model = Sequential()
model.add(Embedding(mask_zero = True, input_dim = len(vocab) + 2, output_dim=200, input_length=50))
model.add(LSTM(128, return_sequences=True))
model.add(TimeDistributed(Dense(len(ne_tags) + 1, activation='softmax')))
model.compile(loss='categorical_crossentropy', optimizer='adam')
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 50, 200)           1000400   
_________________________________________________________________
lstm_2 (LSTM)                (None, 50, 128)           168448    
_________________________________________________________________
time_distributed_2 (TimeDist (None, 50, 10)            1290      
Total params: 1,170,138
Trainable params: 1,170,138
Non-trainable params: 0
_________________________________________________________________


**Questions:**

10. Build a sequential model 'model', and add the following layers with *model.add()*:
  * Embedding -- use embedding dimension 200, and make sure to set *input_length = 50* and *mask_zero = True* (to ignore the padding indices).
  * LSTM -- use hidden state dimension 128, and return the hidden state at each time step (*return_sequences = True*)
  * Fully-connected layer (*Dense()*) with softmax activation. Make sure that this is wrapped in *TimeDistributed()* so that it is applied to the output of our LSTM at each time step. Hint: The output dimension of *Dense* is the number of possible output labels, including the padding label '0'.

  Compile the model (*model.compile()*) with loss function 'categorical_crossentropy' and optimizer 'adam', and print a summary of the model (*model.summary()*). What is the expected shape of input for the model? (Hint: see *model.input_shape*, where *None* means that any number is allowed.)
11. Run the code below in (A) to train the model, changing the number of epochs so the model learns until it starts overfitting. How many epochs did you use for training?
12. Create a model *model2* that is the same as *model* but with the LSTM layer wrapped by *Bidirectional()*, so the model becomes a BiLSTM model. How does this change the final validation loss? Does the model improve?
13. Compare the performance of the two models on the test set data X_test and Y_test (Hint: use model.evaluate()).

In [55]:
#11 (A) 
model.fit(X_train, y_train, epochs = 10, batch_size = 128, validation_data = (X_valid, y_valid))

Train on 14987 samples, validate on 3466 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0xb3fcf5a90>

According to my view, after 6 epochs the validation loss bottoms. After 6 it beings to oscillate until it starts strictly rising. 

In [56]:
model2 = Sequential()
model2.add(Embedding(mask_zero = True, input_dim = len(vocab) + 2, output_dim=200, input_length=50))
model2.add(Bidirectional(LSTM(128, return_sequences=True)))
model2.add(TimeDistributed(Dense(len(ne_tags) + 1, activation='softmax')))
model2.compile(loss='categorical_crossentropy', optimizer='adam')
model2.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 50, 200)           1000400   
_________________________________________________________________
bidirectional_1 (Bidirection (None, 50, 256)           336896    
_________________________________________________________________
time_distributed_3 (TimeDist (None, 50, 10)            2570      
Total params: 1,339,866
Trainable params: 1,339,866
Non-trainable params: 0
_________________________________________________________________


In [57]:
model2.fit(X_train, y_train, epochs = 10, batch_size = 128, validation_data = (X_valid, y_valid))

Train on 14987 samples, validate on 3466 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0xb41b45e48>

Clearly the model improves as we can see that the validation and training loss decrease in model2.

In [58]:
score = model.evaluate(X_test, y_test)




In [59]:
score2 = model2.evaluate(X_test, y_test)



In [63]:
print('Model 1 Test loss:', score)
print('Model 2 Test loss:', score2)

Model 1 Test loss: 0.3178489805602678
Model 2 Test loss: 0.2805885224725473


In [62]:
score

0.3178489805602678

## Bonus 1: Running on custom input

**Bonus question 1:**

What does your model predict as NER tags for the following test sentences?

Hint: Try using the following pipeline on each sentence:

* Tokenize with nltk.word_tokenize()
* Convert to array of indices with word2index() defined above
* Pad to length 50 with pad_sequences() from Keras
* Predict probabilities of NER tags with model2.predict()
* Find maximum likelihood tags using np.argmax() (with axis = 1), and ignore padding values

In [None]:
test_sentences = [
  "This is a test.",
  "I live in the United States.",
  "Israel is a country in the Middle East.",
  "UK joins US in Gulf mission after Iran taunts American allies",
  "The project was funded by EuroNanoMed-II, the Health Ministry, the Portuguese Foundation for Science and Technology, the Israel Science Foundation, the European Research Council’s Consolidator and Advanced Awards, the Saban Family Foundation – Melanoma Research Alliance’s Team Science Award and the Israel Cancer Research Fund."
]

## Bonus 2: Adding features

**Bonus question 2:**

In (B) below, add code to add a new column 'SHAPE' to the dataset. This column should represent the shape of the word token by:
* Replacing all capital letters with 'X'
* Replacing all lowercase letters with 'x'
* Replacing all digits with 'd'

For example, we should have the following:

* 'house' => 'xxxxx'
* 'Apple' => 'Xxxxx'
* 'R2D2' => 'XdXd'
* 'U.K.' => 'X.X.'

Hint: for a Pandas series. you can use series.str.replace() to easily replace text.

In [None]:
def series2shape(series):
  ## (B) -- add bonus question code here
  
train_df['SHAPE'] = series2shape(train_df.TOKEN)
valid_df['SHAPE'] = series2shape(valid_df.TOKEN)
test_df['SHAPE'] = series2shape(test_df.TOKEN)

Once you complete this, run the following code to see how adding this as a feature improves the performance of the model. For simplicity we only use the top 100 word shapes. How does the final loss change?

In [None]:
from tqdm import tqdm

shape_vocab = [w for w, f in FreqDist(train_df.SHAPE).most_common(n = 100)]
shape_set = set(shape_vocab)
def shape2index(shape):
  if shape in shape_set:
    return shape_vocab.index(shape) + 2
  else: # out-of-vocabulary shape
    return 1

n_words = 50
def df2features2(df):
  tqdm.pandas('Shape indices')
  df['shape_index'] = df.SHAPE.progress_apply(shape2index)
  token_index_lists = df.groupby('SENTENCE').token_index.apply(list)
  ne_index_lists = df.groupby('SENTENCE').ne_index.apply(list)
  shape_index_lists = df.groupby('SENTENCE').ne_index.apply(list)
  X = np.stack([
      pad_sequences(token_index_lists, maxlen = n_words, value = 0),
      pad_sequences(shape_index_lists, maxlen = n_words, value = 0)
  ])
  Y = to_categorical(pad_sequences(ne_index_lists, maxlen = n_words, value = 0))
  return X, Y

X2_train, Y2_train = df2features2(train_df)
X2_valid, Y2_valid = df2features2(valid_df)
X2_test, Y2_test = df2features2(test_df)

In [None]:
from keras.models import Model
from keras.layers import Input, concatenate

input1 = Input(shape = (50,))
input2 = Input(shape = (50,))
embedded1 = Embedding(
    len(vocab) + 2, 200,
    input_length = 50, mask_zero = True)(input1)
embedded2 = Embedding(
    len(shape_vocab) + 2, 8,
    input_length = 50, mask_zero = True)(input2)
x = concatenate([embedded1, embedded2])
x = Bidirectional(LSTM(128, return_sequences = True))(x)
output = TimeDistributed(Dense(len(ne_tags) + 1, activation = 'softmax'))(x)
model3 = Model(inputs = [input1, input2], outputs = [output])
model3.compile(loss = 'categorical_crossentropy', optimizer = 'adam')

In [None]:
model3.fit(
    [X2_train[0], X2_train[1]],
    Y2_train, epochs = 5, batch_size = 128,
    validation_data = ([X2_valid[0], X2_valid[1]], Y2_valid))

In [None]:
print("Model3 loss on test data:")
model3.evaluate([X2_test[0], X2_test[1]], Y2_test)