# Recognize named entities on Twitter with RNNs

In this assignment, we will use a recurrent neural network to solve Named Entity Recognition (NER) problem. NER is a common task in natural language processing systems. It serves for extraction such entities from the text as persons, organizations, locations, etc. In this task, we will experiment to recognize named entities from Twitter.

For example, we want to extract persons' and organizations' names from the text. For example, for the input text:

    Ian Goodfellow works for Google Brain

a NER model needs to provide the following sequence of tags:

    B-PER I-PER    O     O   B-ORG  I-ORG

Where *B-* and *I-* prefixes stand for the beginning and inside of the entity, while *O* stands for out of tag or no tag. Markup with the prefix scheme is called *BIO markup*. This markup is introduced for distinguishing of consequent entities with similar types.

A solution of the task will be based on neural networks, particularly, on Bi-Directional RNN Networks (such as Bi-LSTMs or Bi-GRUs).

In [1]:
import numpy as np

### Load the Twitter Named Entity Recognition corpus

We will work with a corpus, which contains tweets with NE tags. Every line of a file contains a pair of a token (word/punctuation symbol) and a tag, separated by a whitespace. Different tweets are separated by an empty line.

The function *read_data* reads a corpus from the *file_path* and returns two lists: one with tokens and one with the corresponding tags. You need to complete this function by adding a code, which will replace a user's nickname to `<USR>` token and any URL to `<URL>` token. You could think that a URL and a nickname are just strings which start with *http://* or *https://* in case of URLs and a *@* symbol for nicknames.

In [2]:
def read_data(file_path):
    tokens = []
    tags = []
    
    tweet_tokens = []
    tweet_tags = []
    for line in open(file_path, encoding='utf-8'):
        line = line.strip()
        if not line:
            if tweet_tokens:
                tokens.append(tweet_tokens)
                tags.append(tweet_tags)
            tweet_tokens = []
            tweet_tags = []
        else:
            token, tag = line.split()
            if '@' in token: token = '<USR>'
            if 'http://' in token or 'https://' in token: token = '<URL>'
            
            # Replace all urls with <URL> token
            # Replace all users with <USR> token

            tweet_tokens.append(token)
            tweet_tags.append(tag)
            
    return tokens, tags

And now we can load three separate parts of the dataset:
 - *train* data for training the model;
 - *validation* data for evaluation and hyperparameters tuning;
 - *test* data for final evaluation of the model.

In [3]:
train_tokens, train_tags = read_data('data/train.txt')
validation_tokens, validation_tags = read_data('data/validation.txt')
test_tokens, test_tags = read_data('data/test.txt')

`train_tokens` and `train_tags` are lists of lists of tokens and tags respectively, for each sentence. We can print the data running the following cell:

In [4]:
for i in range(1):
    for token, tag in zip(train_tokens[i], train_tags[i]):
        print('%s\t%s' % (token, tag))
    print()

RT	O
<USR>	O
:	O
Online	O
ticket	O
sales	O
for	O
Ghostland	B-musicartist
Observatory	I-musicartist
extended	O
until	O
6	O
PM	O
EST	O
due	O
to	O
high	O
demand	O
.	O
Get	O
them	O
before	O
they	O
sell	O
out	O
...	O



In [21]:
len(train_tokens) 

5795

### Prepare dictionaries

To train a neural network, we will use two mappings: 
- {token}$\to${token id}: address the row in embeddings matrix for the current token;
- {tag}$\to${tag id}: one-hot ground truth probability distribution vectors for computing the loss at the output of the network.

We implement the function *build_dict* which will return {token or tag}$\to${index} and vice versa. 

In [6]:
def build_dict(tokens_or_tags, special_tokens):
    """
        tokens_or_tags: a list of lists of tokens or tags
        special_tokens: some special tokens
    """

    tokens_or_tags_flat_ = list(set([x for y in tokens_or_tags for x in y]))
    tokens_or_tags_flat = [x for x in tokens_or_tags_flat_ if x not in special_tokens]
    tokens_or_tags_flat = special_tokens + tokens_or_tags_flat
        
    idx2tok = dict(enumerate(tokens_or_tags_flat))
    tok2idx = {t:i for i, t in idx2tok.items()}

    return tok2idx, idx2tok

After implementing the function *build_dict* you can make dictionaries for tokens and tags. Special tokens in our case will be:
 - `<UNK>` token for out of vocabulary tokens;
 - `<PAD>` token for padding sentence to the same length when we create batches of sentences.

In [7]:
special_tokens = ['<PAD>', '<UNK>']
special_tags = ['O']

# Create dictionaries from training and validation tokens 
token2idx, idx2token = build_dict(train_tokens + validation_tokens, special_tokens)
tag2idx, idx2tag = build_dict(train_tags, special_tags)

In [8]:
len(token2idx) # total of tokens

20458

We check the indices of our special tokens:

In [9]:
token2idx['<PAD>'], token2idx['<UNK>'], tag2idx['O']


(0, 1, 0)

The next additional functions will help create the mapping between tokens and ids for sentences. 

In [10]:
def words2idxs(tokens_list):
    return [token2idx[word] for word in tokens_list]

def tags2idxs(tags_list):
    return [tag2idx[tag] for tag in tags_list]

def idxs2words(idxs):
    return [idx2token[idx] for idx in idxs]

def idxs2tags(idxs):
    return [idx2tag[idx] for idx in idxs]

### Preparing test_tokens and test_tags by replacing out-of-vocabulary tokens by token  `<UNK>` 

In [11]:
for i in range(len(test_tokens)):
    for j in range(len(test_tokens[i])):
        if test_tokens[i][j] not in token2idx.keys():
            test_tokens[i][j] = '<UNK>'
            

Double check the test_tokens:

In [12]:
for i in range(1):
    for token, tag in zip(test_tokens[i], test_tags[i]):
        print('%s\t%s' % (token, tag))
    print()

Man	O
i	O
hate	O
when	O
people	O
<UNK>	O
<UNK>	O
luggage	O
..	O
ima	O
just	O
rip	O
it	O
up	O
more	O
with	O
the	O
<UNK>	O
<UNK>	O
<UNK>	O



### Generate batches

Neural Networks are usually trained with batches. It means that weight updates of the network are based on several sequences at every single time. _The tricky part is that all sequences within a batch need to have the same length_. So we will pad them with a special `<PAD>` token.

In [13]:
def batches_generator(batch_size, 
                      tokens, tags,
                      shuffle=True, 
                      allow_smaller_last_batch=True):
    """Generates padded batches of tokens and tags."""
    
    n_samples = len(tokens)
    if shuffle:
        order = np.random.permutation(n_samples)
    else:
        order = np.arange(n_samples)

    n_batches = n_samples // batch_size
    if allow_smaller_last_batch and n_samples % batch_size:
        n_batches += 1

    for k in range(n_batches):
        batch_start = k * batch_size
        batch_end = min((k + 1) * batch_size, n_samples)
        current_batch_size = batch_end - batch_start
        x_list = []
        y_list = []
        max_len_token = 0
        for idx in order[batch_start: batch_end]:
            x_list.append(words2idxs(tokens[idx]))
            y_list.append(tags2idxs(tags[idx]))
            max_len_token = max(max_len_token, len(tags[idx]))

        # Fill in the data into numpy nd-arrays filled with padding indices.
        x = np.zeros([current_batch_size, max_len_token], dtype=np.int32) 
        y = np.zeros([current_batch_size, max_len_token], dtype=np.int32) 
        lengths = np.zeros(current_batch_size, dtype=np.int32)
        for n in range(current_batch_size):
            utt_len = len(x_list[n])
            x[n, :utt_len] = x_list[n]
            lengths[n] = utt_len
            y[n, :utt_len] = y_list[n]
        yield x, y 


# Build a recurrent neural network

This is the most important part of the assignment. Here we will specify the network architecture based on TensorFlow building blocks. We will create an GRU (LSTM) network which will produce probability distribution over tags for each token in a sentence. To take into account both right and left contexts of the token, we will a Bi-Directional wrapper. Both LSTM and GRU have numerous  parameters, in addition to different implemetations. In particular, we use GRUCell (or could be LSTMCell) looped by RNN layer. The ussual alternative is to use GRU (not its cell) since it does the loop automatically. These different options may result in a slightly different performance depending on the particulat task in hand. Dense layer will be used on top to perform the tag classification.  

In [14]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers as L

Now, let us specify the layers of the neural network. First, we need to perform some preparatory steps: 
 
- We use an Embedding layer which automatically passes the masking tensor to RNN layer. The mask will help RNN igonre the padding timesteps. This mask should also be used to mask loss terms corresponding to paddings. Instead of using an Embedding layer, we could also initialize a random embeddings matrix and look up the input data from it (look up the input indices in the embedding table using `tf.nn.embedding_lookup`).
- Bidirectional wrapper operates forward and backward cells independently. Also, we use dropout as an important regularization technique for neural networks for our task here, which helps us control overfitting.   


Create model with the following parameters:
 - *vocabulary_size* — number of tokens;
 - *n_tags* — number of tags;
 - *embedding_dim* — dimension of embeddings, recommended value: 200;
 - *n_hidden_rnn* — size of hidden layers for RNN, recommended value: 200;
 - *PAD_index* — an index of the padding token (`<PAD>`).
 

In [15]:
vocabulary_size = 20458 
n_tags = 21 
embedding_dim = 200 
n_hidden = 300 
PAD_index = 0
batch_size = 32 
learning_rate = 0.005 
learning_rate_decay = 1.4 
dropout = 0.5

### Model

In [16]:
model = tf.keras.Sequential()

forward_layer = L.RNN(L.GRUCell(n_hidden,dropout=dropout), 
                     return_sequences=True, 
                     )

backward_layer = L.RNN(L.GRUCell(n_hidden,dropout=dropout), 
                      return_sequences=True, 
                      go_backwards=True
                      )

model.add(L.Embedding(vocabulary_size, embedding_dim, mask_zero=True))
model.add(L.Bidirectional(forward_layer, backward_layer=backward_layer))
model.add(L.Dense(n_tags))


### Training

During training we do not need predictions of the network, but we need a loss function. We will use cross-entropy loss efficiently implemented in TF as [cross entropy with logits]. It should be applied to logits of the model (not to softmax probabilities!). Also note, that we do not want to take into account loss terms coming from <PAD> tokens. So we need to mask them out, before computing mean.

#### Training loop

In [18]:
clip_norm = tf.cast(1.0, tf.float32)

train_loss_results = []
optimizer = tf.keras.optimizers.Adam(learning_rate = learning_rate)

n_epochs = 10 

print('Start training... \n')
for epoch in range(n_epochs):
    
    train_loss=0
    optimizer = tf.keras.optimizers.Adam(learning_rate = learning_rate)

    #Train the model
    for x, y in batches_generator(batch_size, train_tokens, train_tags):
        
        y = tf.one_hot(y, n_tags)
        
        with tf.GradientTape() as tape:
            y_pred = model(x, training=True)
            loss_ = tf.nn.softmax_cross_entropy_with_logits(y,y_pred)
            loss = tf.reduce_mean(loss_ * tf.cast( x != 0, dtype=tf.float32))
 
        grads = tape.gradient(loss, model.trainable_variables)
        grads = [tf.clip_by_norm(grad, clip_norm) for grad in grads]
    
        optimizer.apply_gradients(zip(grads, model.trainable_variables))
        train_loss += loss
    
    learning_rate = learning_rate/learning_rate_decay
    train_loss_results.append(keras.backend.eval(train_loss))
    print('Epoch {}/{} \t Trainnig Loss:{}'.format(epoch, n_epochs, keras.backend.eval(train_loss)))

print('...training finished.')

Start training... 

Epoch 0/10 	 Trainnig Loss:39.618526458740234
Epoch 1/10 	 Trainnig Loss:15.92908763885498
Epoch 2/10 	 Trainnig Loss:7.241702556610107
Epoch 3/10 	 Trainnig Loss:3.4963269233703613
Epoch 4/10 	 Trainnig Loss:1.9717960357666016
Epoch 5/10 	 Trainnig Loss:1.3010663986206055
Epoch 6/10 	 Trainnig Loss:0.8103141188621521
Epoch 7/10 	 Trainnig Loss:0.6042543053627014
Epoch 8/10 	 Trainnig Loss:0.4550827741622925
Epoch 9/10 	 Trainnig Loss:0.4174806475639343
...training finished.


### Evaluation

To simplify the evaluation process, two functions are provided:
predict_tags: uses a model to get predictions and transforms indices to tokens and tags;
eval_conll: calculates precision, recall and F1 for the results.

In [19]:
from evaluation import precision_recall_f1

def predict_tags(model, x_batch):
    """Performs predictions and transforms indices to tokens and tags."""
    
    logits = model(x_batch, training=False)
    softmax_output = tf.nn.softmax(logits, axis=-1)
    tag_idxs_batch = tf.math.argmax(softmax_output,output_type=tf.dtypes.int32, axis=-1)
    tag_idxs_batch = keras.backend.eval(tag_idxs_batch)
    
    tags_batch, tokens_batch = [], []
    for tag_idxs, token_idxs in zip(tag_idxs_batch, x_batch):
        tags, tokens = [], []
        for tag_idx, token_idx in zip(tag_idxs, token_idxs):
            tags.append(idx2tag[tag_idx])
            tokens.append(idx2token[token_idx])
        tags_batch.append(tags)
        tokens_batch.append(tokens)
    return tags_batch, tokens_batch
        
def eval_conll(model, tokens, tags, short_report=True):
    """Computes NER quality measures using CONLL shared task script."""
    
    y_true, y_pred = [], []
    for x_batch, y_batch in batches_generator(1, tokens, tags):
        tags_batch, tokens_batch = predict_tags(model, x_batch)
        if len(x_batch[0]) != len(tags_batch[0]):
            raise Exception("Incorrect length of prediction for the input, "
                            "expected length: %i, got: %i" % (len(x_batch[0]), len(tags_batch[0])))
        predicted_tags = []
        ground_truth_tags = []
        for gt_tag_idx, pred_tag, token in zip(y_batch[0], tags_batch[0], tokens_batch[0]): 
            if token != '<PAD>':
                ground_truth_tags.append(idx2tag[gt_tag_idx])
                predicted_tags.append(pred_tag)

        # We extend every prediction and ground truth sequence with 'O' tag
        # to indicate a possible end of entity.
        y_true.extend(ground_truth_tags + ['O'])
        y_pred.extend(predicted_tags + ['O'])
        
    results = precision_recall_f1(y_true, y_pred, print_results=True, short_report=short_report)
    return results

### Checking scores

In [20]:
print('Train data evaluation:')
eval_conll(model, train_tokens, train_tags, short_report=True)
print('Validation data evaluation:')
eval_conll(model, validation_tokens, validation_tags, short_report=True)

Train data evaluation:
processed 105778 tokens with 4489 phrases; found: 4501 phrases; correct: 4467.

precision:  99.24%; recall:  99.51%; F1:  99.38

Validation data evaluation:
processed 12836 tokens with 537 phrases; found: 423 phrases; correct: 208.

precision:  49.17%; recall:  38.73%; F1:  43.33



OrderedDict([('company',
              OrderedDict([('precision', 67.44186046511628),
                           ('recall', 55.769230769230774),
                           ('f1', 61.05263157894737),
                           ('n_predicted_entities', 86),
                           ('n_true_entities', 104)])),
             ('facility',
              OrderedDict([('precision', 48.57142857142857),
                           ('recall', 50.0),
                           ('f1', 49.27536231884058),
                           ('n_predicted_entities', 35),
                           ('n_true_entities', 34)])),
             ('geo-loc',
              OrderedDict([('precision', 69.66292134831461),
                           ('recall', 54.86725663716814),
                           ('f1', 61.38613861386139),
                           ('n_predicted_entities', 89),
                           ('n_true_entities', 113)])),
             ('movie',
              OrderedDict([('precision', 0.0),
         

Now let us see full quality reports for the final model on train, validation, and test sets. You could expect F-score about 40% on the validation set.


In [23]:
print('-' * 20 + ' Train set quality: ' + '-' * 20)
eval_conll(model, train_tokens, train_tags, short_report=False)
print('-' * 20 + ' Validation set quality: ' + '-' * 20)
eval_conll(model, validation_tokens, validation_tags, short_report=False)

-------------------- Train set quality: --------------------
processed 105778 tokens with 4489 phrases; found: 4501 phrases; correct: 4467.

precision:  99.24%; recall:  99.51%; F1:  99.38

	     company: precision:   99.38%; recall:   99.38%; F1:   99.38; predicted:   643

	    facility: precision:   97.48%; recall:   98.41%; F1:   97.94; predicted:   317

	     geo-loc: precision:   99.80%; recall:   99.80%; F1:   99.80; predicted:   996

	       movie: precision:  100.00%; recall:  100.00%; F1:  100.00; predicted:    68

	 musicartist: precision:   99.14%; recall:   99.57%; F1:   99.35; predicted:   233

	       other: precision:   98.43%; recall:   99.60%; F1:   99.02; predicted:   766

	      person: precision:   99.89%; recall:   99.77%; F1:   99.83; predicted:   885

	     product: precision:   99.68%; recall:   99.37%; F1:   99.53; predicted:   317

	  sportsteam: precision:   99.54%; recall:   99.08%; F1:   99.31; predicted:   216

	      tvshow: precision:   95.00%; recall:  

OrderedDict([('company',
              OrderedDict([('precision', 67.44186046511628),
                           ('recall', 55.769230769230774),
                           ('f1', 61.05263157894737),
                           ('n_predicted_entities', 86),
                           ('n_true_entities', 104)])),
             ('facility',
              OrderedDict([('precision', 48.57142857142857),
                           ('recall', 50.0),
                           ('f1', 49.27536231884058),
                           ('n_predicted_entities', 35),
                           ('n_true_entities', 34)])),
             ('geo-loc',
              OrderedDict([('precision', 69.66292134831461),
                           ('recall', 54.86725663716814),
                           ('f1', 61.38613861386139),
                           ('n_predicted_entities', 89),
                           ('n_true_entities', 113)])),
             ('movie',
              OrderedDict([('precision', 0.0),
         

In [24]:
print('-' * 20 + ' Test set quality: ' + '-' * 20)
eval_conll(model, test_tokens, test_tags, short_report=False)


-------------------- Test set quality: --------------------
processed 13258 tokens with 604 phrases; found: 524 phrases; correct: 249.

precision:  47.52%; recall:  41.23%; F1:  44.15

	     company: precision:   62.71%; recall:   44.05%; F1:   51.75; predicted:    59

	    facility: precision:   36.07%; recall:   46.81%; F1:   40.74; predicted:    61

	     geo-loc: precision:   75.61%; recall:   56.36%; F1:   64.58; predicted:   123

	       movie: precision:   16.67%; recall:   12.50%; F1:   14.29; predicted:     6

	 musicartist: precision:   10.00%; recall:   11.11%; F1:   10.53; predicted:    30

	       other: precision:   33.33%; recall:   35.92%; F1:   34.58; predicted:   111

	      person: precision:   56.25%; recall:   43.27%; F1:   48.91; predicted:    80

	     product: precision:   14.81%; recall:   14.29%; F1:   14.55; predicted:    27

	  sportsteam: precision:   26.92%; recall:   22.58%; F1:   24.56; predicted:    26

	      tvshow: precision:    0.00%; recall:    0.0

OrderedDict([('company',
              OrderedDict([('precision', 62.71186440677966),
                           ('recall', 44.047619047619044),
                           ('f1', 51.74825174825175),
                           ('n_predicted_entities', 59),
                           ('n_true_entities', 84)])),
             ('facility',
              OrderedDict([('precision', 36.0655737704918),
                           ('recall', 46.808510638297875),
                           ('f1', 40.74074074074074),
                           ('n_predicted_entities', 61),
                           ('n_true_entities', 47)])),
             ('geo-loc',
              OrderedDict([('precision', 75.60975609756098),
                           ('recall', 56.36363636363636),
                           ('f1', 64.58333333333333),
                           ('n_predicted_entities', 123),
                           ('n_true_entities', 165)])),
             ('movie',
              OrderedDict([('precision', 16

### Conclusions

Could we say that our model is state of the art and the results are acceptable for the task? Definately, we can say so. Nowadays, bidirectional RNNs are among of the state of the art approaches for solving NER problem, which outperform other classical methods such as classical CRFs for this task. Despite the fact that we used small training corpora (in comparison with usual sizes of corpora in Deep Learning), our results are quite good. In addition, in this task there are many possible named entities and for some of them we have only several dozens of trainig examples, which is definately small. However, even better results could be obtained by some combinations of several types of methods, e.g. see [this](https://arxiv.org/abs/1603.01354) paper if interested.