# Recognize named entities on Twitter with LSTMs

In this workshop, you will use a recurrent neural network to solve Named Entity Recognition (NER) problem. NER is a common task in natural language processing systems. It serves for extraction such entities from the text as persons, organizations, locations, etc. In this task you will experiment to recognize named entities from Twitter.

For example, we want to extract persons' and organizations' names from the text. Than for the input text:

    Donald Trump is the president of the United States

a NER model needs to provide the following sequence of tags:

    B-PER I-PER   O O O O O   B-COUNTRY  I-COUNTRY

Where *B-* and *I-* prefixes stand for the beginning and inside of the entity, while *O* stands for out of tag or no tag. Markup with the prefix scheme is called *BIO markup*. This markup is introduced for distinguishing of consequent entities with similar types.

A solution of the task will be based on neural networks, particularly, on Bi-Directional Long Short-Term Memory Networks (Bi-LSTMs).

### Libraries

For this task you will need the following libraries:
 - [Pytorch](https://pytorch.org/docs/stable/index.html) — an open-source software library for Machine Intelligence.
 - [Numpy](http://www.numpy.org) — a package for scientific computing.

Add tutorial link to Pytorch


TODO: describe the task in details

Import libraries and download data

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

from common.evaluation import precision_recall_f1

import sys
sys.path.append("..")
import common.workshop as workshop

workshop.download_ner_generation()

**************************************************
train.txt
**************************************************
test.txt
**************************************************
validation.txt


### Setup execution device

Note: since this is hevy computational task, we need to use GPU, make sure that the cell below outputs 'cuda'

In [3]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Execution device:',device)

Execution device: cuda


In [4]:
!ls ./ner/

test.txt  train.txt  validation.txt


In [5]:
DATA_DIR="ner"

### Read file 

Read data from file and change replace users and urls with tokens

In [6]:
def read_data(file_path):
    tokens = []
    tags = []
    
    tweet_tokens = []
    tweet_tags = []
    for line in open(file_path, encoding='utf-8'):
        line = line.strip()
        if not line:
            if tweet_tokens:
                tokens.append(tweet_tokens)
                tags.append(tweet_tags)
            tweet_tokens = []
            tweet_tags = []
        else:
            token, tag = line.split()
            # Replace all urls with <URL> token
            # Replace all users with <USR> token

            if (token.startswith('@')):
                token = '<USR>'
            elif token.startswith('http://') or token.startswith('https://'):
                token = '<URL>'
            
            tweet_tokens.append(token)
            tweet_tags.append(tag)
            
    return tokens, tags

And now we can load three separate parts of the dataset:
 - *train* data for training the model;
 - *validation* data for evaluation and hyperparameters tuning;
 - *test* data for final evaluation of the model.

In [7]:

train_tokens, train_tags = read_data(DATA_DIR + '/train.txt')
validation_tokens, validation_tags = read_data(DATA_DIR + '/validation.txt')
test_tokens, test_tags = read_data(DATA_DIR + '/test.txt')


In [8]:
print (train_tokens[:2])

[['RT', '<USR>', ':', 'Online', 'ticket', 'sales', 'for', 'Ghostland', 'Observatory', 'extended', 'until', '6', 'PM', 'EST', 'due', 'to', 'high', 'demand', '.', 'Get', 'them', 'before', 'they', 'sell', 'out', '...'], ['Apple', 'MacBook', 'Pro', 'A1278', '13.3', '"', 'Laptop', '-', 'MD101LL/A', '(', 'June', ',', '2012', ')', '-', 'Full', 'read', 'by', 'eBay', '<URL>', '<URL>']]


Lest take a look at our data

In [9]:

for i in range(3):
    for token, tag in zip(train_tokens[i], train_tags[i]):
        print('%s\t%s' % (token, tag))
    print()


RT	O
<USR>	O
:	O
Online	O
ticket	O
sales	O
for	O
Ghostland	B-musicartist
Observatory	I-musicartist
extended	O
until	O
6	O
PM	O
EST	O
due	O
to	O
high	O
demand	O
.	O
Get	O
them	O
before	O
they	O
sell	O
out	O
...	O

Apple	B-product
MacBook	I-product
Pro	I-product
A1278	I-product
13.3	I-product
"	I-product
Laptop	I-product
-	I-product
MD101LL/A	I-product
(	O
June	O
,	O
2012	O
)	O
-	O
Full	O
read	O
by	O
eBay	B-company
<URL>	O
<URL>	O

Happy	O
Birthday	O
<USR>	O
!	O
May	O
Allah	B-person
s.w.t	O
bless	O
you	O
with	O
goodness	O
and	O
happiness	O
.	O



### Prepare dictionaries

To train a neural network, we will use two mappings: 
- {token}$\to${token id}: address the row in embeddings matrix for the current token;
- {tag}$\to${tag id}: one-hot ground truth probability distribution vectors for computing the loss at the output of the network.

Now you need to implement the function *build_dict* which will return {token or tag}$\to${index} and vice versa. 

In [10]:
from collections import defaultdict

def build_dict(tokens_or_tags, special_tokens):
    """
        tokens_or_tags: a list of lists of tokens or tags
        special_tokens: some special tokens
    """
    # Create a dictionary with default value 0
    tok2idx = defaultdict(lambda: 0)
    idx2tok = []
    
    # Create mappings from tokens (or tags) to indices and vice versa.
    # Add special tokens (or tags) to the dictionaries.
    # The first special token must have index 0.
    
    # Mapping tok2idx should contain each token or tag only once. 
    # To do so, you should extract unique tokens/tags from the tokens_or_tags variable
    # and then index them (for example, you can add them into the list idx2tok
    # and for each token/tag save the index into tok2idx).
    
    idx = 0
    for token in special_tokens:
        idx2tok.append(token)
        tok2idx[token]=idx
        idx+=1
    
    for token_list in tokens_or_tags:
        for token in token_list:
            if token not in idx2tok:
                idx2tok.append(token)
                tok2idx[token]=idx
                idx+=1
    
    
    return tok2idx, idx2tok

In [11]:

special_tokens = ['<UNK>', '<PAD>']
special_tags = ['O']

# Create dictionaries 
token2idx, idx2token = build_dict(train_tokens + validation_tokens, special_tokens)
tag2idx, idx2tag = build_dict(train_tags, special_tags)


The next additional functions will help you to create the mapping between tokens and ids for a sentence. 

In [12]:
def words2idxs(tokens_list):
    return [token2idx[word] for word in tokens_list]

def tags2idxs(tags_list):
    return [tag2idx[tag] for tag in tags_list]

def idxs2words(idxs):
    return [idx2token[idx] for idx in idxs]

def idxs2tags(idxs):
    return [idx2tag[idx] for idx in idxs]

### Generate batches

Neural Networks are usually trained with batches. It means that weight updates of the network are based on several sequences at every single time. The tricky part is that all sequences within a batch need to have the same length. So we will pad them with a special `<PAD>` token. It is also a good practice to provide RNN with sequence lengths, so it can skip computations for padding parts.

In [13]:

def batches_generator(batch_size, tokens, tags,
                      shuffle=True, allow_smaller_last_batch=True):
    """Generates padded batches of tokens and tags."""
    
    n_samples = len(tokens)
    if shuffle:
        order = np.random.permutation(n_samples)
    else:
        order = np.arange(n_samples)

    n_batches = n_samples // batch_size
    if allow_smaller_last_batch and n_samples % batch_size:
        n_batches += 1

    for k in range(n_batches):
        batch_start = k * batch_size
        batch_end = min((k + 1) * batch_size, n_samples)
        current_batch_size = batch_end - batch_start
        x_list = []
        y_list = []
        max_len_token = 0
        for idx in order[batch_start: batch_end]:
            x_list.append(words2idxs(tokens[idx]))
            y_list.append(tags2idxs(tags[idx]))
            max_len_token = max(max_len_token, len(tags[idx]))
            
        # Fill in the data into numpy nd-arrays filled with padding indices.
        x = np.ones([current_batch_size, max_len_token], dtype=np.int32) * token2idx['<PAD>']
        y = np.ones([current_batch_size, max_len_token], dtype=np.int32) * tag2idx['O']
        lengths = np.zeros(current_batch_size, dtype=np.int32)
        for n in range(current_batch_size):
            utt_len = len(x_list[n])
            x[n, :utt_len] = x_list[n]
            lengths[n] = utt_len
            y[n, :utt_len] = y_list[n]
        yield x, y, lengths


In [16]:
# check the generator

batch_size= 10
x,y, _ = next(batches_generator(batch_size, train_tokens, train_tags))

print(x.shape,y.shape)

(10, 32) (10, 32)


## Build a recurrent neural network

This is the most important part of the assignment. Here we will specify the network architecture based on Pytorch building blocks. It's fun and easy as a lego constructor! We will create an LSTM network which will produce probability distribution over tags for each token in a sentence. To take into account both right and left contexts of the token, we will use Bi-Directional LSTM (Bi-LSTM). Dense layer will be used on top to perform tag classification.  

In [17]:
## TODO: explain what architecture needs to be built, and provide documentation to corresponding blocks

import torch.nn as nn

class BiLstm(nn.Module):
    def __init__(self, vocab_size, embed_size, n_hidden, n_output):
        super(BiLstm, self).__init__()
        self.n_hidden = n_hidden
        self.n_output = n_output
        
        self.embed_layer = nn.Embedding(vocab_size, embed_size)
        
        self.lstm_layer = nn.LSTM(embed_size, n_hidden,
                                  num_layers = 2, batch_first = True, 
                                  bidirectional = True)
        
        self.linear_layer = nn.Linear(2*n_hidden, n_output)
        

    # input_tensor - shape (batch_size, seq_length)
    # hidden - pair of tensors of shape (batch_size, hidden_size)
    def forward(self, input_tensor, seq_length, batch_size):

        # e_tensor - (batch_size, seq_length, embed_size)
        e_tensor = self.embed_layer(input_tensor)
              
        # execute lstm layer
        lstm_out, _ = self.lstm_layer(e_tensor)

        
        # transfor output to the 2d matrix of shape (batch_size * seq_length, 2*hidden_size)
        # since it is bidirectional network there is double size of hidden parameters
        output_tensor = lstm_out.contiguous().view(-1, 2 * self.n_hidden)
        # execute linear layer
        output_tensor = self.linear_layer(output_tensor)
        
        return output_tensor
        


In [19]:

# Constructs pytorch tensor from numpy array
def construct_pytorch_tensor(numpy_tensor):
    tensor = torch.from_numpy(numpy_tensor).long()
    # Send tensor to the device
    tensor = tensor.to(device)
    return tensor


def construct_model(vocab_size, embed_size, hidden_size, output_size):
    bi_nn = BiLstm(vocab_size=vocab_size, embed_size=embed_size, 
                 n_hidden=hidden_size, n_output=output_size)
    bi_nn = bi_nn.to(device)
    return bi_nn


### Test the network

lest test the created network. Play with parameters to see how do the affect input and output tensors

In [20]:
import numpy as np
import torch
import pdb

vocab_size = len(idx2token)
n_tags = len(idx2tag)
n_input = 100
n_hidden = 40
batch_size = 4
embed_size = 30

x,y, l = next(batches_generator(batch_size, train_tokens, train_tags))

# seq length
model_seq_length = x.shape[1]

input_tensor = construct_pytorch_tensor(x) # construct input tensor

target_tensor = construct_pytorch_tensor(y) # construct target tensor

bi_nn = construct_model(vocab_size, embed_size, n_hidden, n_tags) # init model

output_tensor = bi_nn(input_tensor, model_seq_length, batch_size) # execute forward
print(output_tensor.shape)




torch.Size([84, 21])


### Evaluation

below defined evaluation functions.

We use precision/recall metric and F1 score to determine the performance.

Additional resources:

[precision/recall](https://en.wikipedia.org/wiki/Precision_and_recall)

[f1score](https://skymind.ai/wiki/accuracy-precision-recall-f1)


In [65]:

def evaluate_on_data(model, tokens, tags):
    y_true_indx = []
    y_pred_indx = []
    batch_size = 32
    
    for i, (x, y, _) in enumerate(batches_generator(batch_size, tokens, tags)):
        input_tensor = torch.from_numpy(x).long().to(device)
        taget_tensor = torch.from_numpy(y).long().to(device)
        
        batch_size = x.shape[0]
        seq_length = x.shape[1]

        output_tensor = model(input_tensor, seq_length, batch_size)
        output_tensor = F.softmax(output_tensor, dim = 1)
        _, output_inds = output_tensor.max(dim = 1)
        
        output_inds = output_inds.long()
        
        y_true_indx_batch = list(taget_tensor.cpu().numpy().reshape(-1))
        y_pred_indx_batch = list(output_inds.cpu().detach().numpy().reshape(-1))
        y_true_indx = y_true_indx + y_true_indx_batch
        y_pred_indx = y_pred_indx + y_pred_indx_batch
        
    y_true = [idx2tag[idx] for idx in y_true_indx]
    y_pred = [idx2tag[idx] for idx in y_pred_indx]
    
    precision_recall_f1(y_true, y_pred, print_results=True, short_report=True)
    
def evaluate(model):
    # For evaluation we do not need to update gradients and compute derivatives
    with torch.no_grad():
        print('Evaluation on train set')
        evaluate_on_data(model, train_tokens, train_tags)
        print('Evaluation on validation set')
        evaluate_on_data(model, validation_tokens, validation_tags)




## Train loop

Below defined the train loop for a single epoch

In [21]:

def train(model, optimizer, loss_fn, batch_size = 32):
    for batch_num, (input_data, target_data, _) in enumerate(batches_generator(batch_size, train_tokens, train_tags)):
        
        # The last batch can be smaller than others
        train_batch_size = input_data.shape[0]
        # Since each batch has different sequence length we need to update the variable for each batch
        train_seq_length = input_data.shape[1]

        input_tensor = construct_pytorch_tensor(input_data)
        target_tensor = construct_pytorch_tensor(target_data)
        
        # zero out the gradients
        optimizer.zero_grad()
        
        # get the output sequence from the input and the initial hidden and cell states
        output_tensor = model(input_tensor, train_seq_length, train_batch_size).to(device)
    
        # we need to calculate the loss across all batches, so we have to flat the targets tensor
#         pdb.set_trace()
        target_tensor = target_tensor.view((train_seq_length*train_batch_size, -1)).squeeze(dim=1)
        
        # calculate the loss
        loss = criterion(output_tensor, target_tensor)

        # calculate the gradients
        loss.backward()
        
        # update the parameters of the model
        optimizer.step()


## Main train loop

In [67]:
n_epoch = 10

hidden_size = 256
batch_size = 32
embed_size = 256
n_tags = len(idx2tag)
vocab_size = len(idx2token)


model = construct_model(vocab_size, embed_size, hidden_size, n_tags) # init model
optimizer = torch.optim.Adam(model.parameters(), lr=0.005) # Adam optimizer
criterion = nn.CrossEntropyLoss() # Cross entropy loss

for epoch in range(n_epoch):
    print('Starting epoch: ', epoch)
    train(model, optimizer, criterion, batch_size)

    evaluate(model)


Starting epoch:  0
Evaluation on train set
processed 179408 tokens with 4489 phrases; found: 2302 phrases; correct: 716.

precision:  31.10%; recall:  15.95%; F1:  21.09

Evaluation on validation set
processed 22020 tokens with 537 phrases; found: 211 phrases; correct: 61.

precision:  28.91%; recall:  11.36%; F1:  16.31

Starting epoch:  1
Evaluation on train set
processed 180340 tokens with 4489 phrases; found: 4149 phrases; correct: 2243.

precision:  54.06%; recall:  49.97%; F1:  51.93

Evaluation on validation set
processed 22308 tokens with 537 phrases; found: 345 phrases; correct: 145.

precision:  42.03%; recall:  27.00%; F1:  32.88

Starting epoch:  2
Evaluation on train set
processed 181188 tokens with 4489 phrases; found: 4493 phrases; correct: 3298.

precision:  73.40%; recall:  73.47%; F1:  73.44

Evaluation on validation set
processed 22444 tokens with 537 phrases; found: 436 phrases; correct: 167.

precision:  38.30%; recall:  31.10%; F1:  34.33

Starting epoch:  3
Evalu

### Conclusions