# Recognize named entities on news data with CNN

In this tutorial, you will use a convolutional neural network to solve Named Entity Recognition (NER) problem. NER is a common task in natural language processing systems. It serves for extraction of entities from text such as persons, organizations, locations, etc. In this task you will experiment with recognition of named entities in different news texts from CoNLL-2003 dataset.

For example, we want to extract person and organization names from the text. Then for the input text:

    Ian Goodfellow works for Google Brain

a NER model needs to provide the following sequence of tags:

    B-PER I-PER    O     O   B-ORG  I-ORG

Where *B-* and *I-* prefixes stand for the beginning and inside of the entity, while *O* stands for out of tag or no tag. Markup with the prefix scheme is called **BIO markup**. This markup is introduced for distinguishing of consequent entities with similar types.

A solution of the task will be based on neural networks, particularly, on Convolutional Neural Networks.

### Data

The following cell will download all data required for this assignment into the folder `/data`. The download util from the library is used to download and extract the archive.

In [None]:
import deeppavlov
from deeppavlov.core.data.utils import download_decompress
download_decompress('http://lnsigo.mipt.ru/export/deeppavlov_data/conll2003_v2.tar.gz', 'data/')

### Load the CoNLL-2003 Named Entity Recognition corpus

We will work with a corpus which contains tweets with NE tags. A typical file with NER data contains lines with pairs of tokens (word or punctuation symbol) and tags separated by a whitespace. In many cases additional information such as POS tags is included. Different documents are separated with lines **started** with **-DOCSTART-** token. Different sentences are separated with an empty line. Example:

    -DOCSTART- -X- -X- O

    EU NNP B-NP B-ORG
    rejects VBZ B-VP O
    German JJ B-NP B-MISC
    call NN I-NP O
    to TO B-VP O
    boycott VB I-VP O
    British JJ B-NP B-MISC
    lamb NN I-NP O
    . . O O

    Peter NNP B-NP B-PER
    Blackburn NNP I-NP I-PER

In this tutorial we will focus only on tokens and tags (first and last elements of the line) and drop POS information located in between.

We start with using the *Conll2003DatasetReader* class that provides functionality for reading the dataset. It returns a dictionary with fields *train*, *test*, and *valid*. At each field a list of samples is stored. Each sample is a tuple of tokens and tags. Both tokens and tags are lists. The following example depicts the structure that should be returned by *read* method:

    {'train': [(['Mr.', 'Dwag', 'are', 'derping', 'around'], ['B-PER', 'I-PER', 'O', 'O', 'O']), ....],
     'valid': [...],
     'test': [...]}

There are three separate parts of the dataset:
 - *train* data for training the model;
 - *validation* data for evaluation and hyperparameters tuning;
 - *test* data for final evaluation of the model.
 

Each of these parts is stored in a separate txt file.

We will use [Conll2003DatasetReader](https://github.com/deepmipt/DeepPavlov/blob/master/deeppavlov/dataset_readers/conll2003_reader.py) from the library to read the data from text files to the format described above.

In [None]:
from deeppavlov.dataset_readers.conll2003_reader import Conll2003DatasetReader
dataset = Conll2003DatasetReader().read('data/')

You should always understand what kind of data you deal with. For this purpose, you can print the data running the following cell:

In [None]:
for sample in dataset['train'][:4]:
    for token, tag in zip(*sample):
        print('%s\t%s' % (token, tag))
    print()

### Prepare dictionaries

To train a neural network, we will use two mappings: 
- {token}$\to${token id}: index of the row in embeddings matrix for the current token;
- {tag}$\to${tag id}: one-hot ground truth probability distribution vectors for computing the loss at the output of the network.

Token indices will be used to find the corresponding rows in embedding matrix. The mapping for tags will be used to create one-hot ground-truth probability distribution vectors to compute the loss at the output of the network.

The [SimpleVocabulary](https://github.com/deepmipt/DeepPavlov/blob/master/deeppavlov/core/data/simple_vocab.py) implemented in the library will be used to perform those mappings.

In [None]:
from deeppavlov.core.data.simple_vocab import SimpleVocabulary

Now we need to build dictionaries for tokens and tags. Sometimes there are special tokens in vocabularies, for instance an unknown word token, which is used every time we encounter an out-of-vocabulary word. In our case the only special token will be`<UNK>` for out-of-vocabulary words.

In [None]:
special_tokens = ['<UNK>']

token_vocab = SimpleVocabulary(special_tokens, save_path='model/token.dict')
tag_vocab = SimpleVocabulary(save_path='model/tag.dict')

Let's fit the vocabularies on the train part of the data.

In [None]:
all_tokens_by_sentences = [tokens for tokens, tags in dataset['train']]
all_tags_by_sentences = [tags for tokens, tags in dataset['train']]

token_vocab.fit(all_tokens_by_sentences)
tag_vocab.fit(all_tags_by_sentences)


Try to get the indices. Keep in mind that we are working with batches of the following structure:
    
    [['utt0_tok0', 'utt1_tok1', ...], ['utt1_tok0', 'utt1_tok1', ...], ...]

In [None]:
token_vocab([['How', 'to', 'do', 'a', 'barrel', 'roll', '?']])

In [None]:
tag_vocab([['O', 'O', 'O'], ['B-ORG', 'I-ORG']])

Now we will try converting from indices to tokens.

In [None]:
import numpy as np
token_vocab([np.random.randint(0, 512, size=10)])

### Dataset Iterator

Neural Networks are usually trained on batches of examples. It means that weight updates of the network are based on several sequences at every step. The tricky part is that all sequences within a batch need to have the same length. So we will pad them with a special `<UNK>` token. Likewise, token tags must also be padded. It is also a good practice to provide RNN with sequence lengths, so that it can skip computations for padding parts. We provide the batching function *batches_generator* readily available for you to save time. 

An important concept in the batch generation is shuffling. Shuffling is taking sample from the dataset in random order. It is important to train on shuffled data because large number of consequetive samples of the same class may distort the performance of the model.

In [None]:
from deeppavlov.core.data.data_learning_iterator import DataLearningIterator

Create the dataset iterator for the loaded dataset

In [None]:
data_iterator = DataLearningIterator(dataset)

Try it out:

In [None]:
next(data_iterator.gen_batches(2, shuffle=True))

### Masking

The last thing about generating training data. We need to produce a binary mask which is the one where tokens present and zero elsewhere. This mask will stop backpropagation through paddings. An instance of such mask:

    [[1, 1, 0, 0, 0],
     [1, 1, 1, 1, 1]]
 For the sentences in batch:

     [['The', 'roof'],
      ['This', 'is', 'my', 'domain', '!']]

The Mask preprocessing component from the library will be used.

In [None]:
from deeppavlov.models.preprocessors.mask import Mask
get_mask = Mask()

Try it out:

In [None]:
get_mask([['Try', 'to', 'get', 'the', 'mask'], ['Check', 'paddings']])

## Build a Convolutional Neural Network

This is the most important part of the assignment. Here we will specify the network architecture based on `TensorFlow` building blocks. It's fun and easy as a lego constructor! We will create an Convolutional Neural Network (CNN) which will produce the probability distribution over tags for each token in a sentence. To take into account both right and left contexts of the token, we will use CNN. Dense layer will be used on top to perform tag classification.

In [None]:
import tensorflow as tf
import numpy as np

np.random.seed(42)
tf.set_random_seed(42)

An essential part of almost every network in NLP domain is embeddings of the words. We pass the text to the network as a series of tokens. Each token is represented by its index. For every token (index) we have a vector. In total the vectors form an embedding matrix. This matrix can be either pretrained using some common algorithm like Skip-Gram or CBOW or it can be initialized by random values and trained along with other parameters of the network. In this tutorial we will follow the second alternative.

We need to build a function that takes the tensor of token indices with shape [batch_size, num_tokens] and for each index in this matrix it retrieves a vector from the embedding matrix, corresponding to that index. That results in a new tensor with sahpe [batch_size, num_tokens, emb_dim].

In [None]:
def get_embeddings(indices, vocabulary_size, emb_dim):
    # Initialize the random gaussian matrix with dimensions [vocabulary_size, embedding_dimension]
    # The **VARIANCE** of the random samples must be 1 / embedding_dimension
    
    # YOUR CODE HERE
    
    emb_mat = tf.Variable(emb_mat, trainable=True, dtype=tf.float32)
    emb = tf.nn.embedding_lookup(emb_mat, indices)
    return emb

Check whether it works:

In [None]:
indices = [[0, 1, 2]] # batch of indices of tokens
vocab_size = 5
emb_dim = 100

emb = get_embeddings(indices, vocab_size, emb_dim)
emb_shape = emb.get_shape().as_list()
assert emb_shape[0] == 1
assert emb_shape[1] == 3
assert emb_shape[2] == emb_dim
print('Embeddings are ready to deploy')

The body of the network is the convolutional layers. The basic idea behind convolutions is to apply the same dense layer to every n consecutive samples (tokens in our case). A simplified case is depicted below.

<img src="conv.png" width="400">

Here number of input and output features equals to 1.

Let's try it on a toy example:

In [None]:
# Create a tensor with shape [batch_size, number_of_tokens, number_of_features]
x = tf.random_normal(shape=[2, 10, 100])
y = tf.layers.conv1d(x, filters=200, kernel_size=8)
print(y)

As you can see, due to the abscence of zero padding (zeros on in the beginning and in the end of input) the size of resulting tensor along the token dimension is reduced. To use padding and preserve the dimensionality along the convolution dimension pass padding='same' parameter to the function.

In [None]:
y_with_padding = tf.layers.conv1d(x, filters=200, kernel_size=8, padding='same')
print(y_with_padding)

Now stack a number of layers with dimensionality given in n_hidden_list (list of numbers of hidden units in each layer)

In [None]:
def conv_net(units, n_hidden_list, cnn_filter_width, activation=tf.nn.relu):
    # Use activation(units) to apply activation to units
    
    ######################################
    ########## YOUR CODE HERE ############
    ######################################

    return units
    

Check the convnet

In [None]:
n_hidden_list = [10, 20]
x = tf.Variable(np.random.randn(2, 10, 32), tf.float32)# tensor with dimensions [batch_size, number_of_tokens, number_of_features]
cnn_filter_width = 3
y = conv_net(x, n_hidden_list, cnn_filter_width)
output_shape = y.get_shape().as_list()
assert output_shape[0] == 2
assert output_shape[1] == 10
assert output_shape[2] == n_hidden_list[-1]
print('ConvNet is ready to deploy')

A common loss for the classification task is cross-entropy. Why classification? Because for each token the network must decide which tag to predict. The cross-entropy has the following form:

$$ H(P, Q) = -E_{x \sim P} log Q(x) $$

It measures the dissimilarity between the ground truth distribution over the classes and predicted distribution. In the most of the cases ground truth distribution is one-hot. Luckily this loss is already [implemented](https://www.tensorflow.org/api_docs/python/tf/nn/softmax_cross_entropy_with_logits_v2) in TensorFlow.

In [None]:
# The logits
l = tf.random_normal([1, 4, 3]) # shape [batch_size, number_of_tokens, number of classes]
indices = tf.placeholder(tf.int32, [1, 4])

# Make one-hot distribution from indices for 3 types of tag
p = tf.one_hot(indices, depth=3)
loss_tensor = tf.nn.softmax_cross_entropy_with_logits_v2(labels=p, logits=l)
print(loss_tensor)

All sentences in the batch have same length and we pad the each sentence to the maximal lendth. So there are paddings at the end and pushing the network to predict those paddings usually results in deteriorated quallity. Then we need to multiply the loss tensor by binary mask to prevent gradient flow from the paddings.

In [None]:
mask = tf.placeholder(tf.float32, shape=[1, 4])
loss_tensor *= mask

The last step to do is to compute the mean value of the loss tensor:

In [None]:
loss = tf.reduce_mean(loss_tensor)

Now define your own function that returns a scalar masked cross-entropy loss

In [None]:
def masked_cross_entropy(logits, label_indices, number_of_tags, mask):
    
    ######################################
    ########## YOUR CODE HERE ############
    ######################################
    
    return loss

Check that all works fine:

In [None]:
logits = tf.placeholder(tf.float32, shape=[2, 3, 10])
label_indices = tf.placeholder(tf.int32, shape=[2, 3])
number_of_tags = 10
mask = tf.placeholder(tf.float32, shape=[2, 3])

loss = masked_cross_entropy(logits, label_indices, number_of_tags, mask)

assert len(loss.get_shape().as_list()) == 0

Put everything into a class:

In [None]:
import numpy as np
import tensorflow as tf

class NerNetwork:
    def __init__(self,
                 n_tokens,
                 n_tags,
                 token_emb_dim=100,
                 n_hidden_list=(128,),
                 cnn_filter_width=7,
                 use_batch_norm=False,
                 embeddings_dropout=False,
                 top_dropout=False,
                 **kwargs):
        
        # ================ Building inputs =================
        
        self.learning_rate_ph = tf.placeholder(tf.float32, [])
        self.dropout_keep_ph = tf.placeholder(tf.float32, [])
        self.token_ph = tf.placeholder(tf.int32, [None, None], name='token_ind_ph')
        self.mask_ph = tf.placeholder(tf.float32, [None, None], name='Mask_ph')
        self.y_ph = tf.placeholder(tf.int32, [None, None], name='y_ph')
        
        # ================== Building the network ==================
        
        # Now embedd the indices of tokens using token_emb_dim function
        # this should be like
        
        ######################################
        ########## YOUR CODE HERE ############
        emb = 
        ######################################

        emb = tf.nn.dropout(emb, self.dropout_keep_ph, (tf.shape(emb)[0], 1, tf.shape(emb)[2]))
        
        # Build a multilayer CNN on top of the embeddings.
        # The number of units in the each layer must match
        # corresponding number from n_hidden_list.
        # Use ReLU activation 
        ######################################
        ########## YOUR CODE HERE ############
        units = 
        ######################################
        units = tf.nn.dropout(units, self.dropout_keep_ph, (tf.shape(units)[0], 1, tf.shape(units)[2]))
        logits = tf.layers.dense(units, n_tags, activation=None)
        self.predictions = tf.argmax(logits, 2)
        
        # ================= Loss and train ops =================
        # Use cross-entropy loss. 
        ######################################
        ########## YOUR CODE HERE ############
        self.loss = 
        ######################################

        # Create a training operation to update the network parameters.
        # We purpose to use the Adam optimizer as it work fine for the
        # most of the cases. Check tf.train to find an implementation.
        # Put the train operation to the attribute self.train_op
        
        ######################################
        ########## YOUR CODE HERE ############
        self.train_op = 
        ######################################

        # ================= Initialize the session =================
        
        self.sess = tf.Session()
        self.sess.run(tf.global_variables_initializer())

    def __call__(self, tok_batch, mask_batch):
        feed_dict = {self.token_ph: tok_batch,
                     self.mask_ph: mask_batch,
                     self.dropout_keep_ph: 1.0}
        return self.sess.run(self.predictions, feed_dict)

    def train_on_batch(self, tok_batch, tag_batch, mask_batch, dropout_keep_prob, learning_rate):
        feed_dict = {self.token_ph: tok_batch,
                     self.y_ph: tag_batch,
                     self.mask_ph: mask_batch,
                     self.dropout_keep_ph: dropout_keep_prob,
                     self.learning_rate_ph: learning_rate}
        self.sess.run(self.train_op, feed_dict)


Now create an instance of the NerNetwork class:

In [None]:
nernet = NerNetwork(len(token_vocab),
                    len(tag_vocab),
                    n_hidden_list=[100, 100])

We often want to check the score on validation part of the dataset every epoch. In most of the cases of NER tasks the classes are imbalanced. And the accuracy is not the best measure of performance. If we have 95% of 'O' tags, then a silly classifier that always predicts '0' gets 95% accuracy. To tackle this issue the F1-score is used. The $F_1$-score can be defined as:

$$ F_1 =  \frac{2 P R}{P + R}$$ 

where P is precision and R is recall.

Let's write the evaluation function. We need to get all predictions for the given part of the dataset and compute $F_1$.

In [None]:
from deeppavlov.models.ner.evaluation import precision_recall_f1
# The function precision_recall_f1 takes two lists: y_true and y_predicted
# the tag sequences for each sentences should be merged into one big list 
from deeppavlov.core.data.utils import zero_pad
# zero_pad takes a batch of lists of token indices, pad it with zeros to the
# maximal length and convert it to numpy matrix
from itertools import chain


def eval_valid(network, batch_generator):
    total_true = []
    total_pred = []
    for x, y_true in batch_generator:

        # Prepare token indices from tokens batch
        x_inds = # YOUR CODE HERE

        # Pad the indices batch with zeros
        x_batch = # YOUR CODE HERE

        # Get the mask using get_mask
        mask = # YOUR CODE HERE
        
        # We call the instance of the NerNetwork because we have defined __call__ method
        y_inds = network(x_batch, mask)

        # For every sentence in the batch extract all tags up to paddings (use length of x element)
        y_inds = # YOUR CODE HERE
        y_pred = tag_vocab(y_inds)

        # Add fresh predictions 
        total_true.extend(chain(*y_true))
        total_pred.extend(chain(*y_pred))
    res = precision_recall_f1(total_true, total_pred, print_results=True)

Now let's check 

In [None]:
eval_valid(nernet, data_iterator.gen_batches(16, data_type='valid'))

Set hyperparameters for the training procedure. You might want to start with the following recommended values:
- *batch_size*: 32;
- n_epochs: 10;
- starting value of *learning_rate*: 0.001
- *learning_rate_decay*: a square root of 2;
- *dropout_keep_probability* equal to 0.7 for training (typical values for dropout probability are ranging from 0.3 to 0.9).

A very efficient technique for the learning rate managment is dropping learning rate after convergence. It is common to use dividers 2, 3, and 10 to drop the learning rate.

In [None]:
batch_size = # YOUR HYPERPARAMETER HERE
n_epochs = # YOUR HYPERPARAMETER HERE
learning_rate = # YOUR HYPERPARAMETER HERE
dropout_keep_prob = # YOUR HYPERPARAMETER HERE

Now we iterate through the dataset batch by batch and pass the data to the train op

In [None]:
for epoch in range(n_epochs):
    for x, y in data_iterator.gen_batches(batch_size, 'train'):
        # Convert tokens to indices via Vocab
        x_inds = # YOUR CODE 
        # Convert tags to indices via Vocab
        y_inds = # YOUR CODE 
        
        # Pad every sample with zeros to the maximal length
        x_batch = zero_pad(x_inds)
        y_batch = zero_pad(y_inds)

        mask = get_mask(x)
        nernet.train_on_batch(x_batch, y_batch, mask, dropout_keep_prob, learning_rate)
    print('Evaluating the model on valid part of the dataset')
    eval_valid(nernet, data_iterator.gen_batches(batch_size, 'valid'))


Eval the model on test part now

In [None]:
eval_valid(nernet, data_iterator.gen_batches(batch_size, 'test'))

Let's try to infer the model on our sentence:

In [None]:
sentence = 'Petr stole my vodka'
x = [sentence.split()]

x_inds = token_vocab(x)
x_batch = zero_pad(x_inds)
mask = get_mask(x)
y_inds = nernet(x_batch, mask)
print(x[0])
print(tag_vocab(y_inds)[0])