### Neural Named Entity Recognition

In this notebook, you will find an example of training a neural network to solve Named Entity Recognition (NER) task.
In most of the cases, NER task can be formulated as: 

_Given a sequence of tokens (words, and may be punctuation symbols) provide a tag from predefined set of tags for each token in the sequence._

For NER task there are some common types of entities which essentially are tags:
- persons
- locations
- organizations
- expressions of time
- quantities
- monetary values 

Furthermore, to distinguish consequent entities with the same tags BIO tagging scheme is used. "B" stands for beginning, 
"I" stands for the continuation of an entity and "O" means the absence of entity. Example with dropped punctuation:

    Bernhard        B-PER
    Riemann         I-PER
    Carl            B-PER
    Friedrich       I-PER
    Gauss           I-PER
    and             O
    Leonhard        B-PER
    Euler           I-PER

In the example above PER means person tag, and "B-" and "I-" are prefixes identifying beginnings and continuations of the entities. Without such prefixes, it is impossible to separate Bernhard Riemann from Carl Friedrich Gauss.

### Training data
To train the neural network, you need to have a dataset in the following format:

    EU B-ORG
    rejects O
    the O
    call O
    of O
    Germany B-LOC
    to O
    boycott O
    lamb O
    from O
    Great B-LOC
    Britain I-LOC
    . O
    
    China B-LOC
    says O
    time O
    right O
    for O
    Taiwan B-LOC
    talks O
    . O

    ...

The source text is tokenized and tagged. For each token there is a separate tag with BIO markup. Tags are separated from tokens with whitespaces. Sentences are separated with empty lines.

The dataset is a text file or a set of text files.
The dataset must be split into three partitions: train, test, and validation. The train set is used for training the network, namely adjusting the weights with gradient descent. The validation set is used for monitoring learning progress and early stopping. The test set is used for final estimation of model quality. Typical partitions of train, validation, and test are 80%, 10%, 10% respectively. 

### Download CoNLL 2003 dataset
Now we download the CoNLL 2003 dataset from our server and assemble the dataset_dict data structure. 
dataset_dict is dictionary with fields _'train'_, _'test'_, and _'valid'_. In each field there is a list of training samples. Each sample is a pair (sentence_tokens, sentence_tags). And finally sentence_tokens is a list of tokens and sentence_tags is a list of tags.

In [2]:
from ner.utils import download_untar


conll_tar_url = 'http://lnsigo.mipt.ru/export/datasets/conll2003.tar.gz'
download_path = 'conll2003/'
download_untar(conll_tar_url, download_path)

data_types = ['train', 'test', 'valid']
dataset_dict = dict()
for data_type in data_types:

    with open('conll2003/' + data_type + '.txt') as f:
        xy_list = list()
        tokens = list()
        tags = list()
        for line in f:
            items = line.split()
            if len(items) > 1 and '-DOCSTART-' not in items[0]:
                token, tag = items
                if token[0].isdigit():
                    tokens.append('#')
                else:
                    tokens.append(token)
                tags.append(tag)
            elif len(tokens) > 0:
                xy_list.append((tokens, tags,))
                tokens = list()
                tags = list()
        dataset_dict[data_type] = xy_list

for key in dataset_dict:
    print('Number of samples (sentences) in {:<5}: {}'.format(key, len(dataset_dict[key])))

print('\nHere is a first two samples from the train part of the dataset:')
first_two_train_samples = dataset_dict['train'][:2]
for n, sample in enumerate(first_two_train_samples):
    # sample is a tuple of sentence_tokens and sentence_tags
    tokens, tags = sample
    print('Sentence {}'.format(n))
    print('Tokens: {}'.format(tokens))
    print('Tags:   {}'.format(tags))

Downloading from http://lnsigo.mipt.ru/export/datasets/conll2003.tar.gz to conll2003/conll2003.tar.gz


100%|██████████| 765k/765k [00:00<00:00, 67.9MB/s]

Extracting conll2003/conll2003.tar.gz archive into conll2003/





Number of samples (sentences) in train: 14041
Number of samples (sentences) in test : 3453
Number of samples (sentences) in valid: 3250

Here is a first two samples from the train part of the dataset:
Sentence 0
Tokens: ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']
Tags:   ['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O']
Sentence 1
Tokens: ['Peter', 'Blackburn']
Tags:   ['B-PER', 'I-PER']


### Corpus (batch generator)
Now we have to create a Corpus instance. Corpus is a dataprovider. It creates vocabularies to map tokens to indices and generate batches. There is an optional parameter embeddings_file_path in the Corpus constructor. So you can provide the model with pre-trained embeddings. The embeddings must be either a FastText bin file or txt file with the following structure:

    400000 100
    the -0.038194 -0.24487 ...
    of -0.1529 -0.24279 ...

where the first line contains the total number of tokens and embeddings dimensionality and the rest lines contains tokens and vectors of embeddings. 


In [3]:
from ner.corpus import Corpus
corp = Corpus(dataset_dict, embeddings_file_path=None)

### Neural Network
Now we have to create the Neural Network. To do so we use NER class from the network module. The NER constructor takes the following arguments:

    token_embeddings_dim - token embeddings dimensionality (must agree with embeddings if they are provided)
    char_embeddings_dim - character embeddings dimensionality 
    use_crf - whether to use Conditional Random Fields on the top (suggested to always use True)
    use_capitalization - whethere to include capitalization binary features to the input of the network.
                         If True than binary feature indicating whether the word starts with capital letter
                         will be concatenated to the word embeddings.
    n_filters - list of output feature dimensionality for each layer. For [100, 200] there will be two
                layers with 100 and 200 number of units respectively.
    embeddings_dropout - whether to use dropout on embeddings
    
There are special type of argument determinig what type of net to build:
    
    net_type - could be one of the following 'cnn', 'rnn', and 'cnn_highway'
    
For each net type there are a number of optional arguments. For convolutional neural networks ('cnn' and 'cnn_highway' net types) there are:

    filter_width - width of the convolutional filter (number of tokens under the filter)
    use_batch_norm - if True each layer will be provided with batch normalization

For 'rnn' net there is
    
    cell_type - could be lstm or gru

In [4]:
from ner.network import NER

model_params = {"filter_width": 7,
                "embeddings_dropout": True,
                "n_filters": [
                    128, 128,
                ],
                "token_embeddings_dim": 100,
                "char_embeddings_dim": 25,
                "use_batch_norm": True,
                "use_crf": True,
                "net_type": 'cnn',
                "use_capitalization": True,
               }

net = NER(corp, **model_params)

  return f(*args, **kwds)


Number of parameters: 
Embeddings 2014025
ConvNet 228352
Classifier 1290
transitions:0 100
Total number of parameters equal 2243767


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


### Network training
To train the network the following parameters must be specified:

    dropout_rate - probability of dropping the hidden state a value from 0 to 1. 0.5 Works well
                   in most of the cases
    epochs - number of epochs (10 - 100 typical)
    learning_rate: learning rate (0.01 - 0.0001 typical)
    batch_size: number of samples in the batch (4 - 64 typical)
    learning_rate_decay - multiple factor of decreasing learning rate every epoch (1 - 0.5 typical)

In [5]:
learning_params = {'dropout_rate': 0.5,
                   'epochs': 5,
                   'learning_rate': 0.005,
                   'batch_size': 8,
                   'learning_rate_decay': 0.707}
results = net.fit(**learning_params)

Epoch 0
Eval on valid:
processed 54612 tokens with 5942 phrases; found: 5811 phrases; correct: 5091.

precision:  87.61%; recall:  85.68%; FB1:  86.63


Epoch 1
Eval on valid:
processed 54612 tokens with 5942 phrases; found: 5858 phrases; correct: 5254.

precision:  89.69%; recall:  88.42%; FB1:  89.05


Epoch 2
Eval on valid:
processed 54612 tokens with 5942 phrases; found: 5853 phrases; correct: 5265.

precision:  89.95%; recall:  88.61%; FB1:  89.28


Epoch 3
Eval on valid:
processed 54612 tokens with 5942 phrases; found: 5929 phrases; correct: 5328.

precision:  89.86%; recall:  89.67%; FB1:  89.76


Epoch 4
Eval on valid:
processed 54612 tokens with 5942 phrases; found: 5857 phrases; correct: 5295.

precision:  90.40%; recall:  89.11%; FB1:  89.75


Eval on train:
processed 217662 tokens with 23499 phrases; found: 23489 phrases; correct: 23444.

precision:  99.81%; recall:  99.77%; FB1:  99.79

	LOC: precision:  99.82%; recall:  99.90%; F1:  99.86 7146

	MISC: precision:  99.53%; 