# End-to-end Sequence Labeling via Bi-directional LSTM-CNNs

In this tutorial we will demonstrate how to implement the Bi-directional LSTM-CNN architecture (Published at ACL'16. [Link To Paper](http://www.aclweb.org/anthology/P16-1101)) for Named Entity Recognition using PyTorch. 
**Notice** that the original paper also has a CRF layer after the LSTM output, but we remove that in this tutorial for simplicity.
For an implementation of the original architecture, please check the link below.

The main aim of the tutorial is to make the audience comfortable with pytorch using this tutorial and give a step-by-step walk through of the Bi-LSTM-CNN architecture for NER. Some familiarity with pytorch (or any other deep learning framework) would definitely be a plus. 

The agenda of this tutorial is as follows:

1. Getting Ready with the data 
2. Network Definition. This includes
    * CNN Encoder for Character Level representation.
    * Bi-directional LSTM for Word-Level Encoding.
3. Training 
4. Model testing

This tutorial is modified from [this GitHub repo](https://github.com/TheAnig/NER-LSTM-CNN-Pytorch).

**Author:**
[**Yinghao Li**](https://yinghao-li.github.io/)

### Downloading data

Before starting, we need to download the dataset and pre-trained GloVe embedding files into the machine.
We can use the following commands.

Notice that this may take a while (2~3 mins on Colab)

In [1]:
!mkdir data
!wget -P ./data/ https://raw.githubusercontent.com/TheAnig/NER-LSTM-CNN-Pytorch/master/data/eng.train -nc
!wget -P ./data/ https://raw.githubusercontent.com/TheAnig/NER-LSTM-CNN-Pytorch/master/data/eng.testa -nc
!wget -P ./data/ https://raw.githubusercontent.com/TheAnig/NER-LSTM-CNN-Pytorch/master/data/eng.testb -nc
!wget -P ./data/ http://nlp.stanford.edu/data/glove.6B.zip -nc

!unzip -n ./data/glove.6B.zip -d ./data/

mkdir: cannot create directory ‘data’: File exists
File ‘./data/eng.train’ already there; not retrieving.

File ‘./data/eng.testa’ already there; not retrieving.

File ‘./data/eng.testb’ already there; not retrieving.

File ‘./data/glove.6B.zip’ already there; not retrieving.

Archive:  ./data/glove.6B.zip


### Data Preparation

The paper uses the English data from CoNLL 2003 shared task\[1\], which is present in the "data" directory of this project. We will later apply more preprocessing steps to generate tag mapping, word mapping and character  mapping. The data set contains four different types of named entities: PERSON, LOCATION, ORGANIZATION, and MISC and uses the BIO tagging scheme

BIO tagging Scheme:

    I - Word is inside a phrase of type TYPE
    B - If two phrases of the same type immediately follow each other, the first word of the second phrase will have tag B-TYPE 
    O - Word is not part of a phrase
    
Example of English-NER sentence available in the data:
    
    U.N.         NNP  I-NP  I-ORG 
    official     NN   I-NP  O 
    Ekeus        NNP  I-NP  I-PER 
    heads        VBZ  I-VP  O 
    for          IN   I-PP  O 
    Baghdad      NNP  I-NP  I-LOC 
    .            .    O     O 
    
Data Split(We use the same split as mentioned in paper):

    Training Data - eng.train
    Validation Data - eng.testa
    Testing Data - eng.testb
    

 To get started we first import the necessary libraries

In [2]:
import torch
import torch.nn as nn
from torch.autograd import Variable
from tqdm.auto import tqdm

import matplotlib.pyplot as plt

import os
import codecs
import re
import numpy as np

##### Define constants and paramaters

We now define some constants and parameters that we will be using later

In [3]:
# parameters for the Model
args = dict()
args['train'] = "./data/eng.train"  # Path to train file
args['dev'] = "./data/eng.testa"  # Path to test file
args['test'] = "./data/eng.testb"  # Path to dev file
args['tag_scheme'] = "BIO"  # BIO or BIOES
args['lower'] = True  #  Boolean variable to control lowercasing of words
args['zeros'] =  True  #  Boolean variable to control replacement of  all digits by 0 
args['char_dim'] = 30  # Char embedding dimension
args['word_dim'] = 100  # Token embedding dimension
args['word_lstm_dim'] = 200  # Token LSTM hidden layer size
args['word_bidirect'] = True  # Use a bidirectional LSTM for words
args['embedding_path'] = "./data/glove.6B.100d.txt"  # Location of pretrained embeddings
args['all_emb'] = 1  # Load all embeddings
args['dropout'] = 0.5  # Droupout on the input (0 = no dropout)
args['weights'] = ""  # path to Pretrained for from a previous run
args['name'] = "self-trained-model"  #  Model name
models_path = "./models/"  # path to saved models

# GPU
args['use_gpu'] = torch.cuda.is_available()  # GPU Check
use_gpu = args['use_gpu']

args['reload'] = "./models/pre-trained-model.ckpt" 

# Constants
START_TAG = '<START>'
STOP_TAG = '<STOP>'

In [4]:
# paths to files 
# To stored mapping file
mapping_file = './data/mapping.pt'

# To stored model
name = args['name']
model_name = os.path.join(models_path, name)  # get_name(parameters)

if not os.path.exists(models_path):
    os.makedirs(models_path, exist_ok=True)

##### Load data and preprocess

Firstly, the data is loaded from the train, dev and test files into a list of sentences.

Preprocessing:  
* All the digits in the words are replaced by 0
    
Why this preprocessing step?  
* For the Named Entity Recognition task, the information present in numerical digits doesnot help in predicting the entity.
So, we replace all the digits by 0. So, now the model can concentrate on more important alphabets.

Notice that this step is unecessary for more advanced word embedding methods such as BERT.

In [5]:
def zero_digits(s):
    """
    Replace every digit in a string by a zero.
    """
    return re.sub('\d', '0', s)

def load_sentences(path, zeros):
    """
    Load sentences. A line must contain at least a word and its tag.
    Sentences are separated by empty lines.
    """
    sentences = []
    sentence = []
    for line in codecs.open(path, 'r', 'utf8'):
        line = zero_digits(line.rstrip()) if zeros else line.rstrip()
        if not line:
            if len(sentence) > 0:
                if 'DOCSTART' not in sentence[0][0]:
                    sentences.append(sentence)
                sentence = []
        else:
            word = line.split()
            assert len(word) >= 2
            sentence.append(word)
    if len(sentence) > 0:
        if 'DOCSTART' not in sentence[0][0]:
            sentences.append(sentence)
    return sentences

In [6]:
train_sentences = load_sentences(args['train'], args['zeros'])
test_sentences = load_sentences(args['test'], args['zeros'])
dev_sentences = load_sentences(args['dev'], args['zeros'])

##### Update tagging scheme

Different types of tagging schemes can be used for NER. We update the tags for train, test and dev data ( depending on the parameters \[ tag_scheme \] ).

The original data is labeled using the BIO-1 scheme, but a more general scheme is BIO-2.
The difference is that BIO-1 use predix `B-` only for the beginning token of the second entity within two consecutive entities, while BIO-2 use it for all entity beginning tokens.

There are also other tagging schemes such as `BIOES`. Please refer to [this wiki page](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)) for details.

In [7]:
def iob2(tags):
    """
    Check that tags have a valid BIO format.
    Tags in BIO1 format are converted to BIO2.
    """
    for i, tag in enumerate(tags):
        if tag == 'O':
            continue
        split = tag.split('-')
        if len(split) != 2 or split[0] not in ['I', 'B']:
            return False
        if split[0] == 'B':
            continue
        elif i == 0 or tags[i - 1] == 'O':   #  conversion IOB1 to IOB2
            tags[i] = 'B' + tag[1:]
        elif tags[i - 1][1:] == tag[1:]:
            continue
        else:   #  conversion IOB1 to IOB2
            tags[i] = 'B' + tag[1:]
    return True

def iob_iobes(tags):
    """
    the function is used to convert
    BIO -> BIOES tagging
    """
    new_tags = []
    for i, tag in enumerate(tags):
        if tag == 'O':
            new_tags.append(tag)
        elif tag.split('-')[0] == 'B':
            if i + 1 != len(tags) and \
               tags[i + 1].split('-')[0] == 'I':
                new_tags.append(tag)
            else:
                new_tags.append(tag.replace('B-', 'S-'))
        elif tag.split('-')[0] == 'I':
            if i + 1 < len(tags) and \
                    tags[i + 1].split('-')[0] == 'I':
                new_tags.append(tag)
            else:
                new_tags.append(tag.replace('I-', 'E-'))
        else:
            raise Exception('Invalid IOB format!')
    return new_tags

def update_tag_scheme(sentences, tag_scheme):
    """
    Check and update sentences tagging scheme to BIO2
    Only BIO1 and BIO2 schemes are accepted for input data.
    """
    for i, s in enumerate(sentences):
        tags = [w[-1] for w in s]
         #  Check that tags are given in the BIO format
        if not iob2(tags):
            s_str = '\n'.join(' '.join(w) for w in s)
            raise Exception('Sentences should be given in BIO format! ' +
                            'Please check sentence %i:\n%s' % (i, s_str))
        if tag_scheme == 'BIOES':
            new_tags = iob_iobes(tags)
            for word, new_tag in zip(s, new_tags):
                word[-1] = new_tag
        elif tag_scheme == 'BIO':
            for word, tag in zip(s, tags):
                word[-1] = tag
        else:
            raise Exception('Wrong tagging scheme!')

In [8]:
update_tag_scheme(train_sentences, args['tag_scheme'])
update_tag_scheme(dev_sentences, args['tag_scheme'])
update_tag_scheme(test_sentences, args['tag_scheme'])

##### Create Mappings for Words, Characters and Tags

After we have updated the tag scheme. We now have a list of sentences which are words along with their modified tags. Now, we want to map these individual words, tags and characters in each word, to unique numerical ID's so that each unique word, character and tag in the vocabulary is represented by a particular integer ID. To do this, we first create a functions that do these mapping for us

##### Why mapping is important?

These indices for words, tags and characters help us employ matrix (tensor) operations inside the neural network architecture, which are considerably faster.

In [9]:
def create_dico(item_list):
    """
    Create a dictionary of items from a list of list of items.
    """
    assert type(item_list) is list
    dico = {}
    for items in item_list:
        for item in items:
            if item not in dico:
                dico[item] = 1
            else:
                dico[item] += 1
    return dico

def create_mapping(dico):
    """
    Create a mapping (item to ID / ID to item) from a dictionary.
    Items are ordered by decreasing frequency.
    """
    sorted_items = sorted(dico.items(), key=lambda x: (-x[1], x[0]))
    id_to_item = {i: v[0] for i, v in enumerate(sorted_items)}
    item_to_id = {v: k for k, v in id_to_item.items()}
    return item_to_id, id_to_item

def word_mapping(sentences, lower):
    """
    Create a dictionary and a mapping of words, sorted by frequency.
    """
    words = [[x[0].lower() if lower else x[0] for x in s] for s in sentences]
    dico = create_dico(words)
    dico['<UNK>'] = 10000000  # UNK tag for unknown words
    word_to_id, id_to_word = create_mapping(dico)
    print("Found %i unique words (%i in total)" % (
        len(dico), sum(len(x) for x in words)
    ))
    return dico, word_to_id, id_to_word

def char_mapping(sentences):
    """
    Create a dictionary and mapping of characters, sorted by frequency.
    """
    chars = ["".join([w[0] for w in s]) for s in sentences]
    dico = create_dico(chars)
    char_to_id, id_to_char = create_mapping(dico)
    print("Found %i unique characters" % len(dico))
    return dico, char_to_id, id_to_char

def tag_mapping(sentences):
    """
    Create a dictionary and a mapping of tags, sorted by frequency.
    """
    tags = [[word[-1] for word in s] for s in sentences]
    dico = create_dico(tags)
    dico[START_TAG] = -1
    dico[STOP_TAG] = -2
    tag_to_id, id_to_tag = create_mapping(dico)
    print("Found %i unique named entity tags" % len(dico))
    return dico, tag_to_id, id_to_tag

In [10]:
dico_words, word_to_id, id_to_word = word_mapping(train_sentences, args['lower'])
dico_chars, char_to_id, id_to_char = char_mapping(train_sentences)
dico_tags, tag_to_id, id_to_tag = tag_mapping(train_sentences)

Found 17493 unique words (203621 in total)
Found 75 unique characters
Found 11 unique named entity tags


##### Preparing final dataset

The function prepare dataset returns a list of dictionaries ( one dictionary per each sentence )

Each of the dictionary returned by the function contains
    1. list of all words in the sentence
    2. list of word index for all words in the sentence
    3. list of lists, containing character id of each character for words in the sentence
    4. list of tag for each word in the sentence.

In [11]:
def lower_case(x, lower=False):
    if lower:
        return x.lower()  
    else:
        return x

In [12]:
def prepare_dataset(sentences, word_to_id, char_to_id, tag_to_id, lower=False):
    """
    Prepare the dataset. Return a list of lists of dictionaries containing:
        - word indexes
        - word char indexes
        - tag indexes
    """
    data = []
    for s in sentences:
        str_words = [w[0] for w in s]
        words = [word_to_id[lower_case(w,lower) if lower_case(w,lower) in word_to_id else '<UNK>']
                 for w in str_words]
         #  Skip characters that are not in the training set
        chars = [[char_to_id[c] for c in w if c in char_to_id]
                 for w in str_words]
        tags = [tag_to_id[w[-1]] for w in s]
        data.append({
            'str_words': str_words,
            'words': words,
            'chars': chars,
            'tags': tags,
        })
    return data

train_data = prepare_dataset(
    train_sentences, word_to_id, char_to_id, tag_to_id, args['lower']
)
dev_data = prepare_dataset(
    dev_sentences, word_to_id, char_to_id, tag_to_id, args['lower']
)
test_data = prepare_dataset(
    test_sentences, word_to_id, char_to_id, tag_to_id, args['lower']
)
print("{} / {} / {} sentences in train / dev / test.".format(len(train_data), len(dev_data), len(test_data)))

14041 / 3250 / 3453 sentences in train / dev / test.


We are  done with the preprocessing step for input data. It ready to be given as input to the model ! ! !

##### Load Word Embeddings

Now, We move to the next step of loading the pre-trained word embeddings.

The paper uses glove vectors 100 dimension vectors trained on the ( Wikipedia 2014 + Gigaword 5 ) corpus containing 6 Billion Words. The word embedding file ( glove.6B.100d.txt ) is placed in the data folder.

In [13]:
all_word_embeds = {}
for i, line in enumerate(codecs.open(args['embedding_path'], 'r', 'utf-8')):
    s = line.strip().split()
    if len(s) == args['word_dim'] + 1:
        all_word_embeds[s[0]] = np.array([float(i) for i in s[1:]])

#Intializing Word Embedding Matrix
word_embeds = np.random.uniform(-np.sqrt(0.06), np.sqrt(0.06), (len(word_to_id), args['word_dim']))

for w in word_to_id:
    if w in all_word_embeds:
        word_embeds[word_to_id[w]] = all_word_embeds[w]
    elif w.lower() in all_word_embeds:
        word_embeds[word_to_id[w]] = all_word_embeds[w.lower()]

print('Loaded %i pretrained embeddings.' % len(all_word_embeds))

Loaded 400000 pretrained embeddings.


##### (Optional) Storing Processed Data for Reuse

We can store the preprocessed data and the embedding matrix for future reuse.
This helps us avoid the time taken by the step of preprocessing when try to tune the hyper parameters for the model.

In [None]:
mappings = {
    'word_to_id': word_to_id,
    'tag_to_id': tag_to_id,
    'char_to_id': char_to_id,
    'args': args,
    'word_embeds': word_embeds
}
torch.save(mappings, mapping_file)

print('word_to_id: ', len(word_to_id))

##### (Optional) Loading Stored Data

To load the data stored on the disk, we can simply use the `torch.load` function:

In [None]:
data_dict = torch.load(mapping_file)

word_to_id = data_dict['word_to_id']
tag_to_id = data_dict['tag_to_id']
char_to_id = data_dict['char_to_id']
args = data_dict['args']
word_embeds = data_dict['word_embeds']

### Model


The model that we are presenting is a complicated one, since its a hybridized network using LSTMs and CNNs. So in order to break down the complexity, we have attempted to simplify the process by splitting up operations into individual functions that we can go over part by part. This hopefully makes the whole thing more easily digestable and gives a more intuitive understanding of the whole process.

##### Initialization of weights

We start with the init_embedding function, which just initializes the embedding layer by pooling from a random sample. 

The distribution is pooled from $-\sqrt{\frac{3}{V}}$ to $+\sqrt{\frac{3}{V}}$ where $V$ is the embedding dimension size.

In [14]:
def init_embedding(input_embedding):
    """
    Initialize embedding
    """
    bias = np.sqrt(3.0 / input_embedding.size(1))
    nn.init.uniform_(input_embedding, -bias, bias)

Similar to the initialization above, except this is for the linear layer.

In [15]:
def init_linear(input_linear):
    """
    Initialize linear transformation
    """
    bias = np.sqrt(6.0 / (input_linear.weight.size(0) + input_linear.weight.size(1)))
    nn.init.uniform_(input_linear.weight, -bias, bias)
    if input_linear.bias is not None:
        input_linear.bias.data.zero_()

This is the initialization scheme for the LSTM layers. 

The LSTM layers are initialized by uniform sampling from $-\sqrt{\frac{6}{r+c}}$ to $+\sqrt{\frac{6}{r+c}}$. Where $r$ is the number of rows, $c$ is the number of columns (based on the shape of the weight matrix).

In [16]:
def init_lstm(input_lstm):
    """
    Initialize lstm
    
    PyTorch weights parameters:
    
        weight_ih_l[k]: the learnable input-hidden weights of the k-th layer,
            of shape `(hidden_size * input_size)` for `k = 0`. Otherwise, the shape is
            `(hidden_size * hidden_size)`
            
        weight_hh_l[k]: the learnable hidden-hidden weights of the k-th layer,
            of shape `(hidden_size * hidden_size)`            
    """
    
     #  Weights init for forward layer
    for ind in range(0, input_lstm.num_layers):
        
         # # Gets the weights Tensor from our model, for the input-hidden weights in our current layer
        weight = eval('input_lstm.weight_ih_l' + str(ind))
        
         #  Initialize the sampling range
        sampling_range = np.sqrt(6.0 / (weight.size(0) / 4 + weight.size(1)))
        
         #  Randomly sample from our samping range using uniform distribution and apply it to our current layer
        nn.init.uniform_(weight, -sampling_range, sampling_range)
        
         #  Similar to above but for the hidden-hidden weights of the current layer
        weight = eval('input_lstm.weight_hh_l' + str(ind))
        sampling_range = np.sqrt(6.0 / (weight.size(0) / 4 + weight.size(1)))
        nn.init.uniform_(weight, -sampling_range, sampling_range)
        
        
     #  We do the above again, for the backward layer if we are using a bi-directional LSTM (our final model uses this)
    if input_lstm.bidirectional:
        for ind in range(0, input_lstm.num_layers):
            weight = eval('input_lstm.weight_ih_l' + str(ind) + '_reverse')
            sampling_range = np.sqrt(6.0 / (weight.size(0) / 4 + weight.size(1)))
            nn.init.uniform_(weight, -sampling_range, sampling_range)
            weight = eval('input_lstm.weight_hh_l' + str(ind) + '_reverse')
            sampling_range = np.sqrt(6.0 / (weight.size(0) / 4 + weight.size(1)))
            nn.init.uniform_(weight, -sampling_range, sampling_range)

     #  Bias initialization steps
    
     #  We initialize them to zero except for the forget gate bias, which is initialized to 1
    if input_lstm.bias:
        for ind in range(0, input_lstm.num_layers):
            bias = eval('input_lstm.bias_ih_l' + str(ind))
            
             #  Initializing to zero
            bias.data.zero_()
            
             #  This is the range of indices for our forget gates for each LSTM cell
            bias.data[input_lstm.hidden_size: 2 * input_lstm.hidden_size] = 1
            
             # Similar for the hidden-hidden layer
            bias = eval('input_lstm.bias_hh_l' + str(ind))
            bias.data.zero_()
            bias.data[input_lstm.hidden_size: 2 * input_lstm.hidden_size] = 1
            
         #  Similar to above, we do for backward layer if we are using a bi-directional LSTM 
        if input_lstm.bidirectional:
            for ind in range(0, input_lstm.num_layers):
                bias = eval('input_lstm.bias_ih_l' + str(ind) + '_reverse')
                bias.data.zero_()
                bias.data[input_lstm.hidden_size: 2 * input_lstm.hidden_size] = 1
                bias = eval('input_lstm.bias_hh_l' + str(ind) + '_reverse')
                bias.data.zero_()
                bias.data[input_lstm.hidden_size: 2 * input_lstm.hidden_size] = 1

##### Output Layer

In this tutorial, we use softmax to normalize the scores into a vector such that can be interpreted as the probability that the word belongs to class.
Eventually, the probability of a sequence of tag $y$ is the product of all tags.

Another type of output layer usually used in combination with LSTM is the linear-chain conditional random field (CRF).
Its advantages over Softmax include:
- Softmax doesn't value any dependencies, this is a problem since NER the context heavily influences the tag that is assigned.
This is solved by applying CRF as it takes into account the full sequence to assign the tag. 
- *Example: I-ORG cannot directly follow I-PER.*


### Details fo the Model

##### 1. CNN model for generating character embeddings


Consider the word 'cat', we pad it on both ends to get our maximum word length ( this is mainly an implementation quirk since we can't have variable length layers at run time, our algorithm will ignore the pads).

We then apply a convolution layer on top that generates spatial coherence across characters, we use a maxpool to extract meaningful features out of our convolution layer. This now gives us a dense vector representation of each word. This representation will be concatenated with the pre-trained GloVe embeddings using a simple lookup.


<img src = "https://github.com/TheAnig/NER-LSTM-CNN-Pytorch/raw/master/images/cnn_model.png"></img>
<a href="http://www.aclweb.org/anthology/P16-1101">Image Source</a>


This snippet shows us how the CNN is implemented in pytorch

`self.char_cnn3 = nn.Conv2d(in_channels=1, out_channels=self.out_channels, kernel_size=(3, char_embedding_dim), padding=(2,0))`

##### 2. Rest of the model (LSTM based) that generates tags for the given sequence

The word-embeddings( glove+char embedding ) that we generated above, we feed to a bi-directional LSTM model. The LSTM model has 2 layers, 
* The forward layer takes in a sequence of word vectors and generates a new vector based on what it has seen so far in the forward direction (starting from the start word up until current word) this vector can be thought of as a summary of all the words it has seen. 

* The backwards layer does the same but in opposite direction, i.e., from the end of the sentence to the current word.

The forward vector and the backwards vector at current word concatanate to generate a unified representation.

<img src = "https://github.com/TheAnig/NER-LSTM-CNN-Pytorch/raw/master/images/lstm_model.png"></img>
<a href="http://www.aclweb.org/anthology/P16-1101">Image Source</a>

This snippet shows us how the BiLSTM is implemented in pytorch

`self.lstm = nn.LSTM(embedding_dim+self.out_channels, hidden_dim, bidirectional=True)`

Finally, we have a linear layer to map hidden vectors to tag space.

##### Main Model Implementation

The get_lstm_features function returns the LSTM's tag vectors. The function performs all the steps mentioned above for the model.

Steps:
1. It takes in characters, converts them to embeddings using our character CNN.
2. We concat Character Embeeding with glove vectors, use this as features that we feed to Bidirectional-LSTM. 
3. The Bidirectional-LSTM generates outputs based on these set of features.
4. The output are passed through a linear layer to convert to tag space.

In [17]:
def get_lstm_features(self, sentence, chars):

    # Character embedding
    chars_embeds = self.char_embeds(chars).unsqueeze(1)

    # # Creating Character level representation using Convolutional Neural Netowrk
    # # followed by a Maxpooling Layer
    chars_cnn_out3 = self.char_cnn3(chars_embeds)
    chars_embeds = nn.functional.max_pool2d(chars_cnn_out3, kernel_size=(chars_cnn_out3.size(2), 1))\
        .view(chars_cnn_out3.size(0), self.out_channels)

    # # Loading word embeddings
    embeds = self.word_embeds(sentence)

     # # We concatenate the word embeddings and the character level representation
     # # to create unified representation for each word
    embeds = torch.cat((embeds, chars_embeds), dim=1).unsqueeze(1)

     # # Dropout on the unified embeddings
    embeds = self.dropout(embeds)

    # # Word lstm
    # # Takes words as input and generates a output at each step
    lstm_out, _ = self.lstm(embeds)

    # # Reshaping the outputs from the lstm layer
    lstm_out = lstm_out.view(len(sentence), self.hidden_dim*2)

    # # Dropout on the lstm output
    lstm_out = self.dropout(lstm_out)

    # # Linear layer converts the ouput vectors to tag space
    lstm_feats = self.hidden2tag(lstm_out)

    return lstm_feats

##### Funtion for Negative log likelihood calculation

This is a helper function that calculates the negative log likelihood. 

The functions takes as input the previously calulcated lstm features to use to calculate the sentence score and then perform a forward run score and compare it with our predicted score to generate a log likelihood. 

Implementation detail: Notice we do not pump out any log conversion in this function that is supposedly about log likelihood calculation, this is because we have ensured that we get the scores from our helper functions in the log domain.

In [18]:
def get_neg_log_likelihood(self, sentence, tags, chars):
     #  sentence, tags is a list of ints
     #  features is a 2D tensor, len(sentence) * self.tagset_size
    feats = self._get_lstm_features(sentence, chars)
    scores = nn.functional.cross_entropy(feats, tags)
    return scores

##### Main Model Class

In [19]:
class BiLSTM_CRF(nn.Module):

    def __init__(self,
                 vocab_size,
                 tag_to_ix,
                 embedding_dim,
                 hidden_dim,
                 char_to_ix=None,
                 pre_word_embeds=None,
                 char_out_dimension=25,
                 char_embedding_dim=25,
                 use_gpu=False,
                 dropout=0.1):
        '''
        Input parameters:
                
                vocab_size= Size of vocabulary (int)
                tag_to_ix = Dictionary that maps NER tags to indices
                embedding_dim = Dimension of word embeddings (int)
                hidden_dim = The hidden dimension of the LSTM layer (int)
                char_to_ix = Dictionary that maps characters to indices
                pre_word_embeds = Numpy array which provides mapping from word embeddings to word indices
                char_out_dimension = Output dimension from the CNN encoder for character
                char_embedding_dim = Dimension of the character embeddings
                use_gpu = defines availability of GPU, 
                    when True: CUDA function calls are made
                    else: Normal CPU function calls are made
                dropout: dropout retio
        '''
        
        super(BiLSTM_CRF, self).__init__()
        
         # parameter initialization for the model
        self.use_gpu = use_gpu
        self.embedding_dim = embedding_dim
        self.hidden_dim = hidden_dim
        self.vocab_size = vocab_size
        self.tag_to_ix = tag_to_ix
        self.tagset_size = len(tag_to_ix)
        self.out_channels = char_out_dimension

        if char_embedding_dim is not None:
            self.char_embedding_dim = char_embedding_dim
            
             # Initializing the character embedding layer
            self.char_embeds = nn.Embedding(len(char_to_ix), char_embedding_dim)
            init_embedding(self.char_embeds.weight)
            
             # Performing CNN encoding on the character embeddings
            self.char_cnn3 = nn.Conv2d(in_channels=1, out_channels=self.out_channels, kernel_size=(3, char_embedding_dim), padding=(2,0))

         # Creating Embedding layer with dimension of ( number of words * dimension of each word)
        self.word_embeds = nn.Embedding(vocab_size, embedding_dim)
        if pre_word_embeds is not None:
             # Initializes the word embeddings with pretrained word embeddings
            self.pre_word_embeds = True
            self.word_embeds.weight = nn.Parameter(torch.FloatTensor(pre_word_embeds))
        else:
            self.pre_word_embeds = False
    
         # Initializing the dropout layer, with dropout specificed in parameters
        self.dropout = nn.Dropout(dropout)
        
         # Lstm Layer:
         # input dimension: word embedding dimension + character level representation
         # bidirectional=True, specifies that we are using the bidirectional LSTM
        self.lstm = nn.LSTM(embedding_dim+self.out_channels, hidden_dim, bidirectional=True)
        
         # Initializing the lstm layer using predefined function for initialization
        init_lstm(self.lstm)
        
         #  Linear layer which maps the output of the bidirectional LSTM into tag space.
        self.hidden2tag = nn.Linear(hidden_dim*2, self.tagset_size)
        
         # Initializing the linear layer using predefined function for initialization
        init_linear(self.hidden2tag) 

     # assigning the functions, which we have defined earlier
    _get_lstm_features = get_lstm_features
    neg_log_likelihood = get_neg_log_likelihood

    # define the forward function
    def forward(self, sentence, chars):
    
        '''
        The function calls viterbi decode and generates the 
        most probable sequence of tags for the sentence
        '''
    
        # Get the emission scores from the BiLSTM
        feats = self._get_lstm_features(sentence, chars)
    
        # Find the best path, given the features.
        score, tag_seq = torch.max(feats, 1)
        tag_seq = list(tag_seq.cpu().data)

        return score, tag_seq

In [20]:
#creating the model using the Class defined above
model = BiLSTM_CRF(vocab_size=len(word_to_id),
                   tag_to_ix=tag_to_id,
                   embedding_dim=args['word_dim'],
                   hidden_dim=args['word_lstm_dim'],
                   use_gpu=use_gpu,
                   char_to_ix=char_to_id,
                   pre_word_embeds=word_embeds)
print("Model Initialized!!!")

Model Initialized!!!


In [21]:
if use_gpu:
    model.cuda()

##### Training Paramaters

In [52]:
#Initializing the optimizer
#The best results in the paper where achived using stochastic gradient descent (SGD) 
#learning rate=0.015 and momentum=0.9 
#decay_rate=0.05 

learning_rate = 0.015
momentum = 0.9
number_of_epochs = 5
decay_rate = 0.05
gradient_clip = 5.0
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, momentum=momentum)

#variables which will used in training process
losses = []  # list to store all losses
loss = 0.0  # Loss Initializatoin
best_dev_F = -1.0  #  Current best F-1 Score on Dev Set
best_test_F = -1.0  #  Current best F-1 Score on Test Set
best_train_F = -1.0  #  Current best F-1 Score on Train Set
all_F = list()  #  List storing all the F-1 Scores
eval_every = len(train_data)  #  Calculate F-1 Score after this many iterations
plot_every = 2000  #  Store loss after this many iterations
count = 0  # Counts the number of iterations

### Evaluation

##### Helper functions for evaluation

In [29]:
def get_chunk_type(tok, idx_to_tag):
    """
    The function takes in a chunk ("B-PER") and then splits it into the tag (PER) and its class (B)
    as defined in BIOES
    
    Args:
        tok: id of token, ex 4
        idx_to_tag: dictionary {4: "B-PER", ...}

    Returns:
        tuple: "B", "PER"

    """
    
    tag_name = idx_to_tag[tok]
    tag_class = tag_name.split('-')[0]
    tag_type = tag_name.split('-')[-1]
    return tag_class, tag_type

In [34]:
def get_chunks(seq, tags):
    """Given a sequence of tags, group entities and their position

    Args:
        seq: [4, 4, 0, 0, ...] sequence of labels
        tags: dict["O"] = 4

    Returns:
        list of (chunk_type, chunk_start, chunk_end)

    Example:
        seq = [4, 5, 0, 3]
        tags = {"B-PER": 4, "I-PER": 5, "B-LOC": 3}
        result = [("PER", 0, 2), ("LOC", 3, 4)]

    """
    
     #  We assume by default the tags lie outside a named entity
    default = tags["O"]
    
    idx_to_tag = {idx: tag for tag, idx in tags.items()}
    
    chunks = []
    
    chunk_type, chunk_start = None, None
    for i, tok in enumerate(seq):
        #  End of a chunk 1
        if tok == default and chunk_type is not None:
            #  Add a chunk.
            chunk = (chunk_type, chunk_start, i)
            chunks.append(chunk)
            chunk_type, chunk_start = None, None

        #  End of a chunk + start of a chunk!
        elif tok != default:
            tok_chunk_class, tok_chunk_type = get_chunk_type(tok, idx_to_tag)
            if chunk_type is None:
                #  Initialize chunk for each entity
                chunk_type, chunk_start = tok_chunk_type, i
            elif tok_chunk_type != chunk_type or tok_chunk_class == "B":
                #  If chunk class is B, i.e., its a beginning of a new named entity
                #  or, if the chunk type is different from the previous one, then we
                #  start labelling it as a new entity
                chunk = (chunk_type, chunk_start, i)
                chunks.append(chunk)
                chunk_type, chunk_start = tok_chunk_type, i
        else:
            pass

    #  end condition
    if chunk_type is not None:
        chunk = (chunk_type, chunk_start, len(seq))
        chunks.append(chunk)

    return chunks

In [49]:
def evaluating(model, datas, best_F, dataset="Train"):
    '''
    The function takes as input the model, data and calcuates F-1 Score
    It performs conditional updates
     1) Flag to save the model
     2) Best F-1 score
    ,if the F-1 score calculated improves on the previous F-1 score
    '''
     #  Initializations
    save = False  #  Flag that tells us if the model needs to be saved
    new_F = 0.0  #  Variable to store the current F1-Score (may not be the best)
    correct_preds, total_correct, total_preds = 0., 0., 0.  #  Count variables

    print('[Evaluation]')
    for data in tqdm(datas):
        ground_truth_id = data['tags']
        chars2 = data['chars']

        #  Padding the each word to max word size of that sentence
        chars2_length = [len(c) for c in chars2]
        char_maxl = max(chars2_length)
        chars2_mask = np.zeros((len(chars2_length), char_maxl), dtype='int')
        for i, c in enumerate(chars2):
            chars2_mask[i, :chars2_length[i]] = c
        chars2_mask = Variable(torch.LongTensor(chars2_mask))

        dwords = Variable(torch.LongTensor(data['words']))

         #  We are getting the predicted output from our model
        if use_gpu:
            _, out = model(dwords.cuda(), chars2_mask.cuda())
        else:
            _, out = model(dwords, chars2_mask)
        predicted_id = [idx.item() for idx in out]

        #  We use the get chunks function defined above to get the true chunks
        #  and the predicted chunks from true labels and predicted labels respectively
        lab_chunks      = set(get_chunks(ground_truth_id, tag_to_id))
        lab_pred_chunks = set(get_chunks(predicted_id, tag_to_id))

        #  Updating the count variables
        correct_preds += len(lab_chunks & lab_pred_chunks)
        total_preds   += len(lab_pred_chunks)
        total_correct += len(lab_chunks)

     #  Calculating the F1-Score
    p   = correct_preds / total_preds if correct_preds > 0 else 0
    r   = correct_preds / total_correct if correct_preds > 0 else 0
    new_F  = 2 * p * r / (p + r) if correct_preds > 0 else 0

    print("{}: new_F: {} best_F: {} ".format(dataset, new_F, best_F))

     #  If our current F1-Score is better than the previous best, we update the best
     #  to current F1 and we set the flag to indicate that we need to checkpoint this model

    if new_F > best_F:
        best_F = new_F
        save = True

    return best_F, new_F, save

##### Helper function for performing Learning rate decay

In [36]:
def adjust_learning_rate(optimizer, lr):
    """
    shrink learning rate
    """
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

### Training Step

If `parameters['reload']` is set, we already have a model to load of off, so we can skip the training.
We have originally specified a pre-trained model since training is an expensive process, but we encourage readers to try this out once they're done with the tutorial.

Notice that this example uses stochastic gradient descent with no batching.
**This is bad practice.**
It results in waste of computation resource and relatively unstable training process.
PyTorch datasets and models are also designed to be trained in batch.
You should always train your models with batched data in practice.

For a (relatively more) formal implementation of the LSTM-CRF model and training the model with batched data, you can refer to [this repo](https://github.com/Yinghao-Li/SupervisedNER) (also with `wandb` for training status monitoring, but unannotated).

In [51]:

model.train()

for epoch in range(number_of_epochs):
    print(f"[Training Epoch {epoch}]")
    for i, index in enumerate(tqdm(np.random.permutation(len(train_data)))):
        count += 1
        data = train_data[index]

        # #gradient updates for each data entry
        model.zero_grad()

        sentence_in = data['words']
        sentence_in = torch.tensor(sentence_in, dtype=torch.long)
        tags = data['tags']
        chars2 = data['chars']

        # # Padding the each word to max word size of that sentence
        chars2_length = [len(c) for c in chars2]
        char_maxl = max(chars2_length)
        chars2_mask = np.zeros((len(chars2_length), char_maxl), dtype='int')
        for i, c in enumerate(chars2):
            chars2_mask[i, :chars2_length[i]] = c
        chars2_mask = torch.tensor(chars2_mask, dtype=torch.long)

        targets = torch.LongTensor(tags)

        # we calculate the negative log-likelihood for the predicted tags using the predefined function
        if use_gpu:
            neg_log_likelihood = model.neg_log_likelihood(sentence_in.cuda(), targets.cuda(), chars2_mask.cuda())
        else:
            neg_log_likelihood = model.neg_log_likelihood(sentence_in, targets, chars2_mask)
        loss += neg_log_likelihood.item() / len(data['words'])
        neg_log_likelihood.backward()

         # we use gradient clipping to avoid exploding gradients
        torch.nn.utils.clip_grad_norm_(model.parameters(), gradient_clip)
        optimizer.step()

         # Storing loss
        if count % plot_every == 0:
            loss /= plot_every
            print(count, ': ', loss)
            if losses == []:
                losses.append(loss)
            losses.append(loss)
            loss = 0.0

    # Calculate model performance after each batch
    model.eval()
    best_dev_F, new_dev_F, save = evaluating(model, dev_data, best_dev_F, "Dev")
    if save:
        print("Saving Model to ", model_name)
        torch.save(model.state_dict(), model_name)

    all_F.append(new_dev_F)
    model.train()

    # Performing decay on the learning rate
    adjust_learning_rate(optimizer, lr=learning_rate/(1+decay_rate*count/len(train_data)))

plt.plot(losses)
plt.show()

model.load_state_dict(torch.load(model_name))

[Epoch 0]


  0%|          | 0/14041 [00:00<?, ?it/s]

2000 :  0.004135122453379485
4000 :  0.003045238210593582
6000 :  0.003235383934198727
8000 :  0.005122574703823977
10000 :  0.0037125269382747594
12000 :  0.0039194866480012774
14000 :  0.002970461088603627
[Evaluation]


  0%|          | 0/3250 [00:00<?, ?it/s]

Dev: new_F: 0.8756227167054135 best_F: -1.0 
Saving Model to  ./models/self-trained-model
[Epoch 1]


  0%|          | 0/14041 [00:00<?, ?it/s]

16000 :  0.002718712996897037
18000 :  0.0020489558989192852
20000 :  0.002730672256349012
22000 :  0.0024141069689381676
24000 :  0.0022730897092740126
26000 :  0.002657378973527973
28000 :  0.002210385241169689
[Evaluation]


  0%|          | 0/3250 [00:00<?, ?it/s]

Dev: new_F: 0.8846599371173259 best_F: 0.8756227167054135 
Saving Model to  ./models/self-trained-model
[Epoch 2]


  0%|          | 0/14041 [00:00<?, ?it/s]

30000 :  0.0020031995552588673
32000 :  0.00140047321152302
34000 :  0.0017967389758296586
36000 :  0.0018244125434712696
38000 :  0.001509135896878114
40000 :  0.0014881135090573207
42000 :  0.0012155722103242789
[Evaluation]


  0%|          | 0/3250 [00:00<?, ?it/s]

Dev: new_F: 0.8934582538761296 best_F: 0.8846599371173259 
Saving Model to  ./models/self-trained-model
[Epoch 3]


  0%|          | 0/14041 [00:00<?, ?it/s]

KeyboardInterrupt: 

### Model Testing

This is where we provide our readers with some fun, they can try out how the trained model functions on the sentences that you throw at it. Feel free to play around.


##### LIVE: PRODUCTION!

In [55]:
model_testing_sentences = ['Jay is from India','Donald is the president of USA']

#parameters
lower=args['lower']

#preprocessing
final_test_data = []
for sentence in model_testing_sentences:
    s=sentence.split()
    str_words = [w for w in s]
    words = [word_to_id[lower_case(w,lower) if lower_case(w,lower) in word_to_id else '<UNK>'] for w in str_words]
    
     #  Skip characters that are not in the training set
    chars = [[char_to_id[c] for c in w if c in char_to_id] for w in str_words]
    
    final_test_data.append({
        'str_words': str_words,
        'words': words,
        'chars': chars,
    })

#prediction
predictions = []
print("Prediction:")
print("word : tag")
for data in final_test_data:
    words = data['str_words']
    chars2 = data['chars']

    #  Padding the each word to max word size of that sentence
    chars2_length = [len(c) for c in chars2]
    char_maxl = max(chars2_length)
    chars2_mask = np.zeros((len(chars2_length), char_maxl), dtype='int')
    for i, c in enumerate(chars2):
        chars2_mask[i, :chars2_length[i]] = c
    chars2_mask = torch.tensor(chars2_mask, dtype=torch.long)
    dwords = torch.tensor(data['words'], dtype=torch.long)

    #  We are getting the predicted output from our model
    if use_gpu:
        _, predicted_id = model(dwords.cuda(), chars2_mask.cuda())
    else:
        _, predicted_id = model(dwords, chars2_mask)

    pred_chunks = get_chunks([idx.item() for idx in predicted_id], tag_to_id)
    temp_list_tags=['NA']*len(words)
    for p in pred_chunks:
        temp_list_tags[p[1]]=p[0]
        
    for word,tag in zip(words,temp_list_tags):
        print(word, ':', tag)
    print('\n')

Prediction:
word : tag
Jay : PER
is : NA
from : NA
India : LOC


Donald : PER
is : NA
the : NA
president : NA
of : NA
USA : LOC




### References

1) Xuezhe Ma and Eduard Hovy. 2016. ** End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF .** In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: LongPapers). Association for Computational Linguistics, Berlin, Germany ** (https://arxiv.org/pdf/1603.01354.pdf) **

2) Official PyTorch Tutorial : [** Advanced: Making Dynamic Decisions and the Bi-LSTM CRF **](http://pytorch.org/tutorials/beginner/nlp/advanced_tutorial.html#sphx-glr-beginner-nlp-advanced-tutorial-py)

3) [** Sequence Tagging with Tensorflow **](https://guillaumegenthial.github.io/sequence-tagging-with-tensorflow.html)  using bi-LSTM + CRF with character embeddings for NER and POS by Guillaume Genthial

4) Github Repository - [** Reference Github Repository **](https://github.com/jayavardhanr/End-to-end-Sequence-Labeling-via-Bi-directional-LSTM-CNNs-CRF-Tutorial)
