<a href="https://colab.research.google.com/github/alanwuha/ce7455-nlp/blob/master/assignment-2/Named_Entity_Recognition-LSTM-CNN-CRF-Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### End-to-end Sequence Labeling via Bi-Directional LSTM-CNNs-CRF

In this tutorial we will demonstrate how to implement a state of the art Bi-Directional LSTM-CNN-CRF architecture (Published at ACL'16. [Link To Paper](https://www.aclweb.org/anthology/P16-1101/)) for Named Entity Recognition using Pytorch.

The main aim of the tutorial is to make the audience comfortable with pytorch using this tutorial and give a step-by-by-step walk through of the Bi-LSTM-CNN-CRF architecture for NER. Some familiarity with pytorch (or any other deep learning framework) would definitely be a plus.

The agenda of this tutorial is as follows:
1. Getting Ready with the data
2. Network Definition. This includes
  - CNN Encoder for Character Level representation.
  - Bi-Directional LSTM for Word-Level Encoding.
  - Conditional Random Fields (CRF) for output decoding
3. Training
4. Model testing

#### Data Preparation

The paper uses the English data from CoNLL 2003 shared task [1]. We will later apply more preprocessing steps to generate tag mapping, word mapping and character mapping. The data set contains four different types of named entities: PERSON, LOCATION, ORGANIZATION, and MISC and uses the BIO tagging scheme.

BIO tagging scheme:
```
  I - Word is inside a phrase of type TYPE
  B - If two phrases of the same type immediately follow each other, the first word of the second phrase will have a B-TYPE
  O - Word is not part of a phrase
```

Example of English-NER sentence available in the data:
```
  U.N.         NNP   I-NP   I-ORG
  official     NN    I-NP   O
  Ekeus        NNP   I-NP   I-PER
  heads        VBZ   I-VP   O
  for          IN    I-PP   O
  Baghdad      NNP   I-NP   I-LOC
  .            .     O      O
```

Data Split (We use the same split as mentioned in paper):
```
  Training Data - eng.train
  Validation Data - eng.testa
  Testing Data - eng.testb
```

To get started we first import the necessary libraries.

In [0]:
from __future__ import print_function
from collections import OrderedDict

import torch
import torch.nn as nn
from torch.nn import init
from torch.autograd import Variable
from torch import autograd

import time
import _pickle as cPickle

import urllib
import matplotlib.pyplot as plt
plt.rcParams['figure.dpi'] = 80
plt.style.use('seaborn-pastel')

import os
import sys
import codecs
import re
import numpy as np

#### Define constants and parameters

We now define some constants and parameters that we will be using later.

In [0]:
# parameters for the Model
parameters = OrderedDict()
parameters['train'] = "./data/eng.train"  # Path to train file
parameters['dev'] = "./data/eng.testa"  # Path to dev file
parameters['test'] = "./data/eng.testb" # Path to test file
parameters['tag_scheme'] = "BIOES"  # BIO or BIOES
parameters['lower'] = True  # Boolean variable to control lowercasing of words
parameters['zeros'] = True # Boolean variable to control replacement of all digits by 0
parameters['char_dim'] = 30 # Char embedding dimension
parameters['word_dim'] = 100  # Token embedding dimension
parameters['word_lstm_dim'] = 200 # Token LSTM hidden layer size
parameters['word_bidirect'] = True # Use a bidirectional LSTM for words
parameters['embedding_path'] = "./data/glove.6B.100d.txt" # Location of pretrained embeddings
parameters['all_emb'] = 1 # Load all embeddings
parameters['crf'] = 1 # Use CRF (0 to disable)
parameters['dropout'] = 0.5 # Dropout on the input (0 = no dropout)
parameters['epoch'] = 50  # Number of epochs to run
parameters['weights'] = ""  # Path to pretrained for from a previous run
parameters['name'] = "self-trained-model" # Model name
parameters['gradient_clip'] = 5.0
parameters['char_mode'] = "CNN"
models_path = "./models/" # Path to saved models

# GPU
parameters['use_gpu'] = torch.cuda.is_available() # GPU check
use_gpu = parameters['use_gpu']

parameters['reload'] = "./models/pre-trained-model"

# Constants
START_TAG = '<START>'
STOP_TAG = '<STOP>'

In [0]:
# paths to files
# to stored mapping file
mapping_file = './data/mapping.pkl'

# to stored model
name = parameters['name']
model_name = models_path + name # get_name(parameters)

if not os.path.exists(models_path):
  os.makedirs(models_path)

In [23]:
!rm -rf data
!mkdir data
!wget -P ./data https://raw.githubusercontent.com/TheAnig/NER-LSTM-CNN-Pytorch/master/data/eng.testa
!wget -P ./data https://raw.githubusercontent.com/TheAnig/NER-LSTM-CNN-Pytorch/master/data/eng.testb
!wget -P ./data https://raw.githubusercontent.com/TheAnig/NER-LSTM-CNN-Pytorch/master/data/eng.train
!wget -P ./data https://raw.githubusercontent.com/TheAnig/NER-LSTM-CNN-Pytorch/master/data/eng.train54019
!wget -P ./data https://raw.githubusercontent.com/TheAnig/NER-LSTM-CNN-Pytorch/master/data/mapping.pkl

--2020-02-26 06:42:52--  https://raw.githubusercontent.com/TheAnig/NER-LSTM-CNN-Pytorch/master/data/eng.testa
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 827009 (808K) [text/plain]
Saving to: ‘./data/eng.testa’


2020-02-26 06:42:52 (140 MB/s) - ‘./data/eng.testa’ saved [827009/827009]

--2020-02-26 06:42:54--  https://raw.githubusercontent.com/TheAnig/NER-LSTM-CNN-Pytorch/master/data/eng.testb
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 748094 (731K) [text/plain]
Saving to: ‘./data/eng.testb’


2020-02-26 06:42:54 (90.9 MB/s) - ‘./

#### Load data and preprocess

Firstly, the data is loaded from the train, dev and test files into a list of sentences.

Preprocessing:
```
  * All the digits in the words are replaced by 0
```

Why this preprocessing step?
```
  * For the Named Entity Recognition task, the information present in numerical digits does not help in predcting the entity. So, we replace all the digits by 0. So, now the model can concentrate on more important alphabets.
```

In [0]:
def zero_digits(s):
  """
  Replace every digit in a string by a zero.
  """
  return re.sub('\d', '0', s)

def load_sentences(path, zeros):
  """
  Load sentences. A line must contain at least a word and its tag.
  Sentences are separated by empty lines.
  """
  sentences = []
  sentence = []
  for line in codecs.open(path, 'r', 'utf8'): # codecs.open - open an encoded file using the given mode and return a wrapped version providing transparent encoding/decoding.
    line = zero_digits(line.rstrip()) if zeros else line.rstrip() # rstrip: return a copy of the string with trailing whitespace removed. If chars is given and not None, remove characters in chars instead.
    if not line:
      if len(sentence) > 0:
        if 'DOCSTART' not in sentence[0][0]:
          sentences.append(sentence)
        sentence = []
    else:
      word = line.split()
      assert len(word) >= 2
      sentence.append(word)
  if len(sentence) > 0:
    if 'DOCTSTART' not in sentence[0][0]:
      sentences.append(sentence)
  return sentences

In [0]:
train_sentences = load_sentences(parameters['train'], parameters['zeros'])  # eng.train
test_sentences = load_sentences(parameters['test'], parameters['zeros'])    # eng.testb
dev_sentences = load_sentences(parameters['dev'], parameters['zeros'])      # eng.testa

### Update tagging scheme

Different types of tagging schemes can be used for NER. We update the tags for train, test and dev data (depending on the parameters [tag_scheme]).

In the paper, the authors use the tagging scheme ( BIOES ) rather than BIO (which is used by the dataset). So, we need to first update the data to convert tag scheme from BIO to BIOES.

BIOES tagging scheme:
```
  I - Word is inside a phrase of type TYPE
  B - If two phrases of the same type immediately follow each other, the first word of the second phrase will have tag B-TYPE
  O - Word i not part of a phrase
  E - End ( E will not appear in a prefix-only partial match )
  S - Single
```

In [0]:
def iob2(tags):
  """
  Check that tags have a valid BIO format.
  Tags in BIO1 format are converted to BIO2.
  """
  for i, tag in enumerate(tags):
    if tag == 'O':
      continue
    split = tag.split('-')
    if len(split) != 2 or split[0] not in ['I', 'B']:
      return False
    if split[0] == 'B':
      continue
    elif i == 0 or tags[i - 1] == 'O':  # conversion IOB1 to IOB2
      tags[i] = 'B' + tag[1:]
    elif tags[i-1][1:] == tag[1:]:
      continue
    else: # conversion IOB1 to IOB2
      tags[i] = 'B' + tag[1:]
  return True

def iob_iobes(tags):
  """
  the function is used to convert
  BIO -> BIOES tagging
  """
  new_tags = []
  for i, tag in enumerate(tags):
    if tag == 'O':
      new_tags.append(tag)
    elif tag.split('-')[0] == 'B':
      if i + 1 != len(tags) and \
        tags[i + 1].split('-')[0] == 'I':
        new_tags.append(tag)
      else:
        new_tags.append(tag.replace('B-', 'S-'))
    elif tag.split('-')[0] == 'I':
      if i + 1 < len(tags) and \
        tags[i + 1].split('-')[0] == 'I':
        new_tags.append(tag)
      else:
        new_tags.append(tag.replace('I-', 'E-'))
    else:
      raise Exception('Invalid IOB format!')
  return new_tags

def update_tag_scheme(sentences, tag_scheme):
  """
  Check and update sentences tagging scheme to BIO2
  Only BIO1 and BIO2 schemes are accepted for input data.
  """
  for i, s in enumerate(sentences):
    tags = [w[-1] for w in s]
    # Check that tags are given in the BIO format
    if not iob2(tags):
      s_str = '\n'.join(' '.join(w) for w in s)
      raise Exception('Sentences should be given in BIO format! ' +
                      ' Please check sentence %i:\n%s' % (i, s_str))
    if tag_scheme == 'BIOES':
      new_tags = iob_iobes(tags)
      for word, new_tag in zip(s, new_tags):
        word[-1] = new_tag
    else:
      raise Exception('Wrong tagging scheme!')

In [57]:
print(train_sentences[0])
print(dev_sentences[0])
print(test_sentences[0])

[['EU', 'NNP', 'I-NP', 'I-ORG'], ['rejects', 'VBZ', 'I-VP', 'O'], ['German', 'JJ', 'I-NP', 'I-MISC'], ['call', 'NN', 'I-NP', 'O'], ['to', 'TO', 'I-VP', 'O'], ['boycott', 'VB', 'I-VP', 'O'], ['British', 'JJ', 'I-NP', 'I-MISC'], ['lamb', 'NN', 'I-NP', 'O'], ['.', '.', 'O', 'O']]
[['CRICKET', 'NNP', 'I-NP', 'O'], ['-', ':', 'O', 'O'], ['LEICESTERSHIRE', 'NNP', 'I-NP', 'I-ORG'], ['TAKE', 'NNP', 'I-NP', 'O'], ['OVER', 'IN', 'I-PP', 'O'], ['AT', 'NNP', 'I-NP', 'O'], ['TOP', 'NNP', 'I-NP', 'O'], ['AFTER', 'NNP', 'I-NP', 'O'], ['INNINGS', 'NNP', 'I-NP', 'O'], ['VICTORY', 'NN', 'I-NP', 'O'], ['.', '.', 'O', 'O']]
[['SOCCER', 'NN', 'I-NP', 'O'], ['-', ':', 'O', 'O'], ['JAPAN', 'NNP', 'I-NP', 'I-LOC'], ['GET', 'VB', 'I-VP', 'O'], ['LUCKY', 'NNP', 'I-NP', 'O'], ['WIN', 'NNP', 'I-NP', 'O'], [',', ',', 'O', 'O'], ['CHINA', 'NNP', 'I-NP', 'I-PER'], ['IN', 'IN', 'I-PP', 'O'], ['SURPRISE', 'DT', 'I-NP', 'O'], ['DEFEAT', 'NN', 'I-NP', 'O'], ['.', '.', 'O', 'O']]


In [0]:
update_tag_scheme(train_sentences, parameters['tag_scheme'])
update_tag_scheme(dev_sentences, parameters['tag_scheme'])
update_tag_scheme(test_sentences, parameters['tag_scheme'])

In [59]:
print(train_sentences[0])
print(dev_sentences[0])
print(test_sentences[0])

[['EU', 'NNP', 'I-NP', 'S-ORG'], ['rejects', 'VBZ', 'I-VP', 'O'], ['German', 'JJ', 'I-NP', 'S-MISC'], ['call', 'NN', 'I-NP', 'O'], ['to', 'TO', 'I-VP', 'O'], ['boycott', 'VB', 'I-VP', 'O'], ['British', 'JJ', 'I-NP', 'S-MISC'], ['lamb', 'NN', 'I-NP', 'O'], ['.', '.', 'O', 'O']]
[['CRICKET', 'NNP', 'I-NP', 'O'], ['-', ':', 'O', 'O'], ['LEICESTERSHIRE', 'NNP', 'I-NP', 'S-ORG'], ['TAKE', 'NNP', 'I-NP', 'O'], ['OVER', 'IN', 'I-PP', 'O'], ['AT', 'NNP', 'I-NP', 'O'], ['TOP', 'NNP', 'I-NP', 'O'], ['AFTER', 'NNP', 'I-NP', 'O'], ['INNINGS', 'NNP', 'I-NP', 'O'], ['VICTORY', 'NN', 'I-NP', 'O'], ['.', '.', 'O', 'O']]
[['SOCCER', 'NN', 'I-NP', 'O'], ['-', ':', 'O', 'O'], ['JAPAN', 'NNP', 'I-NP', 'S-LOC'], ['GET', 'VB', 'I-VP', 'O'], ['LUCKY', 'NNP', 'I-NP', 'O'], ['WIN', 'NNP', 'I-NP', 'O'], [',', ',', 'O', 'O'], ['CHINA', 'NNP', 'I-NP', 'S-PER'], ['IN', 'IN', 'I-PP', 'O'], ['SURPRISE', 'DT', 'I-NP', 'O'], ['DEFEAT', 'NN', 'I-NP', 'O'], ['.', '.', 'O', 'O']]


#### Create Mappings for Words, Characters and Tags

After we have updated the tag scheme, we now have a list of sentences which are words along with their modified tags. Now, we want to map these individual words, tags and characters in each word, to unique numerical IDs so that each unique word, character and tag in the vocabulary is represented by a particular integer ID. To do this, we first create a function that does this mapping for us.

#### Why mapping is important?

These indices for words, tags and characters help us employ matrix (tensor) operations inside the neural network architecture, which are considerably faster.

In [0]:
def create_dico(item_list):
  """
  Create a dictionary of items from a list of list of items.
  """
  assert type(item_list) is list
  dico = {}
  for items in item_list:
    for item in items:
      if item not in dico:
        dico[item] = 1
      else:
        dico[item] += 1
  return dico

def create_mapping(dico):
  """
  Create a mapping (item to ID / ID to item) from a dictionary.
  Items are ordered by decreasing frequency.
  """
  sorted_items = sorted(dico.items(), key=lambda x: (-x[1], x[0]))
  id_to_item = {i: v[0] for i, v in enumerate(sorted_items)}
  item_to_id = {v: k for k, v in id_to_item.items()}
  return item_to_id, id_to_item

def word_mapping(sentences, lower):
  """
  Create a dictionary and a mapping of words, sorted by frequency.
  """
  words = [ [x[0].lower() if lower else x[0] for x in s] for s in sentences ] # list of list of words in lower case, x[0] is the word of each sentence
  dico = create_dico(words)
  dico['<UNK>'] = 10000000 # UNK tag for unknown words
  word_to_id, id_to_word = create_mapping(dico)
  print("Found %i unique words (%i in total)" % (len(dico), sum(len(x) for x in words)))
  return dico, word_to_id, id_to_word

def char_mapping(sentences):
  """
  Create a dictionary and mapping of characters, sorted by frequency.
  """
  chars = ["".join([w[0] for w in s]) for s in sentences]
  dico = create_dico(chars)
  char_to_id, id_to_char = create_mapping(dico)
  print("Found %i unique characters" % len(dico))
  return dico, char_to_id, id_to_char

def tag_mapping(sentences):
  """
  Create a dictionary and a mapping of tags, sorted by frequency.
  """
  tags = [[word[-1] for word in s] for s in sentences]
  dico = create_dico(tags)
  dico[START_TAG] = -1
  dico[STOP_TAG] = -2
  tag_to_id, id_to_tag = create_mapping(dico)
  print("Found %i unique named entity tags" % len(dico))
  return dico, tag_to_id, id_to_tag

In [71]:
dico_words, word_to_id, id_to_word = word_mapping(train_sentences, parameters['lower'])
dico_chars, char_to_id, id_to_char = char_mapping(train_sentences)
dico_tags, tag_to_id, id_to_tag = tag_mapping(train_sentences)

Found 17494 unique words (203622 in total)
Found 75 unique characters
Found 19 unique named entity tags
