# Named Entity Recognition - Model Training

If you are new to this, I suggest you to read [Data Preprocessing File](https://github.com/akash1309/Named-Entity-Recognition/blob/master/Data_Preprocessing.ipynb)


Tags of entities are encoded in a BIO-annotation scheme. Each entity is labeled with a B or an I to detect multi-word entities, where B denotes the beginning of an entity and I denote the inside of an entity.
O denotes all other words which are not named entities.

## 1) Importing the libraries

In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
import torch
import torch.nn as nn
import torch.functional as F
import spacy
import nltk
import os
import json
import warnings
import seaborn
import keras
from torch.utils.data import DataLoader,TensorDataset

In [0]:
# Hyper parameters for the vocab

PAD_WORD = '<pad>'
PAD_TAG = '0'
UNK_WORD = 'UNK'


## 2) Loading the text data

- In NLP, we have `text` as input and our machine can't understand texts. So, our first step is to make a dictionary which stores a `numerical value` corresponding the a `word`.

- In NLP applications, a sentence is represented by the sequence of indices of the words in the sentence. 
      For example if our vocabulary is {'is':1, 'John':2, 'Where':3, '.':4, '?':5} 
      then the sentence “Where is John ?” is represented as [3,1,2,5]. 

- We read the words.txt file and populate our vocabulary:

We will be working with full datasets, if you want u can work on small as well.

In [0]:
# for words.txt

word_filepath = 'drive/My Drive/Pytorch_DataSet/Named Entity Recognition/big/words.txt'
word_to_idx = {}
with open(word_filepath,'r') as f:

  for i,word in enumerate(set(f.read().splitlines())):
    word_to_idx[word] = i+2   # Because first 2 indices are stored for padding and unknown character

word_to_idx['<pad>'] = 0  # padding
word_to_idx['UNK'] = 1    # unknown

idx_to_word = {index: word for word, index in word_to_idx.items()}

In [23]:
print(word_to_idx['<pad>'])
print(word_to_idx['UNK'])
#print(idx_to_word)
print(len(word_to_idx))

0
1
35180


In a similar way, we load a mapping `tag_map` from our `labels` from `tags.txt` to indices. Doing so gives us indices for labels in the range `[0,1,...,NUM_TAGS-1]`.

In [0]:
# for tags.txt

tags_filepath = 'drive/My Drive/Pytorch_DataSet/Named Entity Recognition/big/tags.txt'
tag_to_idx = {}

with open(tags_filepath,'r') as f:

  for i,word in enumerate(set(f.read().splitlines())):
    tag_to_idx[word] = i+1 # Because we are storing 0th index for padding 
tag_to_idx['<pad>'] = 0 # padding
idx_to_tag = {index: word for word, index in tag_to_idx.items()}

In [25]:
print(tag_to_idx['<pad>'])
print(idx_to_tag[1])
print(len(tag_to_idx))

0
B-per
18


In addition to words read from English sentences, `words.txt` contains two special tokens: an `UNK` token to represent any word that is not present in the vocabulary, and a `PAD` token that is used as a filler token at the end of a sentence when one batch has sentences of unequal lengths.

We are now ready to load our data. We read the sentences in our dataset and convert them to a sequence of indices by looking up the vocabulary:

In [0]:
# Function for sentences.txt file 

def encode_sentences(file_path):

  sentences = []
  
  with open(file_path) as f:
    for sent in f.read().splitlines():
      #replace each token by its index if it is in vocab else use index of UNK
      s = []
      for token in sent.split(' '):
        if token in word_to_idx:
          s.append(word_to_idx[token])
        else:
          s.append(word_to_idx['UNK'])  

      sentences.append(s)

  return sentences    


In [0]:
# Function for labels.txt file 

def encode_labels(file_path):

  labels = []
  with open(file_path) as f:
    for sentence in f.read().splitlines():
      l = [tag_to_idx[label] for label in sentence.split(' ')]
      labels.append(l)

  return labels   

In [28]:
# Lets apply the transformation

file_sentences = 'drive/My Drive/Pytorch_DataSet/Named Entity Recognition/big/sentences.txt'
file_labels = 'drive/My Drive/Pytorch_DataSet/Named Entity Recognition/big/labels.txt'

sentences = encode_sentences(file_sentences)
labels = encode_labels(file_labels)

print(len(sentences))
print("-------------")
print(len(labels))
# print(sentences[0])
# print(labels[0])

47959
-------------
47959


## 3) Padding Sequences

- This is where it gets fun. When we sample a batch of sentences, not all the sentences usually have the same length. Let’s say we have a batch of sentences `batch_sentences` that is a Python list of lists, with its corresponding `batch_tags` which has a tag for each token in `batch_sentences`. 

- We add pad sequences at last in sentences. Here, we will be taking max length sentence as our main sentence and then padd `<pad>` at the end of all the sentences so that all sequences have all lengths. Similarly in the labels also, we add `O` at last of every label so that all lengths become the same.

In [0]:
batch_sentences = sentences.copy()
batch_tags = labels.copy()

In [0]:
#compute length of longest sentence in batch
batch_max_len = max([len(s) for s in batch_sentences])

#prepare a numpy array with the data, initializing the data with 'PAD' 
#and all labels with -1; initializing word_to_idx labels to -1 differentiates tokens 
#with tags from 'PAD' tokens

batch_data = word_to_idx['<pad>']*np.ones((len(batch_sentences), batch_max_len))
batch_labels = tag_to_idx['<pad>']*np.ones((len(batch_sentences), batch_max_len))

#copy the data to the numpy array
for j in range(len(batch_sentences)):
  cur_len = len(batch_sentences[j])
  batch_data[j][:cur_len] = batch_sentences[j]
  batch_labels[j][:cur_len] = batch_tags[j]


In [53]:
# print(batch_data[0])
print(len(batch_data[0]))
print(type(batch_data))

104
<class 'numpy.ndarray'>


In [55]:
# print(batch_labels[0])
print(len(batch_labels[0]))
print(type(batch_data))

104
<class 'numpy.ndarray'>


## 4) One hot encoding on labels

https://stackoverflow.com/questions/29831489/convert-array-of-indices-to-1-hot-encoded-numpy-array

In [47]:
a = np.array([1, 0, 3])
b = np.zeros((a.size, a.max()+1))
b[np.arange(a.size),a] = 1
print(b)

[[0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [0. 0. 0. 1.]]


In [0]:
labels = batch_labels.copy()

In [57]:
"""
for a in labels:
  b = np.zeros(a.size,a.max() + 1)
  b[np.arrange(a.size),a] = 1
  print(b)
  break
"""  

'\nfor a in labels:\n  b = np.zeros(a.size,a.max() + 1)\n  b[np.arrange(a.size),a] = 1\n  print(b)\n  break\n'

Here we can see that lists don't have punctuation marks in between them, so when we try to perform one hot encoding, it will always give `TypeError: data type not understood`
Uncomment the above code to see the error.

Keras has one functionality that can convert it, lets try that.

In [60]:
n_tags = len(idx_to_tag)
n_tags

18

In [0]:
from keras.utils import to_categorical
# One-Hot encode
n_tags = len(labels[0])
y = [to_categorical(i, num_classes=n_tags) for i in labels]  # n_tags = total tags + <PAD>


In [67]:
y[0]

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.]], dtype=float32)

## 5) Splitting Using TensorDataset and DataLoader


In [64]:
len(sentences)

47959

In [0]:
X = torch.LongTensor(batch_data)
y = torch.LongTensor(y)

In [0]:
data = TensorDataset(X,y)

In [74]:
data

<torch.utils.data.dataset.TensorDataset at 0x7fd588ef90f0>

In [0]:
dataset = DataLoader(data,batch_size=32,shuffle=True)

In [76]:
dataset

<torch.utils.data.dataloader.DataLoader at 0x7fd588ef92b0>

In [80]:
for batch,sample in enumerate(dataset):
  print(batch)
  print("<---------->")
  print(sample)
  break

0
<---------->
[tensor([[32729, 34818, 14341,  ...,     0,     0,     0],
        [28198,  7538, 29931,  ...,     0,     0,     0],
        [31485,  7516,  8059,  ...,     0,     0,     0],
        ...,
        [30433, 17858,  7516,  ...,     0,     0,     0],
        [30433, 25062,  9279,  ...,     0,     0,     0],
        [ 2330,  2417, 17471,  ...,     0,     0,     0]]), tensor([[[0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         ...,
         [1, 0, 0,  ..., 0, 0, 0],
         [1, 0, 0,  ..., 0, 0, 0],
         [1, 0, 0,  ..., 0, 0, 0]],

        [[0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         ...,
         [1, 0, 0,  ..., 0, 0, 0],
         [1, 0, 0,  ..., 0, 0, 0],
         [1, 0, 0,  ..., 0, 0, 0]],

        [[0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         ...,
         [1, 0, 0,  ..., 0, 0, 0],
         [1, 0,