# Named Entity Recognition - Model Training

If you are new to this, I suggest you to read [Data Preprocessing File](https://github.com/akash1309/Named-Entity-Recognition/blob/master/Data_Preprocessing.ipynb)

## 1) Importing the libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
import torch
import torch.nn as nn
import torch.functional as F
import spacy
import nltk
import os
import json
import warnings
import seaborn

  import pandas.util.testing as tm


## 2) Loading the text data

- In NLP, we have `text` as input and our machine can't understand texts. So, our first step is to make a dictionary which stores a `numerical value` corresponding the a `word`.

- In NLP applications, a sentence is represented by the sequence of indices of the words in the sentence. 
      For example if our vocabulary is {'is':1, 'John':2, 'Where':3, '.':4, '?':5} 
      then the sentence “Where is John ?” is represented as [3,1,2,5]. 

- We read the words.txt file and populate our vocabulary:

In [0]:
# for words.txt

word_filepath = 'drive/My Drive/Pytorch_DataSet/Named Entity Recognition/small/words.txt'
word_vocab = {}
with open(word_filepath,'r') as f:

  for i,word in enumerate(f.read().splitlines()):
    word_vocab[word] = i


In [0]:
#vocab

In a similar way, we load a mapping `tag_map` from our `labels` from `tags.txt` to indices. Doing so gives us indices for labels in the range `[0,1,...,NUM_TAGS-1]`.

In [0]:
# for tags.txt

tags_filepath = 'drive/My Drive/Pytorch_DataSet/Named Entity Recognition/small/tags.txt'
tag_vocab = {}

with open(tags_filepath,'r') as f:

  for i,word in enumerate(f.read().splitlines()):
    tag_vocab[word] = i 

In [8]:
tag_vocab

{'B-art': 8,
 'B-geo': 1,
 'B-gpe': 2,
 'B-org': 5,
 'B-per': 3,
 'B-tim': 7,
 'I-art': 9,
 'I-geo': 4,
 'I-org': 6,
 'I-per': 10,
 'O': 0}

In addition to words read from English sentences, `words.txt` contains two special tokens: an `UNK` token to represent any word that is not present in the vocabulary, and a `PAD` token that is used as a filler token at the end of a sentence when one batch has sentences of unequal lengths.

We are now ready to load our data. We read the sentences in our dataset (either train, validation or test) and convert them to a sequence of indices by looking up the vocabulary:

In [0]:
# Function for sentences.txt file 

def encode_sentences(file_path):

  sentences = []
  
  with open(file_path) as f:
    for sent in f.read().splitlines():
      #replace each token by its index if it is in vocab else use index of UNK
      s = []
      for token in sent.split(' '):
        if token in word_vocab:
          s.append(word_vocab[token])
        else:
          s.append(word_vocab['UNK'])  

      sentences.append(s)

  return sentences    


In [0]:
# Function for labels.txt file 

def encode_labels(file_path):

  labels = []
  with open(file_path) as f:
    for sentence in f.read().splitlines():
      l = [tag_vocab[label] for label in sentence.split(' ')]
      labels.append(l)

  return labels   

In [0]:
# For train file

train_file_sentences = 'drive/My Drive/Pytorch_DataSet/Named Entity Recognition/small/train/sentences.txt'
train_file_labels = 'drive/My Drive/Pytorch_DataSet/Named Entity Recognition/small/train/labels.txt'

train_sentences = encode_sentences(train_file_sentences)
train_labels = encode_labels(train_file_labels)

# print(train_sentences)
# print("-------------")
# print(train_labels)


In [0]:
# For validation file

val_file_sentences = 'drive/My Drive/Pytorch_DataSet/Named Entity Recognition/small/val/sentences.txt'
val_file_labels = 'drive/My Drive/Pytorch_DataSet/Named Entity Recognition/small/val/labels.txt'

val_sentences = encode_sentences(val_file_sentences)
val_labels = encode_labels(val_file_labels)

# print(val_sentences)
# print("-------------")
# print(val_labels)


In [0]:
# For test file

test_file_sentences = 'drive/My Drive/Pytorch_DataSet/Named Entity Recognition/small/test/sentences.txt'
test_file_labels = 'drive/My Drive/Pytorch_DataSet/Named Entity Recognition/small/test/labels.txt'

test_sentences = encode_sentences(test_file_sentences)
test_labels = encode_labels(test_file_labels)

# print(test_sentences)
# print("-------------")
# print(test_labels)
