## ENCODER DECODER NETWORK

AND TEACHER FORCING

**References:**

Tutorials Given in Competition Document : [Competetion Link](https://docs.google.com/document/d/1p74wG-bECCgbpyq5x_x2QJrf5RSf9FnMLGSAiyUkHLo/edit)

PyTorch NMT Tutorial : [Pytorch NMT](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html)

Github Page : To understand batch Processing in PyTorch [Github Pengyuchen](https://github.com/pengyuchen/PyTorch-Batch-Seq2seq)

Referred Few Stackoverflow Links for few Regex examples and for some bugs.

The whole code is divided into two sections:
a)  Functions containing all required procedures b) Execution : Using the function . Expand or Collapse to view each sections and subsections.

Observations :
1.   Using the default learning models work better in Adam.
2.   Training in epochs of 20 20 to avoid failure of timeouts.
3.   Saving the models is not working. Due to randomness everywhere. Language Word2index and index2word gets mapped to different word everytime. So all randomness need to be removed for saving and reusing the models.


NOTE : Change the directory location with respect to google drive location where the data is stored and EXPAND/COLLAPSE Section for the code.

No package other than the specified packages are imported


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
location = r"/content/drive/My Drive/Files/"                  
INDIC_NLP_LIB_HOME = location + "indic_nlp_library"
INDIC_NLP_RESOURCES = location + "indic_nlp_resources"
data_location        = location + 'NMT/'                   
model_location       = location + 'NMT/NMT_LOWLSTM_ATTN/' 
weekly_data_location = location + 'NMT/Weekly Data/'

### LIBRARIES  -
This subsection contains importing various libraries. Download or clone the indic nlp library and resources to your drive. And change the location accordingly.
Also google colab does not have morfessor and uses old version of nltk. So needed to update/install those two packages.

In [None]:
import sys
sys.path.append(r'{}'.format(INDIC_NLP_LIB_HOME))
from indicnlp import common
common.set_resources_path(INDIC_NLP_RESOURCES)
from indicnlp import loader
loader.load()

In [None]:
!pip install Morfessor
import csv
import re
import string
import spacy
import tqdm.notebook as tq
nlpen = spacy.load("en_core_web_sm")
import random
import pickle
from indicnlp.tokenize import sentence_tokenize
from indicnlp.tokenize import indic_tokenize
from indicnlp.transliterate.unicode_transliterate import UnicodeIndicTransliterator
from indicnlp.transliterate.unicode_transliterate import ItransTransliterator
from indicnlp.normalize.indic_normalize import IndicNormalizerFactory

Collecting Morfessor
  Downloading https://files.pythonhosted.org/packages/39/e6/7afea30be2ee4d29ce9de0fa53acbb033163615f849515c0b1956ad074ee/Morfessor-2.0.6-py3-none-any.whl
Installing collected packages: Morfessor
Successfully installed Morfessor-2.0.6


In [None]:
import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

!pip install -U nltk
import nltk
import sys
nltk.download('wordnet')
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.meteor_score import single_meteor_score
import numpy as np

Collecting nltk
[?25l  Downloading https://files.pythonhosted.org/packages/5e/37/9532ddd4b1bbb619333d5708aaad9bf1742f051a664c3c6fa6632a105fd8/nltk-3.6.2-py3-none-any.whl (1.5MB)
[K     |▎                               | 10kB 17.4MB/s eta 0:00:01[K     |▌                               | 20kB 22.2MB/s eta 0:00:01[K     |▊                               | 30kB 25.1MB/s eta 0:00:01[K     |█                               | 40kB 27.0MB/s eta 0:00:01[K     |█▏                              | 51kB 27.7MB/s eta 0:00:01[K     |█▍                              | 61kB 18.2MB/s eta 0:00:01[K     |█▋                              | 71kB 19.1MB/s eta 0:00:01[K     |█▉                              | 81kB 19.2MB/s eta 0:00:01[K     |██                              | 92kB 18.2MB/s eta 0:00:01[K     |██▎                             | 102kB 17.5MB/s eta 0:00:01[K     |██▌                             | 112kB 17.5MB/s eta 0:00:01[K     |██▊                             | 122kB 17.5MB/s e

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [None]:
def read_csv(location, file_type):
    cFile = open(location) 
    cReader = csv.reader(cFile, delimiter=',')
    header = next(cReader)
    if( file_type == 'train'):
        df = {}
        df['hindi'] = []
        df['english'] = []
        for t in cReader:
            df['hindi'].append(t[1])
            df['english'].append(t[2])
    elif( file_type == 'weekly' ):
        df = {}
        df['hindi'] = []
        for t in cReader:
            df['hindi'].append(t[2])
    return df

In [None]:
def train_test_split(dataset, test_split_percentage):

    total_len   = len(dataset)
    total_index = list(range(total_len))
    test_index = list( total_index[: int(test_split_percentage*total_len)] )
    train_index  = list( total_index[int(test_split_percentage*total_len) : ] )
    #np.random.shuffle(test_index)
    #np.random.shuffle(train_index)
    index = { 'train' : train_index, 'test' : test_index}
    train_df = [ dataset[i] for i in train_index ]
    test_df  = [ dataset[i] for i in test_index ]
    return index, train_df, test_df

### TEXT PROCESSING
This subsection contains processing of english and hindi sentences.
Since processing the 1 Lakh text pairs takes a lot of time. Instead of doing same thing again and again. I have stored the processed texts and token using pickle. 

In [None]:
english_nums = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
hindi_nums =   ['०', '१', '२', '३', '४', '५', '६', '७', '८', '९']

def clean_string( instr ):
    instr = instr.lower()
    instr = instr.replace(u'[', ' ')
    instr = instr.replace(u']', ' ')
    instr = instr.replace(u'{', ' ')
    instr = instr.replace(u'}', ' ')
    instr = instr.replace(u'(', ' ')
    instr = instr.replace(u')', ' ')
    instr = instr.replace(u'...', ' ')
    instr = instr.replace(u'..', ' ')
    instr = instr.replace(u'-', ' ')
    instr = instr.replace(u',', ' ')
    instr = instr.replace(u'"', ' ')
    instr = re.sub(' +',' ', instr)
    return instr
  
def preprocess_hindi( instr ):
    factory    = IndicNormalizerFactory()
    normalizer = factory.get_normalizer("hi",remove_nuktas=True)
    instr      = normalizer.normalize(instr)

    instr      = clean_string( instr )
    #instr = instr.replace(u'॥', '')
    for nums in hindi_nums:
        instr    = instr.replace(nums, nums + ' ')

    instr      = ItransTransliterator.from_itrans( instr , 'hi')  
    instr      = re.sub(' +',' ', instr)
    instr      = ItransTransliterator.from_itrans( instr , 'hi')
    instr      = instr.strip() #sentence_tokenize.sentence_split(instr, lang='hi')
    
    return instr

def preprocess_english( instr ):
    instr = clean_string(instr)

    instr = instr.replace("’", "'")
    instr = instr.replace("n\'t", " not")
    instr = instr.replace("'re" , " are")
    instr = instr.replace("'ve" , " have")
    instr = instr.replace("'s"  , " is")
    instr = instr.replace("'ll" , " will")
    instr = instr.replace("'m" , " am")
    #instr = re.sub(r'[^\w\s\\d]' , " " , instr)
    #instr = re.sub(r'[\d]' , ' ' , instr)

    for nums in english_nums:
        instr    = instr.replace(nums, nums + ' ')
    instr = re.sub(' +',' ', instr)
    instr = instr.strip()

    return instr

def get_hindi_tokens(sentence):
    return indic_tokenize.trivial_tokenize(sentence)

def get_english_tokens(sentence):
    tokens = []
    tokstr = nlpen(sentence)
    for token in tokstr:
        tokens.append(token.text)
    return tokens

In [None]:
# Load_From_file =
#   -1   : Process the texts and store/dump the files into the location
#    0   : Process the texts and do not store the files
#    1   : Directly load the processed text from the location

def process_pairs(df, load_from_file = 0, location = ''):
    if( load_from_file == 1):
        with open(location + r'pairs.pickle', 'rb') as handle:
            pairs = pickle.load(handle)
        with open(location + r'pairs_tokens.pickle', 'rb') as handle:
            pairs_tokens = pickle.load(handle)
        return pairs, pairs_tokens
    else:
        pairs = []
        pairs_tokens = []
        for i in tq.tqdm( range( len(df['hindi']) )):
            hinsen  = df['hindi'][i]
            hsent   = preprocess_hindi( hinsen )
            htokens = get_hindi_tokens(hsent)

            engsen  = df['english'][i]
            esent   = preprocess_english( engsen )
            etokens = get_english_tokens(esent)

            pairs.append( [hsent, esent] )
            pairs_tokens.append( [htokens, etokens] )

        if( load_from_file == -1):
            with open(location + r'pairs.pickle', 'wb') as handle:
                pickle.dump(pairs, handle, protocol=pickle.HIGHEST_PROTOCOL)
            with open(location + r'pairs_tokens.pickle', 'wb') as handle:
                pickle.dump(pairs_tokens, handle, protocol=pickle.HIGHEST_PROTOCOL)

        return pairs, pairs_tokens

### LANGUAGE
This subsection contains the class 'Laguage' which stores all the token and its equivalent index. This subsection also contains functions to convert a sentence to a tensor.

This subsection is referred from pytorch tutorial on NMT.

In [None]:
START_TOKEN = 0
END_TOKEN = 1
PAD_TOKEN = 2
UNK_TOKEN = 3

class Language:
    def __init__(self, name):
        self.name = name
        self.word2index = {}
        self.word2count = {}
        self.index2word = {}
        self.num_words = 4
        self.word2index['START_TOKEN'] = START_TOKEN
        self.word2index['END_TOKEN']   = END_TOKEN
        self.word2index['PAD_TOKEN']   = PAD_TOKEN
        self.word2index['UNK_TOKEN']   = UNK_TOKEN
        self.index2word[START_TOKEN] = 'START_TOKEN'
        self.index2word[END_TOKEN] = 'END_TOKEN'
        self.index2word[PAD_TOKEN] = 'PAD_TOKEN'
        self.index2word[UNK_TOKEN] = 'UNK_TOKEN'

    def addWord(self, word):
        if word in self.word2count:
            self.word2count[word] = self.word2count[word] + 1
        else:
            self.word2count[word] = 1
            #self.word2index[word] = self.num_words
            #self.index2word[self.num_words] = word
            self.num_words = self.num_words + 1
    
    def addSentence(self, sentence_tokens):
        for word in sentence_tokens:
            self.addWord(word)
    
    def filter_words(self):
        self.num_words = 4
        for word in self.word2count:
            if( self.word2count[word] != 1):
                self.word2index[word] = self.num_words
                self.index2word[self.num_words] = word
                self.num_words = self.num_words + 1


def generate_language( pairs_tokens ):
    hindi   = Language('hindi')
    english = Language('english')
    for i in tq.tqdm( range(len(pairs_tokens)) ):
        hindi.addSentence(pairs_tokens[i][0])
        english.addSentence(pairs_tokens[i][1])
    hindi.filter_words()
    english.filter_words()
    return hindi, english

PROCESS TEXT TO TENSOR

In [None]:
def indexesFromSentence(lang, tokens, max_length):
    indexes = []
    indexes.append(START_TOKEN)
    for word in tokens:
        if word in lang.word2index.keys():
            indexes.append( lang.word2index[word] )
        else:
            indexes.append( lang.word2index['UNK_TOKEN'] )
    indexes = indexes[0:max_length-1]
    indexes.append(END_TOKEN)
    indexes.extend( [PAD_TOKEN]*( max_length - len(indexes)))
    return indexes

def tensorFromSentence(lang, sentence, max_length):
    indexes = indexesFromSentence(lang, sentence, max_length)
    return torch.tensor(indexes, dtype=torch.long, device=device)

def tensorsFromPair(pairs, input_lang, output_lang, max_length):
    res_pairs = []
    for pair in pairs:
        if( len(pair[0]) < max_length and len(pair[1]) < max_length): 
          input_tensor  = tensorFromSentence(input_lang, pair[0], max_length)
          target_tensor = tensorFromSentence(output_lang, pair[1], max_length)
          res_pairs.append( (input_tensor, target_tensor) )
    return res_pairs

### NEURAL MACHINE TRANSLATOR
This subjection contains 3 main classes Encoder , Decoder and an seq2seq which merge the two encoder and decoder.
It also contains a function to train, use and evaluate the seq2seq model.


ENCODER and DECODER

In [None]:
class Encoder(nn.Module):
    def __init__(self, input_size, embed_size, hidden_size):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(input_size, embed_size)
        self.dropout = nn.Dropout(0.4)
        self.rnn = nn.LSTM(embed_size, hidden_size, num_layers = 2, bidirectional = True, dropout = 0.5)
        self.fc_hidden = nn.Linear(hidden_size*2 , hidden_size)
        self.fc_cell = nn.Linear(hidden_size*2 , hidden_size)

    def forward(self, input):
        # input.shape :    [Sentence Length, Batch Size]
        # embedded.shape : [Sentence Length, Batch Size, Embedding Dimension]
        # output.shape :   [Sentence Length, Batch Size, Hidden Size]
        # hidden.shape :   [Layers = 2*2 , Batch Size, Hidden Size]
        # cell.shape   :   [Layers = 2*2 , Batch Size, Hidden Size]

        embed = self.embedding(input)
        embed = self.dropout(embed)
        output, (hidden, cell) = self.rnn(embed)

        hidden = torch.cat( (hidden[0:2,:,:], hidden[2:4,:,:]), dim = 2)
        cell   = torch.cat( (cell[0:2,:,:], cell[2:4,:,:]), dim = 2)
        hidden = self.fc_hidden(hidden)
        cell   = self.fc_cell(cell)
        hidden = torch.tanh(hidden)
        cell   = torch.tanh(cell)

        return output, hidden, cell

In [None]:
class Attention(nn.Module):
    def __init__(self, hidden_size):
        super().__init__()
        self.attn = nn.Linear(hidden_size * 4, hidden_size*2)
        self.v = nn.Linear(hidden_size*2, 1, bias = False)
        
    def forward(self, hidden, encoder_outputs):
        # hidden.shape          :   [Batch Size, Hidden Size]
        # encoder_outputs.shape :   [Sen Len, Batch Size, Hidden_size*2]
        # After Ajusting
        # hidden.shape          :   [Batch Size, Sen Length, Hidden Size]
        # encoder_outputs.shape :   [Batch Size, Sen Length, Hidden_size*2]

        src_len = encoder_outputs.shape[0]
        hidden = torch.cat( (hidden[0,:,:], hidden[1,:,:]), dim = 1)
        hidden = hidden.repeat(src_len, 1, 1)
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        hidden = hidden.permute(1, 0, 2)

        comb        = torch.cat((hidden, encoder_outputs), dim = 2)
        energy      = torch.tanh( self.attn(comb) )
        attention   = self.v(energy).squeeze(2)
        attention   = F.softmax(attention, dim=1)
        attention   = attention.unsqueeze(1)
        weights     = torch.bmm(attention, encoder_outputs)
        weights     = weights.permute(1,0,2)
        return weights

In [None]:
class DecoderAttn(nn.Module):
    def __init__(self, output_size, embed_size, hidden_size):
        super(DecoderAttn, self).__init__()
        self.embedding = nn.Embedding(output_size, embed_size)
        self.dropout = nn.Dropout(0.4)

        self.rnn   = nn.LSTM((hidden_size*2)+embed_size, hidden_size, num_layers = 2, dropout = 0.5)
        self.dense  = nn.Linear(hidden_size*3 + embed_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

        self.attention = Attention(hidden_size)

    def forward(self, target, hidden, cell,  encoder_outputs):
        # target.shape :   [Batch Size]
        # target.shape :   [1, Batch Size] after unsqueezing
        # embed.shape  :   [1, Batch Size, Embedding Size]
        # output.shape :   [1, Batch Size, Hidden Size] before squeezing
        # hidden.shape :   [Batch Size, Hidden Size]
        # preds.shape  :   [Batch Size, Output_Vocabulary_Size]
        
        target = target.unsqueeze(0)
        embed = self.embedding(target)
        embed = self.dropout(embed)

        weights = self.attention(hidden, encoder_outputs)
        rinput   = torch.cat((embed, weights), dim = 2)

        output, (hidden, cell) = self.rnn(rinput, (hidden, cell))
        dense_input = torch.cat((output, weights, embed), dim=2)
        preds = self.dense(dense_input[0])
        #preds = F.relu(preds)
        preds = self.softmax(preds)
        return preds, hidden.squeeze(0), cell.squeeze(0)

In [None]:
class seq2seq(nn.Module):
    def __init__(self, input_size, output_size, embed_size, hidden_size, max_length):
        super(seq2seq, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.embed_size = embed_size
        self.max_length = max_length

        self.encoder = Encoder(input_size, embed_size, hidden_size).to(device)
        self.decoder = DecoderAttn(output_size, embed_size, hidden_size).to(device)

    def forward(self, src, target , teacher_forcing = 0.5):
        # If teacher forcing is set to 0.5, it will use true outputs half the time for
        # next input to decoder and use the predicted output as input
        # If teacher forcing is 0, it will always use previous output as input to decoder.

        # src.shape    = [Input Sentence Length, Batch Size]
        # target.shape = [Output Sentence Length, Batch Size]
        # decoder_output.shape = [ Output Sentence Length, Batch Size, ]
        # Encode the Source Sentence; Decode the tokens one by one.

        batch_size, target_vocab_size = src.shape[1], self.output_size
        outputs = torch.zeros(self.max_length, batch_size, target_vocab_size).to(device)
        encoder_outputs, hidden, cell = self.encoder(src)
        
        dinput = src[0,:]
        for index in range(1, self.max_length):
            output, hidden, cell = self.decoder(dinput, hidden, cell, encoder_outputs)
            if random.random() < teacher_forcing:
                dinput = target[index]  
            else:
                dinput = output.argmax(1)
            outputs[index] = output

        return outputs

In [None]:
# Set model in training mode to activate dropouts
# Transpose the text tokens to adjust to pytorch
# Forward Pass on Encoder-Decoder
# Optimize network
def train( model, opt, lossfn, train_loader, r_epoch, val_loader=None,  save_model=0):

    
    if( val_loader != None):
        #print('Preparing Validating Tensors and Dataloaders ...')
        #val_loader = tensorsFromPair(val_tokens, hindi, english, MAX_LENGTH)
        #val_loader  = torch.utils.data.DataLoader(val_loader, batch_size=batch_size, shuffle=True)
        num_val_batches = len(val_loader)
        #print('Done')

    history = []
    num_train_batches = len(train_loader)
    tf_ratio_main = 0.5
    
    for epoch in range(r_epoch[0], r_epoch[1]):
        
        tf_ratio = tf_ratio_main / (epoch + 0.1)

        epoch_detail = {'Epoch':epoch}

        print('Train ...')
        train_loss = 0
        model.train()                                                           # Train Model
        for inS, outS in tq.tqdm( train_loader ):
            opt.zero_grad()
            loss = 0
            inS, outS =  inS.transpose(0, 1), outS.transpose(0, 1)
            predoutS = model(inS, target = outS, teacher_forcing=tf_ratio)
            # Reshape outputs
            outS, predoutS    = outS[1:].reshape(-1), predoutS[1:].reshape(-1, predoutS.shape[-1])   

            loss = lossfn(predoutS, outS)                                       # Compute Loss
            loss.backward()                                                     # Propagate Loss To the Netowork
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1)               # Gradient Clipping
            opt.step()                                                          # Update the weights
            train_loss = train_loss + loss.item()
        epoch_detail['Train Loss'] = round( train_loss/num_train_batches, 5)
        
        if( val_loader != None):                                                   # Validate Model
            val_loss = 0
            with torch.no_grad():
                print('Validating Model ... ')
                model.eval()
                for inS, outS in tq.tqdm( val_loader ):
                    loss = 0
                    inS =  inS.transpose(0, 1)
                    outS = outS.transpose(0, 1)
                    predoutS = model(inS, target = outS, teacher_forcing=0)
                    outS     = outS[1:].reshape(-1)
                    predoutS =  predoutS[1:].reshape(-1, predoutS.shape[-1])  
                    loss = lossfn(predoutS, outS)         # Compute Loss
                    val_loss = val_loss + loss.item()

            #scores = get_bleu_score(model, val_set, hindi, english, MAX_LENGTH)
            epoch_detail['Val Loss'] = round( val_loss/num_val_batches, 5)
            #epoch_detail['BLEU'] = scores['BLEU']
            #epoch_detail['METEOR'] = scores['METEOR']

        print('Epoch : ', epoch, ' Stats : ', epoch_detail)
        history.append(epoch_detail)

        if( save_model == 1):
            if( (epoch+1)%5 == 0):
                torch.save(model.state_dict(), model_location + 'bilstmattn_dict_' + str(epoch+1) )
            if( (epoch+1)%20 == 0):
                torch.save(model, model_location + 'bilstmattn_' +  str(epoch+1) )

    return history

In [None]:
# Set model to evaluation model to disable dropout layer
# get Tensor from Sentence and adjust it to size [Sequence Length, Max Length = 1]
def make_sentence(tokens):
    str = ''
    for x in tokens:
        if x is 'UNK_TOKEN':
            str = str + ' ' + '<UNK>'
        elif x not in ['START_TOKEN', 'END_TOKEN', 'PAD_TOKEN']:
            str = str + ' ' + x
    return re.sub('(?<=\d)+ (?=\d)+', '', str)[1:]

def translate(model, sentence, input_lang, output_lang, max_length):
    model.eval()
    with torch.no_grad():
        input = tensorFromSentence( input_lang, sentence, max_length= max_length)
        input = torch.transpose( input.unsqueeze(0) , 0 , 1)
        output = model(input, target=None, teacher_forcing = 0)
        dec_words = []
        for x in output.squeeze():
            i = x.argmax(0)
            dec_words.append( output_lang.index2word[ i.item() ] )
            if(i.item() == END_TOKEN ):
                break
    return make_sentence( dec_words )


### PERFORMANCE EVALUATION
Evaluation Script Modified to give Bleu and Meteor Score

In [None]:
def get_bleu_score(model, pairs, input_lang, output_lang, max_length):
    total_num = len(pairs)
    total_bleu_scores = 0
    total_meteor_scores = 0
    
    for i in tq.tqdm( range(total_num) ):
        output    = translate(model, pairs[i][0], input_lang, output_lang, max_length)
        original  = make_sentence(pairs[i][1])
        total_bleu_scores   += sentence_bleu([output.split(" ")], original.split(" "))
        total_meteor_scores += single_meteor_score(output, original)

    bleu_result = total_bleu_scores/total_num
    meteor_result = total_meteor_scores/total_num
    
    print()
    return { "BLEU": round(bleu_result, 5),
              "METEOR": round(meteor_result, 5) }

# EXECUTION
Executing the whole process.


1.   Read the training data
2.   Process all sentences( english and hindi)
3.   Generate Language ( word2index and index2word)
4.   Prepare tensors for all tokens.
5.   Create the seq2seq model and train the model
6.   Evaluate the performance
7.   Use the model for weekly translation



READ AND PROCESS FILE

In [None]:
MAX_LENGTH = 32
batch_size = 256


print('Reading Training Data ... ', end = '')
df = read_csv(data_location + 'train.csv', 'train')
print('Done')

print('Processing Strings ... ', end = '')
pairs, tokens = process_pairs(df, load_from_file=1, location = data_location + 'DataPairs/')
print('Done')

print('Preparing Language Word2vectors and inverse ... ', end = '')
# Generate Langauge Input and Output
hindi, english = generate_language(tokens)
print('Done, Hindi Token Count : ', hindi.num_words, '  English Token Count : ', english.num_words)

print('Filtering Texts...')
import json
with open(data_location+'exclude.json', 'r') as fp:
    excludeids = json.load(fp)
    excludeids = list(excludeids.keys())
for i in range(len(excludeids)):
    excludeids[i] = int(excludeids[i])
for index in sorted(excludeids, reverse=True):
    del pairs[index]
    del tokens[index]
print('Done, Dataset Size : ', len(tokens))

print('Splitting Dataset ... ', end = '')
index, train_tokens, val_tokens = train_test_split(tokens,  0.2)
#index, test_tokens, val_tokens = train_test_split(temp_tokens, 0.25)
print('Done')

print('Preparing Tensors ... ', end = '')
# Get Tensors for tokens and create Dataloaders
train_tensors = tensorsFromPair(train_tokens, hindi, english, MAX_LENGTH)
#test_tensors  = tensorsFromPair(test_tokens,  hindi, english, MAX_LENGTH)
val_tensors   = tensorsFromPair(val_tokens,   hindi, english, MAX_LENGTH)
print('Done')

print('Preparing Dataloaders ... ', end = '')
train_loader = torch.utils.data.DataLoader(train_tensors, batch_size=batch_size, shuffle=True)
val_loader   = torch.utils.data.DataLoader(val_tensors, batch_size=batch_size, shuffle=True)
pretrain_loader = torch.utils.data.DataLoader(train_tensors[0:batch_size], batch_size=batch_size, shuffle=True)
print('Done')
del train_tensors
#del test_tensors
del val_tensors

Reading Training Data ... Done
Processing Strings ... Done
Preparing Language Word2vectors and inverse ... 

HBox(children=(FloatProgress(value=0.0, max=102322.0), HTML(value='')))


Done, Hindi Token Count :  21104   English Token Count :  18988
Filtering Texts...
Done, Dataset Size :  96625
Splitting Dataset ... Done
Preparing Tensors ... Done
Preparing Dataloaders ... Done


TRAIN MODEL

In [None]:
# Model Parameters
print('Initialising Parameters')
hidden_size = 512
input_vocab_size = hindi.num_words + 1
output_vocab_size = english.num_words + 1
embedding_dim = 300
pretrain_epoch = 200
#save_losses
Losses = []

#Generate Model, optimizer, lossfn
print('Creating Models ... ', end = ' ')
model = seq2seq(input_vocab_size, output_vocab_size , embedding_dim, hidden_size, MAX_LENGTH)
optimizer = optim.Adam( model.parameters(), lr= 0.0005, weight_decay=1e-6 )
lossfn = nn.NLLLoss(ignore_index=PAD_TOKEN)
print('Done')

#load_model weights if available
load_model = 0
if(load_model==1):
    print('Loading Pretrained Weights .. :')
    model.load_state_dict( torch.load(model_location + 'bilstmattn_dict_20'))
model.eval() 

Initialising Parameters
Creating Models ...  Done


seq2seq(
  (encoder): Encoder(
    (embedding): Embedding(21105, 300)
    (dropout): Dropout(p=0.4, inplace=False)
    (rnn): LSTM(300, 512, num_layers=2, dropout=0.5, bidirectional=True)
    (fc_hidden): Linear(in_features=1024, out_features=512, bias=True)
    (fc_cell): Linear(in_features=1024, out_features=512, bias=True)
  )
  (decoder): DecoderAttn(
    (embedding): Embedding(18989, 300)
    (dropout): Dropout(p=0.4, inplace=False)
    (rnn): LSTM(1324, 512, num_layers=2, dropout=0.5)
    (dense): Linear(in_features=1836, out_features=18989, bias=True)
    (softmax): LogSoftmax(dim=1)
    (attention): Attention(
      (attn): Linear(in_features=2048, out_features=1024, bias=True)
      (v): Linear(in_features=1024, out_features=1, bias=False)
    )
  )
)

In [None]:
# Pretrain to Overfit Model on single batch
pretrain_hist = train(model, optimizer, lossfn, pretrain_loader, (0,pretrain_epoch))

In [None]:
# Final Train on all Training Data Set, # Append the losses
pretrain_epoch=0
epochs = 20
history = train(model, optimizer, lossfn, train_loader, (pretrain_epoch , pretrain_epoch + epochs),
                val_loader = val_loader, save_model = 1)
Losses.extend(history)


Train ...


HBox(children=(FloatProgress(value=0.0, max=287.0), HTML(value='')))


Validating Model ... 


HBox(children=(FloatProgress(value=0.0, max=72.0), HTML(value='')))


Epoch :  0  Stats :  {'Epoch': 0, 'Train Loss': 5.5969, 'Val Loss': 6.21297}
Train ...


HBox(children=(FloatProgress(value=0.0, max=287.0), HTML(value='')))


Validating Model ... 


HBox(children=(FloatProgress(value=0.0, max=72.0), HTML(value='')))


Epoch :  1  Stats :  {'Epoch': 1, 'Train Loss': 5.09099, 'Val Loss': 5.17933}
Train ...


HBox(children=(FloatProgress(value=0.0, max=287.0), HTML(value='')))


Validating Model ... 


HBox(children=(FloatProgress(value=0.0, max=72.0), HTML(value='')))


Epoch :  2  Stats :  {'Epoch': 2, 'Train Loss': 4.7525, 'Val Loss': 4.90599}
Train ...


HBox(children=(FloatProgress(value=0.0, max=287.0), HTML(value='')))


Validating Model ... 


HBox(children=(FloatProgress(value=0.0, max=72.0), HTML(value='')))


Epoch :  3  Stats :  {'Epoch': 3, 'Train Loss': 4.44805, 'Val Loss': 4.77474}
Train ...


HBox(children=(FloatProgress(value=0.0, max=287.0), HTML(value='')))


Validating Model ... 


HBox(children=(FloatProgress(value=0.0, max=72.0), HTML(value='')))


Epoch :  4  Stats :  {'Epoch': 4, 'Train Loss': 4.15195, 'Val Loss': 4.71417}
Train ...


HBox(children=(FloatProgress(value=0.0, max=287.0), HTML(value='')))


Validating Model ... 


HBox(children=(FloatProgress(value=0.0, max=72.0), HTML(value='')))


Epoch :  5  Stats :  {'Epoch': 5, 'Train Loss': 3.8967, 'Val Loss': 4.6649}
Train ...


HBox(children=(FloatProgress(value=0.0, max=287.0), HTML(value='')))


Validating Model ... 


HBox(children=(FloatProgress(value=0.0, max=72.0), HTML(value='')))


Epoch :  6  Stats :  {'Epoch': 6, 'Train Loss': 3.66794, 'Val Loss': 4.65259}
Train ...


HBox(children=(FloatProgress(value=0.0, max=287.0), HTML(value='')))


Validating Model ... 


HBox(children=(FloatProgress(value=0.0, max=72.0), HTML(value='')))


Epoch :  7  Stats :  {'Epoch': 7, 'Train Loss': 3.48026, 'Val Loss': 4.6554}
Train ...


HBox(children=(FloatProgress(value=0.0, max=287.0), HTML(value='')))


Validating Model ... 


HBox(children=(FloatProgress(value=0.0, max=72.0), HTML(value='')))


Epoch :  8  Stats :  {'Epoch': 8, 'Train Loss': 3.32225, 'Val Loss': 4.65047}
Train ...


HBox(children=(FloatProgress(value=0.0, max=287.0), HTML(value='')))


Validating Model ... 


HBox(children=(FloatProgress(value=0.0, max=72.0), HTML(value='')))


Epoch :  9  Stats :  {'Epoch': 9, 'Train Loss': 3.19276, 'Val Loss': 4.65088}
Train ...


HBox(children=(FloatProgress(value=0.0, max=287.0), HTML(value='')))


Validating Model ... 


HBox(children=(FloatProgress(value=0.0, max=72.0), HTML(value='')))


Epoch :  10  Stats :  {'Epoch': 10, 'Train Loss': 3.07602, 'Val Loss': 4.66561}
Train ...


HBox(children=(FloatProgress(value=0.0, max=287.0), HTML(value='')))


Validating Model ... 


HBox(children=(FloatProgress(value=0.0, max=72.0), HTML(value='')))


Epoch :  11  Stats :  {'Epoch': 11, 'Train Loss': 2.96989, 'Val Loss': 4.68379}
Train ...


HBox(children=(FloatProgress(value=0.0, max=287.0), HTML(value='')))


Validating Model ... 


HBox(children=(FloatProgress(value=0.0, max=72.0), HTML(value='')))


Epoch :  12  Stats :  {'Epoch': 12, 'Train Loss': 2.87271, 'Val Loss': 4.69629}
Train ...


HBox(children=(FloatProgress(value=0.0, max=287.0), HTML(value='')))


Validating Model ... 


HBox(children=(FloatProgress(value=0.0, max=72.0), HTML(value='')))


Epoch :  13  Stats :  {'Epoch': 13, 'Train Loss': 2.78586, 'Val Loss': 4.71815}
Train ...


HBox(children=(FloatProgress(value=0.0, max=287.0), HTML(value='')))


Validating Model ... 


HBox(children=(FloatProgress(value=0.0, max=72.0), HTML(value='')))


Epoch :  14  Stats :  {'Epoch': 14, 'Train Loss': 2.70075, 'Val Loss': 4.73536}
Train ...


HBox(children=(FloatProgress(value=0.0, max=287.0), HTML(value='')))


Validating Model ... 


HBox(children=(FloatProgress(value=0.0, max=72.0), HTML(value='')))


Epoch :  15  Stats :  {'Epoch': 15, 'Train Loss': 2.6256, 'Val Loss': 4.77486}
Train ...


HBox(children=(FloatProgress(value=0.0, max=287.0), HTML(value='')))


Validating Model ... 


HBox(children=(FloatProgress(value=0.0, max=72.0), HTML(value='')))


Epoch :  16  Stats :  {'Epoch': 16, 'Train Loss': 2.55767, 'Val Loss': 4.79472}
Train ...


HBox(children=(FloatProgress(value=0.0, max=287.0), HTML(value='')))


Validating Model ... 


HBox(children=(FloatProgress(value=0.0, max=72.0), HTML(value='')))


Epoch :  17  Stats :  {'Epoch': 17, 'Train Loss': 2.49274, 'Val Loss': 4.81913}
Train ...


HBox(children=(FloatProgress(value=0.0, max=287.0), HTML(value='')))


Validating Model ... 


HBox(children=(FloatProgress(value=0.0, max=72.0), HTML(value='')))


Epoch :  18  Stats :  {'Epoch': 18, 'Train Loss': 2.42504, 'Val Loss': 4.83388}
Train ...


HBox(children=(FloatProgress(value=0.0, max=287.0), HTML(value='')))


Validating Model ... 


HBox(children=(FloatProgress(value=0.0, max=72.0), HTML(value='')))


Epoch :  19  Stats :  {'Epoch': 19, 'Train Loss': 2.36604, 'Val Loss': 4.87352}


In [None]:
Losses

[{'Epoch': 0, 'Train Loss': 5.5969, 'Val Loss': 6.21297},
 {'Epoch': 1, 'Train Loss': 5.09099, 'Val Loss': 5.17933},
 {'Epoch': 2, 'Train Loss': 4.7525, 'Val Loss': 4.90599},
 {'Epoch': 3, 'Train Loss': 4.44805, 'Val Loss': 4.77474},
 {'Epoch': 4, 'Train Loss': 4.15195, 'Val Loss': 4.71417},
 {'Epoch': 5, 'Train Loss': 3.8967, 'Val Loss': 4.6649},
 {'Epoch': 6, 'Train Loss': 3.66794, 'Val Loss': 4.65259},
 {'Epoch': 7, 'Train Loss': 3.48026, 'Val Loss': 4.6554},
 {'Epoch': 8, 'Train Loss': 3.32225, 'Val Loss': 4.65047},
 {'Epoch': 9, 'Train Loss': 3.19276, 'Val Loss': 4.65088},
 {'Epoch': 10, 'Train Loss': 3.07602, 'Val Loss': 4.66561},
 {'Epoch': 11, 'Train Loss': 2.96989, 'Val Loss': 4.68379},
 {'Epoch': 12, 'Train Loss': 2.87271, 'Val Loss': 4.69629},
 {'Epoch': 13, 'Train Loss': 2.78586, 'Val Loss': 4.71815},
 {'Epoch': 14, 'Train Loss': 2.70075, 'Val Loss': 4.73536},
 {'Epoch': 15, 'Train Loss': 2.6256, 'Val Loss': 4.77486},
 {'Epoch': 16, 'Train Loss': 2.55767, 'Val Loss': 4.7947

In [None]:
#save Model and its dictionary
torch.save(model.state_dict(), model_location + 'bilstm_np_dict_' + str(epochs) )
torch.save(model, model_location + 'bilstm_np_' + str(epochs) )
torch.save(model.encoder.state_dict(), model_location + 'bilstm_enc_dict_' + str(epochs) )
torch.save(model.encoder, model_location + 'bilstm_enc_' + str(epochs) )
torch.save(model.decoder.state_dict(), model_location + 'bilstm_dec_dict_' + str(epochs) )
torch.save(model.decoder, model_location + 'bilstm_dec_' + str(epochs) )

### USE MODEL

In [None]:
get_bleu_score(model, val_tokens, hindi, english, MAX_LENGTH)

HBox(children=(FloatProgress(value=0.0, max=19325.0), HTML(value='')))

The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()






{'BLEU': 0.03613, 'METEOR': 0.32516}

USE MODEL FOR WEEKLY TRANSLATION

In [None]:
print('Reading Weekly Data ... ', end = '')
week = read_csv(weekly_data_location + 'Week4/hindistatements.csv', file_type='weekly')    # Load weekly data 
print('Done')

print('Process Weekly Hindi Data ... ', end = '')
week_processed = []
for x in  week['hindi']:
    t = get_hindi_tokens(preprocess_hindi(x))
    week_processed.append(t)
print('Done')

print('Trasnlating all the sentences ... ', end = '')
translated_texts = []
for i in tq.tqdm( range(len(week_processed)) ):
    translated_texts.append( translate(model, week_processed[i], hindi, english, MAX_LENGTH) ) 
print('Done')

print('Storing translated Sentences ... ', end = '')
with open(weekly_data_location + 'Week4/3bilstmattn_20.txt', 'w') as f:
    for item in translated_texts:
        f.write("%s\n" % item)
print('Done')

Reading Weekly Data ... Done
Process Weekly Hindi Data ... Done
Trasnlating all the sentences ... 

HBox(children=(FloatProgress(value=0.0, max=5000.0), HTML(value='')))


Done
Storing translated Sentences ... Done


In [None]:
#translate(model, week_processed[i], hindi, english, MAX_LENGTH)


#torch.save( tmodel.state_dict(), model_location + 'gru_dict_100')
#torch.save(model, location+ 'gru_enc_dec')

#tmodel = torch.load(model_location+ 'gru_100')
#tmodel.eval()

#tq.tqdm._instances.clear()