## ENCODER DECODER NETWORK

AND TEACHER FORCING

**References:**

Tutorials Given in Competition Document : [Competetion Link](https://docs.google.com/document/d/1p74wG-bECCgbpyq5x_x2QJrf5RSf9FnMLGSAiyUkHLo/edit)

PyTorch NMT Tutorial : [Pytorch NMT](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html)

Github Page : To understand batch Processing in PyTorch [Github Pengyuchen](https://github.com/pengyuchen/PyTorch-Batch-Seq2seq)

Referred Few Stackoverflow Links for few Regex examples and for some bugs.

The whole code is divided into two sections:
a)  Functions containing all required procedures b) Execution : Using the function . Expand or Collapse to view each sections and subsections.

Observations :
1.   Using the default learning models work better in Adam.
2.   Training in epochs of 20 20 to avoid failure of timeouts.
3.   Saving the models is not working. Due to randomness everywhere. Language Word2index and index2word gets mapped to different word everytime. So all randomness need to be removed for saving and reusing the models.


NOTE : Change the directory location with respect to google drive location where the data is stored and EXPAND/COLLAPSE Section for the code.

No package other than the specified packages are imported


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
location = r"/content/drive/My Drive/Files/"                  
INDIC_NLP_LIB_HOME = location + "indic_nlp_library"
INDIC_NLP_RESOURCES = location + "indic_nlp_resources"
data_location        = location + 'NMT/'                   
model_location       = location + 'NMT/NMT_LSTMATTN/' 
weekly_data_location = location + 'NMT/Weekly Data/'

### LIBRARIES  -
This subsection contains importing various libraries. Download or clone the indic nlp library and resources to your drive. And change the location accordingly.
Also google colab does not have morfessor and uses old version of nltk. So needed to update/install those two packages.

In [None]:
import sys
sys.path.append(r'{}'.format(INDIC_NLP_LIB_HOME))
from indicnlp import common
common.set_resources_path(INDIC_NLP_RESOURCES)
from indicnlp import loader
loader.load()

In [None]:
!pip install Morfessor
import csv
import re
import string
import spacy
import tqdm.notebook as tq
nlpen = spacy.load("en_core_web_sm")
import random
import pickle
from indicnlp.tokenize import sentence_tokenize
from indicnlp.tokenize import indic_tokenize
from indicnlp.transliterate.unicode_transliterate import UnicodeIndicTransliterator
from indicnlp.transliterate.unicode_transliterate import ItransTransliterator
from indicnlp.normalize.indic_normalize import IndicNormalizerFactory

Collecting Morfessor
  Downloading https://files.pythonhosted.org/packages/39/e6/7afea30be2ee4d29ce9de0fa53acbb033163615f849515c0b1956ad074ee/Morfessor-2.0.6-py3-none-any.whl
Installing collected packages: Morfessor
Successfully installed Morfessor-2.0.6


In [None]:
import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

!pip install -U nltk
import nltk
import sys
nltk.download('wordnet')
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.meteor_score import single_meteor_score
import numpy as np

Collecting nltk
[?25l  Downloading https://files.pythonhosted.org/packages/5e/37/9532ddd4b1bbb619333d5708aaad9bf1742f051a664c3c6fa6632a105fd8/nltk-3.6.2-py3-none-any.whl (1.5MB)
[K     |████████████████████████████████| 1.5MB 6.1MB/s 
Installing collected packages: nltk
  Found existing installation: nltk 3.2.5
    Uninstalling nltk-3.2.5:
      Successfully uninstalled nltk-3.2.5
Successfully installed nltk-3.6.2


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [None]:
def read_csv(location, file_type):
    cFile = open(location) 
    cReader = csv.reader(cFile, delimiter=',')
    header = next(cReader)
    if( file_type == 'train'):
        df = {}
        df['hindi'] = []
        df['english'] = []
        for t in cReader:
            df['hindi'].append(t[1])
            df['english'].append(t[2])
    elif( file_type == 'weekly' ):
        df = {}
        df['hindi'] = []
        for t in cReader:
            df['hindi'].append(t[2])
    return df

In [None]:
def train_test_split(dataset, test_split_percentage):

    total_len   = len(dataset)
    total_index = list(range(total_len))
    test_index = list( total_index[: int(test_split_percentage*total_len)] )
    train_index  = list( total_index[int(test_split_percentage*total_len) : ] )
    #np.random.shuffle(test_index)
    #np.random.shuffle(train_index)
    index = { 'train' : train_index, 'test' : test_index}
    train_df = [ dataset[i] for i in train_index ]
    test_df  = [ dataset[i] for i in test_index ]
    return index, train_df, test_df

### TEXT PROCESSING
This subsection contains processing of english and hindi sentences.
Since processing the 1 Lakh text pairs takes a lot of time. Instead of doing same thing again and again. I have stored the processed texts and token using pickle. 

In [None]:
english_nums = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
hindi_nums =   ['०', '१', '२', '३', '४', '५', '६', '७', '८', '९']

def clean_string( instr ):
    instr = instr.lower()
    instr = instr.replace(u'[', ' ')
    instr = instr.replace(u']', ' ')
    instr = instr.replace(u'{', ' ')
    instr = instr.replace(u'}', ' ')
    instr = instr.replace(u'(', ' ')
    instr = instr.replace(u')', ' ')
    instr = instr.replace(u'...', ' ')
    instr = instr.replace(u'..', ' ')
    instr = instr.replace(u'-', ' ')
    instr = instr.replace(u',', ' ')
    instr = instr.replace(u'"', ' ')
    instr = re.sub(' +',' ', instr)
    return instr
  
def preprocess_hindi( instr ):
    factory    = IndicNormalizerFactory()
    normalizer = factory.get_normalizer("hi",remove_nuktas=True)
    instr      = normalizer.normalize(instr)

    instr      = clean_string( instr )
    #instr = instr.replace(u'॥', '')
    for nums in hindi_nums:
        instr    = instr.replace(nums, nums + ' ')

    instr      = ItransTransliterator.from_itrans( instr , 'hi')  
    instr      = re.sub(' +',' ', instr)
    instr      = ItransTransliterator.from_itrans( instr , 'hi')
    instr      = instr.strip() #sentence_tokenize.sentence_split(instr, lang='hi')
    
    return instr

def preprocess_english( instr ):
    instr = clean_string(instr)

    instr = instr.replace("’", "'")
    instr = instr.replace("n\'t", " not")
    instr = instr.replace("'re" , " are")
    instr = instr.replace("'ve" , " have")
    instr = instr.replace("'s"  , " is")
    instr = instr.replace("'ll" , " will")
    instr = instr.replace("'m" , " am")
    #instr = re.sub(r'[^\w\s\\d]' , " " , instr)
    #instr = re.sub(r'[\d]' , ' ' , instr)

    for nums in english_nums:
        instr    = instr.replace(nums, nums + ' ')
    instr = re.sub(' +',' ', instr)
    instr = instr.strip()

    return instr

def get_hindi_tokens(sentence):
    return indic_tokenize.trivial_tokenize(sentence)

def get_english_tokens(sentence):
    tokens = []
    tokstr = nlpen(sentence)
    for token in tokstr:
        tokens.append(token.text)
    return tokens

In [None]:
# Load_From_file =
#   -1   : Process the texts and store/dump the files into the location
#    0   : Process the texts and do not store the files
#    1   : Directly load the processed text from the location

def process_pairs(df, load_from_file = 0, location = ''):
    if( load_from_file == 1):
        with open(location + r'pairs.pickle', 'rb') as handle:
            pairs = pickle.load(handle)
        with open(location + r'pairs_tokens.pickle', 'rb') as handle:
            pairs_tokens = pickle.load(handle)
        return pairs, pairs_tokens
    else:
        pairs = []
        pairs_tokens = []
        for i in tq.tqdm( range( len(df['hindi']) )):
            hinsen  = df['hindi'][i]
            hsent   = preprocess_hindi( hinsen )
            htokens = get_hindi_tokens(hsent)

            engsen  = df['english'][i]
            esent   = preprocess_english( engsen )
            etokens = get_english_tokens(esent)

            pairs.append( [hsent, esent] )
            pairs_tokens.append( [htokens, etokens] )

        if( load_from_file == -1):
            with open(location + r'pairs.pickle', 'wb') as handle:
                pickle.dump(pairs, handle, protocol=pickle.HIGHEST_PROTOCOL)
            with open(location + r'pairs_tokens.pickle', 'wb') as handle:
                pickle.dump(pairs_tokens, handle, protocol=pickle.HIGHEST_PROTOCOL)

        return pairs, pairs_tokens

### LANGUAGE
This subsection contains the class 'Laguage' which stores all the token and its equivalent index. This subsection also contains functions to convert a sentence to a tensor.

This subsection is referred from pytorch tutorial on NMT.

In [None]:
START_TOKEN = 0
END_TOKEN = 1
PAD_TOKEN = 2
UNK_TOKEN = 3

class Language:
    def __init__(self, name):
        self.name = name
        self.word2index = {}
        self.word2count = {}
        self.index2word = {}
        self.num_words = 4
        self.word2index['START_TOKEN'] = START_TOKEN
        self.word2index['END_TOKEN']   = END_TOKEN
        self.word2index['PAD_TOKEN']   = PAD_TOKEN
        self.word2index['UNK_TOKEN']   = UNK_TOKEN
        self.index2word[START_TOKEN] = 'START_TOKEN'
        self.index2word[END_TOKEN] = 'END_TOKEN'
        self.index2word[PAD_TOKEN] = 'PAD_TOKEN'
        self.index2word[UNK_TOKEN] = 'UNK_TOKEN'

    def addWord(self, word):
        if word in self.word2count:
            self.word2count[word] = self.word2count[word] + 1
        else:
            self.word2count[word] = 1
            #self.word2index[word] = self.num_words
            #self.index2word[self.num_words] = word
            self.num_words = self.num_words + 1
    
    def addSentence(self, sentence_tokens):
        for word in sentence_tokens:
            self.addWord(word)
    
    def filter_words(self):
        self.num_words = 4
        for word in self.word2count:
            if( self.word2count[word] != 1):
                self.word2index[word] = self.num_words
                self.index2word[self.num_words] = word
                self.num_words = self.num_words + 1


def generate_language( pairs_tokens ):
    hindi   = Language('hindi')
    english = Language('english')
    for i in tq.tqdm( range(len(pairs_tokens)) ):
        hindi.addSentence(pairs_tokens[i][0])
        english.addSentence(pairs_tokens[i][1])
    hindi.filter_words()
    english.filter_words()
    return hindi, english

PROCESS TEXT TO TENSOR

In [None]:
def indexesFromSentence(lang, tokens, max_length):
    indexes = []
    indexes.append(START_TOKEN)
    for word in tokens:
        if word in lang.word2index.keys():
            indexes.append( lang.word2index[word] )
        else:
            indexes.append( lang.word2index['UNK_TOKEN'] )
    indexes = indexes[0:max_length-1]
    indexes.append(END_TOKEN)
    indexes.extend( [PAD_TOKEN]*( max_length - len(indexes)))
    return indexes

def tensorFromSentence(lang, sentence, max_length):
    indexes = indexesFromSentence(lang, sentence, max_length)
    return torch.tensor(indexes, dtype=torch.long, device=device)

def tensorsFromPair(pairs, input_lang, output_lang, max_length):
    res_pairs = []
    for pair in pairs:
        input_tensor  = tensorFromSentence(input_lang, pair[0], max_length)
        target_tensor = tensorFromSentence(output_lang, pair[1], max_length)
        res_pairs.append( (input_tensor, target_tensor) )
    return res_pairs

### NEURAL MACHINE TRANSLATOR
This subjection contains 3 main classes Encoder , Decoder and an seq2seq which merge the two encoder and decoder.
It also contains a function to train, use and evaluate the seq2seq model.


ENCODER and DECODER

In [None]:
class Encoder(nn.Module):
    def __init__(self, input_size, embed_size, hidden_size):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(input_size, embed_size)
        self.dropout = nn.Dropout(0.2)
        self.rnn = nn.LSTM(embed_size, hidden_size, bidirectional = True)
        self.fc_hidden = nn.Linear(hidden_size * 2, hidden_size)
        self.fc_cell = nn.Linear(hidden_size * 2, hidden_size)

    def forward(self, input):
        # input.shape :    [Sentence Length, Batch Size]
        # embedded.shape : [Sentence Length, Batch Size, Embedding Dimension]
        # output.shape :   [Sentence Length, Batch Size, Hidden Size]
        # hidden.shape :   [Layers = 2*2 , Batch Size, Hidden Size]
        # cell.shape   :   [Layers = 2*2 , Batch Size, Hidden Size]

        embed = self.embedding(input)
        embed = self.dropout(embed)
        output, (hidden, cell) = self.rnn(embed)

        hidden = torch.cat( (hidden[-2,:,:], hidden[-1,:,:]), dim=1)
        hidden = self.fc_hidden(hidden)

        cell = torch.cat( (cell[-2,:,:], cell[-1,:,:]), dim=1)
        cell = self.fc_cell(cell)
        #hidden = torch.tanh(hidden)

        return output, hidden, cell

In [None]:
class Attention(nn.Module):
    def __init__(self, hidden_size):
        super().__init__()
        self.attn = nn.Linear((hidden_size * 2) + hidden_size, hidden_size)
        self.v = nn.Linear(hidden_size, 1, bias = False)
        
    def forward(self, hidden, encoder_outputs):
        # hidden.shape          :   [Batch Size, Hidden Size]
        # encoder_outputs.shape :   [Sen Len, Batch Size, Hidden_size*2]
        
        # After Ajusting
        # hidden.shape          :   [Batch Size, Sen Length, Hidden Size]
        # encoder_outputs.shape :   [Batch Size, Sen Length, Hidden_size*2]

        src_len = encoder_outputs.shape[0]
        hidden = hidden.unsqueeze(1)
        hidden = hidden.repeat(1, src_len, 1)
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        comb        = torch.cat((hidden, encoder_outputs), dim = 2)
        energy      = torch.tanh( self.attn(comb) )
        attention   = self.v(energy).squeeze(2)
        attention   = F.softmax(attention, dim=1)
        attention   = attention.unsqueeze(1)
        weights     = torch.bmm(attention, encoder_outputs)
        weights     = weights.permute(1,0,2)
        return weights

In [None]:
class DecoderAttn(nn.Module):
    def __init__(self, output_size, embed_size, hidden_size):
        super(DecoderAttn, self).__init__()
        self.embedding = nn.Embedding(output_size, embed_size)
        self.dropout = nn.Dropout(0.2)

        self.rnn   = nn.LSTM((hidden_size*2)+embed_size, hidden_size)
        self.dense  = nn.Linear(hidden_size*3 + embed_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

        self.attention = Attention(hidden_size)

    def forward(self, target, hidden, cell,  encoder_outputs):
        # target.shape :   [Batch Size]
        # target.shape :   [1, Batch Size] after unsqueezing
        # embed.shape  :   [1, Batch Size, Embedding Size]
        # output.shape :   [1, Batch Size, Hidden Size] before squeezing
        # hidden.shape :   [Batch Size, Hidden Size]
        # preds.shape  :   [Batch Size, Output_Vocabulary_Size]

        target = target.unsqueeze(0)
        embed = self.embedding(target)
        embed = self.dropout(embed)
        
        weights = self.attention(hidden, encoder_outputs)
        rinput   = torch.cat((embed, weights), dim = 2)

        hidden = hidden.unsqueeze(0)
        cell = cell.unsqueeze(0)
        output, (hidden, cell) = self.rnn(rinput, (hidden,cell))

        dense_input = torch.cat((output, weights, embed), dim=2)
        preds = self.dense(dense_input[0])
        #preds = F.relu(preds)
        preds = self.softmax(preds)
        return preds, hidden.squeeze(0), cell.squeeze(0)

In [None]:
class seq2seq(nn.Module):
    def __init__(self, input_size, output_size, embed_size, hidden_size, max_length):
        super(seq2seq, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.embed_size = embed_size
        self.max_length = max_length

        self.encoder = Encoder(input_size, embed_size, hidden_size).to(device)
        self.decoder = DecoderAttn(output_size, embed_size, hidden_size).to(device)

    def forward(self, src, target , teacher_forcing = 0.5):
        # If teacher forcing is set to 0.5, it will use true outputs half the time for
        # next input to decoder and use the predicted output as input
        # If teacher forcing is 0, it will always use previous output as input to decoder.

        # src.shape    = [Input Sentence Length, Batch Size]
        # target.shape = [Output Sentence Length, Batch Size]
        # decoder_output.shape = [ Output Sentence Length, Batch Size, ]
        # Encode the Source Sentence; Decode the tokens one by one.

        batch_size, target_vocab_size = src.shape[1], self.output_size
        outputs = torch.zeros(self.max_length, batch_size, target_vocab_size).to(device)
        encoder_outputs, hidden, cell = self.encoder(src)
        
        dinput = src[0,:]
        for index in range(1, self.max_length):
            output, hidden, cell = self.decoder(dinput, hidden, cell, encoder_outputs)
            if random.random() < teacher_forcing:
                dinput = target[index]  
            else:
                dinput = output.argmax(1)
            outputs[index] = output

        return outputs

In [None]:
# Set model in training mode to activate dropouts
# Transpose the text tokens to adjust to pytorch
# Forward Pass on Encoder-Decoder
# Optimize network
def train( model, opt, lossfn, train_loader, r_epoch, save_model=0):
    model.train()
    history = []
    num_batches = len(train_loader)
    tf_ratio = 0.5

    for epoch in range(r_epoch[0], r_epoch[1]):
        epoch_loss = 0
        if((epoch+1) % 5 == 0):
            tf_ratio = tf_ratio - 0.1

        for inS, outS in tq.tqdm( train_loader ):
            opt.zero_grad()
            loss = 0

            inS =  inS.transpose(0, 1)
            outS = outS.transpose(0, 1)
            predoutS = model(inS, target = outS, teacher_forcing=tf_ratio)
            outS     = outS[1:].reshape(-1)       # Reshape outputs
            predoutS = predoutS[1:].reshape(-1, predoutS.shape[-1])

            loss = lossfn(predoutS, outS)         # Compute Loss
            loss.backward()                       # Propagate Loss To the Netowork
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1)  # Gradient Clipping
            opt.step()                            # Update the weights
            epoch_loss = epoch_loss + loss.item()
            

        print(' Epoch : ', epoch , '   loss  : ', epoch_loss / num_batches )
        history.append(epoch_loss / num_batches)

        if( save_model == 1):
            if( (epoch+1)%5 == 0):
                torch.save(model.state_dict(), model_location + 'bilstmattn_dict_' + str(epoch+1) )
            if( (epoch+1)%20 == 0):
                torch.save(model, model_location + 'bilstmattn_' +  str(epoch+1) )

    return history

In [None]:
# Set model to evaluation model to disable dropout layer
# get Tensor from Sentence and adjust it to size [Sequence Length, Max Length = 1]
def make_sentence(tokens):
    str = ''
    for x in tokens:
        if x is 'UNK_TOKEN':
            str = str + ' ' + '<UNK>'
        elif x not in ['START_TOKEN', 'END_TOKEN', 'PAD_TOKEN']:
            str = str + ' ' + x
    return re.sub('(?<=\d)+ (?=\d)+', '', str)[1:]

def translate(model, sentence, input_lang, output_lang, max_length):
    model.eval()
    with torch.no_grad():
        input = tensorFromSentence( input_lang, sentence, max_length= max_length)
        input = torch.transpose( input.unsqueeze(0) , 0 , 1)
        output = model(input, target=None, teacher_forcing = 0)
        dec_words = []
        for x in output.squeeze():
            i = x.argmax(0)
            dec_words.append( output_lang.index2word[ i.item() ] )
            if(i.item() == END_TOKEN ):
                break
    return make_sentence( dec_words )

def translate_beam(model, sentence, input_lang, output_lang, max_length, beam_length = 2):
    model.eval()
    with torch.no_grad():
        input = tensorFromSentence( input_lang, sentence, max_length= max_length)
        input = torch.transpose( input.unsqueeze(0) , 0 , 1)
        #output = model(input, target=None, teacher_forcing = 0)

        encoder_outputs, hidden, cell = model.encoder(input)

        sequences = [[list(), input[0,:] , hidden, cell, 0.0]]
        for index in range(1, max_length):
            all_candidates = list()
            for i in ( range(len(sequences)) ):
                seq, dinput, hidden, cell, score = sequences[i]
                output, hidden, cell = model.decoder(dinput, hidden, cell, encoder_outputs)
                topv, topi = output[0].data.topk(2)

                

                candidate1 = [seq + [ english.index2word[ topi[0].item()]],
                              topi[0].unsqueeze(0),
                              hidden,
                              cell,
                              score - topv[0].item()]
                
                candidate2 = [seq + [ english.index2word[ topi[1].item()]],
                              topi[1].unsqueeze(0),
                              hidden,
                              cell,
                              score - topv[1].item()]

                all_candidates.append(candidate1)
                all_candidates.append(candidate2)

            ordered = sorted(all_candidates, key=lambda tup:tup[4])
            sequences = ordered[:beam_length]
    translations = []
    for x in sequences:
        translations.append( [ make_sentence(x[0]), x[-1] ])      
    return translations[0][0]

### PERFORMANCE EVALUATION
Evaluation Script Modified to give Bleu and Meteor Score

In [None]:
def get_bleu_score(model, pairs, input_lang, output_lang, max_length, beam = 0):
    total_num = len(pairs)
    total_bleu_scores = 0
    total_meteor_scores = 0
    
    for i in tq.tqdm( range(total_num) ):
        if( beam == 0 ):
            output  = translate(model, pairs[i][0], input_lang, output_lang, max_length)
        else:
            output  = translate_beam(model, pairs[i][0], input_lang, output_lang, max_length, beam)
        original  = make_sentence(pairs[i][1])
        total_bleu_scores   += sentence_bleu([output.split(" ")], original.split(" "))
        total_meteor_scores += single_meteor_score(output, original)

    bleu_result = total_bleu_scores/total_num
    meteor_result = total_meteor_scores/total_num
    
    print()
    print("BLEU score: ",bleu_result)
    print("METEOR score: ",meteor_result)

# EXECUTION
Executing the whole process.


1.   Read the training data
2.   Process all sentences( english and hindi)
3.   Generate Language ( word2index and index2word)
4.   Prepare tensors for all tokens.
5.   Create the seq2seq model and train the model
6.   Evaluate the performance
7.   Use the model for weekly translation



READ AND PROCESS FILE

In [None]:
MAX_LENGTH = 32
batch_size = 256


print('Reading Training Data ... ', end = '')
df = read_csv(data_location + 'train.csv', 'train')
print('Done')

print('Processing Strings ... ', end = '')
pairs, tokens = process_pairs(df, load_from_file=1, location = data_location + 'DataPairs/')
print('Done')

print('Splitting Dataset ... ', end = '')
index, train_tokens, test_tokens = train_test_split(tokens,  0.2)
print('Done')

print('Preparing Language Word2vectors and inverse ... ', end = '')
# Generate Langauge Input and Output
hindi, english = generate_language(tokens)
print('Done, Hindi Token Count : ', hindi.num_words, '  English Token Count : ', english.num_words)

print('Preparing Tensors ... ', end = '')
# Get Tensors for tokens and create Dataloaders
train_tensors = tensorsFromPair(train_tokens, hindi, english, MAX_LENGTH)
test_tensors = tensorsFromPair(test_tokens, hindi, english, MAX_LENGTH)
print('Done')

print('Preparing Dataloaders ... ', end = '')
train_loader = torch.utils.data.DataLoader(train_tensors, batch_size=batch_size, shuffle=True)
test_loader  = torch.utils.data.DataLoader(test_tensors, batch_size=batch_size, shuffle=True)
pretrainloader = torch.utils.data.DataLoader(train_tensors[0:batch_size], batch_size=batch_size, shuffle=True)
print('Done')

Reading Training Data ... Done
Processing Strings ... Done
Splitting Dataset ... Done
Preparing Language Word2vectors and inverse ... 

HBox(children=(FloatProgress(value=0.0, max=102322.0), HTML(value='')))


Done, Hindi Token Count :  21104   English Token Count :  18988
Preparing Tensors ... Done
Preparing Dataloaders ... Done


TRAIN MODEL

In [None]:
# Model Parameters
print('Initialising Parameters')
hidden_size = 512
input_vocab_size = hindi.num_words + 1
output_vocab_size = english.num_words + 1
embedding_dim = 300
epochs = 20
pretrain_epoch = 0
#save_losses
Losses = []

#Generate Model, optimizer, lossfn
print('Creating Models ... ', end = ' ')
model = seq2seq(input_vocab_size, output_vocab_size , embedding_dim, hidden_size, MAX_LENGTH)
optimizer = optim.Adam( model.parameters())
lossfn = nn.NLLLoss(ignore_index=PAD_TOKEN)
print('Done')

#load_model weights if available
load_model = 1
if(load_model==1):
    print('Loading Pretrained Weights .. :')
    model.load_state_dict( torch.load(model_location + 'bilstmattn_dict_40'))
model.eval() 

Initialising Parameters
Creating Models ...  Done
Loading Pretrained Weights .. :


seq2seq(
  (encoder): Encoder(
    (embedding): Embedding(21105, 300)
    (dropout): Dropout(p=0.2, inplace=False)
    (rnn): LSTM(300, 512, bidirectional=True)
    (fc_hidden): Linear(in_features=1024, out_features=512, bias=True)
    (fc_cell): Linear(in_features=1024, out_features=512, bias=True)
  )
  (decoder): DecoderAttn(
    (embedding): Embedding(18989, 300)
    (dropout): Dropout(p=0.2, inplace=False)
    (rnn): LSTM(1324, 512)
    (dense): Linear(in_features=1836, out_features=18989, bias=True)
    (softmax): LogSoftmax(dim=1)
    (attention): Attention(
      (attn): Linear(in_features=1536, out_features=512, bias=True)
      (v): Linear(in_features=512, out_features=1, bias=False)
    )
  )
)

In [None]:
# Pretrain to Overfit Model on single batch
train(model, optimizer, lossfn, pretrainloader, (0,200))


 Epoch :  178    loss  :  0.14534761011600494


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))


 Epoch :  179    loss  :  0.1387416422367096


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))


 Epoch :  180    loss  :  0.13657839596271515


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))


 Epoch :  181    loss  :  0.12976236641407013


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))


 Epoch :  182    loss  :  0.13843528926372528


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))


 Epoch :  183    loss  :  0.14021731913089752


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))


 Epoch :  184    loss  :  0.12769953906536102


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))


 Epoch :  185    loss  :  0.1330758035182953


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))


 Epoch :  186    loss  :  0.12409714609384537


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))


 Epoch :  187    loss  :  0.1224239394068718


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))


 Epoch :  188    loss  :  0.12244855612516403


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))


 Epoch :  189    loss  :  0.1214105486869812


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))


 Epoch :  190    loss  :  0.11312056332826614


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))


 Epoch :  191    loss  :  0.10908375680446625


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))


 Epoch :  192    loss  :  0.10826016962528229


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))


 Epoch :  193    loss  :  0.11935558170080185


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))


 Epoch :  194    loss  :  0.10815658420324326


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))


 Epoch :  195    loss  :  0.10154125839471817


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))


 Epoch :  196    loss  :  0.10198602825403214


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))


 Epoch :  197    loss  :  0.09033168107271194


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))


 Epoch :  198    loss  :  0.09616680443286896


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))


 Epoch :  199    loss  :  0.10758029669523239


[9.880617141723633,
 9.26675033569336,
 8.024361610412598,
 6.803066253662109,
 6.195061206817627,
 5.67664098739624,
 5.444336891174316,
 5.369250297546387,
 5.331640243530273,
 5.3879570960998535,
 5.427333354949951,
 5.200872421264648,
 5.2414069175720215,
 5.126220226287842,
 5.169053554534912,
 5.132333278656006,
 5.139401435852051,
 5.080004692077637,
 4.824460506439209,
 4.987847328186035,
 4.751997947692871,
 4.933948993682861,
 4.955910682678223,
 4.916214942932129,
 4.914382457733154,
 4.871389865875244,
 4.841187000274658,
 4.78424596786499,
 4.760263442993164,
 4.724308490753174,
 4.661548614501953,
 4.610487461090088,
 4.5752949714660645,
 4.51602840423584,
 4.477916717529297,
 4.445582389831543,
 4.372894287109375,
 4.324489116668701,
 4.257239818572998,
 4.172518253326416,
 4.095345497131348,
 4.030576229095459,
 3.9869747161865234,
 3.903871774673462,
 3.841989755630493,
 3.7306408882141113,
 3.6819193363189697,
 3.644474983215332,
 3.5418481826782227,
 3.45557785034179

In [None]:
# Final Train on all Training Data Set, # Append the losses
pretrain_epoch=30
epochs = 10
history = train(model, optimizer, lossfn, train_loader, (pretrain_epoch , pretrain_epoch + epochs), save_model = 1)
Losses.extend(history)
Losses

HBox(children=(FloatProgress(value=0.0, max=320.0), HTML(value='')))


 Epoch :  30    loss  :  1.2528340870514512


HBox(children=(FloatProgress(value=0.0, max=320.0), HTML(value='')))


 Epoch :  31    loss  :  1.1872194083407521


HBox(children=(FloatProgress(value=0.0, max=320.0), HTML(value='')))


 Epoch :  32    loss  :  1.1575264403596521


HBox(children=(FloatProgress(value=0.0, max=320.0), HTML(value='')))


 Epoch :  33    loss  :  1.1352653423324228


HBox(children=(FloatProgress(value=0.0, max=320.0), HTML(value='')))


 Epoch :  34    loss  :  1.2145114550366998


HBox(children=(FloatProgress(value=0.0, max=320.0), HTML(value='')))


 Epoch :  35    loss  :  1.1967773906886578


HBox(children=(FloatProgress(value=0.0, max=320.0), HTML(value='')))


 Epoch :  36    loss  :  1.1967438181862236


HBox(children=(FloatProgress(value=0.0, max=320.0), HTML(value='')))


 Epoch :  37    loss  :  1.1592492908239365


HBox(children=(FloatProgress(value=0.0, max=320.0), HTML(value='')))


 Epoch :  38    loss  :  1.1418753948062659


HBox(children=(FloatProgress(value=0.0, max=320.0), HTML(value='')))


 Epoch :  39    loss  :  1.24089040979743


[1.2528340870514512,
 1.1872194083407521,
 1.1575264403596521,
 1.1352653423324228,
 1.2145114550366998,
 1.1967773906886578,
 1.1967438181862236,
 1.1592492908239365,
 1.1418753948062659,
 1.24089040979743]

In [None]:
#save Model and its dictionary
torch.save(model.state_dict(), model_location + 'bilstm_np_dict_' + str(epochs) )
torch.save(model, model_location + 'bilstm_np_' + str(epochs) )
torch.save(model.encoder.state_dict(), model_location + 'bilstm_enc_dict_' + str(epochs) )
torch.save(model.encoder, model_location + 'bilstm_enc_' + str(epochs) )
torch.save(model.decoder.state_dict(), model_location + 'bilstm_dec_dict_' + str(epochs) )
torch.save(model.decoder, model_location + 'bilstm_dec_' + str(epochs) )

### USE MODEL

In [None]:
get_bleu_score(model, test_tokens[0:5000], hindi, english, MAX_LENGTH, beam = 0)

HBox(children=(FloatProgress(value=0.0, max=5000.0), HTML(value='')))

The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()




BLEU score:  0.048585063745602654
METEOR score:  0.371080236895734


In [None]:
get_bleu_score(model, test_tokens[0:5000], hindi, english, MAX_LENGTH , beam = 2)

HBox(children=(FloatProgress(value=0.0, max=5000.0), HTML(value='')))

The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()




BLEU score:  0.04540611836691771
METEOR score:  0.3072566108387664


In [None]:
get_bleu_score(model, test_tokens[0:5000], hindi, english, MAX_LENGTH , beam = 3)

HBox(children=(FloatProgress(value=0.0, max=5000.0), HTML(value='')))

The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()




BLEU score:  0.04422212105648841
METEOR score:  0.30498109585759126


USE MODEL FOR WEEKLY TRANSLATION

In [None]:
print('Reading Weekly Data ... ', end = '')
week = read_csv(weekly_data_location + 'Week4/hindistatements.csv', file_type='weekly')    # Load weekly data 
print('Done')

print('Process Weekly Hindi Data ... ', end = '')
week_processed = []
for x in  week['hindi']:
    t = get_hindi_tokens(preprocess_hindi(x))
    week_processed.append(t)
print('Done')

print('Trasnlating all the sentences ... ', end = '')
translated_texts = []
for i in tq.tqdm( range(len(week_processed)) ):
    translated_texts.append( translate_beam(model, week_processed[i], hindi, english, MAX_LENGTH, beam=2) ) 
print('Done')

print('Storing translated Sentences ... ', end = '')
with open(weekly_data_location + 'Week4/beam.txt', 'w') as f:
    for item in translated_texts:
        f.write("%s\n" % item)
print('Done')

Reading Weekly Data ... Done
Process Weekly Hindi Data ... Done
Trasnlating all the sentences ... 

HBox(children=(FloatProgress(value=0.0, max=5000.0), HTML(value='')))


Done
Storing translated Sentences ... Done


In [None]:
i= 95008
tokens[i][0], tokens[i][1], translate_beam(model, tokens[i][0], hindi, english, MAX_LENGTH, 4) ,translate(model, tokens[i][0], hindi, english, MAX_LENGTH)

(['सभी',
  'यह',
  'है',
  'कि',
  'आप',
  'क्या',
  'कर',
  'रहे',
  'हैं',
  'वह',
  'यह',
  'सब',
  'लायक',
  'नहीं',
  'है'],
 ['all',
  'this',
  'that',
  'you',
  're',
  'doing',
  'he',
  'does',
  'nt',
  'deserve',
  'all',
  'this'],
 [['all this that you re doing he does nt deserve all this',
   5.911160113057122],
  ['all this that you re doing he does nt deserve all this all',
   6.126864475896582],
  ['all this that you re doing he does nt deserve all this all',
   6.179305584868416],
  ['all this is you re doing he does nt deserve all this', 6.765873832628131]],
 'all this that you re doing he does nt deserve all this')

In [None]:
#translate(model, week_processed[i], hindi, english, MAX_LENGTH)


#torch.save( tmodel.state_dict(), model_location + 'gru_dict_100')
#torch.save(model, location+ 'gru_enc_dec')

#tmodel = torch.load(model_location+ 'gru_100')
#tmodel.eval()

#tq.tqdm._instances.clear()