# 1.Introduction

This notebook contains the sequence to sequence model for Hindi to English Neural Machine Translation using Pytorch. For encoder and decoder, a uni-directional LSTM(Long Short-Term Memory) with 2 neural layers is used. The source and target language sentences are appended with start of sequence (\<sos\>) and end of sequence (\<eos\>) tokens. IndicNLP is used for tokenization of Hindi and English sentences. Cross Entropy Loss Function is used for computation of loss and to update the parameters of the model.
The notebook is divided into the following sections:
1. Introduction
2. Installing the required packages
3. Pre-processing data
4. Building the Vocabulary
5. Model Architecture
6. Training the Model
7. Testing the Model
8. Generating Predictions

# 2. Installing the required packages

In [None]:
import csv
import torch
import random
import numpy as np
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence

In [None]:
#installing Indic NLP packages for hindi and english tokenizer
!git clone https://github.com/anoopkunchukuttan/indic_nlp_resources.git
!git clone "https://github.com/anoopkunchukuttan/indic_nlp_library"

fatal: destination path 'indic_nlp_resources' already exists and is not an empty directory.
fatal: destination path 'indic_nlp_library' already exists and is not an empty directory.


In [None]:
INDIC_NLP_LIB_HOME=r"/content/indic_nlp_library" # path to local git repo for Indic NLP library
INDIC_NLP_RESOURCES="/content/indic_nlp_resources" # path to local git repo for Indic NLP Resources

In [None]:
import sys
sys.path.append(r'{}'.format(INDIC_NLP_LIB_HOME))
from indicnlp import common
common.set_resources_path(INDIC_NLP_RESOURCES)
from indicnlp import loader
loader.load()
from indicnlp.tokenize import indic_tokenize

In [None]:
# from google.colab import drive
# drive.mount('/content/gdrive')

Mounted at /content/gdrive


# 3. Pre-processing data

In this section, the training data is prepared. All the English letters are converted to lower case and clitics are also converted to their original words, for example, "you're" is replaced with "you are". Similarly other clitics like 're, 'm, 've, 's, 'll, n't are replaced with are, am, have, is, will, not respectively. Punctuation marks and symbols like -, (, ), {, }, [, ], :, ',\", \&, , , #, /, \\, ♪, \=, ¶, ~ are also removed.

In [None]:
train_set=[] #list to store pair of Hindi and English sentences
i=0
with open('train.csv', 'r') as f: #reading the train.csv file
    csv_reader = csv.reader(f, delimiter=',')
    for row in csv_reader:
      if (i==0): # To skip the column names from getting stored in the train data list.
        i+=1
        continue
      train_set.append([row[1],row[2].lower()]) # lower casing the english sentences while storing them in list
train_set[0:10] 

[['एल सालवाडोर मे, जिन दोनो पक्षों ने सिविल-युद्ध से वापसी ली, उन्होंने वही काम किये जो कैदियों की कश्मकश के निदान हैं।',
  "in el salvador, both sides that withdrew from their civil war took moves that had been proven to mirror a prisoner's dilemma strategy."],
 ['मैं उनके साथ कोई लेना देना नहीं है.', 'i have nothing to do with them.'],
 ['-हटाओ रिक.', 'fuck them, rick.'],
 ['क्योंकि यह एक खुशियों भरी फ़िल्म है.', "because it's a happy film."],
 ['The thought reaching the eyes...', 'the thought reaching the eyes...'],
 ['मैंने तुमे School से हटवा दिया .', 'i got you suspended.'],
 ['यह Vika, एक फूल है.', "it's a flower, vika."],
 ['पर मेरे लिए उसका यहुदी विरोधी होना उसके कार्यों को और भी प्रशंसनीय बनाता है क्योंकि उसके पास भी पक्षपात करने के वही कारण थे जो बाकी फौजियों के पास थे पर उसकी सच जानने और उसे बनाए रखने की प्रेरणा सबसे ऊपर थी',
  'but personally, for me, the fact that picquart was anti-semitic actually makes his actions more admirable, because he had the same prejudices, the 

In [None]:
len(train_set) #size of original train dataset

102322

In [None]:
#removing the punctuation, unnecessary spaces and expanding words like "can't" to "can not"
processing_dict={"\'re": "are","\'m":" am", "let\'s":"let us","\'s":" is","\'ve":" have","\'ll":" will",
    "\'re":" are", "don\'t":"do not","didn't":"did not","can\'t":"can not","couldn\'t":"could not",
    "wouldn\'t":"would not", "doesn\'t":"does not", "isn\'t":"is not","won\'t":"will not",
    "weren\'t":"were not","hadn\'t":"had not","aren\'t":"are not", "hasn\'t":"has not",
    "wasn\'t":"was not", "shouldn\'t":"should not","ain\'t":"am not","-":" ", "(":" ",")":" ","{":" ",
    "}":" ","[":" ", "]":" ",":":" ", "\'":" ","\"":" ","\&":" and ", ",":" ","#":" ", "/":" ","\\":" ",
    "♪":" ","\=":" ","¶":" ","~":" ","  ":" "}
for i in range(0,len(train_set)):
  for j in range(0,2):
    for (src,trg) in processing_dict.items():
      if src in train_set[i][j]:
        train_set[i][j]=train_set[i][j].replace(src,trg) 
    train_set[i][0]=train_set[i][0].replace("..."," ")
    train_set[i][0]=train_set[i][0].replace(".","|")
    train_set[i][1]=train_set[i][1].replace("...."," ")
    train_set[i][1]=train_set[i][1].replace("..."," ")
    train_set[i][1]=train_set[i][1].replace(".."," ")
with open('pre_processed_train.csv', 'w') as f:
    write = csv.writer(f)
    write.writerow(['hindi','english'])
    write.writerows(train_set)


In [None]:
train_set[0:20] # training data after pre-processing

[['एल सालवाडोर मे जिन दोनो पक्षों ने सिविल युद्ध से वापसी ली उन्होंने वही काम किये जो कैदियों की कश्मकश के निदान हैं।',
  'in el salvador both sides that withdrew from their civil war took moves that had been proven to mirror a prisoner is dilemma strategy.'],
 ['मैं उनके साथ कोई लेना देना नहीं है|', 'i have nothing to do with them.'],
 [' हटाओ रिक|', 'fuck them rick.'],
 ['क्योंकि यह एक खुशियों भरी फ़िल्म है|', 'because it is a happy film.'],
 ['The thought reaching the eyes ', 'the thought reaching the eyes '],
 ['मैंने तुमे School से हटवा दिया |', 'i got you suspended.'],
 ['यह Vika एक फूल है|', 'it is a flower vika.'],
 ['पर मेरे लिए उसका यहुदी विरोधी होना उसके कार्यों को और भी प्रशंसनीय बनाता है क्योंकि उसके पास भी पक्षपात करने के वही कारण थे जो बाकी फौजियों के पास थे पर उसकी सच जानने और उसे बनाए रखने की प्रेरणा सबसे ऊपर थी',
  'but personally for me the fact that picquart was anti semitic actually makes his actions more admirable because he had the same prejudices the same reason

### Creating Train and validation set

In [None]:
n=len(train_set)
train_ratio=0.80 # 80:20 ratio is used for train and validation/test data
train_size=int(n*train_ratio)
val_size=int(n-train_size)
train_ds, val_ds = train_set[:train_size],train_set[train_size:]
len(train_ds), len(val_ds)

(81857, 20465)

In [None]:
# saving the train and validation data in csv file

with open('train_ds.csv', 'w') as f:
    # using csv.writer method from CSV package
    write = csv.writer(f)
    write.writerows(train_ds)


with open('validation_ds.csv', 'w') as f:
    # using csv.writer method from CSV package
    write = csv.writer(f)
    write.writerows(val_ds)

# 4.Building the Vocabulary

---



In [None]:
# funtion to tokenize hindi text
def hindi_tokenizer(text_in_hindi):
  hindi_tokens=[]
  for token in indic_tokenize.trivial_tokenize(text_in_hindi): # trivial_tokenize of indicNLP is used for tokenization
    hindi_tokens.append(token)
  return hindi_tokens # tokens of a sentence are returned as list

In [None]:
# function to tokenize english text
def english_tokenizer(text_in_english):
  english_tokens=[]
  for token in indic_tokenize.trivial_tokenize(text_in_english): # trivial_tokenize of indicNLP is used for tokenization
    english_tokens.append(token)
  return english_tokens # tokens of a sentence are returned as list

In [None]:
sos_token='<sos>' # start of sequence token; appended at start of sentence
eos_token='<eos>' # end of sequence token; appended at end of sentence
unk_token='<unk>' # unknown token; used to represent a word if that word is not found in the dictionary
pad_token='<pad>' # token for padding; used to make all sentences of equal length in a batch

In [None]:
# dictionary to keep count of occurrence of each English word
E_wordCount={} 

# dictioanry to find the index for a word in English
E_word2index={sos_token:0, eos_token:1, unk_token:2, pad_token:3}

# dictionary to find the English word for a particular index
E_index2word={0:sos_token, 1:eos_token, 2:unk_token, 3:pad_token} 

# dictionary to keep count of occurrence of each Hindi word
H_wordCount={} 

# dictioanry to find the index for a word in Hindi
H_word2index={sos_token:0, eos_token:1, unk_token:2, pad_token:3}

# dictionary to find the Hindi word for a particular index
H_index2word={0:sos_token, 1:eos_token, 2:unk_token, 3:pad_token}

E_count=4 # keeps count of number of words so far in English dictionary
H_count=4 # keeps count of number of words so far in Hindi dictionary

In [None]:
# function to add a word in English dictionary
def E_updateDict(eng_sentence):
  global E_count
  tokens= english_tokenizer(eng_sentence) #generating tokens for the given sentence
  for token in tokens:
    if token not in E_word2index.keys(): # check if the token already exists in English dictionary
      # if the token is not present in English dictionary then add it to word2index and index2word English dictionary
      E_word2index[token]= E_count
      E_index2word[E_count]=token
      E_wordCount[token]=1
      E_count+=1 # increasing the count of words in English vocabulary
    else:
      E_wordCount[token]+=1 # if the token exists in dictionary then simply increase it's count of occurrence

# function to add a word in Hindi dictionary
def H_updateDict(hindi_sentence):
  global H_count
  tokens= hindi_tokenizer(hindi_sentence) #generating tokens for the given sentence
  for token in tokens:
    if token not in H_word2index.keys(): # check if the token already exists in Hindi dictionary
      # if the token is not present in Hindi dictionary then add it to word2index and index2word Hindi dictionary
      H_word2index[token]= H_count
      H_index2word[H_count]=token
      H_wordCount[token]=1
      H_count+=1 # increasing the count of words in Hindi vocabulary
    else:
      H_wordCount[token]+=1 # if the token exists in dictionary then simply increase it's count of occurrence

In [None]:
# reading the training pairs to create hindi and english vocabulary
for pair in train_ds:
  H_updateDict(pair[0]) # updating hindi vocabulary
  E_updateDict(pair[1]) # updating english vocabulary

In [None]:
# number of words in hindi and english vocabulary
print(H_count, E_count) 

41006 28207


# 5. Model Architecture

Defining the Encoder architecture

In [None]:
class Encoder(nn.Module):
  def __init__(self, input_size, embedding_size, hidden_size, layers, dropout_val):
    #input size is equal to hindi vocabulary size and embedding_size is equal to dimensions of embeddings
    super(Encoder, self).__init__()
    self.dropout = nn.Dropout(dropout_val)
    self.embedding = nn.Embedding(input_size,embedding_size)
    #input to LSTM has dimensions equal to embedding_size and output has dimensions equal to hidden_size
    self.lstm = nn.LSTM(embedding_size, hidden_size, layers, dropout = dropout_val)
  
  def forward(self,token_vec):
    #token_vec is a vector of indices mapping a word to its index in the vocabulary
    embedding = self.dropout(self.embedding(token_vec)) #embedding is a 3D tensor of shape (seq length, batch_size, embedding_size)
    outputs, (hidden,cell) = self.lstm(embedding) #the embedding is passed as input to the LSTM
    return hidden, cell #encoder output is not stored; only hidden and cell states are of concern in encoder

Definning the Decoder architecture

In [None]:
class Decoder(nn.Module):
   def __init__(self, input_size, embedding_size, hidden_size, output_size, layers, dropout_val):
     # here, input_size= size of english vocabulary, embedding_size= dimensions of embedding as defined, hidden_size as defined and output_size=english vocabulary size
     super(Decoder, self).__init__()
     self.dropout = nn.Dropout(dropout_val)
     self.embedding = nn.Embedding(input_size,embedding_size)
     self.lstm = nn.LSTM(embedding_size, hidden_size, layers, dropout = dropout_val) 
     self.fc = nn.Linear(hidden_size, output_size) 

   def forward(self,token_vec,hidden,cell): #token_vec is one dimensional i.e. shape(token_vec) = (batch_size). 
     # However the decoder predicts one word at a time, thus the required dimensions are (1, batch_size). 
     token_vec = token_vec.unsqueeze(0) # unsqueeze adds one more dimension to token_vec
     embedding = self.dropout(self.embedding(token_vec)) # embedding is a 3D tensor of shape (1, batch_size, embedding_size)
     outputs, (hidden,cell) = self.lstm(embedding, (hidden,cell)) # hidden and cell states are used to determine the next word in the sequence and output is the current predicted word. 
     predictions = self.fc(outputs) # shape(outputs)= (1, batch size, hidden size), shape(predictions)= (1, batch_size, english vocabulary size)
     predictions = predictions.squeeze(0) # remove the one extra dimension which was added using unsqueeze. 
     return predictions, hidden, cell

Defining the Seq2Seq class to define the model architecture

In [None]:
# Now we need to define a class which will define our model.
class seq2seq(nn.Module):
    def __init__(self, encoder, decoder):
        super(seq2seq, self).__init__()
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, source, target, teacher_force_ratio = 0.5): #teacher_force_ratio helps in preventing the model from overfitting and underfitting. 
        # teacher_force_ratio helps in deciding whether the next input word to the decoder will be actual/target word or the previous predicted word.
        output_vec = torch.zeros(target.shape[0], source.shape[1], E_count).to(device)
        hidden, cell = self.encoder(source)
        input_token = target[0]
        for i in range(1, target.shape[0]):
            output, hidden, cell = self.decoder(input_token, hidden, cell) # Size of output = (batch_size, eng vocabulary size)          
            guess_val = output.argmax(1)
            if (random.random() < teacher_force_ratio): #half of the times this will be true if teacher_force_ratio is 0.5
              input_token = target[i]  #in this case next input to the decoder is target/actual word
            else:  input_token= guess_val #in this case next input to the decoder is predicted word
            output_vec[i] = output
        return output_vec

# 6. Training the Model

Setting optimal hyperparameters for Training

In [None]:
#Hyperparameters
batch_size = 32
learning_rate = 0.001
epochs =30
epoch_loss=0.0 # training loss in each epoch
layers = 2 # number of neural network layers in rnn
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
encoder_input_size = H_count 
decoder_input_size = E_count
output_size = E_count
hidden_size = 512 # encoder and decoder have same hidden size
embedding_size = 250 # encoder and decoder embedding size
dropout = 0.5 # encoder and decoder dropout value

**Preparing data for training the Model:** First, the data is sorted according to the length of Hindi sentences and then index vectors for these sentences are found using H_sentenceToTensor and E_sentenceToTensor functions. The logic behind sorting the training data is that in one batch we want sentences of similar lengths, so sorting helps us achieve that and padding is performed whereever necessary. To create batches of same length, I calculated the maximum length of sentence in a batch and stored this value in a dictionary with key as batch_id. After obtaining the maximum length for each batch, "\<pad\>" token was appended to the sentences whose length was less than the maximum length of sentence in that batch. After that, Dataloader is used to create batches of the required batch size. Each batch will have sentences of same length.

In [None]:
# sorting the training data according to length of hindi sentences
train_ds.sort(key= lambda x: len(x[0])) 

In [None]:
# finding the maximum length of sentence in a batch
max_length_train={} # stores maximum length of sentences in training data
batch_id=1
# computing maximum length for each batch of training data
for i in range(0,len(train_ds),batch_size):
  max_len=0 
  for pair in train_ds[i:i+batch_size]:
    E_maxlength=0
    H_maxlength=0
    for token in indic_tokenize.trivial_tokenize(pair[0]):
      H_maxlength+=1
    for token in indic_tokenize.trivial_tokenize(pair[1]):
      E_maxlength+=1
    max_len=max(max_len,E_maxlength, H_maxlength )
  max_length_train[batch_id]=max_len+2
  batch_id+=1

max_length_test={} # stores maximum length of sentences in test/validation data
batch_id=1
# computing maximum length for each batch of validation data
for i in range(0,len(val_ds),batch_size):
  max_len=0
  for pair in val_ds[i:i+batch_size]:
    E_maxlength=0
    H_maxlength=0
    for token in indic_tokenize.trivial_tokenize(pair[0]):
      H_maxlength+=1
    for token in indic_tokenize.trivial_tokenize(pair[1]):
      E_maxlength+=1
    max_len=max(max_len,E_maxlength, H_maxlength )
  max_length_test[batch_id]=max_len+2
  batch_id+=1   

In [None]:
# H_sentenceToTensor function takes a sentence, maximum length as argument and returns a tensor of indices with padding done, if required.
def H_sentenceToTensor(sentence,max_length):
  # append start of sequence token at beginning
  src_index=[H_word2index['<sos>']] 
  for token in hindi_tokenizer(sentence):
     # if the word in not present in dictionary then index corresponding to unknown token '<unk>' i.e. 2 is used
    src_index.append(H_word2index.get(token, 2))
  # append end of sequence token
  src_index.append(H_word2index['<eos>'])
  # check if length of sentence is less than maximum length, if yes, then append <pad> token
  if(len(src_index)<max_length):
    while(len(src_index)!=max_length):
      src_index.append(H_word2index['<pad>'])
  return torch.Tensor(src_index) # returning tensor of indices with length equal to max_length

# H_sentenceToTensor function takes a sentence, maximum length as argument and returns a tensor of indices with padding done, if required.
def E_sentenceToTensor(sentence,max_length):
  # append start of sequence token at beginning
  trg_index=[E_word2index['<sos>']]
  for token in english_tokenizer(sentence):
    # if the word in not present in dictionary then index corresponding to unknown token '<unk>' i.e. 2 is used
    trg_index.append(E_word2index.get(token,2))
  # append end of sequence token
  trg_index.append(E_word2index['<eos>'])
  # check if length of sentence is less than maximum length, if yes, then append <pad> token
  if(len(trg_index)<max_length): 
    while(len(trg_index)!=max_length):
      trg_index.append(E_word2index['<pad>'])
  return torch.Tensor(trg_index) # returning tensor of indices with length equal to max_length

In [None]:
train_tensor=[] # stores tensor of indexes of training data
test_tensor=[] # stores tensor of indexes of validation/test data

# finding tensor of indexes of training data
batch_id=1
for i in range(0,len(train_ds),batch_size):
  max_len=max_length_train[batch_id]
  for pair in train_ds[i:i+batch_size]:
    source_tensor=H_sentenceToTensor(pair[0],max_len)
    target_tensor=E_sentenceToTensor(pair[1],max_len)
    train_tensor.append([source_tensor, target_tensor])
  batch_id+=1

# finding tensor of indexes of validation/test data
batch_id=1
for i in range(0,len(val_ds),batch_size):
  max_len=max_length_test[batch_id]
  for pair in val_ds[i:i+batch_size]:
    source_tensor=H_sentenceToTensor(pair[0],max_len)
    target_tensor=E_sentenceToTensor(pair[1],max_len)
    test_tensor.append([source_tensor, target_tensor])
  batch_id+=1

In [None]:
print(len(train_tensor))

81857


In [None]:
# finding train and test iterator using data loader
# shuffle=false is used so that data remains sorted in batches
train_iterator = DataLoader(train_tensor, batch_size=batch_size,shuffle=False) 
test_iterator = DataLoader(test_tensor, batch_size=batch_size,shuffle=False)

In [None]:
# function to evaluate the validation loss in each epoch
def evaluate(model, iterator, criterion):
  model.eval()
  epoch_loss = 0
  with torch.no_grad():
    for i, (x,y) in enumerate(iterator):
      input_sentence = x.long().to(device) 
      target_sentence = y.long().to(device)
      # input_sentence and target_sentence have shape = (batch_size, maximum length) but we need shape to be (maximum length, batch_size ) so they are transposed
      input_sentence=torch.transpose(input_sentence, 0, 1)
      target_sentence=torch.transpose(target_sentence, 0, 1)
      output = model(input_sentence, target_sentence, 0) #turn off teacher forcing
      output_dim = output.shape[2]
      # output.shape() = (target len, batch size, output dim)
      output = output[1:].reshape(-1, output_dim)
      target_sentence = target_sentence[1:].reshape(-1)
      loss = criterion(output, target_sentence)
      epoch_loss += loss.item()
      del target_sentence,output,input_sentence
  return epoch_loss / len(iterator)

In [None]:
path = "phase3.pth"

In [None]:
encoder = Encoder(encoder_input_size, embedding_size, hidden_size, layers, dropout).to(device) #passing the inputs to encoder
decoder = Decoder(decoder_input_size, embedding_size, hidden_size, output_size,layers,dropout,).to(device) #passing the inputs to decoder
model = seq2seq(encoder, decoder).to(device)
pad_index = E_word2index['<pad>'] #finding the index of token <pad> in english vocabulary
criterion = nn.CrossEntropyLoss(ignore_index = pad_index) #padding token is being ignored while loss computation because we don't want to pay price for <pad> token
optimizer = optim.AdamW(model.parameters(), lr=learning_rate) # AdamW optimizer is used 
step = 0
for epoch in range(0,epochs):
    print(f"[Epoch {epoch} / {epochs}]")
    model.eval()
    model.train()
    i=0
    for id, (x,y) in enumerate(train_iterator):   # iterating over batches of train_iterator
      input_sentence = x.long().to(device)
      target_sentence = y.long().to(device)

      # input_sentence and target_sentence have shape = (batch_size, maximum length) but we need shape to be (maximum length, batch_size ) so they are transposed
      input_sentence=torch.transpose(input_sentence, 0, 1)
      target_sentence=torch.transpose(target_sentence, 0, 1)
      
      output = model(input_sentence, target_sentence) #forward propagation
      output = output[1:].reshape(-1, output.shape[2]) #removing the start token from model's prediction and reshaping it to make it make it fit for input to loss function
      
      target_sentence = target_sentence[1:].reshape(-1) #removing the start token from actual target translation
      optimizer.zero_grad() 
      loss = criterion(output, target_sentence)
      loss.backward() #backward propagation
      torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1) # clipping the gradients to keep them in reasonable range
      optimizer.step() #gradient descent. The optimizer iterates over all parameters (tensors) to be updated and their internally stored gradients are used.
      del target_sentence,output,input_sentence
      step += 1
      epoch_loss+=loss.item()
    if(epoch%5==0): # saving the model in every 5 iterations
      torch.save(model,path)
    val_loss=evaluate(model, test_iterator, criterion)
    print("Train loss : ", loss.item())
    print("Validation loss : ", val_loss)

[Epoch 0 / 30]
Train loss :  7.253509521484375
Validation loss :  6.6556614227592945
[Epoch 1 / 30]
Train loss :  6.753749370574951
Validation loss :  6.2911082722246645
[Epoch 2 / 30]
Train loss :  6.3020243644714355
Validation loss :  6.189002197980881
[Epoch 3 / 30]
Train loss :  5.885638236999512
Validation loss :  6.111462603509426
[Epoch 4 / 30]
Train loss :  5.538120269775391
Validation loss :  6.056318908184767
[Epoch 5 / 30]
Train loss :  5.267170429229736
Validation loss :  6.093760447204113
[Epoch 6 / 30]
Train loss :  5.208654880523682
Validation loss :  6.030258818715811
[Epoch 7 / 30]
Train loss :  4.856709957122803
Validation loss :  6.0159385114908215
[Epoch 8 / 30]
Train loss :  4.896815299987793
Validation loss :  6.125546030700207
[Epoch 9 / 30]
Train loss :  4.800422191619873
Validation loss :  6.063683351874351
[Epoch 10 / 30]
Train loss :  4.802103519439697
Validation loss :  6.140982373803854
[Epoch 11 / 30]
Train loss :  4.603856563568115
Validation loss :  6.15

In [None]:
path = "phase3.pth" # location to save the model
torch.save(model,path) #saving the trained model at defined location
# model.train()
# model = torch.load(path) #loading the model
model.eval()

seq2seq(
  (encoder): Encoder(
    (dropout): Dropout(p=0.5, inplace=False)
    (embedding): Embedding(41006, 250)
    (lstm): LSTM(250, 512, num_layers=2, dropout=0.5)
  )
  (decoder): Decoder(
    (dropout): Dropout(p=0.5, inplace=False)
    (embedding): Embedding(28207, 250)
    (lstm): LSTM(250, 512, num_layers=2, dropout=0.5)
    (fc): Linear(in_features=512, out_features=28207, bias=True)
  )
)

Function to translate hindi sentences(index vectors) to english sentences(index vector)

In [None]:
def hin_to_eng_translation(model, device, hindi_num_vec, max_length=50):
    hindi_tensor = torch.LongTensor(hindi_num_vec).unsqueeze(1).to(device)
    with torch.no_grad():
        hidden, cell = model.encoder(hindi_tensor)
    eng_num_vec = [E_word2index["<sos>"]] # adding index for <sos> token
    eos_idx=E_word2index["<eos>"] # adding index for <eos> token
    for _ in range(max_length):
        curr_input = torch.LongTensor([eng_num_vec[-1]]).to(device)
        with torch.no_grad():
            output, hidden, cell = model.decoder(curr_input, hidden, cell)
            curr_output = output.argmax(1).item()
        eng_num_vec.append(curr_output) # appending the prediction in english index vector
        if (curr_output == eos_idx): # stop generating predictions once eos token is encountered
            break
    return eng_num_vec

# 7. Testing the Model 
Obtaining the reference and prediction files to compute Bleu score and Meteor Score using Evaluation Script. 

Obtaining the predicted sentences for validation set

In [None]:
file1 = open("validation_prediction.txt","w") # to store the prediction of validation set
csv_file = open("validation_ds.csv",encoding='utf-8')
rows = csv.reader(csv_file)
for row in rows:
    hindi_sentence = row[0]
    hindi_sentence_token=[]

    # tokenize hindi sentence
    if type(hindi_sentence) == str:
      for t in indic_tokenize.trivial_tokenize(hindi_sentence):
        hindi_sentence_token.append(t)
    else:
        for t in hindi_sentence:
          hindi_sentence_token.append(t)
  
    hindi_sentence_token.insert(0,'<sos>') # append <sos> token
    hindi_sentence_token.append('<eos>') # append <eos> token
    hindi_num_vec = []

    # generating index vector for hindi sentences
    for t in hindi_sentence_token:
      hindi_num_vec.append(H_word2index.get(t,2))

    # call hin_to_eng_translation function to generate predictions
    eng_num_vec = hin_to_eng_translation(model, device, hindi_num_vec, max_length=50)
    # eng_num_vec is vector of indices of predicted english sentences. Now, we need to find the words corresponding to these indices

    english_sentence_list=[]
    for word_idx in eng_num_vec:
      english_sentence_list.append(E_index2word.get(word_idx,2)) # index 2 is for <unk>. 

    english_sentence_list.pop(0) # remove <sos> token
    english_sentence_list.pop() # remove <eos> token

    # capitalise first letter of first word of predicted english sentence
    english_sentence=str(english_sentence_list[0][0].upper()) + str(english_sentence_list[0][1:])

    # storing the sentences in form of string (while prediction these words were stored in list that's why now there is need to store them as string)
    for string in english_sentence_list[1:]:
           english_sentence+=' '+ string
    file1.write(english_sentence + "\n")
  
file1.close()

Saving the refernce sentences for validation set

In [None]:
csv_file = open("validation_ds.csv",encoding='utf-8')
rows = csv.reader(csv_file)
file = open("validation_english.txt","w")
for row in rows:
    file.write(row[1]+"\n")
file.close()

# 8. Generating Predictions

### Obtaining translation for hindistatements.csv

In [None]:
file1 = open("predicted_text_phase3.txt","w") # file to store the predictions
csv_file = open("hindistatements.csv",encoding='utf-8')
rows = csv.reader(csv_file)
i=0
for row in rows:
    if (i==0):  # don't want to translate columns names for prediction
      i+=1
      continue
    hindi_sentence = row[2]
    hindi_sentence_token=[]
    # tokenize hindi sentence
    if type(hindi_sentence) == str:
      for t in indic_tokenize.trivial_tokenize(hindi_sentence):
        hindi_sentence_token.append(t)
    else:
        for t in hindi_sentence:
          hindi_sentence_token.append(t)
  
    hindi_sentence_token.insert(0, '<sos>') # append <sos> token
    hindi_sentence_token.append('<eos>') # append <eos> token
    hindi_num_vec = []

    # generating index vector for hindi sentences
    for t in hindi_sentence_token:
      hindi_num_vec.append(H_word2index.get(t,2))

    # call hin_to_eng_translation function to generate predictions
    eng_num_vec = hin_to_eng_translation(model, device, hindi_num_vec, max_length=50)
    # eng_num_vec is vector of indices of predicted english sentences. Now, we need to find the words corresponding to these indices

    english_sentence_list=[] 
    for word_idx in eng_num_vec:
      english_sentence_list.append(E_index2word.get(word_idx,2)) # index 2 is for <unk>. 
    
    english_sentence_list.pop(0) # remove <sos> token
    english_sentence_list.pop() # remove <eos> token

    # capitalise first letter of first word of predicted english sentence
    english_sentence=str(english_sentence_list[0][0].upper()) + str(english_sentence_list[0][1:]) 

    # storing the sentences in form of string (while prediction these words were stored in list that's why now there is need to store them as string)
    for string in english_sentence_list[1:]:
           english_sentence+=' '+ string
    file1.write(english_sentence + "\n")
  
file1.close()

# 9. References

[1] [https://arxiv.org/abs/1409.3215t](https://arxiv.org/abs/1409.3215)

[2] [https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html)
