Attempting to do machine translation following the [original seq2seq paper](https://paperswithcode.com/method/seq2seq). 

I will solve this problem in two parts.

 Part 1  : Converting the sentences into sequences. This will include removing NaN values, basic pre-processing (removing punctuation, converting to lower-case), tokenization and vocabulary creation.

Part 2 : Building and training the seq2seq model, following the [paper](https://paperswithcode.com/method/seq2seq) closely(relation between input-output of encoder-decoder,number of layers in LSTM etc.)

### Part 1 : Converting to sequences

Removing Nan Values, Converting to lower case and removing punctuations.

In [1]:
import torch
import torch.nn as nn
import pandas as pd
import string
import numpy as np
import random
import time
from collections import Counter
from sklearn.model_selection import train_test_split
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import Dataset, DataLoader

df = pd.read_csv('/kaggle/input/english-hindi-machine-translation/Hindi_English_Truncated_Corpus.csv')

df = df.dropna()  # Remove NaN values

# Converting English sentences to lowercase and removing punctuations from both languages
def remove_punctuation(text):
    return text.translate(str.maketrans('', '', string.punctuation))

df['english_sentence'] = df['english_sentence'].str.lower().apply(remove_punctuation)
df['hindi_sentence'] = df['hindi_sentence'].apply(remove_punctuation)
df

Unnamed: 0,source,english_sentence,hindi_sentence
0,ted,politicians do not have permission to do what ...,राजनीतिज्ञों के पास जो कार्य करना चाहिए वह करन...
1,ted,id like to tell you about one such child,मई आपको ऐसे ही एक बच्चे के बारे में बताना चाहूंगी
2,indic2012,this percentage is even greater than the perce...,यह प्रतिशत भारत में हिन्दुओं प्रतिशत से अधिक है।
3,ted,what we really mean is that theyre bad at not ...,हम ये नहीं कहना चाहते कि वो ध्यान नहीं दे पाते
4,indic2012,the ending portion of these vedas is called up...,इन्हीं वेदों का अंतिम भाग उपनिषद कहलाता है।
...,...,...,...
127602,indic2012,examples of art deco construction can be found...,आर्ट डेको शैली के निर्माण मैरीन ड्राइव और ओवल ...
127603,ted,and put it in our cheeks,और अपने गालों में डाल लेते हैं।
127604,tides,as for the other derivatives of sulphur the c...,जहां तक गंधक के अन्य उत्पादों का प्रश्न है दे...
127605,tides,its complicated functioning is defined thus in...,Zरचनाप्रकिया को उसने एक पहेली में यों बांधा है


Splitting the data into train, validation and test. Taking a small sample of the data because of limited compute resources.

In [2]:
# Split the data into train, validation, and test sets
train_df, val_test_df = train_test_split(df, test_size=0.9, random_state=42)
val_df, test_df = train_test_split(val_test_df, test_size=0.9, random_state=42)
train_df.count(),val_df.count(),test_df.count()

(source              12760
 english_sentence    12760
 hindi_sentence      12760
 dtype: int64,
 source              11484
 english_sentence    11484
 hindi_sentence      11484
 dtype: int64,
 source              103361
 english_sentence    103361
 hindi_sentence      103361
 dtype: int64)

Tokenizing the sentences. Source sentences are tokenized in reverse as it was one of the key source of improvement in the paper.

In [3]:
# Define tokens
START_TOKEN = 'SOS'
END_TOKEN = 'EOS'
OUT_OF_VOCAB_TOKEN = 'OOV'

def tokenize(df):
    df['english_sentence'] = df['english_sentence'].apply(lambda x: [END_TOKEN] + x.split()[::-1] + [START_TOKEN])
    df['hindi_sentence'] = df['hindi_sentence'].apply(lambda x: [START_TOKEN] + x.split() + [END_TOKEN])
    print(df.head(4))

# Tokenize the sentences and add EOS and SOS tokens
#train_df['english_sentence'] = train_df['english_sentence'].apply(lambda x: [END_TOKEN] + x.split()[::-1] + [START_TOKEN])
#train_df['hindi_sentence'] = train_df['hindi_sentence'].apply(lambda x: [START_TOKEN] + x.split() + [END_TOKEN])

tokenize(train_df)
tokenize(val_df)
tokenize(test_df)

           source                                   english_sentence  \
9425        tides  [EOS, incentives, other, and, currency, devalu...   
93313         ted  [EOS, that, and, complete, are, triumphs, our,...   
110779  indic2012  [EOS, malaria, cause, which, creatures, change...   
106155  indic2012  [EOS, festival, film, international, toronto, ...   

                                           hindi_sentence  
9425    [SOS, प्राकृतिक, रूप, से, कम, कीमत, का, लाभ, ज...  
93313   [SOS, अब, हमारी, विजय, पूरी, हो, चुकी, है, और,...  
110779  [SOS, उष्णकटिबंधीय, बीमारियां, tropical, disea...  
106155  [SOS, उनकी, पहली, अंग्रेजी, भाषा, की, फिल्म, र...  
          source                                   english_sentence  \
59399      tides  [EOS, afghanistan, in, loyalties, changing, of...   
17253      tides  [EOS, beginning, the, from, state, hidden, a, ...   
99756        ted  [EOS, directions, different, into, growing, ar...   
28652  indic2012  [EOS, sanadyataptilok, prasad, poetry

#### Creating vocabulary(First counter then vocabulary) using only the training set since it prevents information leakage.

Creating the frequency counter , words having frequency =1 will not be included in the vocabulary. 

In [4]:
# Define minimum word frequency for it to be included in vocabulary
MIN_WORD_FREQ = 2

# Count the frequency of each word in both languages
english_vocab_counter = Counter(word for sentence in train_df['english_sentence'] for word in sentence)
hindi_vocab_counter = Counter(word for sentence in train_df['hindi_sentence'] for word in sentence)

english_vocab_counter.most_common(10) ,hindi_vocab_counter.most_common(10)

([('the', 12836),
  ('EOS', 12760),
  ('SOS', 12760),
  ('of', 7374),
  ('and', 5896),
  ('to', 4807),
  ('in', 4772),
  ('a', 3660),
  ('is', 3040),
  ('that', 1760)],
 [('SOS', 12760),
  ('EOS', 12760),
  ('के', 8628),
  ('में', 6486),
  ('है', 5534),
  ('की', 4877),
  ('और', 4745),
  ('से', 3881),
  ('का', 3364),
  ('को', 3158)])

Creating the dictionary using words only from the training set. Creating the vocabulary and adding the 'OOV' token. Doing word: i+1 to keep 0 for padding

In [5]:
# Create vocabulary by including words that have a frequency of more than MIN_WORD_FREQ
#english_vocab = {word: i+1 for i, (word, freq) in enumerate(english_vocab_counter.items()) if freq >= MIN_WORD_FREQ}
#hindi_vocab = {word: i+1 for i, (word, freq) in enumerate(hindi_vocab_counter.items()) if freq >= MIN_WORD_FREQ}
# OOV token will be displayed when we encounter a word not in the vocabulary
i=1
english_vocab={}
english_index_to_word={}
for _, (word, freq) in enumerate(english_vocab_counter.items()):
    if freq >= MIN_WORD_FREQ and english_vocab.__contains__(word)==False:
        english_vocab[word]=i
        english_index_to_word[i]=word
        i+=1

i=1
hindi_vocab={}
hindi_index_to_word={}
for _, (word, freq) in enumerate(hindi_vocab_counter.items()):
    if freq >= MIN_WORD_FREQ and hindi_vocab.__contains__(word)==False:
        hindi_vocab[word]=i
        hindi_index_to_word[i]=word
        i+=1
english_vocab.update({OUT_OF_VOCAB_TOKEN: len(english_vocab)})
hindi_vocab.update({OUT_OF_VOCAB_TOKEN: len(hindi_vocab)})
len(english_vocab),len(hindi_vocab),len(hindi_index_to_word),len(english_index_to_word)

(9858, 11336, 11335, 9857)

Finally, converting sentences to sequences.

In [6]:
# Convert the words in the sentences to their corresponding index in the vocabulary
def create_sequences(df):
    df['english_sentence'] = df['english_sentence'].apply(lambda sentence: [english_vocab.get(word, english_vocab[OUT_OF_VOCAB_TOKEN]) for word in sentence])
    df['hindi_sentence'] = df['hindi_sentence'].apply(lambda sentence: [hindi_vocab.get(word, hindi_vocab[OUT_OF_VOCAB_TOKEN]) for word in sentence])
    print(df['hindi_sentence'][:5],df['english_sentence'][:5])

create_sequences(train_df)
create_sequences(val_df)
create_sequences(test_df)

9425      [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...
93313           [1, 26, 27, 28, 29, 22, 30, 12, 16, 31, 25]
110779    [1, 32, 33, 34, 35, 36, 37, 38, 39, 40, 38, 41...
106155    [1, 55, 56, 57, 58, 59, 60, 11335, 61, 11335, ...
68713     [1, 78, 79, 80, 12, 81, 82, 11335, 11335, 7, 8...
Name: hindi_sentence, dtype: object 9425      [1, 2, 3, 4, 5, 9857, 6, 7, 8, 9, 10, 11, 12, ...
93313                    [1, 18, 4, 19, 20, 21, 22, 18, 17]
110779    [1, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33...
106155    [1, 39, 40, 41, 42, 43, 16, 44, 9857, 45, 44, ...
68713     [1, 56, 55, 57, 58, 59, 60, 61, 62, 63, 64, 65...
Name: english_sentence, dtype: object
59399     [1, 223, 11335, 42, 816, 5606, 5487, 714, 265,...
17253     [1, 510, 2176, 4291, 1793, 4, 3769, 74, 11335,...
99756                 [1, 279, 2341, 42, 4881, 523, 24, 25]
28652     [1, 11335, 59, 396, 7453, 11335, 4832, 11335, ...
113685    [1, 570, 105, 2770, 977, 11335, 999, 19, 570, ...
Name: hindi_sentence, dtyp

#### Creating the dataloaders.


In [7]:
def collate_fn(batch):
    english_sequences, hindi_sequences = zip(*batch)
    english_sequences = [torch.tensor(seq) for seq in english_sequences]
    hindi_sequences = [torch.tensor(seq) for seq in hindi_sequences]
    
    # Pad sequences
    english_sequences = pad_sequence(english_sequences, batch_first=True, padding_value=0)
    hindi_sequences = pad_sequence(hindi_sequences, batch_first=True, padding_value=0)
    
    return english_sequences, hindi_sequences


# Define a PyTorch Dataset
class TranslationDataset(Dataset):
    def __init__(self, df):
        self.df = df

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        return torch.tensor(self.df.iloc[idx]['english_sentence']), torch.tensor(self.df.iloc[idx]['hindi_sentence'])

# Define a function to create data loaders
def create_data_loaders(train_df, val_df, test_df, batch_size=8):
    train_loader = DataLoader(TranslationDataset(train_df), batch_size=batch_size,collate_fn=collate_fn, shuffle=True)
    val_loader = DataLoader(TranslationDataset(val_df),  collate_fn=collate_fn, batch_size=batch_size)
    test_loader = DataLoader(TranslationDataset(test_df),  collate_fn=collate_fn, batch_size=batch_size)
    return train_loader, val_loader, test_loader

train_loader, val_loader, test_loader = create_data_loaders(train_df, val_df, test_df)


#### Visualizing the data in the dataloaders

Reduce the batch_size for visualization.

In [8]:
dataiter = iter(train_loader)
data = next(dataiter)
print(len(data[0]),len(data[1])) # = batch_size
src , trg = data
for i in range(len(src)):
        print(src[i])
        print(trg[i])
print(src.shape) # (batch_size,length_of_sequences)   

8 8
tensor([   1, 1480,   16,  179, 9857,    4, 2506,   63,    4,  732, 2657,   16,
          63, 9857, 3425,  230, 9857,  529,   63, 4102, 1186,   30,  593,  953,
         200, 1356, 3996, 6637,    4, 1340,   44, 3429, 6638,   17])
tensor([    1,  4681,   652,  7009,    48,  1636,    16, 11335,  4436,  2107,
           42,  4569,   302,  1412, 11335,    42,   583, 11335,   414,  7452,
           59,  7453,  2500,    42,  5260, 11335,    16,  2765,   414,  5744,
         1787,   742,    24,    25,     0,     0,     0,     0,     0])
tensor([   1, 1666,   16,    7,  521, 3611,   16,  582, 2038,   57, 2332, 1005,
         179, 1614,  290,   57, 3560,   74, 9857,   16,    4, 9857,   16,   17,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0])
tensor([    1,  4875,    82,  4028,    74,  8618,   302,    82,   253, 11335,
           16, 11335,   792,   478,   269,    25,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0

  english_sequences = [torch.tensor(seq) for seq in english_sequences]
  hindi_sequences = [torch.tensor(seq) for seq in hindi_sequences]


### This marks the end of the data preparation now we will define the seq2seq model and train it.

#### First we will define the encoder, then the decoder and then combine them to define the seq2seq model. After this we will train the model and do validation.

#### Encoder 




#### Visualizing the embeddings. 
Reducing the embedding_dim for decent visualization. Can also use embed_example.**weight.data** to see the weight matrix of embeddings. 

Every word has a vector of size **embedding_dim** associated with it.

In [9]:
input_dim = len(english_vocab)
embedding_dim=16
embed_example = nn.Embedding(input_dim, embedding_dim)
embed_example
embeddings = embed_example.weight.data
embeddings[:4]

tensor([[-0.2724,  0.2142,  0.3807,  0.4334,  0.3872, -1.1835, -1.1071, -0.1841,
         -0.4885, -1.2278,  0.6713, -1.2627, -2.3140, -1.3045,  1.2337, -0.4339],
        [ 0.7787, -0.3580,  0.8331, -1.0472, -1.7335, -1.0104,  0.0711,  1.5074,
          0.5364,  1.4165, -2.0296, -1.0663, -0.8677,  1.3982,  0.4945,  1.8224],
        [-2.1643,  0.0447,  0.4287,  0.1487,  1.5372,  0.6529,  0.4082,  0.9085,
         -1.7071, -1.5394, -0.3037, -0.7986,  0.0235, -0.4832, -1.1046,  0.5799],
        [ 0.7884,  0.4522, -2.5227, -0.7096,  0.4728,  1.5256, -0.3603,  1.3527,
         -0.2150, -1.1938, -0.4472,  0.4304,  2.4004,  3.1311,  0.7262,  0.8061]])

**Visualizing how dropout functions.**

In [10]:
input_dim = len(english_vocab)
embedding = nn.Embedding(input_dim, embedding_dim)
embedded=embedding(src)
print(embedded,embedded.shape)
dropout = nn.Dropout(0.5)
embedding = nn.Embedding(input_dim, embedding_dim)
embedded=dropout(embedding(src))
print(embedded,embedded.shape)

tensor([[[ 0.9847, -0.0953, -1.3843,  ...,  1.9782,  0.2396, -1.0359],
         [-1.3195,  0.3484, -0.0924,  ..., -0.7197, -1.6797, -2.5680],
         [ 0.1810, -0.8129,  1.7524,  ...,  0.4445, -0.0044, -2.8132],
         ...,
         [ 1.0120, -0.1411, -0.4691,  ...,  1.6046, -0.4548, -1.1192],
         [-0.1552,  0.4610,  1.7449,  ..., -0.3100,  0.5175,  0.2599],
         [ 0.1272, -0.1848,  1.2650,  ...,  0.8262, -1.0685,  0.1285]],

        [[ 0.9847, -0.0953, -1.3843,  ...,  1.9782,  0.2396, -1.0359],
         [-0.4735,  1.8389, -0.0459,  ..., -0.8488, -1.7114, -1.8881],
         [ 0.1810, -0.8129,  1.7524,  ...,  0.4445, -0.0044, -2.8132],
         ...,
         [-1.5033,  0.1097, -0.0202,  ..., -1.4170, -0.1458, -1.1922],
         [-1.5033,  0.1097, -0.0202,  ..., -1.4170, -0.1458, -1.1922],
         [-1.5033,  0.1097, -0.0202,  ..., -1.4170, -0.1458, -1.1922]],

        [[ 0.9847, -0.0953, -1.3843,  ...,  1.9782,  0.2396, -1.0359],
         [ 1.3387, -0.0381,  1.3680,  ...,  1

**Visualizing the LSTM with num_layers = 3.** 
* (hidden , cell) will have the values from all the layers stacked one over another. 
* The shape of hidden and cell will be (num_layers,batch_size,hidden_dim)
* The input shape required by LSTM if batch_first=True is (batch_size,sequence_length,input_length).
* In our case the batch_size = **8** , sequence_length is the **length of integer sequences**, input_length is = **embedding_dim**.

In [11]:
lstm = nn.LSTM(16, 8, num_layers = 3, dropout = 0.5,batch_first=True)
outputs, (enc_hidden, enc_cell) = lstm(embedded)
print(enc_hidden,enc_hidden.shape)
print(enc_cell,enc_cell.shape)

tensor([[[ 3.8312e-01,  2.0265e-02, -5.4521e-01,  1.9433e-01, -6.1937e-02,
           2.2705e-01,  1.0424e-01,  9.7161e-03],
         [ 3.8842e-01, -7.1287e-01,  5.7275e-01,  1.8696e-01, -3.7756e-02,
           3.1588e-01,  3.2942e-01, -1.3284e-01],
         [ 3.2524e-01, -4.3411e-01,  5.2736e-01, -2.8123e-01, -8.7427e-02,
           2.9250e-01,  1.4738e-01,  9.0907e-02],
         [ 9.3383e-02, -3.5597e-01,  3.5542e-01,  2.4250e-01, -1.9000e-03,
           9.3249e-04,  2.0033e-02,  7.9614e-02],
         [ 1.4495e-01, -4.5507e-01,  2.6401e-01,  2.2546e-01, -2.3199e-02,
           3.9713e-02, -2.8893e-01, -9.0405e-02],
         [ 1.4432e-01, -4.0595e-01,  4.0517e-01,  1.2881e-01, -3.2069e-02,
          -3.9454e-02, -3.0835e-01, -1.3270e-01],
         [ 3.1854e-01, -5.2704e-01,  5.6235e-01,  2.9652e-01,  6.3688e-02,
           9.9210e-02, -1.3904e-01, -2.2951e-01],
         [ 2.5299e-01, -5.7169e-01,  5.2099e-01,  2.3155e-02,  6.7252e-03,
           4.2182e-01,  5.3297e-02, -1.0422e-01]],

**Formally defining the Encoder**

In [12]:
import torch.optim as optim
from torchtext.data.metrics import bleu_score

# Encoder class
class Encoder(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, n_layers, dropout):
        super().__init__()
        # For every word(or an integer in the input sequence) it creates a vector of size embedding_dim 
        self.embedding = nn.Embedding(input_dim, embedding_dim) 
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, dropout = dropout,batch_first=True)
        self.dropout = nn.Dropout(dropout)
    def forward(self, src):
        #Shape of src is (batch_size,length of padded sequence)
        embedded = self.dropout(self.embedding(src))
        #Shape of embedded is (batch_size,length of one padded sequence,embedding_dim)
        outputs, (hidden, cell) = self.lstm(embedded)
        # Shape of hidden and cell both is (n_layers, batch_size, hidden_dim)
        return hidden, cell

### Decoder

* Expected input shape by the **DECODER**(Not the lstm in the decoder) = (batch_size)
* The input to the decoder is converted to shape (4,1) by unsqueeze operation. 
* The Embedding and dropout is similar to the Encoder layer. See Encoder for their visualization, expected shape etc.
* Input shape to lstm is = (batch_size,1,embedding_dim)
* The shape of hidden and cell is = (num_layers,batch_size,hidden_dim). Basically they contain the hidden and cell state at the final time step(here time step = 1) of each layer stacked on top of each other.
* Output only contains the values from the topmost layer and it's shape is (batch_size,1,hidden_dim)
* The output is passed through a fully connected layer.

In [13]:
# Decoder class
class Decoder(nn.Module):
    def __init__(self, output_dim, embedding_dim, hidden_dim, n_layers, dropout):
        super().__init__()
        self.output_dim = output_dim
        self.embedding = nn.Embedding(output_dim, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, dropout = dropout, batch_first=True )
        self.fc_out = nn.Linear(hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, cell):
        input = input.unsqueeze(1)
        embedded = self.dropout(self.embedding(input))
        output, (hidden, cell) = self.lstm(embedded, (hidden, cell))
        prediction = self.fc_out(output.squeeze(1))
        return prediction, hidden, cell


#### Trying out the decoder and checking the output shapes

In [14]:
output_dim=len(hindi_vocab)
input=trg[:,1]
decoder=Decoder(output_dim,16,8,3,0.5)
output, hidden, cell = decoder(input, enc_hidden, enc_cell)
print(output.shape,hidden.shape,cell.shape)

torch.Size([8, 11336]) torch.Size([3, 8, 8]) torch.Size([3, 8, 8])


#### The Seq2Seq Class
* This basically takes the input in the training loop. Feeds to the encoder, gets the output, then feeds the context vector(final hidden and cell state of encoder for each layer) to the decoder and generates output by calling the decoder one word at a time, clubs it together and returns the output in the desired shape.
* Our decoder generates the output one word after another.
* We initially feed the decoder with the SOS token.
* Teacher_Forcing_Ratio : The next input to the decoder could be a word from the target sentence or the prediction generated in the previous step by the decoder. It is decided by teacher_forcing_ratio. For example if teacher forcing ratio is 0.6 or 60% then for 60% of the time input will be the original word.
* The output shape is: (length of a target sequence, batch_size, output_dim)
* Also setting the device for orders of magnitude faster performance.

In [15]:
# Seq2Seq class
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
    def forward(self, source, target, teacher_forcing_ratio=0.5):
        batch_size = target.shape[0]
        trg_len = target.shape[1]
        trg_vocab_size = self.decoder.output_dim
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
        hidden, cell = self.encoder(source)
        input = target[:,0]
        
        for t in range(1, trg_len):
            output, hidden, cell = self.decoder(input, hidden, cell)
            outputs[t] = output
            teacher_force = random.random() < teacher_forcing_ratio
            top1 = output.argmax(1) 
            input = target[:,t] if teacher_force else top1
            
        return outputs

#### Training the model and calculating validation

* Defining the hyperparameters and setting the device.
* Initializing the encoder,decoder and model. Initializing the loss function and optimizer.
* Shape of output and target before modification = (len(target), batch_size, output_dim) and (batch_size,len(target)) respectively.
* Shape of output and target after modification = ((len(target)-1)\*batch_size, output_dim) and (batch_size\*(len(target)-1)) respectively. This is consistent with the shapes expected by the optimizer.

This function initializes all the weights of the model from a uniform distribution of (-0.08, 0.08), same as mentioned in the paper.

In [16]:
def init_weights(m):
    for name, param in m.named_parameters():
        nn.init.uniform_(param.data, -0.08, 0.08)

In [17]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Hyperparameters
input_dim = len(english_vocab)
output_dim=len(hindi_vocab)
enc_embedding_dim = dec_embedding_dim = 32
hidden_dim = 32
n_layers = 2
dropout = 0.5
num_epochs=10

# Initialize encoder, decoder and seq2seq model
encoder = Encoder(input_dim, enc_embedding_dim, hidden_dim, n_layers, dropout)
decoder = Decoder(output_dim, dec_embedding_dim, hidden_dim, n_layers, dropout)
model = Seq2Seq(encoder, decoder, device).to(device)
model.apply(init_weights)
# Loss function
criterion = nn.CrossEntropyLoss()
lr=0.01
# Optimizer
optimizer = optim.Adam(model.parameters(),lr=lr)
least_loss=float('inf')
# Training loop
for epoch in range(num_epochs):
    model.train()
    train_loss=0
    n=len(train_loader)
    s=time.time()
    #print(f"Training for Epoch={epoch} starting....")
    for i, (src, trg) in enumerate(train_loader):
        src, trg = src.to(device), trg.to(device)
        optimizer.zero_grad()
        output = model(src, trg)
        output_dim = output.shape[-1]
        output = output[1:].view(-1, output_dim)
        trg = trg[:,1:].reshape(-1)
        loss = criterion(output, trg)
        loss.backward()
        optimizer.step()
        train_loss+=loss.item()
        #print(f"Training batch {i} out of {n} time elapsed = {time.time()-s} seconds")

    print(f"Training loss after Epoch = {epoch} is:" ,train_loss/len(train_loader)) 
        
    #print(f"Validation for Epoch={epoch} starting....")
    s=time.time()
    # Validation
    val_loss=0
    n=len(val_loader)
    model.eval()
    with torch.no_grad():
        for i, (src, trg) in enumerate(val_loader):
            src, trg = src.to(device), trg.to(device)
            output = model(src, trg,0)
            output_dim = output.shape[-1]
            output = output[1:].view(-1, output_dim)
            trg = trg[:,1:].reshape(-1)
            loss = criterion(output, trg)
            val_loss+=loss.item()     
            #print(f"Validation batch {i} out of {n} time elapsed = {time.time()-s} seconds")
        if val_loss<least_loss:
            least_loss=val_loss
            best_model=model
    print(f"validation loss after Epoch = {epoch} is:" ,val_loss/len(val_loader))

  english_sequences = [torch.tensor(seq) for seq in english_sequences]
  hindi_sequences = [torch.tensor(seq) for seq in hindi_sequences]


Training loss after Epoch = 0 is: 3.924301615628329
validation loss after Epoch = 0 is: 3.6940734579868635
Training loss after Epoch = 1 is: 3.806098015928717
validation loss after Epoch = 1 is: 3.668039105479764
Training loss after Epoch = 2 is: 3.7832354318385586
validation loss after Epoch = 2 is: 3.657054498003054
Training loss after Epoch = 3 is: 3.7843941264017995
validation loss after Epoch = 3 is: 3.643420433018533
Training loss after Epoch = 4 is: 3.7616198863355343
validation loss after Epoch = 4 is: 3.6382779989567973
Training loss after Epoch = 5 is: 3.7324353487887727
validation loss after Epoch = 5 is: 3.6374386232709486
Training loss after Epoch = 6 is: 3.7517867027778986
validation loss after Epoch = 6 is: 3.6417413399578136
Training loss after Epoch = 7 is: 3.744742383852274
validation loss after Epoch = 7 is: 3.6342693445244207
Training loss after Epoch = 8 is: 3.741188090273579
validation loss after Epoch = 8 is: 3.6369617019357126
Training loss after Epoch = 9 is: 3

In [18]:
torch.save(best_model.state_dict(), 'model.pt')