Attempting to do machine translation following the [original seq2seq paper](https://paperswithcode.com/method/seq2seq). 

I will solve this problem in two parts.

 Part 1  : Converting the sentences into sequences. This will include removing NaN values, basic pre-processing (removing punctuation, converting to lower-case), tokenization and vocabulary creation.

Part 2 : Building and training the seq2seq model, following the [paper](https://paperswithcode.com/method/seq2seq) closely(relation between input-output of encoder-decoder,number of layers in LSTM etc.)

### Part 1 : Converting to sequences

Removing Nan Values, Converting to lower case and removing punctuations.

In [1]:
import torch
import pandas as pd
import string
import numpy as np
from collections import Counter
from sklearn.model_selection import train_test_split
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import Dataset, DataLoader

df = pd.read_csv('/kaggle/input/english-hindi-machine-translation/Hindi_English_Truncated_Corpus.csv')

df = df.dropna()  # Remove NaN values

# Converting English sentences to lowercase and removing punctuations from both languages
def remove_punctuation(text):
    return text.translate(str.maketrans('', '', string.punctuation))

df['english_sentence'] = df['english_sentence'].str.lower().apply(remove_punctuation)
df['hindi_sentence'] = df['hindi_sentence'].apply(remove_punctuation)
df

Unnamed: 0,source,english_sentence,hindi_sentence
0,ted,politicians do not have permission to do what ...,राजनीतिज्ञों के पास जो कार्य करना चाहिए वह करन...
1,ted,id like to tell you about one such child,मई आपको ऐसे ही एक बच्चे के बारे में बताना चाहूंगी
2,indic2012,this percentage is even greater than the perce...,यह प्रतिशत भारत में हिन्दुओं प्रतिशत से अधिक है।
3,ted,what we really mean is that theyre bad at not ...,हम ये नहीं कहना चाहते कि वो ध्यान नहीं दे पाते
4,indic2012,the ending portion of these vedas is called up...,इन्हीं वेदों का अंतिम भाग उपनिषद कहलाता है।
...,...,...,...
127602,indic2012,examples of art deco construction can be found...,आर्ट डेको शैली के निर्माण मैरीन ड्राइव और ओवल ...
127603,ted,and put it in our cheeks,और अपने गालों में डाल लेते हैं।
127604,tides,as for the other derivatives of sulphur the c...,जहां तक गंधक के अन्य उत्पादों का प्रश्न है दे...
127605,tides,its complicated functioning is defined thus in...,Zरचनाप्रकिया को उसने एक पहेली में यों बांधा है


Before tokenizing or creating vocabularies split the data into train, validation and test. This prevents "information leakage" into the test and validation sets.

In [2]:
# Split the data into train, validation, and test sets
train_df, val_test_df = train_test_split(df, test_size=0.2, random_state=42)
val_df, test_df = train_test_split(val_test_df, test_size=0.5, random_state=42)


In [3]:
train_df.count(),val_df.count(),test_df.count()

(source              102084
 english_sentence    102084
 hindi_sentence      102084
 dtype: int64,
 source              12760
 english_sentence    12760
 hindi_sentence      12760
 dtype: int64,
 source              12761
 english_sentence    12761
 hindi_sentence      12761
 dtype: int64)

Tokenizing the sentences. Source sentences are tokenized in reverse as it was one of the key source of improvement in the paper.

In [4]:
# Define tokens
START_TOKEN = 'SOS'
END_TOKEN = 'EOS'
OUT_OF_VOCAB_TOKEN = 'OOV'


# Tokenize the sentences and add EOS and SOS tokens
train_df['english_sentence'] = train_df['english_sentence'].apply(lambda x: [START_TOKEN] + x.split()[::-1] + [END_TOKEN])
train_df['hindi_sentence'] = train_df['hindi_sentence'].apply(lambda x: [START_TOKEN] + x.split() + [END_TOKEN])

train_df.head(10)

Unnamed: 0,source,english_sentence,hindi_sentence
82661,tides,"[SOS, unions, trade, strong, of, up, building,...","[SOS, इसलिए, मजदूर, वर्ग, की, पहली, जरूरत, यह,..."
121426,indic2012,"[SOS, 1830, of, decade, the, during, marble, i...","[SOS, इस, तथ्य, के, भी, कोई, साक्ष्य, नहीं, है..."
30572,indic2012,"[SOS, pradesh, uttar, in, districts, 70, are, ...","[SOS, उत्तर, प्रदेश, में, ७०, जिले, हैं, EOS]"
25371,ted,"[SOS, schoolhouse, the, to, way, the, on, scho...","[SOS, या, तो, स्कूल, में, या, स्कूल, आतेजाते, ..."
56266,indic2012,"[SOS, road, northsouth, long, the, crosses, it...","[SOS, इसके, बाद, एक, बडा़, खुला, स्थान, है, जह..."
19147,tides,"[SOS, control, to, difficult, more, seem, pku,...","[SOS, फ्खू, जैस, उत्परिवर्तन, पर, नियंत्रण, पा..."
26135,ted,"[SOS, dollars, 5000, around, costing, was, tim...","[SOS, उस, समय, 5000, डालर, की, थी, EOS]"
77079,tides,"[SOS, them, harmonise, to, attempt, an, make, ...","[SOS, दार्शनिक, और, इतिहास, उनमें, से, एक, या,..."
39591,tides,"[SOS, 4344, pages, see, order, on, goods, of, ...","[SOS, डिपाजऋटि, के, बारे, में, जऋऊण्श्छ्ष्यादा..."
14882,tides,"[SOS, it, of, amount, some, needs, badly, subc...","[SOS, यह, सच, है, कि, वाजपेयी, ने, एक, पक्के, ..."


Creating the frequency counter , words having frequency =1 will not be included in the vocabulary.

In [5]:
# Define minimum word frequency for it to be included in vocabulary
MIN_WORD_FREQ = 2

# Count the frequency of each word in both languages
english_vocab_counter = Counter(word for sentence in train_df['english_sentence'] for word in sentence)
hindi_vocab_counter = Counter(word for sentence in train_df['hindi_sentence'] for word in sentence)

english_vocab_counter.most_common(10) ,hindi_vocab_counter.most_common(10)

([('the', 103848),
  ('SOS', 102084),
  ('EOS', 102084),
  ('of', 59641),
  ('and', 47380),
  ('to', 38179),
  ('in', 37816),
  ('a', 29128),
  ('is', 23864),
  ('that', 14843)],
 [('SOS', 102084),
  ('EOS', 102084),
  ('के', 70335),
  ('में', 51408),
  ('है', 45851),
  ('की', 39470),
  ('और', 38008),
  ('से', 30831),
  ('का', 26611),
  ('को', 25161)])

Creating the vocabulary and adding the 'OOV' token.

In [6]:
# Create vocabulary by including words that have a frequency of more than MIN_WORD_FREQ
english_vocab = {word: i for i, (word, freq) in enumerate(english_vocab_counter.items()) if freq >= MIN_WORD_FREQ}
hindi_vocab = {word: i for i, (word, freq) in enumerate(hindi_vocab_counter.items()) if freq >= MIN_WORD_FREQ}
# OOV token will be displayed when we encounter a word not in the vocabulary
english_vocab.update({OUT_OF_VOCAB_TOKEN: len(english_vocab)})
hindi_vocab.update({OUT_OF_VOCAB_TOKEN: len(hindi_vocab)})


Finally, converting sentences to sequences.

In [7]:
# Convert the words in the sentences to their corresponding index in the vocabulary
train_df['english_sentence'] = train_df['english_sentence'].apply(lambda sentence: [english_vocab.get(word, english_vocab[OUT_OF_VOCAB_TOKEN]) for word in sentence])
train_df['hindi_sentence'] = train_df['hindi_sentence'].apply(lambda sentence: [hindi_vocab.get(word, hindi_vocab[OUT_OF_VOCAB_TOKEN]) for word in sentence])
train_df['hindi_sentence'][:20]

82661     [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 41227, ...
121426    [0, 28, 29, 30, 31, 32, 33, 34, 16, 9, 35, 36,...
30572                       [0, 55, 56, 24, 57, 58, 16, 27]
25371                   [0, 59, 60, 61, 24, 59, 61, 62, 27]
56266     [0, 63, 64, 10, 65, 66, 67, 8, 68, 7, 69, 70, ...
19147            [0, 74, 75, 76, 77, 78, 79, 80, 81, 8, 27]
26135                        [0, 82, 83, 84, 85, 4, 86, 27]
77079     [0, 87, 88, 89, 90, 91, 10, 59, 92, 30, 93, 94...
39591     [0, 41227, 30, 104, 24, 105, 106, 30, 107, 108...
14882     [0, 7, 119, 8, 9, 120, 41, 10, 121, 122, 4, 12...
44539     [0, 131, 30, 132, 4, 133, 134, 24, 135, 41227,...
77431                           [0, 169, 170, 171, 172, 27]
84191                       [0, 173, 174, 34, 175, 176, 27]
71777                                [0, 177, 178, 179, 27]
92433     [0, 180, 181, 127, 182, 183, 8, 184, 185, 186,...
77639     [0, 193, 194, 64, 195, 196, 197, 41, 198, 77, ...
28978                                   

In [8]:
train_df['hindi_sentence'][:20],train_df['english_sentence'][:20]

(82661     [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 41227, ...
 121426    [0, 28, 29, 30, 31, 32, 33, 34, 16, 9, 35, 36,...
 30572                       [0, 55, 56, 24, 57, 58, 16, 27]
 25371                   [0, 59, 60, 61, 24, 59, 61, 62, 27]
 56266     [0, 63, 64, 10, 65, 66, 67, 8, 68, 7, 69, 70, ...
 19147            [0, 74, 75, 76, 77, 78, 79, 80, 81, 8, 27]
 26135                        [0, 82, 83, 84, 85, 4, 86, 27]
 77079     [0, 87, 88, 89, 90, 91, 10, 59, 92, 30, 93, 94...
 39591     [0, 41227, 30, 104, 24, 105, 106, 30, 107, 108...
 14882     [0, 7, 119, 8, 9, 120, 41, 10, 121, 122, 4, 12...
 44539     [0, 131, 30, 132, 4, 133, 134, 24, 135, 41227,...
 77431                           [0, 169, 170, 171, 172, 27]
 84191                       [0, 173, 174, 34, 175, 176, 27]
 71777                                [0, 177, 178, 179, 27]
 92433     [0, 180, 181, 127, 182, 183, 8, 184, 185, 186,...
 77639     [0, 193, 194, 64, 195, 196, 197, 41, 198, 77, ...
 28978                  

Creating the dataloaders.

In [9]:
?pad_sequence

In [10]:
def collate_fn(batch):
    english_sequences, hindi_sequences = zip(*batch)
    english_sequences = [torch.tensor(seq) for seq in english_sequences]
    hindi_sequences = [torch.tensor(seq) for seq in hindi_sequences]
    
    # Pad sequences
    english_sequences = pad_sequence(english_sequences, batch_first=True, padding_value=23)
    hindi_sequences = pad_sequence(hindi_sequences, batch_first=True, padding_value=27)
    
    return english_sequences, hindi_sequences


# Define a PyTorch Dataset
class TranslationDataset(Dataset):
    def __init__(self, df):
        self.df = df

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        return torch.tensor(self.df.iloc[idx]['english_sentence']), torch.tensor(self.df.iloc[idx]['hindi_sentence'])

# Define a function to create data loaders
def create_data_loaders(train_df, val_df, test_df, batch_size=4):
    train_loader = DataLoader(TranslationDataset(train_df), batch_size=batch_size,collate_fn=collate_fn, shuffle=True)
    val_loader = DataLoader(TranslationDataset(val_df),  collate_fn=collate_fn, batch_size=batch_size)
    test_loader = DataLoader(TranslationDataset(test_df),  collate_fn=collate_fn, batch_size=batch_size)
    return train_loader, val_loader, test_loader

train_loader, val_loader, test_loader = create_data_loaders(train_df, val_df, test_df)


#### Visualizing the data in the dataloaders

Reduce the batch_size for visualization.

In [11]:
dataiter = iter(train_loader)
data = next(dataiter)
print(len(data[0]),len(data[1])) # = batch_size
src , trg = data
for i in range(len(src)):
        print(src[i])
        print(trg[i])
print(src.shape) # (batch_size,length_of_sequences)   

4 4
tensor([  0, 355, 299,  23,  23,  23,  23,  23,  23,  23,  23,  23,  23,  23,
         23,  23,  23,  23,  23,  23,  23,  23,  23,  23,  23,  23,  23,  23])
tensor([   0, 3184,  423,   27,   27,   27,   27,   27,   27,   27,   27,   27,
          27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,
          27,   27,   27,   27,   27,   27,   27,   27])
tensor([    0,  7156,  2552,   520,     7,   449,  4807,    17, 32445,     7,
            8,  6025,   222,     4,   848,   284,    68,   449,  7023,    85,
         2159,     7,     4,  3012,  3424,  7824,     7,    23])
tensor([    0,   795,    45,  1574,  3875,  1929,    45,    31,   277, 39376,
           24, 32022,  1335,    88,    64,    30,  6287,    24,  6456,  3412,
         1335,  3460,     4,  1390,   949,    30,  2767,  3460,   204,   287,
          387,    27])
tensor([   0, 2387, 3619,   23,   23,   23,   23,   23,   23,   23,   23,   23,
          23,   23,   23,   23,   23,   23,   23,   23,   23,   

  english_sequences = [torch.tensor(seq) for seq in english_sequences]
  hindi_sequences = [torch.tensor(seq) for seq in hindi_sequences]


### This marks the end of the data preparation now we will define the seq2seq model and train it.

First we will define the encoder, then the decoder and then combine them to define the seq2seq model. After this we will train the model and do validation.

#### Encoder 




#### Visualizing the embeddings. 
Again Reducing the embedding_dim for decent visualization. Can also use embed_example.**weight.data** to see the weight matrix of embeddings. 

Every word has a vector of size **embedding_dim** associated with it.

In [12]:
import torch
import torch.nn as nn
input_dim = len(english_vocab)
embedding_dim=16
embed_example = nn.Embedding(input_dim, embedding_dim)
embed_example
embeddings = embed_example.weight.data
embeddings[:4]

tensor([[ 1.5633, -0.7613,  0.6382,  0.6315, -1.3828, -0.0220, -1.2445,  0.5709,
         -0.3984,  0.7237,  0.3416, -0.7214, -0.7264,  0.8674,  0.0475, -0.1087],
        [-0.1380,  0.1000, -0.9086,  1.1975,  1.4482, -0.3140, -0.5042,  0.6986,
          0.8681, -0.6629, -1.2156,  0.6783,  0.0185, -0.7259,  0.7487,  2.1836],
        [-0.0807, -0.2238,  0.1027, -1.8286,  0.7095,  0.6904,  2.5964,  0.8055,
         -1.1085,  0.2683,  1.5151, -1.1142,  1.4237, -0.4702, -1.1880, -0.3092],
        [ 1.5518,  1.4537,  0.9522,  0.8927,  2.0210, -0.2655,  1.5467, -0.5011,
          0.2822,  0.2574, -1.3491,  0.9236,  3.0456,  0.7430, -0.6057, -0.7287]])

**Visualizing how dropout functions.**

In [13]:
embedding = nn.Embedding(input_dim, embedding_dim)
embedded=embedding(src)
print(embedded,embedded.shape)
dropout = nn.Dropout(0.5)
embedding = nn.Embedding(input_dim, embedding_dim)
embedded=dropout(embedding(src))
print(embedded,embedded.shape)

tensor([[[ 0.1142, -0.1283, -0.9150,  ...,  0.4571, -0.4290, -0.5260],
         [ 2.5914,  0.4068,  0.6622,  ...,  0.1918,  0.3976,  1.4082],
         [-1.0043,  1.4600, -1.5398,  ..., -0.1862, -0.3014,  2.0512],
         ...,
         [-0.6021, -1.4109,  0.5248,  ...,  0.5734,  0.0155, -0.2278],
         [-0.6021, -1.4109,  0.5248,  ...,  0.5734,  0.0155, -0.2278],
         [-0.6021, -1.4109,  0.5248,  ...,  0.5734,  0.0155, -0.2278]],

        [[ 0.1142, -0.1283, -0.9150,  ...,  0.4571, -0.4290, -0.5260],
         [ 1.8804,  0.7643, -0.7959,  ...,  0.7558,  0.4821,  0.4082],
         [ 0.4552, -0.9760, -1.6125,  ..., -0.7051,  0.9978,  0.9181],
         ...,
         [-0.6554, -0.8692,  1.5923,  ..., -0.3927, -0.7336, -2.0615],
         [ 1.1552,  0.5541,  0.3543,  ..., -0.5692,  0.2688,  0.8878],
         [-0.6021, -1.4109,  0.5248,  ...,  0.5734,  0.0155, -0.2278]],

        [[ 0.1142, -0.1283, -0.9150,  ...,  0.4571, -0.4290, -0.5260],
         [-0.3253, -0.3183, -1.1601,  ..., -0

**Visualizing the LSTM with num_layers = 3.** 
* (hidden , cell) will have the values from all the layers stacked one over another. 
* The shape of hidden and cell will be (num_layers,batch_size,hidden_dim)
* The input shape required by LSTM if batch_first=True is (batch_size,sequence_length,input_length).
* In our case the batch_size = **32** , sequence_length is the **length of integer sequences**, input_length is = **embedding_dim**.

In [14]:
lstm = nn.LSTM(16, 20, num_layers = 1, dropout = 0.5,batch_first=True)
outputs, (hidden, cell) = lstm(embedded)
print(hidden,hidden.shape)
print(cell,cell.shape)

tensor([[[ 0.1721, -0.1985,  0.1973, -0.2764, -0.1204, -0.2277, -0.0452,
          -0.1104,  0.1064, -0.1619, -0.2747,  0.0829, -0.1066, -0.2344,
           0.1671,  0.1116,  0.2070,  0.0980,  0.3262,  0.1509],
         [ 0.0846, -0.2682,  0.0512, -0.2109,  0.0399, -0.2178, -0.0642,
           0.1358, -0.3584,  0.0268, -0.1120, -0.0938,  0.3835, -0.2323,
          -0.0599,  0.0755,  0.1129,  0.2590,  0.3183,  0.2175],
         [ 0.0921, -0.2663,  0.1675, -0.2652, -0.0566, -0.4089, -0.0888,
           0.0254, -0.0530,  0.0987,  0.1629, -0.2235, -0.0694, -0.2750,
           0.0862,  0.2463,  0.2717,  0.3436,  0.5060,  0.0202],
         [ 0.1912,  0.0278,  0.2036, -0.3391, -0.0596, -0.1821,  0.0443,
          -0.0048,  0.0123, -0.0503, -0.1012,  0.3097, -0.1153, -0.1413,
           0.0286,  0.1502,  0.0647,  0.1248,  0.3103,  0.0258]]],
       grad_fn=<StackBackward0>) torch.Size([1, 4, 20])
tensor([[[ 0.2078, -0.4637,  0.3857, -0.5982, -0.3008, -0.3525, -0.0558,
          -0.2931,  0.190



**Formally defining the Encoder**

In [15]:
import torch.optim as optim
from torchtext.data.metrics import bleu_score

# Encoder class
class Encoder(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, n_layers, dropout):
        super().__init__()
        # For every word(or an integer in the input sequence) it creates a vector of size embedding_dim 
        self.embedding = nn.Embedding(input_dim, embedding_dim) 
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, dropout = dropout,batch_first=True)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src):
        #Shape of src is (batch_size,length of padded sequence)
        embedded = self.dropout(self.embedding(src))
        #Shape of embedded is (batch_size,length of one padded sequence,embedding_dim)
        outputs, (hidden, cell) = self.lstm(embedded)
        # Shape of hidden and cell both is (n_layers, batch_size, hidden_dim)
        return hidden, cell