Attempting to do machine translation following the [original seq2seq paper](https://paperswithcode.com/method/seq2seq). 

I will solve this problem in two parts.

 Part 1  : Converting the sentences into sequences. This will include removing NaN values, basic pre-processing (removing punctuation, converting to lower-case), tokenization and vocabulary creation.

Part 2 : Building and training the seq2seq model, following the [paper](https://paperswithcode.com/method/seq2seq) closely.

### Part 1 : Converting to sequences

Removing Nan Values, Converting to lower case and removing punctuations.

In [1]:
import torch
import pandas as pd
import string
import numpy as np
from collections import Counter
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, DataLoader

df = pd.read_csv('/kaggle/input/english-hindi-machine-translation/Hindi_English_Truncated_Corpus.csv')

df = df.dropna()  # Remove NaN values

# Converting English sentences to lowercase and removing punctuations from both languages
def remove_punctuation(text):
    return text.translate(str.maketrans('', '', string.punctuation))

df['english_sentence'] = df['english_sentence'].str.lower().apply(remove_punctuation)
df['hindi_sentence'] = df['hindi_sentence'].apply(remove_punctuation)
df

Unnamed: 0,source,english_sentence,hindi_sentence
0,ted,politicians do not have permission to do what ...,राजनीतिज्ञों के पास जो कार्य करना चाहिए वह करन...
1,ted,id like to tell you about one such child,मई आपको ऐसे ही एक बच्चे के बारे में बताना चाहूंगी
2,indic2012,this percentage is even greater than the perce...,यह प्रतिशत भारत में हिन्दुओं प्रतिशत से अधिक है।
3,ted,what we really mean is that theyre bad at not ...,हम ये नहीं कहना चाहते कि वो ध्यान नहीं दे पाते
4,indic2012,the ending portion of these vedas is called up...,इन्हीं वेदों का अंतिम भाग उपनिषद कहलाता है।
...,...,...,...
127602,indic2012,examples of art deco construction can be found...,आर्ट डेको शैली के निर्माण मैरीन ड्राइव और ओवल ...
127603,ted,and put it in our cheeks,और अपने गालों में डाल लेते हैं।
127604,tides,as for the other derivatives of sulphur the c...,जहां तक गंधक के अन्य उत्पादों का प्रश्न है दे...
127605,tides,its complicated functioning is defined thus in...,Zरचनाप्रकिया को उसने एक पहेली में यों बांधा है


Before tokenizing or creating vocabularies split the data into train, validation and test. This prevents "information leakage" into the test and validation sets.

In [2]:
# Split the data into train, validation, and test sets
train_df, val_test_df = train_test_split(df, test_size=0.2, random_state=42)
val_df, test_df = train_test_split(val_test_df, test_size=0.5, random_state=42)


In [3]:
train_df.count(),val_df.count(),test_df.count()

(source              102084
 english_sentence    102084
 hindi_sentence      102084
 dtype: int64,
 source              12760
 english_sentence    12760
 hindi_sentence      12760
 dtype: int64,
 source              12761
 english_sentence    12761
 hindi_sentence      12761
 dtype: int64)

Tokenizing the sentences. Source sentences are tokenized in reverse as it was one of the key source of improvement in the paper.

In [4]:
# Define tokens
START_TOKEN = 'SOS'
END_TOKEN = 'EOS'
OUT_OF_VOCAB_TOKEN = 'OOV'


# Tokenize the sentences and add EOS and SOS tokens
train_df['english_sentence'] = train_df['english_sentence'].apply(lambda x: [START_TOKEN] + x.split()[::-1] + [END_TOKEN])
train_df['hindi_sentence'] = train_df['hindi_sentence'].apply(lambda x: [START_TOKEN] + x.split() + [END_TOKEN])

train_df.head(10)

Unnamed: 0,source,english_sentence,hindi_sentence
82661,tides,"[SOS, unions, trade, strong, of, up, building,...","[SOS, इसलिए, मजदूर, वर्ग, की, पहली, जरूरत, यह,..."
121426,indic2012,"[SOS, 1830, of, decade, the, during, marble, i...","[SOS, इस, तथ्य, के, भी, कोई, साक्ष्य, नहीं, है..."
30572,indic2012,"[SOS, pradesh, uttar, in, districts, 70, are, ...","[SOS, उत्तर, प्रदेश, में, ७०, जिले, हैं, EOS]"
25371,ted,"[SOS, schoolhouse, the, to, way, the, on, scho...","[SOS, या, तो, स्कूल, में, या, स्कूल, आतेजाते, ..."
56266,indic2012,"[SOS, road, northsouth, long, the, crosses, it...","[SOS, इसके, बाद, एक, बडा़, खुला, स्थान, है, जह..."
19147,tides,"[SOS, control, to, difficult, more, seem, pku,...","[SOS, फ्खू, जैस, उत्परिवर्तन, पर, नियंत्रण, पा..."
26135,ted,"[SOS, dollars, 5000, around, costing, was, tim...","[SOS, उस, समय, 5000, डालर, की, थी, EOS]"
77079,tides,"[SOS, them, harmonise, to, attempt, an, make, ...","[SOS, दार्शनिक, और, इतिहास, उनमें, से, एक, या,..."
39591,tides,"[SOS, 4344, pages, see, order, on, goods, of, ...","[SOS, डिपाजऋटि, के, बारे, में, जऋऊण्श्छ्ष्यादा..."
14882,tides,"[SOS, it, of, amount, some, needs, badly, subc...","[SOS, यह, सच, है, कि, वाजपेयी, ने, एक, पक्के, ..."


Creating the frequency counter , words having frequency =1 will not be included in the vocabulary.

In [5]:
# Define minimum word frequency for it to be included in vocabulary
MIN_WORD_FREQ = 2

# Count the frequency of each word in both languages
english_vocab_counter = Counter(word for sentence in train_df['english_sentence'] for word in sentence)
hindi_vocab_counter = Counter(word for sentence in train_df['hindi_sentence'] for word in sentence)

english_vocab_counter.most_common(10) ,hindi_vocab_counter.most_common(10)

([('the', 103848),
  ('SOS', 102084),
  ('EOS', 102084),
  ('of', 59641),
  ('and', 47380),
  ('to', 38179),
  ('in', 37816),
  ('a', 29128),
  ('is', 23864),
  ('that', 14843)],
 [('SOS', 102084),
  ('EOS', 102084),
  ('के', 70335),
  ('में', 51408),
  ('है', 45851),
  ('की', 39470),
  ('और', 38008),
  ('से', 30831),
  ('का', 26611),
  ('को', 25161)])

Creating the vocabulary and adding the 'OOV' token.

In [6]:
# Create vocabulary by including words that have a frequency of more than MIN_WORD_FREQ
english_vocab = {word: i for i, (word, freq) in enumerate(english_vocab_counter.items()) if freq >= MIN_WORD_FREQ}
hindi_vocab = {word: i for i, (word, freq) in enumerate(hindi_vocab_counter.items()) if freq >= MIN_WORD_FREQ}
# OOV token will be displayed when we encounter a word not in the vocabulary
english_vocab.update({OUT_OF_VOCAB_TOKEN: len(english_vocab)})
hindi_vocab.update({OUT_OF_VOCAB_TOKEN: len(hindi_vocab)})


Finally, converting sentences to sequences.

In [7]:
# Convert the words in the sentences to their corresponding index in the vocabulary
train_df['english_sentence'] = train_df['english_sentence'].apply(lambda sentence: [english_vocab.get(word, english_vocab[OUT_OF_VOCAB_TOKEN]) for word in sentence])
train_df['hindi_sentence'] = train_df['hindi_sentence'].apply(lambda sentence: [hindi_vocab.get(word, hindi_vocab[OUT_OF_VOCAB_TOKEN]) for word in sentence])
train_df['hindi_sentence'][:20]

82661     [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 41227, ...
121426    [0, 28, 29, 30, 31, 32, 33, 34, 16, 9, 35, 36,...
30572                       [0, 55, 56, 24, 57, 58, 16, 27]
25371                   [0, 59, 60, 61, 24, 59, 61, 62, 27]
56266     [0, 63, 64, 10, 65, 66, 67, 8, 68, 7, 69, 70, ...
19147            [0, 74, 75, 76, 77, 78, 79, 80, 81, 8, 27]
26135                        [0, 82, 83, 84, 85, 4, 86, 27]
77079     [0, 87, 88, 89, 90, 91, 10, 59, 92, 30, 93, 94...
39591     [0, 41227, 30, 104, 24, 105, 106, 30, 107, 108...
14882     [0, 7, 119, 8, 9, 120, 41, 10, 121, 122, 4, 12...
44539     [0, 131, 30, 132, 4, 133, 134, 24, 135, 41227,...
77431                           [0, 169, 170, 171, 172, 27]
84191                       [0, 173, 174, 34, 175, 176, 27]
71777                                [0, 177, 178, 179, 27]
92433     [0, 180, 181, 127, 182, 183, 8, 184, 185, 186,...
77639     [0, 193, 194, 64, 195, 196, 197, 41, 198, 77, ...
28978                                   

In [8]:
train_df['hindi_sentence'][:20]

82661     [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 41227, ...
121426    [0, 28, 29, 30, 31, 32, 33, 34, 16, 9, 35, 36,...
30572                       [0, 55, 56, 24, 57, 58, 16, 27]
25371                   [0, 59, 60, 61, 24, 59, 61, 62, 27]
56266     [0, 63, 64, 10, 65, 66, 67, 8, 68, 7, 69, 70, ...
19147            [0, 74, 75, 76, 77, 78, 79, 80, 81, 8, 27]
26135                        [0, 82, 83, 84, 85, 4, 86, 27]
77079     [0, 87, 88, 89, 90, 91, 10, 59, 92, 30, 93, 94...
39591     [0, 41227, 30, 104, 24, 105, 106, 30, 107, 108...
14882     [0, 7, 119, 8, 9, 120, 41, 10, 121, 122, 4, 12...
44539     [0, 131, 30, 132, 4, 133, 134, 24, 135, 41227,...
77431                           [0, 169, 170, 171, 172, 27]
84191                       [0, 173, 174, 34, 175, 176, 27]
71777                                [0, 177, 178, 179, 27]
92433     [0, 180, 181, 127, 182, 183, 8, 184, 185, 186,...
77639     [0, 193, 194, 64, 195, 196, 197, 41, 198, 77, ...
28978                                   

Creating the dataloaders.

In [9]:
# Define a PyTorch Dataset
class TranslationDataset(Dataset):
    def __init__(self, df):
        self.df = df

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        return torch.tensor(self.df.iloc[idx]['english_sentence']), torch.tensor(self.df.iloc[idx]['hindi_sentence'])

# Define a function to create data loaders
def create_data_loaders(train_df, val_df, test_df, batch_size=32):
    train_loader = DataLoader(TranslationDataset(train_df), batch_size=batch_size, shuffle=True)
    val_loader = DataLoader(TranslationDataset(val_df), batch_size=batch_size)
    test_loader = DataLoader(TranslationDataset(test_df), batch_size=batch_size)
    return train_loader, val_loader, test_loader

train_loader, val_loader, test_loader = create_data_loaders(train_df, val_df, test_df)


### This marks the end of the data preparation now we will define the seq2seq model and train it.