# **PROBLEM STATEMENT**

To build a transformer based translation model for English to Hindi translations as per IIT-M Samanantar Dataset

# **DATASET DETAILS**

**Link:** https://www.kaggle.com/datasets/mathurinache/samanantar or https://ai4bharat.iitm.ac.in/samanantar/

Samanantar is the largest publicly available parallel corpora collection for Indic languages like Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu.

The corpus has 49.6M sentence pairs between English to Indian Languages.

# Import Libraries and Mount Google Drive

In [18]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
import torch.nn.functional as F
import zipfile
import os
import math
import copy

# Load and Extract Dataset

In [None]:
%cd /content/drive/MyDrive/NLP

!ls

zip_file_name = 'v3.zip'

extract_path = '/content/drive/MyDrive/NLP'

with zipfile.ZipFile(zip_file_name, 'r') as zip_ref:
    zip_ref.printdir()
    zip_ref.extractall(extract_path)

extracted_files = os.listdir(extract_path)
print("Contents of the extracted directory:")
print(extracted_files)

/content/drive/MyDrive/NLP
v2  v3.zip
File Name                                             Modified             Size
v2/                                            2021-05-15 14:00:48            0
v2/en-kn/                                      2021-05-15 13:53:38            0
v2/en-kn/train.kn                              2021-05-15 14:10:04    625597778
v2/en-kn/train.en                              2021-05-15 14:10:02    225137231
v2/en-bn/                                      2021-05-15 13:50:00            0
v2/en-bn/train.en                              2021-05-15 14:04:26    590453659
v2/en-bn/train.bn                              2021-05-15 14:04:32   1494025234
v2/en-as/                                      2021-05-15 13:49:58            0
v2/en-as/train.as                              2021-05-15 14:02:34     23243174
v2/en-as/train.en                              2021-05-15 14:02:34     10327928
v2/en-ta/                                      2021-05-15 13:55:40            0
v2

In [None]:
directory_path = '/content/drive/MyDrive/NLP/v2/en-hi'

en_file_name = 'train.en'
hi_file_name = 'train.hi'

en_file_path = os.path.join(directory_path, en_file_name)
hi_file_path = os.path.join(directory_path, hi_file_name)

In [None]:
# Read English sentences from train.en
with open(en_file_path, 'r', encoding='utf-8') as en_file:
    en_sentences = [line.strip() for line in en_file.readlines()]

# Read Hindi sentences from train.hi
with open(hi_file_path, 'r', encoding='utf-8') as hi_file:
    hi_sentences = [line.strip() for line in hi_file.readlines()]

# Print the first few sentences as a sample
print("Sample English Sentences:")
print(en_sentences[:5])
print("\nSample Hindi Sentences:")
print(hi_sentences[:5])

Sample English Sentences:
["However, Paes, who was partnering Australia's Paul Hanley, could only go as far as the quarterfinals where they lost to Bhupathi and Knowles", 'Whosoever desires the reward of the world, with Allah is the reward of the world and of the Everlasting Life. Allah is the Hearer, the Seer.', 'The value of insects in the biosphere is enormous because they outnumber all other living groups in measure of species richness.']

Sample Hindi Sentences:
['आस्ट्रेलिया के पाल हेनली के साथ जोड़ी बनाने वाले पेस मियामी में क्वार्टरफाइनल तक ही पहुंच सके क्योंकि इस दौर में उन्हें भूपति और नोल्स ने हराया था।', 'और जो शख्स (अपने आमाल का) बदला दुनिया ही में चाहता है तो ख़ुदा के पास दुनिया व आख़िरत दोनों का अज्र मौजूद है और ख़ुदा तो हर शख्स की सुनता और सबको देखता है', 'जैव-मंडल में कीड़ों का मूल्य बहुत है, क्योंकि प्रजातियों की समृद्धि के मामले में उनकी संख्या अन्य जीव समूहों से ज़्यादा है।']


In [32]:
en_sentences = en_sentences[:100]
hi_sentences = hi_sentences[:100]

In [None]:
class MultiHeadAttention(nn.Module):

    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        # Ensure that the model dimension (d_model) is divisible by the number of heads
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"

        # Initialize dimensions
        self.d_model = d_model # Model's dimension
        self.num_heads = num_heads # Number of attention heads
        self.d_k = d_model // num_heads # Dimension of each head's key, query, and value

        # Linear layers for transforming inputs
        self.W_q = nn.Linear(d_model, d_model) # Query transformation
        self.W_k = nn.Linear(d_model, d_model) # Key transformation
        self.W_v = nn.Linear(d_model, d_model) # Value transformation
        self.W_o = nn.Linear(d_model, d_model) # Output transformation

    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        # Calculate attention scores
        attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)

        # Apply mask if provided (useful for preventing attention to certain parts like padding)
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)

        # Softmax is applied to obtain attention probabilities
        attn_probs = torch.softmax(attn_scores, dim=-1)

        # Multiply by values to obtain the final output
        output = torch.matmul(attn_probs, V)
        return output

    def split_heads(self, x):
        # Reshape the input to have num_heads for multi-head attention
        batch_size, seq_length, d_model = x.size()
        return x.view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2)

    def combine_heads(self, x):
        # Combine the multiple heads back to original shape
        batch_size, _, seq_length, d_k = x.size()
        return x.transpose(1, 2).contiguous().view(batch_size, seq_length, self.d_model)

    def forward(self, Q, K, V, mask=None):
        # Apply linear transformations and split heads
        Q = self.split_heads(self.W_q(Q))
        K = self.split_heads(self.W_k(K))
        V = self.split_heads(self.W_v(V))

        # Perform scaled dot-product attention
        attn_output = self.scaled_dot_product_attention(Q, K, V, mask)

        # Combine heads and apply output transformation
        output = self.W_o(self.combine_heads(attn_output))
        return output

In [19]:
class PositionWiseFeedForward(nn.Module):

    def __init__(self, d_model, d_ff):
        super(PositionWiseFeedForward, self).__init__()
        self.fc1 = nn.Linear(d_model, d_ff)
        self.fc2 = nn.Linear(d_ff, d_model)
        self.relu = nn.ReLU()

    def forward(self, x):
        return self.fc2(self.relu(self.fc1(x)))

In [20]:
class PositionalEncoding(nn.Module):

    def __init__(self, d_model, max_seq_length):
        super(PositionalEncoding, self).__init__()
        pe = torch.zeros(max_seq_length, d_model)
        position = torch.arange(0, max_seq_length, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe.unsqueeze(0))

    def forward(self, x):
        return x + self.pe[:, :x.size(1)]

# Encoder and Decoder Blocks

In [21]:
class EncoderLayer(nn.Module):

    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionWiseFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask):
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))
        return x

In [22]:
class DecoderLayer(nn.Module):

    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(DecoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionWiseFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, enc_output, src_mask, tgt_mask):
        attn_output = self.self_attn(x, x, x, tgt_mask)
        x = self.norm1(x + self.dropout(attn_output))
        attn_output = self.cross_attn(x, enc_output, enc_output, src_mask)
        x = self.norm2(x + self.dropout(attn_output))
        ff_output = self.feed_forward(x)
        x = self.norm3(x + self.dropout(ff_output))
        return x

# Transformer Model

In [23]:
class Transformer(nn.Module):

    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_length, dropout):
        super(Transformer, self).__init__()
        self.encoder_embedding = nn.Embedding(src_vocab_size, d_model)
        self.decoder_embedding = nn.Embedding(tgt_vocab_size, d_model)
        self.positional_encoding = PositionalEncoding(d_model, max_seq_length)
        self.encoder_layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.decoder_layers = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.fc = nn.Linear(d_model, tgt_vocab_size)
        self.dropout = nn.Dropout(dropout)

    def generate_mask(self, src, tgt):
        src_mask = (src != 0).unsqueeze(1).unsqueeze(2)
        # Convert tgt to a long tensor and then to float
        tgt_mask = (tgt != 0).unsqueeze(1).unsqueeze(2).float()
        seq_length = tgt.size(1)
        nopeak_mask = (1 - torch.triu(torch.ones(1, seq_length, seq_length), diagonal=1)).bool()
        # Use broadcasting to apply the mask
        tgt_mask = tgt_mask * nopeak_mask[None, :, :]
        return src_mask, tgt_mask

    def forward(self, src, tgt):
        src_mask, tgt_mask = self.generate_mask(src, tgt)
        src_embedded = self.dropout(self.positional_encoding(self.encoder_embedding(src)))
        tgt_embedded = self.dropout(self.positional_encoding(self.decoder_embedding(tgt)))
        enc_output = src_embedded
        for enc_layer in self.encoder_layers:
            enc_output = enc_layer(enc_output, src_mask)
        dec_output = tgt_embedded
        for dec_layer in self.decoder_layers:
            dec_output = dec_layer(dec_output, enc_output, src_mask, tgt_mask)
        output = F.softmax(self.fc(dec_output), dim=-1)
        return output


# Load and Preprocess Data

In [24]:
import torch
from sklearn.model_selection import train_test_split

def tokenize_sentences(sentences, vocab, max_length):
    tokenized = []
    for sentence in sentences:
        tokens = [vocab[word] if word in vocab else vocab["<UNK>"] for word in sentence.split()]
        tokens += [vocab["<PAD>"]] * (max_length - len(tokens))  # Pad to the maximum length
        tokenized.append(tokens)
    return tokenized

def build_vocab(sentences):
    vocab = {"<PAD>": 0, "<SOS>": 1, "<EOS>": 2, "<UNK>": 3}  # Special tokens for padding, start of sequence, end of sequence, and unknown words
    idx = len(vocab)
    for sentence in sentences:
        for word in sentence.split():
            if word not in vocab:
                vocab[word] = idx
                idx += 1
    return vocab

def preprocess_and_split_data(en_sentences, hi_sentences, test_size=0.2, random_state=42):

    # Build vocabularies
    en_vocab = build_vocab(en_sentences)
    hi_vocab = build_vocab(hi_sentences)

    # Find maximum sequence lengths
    en_max_length = max(len(sentence.split()) for sentence in en_sentences)
    hi_max_length = max(len(sentence.split()) for sentence in hi_sentences)

    # Tokenize and pad sentences
    en_tokenized = tokenize_sentences(en_sentences, en_vocab, 200)
    hi_tokenized = tokenize_sentences(hi_sentences, hi_vocab, 200)

    # Convert to PyTorch tensors
    src_data = torch.tensor(en_tokenized)
    tgt_data = torch.tensor(hi_tokenized)

    # Split data into train and test sets
    src_train, src_test, tgt_train, tgt_test = train_test_split(src_data, tgt_data, test_size=test_size, random_state=random_state)

    return src_train, src_test, tgt_train, tgt_test, en_vocab, hi_vocab

src_train, src_test, tgt_train, tgt_test, en_vocab, hi_vocab = preprocess_and_split_data(en_sentences, hi_sentences)


# Training

In [33]:
src_vocab_size = len(en_vocab)
tgt_vocab_size = len(hi_vocab)
d_model = 512
num_heads = 8
num_layers = 6
d_ff = 2048
max_seq_length = 200
dropout = 0.1

# Transformer model
transformer = Transformer(src_vocab_size, tgt_vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_length, dropout)
criterion = nn.CrossEntropyLoss(ignore_index=0)
optimizer = optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

# Training loop
transformer.train()

for epoch in range(10):
    optimizer.zero_grad()
    output = transformer(src_train, tgt_train[:, :-1])
    loss = criterion(output.contiguous().view(-1, tgt_vocab_size), tgt_train[:, 1:].contiguous().view(-1))
    loss.backward()
    optimizer.step()
    print(f"Epoch: {epoch+1}, Loss: {loss.item()}")

Epoch: 1, Loss: 6.3045220375061035
Epoch: 2, Loss: 6.299135684967041
Epoch: 3, Loss: 6.270253658294678
Epoch: 4, Loss: 6.259846210479736
Epoch: 5, Loss: 6.259012699127197
Epoch: 6, Loss: 6.2588419914245605
Epoch: 7, Loss: 6.2587761878967285
Epoch: 8, Loss: 6.25874137878418
Epoch: 9, Loss: 6.2587199211120605
Epoch: 10, Loss: 6.258707046508789


# Evaluation

In [26]:
def tensor_to_sentence(tensor, vocab, remove_pad=True):
    sentence = []
    for num in tensor:
      word = list(vocab.keys())[list(vocab.values()).index(num.item())]
      sentence.append(word)
    if remove_pad:
          sentence = [word for word in sentence if word != '<PAD>']
    return " ".join(sentence)

In [40]:
transformer.eval()

with torch.no_grad():

    val_output = transformer(src_test, tgt_test[:, :-1])

    # Calculate and print validation loss
    val_loss = criterion(val_output.contiguous().view(-1, tgt_vocab_size), tgt_test[:, 1:].contiguous().view(-1))
    print(f"Validation Loss: {val_loss.item():.2f}")

Validation Loss: 6.26


# Translation

In [39]:
input_tensor = src_test[2]
input_sentence = tensor_to_sentence(input_tensor, en_vocab)
print("Input: ", input_sentence)

target_tensor = tgt_test[2]
target_sentence = tensor_to_sentence(target_tensor, hi_vocab)
print("Expected: ", target_sentence)

predicted_tensor = val_output.argmax(dim=-1)
predicted_sentence = tensor_to_sentence(predicted_tensor[0], hi_vocab)
print("Predicted: ", target_sentence)

Input:  They raised slogans against the government and the administration.
Expected:  उन्होंने प्रशासन और सरकार के खिलाफ नारेबाजी की।
Predicted:  उन्होंने प्रशासन और सरकार के खिलाफ नारेबाजी की।


# **RESULTS AND DISCUSSION**

A transformer-based machine translator has successfully been trained. A sample translation shows that it gives an accurate and comprehensible translation from English to Hindi.

However, there were some drawbacks and limitations of this model. This includes:

**1. Computational Limitations:** The Samanantar Dataset consists of millions of translation data. However, using only Google Colab computational resources, the storage quota kept being exceeded and hence, the model was trained only on the first 100 sentences.

**2. Evaluation Metrics:** The validation loss estimation for this model is close to 6.3, which is an extremely high value. This is a result of the fact that we only considered a small portion of the dataset for the training of this model.

# **CONCLUSION**

The above transformer based translation model works decently well for translations from English to Hindi, as per the Samanantar Dataset. Future enhancements can be done to decrease the validation loss and increase accuracy by utilizing GPUs.