<a href="https://colab.research.google.com/github/astromanish/NMT/blob/main/NMT_Final_Phase.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# 1.Introduction


This notebook contains the sequence to sequence model with attention mechanism for Hindi to English Neural Machine Translation using Pytorch. For encoder, a bi-directional GRU with 1 neural layer is used and the decoder uses attention layer and uni-directional GRU with 1 neural layer. The source and target language sentences are appended with start of sequence (\<sos\>) and end of sequence (\<eos\>) tokens. IndicNLP is used for tokenization of Hindi sentences and NLTK is used for tokenization of English sentences. AdamW optimizer and Cross Entropy Loss Function are used for computation of loss and to update the parameters of the model.
The notebook is divided into the following sections:

1. Introduction
2. Installing the required packages
3. Pre-processing data
4. Building the Vocabulary
5. Model Architecture
6. Training the Model
7. Testing the Model
8. Generating Predictions


# 2. Installing the required packages


In [None]:
import csv
import torch
import re
import random
import numpy as np
import torch.nn.functional as F
import torch.nn as nn
import torch.optim as optim
from nltk.tokenize import RegexpTokenizer
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence

In [None]:
#installing Indic NLP packages for hindi and english tokenizer
!git clone https://github.com/anoopkunchukuttan/indic_nlp_resources.git
!git clone "https://github.com/anoopkunchukuttan/indic_nlp_library"

In [None]:
INDIC_NLP_LIB_HOME = (
    "./indic_nlp_library/"  # path to local git repo for Indic NLP library
)
INDIC_NLP_RESOURCES = (
    "./indic_nlp_resources/"  # path to local git repo for Indic NLP Resources
)

In [None]:
import sys

sys.path.append(r"{}".format(INDIC_NLP_LIB_HOME))
from indicnlp import common

common.set_resources_path(INDIC_NLP_RESOURCES)
from indicnlp import loader

loader.load()
from indicnlp.tokenize import indic_tokenize

# 3. Pre-processing data


In this section, the training data is prepared. The following steps are carried out for cleaning the data in train.csv file:

1. All the English letters are converted to lower case.
2. Punctuation marks and symbols like -, (, ), {, }, [, ], :, ,\", #, /, \\, ♪, \=, ¶, ~ are removed from Hindi and English sentences. '\&' is replaced with 'and'.
3. If at the end of a Hindi sentence, a purn viram ('|') is not present then it is added.
4. All the Devanagari numerals are replaced by Western Arabic Numerals, for example, "१" is replaced with "1".
5. Hindi sentences containing any English word are removed from the train set. For example, sentences like "मैंने तुमे School से हटवा दिया." were deleted.
6. Multiple occurrrences of period(.), quoatation marks(") and spaces are replaced with a single occurrence.
7. Sentence with more than 70 words are removed from the train set.


In [None]:
train_set = []  # list to store pair of Hindi and English sentences
i = 0
with open("../data/raw_train.csv", "r") as f:  # reading the train.csv file
    csv_reader = csv.reader(f, delimiter=",")
    for row in csv_reader:
        flag = 0
        if (
            i == 0
        ):  # To skip the column names from getting stored in the train data list.
            i += 1
            continue
        for j in range(
            65, 123
        ):  # checking if the hindi sentence contains any English word
            if chr(j) in row[1]:
                flag = 1
                break
        if (
            flag == 0
        ):  # adding the sentence pair to train set only if Hindi sentence doesn't contain any English word
            train_set.append(
                [row[1], row[2].lower()]
            )  # lower casing the english sentences while storing them in list
train_set[0:10]

In [None]:
# removing the punctuation, unnecessary spaces and replacing Devanagari numerals
processing_dict = {
    "-": " ",
    "(": " ",
    ")": " ",
    "{": " ",
    "}": " ",
    "[": " ",
    "]": " ",
    ":": " ",
    '"': " ",
    "\&": " and ",
    "#": " ",
    "/": " ",
    "\\": " ",
    "♪": " ",
    "\=": " ",
    "¶": " ",
    "~": " ",
    "  ": " ",
    "%": " ",
    ",": "",
    "♫": " ",
}
for i in range(0, len(train_set)):
    for j in range(0, 2):
        for src, trg in processing_dict.items():
            if (
                src in train_set[i][j]
            ):  # check if any character to be replaced is present in sentences
                train_set[i][j] = train_set[i][j].replace(src, trg)

        # multiple quotation marks are replaced by single quotation mark
        train_set[i][0] = re.sub('"+', "", train_set[i][0])
        train_set[i][1] = re.sub('"+', "", train_set[i][1])

        # multiple occurrences of period are replaced with single occurrence
        train_set[i][0] = train_set[i][0].replace("....", ".")
        train_set[i][1] = train_set[i][1].replace("....", ".")
        train_set[i][0] = train_set[i][0].replace("...", ".")
        train_set[i][1] = train_set[i][1].replace("...", ".")
        train_set[i][0] = train_set[i][0].replace("...", ".")
        train_set[i][1] = train_set[i][1].replace("..", ".")
        train_set[i][0] = train_set[i][0].replace("..", ".")
        train_set[i][0] = train_set[i][0].replace(".", "|")

        # replacing the Devanagari Numerals with Western Arabic Numerals
        train_set[i][0] = train_set[i][0].replace("०", "0")
        train_set[i][0] = train_set[i][0].replace("१", "1")
        train_set[i][0] = train_set[i][0].replace("२", "2")
        train_set[i][0] = train_set[i][0].replace("३", "3")
        train_set[i][0] = train_set[i][0].replace("४", "4")
        train_set[i][0] = train_set[i][0].replace("५", "5")
        train_set[i][0] = train_set[i][0].replace("६", "6")
        train_set[i][0] = train_set[i][0].replace("७", "7")
        train_set[i][0] = train_set[i][0].replace("८", "8")
        train_set[i][0] = train_set[i][0].replace("९", "9")

# storing the pre-processed data in a separate file
with open("../data/pre_processed_train.csv", "w") as f:
    write = csv.writer(f)
    write.writerow(["hindi", "english"])  # adding column names
    write.writerows(train_set)

In [None]:
train_set[0:20]  # training data after pre-processing

### Defining the Hindi and English Tokenizers


For tokenization of Hindi sentences, indic_tokenize.trivial_tokenize() from IndicNLP is used. This tokenizer tokenizes the Hindi text on spaces.


In [None]:
# funtion to tokenize hindi text
def hindi_tokenizer(text_in_hindi):
    hindi_tokens = []
    for token in indic_tokenize.trivial_tokenize(
        text_in_hindi
    ):  # trivial_tokenize of indicNLP is used for tokenization
        hindi_tokens.append(token)
    return hindi_tokens  # tokens of a sentence are returned as list

For tokenization of English sentences, Regexp tokenizer of NLTK is used. This tokenizer is selected for tokenization because while tokenizing it takes care of Apostophe symbol. For example, if the sentence is "I'll be there", then Regexp tokenizer will tokenize it as ["I'll", 'be', 'there'] whereas word_tokenizer() of NLTK will tokenize it as ['I', "'ll", 'be', 'there'].


In [None]:
regexp_tokenizer = RegexpTokenizer("[m\w']+")
end_punctuation = [".", "!", "?"]


def english_tokenizer(text_in_english):
    english_tokens = []
    for token in regexp_tokenizer.tokenize(text_in_english):
        english_tokens.append(token)
    # the Regexp tokenizer doesn't adds the punctuations like ". ! ?" as tokens, so these punctuation marks are added as tokens to english sentences
    if text_in_english[-1] in end_punctuation:
        english_tokens.append(text_in_english[-1])
    return english_tokens  # tokens of Hindi sentence are returned as list

Now, we need to define a maximum limit on the length of sentences which needs to be considered for training. For this, I used the maximum limit as 70 and all the sentence pairs having either Hindi or English sentence length greater than 70 were eliminated.

Note: By length here I mean number of tokens and not number of characters.


In [None]:
train_set_trimmed = (
    []
)  # list to store training data where sentence length is less than 70
for pair in train_set:
    if len(pair[0]) > 0 and len(pair[1]) > 0:
        l1 = len(hindi_tokenizer(pair[0]))  # number of tokens in hindi sentence
        l2 = len(english_tokenizer(pair[1]))  # number of tokens in english sentence
        if (
            1 <= l1 <= 70 and 1 <= l2 <= 70
        ):  # check if length of both hindi and english sentence is less than 70
            train_set_trimmed.append(
                pair
            )  # if length<70, then add the pair to trimmed train dataset
# print(len(train_set_trimmed))
print(train_set_trimmed[0:10])

### Creating Train and validation set


In [None]:
n = len(train_set_trimmed)  # length of trimmed datatset
train_ratio = 0.90  # 90:10 ratio is used for train and validation/test data
train_size = int(n * train_ratio)  # size of training data
val_size = int(n - train_size)  # size of validation data

# storing train and validation data in a separate list
train_ds, val_ds = train_set_trimmed[:train_size], train_set_trimmed[train_size:]

# length of train and validation data
len(train_ds), len(val_ds)

In [None]:
# saving the train and validation data in csv file

with open("../data/final_train.csv", "w") as f:
    # using csv.writer method from CSV package
    write = csv.writer(f)
    write.writerows(train_ds)


with open("../data/final_validation.csv", "w") as f:
    # using csv.writer method from CSV package
    write = csv.writer(f)
    write.writerows(val_ds)

# 4.Building the Vocabulary

---


To build the hindi and english vocabulary, first a list of tokens is generated using the tokenizer functions and then checked if the token already exists in the dictioanry. If the token is not present in the dictionary then it is assigned an index and added to the dictionary. Two dictionaries are maintained for both Hindi and English sentences. One dictionary maps the word to its corresponding index (word2index) and another maps the index to the corresponding word (index2word). One dictionary is maintained to keep the count of number of occurrences of each word in the corpus. This dictionary can help to limit the size of the vocabulary by keeping the most frequent words in vocabulary. However, in this code all the words present in corpus are taken in the vocabulary.


In [None]:
sos_token = "<sos>"  # start of sequence token; appended at start of sentence
eos_token = "<eos>"  # end of sequence token; appended at end of sentence
unk_token = "<unk>"  # unknown token; used to represent a word if that word is not found in the dictionary
pad_token = (
    "<pad>"  # token for padding; used to make all sentences of equal length in a batch
)

### English Vocabulary

In [None]:
# dictionary to keep count of occurrence of each English word
E_wordCount = {}

# dictioanry to find the index for a word in English
E_word2index = {sos_token: 0, eos_token: 1, unk_token: 2, pad_token: 3}

# dictionary to find the English word for a particular index
E_index2word = {0: sos_token, 1: eos_token, 2: unk_token, 3: pad_token}

E_count = 4  # keeps count of number of words so far in English dictionary

### Hindi Vocabulary

In [None]:
# dictionary to keep count of occurrence of each Hindi word
H_wordCount = {}

# dictioanry to find the index for a word in Hindi
H_word2index = {sos_token: 0, eos_token: 1, unk_token: 2, pad_token: 3}

# dictionary to find the Hindi word for a particular index
H_index2word = {0: sos_token, 1: eos_token, 2: unk_token, 3: pad_token}

H_count = 4  # keeps count of number of words so far in Hindi dictionary

Defining functions to update the dictionary.


In [None]:
# function to add a word in English dictionary
def E_updateDict(eng_sentence):
    global E_count
    tokens = english_tokenizer(eng_sentence)  # generating tokens for the given sentence
    for token in tokens:
        E_wordCount[token] = E_wordCount.get(token, 0) + 1
        if (
            token not in E_word2index.keys()
        ):  # check if the token already exists in English dictionary
            # if the token is not present in English dictionary then add it to word2index and index2word English dictionary
            E_word2index[token] = E_count
            E_index2word[E_count] = token
            E_count += 1  # increasing the count of words in English vocabulary
        else:
            E_wordCount[
                token
            ] += 1  # if the token exists in dictionary then simply increase it's count of occurrence


# function to add a word in Hindi dictionary
def H_updateDict(hindi_sentence):
    global H_count
    tokens = hindi_tokenizer(hindi_sentence)  # generating tokens for the given sentence
    for token in tokens:
        H_wordCount[token] = H_wordCount.get(token, 0) + 1
        if (
            token not in H_word2index.keys()
        ):  # check if the token already exists in Hindi dictionary
            # if the token is not present in Hindi dictionary then add it to word2index and index2word Hindi dictionary
            H_word2index[token] = H_count
            H_index2word[H_count] = token
            H_count += 1  # increasing the count of words in Hindi vocabulary

In [None]:
# reading the training pairs to create hindi and english vocabulary
for pair in train_ds:
    H_updateDict(pair[0])  # updating hindi vocabulary
    E_updateDict(pair[1])  # updating english vocabulary

In [None]:
# number of words in hindi and english vocabulary
print(H_count, E_count)

# 5. Model Architecture


#### Defining the Encoder architecture

For encoder a bidirectional GRU is used, where the forward RNN goes over the embedded sentence from left to right and the backward RNN goes over the embedded sentence from right to left. Due to bidirectional nature of encoder, we get two context vectors, one corresponding to each RNN. However, since the decoder used is unidirectional, so these context vectors are concatenated together through a linear layer and then the tanh activation function is applied.


In [None]:
class Encoder(nn.Module):
    def __init__(self, input_size, embedding_size, hidden_size, dropout_val):
        # input_size is equal to hindi vocabulary size and embedding_size is equal to dimensions of embeddings
        super().__init__()
        self.dropout = nn.Dropout(dropout_val)
        self.embedding = nn.Embedding(input_size, embedding_size)
        # to make the GRU bidirectional we pass bidirectional=True parameter
        self.gru = nn.GRU(embedding_size, hidden_size, bidirectional=True)
        # fully connected linear layer
        self.linear = nn.Linear(hidden_size * 2, hidden_size)

    def forward(self, token_vec):
        # token_vec is a vector of indices mapping a word to its index in the vocabulary. token_vec.shape()=[max_batch_length, batch size]

        embedding = self.dropout(
            self.embedding(token_vec)
        )  # embedding is a 3D tensor of shape (seq length, batch_size, embedding_size)

        encoder_outputs, encoder_hidden = self.gru(
            embedding
        )  # the embedding is passed as input to the GRU
        # encoder_outputs has dimensions [seq length, batch size, 2*hidden_size]
        # encoder_hidden has dimensions [2, batch_size, hidden_size]
        # in encoder_outputs and encoder_hidden we have 2 due to the bidirectional nature of GRU encoder. These are hidden states of both the forward RNN and backward RNN.

        # concatinating the hidden states of both the layers using a linear layer and then applying tanh activation function
        # encoder_hidden[-2,:,:] represents the hidden states from forward layer and encoder_hidden[-1,:,:] represents the hidden states from backward layer after the final time step
        vec = torch.cat((encoder_hidden[-2, :, :], encoder_hidden[-1, :, :]), dim=1)
        linear_layer_vec = self.linear(vec)
        encoder_hidden = torch.tanh(linear_layer_vec)

        return (
            encoder_outputs,
            encoder_hidden,
        )  # returning the encoder output and hidden states
        # encoder_outputs.shape=[seq_length, batch_size, 2*hidden_size], encoder_hidden.shape=[batch_size, hidden_size]

#### Defining the Attention Layer

This layer returns a vector of attention score with sum of elements equal to 1. These scores basically tell the importance of a word in the input sentence for correct prediction of target sentence.


In [None]:
class Attention(nn.Module):
    def __init__(self, hidden_size):
        super().__init__()
        self.attention_layer = nn.Linear((hidden_size * 2) + hidden_size, hidden_size)
        self.weight_vector = nn.Linear(hidden_size, 1, bias=False)

    def forward(self, hidden, encoder_outputs):
        # hidden has is a 2D tensor of dimensions [batch_size, hidden_size] and encoder_outputs has dimensions= [seq length, batch_size, 2*hidden_size]

        batch_size = encoder_outputs.shape[1]
        input_len = encoder_outputs.shape[0]

        # we need to calculate the energy between encoder states and previous decoder hidden state. For this, first we repeat the decoder hidden states input_len times
        hidden = hidden.unsqueeze(1)
        hidden = hidden.repeat(
            1, input_len, 1
        )  # hidden=(batch_size, seq_length, hidden_size)
        encoder_outputs = encoder_outputs.permute(
            1, 0, 2
        )  # encoder_outputs=(batch_size, seq_length, 2*hidden_size)

        # The encoder outputs and decoder hiddens states are concatenated using a linear layer and then tanh activation function is applied
        new_vec = torch.cat(
            (hidden, encoder_outputs), dim=2
        )  # (batch_size, seq_length, 2*hidden_size+hidden_size)
        a = self.attention_layer(new_vec)  # (batch_size, seq_length, hidden_size)
        energy_values = torch.tanh(a)  # (batch_size, seq_length, hidden_size)

        # computing the attention vector which tells the attention score for each encoder hidden state
        attention_vector = self.weight_vector(
            energy_values
        )  # (batch_size, seq_length, 1)
        attention_vector = attention_vector.squeeze(2)

        # now the attention vector is passed through a softmax layer to ensure that each score is between 0 and 1 and sum of all the scores is 1.
        return F.softmax(attention_vector, dim=1)

Definning the Decoder architecture


In [None]:
class Decoder(nn.Module):
    def __init__(
        self, attention, embedding_size, hidden_size, output_size, dropout_val
    ):
        # here, embedding_size= dimensions of embedding as defined, hidden_size as defined and output_size=english vocabulary size
        super().__init__()
        self.attention = attention
        self.output_size = output_size
        self.dropout = nn.Dropout(dropout_val)
        self.embedding = nn.Embedding(output_size, embedding_size)
        self.gru = nn.GRU((hidden_size * 2) + embedding_size, hidden_size)
        self.linear_decoder = nn.Linear(
            (hidden_size * 2) + hidden_size + embedding_size, output_size
        )

    def forward(self, token_vec, hidden, encoder_outputs):
        # token_vec is one dimensional i.e. shape(token_vec) = (batch_size), hidden is a 2D vector of shape [batch_size, hidden_size]
        # encoder_outputs have shape [seq len, batch_size, 2*hidden_size]
        token_vec = token_vec.unsqueeze(0)  # adding one more dimension to the token_vec

        embedded_input = self.embedding(
            token_vec
        )  # passing the input token_vec to embedding, [1, batch_size, embedding_size]
        embedded_input = self.dropout(
            embedded_input
        )  # a 3D tensor of size [1, batch_size, embedding_size]

        # previous decoder hidden and encoder states are passsed to the attention layer to get attention scores
        a = self.attention(hidden, encoder_outputs).unsqueeze(1)
        # a has dimension [batch_size, 1, seq length] i.e. a score is defined for each token in source sentence

        encoder_outputs = encoder_outputs.permute(
            1, 0, 2
        )  # (batch_size, seq len, 2*hidden_size)
        weighted_vectors = torch.bmm(a, encoder_outputs).permute(
            1, 0, 2
        )  # (batch_size, 1, 2*hidden_size)
        input_to_gru = torch.cat(
            (embedded_input, weighted_vectors), dim=2
        )  # vector with dimensions= [1, batch_size, 2*hidden_size + embedding_size]

        decoder_output, hidden = self.gru(input_to_gru, hidden.unsqueeze(0))
        embedded_input = embedded_input.squeeze(0)
        decoder_output = decoder_output.squeeze(0)
        weighted_vectors = weighted_vectors.squeeze(0)

        # decoder output, weight vectors and embedded input are passed through a linear layer to predict the next token in target sentence
        predicted_tokens = self.linear_decoder(
            torch.cat((decoder_output, weighted_vectors, embedded_input), dim=1)
        )
        # prediction_tokens contains the predicted words and has dimensions [batch_size, output_size]
        hidden = hidden.squeeze(0)

        return predicted_tokens, hidden

Defining the Seq2Seq class to define the model architecture


In [None]:
class seq2seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

    def forward(
        self, input_token, target_token, teacher_force_ratio=0.5
    ):  # teacher_force_ratio helps in preventing the model from overfitting and underfitting.
        # teacher_force_ratio helps in deciding whether the next input word to the decoder will be actual/target word or the previous predicted word.
        # input_token is tensor of [seq length, batch_size] shape and target_token has shape [target_length, batch-size]

        batch_size = input_token.shape[1]
        target_len = target_token.shape[0]
        english_dict_size = self.decoder.output_size

        # tensor to store decoder outputs, it is initially initialised to all zeroes
        predicted_vector = torch.zeros(target_len, batch_size, english_dict_size).to(
            self.device
        )
        encoder_outputs, hidden = self.encoder(input_token)
        # encoder output stores all the hidden states in the input sequence both in forward and backward direction and hidden stores forward and backward hidden states after the final time step
        token_vec = target_token[0, :]  # appending the <sos> token in prediction vector

        for i in range(1, target_len):
            output_token, hidden = self.decoder(
                token_vec, hidden, encoder_outputs
            )  # embedded input token, previous hidden states and encoder hidden states are passed to the decoder to obtain the prediction
            predicted_vector[
                i
            ] = output_token  # output is appended to the prediction tokens

            if (
                random.random() < teacher_force_ratio
            ):  # half of the times this will be true if teacher_force_ratio is 0.5
                token_vec = target_token[
                    i
                ]  # in this case next input to the decoder is target/actual word
            else:
                token_vec = output_token.argmax(
                    1
                )  # in this case next input to the decoder is predicted word

        return predicted_vector

# 6. Training the Model


Setting optimal hyperparameters for Training


In [None]:
# Hyperparameters
batch_size = 50
learning_rate = 0.001
epochs = 25
epoch_loss = 0.0  # training loss in each epoch
layers = 1  # number of neural network layers in rnn
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
input_size = H_count
output_size = E_count
hidden_size = 512  # encoder and decoder have same hidden size
embedding_size = 256  # encoder and decoder embedding size
dropout = 0.5  # encoder and decoder dropout value

**Preparing data for training the Model:** First, the data is sorted according to the length of Hindi sentences and then index vectors for these sentences are found using H_sentenceToTensor and E_sentenceToTensor functions. The logic behind sorting the training data is that in one batch we want sentences of similar lengths, so sorting helps us achieve that and padding is performed whereever necessary. To create batches of same length, I calculated the maximum length of sentence in a batch and stored this value in a dictionary with key as batch_id. After obtaining the maximum length for each batch, "\<pad\>" token was appended to the sentences whose length was less than the maximum length of sentence in that batch. After that, Dataloader is used to create batches of the required batch size. Each batch will have sentences of same length.


In [None]:
# sorting the training data according to length of hindi sentences
train_ds.sort(key=lambda x: len(x[0]))

In [None]:
# finding the maximum length of sentence in a batch
max_length_train = {}  # stores maximum length of sentences in training data
batch_id = 1
# computing maximum length for each batch of training data
for i in range(0, len(train_ds), batch_size):
    max_len = 0
    for pair in train_ds[i : i + batch_size]:
        E_maxlength = 0
        H_maxlength = 0
        for token in hindi_tokenizer(pair[0]):
            H_maxlength += 1
        for token in english_tokenizer(pair[1]):
            E_maxlength += 1
        max_len = max(max_len, E_maxlength, H_maxlength)
    max_length_train[batch_id] = max_len + 2
    batch_id += 1

max_length_test = {}  # stores maximum length of sentences in test/validation data
batch_id = 1
# computing maximum length for each batch of validation data
for i in range(0, len(val_ds), batch_size):
    max_len = 0
    for pair in val_ds[i : i + batch_size]:
        E_maxlength = 0
        H_maxlength = 0
        for token in hindi_tokenizer(pair[0]):
            H_maxlength += 1
        for token in english_tokenizer(pair[1]):
            E_maxlength += 1
        max_len = max(max_len, E_maxlength, H_maxlength)
    max_length_test[batch_id] = max_len + 2
    batch_id += 1

In [None]:
# H_sentenceToTensor function takes a sentence, maximum length as argument and returns a tensor of indices with padding done, if required.
def H_sentenceToTensor(sentence, max_length):
    # append start of sequence token at beginning
    src_index = [H_word2index["<sos>"]]
    for token in hindi_tokenizer(sentence):
        # if the word in not present in dictionary then index corresponding to unknown token '<unk>' i.e. 2 is used
        src_index.append(H_word2index.get(token, 2))
    # append end of sequence token
    src_index.append(H_word2index["<eos>"])
    # check if length of sentence is less than maximum length, if yes, then append <pad> token
    if len(src_index) < max_length:
        while len(src_index) != max_length:
            src_index.append(H_word2index["<pad>"])
    return torch.Tensor(
        src_index
    )  # returning tensor of indices with length equal to max_length


# H_sentenceToTensor function takes a sentence, maximum length as argument and returns a tensor of indices with padding done, if required.
def E_sentenceToTensor(sentence, max_length):
    # append start of sequence token at beginning
    trg_index = [E_word2index["<sos>"]]
    for token in english_tokenizer(sentence):
        # if the word in not present in dictionary then index corresponding to unknown token '<unk>' i.e. 2 is used
        trg_index.append(E_word2index.get(token, 2))
    # append end of sequence token
    trg_index.append(E_word2index["<eos>"])
    # check if length of sentence is less than maximum length, if yes, then append <pad> token
    if len(trg_index) < max_length:
        while len(trg_index) != max_length:
            trg_index.append(E_word2index["<pad>"])
    return torch.Tensor(
        trg_index
    )  # returning tensor of indices with length equal to max_length

In [None]:
train_tensor = []  # stores tensor of indexes of training data
test_tensor = []  # stores tensor of indexes of validation/test data

# finding tensor of indexes of training data
batch_id = 1
for i in range(0, len(train_ds), batch_size):
    max_len = max_length_train[batch_id]
    for pair in train_ds[i : i + batch_size]:
        source_tensor = H_sentenceToTensor(pair[0], max_len)
        target_tensor = E_sentenceToTensor(pair[1], max_len)
        train_tensor.append([source_tensor, target_tensor])
    batch_id += 1

# finding tensor of indexes of validation/test data
batch_id = 1
for i in range(0, len(val_ds), batch_size):
    max_len = max_length_test[batch_id]
    for pair in val_ds[i : i + batch_size]:
        source_tensor = H_sentenceToTensor(pair[0], max_len)
        target_tensor = E_sentenceToTensor(pair[1], max_len)
        test_tensor.append([source_tensor, target_tensor])
    batch_id += 1

In [None]:
# finding train and test iterator using data loader
# shuffle=false is used so that data remains sorted in batches
train_iterator = DataLoader(train_tensor, batch_size=batch_size, shuffle=False)
test_iterator = DataLoader(test_tensor, batch_size=batch_size, shuffle=False)

In [None]:
# function to evaluate the validation loss in each epoch
def evaluate(model, iterator, criterion):
    model.eval()
    epoch_loss = 0
    with torch.no_grad():
        for i, (x, y) in enumerate(iterator):
            input_sentence = x.long()
            target_sentence = y.long()
            # input_sentence and target_sentence have shape = (batch_size, maximum length) but we need shape to be (maximum length, batch_size ) so they are transposed
            input_sentence = torch.transpose(input_sentence, 0, 1).to(device)
            target_sentence = torch.transpose(target_sentence, 0, 1).to(device)

            output = model(
                input_sentence, target_sentence, 0
            )  # turn off teacher forcing
            output_dim = output.shape[2]
            output = output[1:].reshape(-1, output_dim)

            target_sentence = target_sentence[1:].reshape(-1)
            loss = criterion(output, target_sentence)
            epoch_loss += loss.item()
            del target_sentence, output, input_sentence
    return epoch_loss / len(iterator)

In [None]:
# defining path to store the model in different epochs
path = "final_phase.pth"

In [None]:
attention = Attention(hidden_size)
encoder = Encoder(input_size, embedding_size, hidden_size, dropout).to(device)
decoder = Decoder(attention, embedding_size, hidden_size, output_size, dropout).to(
    device
)

model = seq2seq(encoder, decoder, device).to(device)

In [None]:
pad_index = E_word2index[
    "<pad>"
]  # finding the index of token <pad> in english vocabulary
criterion = nn.CrossEntropyLoss(
    ignore_index=pad_index
)  # padding token is being ignored while loss computation because we don't want to pay price for <pad> token
optimizer = optim.AdamW(model.parameters(), lr=learning_rate)  # AdamW optimizer is used
step = 0

Initialising the weights of the model using Normal distribution with mean 0 and standard deviation 0.01.


In [None]:
def init_weights(model):
    for name, parameter in model.named_parameters():
        if "weight" in name:
            nn.init.normal_(parameter.data, mean=0, std=0.01)
        else:
            nn.init.constant_(parameter.data, 0)


model.apply(init_weights)

The model is stored after every 5 epochs and then prediction on validation set is generated from these models. The model which gave the best score using the provided evaluation script is submitted. The model was run for 35 epochs by training in two parts (due to colab runtime limitations). Best score is obtained by training the model for 25 epochs.


In [None]:
for epoch in range(1, epochs + 1):
    epoch_loss = 0
    print(f"[Epoch {epoch} / {epochs}]")
    model.eval()
    model.train()
    i = 0
    for id, (x, y) in enumerate(
        train_iterator
    ):  # iterating over batches of train_iterator
        input_sentence = x.long()
        target_sentence = y.long()

        # input_sentence and target_sentence have shape = (batch_size, maximum length) but we need shape to be (maximum length, batch_size ) so they are transposed
        input_sentence = torch.transpose(input_sentence, 0, 1).to(device)
        target_sentence = torch.transpose(target_sentence, 0, 1).to(device)

        output = model(input_sentence, target_sentence)  # forward propagation
        output = output[1:].view(
            -1, output.shape[-1]
        )  # removing the start token from model's prediction and reshaping it to make it make it fit for input to loss function

        target_sentence = target_sentence[1:].reshape(
            -1
        )  # removing the start token from actual target translation
        optimizer.zero_grad()
        loss = criterion(output, target_sentence)

        loss.backward()  # backward propagation
        torch.nn.utils.clip_grad_norm_(
            model.parameters(), max_norm=1
        )  # clipping the gradients to keep them in reasonable range
        optimizer.step()  # gradient descent. The optimizer iterates over all parameters (tensors) to be updated and their internally stored gradients are used.
        del target_sentence, output, input_sentence
        step += 1
        epoch_loss += loss.item()  # adding the epoch loss for each batch
    if epoch % 5 == 0:  # saving the model after every 5 epochs
        torch.save(model, path)
    val_loss = evaluate(model, test_iterator, criterion)
    print("Train loss : ", epoch_loss / len(train_iterator))
    print("Validation loss : ", val_loss)

In [None]:
torch.save(model, path)  # saving the trained model at defined location
# model.train()
# model = torch.load(path) #loading the model
model.eval()

Function to translate hindi sentences(index vectors) to english sentences(index vector)


In [None]:
def hin_to_eng_translation(model, device, hindi_num_vec, max_length=70):
    hindi_tensor = torch.LongTensor(hindi_num_vec).unsqueeze(1).to(device)
    with torch.no_grad():
        encoder_states, hidden = model.encoder(hindi_tensor)
    eng_num_vec = [E_word2index["<sos>"]]  # adding index for <sos> token
    eos_idx = E_word2index["<eos>"]  # adding index for <eos> token
    for _ in range(max_length):
        curr_input = torch.LongTensor([eng_num_vec[-1]]).to(device)
        with torch.no_grad():
            output, hidden = model.decoder(curr_input, hidden, encoder_states)
            curr_output = output.argmax(1).item()
        eng_num_vec.append(
            curr_output
        )  # appending the prediction in english index vector
        if (
            curr_output == eos_idx
        ):  # stop generating predictions once eos token is encountered
            break
    return eng_num_vec

# 7. Testing the Model

Obtaining the reference and prediction files to compute Bleu score and Meteor Score using Evaluation Script.


Obtaining the predicted sentences for validation set


In [None]:
file1 = open(
    "validation_prediction.txt", "w"
)  # to store the prediction of validation set
csv_file = open("validation_ds.csv", encoding="utf-8")
rows = csv.reader(csv_file)
for row in rows:
    hindi_sentence = row[0]
    hindi_sentence_token = []

    # tokenize hindi sentence
    if type(hindi_sentence) == str:
        for t in hindi_tokenizer(hindi_sentence):
            hindi_sentence_token.append(t)
    else:
        for t in hindi_sentence:
            hindi_sentence_token.append(t)

    hindi_sentence_token.insert(0, "<sos>")  # append <sos> token
    hindi_sentence_token.append("<eos>")  # append <eos> token
    hindi_num_vec = []

    # generating index vector for hindi sentences
    for t in hindi_sentence_token:
        hindi_num_vec.append(H_word2index.get(t, 2))

    # call hin_to_eng_translation function to generate predictions
    eng_num_vec = hin_to_eng_translation(model, device, hindi_num_vec, max_length=70)
    # eng_num_vec is vector of indices of predicted english sentences. Now, we need to find the words corresponding to these indices

    english_sentence_list = []
    for word_idx in eng_num_vec:
        english_sentence_list.append(
            E_index2word.get(word_idx, 2)
        )  # index 2 is for <unk>.

    english_sentence_list.pop(0)  # remove <sos> token
    english_sentence_list.pop()  # remove <eos> token

    # storing the sentences in form of string (while prediction these words were stored in list that's why now there is need to store them as string)
    english_sentence = ""
    if len(english_sentence_list) > 0:
        for string in english_sentence_list[0:]:
            english_sentence += string + " "
    file1.write(english_sentence + "\n")

file1.close()

Saving the refernce sentences for validation set


In [None]:
csv_file = open("validation_ds.csv", encoding="utf-8")
rows = csv.reader(csv_file)
file = open("validation_english.txt", "w")
for row in rows:
    file.write(row[1] + "\n")
file.close()

# 8. Generating Predictions


### Obtaining translation for testhindistatements.csv

To obtain prediction for the test statements, first the hindi statements are cleaned in the same way as the train set was processed.


In [None]:
prediction = []  # contains statements for which prediction is to be generated

# reading the testhindistatements.csv file
csv_file = open("testhindistatements.csv", encoding="utf-8")
rows = csv.reader(csv_file)
for row in rows:
    prediction.append(row[2])

prediction = prediction[1:]  # removing the column name
prediction[0:10]

In [None]:
# pre-processing of hindi statements is carried out
for i in range(0, len(prediction)):
    for (
        src,
        trg,
    ) in (
        processing_dict.items()
    ):  # processing_dict is defined in Section 3: Pre-processing data
        if src in hindi_tokenizer(prediction[i]):
            prediction[i] = prediction[i].replace(src, trg)
    prediction[i] = prediction[i].replace("...", " ")
    prediction[i] = prediction[i].replace(".", "|")
prediction[0:10]  # cleaned hindi sentences

In [None]:
file1 = open("answer.txt", "w")  # file to store the predictions

for hindi_sentence in prediction:
    hindi_sentence_token = []
    # tokenize hindi sentence
    if type(hindi_sentence) == str:
        for t in indic_tokenize.trivial_tokenize(hindi_sentence):
            hindi_sentence_token.append(t)
    else:
        for t in hindi_sentence:
            hindi_sentence_token.append(t)

    hindi_sentence_token.insert(0, "<sos>")  # append <sos> token
    hindi_sentence_token.append("<eos>")  # append <eos> token
    hindi_num_vec = []

    # generating index vector for hindi sentences
    for t in hindi_sentence_token:
        hindi_num_vec.append(H_word2index.get(t, 2))  # index 2 is for <unk> token

    # call hin_to_eng_translation function to generate predictions
    eng_num_vec = hin_to_eng_translation(model, device, hindi_num_vec, max_length=70)

    # eng_num_vec is vector of indices of predicted english sentences. Now, we need to find the words corresponding to these indices

    english_sentence_list = []
    for word_idx in eng_num_vec:
        english_sentence_list.append(
            E_index2word.get(word_idx, 2)
        )  # index 2 is for <unk>.
    # english_sentence_list contains predicted english words. Now, we need to obtain the sentences as string from this list of words

    english_sentence_list.pop(0)  # remove <sos> token
    english_sentence_list.pop()  # remove <eos> token

    # storing the sentences in form of string (while prediction these words were stored in list that's why now there is need to store them as string)
    english_sentence = ""
    if len(english_sentence_list) > 0:
        for string in english_sentence_list[0:]:
            english_sentence += string + " "
    file1.write(english_sentence + "\n")

file1.close()

# 9. References

[1] [https://arxiv.org/abs/1409.0473](https://arxiv.org/abs/1409.0473)

[2] [https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html)
