# Data loading and cleaning

**Note:** To run the notebook upload given "train.csv" and "hindistatements_week1.csv" for phase 1 in the drive inside collaboratory folder.

In [1]:
import string
import numpy as np
import pandas as pd
import random

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

We will read the data from drive directly. For safety we will make a copy of the data and drop the first column from the data.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
data=pd.read_csv("/content/drive/MyDrive/Colab Notebooks/train.csv")
data_copy=data.copy()
data.drop(data.columns[0],axis=1,inplace=True)

**Creating columns for sentence length for both Hindi and English**

Sentence length can play important role in cleaning the data, so we can create two new columns for lengths of hindi and english sentences.

In [4]:
data['len_hindi']=data['hindi'].apply(lambda x:len(x.split(' ')))
data['len_english']=data['english'].apply(lambda x: len(x.split(' ')))

**Checking for null values**

checking for null values is generally the first step in the data analaysis. Here also we can check if any null values are present in the given dataset or not.

In [5]:
data.isnull().sum()

hindi          0
english        0
len_hindi      0
len_english    0
dtype: int64

There is no null values present in the data

**Dropping duplicates from the data**

It may be the case that there are some duplicate entries present in the dataset. We can remove such entries.

In [6]:
print(f"Shape of the dataset before removing the duplicates: {data.shape}")
data.drop_duplicates(inplace=True)
print(f"Shape of the dataset after removing the duplicates: {data.shape}")

Shape of the dataset before removing the duplicates: (102322, 4)
Shape of the dataset after removing the duplicates: (102296, 4)


We can see there were some duplicates present in the dataset.

**Converting all chars in the lowercase**

We can perform lowercase normalization on the whole data.

In [7]:
data['hindi']=data['hindi'].apply(lambda x: x.lower())
data['english']=data['english'].apply(lambda x: x.lower())

**Removing punctuations**

There are different kinds of punctuations present in both hindi and english sentences. It's better to remove such punctuations in the data cleaning.



In [8]:
def remove_punctuations(sentence):
    punctuations=list(string.punctuation)
    cleaned=""
    for letter in sentence:
        if letter not in punctuations:
            cleaned+=letter
    return cleaned  

In [9]:
data['hindi']=data['hindi'].apply(lambda x: remove_punctuations(x))
data['english']=data['english'].apply(lambda x: remove_punctuations(x))

**Removing mixed sentences (those samples which have english words in the hindi sentences)**

On observing the data, we find out that there are some samples in which english words are present between the hindi sentences. We treat these sentences as outliers and can remove them from the dataset.

In [10]:
def is_mixed(sentence):
    letters="abcdefghijklmnopqrstuvwxyz"
    for ch in letters:
        if ch in sentence:
            return True
    return False

In [11]:
data['is_mixed']=data['hindi'].apply(lambda x : is_mixed(x))
data.head()

Unnamed: 0,hindi,english,len_hindi,len_english,is_mixed
0,एल सालवाडोर मे जिन दोनो पक्षों ने सिविलयुद्ध स...,in el salvador both sides that withdrew from t...,22,23,False
1,मैं उनके साथ कोई लेना देना नहीं है,i have nothing to do with them,8,7,False
2,हटाओ रिक,fuck them rick,2,3,False
3,क्योंकि यह एक खुशियों भरी फ़िल्म है,because its a happy film,7,5,False
4,the thought reaching the eyes,the thought reaching the eyes,5,5,True


In [12]:
data['is_mixed'].value_counts(normalize=True)*100

False    94.430867
True      5.569133
Name: is_mixed, dtype: float64

In [13]:
data=data[data['is_mixed']==False]

**Changing the Encoding of the data**

The data may have some 'unicode' encoding. We need to change the encoding for processing of the data.

In [14]:
data['hindi']=data['hindi'].str.encode('utf-8',errors='ignore').str.decode('utf-8')
data['english']=data['english'].str.encode('ascii',errors='ignore').str.decode('utf-8')

**Dropping any row having NULL values**

In [15]:
null_indices=[]
for index,rows in data.iterrows():
    is_null=rows.isnull()
    if is_null.any():
        null_indices.append(index)

In [16]:
data.drop(null_indices,inplace=True)

**Saving the processed Dataframe**

We will save this processed dataset into drive, so we don't have to repeat these steps again and again.

In [17]:
data.to_csv("/content/drive/MyDrive/Colab Notebooks/processed.csv")

# Loading processed Dataset and Vocabulary building

We will load the processed dataframe directly and build the vocabulary for source and target language.

### References:
1. https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html
2. https://www.youtube.com/watch?v=B8g-PNT2W2Q
3. https://www.youtube.com/watch?v=EoGUlvhRYpk

In [18]:
data=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/processed.csv')

In [19]:
null_indices=[]
for index,rows in data.iterrows():
    is_null=rows.isnull()
    if is_null.any():
        null_indices.append(index)

data.drop(null_indices,inplace=True)

In [20]:
SOS_token=0
EOS_token=1
PAD_token=2
MAX_LENGTH=25

class Vocab_class:
    def __init__(self):
        self.word_to_index={"<SOS>":0,"<EOS>":1,"<PAD>":2,"<UKN>":3}
        self.word_counts={}
        self.index_to_word={0:"<SOS>", 1:"<EOS>", 2:"<PAD>", 3:"<UKN>"}
        self.num_of_words=4
        
    def sentence_add(self, sentence):
        words=sentence.split(" ")
        for word in words:
            if word not in self.word_to_index:
                self.word_to_index[word]=self.num_of_words
                self.word_counts[word]=1
                
                self.index_to_word[self.num_of_words]=word
                self.num_of_words+=1
            else:
                self.word_counts[word]+=1

In [21]:
# We scan each sentence from the dataset and add the words in the corresponding vocabularies.
hindi_lang=Vocab_class()
eng_lang=Vocab_class()
pairs=[]
# For the sentences which are shorter than the max_length, we use "<PAD>" tokens for them.
for index,row in data.iterrows():
    if row['len_hindi']<MAX_LENGTH and row['len_english']<MAX_LENGTH:
        pair=[row['hindi'].strip(), row['english'].strip()]
        hin_extra=MAX_LENGTH-len(row['hindi'].strip().split(" "))
        eng_extra=MAX_LENGTH-len(row['english'].strip().split(" "))
        hindi_lang.sentence_add(pair[0])
        eng_lang.sentence_add(pair[1])
        pair[0]=pair[0].split(" ")
        pair[0].insert(0,"<SOS>")
        pair[0].append("<EOS>")
        pair[0]=pair[0]+["<PAD>"]*(hin_extra)

        pair[1]=pair[1].split(" ")
        pair[1].insert(0,"<SOS>")
        pair[1].append("<EOS>")
        pair[1]=pair[1]+["<PAD>"]*(eng_extra)

        pair[0]=" ".join(pair[0])
        pair[1]=" ".join(pair[1])
        pairs.append(pair)

In [22]:
print(f"Hindi vocabulary size : {hindi_lang.num_of_words}")
print(f"English vocabulary size: {eng_lang.num_of_words}")

Hindi vocabulary size : 36160
English vocabulary size: 25527


In [23]:
def pair_to_tensor(pair):
    '''
    A function to convert a given pair to tensors corresponding to index in vocabulary
    '''
    hindi_sentence=pair[0]
    eng_sentence=pair[1]
    indexes_hindi=[hindi_lang.word_to_index[word] for word in hindi_sentence.split(' ')]
    indexes_eng=[eng_lang.word_to_index[word] for word in eng_sentence.split(' ')]
    hindi_tensor=torch.tensor(indexes_hindi, dtype=torch.long, device=device).view(-1,1)
    eng_tensor=torch.tensor(indexes_eng, dtype=torch.long, device=device).view(-1,1)
    return (hindi_tensor, eng_tensor)

In [24]:
# Now we can convert all the sentences into tensors for further processing.
hin_tensors=[]
eng_tensors=[]
for pair in pairs:
    hin,eng=pair_to_tensor(pair)
    hin_tensors.append(hin)
    eng_tensors.append(eng)

# Seq2Seq Model Implementation

#### References:
1. https://machinelearningmastery.com/encoder-decoder-recurrent-neural-network-models-neural-machine-translation/
2. https://towardsdatascience.com/understanding-encoder-decoder-sequence-to-sequence-model-679e04af4346


### Encoder RNN Implementation

In [25]:
class EncoderLSTM(nn.Module):
    def __init__(self,size_input,size_embbeding,size_hidden,layers,p):
        super(EncoderLSTM,self).__init__()
        self.size_input=size_input
        self.size_embbeding=size_embbeding
        self.size_hidden=size_hidden
        self.layers=layers
        self.dropout=nn.Dropout(p)
        self.tag=True

        self.embbed_layer=nn.Embedding(self.size_input,self.size_embbeding)
        self.lstm=nn.LSTM(self.size_embbeding,self.size_hidden,self.layers,dropout=p)

    def forward(self, x):
        # print(x.shape)
        embbeding=self.dropout(self.embbed_layer(x))
        # print(embbeding.shape)
        output, (hidden_st,cell_st) = self.lstm(embbeding)
        return hidden_st, cell_st

### Decoder RNN Implementation


In [26]:
class DecoderLSTM(nn.Module):
    def __init__(self,size_input,size_embbeding,size_hidden,layers,p,size_output):
        super(DecoderLSTM,self).__init__()
        self.size_input=size_input
        self.size_embbeding=size_embbeding
        self.size_hidden=size_hidden
        self.layers=layers
        self.size_output=size_output
        self.dropout=nn.Dropout(p)
        self.tag=True

        self.embbed_layer=nn.Embedding(self.size_input,self.size_embbeding)
        self.lstm=nn.LSTM(self.size_embbeding,self.size_hidden,self.layers,dropout=p)
        self.fc=nn.Linear(self.size_hidden,self.size_output)

    def forward(self,x,hidden_st,cell_st):
        x=x.unsqueeze(0)
        embbeding=self.dropout(self.embbed_layer(x))
        outputs, (hidden_st, cell_st) = self.lstm(embbeding, (hidden_st,cell_st))
        preds=self.fc(outputs)
        preds=preds.squeeze(0)
        return preds,hidden_st,cell_st

### Encoder-Decoder Interface Implementation

In [27]:
class Seq2seq_model(nn.Module):
    def __init__(self,encoder_net,decoder_net):
        super(Seq2seq_model,self).__init__()
        self.encoder_net=encoder_net
        self.decoder_net=decoder_net

    def forward(self,src,target,teacher_forcing=0.5):
        batch_length=src.shape[1]
        target_len=target.shape[0]
        target_vocab_len=eng_lang.num_of_words

        outputs=torch.zeros(target_len,batch_length,target_vocab_len).to(device)
        hidden_st_enc, cell_st_enc=self.encoder_net(src)
        x=target[0]

        for i in range(1,target_len):
            output,hidden_st_dec,cell_st_dec=self.decoder_net(x,hidden_st_enc,cell_st_enc)
            outputs[i]=output
            pred=output.argmax(1)
            x=target[i] if random.random()<teacher_forcing else pred

        return outputs

**Creating objects of Encoder, Decoder and Seq2Seq model**

In [28]:
encoder_ip_size=hindi_lang.num_of_words
encoder_embbeding_size=300
encoder_hidden_size=1024
encoder_layers=2
encoder_dropout=float(0.5)

encoder_net=EncoderLSTM(encoder_ip_size, encoder_embbeding_size, encoder_hidden_size, encoder_layers,
                        encoder_dropout).to(device)

print(encoder_net)

EncoderLSTM(
  (dropout): Dropout(p=0.5, inplace=False)
  (embbed_layer): Embedding(36160, 300)
  (lstm): LSTM(300, 1024, num_layers=2, dropout=0.5)
)


In [29]:
decoder_ip_size=eng_lang.num_of_words
decoder_embbed_size=300
decoder_hidden_size=1024
decoder_layers=2
decoder_dropout=float(0.5)
decoder_op_size=eng_lang.num_of_words

decoder_net=DecoderLSTM(decoder_ip_size,decoder_embbed_size,decoder_hidden_size,
                        decoder_layers, decoder_dropout, decoder_op_size).to(device)

print(decoder_net)

DecoderLSTM(
  (dropout): Dropout(p=0.5, inplace=False)
  (embbed_layer): Embedding(25527, 300)
  (lstm): LSTM(300, 1024, num_layers=2, dropout=0.5)
  (fc): Linear(in_features=1024, out_features=25527, bias=True)
)


In [30]:
model=Seq2seq_model(encoder_net, decoder_net)
print(model)

Seq2seq_model(
  (encoder_net): EncoderLSTM(
    (dropout): Dropout(p=0.5, inplace=False)
    (embbed_layer): Embedding(36160, 300)
    (lstm): LSTM(300, 1024, num_layers=2, dropout=0.5)
  )
  (decoder_net): DecoderLSTM(
    (dropout): Dropout(p=0.5, inplace=False)
    (embbed_layer): Embedding(25527, 300)
    (lstm): LSTM(300, 1024, num_layers=2, dropout=0.5)
    (fc): Linear(in_features=1024, out_features=25527, bias=True)
  )
)


## Setting the training of model

**Note:** To train the model again, please set train_model=True

In [31]:
model_available=False

In [None]:
batch_size=64
optimizer=optim.Adagrad(model.parameters(),lr=0.01)
PATH="/content/drive/MyDrive/Colab Notebooks/phase1_v2.pth"

epochs=50
epoch_loss=0.0
padding_idx=eng_lang.word_to_index["<PAD>"]
criterion=nn.CrossEntropyLoss(ignore_index=padding_idx)

train_model=True

if train_model==False:
    model=torch.load(PATH)
else:
    if model_available==True:
        model=torch.load(PATH)
    batches=len(pairs)//batch_size
    for epoch in range(epochs):
        print(f"epoch {epoch+1}/{epochs}")
        model.eval()
        model.train(True)
        cur_batch=0
        for idx in range(0,len(pairs),batch_size):
            cur_batch+=1
            if(cur_batch%100==0):
                print(f"    running batch {cur_batch} of {batches}")
            if idx+batch_size < len(pairs):
                src_batch=hin_tensors[idx:idx+batch_size]
                target_batch=eng_tensors[idx:idx+batch_size]
            else:
                src_batch=hin_tensors[idx:]
                target_batch=eng_tensors[idx:]

            src_batch=torch.cat(src_batch,dim=1)     #max_len*batch_size
            target_batch=torch.cat(target_batch,dim=1)

            output=model(src_batch,target_batch)
            output=output[1:].reshape(-1,output.shape[2])
            target=target_batch[1:].reshape(-1)

            optimizer.zero_grad()
            loss=criterion(output,target)

            loss.backward()

            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1)

            optimizer.step()
            epoch_loss += loss.item()
        
        print(f"Epoch loss : {loss.item()}")
        
        torch.save(model,PATH)
        model_available=True

epoch 1/50
    running batch 100 of 1360
    running batch 200 of 1360
    running batch 300 of 1360
    running batch 400 of 1360
    running batch 500 of 1360
    running batch 600 of 1360
    running batch 700 of 1360
    running batch 800 of 1360
    running batch 900 of 1360
    running batch 1000 of 1360
    running batch 1100 of 1360
    running batch 1200 of 1360
    running batch 1300 of 1360
Epoch loss : 5.565928936004639
epoch 2/50
    running batch 100 of 1360
    running batch 200 of 1360
    running batch 300 of 1360
    running batch 400 of 1360
    running batch 500 of 1360
    running batch 600 of 1360
    running batch 700 of 1360
    running batch 800 of 1360
    running batch 900 of 1360
    running batch 1000 of 1360
    running batch 1100 of 1360
    running batch 1200 of 1360
    running batch 1300 of 1360
Epoch loss : 5.020123481750488
epoch 3/50
    running batch 100 of 1360
    running batch 200 of 1360
    running batch 300 of 1360
    running batch 400 of 13

In [None]:
def clean_sentence(sentence):
    punctuations=list(string.punctuation)
    cleaned=""
    for letter in sentence:
        if letter=='<' or letter=='>' or letter not in punctuations:
            cleaned+=letter
    return cleaned  

def predict_translation(model,sentence,device,max_length=MAX_LENGTH):
    sentence=clean_sentence(sentence)
    tokens=sentence.split(" ")
    indexes=[]
    for token in tokens:
        if token in hindi_lang.word_to_index:
            indexes.append(hindi_lang.word_to_index[token])
        else:
            indexes.append(hindi_lang.word_to_index["<UKN>"])
    tensor_of_sentence=torch.LongTensor(indexes).unsqueeze(1).to(device)
    with torch.no_grad():
        hidden,cell=model.encoder_net(tensor_of_sentence)
    outputs=[SOS_token]
    for _ in range(max_length):
        prev_word=torch.LongTensor([outputs[-1]]).to(device)
        with torch.no_grad():
            output,hidden,cell=model.decoder_net(prev_word, hidden,cell)
            pred=output.argmax(1).item()

        outputs.append(pred)

        if eng_lang.index_to_word[pred] =="<EOS>":
            break
    
    final=[]

    for i in outputs:
        if i == "<PAD>":
            break
        final.append(i)

    final = [eng_lang.index_to_word[idx] for idx in final]
    translated=" ".join(final)
    return translated

In [None]:
test_sentences=[pair[0] for pair in pairs[50:100]]
actual_sentences=[pair[1] for pair in pairs[50:100]]
pred_sentences=[]

for idx,i in enumerate(test_sentences):
    translated=predict_translation(model,i,device)
    print("*"*20)
    print(f"Hindi: {i}")
    print(f"Actual: {actual_sentences[idx]}")
    print(f"Predicted: {translated}")
    print("*"*20)

## Generating Validation Set results

In [None]:
val_data=pd.read_csv("/content/drive/MyDrive/Colab Notebooks/testhindistatements.csv")
val_data.head()
sentences=val_data['hindi']

In [None]:
sentences=sentences.apply(lambda x : x.strip())
sentences

In [None]:
fp=open("/content/drive/MyDrive/Colab Notebooks/answer_week1_test.txt","w")

In [None]:
count=1
for sentence in sentences:
    translated=predict_translation(model,sentence,device)
    translated=translated.split(" ")[1:-1]
    translated=" ".join(translated)
    fp.write(translated+'\n')
    print(f"sentence : {count}")
    count+=1
fp.close()