# <p align = "center"> **English To Hindi Translation - Transformers**</p>
<div align = "center">
    <img src = "https://peak-translations.co.uk/wp-content/uploads/2018/06/Creative-Hindi-alphabet-texture-background-2_1180x400_acf_cropped.jpg">
         </img>
</div>
    

### Downloading required Libraries and Text Vocabulary, embeddings and models

In [1]:
!pip install fasttext
!pip install inltk
!pip install gunzip

## Shall Clear the outputs for clean notebook

In [2]:
# Download the pretrained Fasttext Embeddings For Hindi Vocabulary
! wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.hi.300.bin.gz


# Download the pretrained Fasttext Embeddings For English Vocabulary
! wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz

## Shall Clear the outputs for clean notebook

In [3]:
# The downloads are zipped, so they are unzipped to get ".bin" files

! gunzip /content/cc.en.300.bin.gz
! gunzip /content/cc.hi.300.bin.gz

In [4]:
import os
from google.colab import drive


import numpy as np
import pandas as pd
import tensorflow as tf

import nltk
import fasttext
import re

### Connecting to Drive

In [5]:
# The dataset has been loaded in Google Drive.
# Working in Colab, we can't upload the dataset everytime.
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [6]:
datasets_path = '/content/drive/My Drive/Machine Learning Datasets/Hindi-English'
data_path = datasets_path + '/data.csv'

### Datasets

In [7]:
# Read only the required columns from dataset to save memory
# The Embedding Vectors are huge, and working with them would crash the memory, if not used efficiently
df = pd.read_csv(data_path, usecols = ['english_sentence', 'hindi_sentence'])               

In [8]:
df.head(10)

Unnamed: 0,english_sentence,hindi_sentence
0,politicians do not have permission to do what ...,"राजनीतिज्ञों के पास जो कार्य करना चाहिए, वह कर..."
1,id like to tell you about one such child,मई आपको ऐसे ही एक बच्चे के बारे में बताना चाहू...
2,this percentage is even greater than the perce...,यह प्रतिशत भारत में हिन्दुओं प्रतिशत से अधिक है।
3,what we really mean is that theyre bad at not ...,हम ये नहीं कहना चाहते कि वो ध्यान नहीं दे पाते
4,the ending portion of these vedas is called up...,इन्हीं वेदों का अंतिम भाग उपनिषद कहलाता है।
5,the then governor of kashmir resisted transfer...,कश्मीर के तत्कालीन गवर्नर ने इस हस्तांतरण का व...
6,in this lies the circumstances of people befor...,इसमें तुमसे पूर्व गुज़रे हुए लोगों के हालात हैं।
7,and who are we to say even that they are wrong,और हम होते कौन हैं यह कहने भी वाले कि वे गलत हैं
8,“”global warming“” refer to warming caused in ...,ग्लोबल वॉर्मिंग से आशय हाल ही के दशकों में हुई...
9,you may want your child to go to a school that...,हो सकता है कि आप चाहते हों कि आप का नऋर्नमेनटे...


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 127607 entries, 0 to 127606
Data columns (total 2 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   english_sentence  127575 non-null  object
 1   hindi_sentence    127607 non-null  object
dtypes: object(2)
memory usage: 1.9+ MB


In [10]:
df.dropna(inplace = True)

#### Note-
    -> Since this is a Machine Translation Project, we choose not to remove stopwords and punctuations from the text.
    -> This is to enable our model to generalize with all sorts of texts.


### Text Vectorization
    -> Give a unique integral identity to each word in both English and Hindi Corpuses.
    -> Convert the texts into vectors, with the respective integers representing the words

In [11]:
# tokenizer = nltk.tokenize.WhitespaceTokenizer()
# df['english_sentence'] = df.english_sentence.apply(lambda x: tokenizer.tokenize(x))
# df['hindi_sentence'] = df.hindi_sentence.apply(lambda x: tokenizer.tokenize(x))

In [12]:
df.head()

Unnamed: 0,english_sentence,hindi_sentence
0,politicians do not have permission to do what ...,"राजनीतिज्ञों के पास जो कार्य करना चाहिए, वह कर..."
1,id like to tell you about one such child,मई आपको ऐसे ही एक बच्चे के बारे में बताना चाहू...
2,this percentage is even greater than the perce...,यह प्रतिशत भारत में हिन्दुओं प्रतिशत से अधिक है।
3,what we really mean is that theyre bad at not ...,हम ये नहीं कहना चाहते कि वो ध्यान नहीं दे पाते
4,the ending portion of these vedas is called up...,इन्हीं वेदों का अंतिम भाग उपनिषद कहलाता है।


In [13]:
tokenizer = nltk.tokenize.WhitespaceTokenizer()
df['Eng_Token'] = df['english_sentence'].apply(lambda x: len(tokenizer.tokenize(x)))
df['Hin_Token'] = df['hindi_sentence'].apply(lambda x: len(tokenizer.tokenize(x)))
df.head()

Unnamed: 0,english_sentence,hindi_sentence,Eng_Token,Hin_Token
0,politicians do not have permission to do what ...,"राजनीतिज्ञों के पास जो कार्य करना चाहिए, वह कर...",12,14
1,id like to tell you about one such child,मई आपको ऐसे ही एक बच्चे के बारे में बताना चाहू...,9,11
2,this percentage is even greater than the perce...,यह प्रतिशत भारत में हिन्दुओं प्रतिशत से अधिक है।,10,9
3,what we really mean is that theyre bad at not ...,हम ये नहीं कहना चाहते कि वो ध्यान नहीं दे पाते,12,11
4,the ending portion of these vedas is called up...,इन्हीं वेदों का अंतिम भाग उपनिषद कहलाता है।,9,8


In [14]:
print(df.Eng_Token.max(), df.Hin_Token.max())

398 418


In [15]:
df[['Eng_Token', 'Hin_Token']].describe(include = 'all')

Unnamed: 0,Eng_Token,Hin_Token
count,127575.0,127575.0
mean,15.180819,17.885581
std,13.623336,16.554123
min,1.0,1.0
25%,7.0,8.0
50%,11.0,13.0
75%,20.0,24.0
max,398.0,418.0


In [16]:
print(df['Eng_Token'].quantile(0.98),
df['Hin_Token'].quantile(0.98))

50.0 60.0


#### Note-
    -> We observe that most of the samples (about 98 percent) of the samples have max tokens of around 50 for english, and 60 for Hindi
    -> So, we drop the samples which have too long texts.
    -> Including them would cause us to inevitebly use nuch longer vectors for representation of sentences.
    -> That could in turn introduce unwanted Bias and Variance in the model and degrade performance.

In [17]:
# Removing rows that have too large sentences.
df = (df[df['Eng_Token'] <= 75]).copy()
df = (df[df['Hin_Token'] <= 75]).copy()

In [18]:
df[['Eng_Token', 'Hin_Token']].describe(include = 'all')

Unnamed: 0,Eng_Token,Hin_Token
count,126196.0,126196.0
mean,14.44297,16.873792
std,10.837149,12.768491
min,1.0,1.0
25%,7.0,8.0
50%,11.0,13.0
75%,20.0,23.0
max,75.0,75.0


In [19]:
df.head()

Unnamed: 0,english_sentence,hindi_sentence,Eng_Token,Hin_Token
0,politicians do not have permission to do what ...,"राजनीतिज्ञों के पास जो कार्य करना चाहिए, वह कर...",12,14
1,id like to tell you about one such child,मई आपको ऐसे ही एक बच्चे के बारे में बताना चाहू...,9,11
2,this percentage is even greater than the perce...,यह प्रतिशत भारत में हिन्दुओं प्रतिशत से अधिक है।,10,9
3,what we really mean is that theyre bad at not ...,हम ये नहीं कहना चाहते कि वो ध्यान नहीं दे पाते,12,11
4,the ending portion of these vedas is called up...,इन्हीं वेदों का अंतिम भाग उपनिषद कहलाता है।,9,8


In [20]:
def preprocess_text(text, num_tokens):
    text = text.apply(lambda x: " ".join(['<SOS>', x, '<EOS>']))
    text = text + num_tokens.apply(lambda x: (75-x) * " <PAD> ")
    return text

df['english_sentence'] = preprocess_text(df.english_sentence, df.Eng_Token)
df['hindi_sentence'] = preprocess_text(df.hindi_sentence, df.Hin_Token)

# df['english_sentence'] = df['english_sentence'].apply(lambda x: " ".join(['<SOS>', x, '<EOS>']))
# df['hindi_sentence'] = df['hindi_sentence'].apply(lambda x: " ".join(['<SOS>', x, '<EOS>']))

In [21]:
df.head()

Unnamed: 0,english_sentence,hindi_sentence,Eng_Token,Hin_Token
0,<SOS> politicians do not have permission to do...,"<SOS> राजनीतिज्ञों के पास जो कार्य करना चाहिए,...",12,14
1,<SOS> id like to tell you about one such child...,<SOS> मई आपको ऐसे ही एक बच्चे के बारे में बतान...,9,11
2,<SOS> this percentage is even greater than the...,<SOS> यह प्रतिशत भारत में हिन्दुओं प्रतिशत से ...,10,9
3,<SOS> what we really mean is that theyre bad a...,<SOS> हम ये नहीं कहना चाहते कि वो ध्यान नहीं द...,12,11
4,<SOS> the ending portion of these vedas is cal...,<SOS> इन्हीं वेदों का अंतिम भाग उपनिषद कहलाता ...,9,8


In [22]:
df.to_csv(datasets_path + '/Preprocessed_data.csv', index = False)

In [23]:
# Updating the number of tokens for english and hindi texts
# For all the sentences/samples in the given text, it should be 77 (75 + 2(<sos> and <eos>))
df['Eng_Token'] = df['english_sentence'].apply(lambda x: len(tokenizer.tokenize(x)))
df['Hin_Token'] = df['hindi_sentence'].apply(lambda x: len(tokenizer.tokenize(x)))

In [24]:
df.describe(include = 'all')

Unnamed: 0,english_sentence,hindi_sentence,Eng_Token,Hin_Token
count,126196,126196,126196.0,126196.0
unique,121936,96943,,
top,<SOS> laughter <EOS> <PAD> <PAD> <PAD> <PAD...,<SOS> (हँसी) <EOS> <PAD> <PAD> <PAD> <PAD> ...,,
freq,558,212,,
mean,,,77.0,77.0
std,,,0.0,0.0
min,,,77.0,77.0
25%,,,77.0,77.0
50%,,,77.0,77.0
75%,,,77.0,77.0


In [25]:
df.english_sentence[0]

'<SOS> politicians do not have permission to do what needs to be done <EOS> <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD> '

In [26]:
df.hindi_sentence[0]

'<SOS> राजनीतिज्ञों के पास जो कार्य करना चाहिए, वह करने कि अनुमति नहीं है . <EOS> <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD>  <PAD> '

In [27]:
def get_text_vectorizer(text_ds):

    # A tensorflow Layer that maps words in the given text to integer based on the vocabulary it has
    text_vectorizer = tf.keras.layers.TextVectorization()
    
    # Since we don't allready have a vocab, we let our TextVectorization Layer to adapt to our dataset
    text_vectorizer.adapt(text_ds)

    # After adapting to the dataset, the layer sorts the words based on their frequencies to get a vocabulary
    vocab = text_vectorizer.get_vocabulary()

    # The layer doesn't have a function to give us indices for the words, so we create a dictionary as it shall be used frequently
    word_index = dict(zip(vocab, range(len(vocab))))
    print("Total Number of Unique words in the text -", len(vocab))
    return word_index, text_vectorizer, vocab

In [28]:
eng_word_index, eng_text_vectorizer, eng_vocab = get_text_vectorizer(df.english_sentence)

Total Number of Unique words in the text - 73480


In [29]:
hin_word_index, hin_text_vectorizer, hindi_vocab = get_text_vectorizer(df.hindi_sentence)

Total Number of Unique words in the text - 82387


In [30]:
np.array(eng_vocab)

array(['', '[UNK]', 'pad', ..., '003', '0001', '00'], dtype='<U96')

In [31]:
np.save(datasets_path + "/English_Vectorizer_weights.npy", eng_text_vectorizer.get_weights())
np.save(datasets_path + "/Hindi_Vectorizer_weights.npy", hin_text_vectorizer.get_weights())

In [32]:
feature = eng_text_vectorizer(df.english_sentence)               # English Sentences
target = hin_text_vectorizer(df.hindi_sentence)                  # Hindi Sentences

np.save((datasets_path + '/English_vectorized.npy'), feature.numpy())
np.save((datasets_path + '/Hindi_vectorized.npy'), target.numpy())


In [33]:
print(
f"""
Shape of English vectors - {feature.shape}
Shape of Hindi vectors - {target.shape}
"""
)


Shape of English vectors - (126196, 77)
Shape of Hindi vectors - (126196, 77)



##### Note-
    -> We had restricted the maximum number of tokens(created using whitespace tokenizer) to be 75.
    -> We added 2 new tokens to all the text samples, <SOS> AND <EOS> meaning, Start of Sentence and End Of Sentence.
    -> So, as expected, the vectorized representations of our texts have 77 dimensions.

### Embedding Vectors For Our Vocabulary

In [34]:
# Initialize 3 different random vectors for our <SOS>, <EOS>, <PAD> tokens
np.random.seed(7)
sos_vec = np.random.rand(300, )

np.random.seed(8)
eos_vec = np.random.rand(300, )

np.random.seed(9)
pad_vec = np.zeros((300,))

In [None]:
embedding_dims = 300
hindi_embed_model = fasttext.load_model('/content/cc.hi.300.bin')
english_embed_model = fasttext.load_model('/content/cc.en.300.bin')

In [36]:
def get_embedding_matrix(word_index, embedding_model):
    num_tokens = len(word_index) + 2

    embedding_matrix = np.zeros((num_tokens, embedding_dims), dtype = np.float32)
    
    for word, idx in word_index.items():

        if(word == '<SOS>'):
            embedding_matrix[idx] = sos_vec
            continue

        if(word == '<EOS>'):
            embedding_matrix[idx] = eos_vec
            continue

        if(word == '<PAD>'):
            embedding_matrix[idx] = pad_vec
            continue
            
        emb_vector = embedding_model.get_word_vector(word)
        embedding_matrix[idx] = emb_vector

    return embedding_matrix


In [37]:
eng_embedding_matrix = get_embedding_matrix(eng_word_index, english_embed_model)
hin_embedding_matrix = get_embedding_matrix(hin_word_index, hindi_embed_model)

In [38]:
np.save((datasets_path + '/English_embedding_matrix.npy'), eng_embedding_matrix)
np.save((datasets_path + '/Hindi_embedding_matrix.npy'), hin_embedding_matrix)


# END