# Introduction

1. Neural Machine Translation (NMT) is the task of using artificial neural network models for translation from one language to the other.
2. The NMT model generally consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation.
3. This problem can be thought as a prediction problem, where given a sequence of words in source language as input, task is to predict the output sequence of words in target language.
4. The dataset comes from http://www.manythings.org/anki/, where you may find tab delimited bilingual sentence pairs in different files based on the source and target language of your choice.
5. For this project, you need to use French - English language pairs just to evaluate the projects uniformly for all students.

#Step-1: Download and clean the data
1. Download the data as zip file and extract it to corresponding txt file. Read this txt file and prepare the list of pairs of language phrases.
2. Now, we will nedd to clean these pairs. For cleaning the text, some of the operations for cleaning are:


*   Remove the non printable charaters, if any
*   Remove punctuations and non-alphabetic charaters
* Convert to lowercase



#Step-2: Split and prepare the data for training the model
1. After cleaning the data, next you need to split the data in train and test.
2. Then, you need to create separate tokenizer for both source language and target language.
3. After creating the tokenizer, you need to encode and pad the input (source language) and output(target language) sequences w.r.t. their individual tokenizers and maximum sequence lengths.
4. Here, in this problem you will essentially be predicting the words in target language, therefore output seuences will need to be converted in one hot encoding.


#Step-3: Define and train the RNN based Encoder-Decoder model
1. First, you need to define the sequential model consisting mainly of two parts Encoder and Decoder 
2. In Encoder, the input sequence shall be passed through an Embedding layer (to train the word embeddings for source language) and then the output from the Embedding layer may be passed through one or more RNN/LSTM layers.
3. Now, to connect this Encoder to Decoder (yet to be defined), we can use RepeatVector layer. (This is because the shape of the output by Encoder is not same as expected shape of Input by Decoder)
4. Now, stack up the Decoder, wherein you may add one or more RNN/LSTM layers and finally the output TimeDistributed Dense layer to get output separately by timesteps.
5. Now, you have defined the model and now this can be trained on the training data, you prepared in last step. Here, you may play with the number of epochs, optimizer, batch size to get the optimum results.

#Step-4: Evaluating the model
Use BLEU score for evaluating your model using NLTK library

## Importing required modules

In [1]:
import string
from nltk.tokenize import word_tokenize
import numpy as np
import tensorflow as tf
import keras
from keras.preprocessing.text import Tokenizer
from keras_preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import Embedding,Bidirectional,LSTM,Dense,RepeatVector,TimeDistributed
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping,ModelCheckpoint
from tensorflow.keras.utils import to_categorical
from nltk.translate.bleu_score import SmoothingFunction,corpus_bleu
smoothie = SmoothingFunction().method4

## Reading the text file and cleaning it

In [3]:
text=open("fra.txt","r").readlines()
# text=text.lower()

In [4]:
text

['Go.\tVa !\tCC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #1158250 (Wittydev)\n',
 'Go.\tMarche.\tCC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #8090732 (Micsmithel)\n',
 'Go.\tEn route !\tCC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #8267435 (felix63)\n',
 'Go.\tBouge !\tCC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #9022935 (Micsmithel)\n',
 'Hi.\tSalut !\tCC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #509819 (Aiji)\n',
 'Hi.\tSalut.\tCC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #4320462 (gillux)\n',
 'Run!\tCoursâ€¯!\tCC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #906331 (sacredceltic)\n',
 'Run!\tCourezâ€¯!\tCC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #906332 (sacredceltic)\n',
 'Run!\tPrenez vos jambes Ã\xa0 vos cous !\tCC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #2077449 (sacredceltic)\n',
 'Run!\tFile !\tCC-BY 2.0 (France) Attribution:

In [5]:
eng_fre=[]
for t in text:
    a=t.split("\t")
    eng_fre.append([a[0],a[1]])
eng_fre=np.array(eng_fre)
eng_fre,len(eng_fre)

(array([['Go.', 'Va !'],
        ['Go.', 'Marche.'],
        ['Go.', 'En route !'],
        ...,
        ['Since there are usually multiple websites on any given topic, I usually just click the back button when I arrive on any webpage that has pop-up advertising. I just go to the next page found by Google and hope for something less irritating.',
         "Puisqu'il y a de multiples sites web sur chaque sujet, je clique d'habitude sur le bouton retour arriÃ¨re lorsque j'atterris sur n'importe quelle page qui contient des publicitÃ©s surgissantes. Je me rends juste sur la prochaine page proposÃ©e par Google et espÃ¨re tomber sur quelque chose de moins irritant."],
        ["If someone who doesn't know your background says that you sound like a native speaker, it means they probably noticed something about your speaking that made them realize you weren't a native speaker. In other words, you don't really sound like a native speaker.",
         "Si quelqu'un qui ne connaÃ®t pas vos antÃ©c

In [6]:
eng_fre[:,0] = [s.translate(str.maketrans('', '', string.punctuation)) for s in eng_fre[:,0]]
eng_fre[:,1] = [s.translate(str.maketrans('', '', string.punctuation)) for s in eng_fre[:,1]]
eng_fre=eng_fre[:13000]

In [7]:
for i in range(len(eng_fre)):
    eng_fre[i,0]=eng_fre[i,0].lower()
    eng_fre[i,1]=eng_fre[i,1].lower()
eng_fre

array([['go', 'va '],
       ['go', 'marche'],
       ['go', 'en route '],
       ...,
       ['youre special', 'vous ãªtes spã©cial'],
       ['youre the pro', 'cest vous le professionnel'],
       ['youre the pro', 'cest toi le professionnel']], dtype='<U360')

In [8]:
eng=eng_fre[:,0]
fre=eng_fre[:,1]

In [9]:
max_eng_length=max((len(line.split()) for line in eng))
max_fre_length=max((len(line.split()) for line in fre))
max_eng_length,max_fre_length

(5, 10)

In [10]:
np.random.shuffle(eng_fre)
train=eng_fre[:10000]
test=eng_fre[10000:]

## Tokenization and Padding

In [11]:
def tokenization(lines):
    tokenizer=Tokenizer(oov_token="<UNK>")
    tokenizer.fit_on_texts(lines)
    return tokenizer

In [12]:
eng_tokenizer=tokenization(eng_fre[:,0])
eng_vocab_size=len(eng_tokenizer.word_index)+1 
eng_vocab_size

2468

In [13]:
fre_tokenizer=tokenization(eng_fre[:,1])
fre_vocab_size=len(fre_tokenizer.word_index)+1 
fre_vocab_size

5766

In [14]:
# encode and pad sequences
def encode_sequences(tokenizer, length, lines):
    # integer encode sequences
    sequences = tokenizer.texts_to_sequences(lines)
    seq = pad_sequences(sequences, maxlen=length)
    return seq

## Preparing training and testing data

In [15]:
# max_fre_length=0
# max_eng_length=0
# max_fre_length1=0
# max_eng_length1=0
trainX=encode_sequences(fre_tokenizer,max_fre_length,train[:,1])
trainY=encode_sequences(eng_tokenizer,max_eng_length,train[:,0])

testX=encode_sequences(fre_tokenizer,max_fre_length,test[:,1])
testY=encode_sequences(eng_tokenizer,max_eng_length,test[:,0])

In [16]:
def one_hot_encoding(sequences, vocab_size):
    y_1 = list()
    for sequence in sequences:
        encoded = to_categorical(sequence, num_classes=vocab_size)
        y_1.append(encoded)
    y = np.array(y_1)
    y = y.reshape(sequences.shape[0], sequences.shape[1], vocab_size)
    return y

In [17]:
trainY=one_hot_encoding(trainY,eng_vocab_size)
testY=one_hot_encoding(testY,eng_vocab_size)

In [18]:
trainX.shape,trainY.shape

((10000, 10), (10000, 5, 2468))

In [19]:
testX.shape,testY.shape

((3000, 10), (3000, 5, 2468))

## Preparing and running the model

In [20]:
model=Sequential()
model.add(Embedding(fre_vocab_size,512,input_length=max_fre_length,mask_zero=True)) 
model.add(LSTM(512))
model.add(RepeatVector(max_eng_length))
model.add(LSTM(512,return_sequences=True))
model.add(TimeDistributed(Dense(eng_vocab_size,activation="softmax")))
# adam=Adam(learning_rate=0.01)
model.compile(loss="categorical_crossentropy",optimizer="rmsprop",metrics=["acc"])  #sparse is used as loss function because we have not ne-hot encoded kyonki bht zyada space le rha tha

In [21]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 10, 512)           2952192   
                                                                 
 lstm (LSTM)                 (None, 512)               2099200   
                                                                 
 repeat_vector (RepeatVector  (None, 5, 512)           0         
 )                                                               
                                                                 
 lstm_1 (LSTM)               (None, 5, 512)            2099200   
                                                                 
 time_distributed (TimeDistr  (None, 5, 2468)          1266084   
 ibuted)                                                         
                                                                 
Total params: 8,416,676
Trainable params: 8,416,676
Non-

In [22]:
filename = 'best_model'
checkpoint = ModelCheckpoint(filename, monitor='val_acc', verbose=1, save_best_only=True)
es=EarlyStopping(monitor="val_acc",min_delta=0.01,patience=5)

In [23]:
model.fit(trainX,trainY,epochs=50,verbose=1,batch_size=64,callbacks=[checkpoint],validation_data=(testX,testY))

Epoch 1/50
Epoch 1: val_acc improved from -inf to 0.47980, saving model to best_model




INFO:tensorflow:Assets written to: best_model\assets


INFO:tensorflow:Assets written to: best_model\assets


Epoch 2/50
Epoch 2: val_acc improved from 0.47980 to 0.51833, saving model to best_model




INFO:tensorflow:Assets written to: best_model\assets


INFO:tensorflow:Assets written to: best_model\assets


Epoch 3/50
Epoch 3: val_acc improved from 0.51833 to 0.54600, saving model to best_model




INFO:tensorflow:Assets written to: best_model\assets


INFO:tensorflow:Assets written to: best_model\assets


Epoch 4/50
Epoch 4: val_acc improved from 0.54600 to 0.56347, saving model to best_model




INFO:tensorflow:Assets written to: best_model\assets


INFO:tensorflow:Assets written to: best_model\assets


Epoch 5/50
Epoch 5: val_acc improved from 0.56347 to 0.58220, saving model to best_model




INFO:tensorflow:Assets written to: best_model\assets


INFO:tensorflow:Assets written to: best_model\assets


Epoch 6/50
Epoch 6: val_acc improved from 0.58220 to 0.59453, saving model to best_model




INFO:tensorflow:Assets written to: best_model\assets


INFO:tensorflow:Assets written to: best_model\assets


Epoch 7/50
Epoch 7: val_acc did not improve from 0.59453
Epoch 8/50
Epoch 8: val_acc improved from 0.59453 to 0.62680, saving model to best_model




INFO:tensorflow:Assets written to: best_model\assets


INFO:tensorflow:Assets written to: best_model\assets


Epoch 9/50
Epoch 9: val_acc improved from 0.62680 to 0.63833, saving model to best_model




INFO:tensorflow:Assets written to: best_model\assets


INFO:tensorflow:Assets written to: best_model\assets


Epoch 10/50
Epoch 10: val_acc improved from 0.63833 to 0.63847, saving model to best_model




INFO:tensorflow:Assets written to: best_model\assets


INFO:tensorflow:Assets written to: best_model\assets


Epoch 11/50
Epoch 11: val_acc improved from 0.63847 to 0.65227, saving model to best_model




INFO:tensorflow:Assets written to: best_model\assets


INFO:tensorflow:Assets written to: best_model\assets


Epoch 12/50
Epoch 12: val_acc improved from 0.65227 to 0.66287, saving model to best_model




INFO:tensorflow:Assets written to: best_model\assets


INFO:tensorflow:Assets written to: best_model\assets


Epoch 13/50
Epoch 13: val_acc improved from 0.66287 to 0.66893, saving model to best_model




INFO:tensorflow:Assets written to: best_model\assets


INFO:tensorflow:Assets written to: best_model\assets


Epoch 14/50
Epoch 14: val_acc improved from 0.66893 to 0.66900, saving model to best_model




INFO:tensorflow:Assets written to: best_model\assets


INFO:tensorflow:Assets written to: best_model\assets


Epoch 15/50
Epoch 15: val_acc improved from 0.66900 to 0.67353, saving model to best_model




INFO:tensorflow:Assets written to: best_model\assets


INFO:tensorflow:Assets written to: best_model\assets


Epoch 16/50
Epoch 16: val_acc improved from 0.67353 to 0.67973, saving model to best_model




INFO:tensorflow:Assets written to: best_model\assets


INFO:tensorflow:Assets written to: best_model\assets


Epoch 17/50
Epoch 17: val_acc improved from 0.67973 to 0.68720, saving model to best_model




INFO:tensorflow:Assets written to: best_model\assets


INFO:tensorflow:Assets written to: best_model\assets


Epoch 18/50
Epoch 18: val_acc improved from 0.68720 to 0.69320, saving model to best_model




INFO:tensorflow:Assets written to: best_model\assets


INFO:tensorflow:Assets written to: best_model\assets


Epoch 19/50
Epoch 19: val_acc improved from 0.69320 to 0.69367, saving model to best_model




INFO:tensorflow:Assets written to: best_model\assets


INFO:tensorflow:Assets written to: best_model\assets


Epoch 20/50
Epoch 20: val_acc improved from 0.69367 to 0.69833, saving model to best_model




INFO:tensorflow:Assets written to: best_model\assets


INFO:tensorflow:Assets written to: best_model\assets


Epoch 21/50
Epoch 21: val_acc did not improve from 0.69833
Epoch 22/50
Epoch 22: val_acc did not improve from 0.69833
Epoch 23/50
Epoch 23: val_acc improved from 0.69833 to 0.70560, saving model to best_model




INFO:tensorflow:Assets written to: best_model\assets


INFO:tensorflow:Assets written to: best_model\assets


Epoch 24/50
Epoch 24: val_acc did not improve from 0.70560
Epoch 25/50
Epoch 25: val_acc improved from 0.70560 to 0.70793, saving model to best_model




INFO:tensorflow:Assets written to: best_model\assets


INFO:tensorflow:Assets written to: best_model\assets


Epoch 26/50
Epoch 26: val_acc improved from 0.70793 to 0.71340, saving model to best_model




INFO:tensorflow:Assets written to: best_model\assets


INFO:tensorflow:Assets written to: best_model\assets


Epoch 27/50
Epoch 27: val_acc did not improve from 0.71340
Epoch 28/50
Epoch 28: val_acc did not improve from 0.71340
Epoch 29/50
Epoch 29: val_acc did not improve from 0.71340
Epoch 30/50
Epoch 30: val_acc did not improve from 0.71340
Epoch 31/50
Epoch 31: val_acc did not improve from 0.71340
Epoch 32/50
Epoch 32: val_acc improved from 0.71340 to 0.71500, saving model to best_model




INFO:tensorflow:Assets written to: best_model\assets


INFO:tensorflow:Assets written to: best_model\assets


Epoch 33/50
Epoch 33: val_acc did not improve from 0.71500
Epoch 34/50
Epoch 34: val_acc did not improve from 0.71500
Epoch 35/50
Epoch 35: val_acc did not improve from 0.71500
Epoch 36/50
Epoch 36: val_acc did not improve from 0.71500
Epoch 37/50
Epoch 37: val_acc improved from 0.71500 to 0.71553, saving model to best_model




INFO:tensorflow:Assets written to: best_model\assets


INFO:tensorflow:Assets written to: best_model\assets


Epoch 38/50
Epoch 38: val_acc did not improve from 0.71553
Epoch 39/50
Epoch 39: val_acc did not improve from 0.71553
Epoch 40/50
Epoch 40: val_acc did not improve from 0.71553
Epoch 41/50
Epoch 41: val_acc did not improve from 0.71553
Epoch 42/50
Epoch 42: val_acc improved from 0.71553 to 0.71680, saving model to best_model




INFO:tensorflow:Assets written to: best_model\assets


INFO:tensorflow:Assets written to: best_model\assets


Epoch 43/50
Epoch 43: val_acc did not improve from 0.71680
Epoch 44/50
Epoch 44: val_acc did not improve from 0.71680
Epoch 45/50
Epoch 45: val_acc did not improve from 0.71680
Epoch 46/50
Epoch 46: val_acc did not improve from 0.71680
Epoch 47/50
Epoch 47: val_acc did not improve from 0.71680
Epoch 48/50
Epoch 48: val_acc did not improve from 0.71680
Epoch 49/50
Epoch 49: val_acc did not improve from 0.71680
Epoch 50/50
Epoch 50: val_acc did not improve from 0.71680


<keras.callbacks.History at 0x22c1b6976d0>

In [32]:
from keras.models import load_model
model=load_model("best_model")

In [33]:
eng_tokenizer.word_index.items()

dict_items([('<UNK>', 1), ('i', 2), ('tom', 3), ('it', 4), ('you', 5), ('im', 6), ('is', 7), ('a', 8), ('me', 9), ('we', 10), ('youre', 11), ('was', 12), ('go', 13), ('its', 14), ('he', 15), ('were', 16), ('are', 17), ('be', 18), ('this', 19), ('that', 20), ('dont', 21), ('get', 22), ('the', 23), ('to', 24), ('do', 25), ('they', 26), ('not', 27), ('up', 28), ('ill', 29), ('can', 30), ('have', 31), ('no', 32), ('come', 33), ('she', 34), ('thats', 35), ('here', 36), ('my', 37), ('in', 38), ('let', 39), ('out', 40), ('take', 41), ('stop', 42), ('did', 43), ('who', 44), ('need', 45), ('theyre', 46), ('like', 47), ('all', 48), ('him', 49), ('love', 50), ('toms', 51), ('keep', 52), ('got', 53), ('well', 54), ('stay', 55), ('us', 56), ('look', 57), ('help', 58), ('am', 59), ('what', 60), ('want', 61), ('please', 62), ('hes', 63), ('on', 64), ('lets', 65), ('home', 66), ('your', 67), ('try', 68), ('lost', 69), ('how', 70), ('saw', 71), ('one', 72), ('see', 73), ('must', 74), ('so', 75), ('back

### Making the functions to predict the sequence and BLEU_sccore

In [34]:
def word_for_id(integer, tokenizer):
    # map an integer to a word
    for word, index in tokenizer.word_index.items():
        if index == integer:
            return word
    return None
 
def predict_seq(model, tokenizer, source):
    # generate target from a source sequence
    prediction = model.predict(source, verbose=0)[0]
#     print(prediction)
    integers = [np.argmax(vector) for vector in prediction]
    target = list()
#     print(integers,len(integers))
    for i in integers:
        if i!=0:
            word = word_for_id(i, tokenizer)
    #         print(word)
            if word is None:
                break
            target.append(word)
#     print(target)
    return ' '.join(target)

In [35]:
def bleu_score(model, tokenizer, sources, raw_dataset):
    # Get the bleu score of a model
    actual, predicted = [], []
    for i, source in enumerate(sources):
        # translate encoded source text
#         print(source,source.shape)
        source = source.reshape((1, source.shape[0]))
        translation = predict_seq(model, tokenizer, source)
#         print(translation,source)
#         break
        raw_target, raw_src = raw_dataset[i]
        actual.append([raw_target.split()])
        predicted.append(translation.split())
        if i <= 10:
            print('source = ',raw_src,'<--->', ' target = ',raw_target,'<--->','  predicted = ',translation)
    # calculating BLEU score
    print('-------------------------------------------')
    print('BLEU Score :')
    print('BLEU score-1: %f' % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0),smoothing_function=smoothie,auto_reweigh=False))
    print('BLEU score-2: %f' % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0),smoothing_function=smoothie,auto_reweigh=False))
    print('BLEU score-3: %f' % corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0),smoothing_function=smoothie,auto_reweigh=False))
    print('BLEU score-4: %f' % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25),smoothing_function=smoothie,auto_reweigh=False))

### Evaluating the model on training dataset

In [36]:
bleu_score(model,eng_tokenizer,trainX,train)

source =  goã»te ã§a <--->  target =  taste this <--->   predicted =  taste this
source =  estce que tom va bien  <--->  target =  is tom ok <--->   predicted =  is tom ok
source =  vous aton tirã© dessus  <--->  target =  were you shot <--->   predicted =  were you shot
source =  vous sentezvous seuleâ€¯ <--->  target =  are you lonely <--->   predicted =  are you lonely
source =  jã©tais troisiã¨me <--->  target =  i was third <--->   predicted =  i was third
source =  tom la pris <--->  target =  tom took it <--->   predicted =  tom took it
source =  estu esseulã©â€¯ <--->  target =  are you lonely <--->   predicted =  are you lonely
source =  rapprochezvous pour voir <--->  target =  look closer <--->   predicted =  look look closer
source =  nous sommes toutes lã  <--->  target =  were all here <--->   predicted =  were all here
source =  demandez ã  mes amis <--->  target =  ask my friends <--->   predicted =  ask my friends
source =  je dã©teste repasser <--->  target =  i hate 

### Evaluating the model on testing dataset

In [37]:
bleu_score(model, eng_tokenizer, testX, test)

source =  cest devenu viral <--->  target =  it went viral <--->   predicted =  he hurts that
source =  vous avez lair malade <--->  target =  you look sick <--->   predicted =  you look tired
source =  tom appellera <--->  target =  tomll call <--->   predicted =  is good
source =  cest si triste <--->  target =  thats so sad <--->   predicted =  thats so cool
source =  jai eu besoin daide <--->  target =  i needed help <--->   predicted =  i need help
source =  ignoreles <--->  target =  ignore them <--->   predicted =  shut up
source =  fermez la porte <--->  target =  shut the door <--->   predicted =  close the door
source =  je ne suis pas seule <--->  target =  im not alone <--->   predicted =  im not alone
source =  ã‡a aura lieu <--->  target =  itll happen <--->   predicted =  it smells work
source =  ne souriez pas <--->  target =  dont smile <--->   predicted =  dont stop
source =  il se pourrait que tom gagne <--->  target =  tom might win <--->   predicted =  tom may talk