# Assignment 2

## Model for translating sentence from french language to english language

Language translation is a  very important important tool when travelling to another country which language you have no idea. This model will be very helpful for users who know french language and dont know english language.

![image-2.png](attachment:image-2.png)

## Research

Since there are many Recurrent Neural Network (RNN) architecture available for training the model like Long Short Term Memory (LSTM), Gated recurrent unit (GRU)  etc. But i am using GRU for training the model.

## Model Creation

### Importing statements

In [1]:
#import
import os
import string
import numpy as np
import tensorflow as tf
from keras.models import load_model
from keras.models import Model, Sequential
from keras.layers.embeddings import Embedding
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import GRU, Input, Dense, TimeDistributed, Dropout

### Loading data

The data is getting loaded from the french and english data.  After that the data is getting preprocessed to find the length and read few sentences to see how the data looks like.

In [2]:
#method for loading data
def load_data(path):
    #adding path to read data
    text_data = os.path.join(path)
    with open(text_data, "r") as file:
        data = file.read()
    return data.split('\n')

In [3]:
#loading english data and french data
english_sentences = load_data('english_data')
french_sentences = load_data('french_data')

In [4]:
#in and out of data. Length ofeach data types
print("English sentences: {}".format(len(english_sentences)))
print("French_sentences: {}".format(len(french_sentences)))
print(english_sentences[0])
print(french_sentences[0])

English sentences: 137861
French_sentences: 137861
new jersey is sometimes quiet during autumn , and it is snowy in april .
new jersey est parfois calme pendant l' automne , et il est neigeux en avril .


It can be seen that we need to remove punctuation from our sentences.

### Removing  punctuation from english and  french sentences

As we see our data contains few punctuations. So we have to remove the puntuations in torder to process the data further. So sometimes punctuations reduce the accuracy and make model prone to error.

In [5]:
#removing punctuation from english and french sentences
input1 = str.maketrans('', '', string.punctuation+ '“”')

#pasing punctuation from input and output sentences
english_sentences = [w.translate(input1) for  w in english_sentences]
french_sentences = [w.translate(input1) for  w in french_sentences]

#out after removing punctuation
print(english_sentences[0])
print(french_sentences[0])

new jersey is sometimes quiet during autumn  and it is snowy in april 
new jersey est parfois calme pendant l automne  et il est neigeux en avril 


We will now pick sentences only having length less than 16 or equal to 16.

### Setting word limit to a sentences on both the language data

Setting word linit is mandatory so that we can remove the sentences having 100s of words in a sentence. So that accuracy is improved.

In [6]:
#removing data having more than 16 words in a sentence.
english_data = []
french_data = []
for i in range(len(french_sentences)):
    #condition on how sentences will be removed
    if len(english_sentences[i].split())<=16 and len(french_sentences[i].split())<=16:
        #adding sentence in new list
        english_data.append(english_sentences[i])
        french_data.append(french_sentences[i]) 

In [7]:
#length of english and french sentences after removing sentences more than 16 words 
print(len(english_data))
print(len(french_data))

135594
135594


As it can be seen out data has been reduced as few sentences were having more than 16 words

In [8]:
#print of one english and french sentence after removing punctuation and caping limit to maximum of 16
print(english_data[0])
print(french_data[0])

new jersey is sometimes quiet during autumn  and it is snowy in april 
new jersey est parfois calme pendant l automne  et il est neigeux en avril 


### Tokenizer

Tokenization is the process of converting a sequence of characters into a sequence of tokens. Basically what it does is, gives a number to each unique words. And after tokenizing the data we can know how many unique words we have in data, most repeated words and many more functions come in with tokenizer. It is  the  building block of making models.

In [9]:
#adding tokenizer instance on french data
french_tokenizer = Tokenizer()
#french data on tokenizer 
french_tokenizer.fit_on_texts(french_data)
sequence_french_data = french_tokenizer.texts_to_sequences(french_data)
    
#adding tokenizer instance on english data
english_tokenizer = Tokenizer()
#english data on tokenizer 
english_tokenizer.fit_on_texts(english_data)
sequence_english_data = english_tokenizer.texts_to_sequences(english_data)

#adding padding to english and french data having words less than 16 in a sentence
sequence_french_data = pad_sequences(sequence_french_data, maxlen=16, padding='post')
sequence_english_data = pad_sequences(sequence_english_data, maxlen=16, padding='post')  

#reshaping english data
sequence_english_data = sequence_english_data.reshape(*sequence_english_data.shape, 1)

In [10]:
#french vocab length
french_vocab = len(french_tokenizer.word_index)+1

#english vocab length
english_vocab = len(english_tokenizer.word_index)+1

#reshaping french data
sequence_french_data = sequence_french_data.reshape((-1, sequence_english_data.shape[-2]))

### Model

We are creating model using GRUs. It is a sequential model. The data is  trained in 4 epocs as the data, as model becomes over fitted after 4 epocs. We have used activation function  Relu in hidden layers  and softmax in  output layer.

In [11]:
#creating model
model = Sequential()

#adding embedding layer
model.add(Embedding(french_vocab,512, input_length=sequence_french_data.shape[1], input_shape=sequence_french_data.shape[1:]))

#gru network
model.add(GRU(512, return_sequences=True))    
model.add(TimeDistributed(Dense(1024, activation='relu')))
model.add(Dropout(0.4))
model.add(TimeDistributed(Dense(english_vocab, activation='softmax'))) 

#model summary
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 16, 512)           176640    
_________________________________________________________________
gru (GRU)                    (None, 16, 512)           1575936   
_________________________________________________________________
time_distributed (TimeDistri (None, 16, 1024)          525312    
_________________________________________________________________
dropout (Dropout)            (None, 16, 1024)          0         
_________________________________________________________________
time_distributed_1 (TimeDist (None, 16, 200)           205000    
Total params: 2,482,888
Trainable params: 2,482,888
Non-trainable params: 0
_________________________________________________________________


In [12]:
#model compilation for getting information
model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['accuracy'])

### Model training

After training the model, it shows an accuracy of 84%. The accuracy can be improved by adding more data.

In [13]:
#model training for four epocs as after that it is over fitting the model
model.fit(sequence_french_data, sequence_english_data, batch_size=1024, epochs=4)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<tensorflow.python.keras.callbacks.History at 0x134be30a0>

### Saving model

The model is saved so that we  dont have to  train model again and again. We can use the saved model for predicting outputs.

In [14]:
#saving model
model.save('my_model.h5')

### Output prediction

In [15]:
#output prediction
def output():
    
    #loading model
    saved_model = load_model("my_model.h5")
    #enter sentence in french language
    sentence = input ("Enter sentence in french language : ")
    #removing punctuation from french sentence
    input1 = str.maketrans('', '', string.punctuation+ '“”')
    sentence = sentence.translate(input1)
    
    #encode sentence using french tokenizer
    encoded = french_tokenizer.texts_to_sequences([sentence])[0]
    #adding padding to encoded sequence
    input_sentence = pad_sequences([encoded], maxlen=16, padding='post')
    
    #predicting encoded output
    prediction = saved_model.predict(input_sentence)[0]
    
    #decoding output from numbers to words
    index_to_words = {id: word for word, id in english_tokenizer.word_index.items()}
    index_to_words[0] = ''
    output = ' '.join([index_to_words[predict] for predict in np.argmax(prediction, 1)])
    
    return output


### Outputs

The outputs is shown below and  it  runs  until no is entered in the input field.  To run enter yes.

In [16]:
#starting model
print('Welcome to the language translation model from French to English \n')
print('Note: Maximum 16 words in a sentence \n')

#for lopping infinite unit boolean is  set to false. If false, no more loop
i = True

#infinite loop until you dont want to to more translation from french to english language
while(i):
    
    #for selcting yes or no. If yes start the model. If no close the model
    start = input ("Do you want to start model (y/n) : ")
    print("\n")
    # if yes, enter the input in french language
    if start == 'y' or  start == 'Y' or start == 'yes' or  start == 'Yes':
        print("Output in english language :  {}\n".format(output()))
    #if no close the model
    elif start == 'n' or  start == 'N' or start == 'No' or  start == 'no':
        #close the model
        print("You closed the model. Thankyou for using the model. \n")
        i  = False
    else:
        #if input is other than yes or no
        print("Invalid input")

Welcome to the language translation model from French to English 

Note: Maximum 16 words in a sentence 

Do you want to start model (y/n) : y


Enter sentence in french language : paris est relaxant en décembre , mais il est généralement froid en juillet
Output in english language :  paris is relaxing during december but it is usually chilly in july    

Do you want to start model (y/n) : y


Enter sentence in french language : ce chat était mon animal le plus aimé
Output in english language :  that cat was my most feared animal         

Do you want to start model (y/n) : y


Enter sentence in french language : elle déteste ce petit camion rouge
Output in english language :  she dislikes that little black truck          

Do you want to start model (y/n) : y


Enter sentence in french language : je prévois de visiter la californie en mai prochain
Output in english language :  i plan to visit france in april         

Do you want to start model (y/n) : n


You closed the model. Thanky

## Insight

This model gives an accuracy of 84% which can be improved by adding more data in the traing process.

## Recommendation

I highly recommend to using GRU architecture as it has given better accuracy for the same data.

### Conclusion

This model can added to an android application to bring it into use in the near future with more improved  accuracy and adding more data to the data available already.