# <center><u> <h1>English French-Translator</h1></u></center>

Savez-vous comment créer une application de traduction de langue?<br>
Did you understand the above sentence?<br>
Well after googling it, I found its meaning as:<br>
Do you know how to create a language translator app?<br>

We all know about Google Translate which allows us to convert from one language to another and it’s very useful for learning and understanding new languages.
<br>
<br>


![](https://daleonai.com/images/2019-11-05-improving-machine-translation-with-the-google-translation-api-advanced/1.png)


In this project I aiming to convert English phrases to French using RNN on Deep Learning Neural Network

## Introduction
In this notebook, I built a deep neural network that functions as part of an end-to-end machine translation pipeline. The completed pipeline will accept English text as input and return the French translation.

Preprocess - I converted text to sequence of integers.
Models Create models which accepts a sequence of integers as input and returns a probability distribution over possible translations.
Prediction Run the model on English text.

In [1]:
# Now import the required libraries
import numpy as np
import pandas as pd
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import GRU, Input, Dense, TimeDistributed, Activation, RepeatVector, Bidirectional, Dropout, LSTM, Embedding
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import sparse_categorical_crossentropy
import tensorflow as tf

In [2]:
# Importing Google Drive for Colab
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Load Data
The data is located in data/small_vocab_en and data/small_vocab_fr. The small_vocab_en file contains English sentences with their French translations in the small_vocab_fr file.

In [3]:
# Define the paths for English and French data
english_data = "/content/drive/MyDrive/Deep Learning Projects/English To French Translator/small_vocab_en.txt"
french_data = "/content/drive/MyDrive/Deep Learning Projects/English To French Translator/small_vocab_fr.txt"

- The OS module in Python provides functions for interacting with the operating system
- The code loads the data from a file called input_file. The code then splits the string of text into an array using split(). <br>
Then, it uses list comprehension to create a list with each line in the array as its own element.

In [4]:
#import os
import os

#define a function with one parameter path
def load_data(path):
  #join the path with join keyword
  input_file = os.path.join(path)
  #open file and read  as f
  with open (input_file,"r") as f:
    #read file
    data = f.read()
  #return with data split("\n")
  return data.split("\n")

Now loading all english and french data into variables.

In [5]:
# Load English and French data
english_sentences = load_data(english_data)
french_sentences = load_data(french_data)

## Analysis of Dataset
Let's look at few examples in the dataset of both language

In [6]:
# Print sample sentences from both languages
for i in range(5):
    print("Sample: ", i)
    print("English: ", english_sentences[i])
    print("French: ", french_sentences[i])
    print("-" * 50)

Sample:  0
English:  new jersey is sometimes quiet during autumn , and it is snowy in april .
French:  new jersey est parfois calme pendant l' automne , et il est neigeux en avril .
--------------------------------------------------
Sample:  1
English:  the united states is usually chilly during july , and it is usually freezing in november .
French:  les états-unis est généralement froid en juillet , et il gèle habituellement en novembre .
--------------------------------------------------
Sample:  2
English:  california is usually quiet during march , and it is usually hot in june .
French:  california est généralement calme en mars , et il est généralement chaud en juin .
--------------------------------------------------
Sample:  3
English:  the united states is sometimes mild during june , and it is cold in september .
French:  les états-unis est parfois légère en juin , et il fait froid en septembre .
--------------------------------------------------
Sample:  4
English:  your le

## Convert to Vocabulary
The complexity of the problem is determined by the complexity of the vocabulary. A more complex vocabulary is a more complex problem. Let's look at the complexity of the dataset.

In [7]:
# Import collections for counting words
import collections

In [8]:
# Count the number of unique words in English and French
english_words_counter = collections.Counter([word for sentence in english_sentences for word in sentence.split()])
print("English Vocabulary Size:", len(english_words_counter))
french_words_counter = collections.Counter([word for sentence in french_sentences for word in sentence.split()])
print("French Vocabulary Size:", len(french_words_counter))

English Vocabulary Size: 227
French Vocabulary Size: 355


## Tokenize (IMPLEMENTATION)
For a neural network to predict on text data, it first has to be turned into data it can understand. Text data like "dog" is a sequence of ASCII character encodings. Since a neural network is a series of multiplication and addition operations, the input data needs to be numbers.

We can turn each character into a number or each word into a number. These are called character and word ids, respectively. Character ids are used for character level models that generate text predictions for each character. A word level model uses word ids that generate text predictions for each word. Word level models tend to learn better, since they are lower in complexity.

Turn each sentence into a sequence of words ids using Keras's Tokenizer function. 

In [9]:
# Define a function to tokenize text
def tokenize(x):
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(x)
    return tokenizer.texts_to_sequences(x), tokenizer

- The code starts by tokenizing the text_sentences list into individual sentences.Then, it prints out the word index of each sentence in the text_tokenized list.<br>
- Next, it iterates through each sentence and prints out a sample output for that sentence.

In [10]:
# Tokenize sample text
text_sentences = [
    'The quick brown fox jumps over the lazy dog .',
    'By Jove , my quick study of lexicography won a prize .',
    'This is a short sentence .'
]
text_tokenized, text_tokenizer = tokenize(text_sentences)
print("Word Index:")
print(text_tokenizer.word_index)
print()

Word Index:
{'the': 1, 'quick': 2, 'a': 3, 'brown': 4, 'fox': 5, 'jumps': 6, 'over': 7, 'lazy': 8, 'dog': 9, 'by': 10, 'jove': 11, 'my': 12, 'study': 13, 'of': 14, 'lexicography': 15, 'won': 16, 'prize': 17, 'this': 18, 'is': 19, 'short': 20, 'sentence': 21}



## Padding (IMPLEMENTATION)
When batching the sequence of word ids together, each sequence needs to be the same length. Since sentences are dynamic in length, adding padding to the end of the sequences to make them the same length.

Making sure all the English sequences have the same length and all the French sequences have the same length by adding padding to the end of each sequence using Keras's pad_sequences function.

In [11]:
# Define a function to pad sequences
def pad(x, length=None):
    return pad_sequences(x, maxlen=length, padding="post")

- The code is used to preprocess the input data set. - The tokenize function splits the text into individual tokens, which are then passed to a function called pad that takes in a list of tokens and pads them with a specified character (in this case, spaces).

In [12]:
# Define a preprocessing function
def preprocess(x, y):
    preprocess_x, x_tk = tokenize(x)
    preprocess_y, y_tk = tokenize(y)

    preprocess_x = pad(preprocess_x)
    preprocess_y = pad(preprocess_y)

    preprocess_y = preprocess_y.reshape(*preprocess_y.shape, 1)

    return preprocess_x, preprocess_y, x_tk, y_tk

In [13]:
# Preprocess the data
preproc_english_sentences, preproc_french_sentences, english_tokenizer, french_tokenizer = preprocess(english_sentences, french_sentences)

In [14]:
# Print preprocessing information
max_english_sequence_length = preproc_english_sentences.shape[1]
max_french_sequence_length = preproc_french_sentences.shape[1]
english_vocab_size = len(english_tokenizer.word_index)
french_vocab_size = len(french_tokenizer.word_index)

print("Data Preprocessed")
print("Max English sentence length:", max_english_sequence_length)
print("Max French sentence length:", max_french_sequence_length)
print("English vocabulary size:", english_vocab_size)
print("French vocabulary size:", french_vocab_size)

Data Preprocessed
Max English sentence length: 15
Max French sentence length: 21
English vocabulary size: 199
French vocabulary size: 344


## Create Model

I will begin by training four relatively simple architectures:
- 1. Model 1 is a simple RNN
- 2. Model 2 is a RNN with Embedding
- 3. Model 3 is a Bidirectional RNN
- 4. Model 4 is an optional Encoder-Decoder RNN

After experimenting with the four simple architectures, I will construct a deeper architecture that is designed to outperform all four models.

## Ids Back to Text
The neural network will be translating the input to words ids, which isn't the final form we want. We want the French translation. The function logits_to_text will bridge the gab between the logits from the neural network to the French translation.

In [15]:
# Define a function to convert logits to text
def logits_to_text(logits, tokenizer):
    index_to_words = {id: word for word, id in tokenizer.word_index.items()}
    index_to_words[0] = '<PAD>'
    return " ".join([index_to_words[prediction] for prediction in np.argmax(logits, axis=1)])

## Building Model
Here I used RNN model combined with GRU nodes for translation

- The code starts by defining the input shape and output sequence length. Next, it defines the number of unique English words in the dataset and French words in the dataset. The code then builds a Keras model using word embedding on x and y. It also sets hyperparameters for learning rate, which is 0.005, as well as building layers for this model. Finally, it compiles this model with sparse_categorical_crossentropy loss function and Adam optimizer with learning rate set to 0.005. The code will create a Keras model that has been trained to recognize words in English and French.

In [16]:
"""
  Build and train a RNN model using word embedding on x and y
  :param input_shape: Tuple of input shape
  :param output_sequence_length: Length of output sequence
  :param english_vocab_size: Number of unique English words in the dataset
  :param french_vocab_size: Number of unique French words in the dataset
  :return: Keras model built, but not trained
"""
# Define the embedding model
def embed_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    learning_rate = 0.005
    model = Sequential()
    model.add(Embedding(english_vocab_size, 256, input_length=input_shape[1]))
    model.add(GRU(256, return_sequences=True))
    model.add(TimeDistributed(Dense(1024, activation="relu")))
    model.add(Dropout(0.5))
    model.add(TimeDistributed(Dense(french_vocab_size, activation="softmax")))
    model.compile(loss=sparse_categorical_crossentropy, optimizer=Adam(learning_rate), metrics=["accuracy"])
    return model

In [17]:
# Reshape the input for the model
tmp_x = pad(preproc_english_sentences, preproc_french_sentences.shape[1])
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2]))

Finally calling the model function

In [18]:
# Train the model
simple_rnn_model = embed_model(tmp_x.shape, preproc_french_sentences.shape[1], len(english_tokenizer.word_index) + 1, len(french_tokenizer.word_index) + 1)
history = simple_rnn_model.fit(tmp_x, preproc_french_sentences, batch_size=1024, epochs=20, validation_split=0.2)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


Printing model summary

In [19]:
#print Model summary
simple_rnn_model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 21, 256)           51200     
                                                                 
 gru (GRU)                   (None, 21, 256)           394752    
                                                                 
 time_distributed (TimeDist  (None, 21, 1024)          263168    
 ributed)                                                        
                                                                 
 dropout (Dropout)           (None, 21, 1024)          0         
                                                                 
 time_distributed_1 (TimeDi  (None, 21, 345)           353625    
 stributed)                                                      
                                                                 
Total params: 1062745 (4.05 MB)
Trainable params: 106274

## Saving our model

In [20]:
# Save the model
simple_rnn_model.save('model.h5')

  saving_api.save_model(


## Arbitrary Predictions
Performing predictions on the models using User Input.

In [21]:
# Define the final predictions function
def final_predictions(text):
    y_id_to_word = {value: key for key, value in french_tokenizer.word_index.items()}
    y_id_to_word[0] = '<PAD>'
    sentence = [english_tokenizer.word_index[word] for word in text.split()]
    sentence = pad_sequences([sentence], maxlen=preproc_french_sentences.shape[-2], padding='post')
    translated_text = logits_to_text(simple_rnn_model.predict(sentence[:1])[0], french_tokenizer)
    translated_text = " ".join([word for word in translated_text.split() if word != '<PAD>'])
    return translated_text

In [22]:
final_predictions(input())

it's easy


'il est facile'

#Implementation
Enter your input here to get predictions. We will using Gradio for implementation part. Refer video for detailed information

In [23]:
# Install Gradio
!pip install gradio



After installing, we import gradio as gr

- The code creates an interface with a function called final_predictions. - The inputs of the interface are a textbox that has two lines and a placeholder, which is "Text to translate". - The outputs of the interface are "text". - The code launches the program in debug mode.

In [24]:
# Import Gradio
import gradio as gr

In [25]:
# Define the Gradio interface
def translate_text(text):
    translated_text = final_predictions(text)
    return translated_text

gr.Interface(fn=translate_text, inputs="text", outputs="text").launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://3ad4ba87b65abdf3e9.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


