# Machine Translation

Deep neural network that will accept English text as input and return the French translation.

This notebook is based on the Natural Language Processing [capstone project](https://github.com/udacity/aind2-nlp-capstone) of the [Udacity's Artificial Intelligence  Nanodegree](https://www.udacity.com/course/artificial-intelligence-nanodegree--nd889).

Dataset: reduced vocabulary set taken from [WMT](http://www.statmt.org/). The `small_vocab_en` file contains English sentences with their French translations in the `small_vocab_fr` file.  The sentences, have been preprocessed already: the puncuations have been delimited using spaces, ans all the text have been converted to lowercase.

## Load the data

In [1]:
with open('data/small_vocab_en', "r") as f:
    english_sentences = f.read().split('\n')
with open('data/small_vocab_fr', "r") as f:
    french_sentences = f.read().split('\n')  
    
for i in range(2):
    print("sample {}:".format(i))
    print("{}  \n{} \n".format(english_sentences[i], french_sentences[i]))

sample 0:
new jersey is sometimes quiet during autumn , and it is snowy in april .  
new jersey est parfois calme pendant l' automne , et il est neigeux en avril . 

sample 1:
the united states is usually chilly during july , and it is usually freezing in november .  
les états-unis est généralement froid en juillet , et il gèle habituellement en novembre . 



In [2]:
import collections

words = dict()
words["English"] = [word for sentence in english_sentences for word in sentence.split()]
words["French"] = [word for sentence in french_sentences for word in sentence.split()]

for key, value in words.items():
    print("{}: {} words, {} unique words".format(key, len(value),len(collections.Counter(value))))

English: 1823250 words, 227 unique words
French: 1961295 words, 355 unique words


## Preprocess

### Tokenize
Low complexity word to numerical word ids

In [5]:
from keras.preprocessing.text import Tokenizer


def tokenize(x):
    """
    :param x: List of sentences/strings to be tokenized
    :return: Tuple of (tokenized x data, tokenizer used to tokenize x)
    """
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(x)
    tokens = tokenizer.texts_to_sequences(x)
    
    return tokens, tokenizer

### Padding
When batching the sequence of word ids together, each sequence needs to be the same length.  Since sentences are dynamic in length, we can add padding to the end of the sequences to make them the same length.

In [6]:
import numpy as np
from keras.preprocessing.sequence import pad_sequences


def pad(x, length=None):
    """
    :param x: List of sequences.
    :param length: Length to pad the sequence to.  If None, use length of longest sequence in x.
    :return: Padded numpy array of sequences
    """
    return pad_sequences(x, maxlen=length, padding='post')

### Preprocess Pipeline

In [8]:
def preprocess(x, y):
    """
    :param x: Feature List of sentences
    :param y: Label List of sentences
    :return: Tuple of (Preprocessed x, Preprocessed y, x tokenizer, y tokenizer)
    """
    preprocess_x, x_tk = tokenize(x)
    preprocess_y, y_tk = tokenize(y)

    preprocess_x = pad(preprocess_x)
    preprocess_y = pad(preprocess_y)

    # Keras's sparse_categorical_crossentropy function requires the labels to be in 3 dimensions
    preprocess_y = preprocess_y.reshape(*preprocess_y.shape, 1)

    return preprocess_x, preprocess_y, x_tk, y_tk



In [9]:
preproc_english_sentences, preproc_french_sentences, english_tokenizer, french_tokenizer =\
    preprocess(english_sentences, french_sentences)

print('Data Preprocessed')

Data Preprocessed


## Model


### Ids Back to Text
The function `logits_to_text` will bridge the gab between the logits from the neural network to the French translation.

In [11]:
def logits_to_text(logits, tokenizer):
    """
    Turn logits from a neural network into text using the tokenizer
    :param logits: Logits from a neural network
    :param tokenizer: Keras Tokenizer fit on the labels
    :return: String that represents the text of the logits
    """
    index_to_words = {id: word for word, id in tokenizer.word_index.items()}
    index_to_words[0] = '<PAD>'

    return ' '.join([index_to_words[prediction] for prediction in np.argmax(logits, 1)])

### RNN Model 
Model that incorporates encoder-decoder, embedding and bidirectional RNNs. An embedding is a vector representation of the word that is close to similar words in n-dimensional space, where the n represents the size of the embedding vectors. The encoder creates a matrix representation of the sentence. The decoder takes this matrix as input and predicts the translation as output.

In [12]:
from keras.layers import GRU, Input, Dense, TimeDistributed, LSTM, Bidirectional, RepeatVector, Activation
from keras.layers.embeddings import Embedding
from keras.models import Model, Sequential
from keras.optimizers import Adam
from keras.losses import sparse_categorical_crossentropy
from keras.layers.core import Dropout


def model_final(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    """
    Build and train a model that incorporates embedding, encoder-decoder, and bidirectional RNN on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    learning_rate = 0.01
    
    model = Sequential()
    
    vector_size = english_vocab_size // 10    
    
    model.add(Embedding(english_vocab_size, vector_size, input_shape = input_shape[1:], mask_zero=False))   
    model.add(Bidirectional(GRU(output_sequence_length)))
    model.add(Dense(128, activation='relu'))    

    model.add(RepeatVector(output_sequence_length))
    model.add(Bidirectional(GRU(128, return_sequences=True)))
    model.add(TimeDistributed(Dense(french_vocab_size, activation="softmax")))
    
    model.compile(loss=sparse_categorical_crossentropy,
                  optimizer=Adam(learning_rate),
                  metrics=['accuracy'])
    return model


print('Final Model Loaded')

Final Model Loaded


## Prediction

In [13]:
import numpy as np
from keras.preprocessing.sequence import pad_sequences


def final_predictions(x, y, x_tk, y_tk):
    """
    Gets predictions using the final model
    :param x: Preprocessed English data
    :param y: Preprocessed French data
    :param x_tk: English tokenizer
    :param y_tk: French tokenizer
    """
    # TODO: Train neural network using model_final
    model = model_final(
        x.shape,
        y.shape[1],
        len(x_tk.word_index),
        len(y_tk.word_index))
    
    print(model.summary())
    model.fit(x, y, batch_size=1024, epochs=10, validation_split=0.2)

    print(logits_to_text(model.predict(x[:1])[0], y_tk))
    

    
    ## DON'T EDIT ANYTHING BELOW THIS LINE
    y_id_to_word = {value: key for key, value in y_tk.word_index.items()}
    y_id_to_word[0] = '<PAD>'

    sentence = 'he saw a old yellow truck'
    sentence = [x_tk.word_index[word] for word in sentence.split()]
    sentence = pad_sequences([sentence], maxlen=x.shape[-1], padding='post')
    sentences = np.array([sentence[0], x[0]])
    predictions = model.predict(sentences, len(sentences))

    print('Sample 1:')
    print(' '.join([y_id_to_word[np.argmax(x)] for x in predictions[0]]))
    print('Il a vu un vieux camion jaune')
    print('Sample 2:')
    print(' '.join([y_id_to_word[np.argmax(x)] for x in predictions[1]]))
    print(' '.join([y_id_to_word[np.argmax(x)] for x in y[0]]))


final_predictions(preproc_english_sentences, preproc_french_sentences, english_tokenizer, french_tokenizer)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 15, 19)            3781      
_________________________________________________________________
bidirectional_1 (Bidirection (None, 42)                5166      
_________________________________________________________________
dense_1 (Dense)              (None, 128)               5504      
_________________________________________________________________
repeat_vector_1 (RepeatVecto (None, 21, 128)           0         
_________________________________________________________________
bidirectional_2 (Bidirection (None, 21, 256)           197376    
_________________________________________________________________
time_distributed_1 (TimeDist (None, 21, 344)           88408     
Total params: 300,235.0
Trainable params: 300,235.0
Non-trainable params: 0.0
________________________________________________________________

## RESULTS ## 

**Validation accuracy:  > 93 % **

A)
_Target:_     il a vu un vieux camion jaune . <br>
_Predicted:_  il a vu un vieux camion jaune <br>

B)
_Target:_    new jersey est parfois calme pendant l' automne , et il est neigeux en avril . <br>
_Predicted:_ new jersey est parfois calme pendant l'automne et il est neigeux en avril <br>