# Artificial Intelligence Nanodegree
## Machine Translation Project
In this notebook, sections that end with **'(IMPLEMENTATION)'** in the header indicate that the following blocks of code will require additional functionality which you must provide. Please be sure to read the instructions carefully!

## Introduction
In this notebook, you will build a deep neural network that functions as part of an end-to-end machine translation pipeline. Your completed pipeline will accept English text as input and return the French translation.

- **Preprocess** - You'll convert text to sequence of integers.
- **Models** Create models which accepts a sequence of integers as input and returns a probability distribution over possible translations. After learning about the basic types of neural networks that are often used for machine translation, you will engage in your own investigations, to design your own model!
- **Prediction** Run the model on English text.

In [None]:
%load_ext autoreload
%aimport helper, tests
%autoreload 1

In [None]:
import collections

import helper
import numpy as np
import re
#import project_tests as tests

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Model, Sequential
from keras.layers import  Dropout, GRU, Input, Dense, TimeDistributed, Activation, RepeatVector, Bidirectional, SimpleRNN, LSTM
from keras.layers.embeddings import Embedding
from keras.optimizers import Adam
from keras.losses import sparse_categorical_crossentropy
from keras.models import load_model

### Verify access to the GPU
The following test applies only if you expect to be using a GPU, e.g., while running in a Udacity Workspace or using an AWS instance with GPU support. Run the next cell, and verify that the device_type is "GPU".
- If the device is not GPU & you are running from a Udacity Workspace, then save your workspace with the icon at the top, then click "enable" at the bottom of the workspace.
- If the device is not GPU & you are running from an AWS instance, then refer to the cloud computing instructions in the classroom to verify your setup steps.

In [None]:
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

### Model 4: Encoder-Decoder (OPTIONAL)
Time to look at encoder-decoder models.  This model is made up of an encoder and decoder. The encoder creates a matrix representation of the sentence.  The decoder takes this matrix as input and predicts the translation as output.

Create an encoder-decoder model in the cell below.

# Resources:

### Encoder Decoder
https://machinelearningmastery.com/develop-encoder-decoder-model-sequence-sequence-prediction-keras/
https://keras.io/examples/nlp/lstm_seq2seq/****

### Attention
https://github.com/keras-team/keras/issues/4962
https://stackoverflow.com/questions/42918446/how-to-add-an-attention-mechanism-in-keras
https://github.com/philipperemy/keras-attention-mechanism****


### Importing data

In [None]:
# Importing the dataset
lines = open('data/movie_lines.txt', encoding = 'utf-8', errors = 'ignore').read().split('\n')
conversations = open('data/movie_conversations.txt', encoding = 'utf-8', errors = 'ignore').read().split('\n')


# Load English data
#english_sentences = helper.load_data('data/small_vocab_en')
# Load French data
#french_sentences = helper.load_data('data/small_vocab_fr')

print('Dataset Loaded')

### Format data

In [None]:

# Creating a dictionary that maps each line and its id
id2line = {}
for line in lines:
    _line = line.split(' +++$+++ ')
    if len(_line) == 5:
        id2line[_line[0]] = _line[4]

# Creating a list of all of the conversations
conversations_ids = []
for conversation in conversations[:-1]:
    _conversation = conversation.split(' +++$+++ ')[-1][1:-1].replace("'", "").replace(" ", "")
    conversations_ids.append(_conversation.split(','))

# Getting separately the questions and the answers
questions = []
answers = []
for conversation in conversations_ids:
    for i in range(len(conversation) - 1):
        questions.append(id2line[conversation[i]])
        answers.append(id2line[conversation[i+1]])


### Clean data

In [None]:
# Doing a first cleaning of the texts
def clean_text(text):
    text = text.lower()
    text = re.sub(r"ain't", "am not", text)
    text = re.sub(r"aren't", "are not", text)
    text = re.sub(r"can't", "cannot", text)
    text = re.sub(r"can't've", "cannot have", text)
    text = re.sub(r"'cause", "because", text)
    text = re.sub(r"could've", "could have", text)
    text = re.sub(r"couldn't", "could not", text)
    text = re.sub(r"couldn't've", "could not have", text)
    text = re.sub(r"didn't", "did not", text)
    text = re.sub(r"doesn't", "does not", text)
    text = re.sub(r"don't", "do not", text)
    text = re.sub(r"hadn't", "had not", text)
    text = re.sub(r"hadn't've", "had not have", text)
    text = re.sub(r"hasn't", "has not", text)
    text = re.sub(r"haven't", "have not", text)
    text = re.sub(r"he'd", "he would", text)
    text = re.sub(r"he'd've", "he would have", text)
    text = re.sub(r"he'll", "he will", text)
    text = re.sub(r"he'll've", "he will have", text)
    text = re.sub(r"he's", "he is", text)
    text = re.sub(r"how'd", "how did", text)
    text = re.sub(r"how'd'y", "how do you", text)
    text = re.sub(r"how'll", "how will", text)
    text = re.sub(r"how's", "how is", text)
    text = re.sub(r"I'd", "I would", text)
    text = re.sub(r"I'd've", "I would have", text)
    text = re.sub(r"I'll", "I will", text)
    text = re.sub(r"I'll've", "I will have", text)
    text = re.sub(r"I'm", "I am", text)
    text = re.sub(r"I've", "I have", text)
    text = re.sub(r"isn't", "is not", text)
    text = re.sub(r"it'd", "it would", text)
    text = re.sub(r"it'd've", "it would have", text)
    text = re.sub(r"it'll", "it will", text)
    text = re.sub(r"it'll've", "it will have", text)
    text = re.sub(r"it's", "it is", text)
    text = re.sub(r"let's", "let us", text)
    text = re.sub(r"ma'am", "madam", text)
    text = re.sub(r"mayn't", "may not", text)
    text = re.sub(r"might've", "might have", text)
    text = re.sub(r"mightn't", "might not", text)
    text = re.sub(r"mightn't've", "might not have", text)
    text = re.sub(r"must've", "must have", text)
    text = re.sub(r"mustn't", "must not", text)
    text = re.sub(r"mustn't've", "must not have", text)
    text = re.sub(r"needn't", "need not", text)
    text = re.sub(r"needn't've", "need not have", text)
    text = re.sub(r"o'clock", "of the clock", text)
    text = re.sub(r"oughtn't", "ought not", text)
    text = re.sub(r"oughtn't've", "ought not have", text)
    text = re.sub(r"shan't", "shall not", text)
    text = re.sub(r"sha'n't", "shall not", text)
    text = re.sub(r"shan't've", "shall not have", text)
    text = re.sub(r"she'd", "she would", text)
    text = re.sub(r"she'd've", "she would have", text)
    text = re.sub(r"she'll", "she will", text)
    text = re.sub(r"she'll've", "she will have", text)
    text = re.sub(r"she's", "she is", text)
    text = re.sub(r"should've", "should have", text)
    text = re.sub(r"shouldn't", "should not", text)
    text = re.sub(r"shouldn't've", "should not have", text)
    text = re.sub(r"so've", "so have", text)
    text = re.sub(r"so's", "so as", text)
    text = re.sub(r"that'd", "that would", text)
    text = re.sub(r"that'd've", "that would have", text)
    text = re.sub(r"that's", "that is", text)
    text = re.sub(r"there'd", "there would", text)
    text = re.sub(r"there'd've", "there would have", text)
    text = re.sub(r"there's", "there is", text)
    text = re.sub(r"they'd", "they would", text)
    text = re.sub(r"they'd've", "they would have", text)
    text = re.sub(r"they'll", "they will", text)
    text = re.sub(r"they'll've", "they will have", text)
    text = re.sub(r"they're", "they are", text)
    text = re.sub(r"they've", "they have", text)
    text = re.sub(r"to've", "to have", text)
    text = re.sub(r"wasn't", "was not", text)
    text = re.sub(r"we'd", "we would", text)
    text = re.sub(r"we'd've", "we would have", text)
    text = re.sub(r"we'll", "we will", text)
    text = re.sub(r"we'll've", "we will have", text)
    text = re.sub(r"we're", "we are", text)
    text = re.sub(r"we've", "we have", text)
    text = re.sub(r"weren't", "were not", text)
    text = re.sub(r"what'll", "what will", text)
    text = re.sub(r"what'll've", "what will have", text)
    text = re.sub(r"what're", "what are", text)
    text = re.sub(r"what's", "what is", text)
    text = re.sub(r"what've", "what have", text)
    text = re.sub(r"when's", "when is", text)
    text = re.sub(r"when've", "when have", text)
    text = re.sub(r"where'd", "where did", text)
    text = re.sub(r"where's", "where is", text)
    text = re.sub(r"where've", "where have", text)
    text = re.sub(r"who'll", "who will", text)
    text = re.sub(r"who'll've", "who will have", text)
    text = re.sub(r"who's", "who is", text)
    text = re.sub(r"who've", "who have", text)
    text = re.sub(r"why's", "why is", text)
    text = re.sub(r"why've", "why have", text)
    text = re.sub(r"will've", "will have", text)
    text = re.sub(r"won't", "will not", text)
    text = re.sub(r"won't've", "will not have", text)
    text = re.sub(r"would've", "would have", text)
    text = re.sub(r"wouldn't", "would not", text)
    text = re.sub(r"wouldn't've", "would not have", text)
    text = re.sub(r"y'all", "you all", text)
    text = re.sub(r"y'all'd", "you all would", text)
    text = re.sub(r"y'all'd've", "you all would have", text)
    text = re.sub(r"y'all're", "you all are", text)
    text = re.sub(r"y'all've", "you all have", text)
    text = re.sub(r"you'd", "you would", text)
    text = re.sub(r"you'd've", "you would have", text)
    text = re.sub(r"you'll", "you will", text)
    text = re.sub(r"you'll've", "you will have", text)
    text = re.sub(r"you're", "you are", text)
    text = re.sub(r"you've", "you have", text)
    text = re.sub(r"[-()\"#/@;:<>{}+=~|.?,]", "", text)
    return text

# Cleaning the questions
clean_questions = []
for question in questions:
    clean_questions.append(clean_text(question))
    
# Cleaning the answers
clean_answers = []
for answer in answers:
    clean_answers.append(clean_text(answer))


In [None]:
question_words_counter = collections.Counter([word for sentence in clean_questions for word in sentence.split()])
answer_words_counter = collections.Counter([word for sentence in clean_answers for word in sentence.split()])

print('{} Question words.'.format(len([word for sentence in clean_questions for word in sentence.split()])))
print('{} unique question words.'.format(len(question_words_counter)))
print('10 Most common words in the question dataset:')
print('"' + '" "'.join(list(zip(*question_words_counter.most_common(10)))[0]) + '"')
print()
print('{} Answer words.'.format(len([word for sentence in clean_answers for word in sentence.split()])))
print('{} unique answer words.'.format(len(answer_words_counter)))
print('10 Most common words in the answer dataset:')
print('"' + '" "'.join(list(zip(*answer_words_counter.most_common(10)))[0]) + '"')

In [None]:
print(clean_questions[3])

### Configuration

In [None]:
batch_size = 64  # Batch size for training.
epochs = 30  # Number of epochs to train for.
latent_dim = 256  # Latent dimensionality of the encoding space.
num_samples = 10000  # Number of samples to train on.
# Path to the data txt file on disk.
data_path = "fra.txt"

### Prepare the data

In [None]:
t_input_sentence = [
    'The quick brown fox jumps over the lazy dog Jesus difficult alien .',
    'By Jove , my quick study of lexicography won a prize .',
    'This is a short sentence .']

t_target_sentence = [
    'les états-unis est généralement froid en juillet , et il gèle habituellement en novembre .', 
    'california est généralement calme en mars , et il est généralement chaud en juin .',
    'les états-unis est parfois légère en juin , et il fait froid en septembre .'
]

# Vectorize the data.
input_texts = []
target_texts = []
input_vocab = {" "}
target_vocab = {"\t", "\n", " "}

# Create input_text list
for line in clean_questions[0:num_samples]:
    input_text = []
    for word in line.split():
        input_text.append(word)
        input_vocab.add(word)
    input_texts.append(input_text)

# Create target_text list
for line in clean_answers[0:num_samples]:
    # We use "tab" as the "start sequence" character
    # for the targets, and "\n" as "end sequence" character.
    target_text = []
    target_text.append("\t")
    for word in line.split():
        target_text.append(word)
        target_vocab.add(word)
    target_text.append("\n")
    target_texts.append(target_text)
    


In [None]:
ztest = np.zeros((1,2))   


In [None]:
input_token_index = {}
for i, word in enumerate(input_vocab):
    input_token_index[word] = i    

target_token_index = {}
for i, word in enumerate(target_vocab):    
    target_token_index[word] = i
    
#encoder_tokens = collections.Counter([word for sentence in t_input_sentence for word in sentence.split()])
#decoder_tokens = collections.Counter([word for sentence in t_target_sentence for word in sentence.split()])
num_encoder_tokens = len(input_token_index)
num_decoder_tokens = len(target_token_index)
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])

print("Number of samples:", len(input_texts))
print("Number of unique input tokens:", num_encoder_tokens)
print("Number of unique output tokens:", num_decoder_tokens)
print("Max sequence length for inputs:", max_encoder_seq_length)
print("Max sequence length for outputs:", max_decoder_seq_length)


encoder_input_data = np.zeros(
    (len(input_texts), max_encoder_seq_length, num_encoder_tokens), dtype="float32"
#       # of samples   Max seq len for inputs  # unique input tokens
#          3            13                        27
)
decoder_input_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens), dtype="float32"
#       # of samples   Max seq len for outputs  # unique output tokens
#          3            17                        23

)
decoder_target_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens), dtype="float32"
#       # of samples   Max seq len for outputs  # unique output tokens
)

for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
    for t, word in enumerate(input_text):
        encoder_input_data[i, t, input_token_index[word]] = 1.0
    encoder_input_data[i, t + 1 :, input_token_index[" "]] = 1.0
    for t, word in enumerate(target_text):
        # decoder_target_data is ahead of decoder_input_data by one timestep
        decoder_input_data[i, t, target_token_index[word]] = 1.0
        if t > 0:
            # decoder_target_data will be ahead by one timestep
            # and will not include the start character.
            decoder_target_data[i, t - 1, target_token_index[word]] = 1.0
    decoder_input_data[i, t + 1 :, target_token_index[" "]] = 1.0
    decoder_target_data[i, t:, target_token_index[" "]] = 1.0


### Build the model

In [None]:
# Define an input sequence and process it.
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)

# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]

# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None, num_decoder_tokens))

# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation="softmax")
decoder_outputs = decoder_dense(decoder_outputs)

# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)


In [None]:
model.compile(
    optimizer="rmsprop", loss="categorical_crossentropy", metrics=["accuracy"]
)
model.fit(
    [encoder_input_data, decoder_input_data],
    decoder_target_data,
    batch_size=batch_size,
    epochs=epochs,
    validation_split=0.2,
)
# Save model
model.save("s2s")

In [None]:
# Define sampling models
# Restore the model and construct the encoder and decoder.
model = load_model("s2s")

encoder_inputs = model.input[0]  # input_1
encoder_outputs, state_h_enc, state_c_enc = model.layers[2].output  # lstm_1
encoder_states = [state_h_enc, state_c_enc]
encoder_model = Model(encoder_inputs, encoder_states)

decoder_inputs = model.input[1]  # input_2
decoder_state_input_h = Input(shape=(latent_dim,), name="input_3")
decoder_state_input_c = Input(shape=(latent_dim,), name="input_4")
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_lstm = model.layers[3]
decoder_outputs, state_h_dec, state_c_dec = decoder_lstm(
    decoder_inputs, initial_state=decoder_states_inputs
)
decoder_states = [state_h_dec, state_c_dec]
decoder_dense = model.layers[4]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states
)

# Reverse-lookup token index to decode sequences back to
# something readable.
reverse_input_char_index = dict((i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict((i, char) for char, i in target_token_index.items())


def decode_sequence(input_seq):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq)

    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0, target_token_index["\t"]] = 1.0

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = []
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence.append(sampled_char)

        # Exit condition: either hit max length
        # or find stop character.
        if sampled_char == "\n" or len(decoded_sentence) > max_decoder_seq_length:
            stop_condition = True

        # Update the target sequence (of length 1).
        target_seq = np.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.0

        # Update states
        states_value = [h, c]
    return str(decoded_sentence)

In [None]:
for seq_index in range(20):
    # Take one sequence (part of the training set)
    # for trying out decoding.
    input_seq = encoder_input_data[seq_index : seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    print("-")
    print("Input sentence:", input_texts[seq_index])
    print("Decoded sentence:", decoded_sentence)

### PAUSE

In [None]:
def encdec_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    """
    Build and train an encoder-decoder model on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    # OPTIONAL: Implement
    return None
tests.test_encdec_model(encdec_model)


# OPTIONAL: Train and Print prediction(s)

### Train the model

In [None]:
model.compile(
    optimizer="rmsprop", loss="categorical_crossentropy", metrics=["accuracy"]
)
model.fit(
    [encoder_input_data, decoder_input_data],
    decoder_target_data,
    batch_size=batch_size,
    epochs=epochs,
    validation_split=0.2,
)
# Save model
model.save("s2s")

### Model 5: Custom (IMPLEMENTATION)
Use everything you learned from the previous models to create a model that incorporates embedding and a bidirectional rnn into one model.

In [None]:
def model_final(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    """
    Build and train a model that incorporates embedding, encoder-decoder, and bidirectional RNN on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    # TODO: Implement
    return None
tests.test_model_final(model_final)


print('Final Model Loaded')
# TODO: Train the final model

## Prediction (IMPLEMENTATION)

In [None]:
def final_predictions(x, y, x_tk, y_tk):
    """
    Gets predictions using the final model
    :param x: Preprocessed English data
    :param y: Preprocessed French data
    :param x_tk: English tokenizer
    :param y_tk: French tokenizer
    """
    # TODO: Train neural network using model_final
    model = None

    
    ## DON'T EDIT ANYTHING BELOW THIS LINE
    y_id_to_word = {value: key for key, value in y_tk.word_index.items()}
    y_id_to_word[0] = '<PAD>'

    sentence = 'he saw a old yellow truck'
    sentence = [x_tk.word_index[word] for word in sentence.split()]
    sentence = pad_sequences([sentence], maxlen=x.shape[-1], padding='post')
    sentences = np.array([sentence[0], x[0]])
    predictions = model.predict(sentences, len(sentences))

    print('Sample 1:')
    print(' '.join([y_id_to_word[np.argmax(x)] for x in predictions[0]]))
    print('Il a vu un vieux camion jaune')
    print('Sample 2:')
    print(' '.join([y_id_to_word[np.argmax(x)] for x in predictions[1]]))
    print(' '.join([y_id_to_word[np.max(x)] for x in y[0]]))


final_predictions(preproc_english_sentences, preproc_french_sentences, english_tokenizer, french_tokenizer)

## Submission
When you're ready to submit, complete the following steps:
1. Review the [rubric](https://review.udacity.com/#!/rubrics/1004/view) to ensure your submission meets all requirements to pass
2. Generate an HTML version of this notebook

  - Run the next cell to attempt automatic generation (this is the recommended method in Workspaces)
  - Navigate to **FILE -> Download as -> HTML (.html)**
  - Manually generate a copy using `nbconvert` from your shell terminal
```
$ pip install nbconvert
$ python -m nbconvert machine_translation.ipynb
```
  
3. Submit the project

  - If you are in a Workspace, simply click the "Submit Project" button (bottom towards the right)
  
  - Otherwise, add the following files into a zip archive and submit them 
  - `helper.py`
  - `machine_translation.ipynb`
  - `machine_translation.html`
    - You can export the notebook by navigating to **File -> Download as -> HTML (.html)**.

### Generate the html

**Save your notebook before running the next cell to generate the HTML output.** Then submit your project.

In [None]:
# Save before you run this cell!
!!jupyter nbconvert *.ipynb

## Optional Enhancements

This project focuses on learning various network architectures for machine translation, but we don't evaluate the models according to best practices by splitting the data into separate test & training sets -- so the model accuracy is overstated. Use the [`sklearn.model_selection.train_test_split()`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function to create separate training & test datasets, then retrain each of the models using only the training set and evaluate the prediction accuracy using the hold out test set. Does the "best" model change?