# Neural Machine Translation with Luong Attention Mechanism

You will build a Neural Machine Translation (NMT) model to translate human readable dates ("25th of June, 2009") into machine readable dates ("2009-06-25"). You will do this using an attention model, one of the most sophisticated sequence to sequence models. 

In [None]:
import numpy as np
import tqdm
from faker import Faker
from babel.dates import format_date
from nmt_utils import load_dataset_v2, preprocess_data, string_to_int, int_to_string, softmax
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
import os

## 1 - Translating human readable dates into machine readable dates

The model you will build here could be used to translate from one language to another, such as translating from English to Hindi. However, language translation requires massive datasets and usually takes days of training on GPUs. To give you a place to experiment with these models even without using massive datasets, we will instead use a simpler "date translation" task. 

The network will input a date written in a variety of possible formats (*e.g. "the 29th of August 1958", "03/30/1968", "24 JUNE 1987"*) and translate them into standardized, machine readable dates (*e.g. "1958-08-29", "1968-03-30", "1987-06-24"*). We will have the network learn to output dates in the common machine-readable format YYYY-MM-DD. 



<!-- 
Take a look at [nmt_utils.py](./nmt_utils.py) to see all the formatting. Count and figure out how the formats work, you will need this knowledge later. !--> 

### 1.1 - Dataset

We will train the model on a dataset of 60000 human readable dates and their equivalent, standardized, machine readable dates. Let's run the following cells to load the dataset and print some examples. 

In [None]:
m = 60000
dataset, human_vocab, machine_vocab, inv_machine_vocab = load_dataset_v2(m)

In [None]:
dataset[:10]

In [None]:
human_vocab

In [None]:
machine_vocab

You've loaded:
- `dataset`: a list of tuples of (human readable date, machine readable date)
- `human_vocab`: a python dictionary mapping all characters used in the human readable dates to an integer-valued index 
- `machine_vocab`: a python dictionary mapping all characters used in machine readable dates to an integer-valued index. These indices are not necessarily consistent with `human_vocab`. 
- `inv_machine_vocab`: the inverse dictionary of `machine_vocab`, mapping from indices back to characters. 

Let's preprocess the data and map the raw text data into the index values. We will also use Tx=30 (which we assume is the maximum length of the human readable date; if we get a longer input, we would have to truncate it) and Ty=10 (since "YYYY-MM-DD" is 10 characters long). 

In [None]:
Tx = 30
Ty = 10

X, Y = preprocess_data(dataset, human_vocab, machine_vocab, Tx, Ty+1)

X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.2)

del X, Y

print("X_train.shape:", X_train.shape)
print("Y_train.shape:", Y_train.shape)

## 2 - Neural machine translation with attention

If you had to translate a book's paragraph from French to English, you would not read the whole paragraph, then close the book and translate. Even during the translation process, you would read/re-read and focus on the parts of the French paragraph corresponding to the parts of the English you are writing down. 

The attention mechanism tells a Neural Machine Translation model where it should pay attention to at any step. 


### Luong attention mechanism
<img src="https://i.imgur.com/46R4XQV.png" style="width:500;height:500px;"> <br>

<caption><center> Luong attention mechanism</center></caption>

<img src="https://www.tensorflow.org/images/seq2seq/attention_equation_0.jpg">

<img src="https://www.tensorflow.org/images/seq2seq/attention_equation_1.jpg">

In [None]:
import tensorflow as tf

In [None]:
tf.enable_eager_execution()

L = tf.keras.layers

In [None]:
tf.keras.backend.clear_session()

#### Define hyperparameter 

In [None]:
WORD_EMBED_SIZE = 32
ENCODER_LSTM_UNITS = 32
DECODER_LSTM_UNITS = 32

In [None]:
def lstm(units):
    if tf.test.is_gpu_available():
        return L.CuDNNLSTM(units=units, return_sequences=True, return_state=True, recurrent_initializer="glorot_uniform")
    else:
        return L.LSTM(units=units, return_sequences=True, return_state=True, recurrent_activation="sigmoid", recurrent_initializer="glorot_uniform")

In [None]:
class Encoder(tf.keras.Model):
    
    def __init__(self, len_vocab, embedding_size, lstm_units):
        super(Encoder, self).__init__()
        self.word_embed = L.Embedding(len_vocab, embedding_size)
        self.lstm = lstm(lstm_units)
        
    def call(self, x, hidden):
        x = self.word_embed(x)
        output, h_state, c_state = self.lstm(x, initial_state=hidden)
        return output, (h_state, c_state)

In [None]:
class Decoder(tf.keras.Model):
    
    def __init__(self, len_vocab, embedding_size, lstm_units):
        super(Decoder, self).__init__()
        self.embedding = L.Embedding(len_vocab, embedding_size)
        self.lstm = lstm(lstm_units)
        self.s1_score = L.Dense(units=DECODER_LSTM_UNITS)
        self.s2_score = L.Dense(units=DECODER_LSTM_UNITS)
        self.score = L.Dense(units=1)
        self.context = L.Dot(axes=1)
        self.attention = L.Dense(units=DECODER_LSTM_UNITS, activation="tanh")
        self.logits = L.Dense(units=len_vocab)

    def call(self, x, hidden, encoder_output):
        """
        Do computation when calling.
        
        Parameters
        ----------
        x: the input to RNN decoder. shape = (batch_size, 1)
        hidden: initial hidden to current LSTM cell of decoder. shape = (batch_size, DECODER_LSTM_UNITS)
        encoder_output: hidden state output at every step of encoder. 
                            shape = (batch_size, encoder_seq_length, ENCODER_LSTM_UNITS)
        """
        # word_embed shape = (batch_size, 1, embedding_size)
        word_embed = self.embedding(x) 
        
        # h_decoder, h_state, c_state shape = (batch_size, 1, DECODER_LSTM_UNITS)
        h_decoder, h_state, c_state = self.lstm(word_embed, initial_state=hidden)
        
        # score shape = (batch_size, encoder_seq_length, ENCODER_LSTM_UNITS)
        score = tf.nn.tanh(self.s1_score(h_decoder) + self.s2_score(encoder_output))
        score = self.score(score) # shape = (batch_size, encoder_seq_length, 1)

        alignment = tf.nn.softmax(score, axis=1) # shape = (batch_size, encoder_seq_length, 1)
        
        context = self.context([alignment, encoder_output]) # shape = (batch_size, 1, ENCODER_LSTM_UNITS)
        context = tf.reshape(context, shape=(-1, DECODER_LSTM_UNITS)) # shape = (batch_size, ENCODER_LSTM_UNITS)

        attention = self.attention(tf.concat([context, h_state], axis=1)) # shape = (batch_size, DECODER_LSTM_UNITS)
        out = self.logits(attention) # shape = (batch_size, len_vocab)
        return out, (h_state, c_state)

In [None]:
optimizer = tf.train.AdamOptimizer(learning_rate=0.005, beta1=0.9, beta2=0.999)


def loss_function(real, pred):
    loss_ = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=real, logits=pred)
    return tf.reduce_mean(loss_)

In [None]:
encoder = Encoder(len(human_vocab), WORD_EMBED_SIZE, ENCODER_LSTM_UNITS)
decoder = Decoder(len(machine_vocab), WORD_EMBED_SIZE//4, DECODER_LSTM_UNITS)

In [None]:
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(optimizer=optimizer,
                                 encoder=encoder,
                                 decoder=decoder)

In [None]:
epochs = 5
batch_size = 64

num_batches_train = X_train.shape[0]//batch_size if X_train.shape[0] % batch_size == 0 else X_train.shape[0]//batch_size + 1

num_batches_val = X_val.shape[0]//batch_size if X_val.shape[0] % batch_size == 0 else X_val.shape[0]//batch_size + 1

val_loss_min = 1e6

data_train = tf.concat([X_train, Y_train], axis=1)

data_val = tf.concat([X_val, Y_val], axis=1)

for e in range(epochs):
    
    data_train = tf.random.shuffle(data_train)
    
    X_train, Y_train = data_train[:, :Tx], data_train[:, Tx:]
    
    pbar = tqdm.tqdm_notebook(range(0, num_batches_train), desc="Epoch " + str(e+1))
    
    train_loss = 0
    
    for it in pbar:
        loss = 0
        start = it*batch_size
        end = (it+1)*batch_size
        hidden = tf.zeros(shape=(end-start, ENCODER_LSTM_UNITS))
        with tf.GradientTape() as tape:
            encoder_output, encoder_hidden = encoder(X_train[start:end], (hidden, hidden))
            decoder_hidden = encoder_hidden
            for t in range(0, Y_train[start:end, :-1].shape[1]):
                logits, decoder_hidden = decoder(tf.expand_dims(Y_train[start:end, t], axis=1), decoder_hidden, encoder_output)
                loss += loss_function(Y_train[start:end, t+1], logits)
    
        batch_loss = (loss / int(Y_train[start:end, :-1].shape[1]))
        
        train_loss += batch_loss
        
        pbar.set_description("Epoch %s - Training loss: %f" % (e+1, (train_loss / (it+1))))
        
        variables = encoder.variables + decoder.variables
        
        gradients = tape.gradient(loss, variables)
        
        optimizer.apply_gradients(zip(gradients, variables))
    
    val_loss = 0
    
    data_val = tf.random.shuffle(data_val)
    
    X_val, Y_val = data_val[:, :Tx], data_val[:, Tx:]
    
    for it in range(num_batches_val):
        loss = 0
        start = it*batch_size
        end = (it+1)*batch_size
        
        hidden = tf.zeros(shape=(X_val[start:end].shape[0], ENCODER_LSTM_UNITS))
        
        encoder_output, encoder_hidden = encoder(X_val[start:end], (hidden, hidden))
        decoder_hidden = encoder_hidden
        for t in range(0, Y_val[start:end, :-1].shape[1]):
            logits, decoder_hidden = decoder(tf.expand_dims(Y_val[start:end, t], axis=1), decoder_hidden, encoder_output)
            loss += loss_function(Y_val[start:end, t+1], logits)

        val_loss += (loss/int(Y_val[start:end, :-1].shape[1]))

    print("Val loss: %f" %  (float(val_loss)/int(num_batches_val)))
    
    if val_loss_min > val_loss:
        val_loss_min = val_loss
        checkpoint.save(file_prefix=checkpoint_prefix)


You can now see the results on new examples.

In [None]:
EXAMPLES = ['3 May 1979', '5 April 09', '21th of August 2016', 'Tue 10 Jul 2007', 'Saturday May 9 2018', 'March 3 2001', 'March 3rd 2001', '1 March 2001']

for example in EXAMPLES:
    source = string_to_int(example, Tx, human_vocab)
    source = np.array([source])
    
    hidden = tf.zeros(shape=(1, ENCODER_LSTM_UNITS))

    encoder_output, encoder_state = encoder(source, (hidden, hidden))
    decoder_state = encoder_state
    sentence = [machine_vocab["#"]]
    for t in range(Ty):
        logits, decoder_state = decoder(np.array([[sentence[-1]]]), decoder_state, encoder_output)
        prediction = softmax(logits)
        prediction = np.argmax(prediction, axis=-1)
        sentence.append(prediction[0])
        
    output = [inv_machine_vocab[s] for s in sentence[1:]]
    
    print("source:", example)
    print("output:", ''.join(output))