# Neural Machine Translation with Bahdanau Attention Mechanism

You will build a Neural Machine Translation (NMT) model to translate human readable dates ("25th of June, 2009") into machine readable dates ("2009-06-25"). You will do this using an attention model, one of the most sophisticated sequence to sequence models. 

This notebook was produced together with NVIDIA's Deep Learning Institute. 

Let's load all the packages you will need for this assignment.

In [1]:
import numpy as np

from faker import Faker
from babel.dates import format_date
from nmt_utils import load_dataset, preprocess_data, string_to_int, int_to_string, softmax, plot_attention_map
import matplotlib.pyplot as plt
%matplotlib inline
from keras.utils import to_categorical
from sklearn.model_selection import train_test_split

Using TensorFlow backend.


## 1 - Translating human readable dates into machine readable dates

The model you will build here could be used to translate from one language to another, such as translating from English to Hindi. However, language translation requires massive datasets and usually takes days of training on GPUs. To give you a place to experiment with these models even without using massive datasets, we will instead use a simpler "date translation" task. 

The network will input a date written in a variety of possible formats (*e.g. "the 29th of August 1958", "03/30/1968", "24 JUNE 1987"*) and translate them into standardized, machine readable dates (*e.g. "1958-08-29", "1968-03-30", "1987-06-24"*). We will have the network learn to output dates in the common machine-readable format YYYY-MM-DD. 



<!-- 
Take a look at [nmt_utils.py](./nmt_utils.py) to see all the formatting. Count and figure out how the formats work, you will need this knowledge later. !--> 

### 1.1 - Dataset

We will train the model on a dataset of 60000 human readable dates and their equivalent, standardized, machine readable dates. Let's run the following cells to load the dataset and print some examples. 

In [2]:
m = 60000
dataset, human_vocab, machine_vocab, inv_machine_vocab = load_dataset(m)

100%|█████████████████████████████████████████████████████████████████████████| 60000/60000 [00:01<00:00, 34448.87it/s]


In [3]:
dataset[:10]

[('9 may 1998', '1998-05-09'),
 ('10.09.70', '1970-09-10'),
 ('4/28/90', '1990-04-28'),
 ('thursday january 26 1995', '1995-01-26'),
 ('monday march 7 1983', '1983-03-07'),
 ('sunday may 22 1988', '1988-05-22'),
 ('tuesday july 8 2008', '2008-07-08'),
 ('08 sep 1999', '1999-09-08'),
 ('1 jan 1981', '1981-01-01'),
 ('monday may 22 1995', '1995-05-22')]

In [4]:
human_vocab

{'<pad>': 0,
 '<unk>': 1,
 ' ': 2,
 '.': 3,
 '/': 4,
 '0': 5,
 '1': 6,
 '2': 7,
 '3': 8,
 '4': 9,
 '5': 10,
 '6': 11,
 '7': 12,
 '8': 13,
 '9': 14,
 'a': 15,
 'b': 16,
 'c': 17,
 'd': 18,
 'e': 19,
 'f': 20,
 'g': 21,
 'h': 22,
 'i': 23,
 'j': 24,
 'l': 25,
 'm': 26,
 'n': 27,
 'o': 28,
 'p': 29,
 'r': 30,
 's': 31,
 't': 32,
 'u': 33,
 'v': 34,
 'w': 35,
 'y': 36}

In [5]:
machine_vocab

{'-': 0,
 '0': 1,
 '1': 2,
 '2': 3,
 '3': 4,
 '4': 5,
 '5': 6,
 '6': 7,
 '7': 8,
 '8': 9,
 '9': 10}

You've loaded:
- `dataset`: a list of tuples of (human readable date, machine readable date)
- `human_vocab`: a python dictionary mapping all characters used in the human readable dates to an integer-valued index 
- `machine_vocab`: a python dictionary mapping all characters used in machine readable dates to an integer-valued index. These indices are not necessarily consistent with `human_vocab`. 
- `inv_machine_vocab`: the inverse dictionary of `machine_vocab`, mapping from indices back to characters. 

Let's preprocess the data and map the raw text data into the index values. We will also use Tx=30 (which we assume is the maximum length of the human readable date; if we get a longer input, we would have to truncate it) and Ty=10 (since "YYYY-MM-DD" is 10 characters long). 

In [36]:
Tx = 30
Ty = 10

X, Y = preprocess_data(dataset, human_vocab, machine_vocab, Tx, Ty)

Y = np.array(list(map(lambda x: to_categorical(x, num_classes=len(machine_vocab)), Y)))

print("X.shape:", X.shape)
print("Y.shape:", Y.shape)

X.shape: (60000, 30)
Y.shape: (60000, 10, 11)


## 2 - Neural machine translation with attention

If you had to translate a book's paragraph from French to English, you would not read the whole paragraph, then close the book and translate. Even during the translation process, you would read/re-read and focus on the parts of the French paragraph corresponding to the parts of the English you are writing down. 

The attention mechanism tells a Neural Machine Translation model where it should pay attention to at any step. 


### Attention mechanism

In this part, you will implement the attention mechanism presented in the lecture videos. Here is a figure to remind you how the model works. The diagram on the left shows the attention model. The diagram on the right shows what one "Attention" step does to calculate the attention variables $\alpha^{\langle t, t' \rangle}$, which are used to compute the context variable $context^{\langle t \rangle}$ for each timestep in the output ($t=1, \ldots, T_y$). 

<table>
<td> 
<img src="https://i.imgur.com/fuOZgQl.png" style="width:500;height:500px;"> <br>
</td> 
<td> 
<img src="https://i.imgur.com/CEgMHFc.png" style="width:500;height:500px;"> <br>
</td> 
</table>
<caption><center> **Figure 1**: Neural machine translation with attention</center></caption>



Here are some properties of the model that you may notice: 

- There are two separate LSTMs in this model (see diagram on the left). Because the one at the bottom of the picture is a Bi-directional LSTM and comes *before* the attention mechanism, we will call it *pre-attention* Bi-LSTM. The LSTM at the top of the diagram comes *after* the attention mechanism, so we will call it the *post-attention* LSTM. The pre-attention Bi-LSTM goes through $T_x$ time steps; the post-attention LSTM goes through $T_y$ time steps. 

- The post-attention LSTM passes $s^{\langle t \rangle}, c^{\langle t \rangle}$ from one time step to the next. In the lecture videos, we were using only a basic RNN for the post-activation sequence model, so the state captured by the RNN output activations $s^{\langle t\rangle}$. But since we are using an LSTM here, the LSTM has both the output activation $s^{\langle t\rangle}$ and the hidden cell state $c^{\langle t\rangle}$. However, unlike previous text generation examples (such as Dinosaurus in week 1), in this model the post-activation LSTM at time $t$ does will not take the specific generated $y^{\langle t-1 \rangle}$ as input; it only takes $s^{\langle t\rangle}$ and $c^{\langle t\rangle}$ as input. We have designed the model this way, because (unlike language generation where adjacent characters are highly correlated) there isn't as strong a dependency between the previous character and the next character in a YYYY-MM-DD date. 

- We use $a^{\langle t \rangle} = [\overrightarrow{a}^{\langle t \rangle}; \overleftarrow{a}^{\langle t \rangle}]$ to represent the concatenation of the activations of both the forward-direction and backward-directions of the pre-attention Bi-LSTM. 

- The diagram on the right uses a `RepeatVector` node to copy $s^{\langle t-1 \rangle}$'s value $T_x$ times, and then `Concatenation` to concatenate $s^{\langle t-1 \rangle}$ and $a^{\langle t \rangle}$ to compute $e^{\langle t, t'}$, which is then passed through a softmax to compute $\alpha^{\langle t, t' \rangle}$. We'll explain how to use `RepeatVector` and `Concatenation` in Keras below. 

In [37]:
import keras

In [38]:
L = keras.layers
M = keras.models

In [43]:
keras.backend.clear_session()

#### Define hyperparameter 

In [44]:
WORD_EMBED_SIZE = 32
PRE_ATTENTION_LSTM_UNITS = 32
POST_ATTENTION_LSTM_UNITS = PRE_ATTENTION_LSTM_UNITS*2

#### Define tensor(node) in the graph

In [45]:
X_input = L.Input(shape=(Tx,))
s0_input = L.Input(shape=(POST_ATTENTION_LSTM_UNITS,))
c0_input = L.Input(shape=(POST_ATTENTION_LSTM_UNITS,))

word_embed_tensor = L.Embedding(len(human_vocab), WORD_EMBED_SIZE)
pre_attention_LSTM_tensor = L.Bidirectional(L.LSTM(PRE_ATTENTION_LSTM_UNITS, return_sequences=True))

repeat_s_prev_tensor = L.RepeatVector(Tx)
concat_tensor = L.Concatenate()
e1_tensor = L.Dense(units=32, activation="tanh")
e2_tensor = L.Dense(units=1, activation="tanh")
alpha_tensor = L.Activation(activation=softmax)
context_tensor = L.Dot(axes=1)

post_attention_LSTM_tensor = L.LSTM(units=POST_ATTENTION_LSTM_UNITS, return_state=True)
output_tensor = L.Dense(len(machine_vocab), activation=softmax)

#### Connect the nodes in the graph

In [46]:
word_embed = word_embed_tensor(X_input)
h = pre_attention_LSTM_tensor(word_embed)

s = s0_input
c = c0_input
outputs = []

for t in range(Ty):
    s_prev = repeat_s_prev_tensor(s0_input)
    flow = concat_tensor([h, s_prev])
    flow = e1_tensor(flow)
    flow = e2_tensor(flow)
    flow = alpha_tensor(flow)
    context = context_tensor([flow, h])
    
    s, _, c = post_attention_LSTM_tensor(inputs=context, initial_state=[s, c])

    out = output_tensor(s)
    
    outputs.append(out)
    
model = M.Model(inputs=[X_input, s0_input, c0_input], outputs=outputs)
    

In [47]:
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 30)           0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 30, 32)       1184        input_1[0][0]                    
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, 64)           0                                            
__________________________________________________________________________________________________
bidirectional_1 (Bidirectional) (None, 30, 64)       16640       embedding_1[0][0]                
__________________________________________________________________________________________________
repeat_vec

As usual, after creating your model in Keras, you need to compile it and define what loss, optimizer and metrics you want to use. Compile your model using `categorical_crossentropy` loss, a custom [Adam](https://keras.io/optimizers/#adam) [optimizer](https://keras.io/optimizers/#usage-of-optimizers) (`learning rate = 0.005`, $\beta_1 = 0.9$, $\beta_2 = 0.999$, `decay = 0.01`)  and `['accuracy']` metrics:

In [48]:
opt = keras.optimizers.Adam(lr=0.005, beta_1=0.9, beta_2=0.999, decay=0.01)
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

In [49]:
s0 = np.zeros((Y_train.shape[0], POST_ATTENTION_LSTM_UNITS))
c0 = np.zeros((Y_train.shape[0], POST_ATTENTION_LSTM_UNITS))
outputs = list(Y_train.swapaxes(0, 1))

Let's now fit the model.

In [50]:
from keras_tqdm import TQDMNotebookCallback
file_path = "model/bahdanau_model.hdf5"
checkpoints = keras.callbacks.ModelCheckpoint(file_path, save_best_only=True)
model.fit([X, s0, c0], outputs, epochs=5, batch_size=100, verbose=0, 
            callbacks=[TQDMNotebookCallback(), checkpoints])

HBox(children=(IntProgress(value=0, description='Training', max=5, style=ProgressStyle(description_width='init…

HBox(children=(IntProgress(value=0, description='Epoch 0', max=48000, style=ProgressStyle(description_width='i…

HBox(children=(IntProgress(value=0, description='Epoch 1', max=48000, style=ProgressStyle(description_width='i…

HBox(children=(IntProgress(value=0, description='Epoch 2', max=48000, style=ProgressStyle(description_width='i…

HBox(children=(IntProgress(value=0, description='Epoch 3', max=48000, style=ProgressStyle(description_width='i…

HBox(children=(IntProgress(value=0, description='Epoch 4', max=48000, style=ProgressStyle(description_width='i…




<keras.callbacks.History at 0x1f97c8f7940>

Load the model

In [42]:
model = M.load_model(file_path)

You can now see the results on new examples.

In [52]:
EXAMPLES = ['3 May 1979', '5 April 09', '21th of August 2016', 'Tue 10 Jul 2007', 'Saturday May 9 2018', 'March 3 2001', 'March 3rd 2001', '1 March 2001']
for example in EXAMPLES:
    source = string_to_int(example, Tx, human_vocab)
    source = np.array([source])
    prediction = model.predict([source, np.zeros(shape=(1, POST_ATTENTION_LSTM_UNITS)), np.zeros(shape=(1, POST_ATTENTION_LSTM_UNITS))])
    prediction = np.argmax(prediction, axis = -1)
    output = [inv_machine_vocab[int(i)] for i in prediction]
    
    print("source:", example)
    print("output:", ''.join(output))

source: 3 May 1979
output: 1999-05-03
source: 5 April 09
output: 2009-04-04
source: 21th of August 2016
output: 2011-08-16
source: Tue 10 Jul 2007
output: 2007-07-10
source: Saturday May 9 2018
output: 2018-09-08
source: March 3 2001
output: 2000-03-03
source: March 3rd 2001
output: 2000-01-03
source: 1 March 2001
output: 2000-01-11
