# Machine Translation and Attention
In this notebook, we will implement a model for neural machine translation (NMT) with attention. This notebook is adapted from the [TensorFlow tutorial on NMT](https://www.tensorflow.org/tutorials/seq2seq) at  as well as the [TensorFlow NMT package](https://github.com/tensorflow/nmt/).

In [1]:
%matplotlib inline

import collections
from functools import partial
import math
import matplotlib.pyplot as plt
import os
import random
import time
import zipfile

import numpy as np
from six.moves import urllib
from six.moves import xrange

import tensorflow as tf

# Helper TensorFlow functions
from utils import maybe_download

# The encoder-decoder architecture
from nmt.model import AttentionalModel, LSTMCell
from nmt.utils import vocab_utils
from nmt.train import train

  from ._conv import register_converters as _register_converters


## Data
We'll train our model on a small-scale dataset: an English-Vietnamese parallel corpus of TED talks (133K sentence pairs) provided by the IWSLT Evaluation Campaign (https://sites.google.com/site/iwsltevaluation2015/).

In [2]:
out_dir = os.path.join('datasets', 'nmt_data_vi')
site_prefix = "https://nlp.stanford.edu/projects/nmt/data/"

maybe_download(site_prefix + 'iwslt15.en-vi/train.en', out_dir, 13603614)
maybe_download(site_prefix + 'iwslt15.en-vi/train.vi', out_dir, 18074646)

maybe_download(site_prefix + 'iwslt15.en-vi/tst2012.en', out_dir, 140250)
maybe_download(site_prefix + 'iwslt15.en-vi/tst2012.vi', out_dir, 188396)

maybe_download(site_prefix + 'iwslt15.en-vi/tst2013.en', out_dir, 132264)
maybe_download(site_prefix + 'iwslt15.en-vi/tst2013.vi', out_dir, 183855)

maybe_download(site_prefix + 'iwslt15.en-vi/vocab.en', out_dir, 139741)
maybe_download(site_prefix + 'iwslt15.en-vi/vocab.vi', out_dir, 46767)

Downloading train.en...
Finished!
Found and verified datasets/nmt_data_vi/train.en
Downloading train.vi...
Finished!
Found and verified datasets/nmt_data_vi/train.vi
Downloading tst2012.en...
Finished!
Found and verified datasets/nmt_data_vi/tst2012.en
Downloading tst2012.vi...
Finished!
Found and verified datasets/nmt_data_vi/tst2012.vi
Downloading tst2013.en...
Finished!
Found and verified datasets/nmt_data_vi/tst2013.en
Downloading tst2013.vi...
Finished!
Found and verified datasets/nmt_data_vi/tst2013.vi
Downloading vocab.en...
Finished!
Found and verified datasets/nmt_data_vi/vocab.en
Downloading vocab.vi...
Finished!
Found and verified datasets/nmt_data_vi/vocab.vi


'datasets/nmt_data_vi/vocab.vi'

## Introduction to NMT

<figure>
    <img src='images/encdec.jpg' alt='missing' />
    <figcaption>**Figure 1.** Example of a general, *encoder-decoder* approach to NMT. An encoder converts a source sentence into a representation which is passed through a decoder to produce a translation</figcaption>
</figure>

A neural machine translation (NMT) system reads in a source sentence using an *encoder*, and then uses a *decoder* to emit a translation. NMT models vary in terms of their exact architectures. A natural choice for sequential data is the recurrent neural network (RNN). Usually an RNN is used for both the encoder and decoder. The RNN models, however, differ in terms of: (a) directionality – unidirectional or bidirectional (whether they read the source sentence in forwards or forwards and backwards); (b) depth – single- or multi-layer; and (c) type – often either a vanilla RNN, a Long Short-term Memory (LSTM), or a gated recurrent unit (GRU).

We will consider a deep multi-layer RNN which is bi-directional (it reads the input sequence both forwards and backwards) and uses LSTM units with attention. At a high level, the NMT model consists of two recurrent neural networks: the encoder recurrent network simply consumes the input source words without making any prediction; the decoder, on the other hand, processes the target sentence while predicting the next words.

<figure>
    <img src='images/seq2seq.jpg' alt='missing' />
    <figcaption>**Figure 2.** Example of a neural machine translation system for translating a source sentence "I am a student" into a target sentence "Je suis étudiant".  Here, $<s>$ marks the start of the decoding process while $</s>$ tells the decoder to stop.
    </figcaption>
</figure>

At the bottom layer, the encoder and decoder recurrent networks receive as input the following: first, the source sentence, then a boundary marker $</s>$ which indicates the transition from the encoding to the decoding mode, and the target sentence. We now go into the details of how the model deals with source and target sentences.

### Embedding
Given the categorical nature of words, the model must first look up the source and target embeddings to retrieve the corresponding word representations. For this embedding layer to work, a vocabulary is first chosen for each language. Usually, a vocabulary size $V$ is selected, and only the most frequent $V$ words in the corpus are treated as unique. All other words are converted to an "unknown" token $<$UNK$>$ and all get the same embedding. The embedding weights, one set per language, are usually learned during training (but pretrained word embeddings may be used instead).

### Encoder
Once retrieved, the word embeddings are then fed as input into the main network, which consists of two multi-layer recurrent neural networks -- an encoder for the source language and a decoder for the target language. These two networks, in principle, can share the same weights; however, in practice, we often use two different sets of parameters (such models do a better job when fitting large training datasets). The encoder uses zero vectors as its starting states (before it sees the source sequence). In TensorFlow:

    # Build RNN cell
    encoder_cell = YourEncoderRNNCell(num_units)

    # Run Dynamic RNN
    #   encoder_outputs: [max_time, batch_size, num_units]
    #   encoder_state: [batch_size, num_units]
    encoder_outputs, encoder_state = tf.nn.dynamic_rnn(
        encoder_cell, encoder_emb_inp,
        sequence_length=source_sequence_length, time_major=True)

### Decoder
The decoder also needs to have access to the source information, and one simple way to achieve that is to initialize it with the last hidden state of the encoder, `encoder_state`. In Figure 2, we pass the hidden state at the source word "student" to the decoder side.

    # Build RNN cell
    decoder_cell = tf.nn.rnn_cell.BasicLSTMCell(num_units)
    
    # Helper
    helper = tf.contrib.seq2seq.TrainingHelper(
        decoder_emb_inp, decoder_lengths, time_major=True)

    # Decoder
    decoder = tf.contrib.seq2seq.BasicDecoder(
        decoder_cell, helper, encoder_state, output_layer=projection_layer)
    
    # Dynamic decoding
    outputs, _ = tf.contrib.seq2seq.dynamic_decode(decoder, ...)
    logits = outputs.rnn_output

### Loss
Given the logits above, we are now ready to compute the training loss:

    xent = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=decoder_outputs, logits=logits)
    train_loss = (tf.reduce_sum(crossent * target_weights) / batch_size)

Here, target_weights is a zero-one matrix of the same size as decoder_outputs. It masks padding positions outside of the target sequence lengths with values 0.

Important note: It's worth pointing out that we should divide the loss by `batch_size`, so our hyperparameters are "invariant" to `batch_size`. Some people divide the loss by (`batch_size * num_time_steps`), which plays down the errors made on short sentences. More subtly, the same hyperparameters (applied to the former way) can't be used for the latter way. For example, if both approaches use SGD with a learning of `1.0`, the latter approach effectively uses a much smaller learning rate of `1 / num_time_steps`.

### How to generate translations at test time

While you're training your NMT models (and once you have trained models), you can obtain translations given previously unseen source sentences. At test time, we only have access to the source sentence; i.e., `encoder_inputs`. There are many ways to perform decoding given those inputs. Decoding methods include greedy, sampling, and beam-search decoding. Here, we will discuss the greedy decoding strategy.

The idea is simple and illustrated in Figure 3:

1. We still encode the source sentence in the same way as during training to obtain an `encoder_state`, and this `encoder_state` is used to initialize the decoder.

2. The decoding (translation) process is started as soon as the decoder receives a starting symbol $<$/s$>$.

3. For each timestep on the decoder side, we treat the recurrent network's output as a set of logits. We choose the most likely word, the id associated with the maximum logit value, as the emitted word (this is the "greedy" behavior). For example in Figure 3, the word "moi" has the highest translation probability in the first decoding step. We then feed this word as input to the next timestep. (At training time, however, we may feed in the true target as input to the next timestep in a process called *teacher forcing*.)

4. The process continues until the end-of-sentence marker $<$/s$>$ is produced as an output symbol.

<figure>
    <img src='images/greedy_dec.jpg' alt='missing' />
    <figcaption>**Figure 3.** Example of how a trained NMT model produces a translation for a source sentence "Je suis étudiant" using greedy search.
    </figcaption>
</figure>

## Introduction to Attention

The attention mechanism was first introduced by Bahdanau et al., 2015 [1] and then later refined by Luong et al., 2015 [2] and others. The key idea of the attention mechanism is to establish direct short-cut connections between the target and the source by paying "attention" to relevant source content as we translate (produce output tokens). A nice byproduct of the attention mechanism is an easy-to-visualize alignment matrix between the source and target sentences that we will visualize at the end of this notebook.
 
Remember that in a vanilla seq2seq model, we pass the last source state $h_{s_{T_s}}$ from the encoder to the decoder when starting the decoding process. This works well for short and medium-length sentences; however, for long sentences, the single fixed-size hidden state becomes an information bottleneck. Instead of discarding all of the hidden states computed in the source RNN, the attention mechanism provides an approach that allows the decoder to peek at them (treating them as a dynamic memory of the source information). By doing so, the attention mechanism improves the translation of longer sentences. Nowadays, attention mechanisms are the *de facto* standard and have been successfully applied to many other tasks (including image caption generation, speech recognition, and text summarization).

<figure>
    <img src='images/att.jpg' alt='missing' />
    <figcaption>**Figure 4.** Example of an attention-based NMT system with the first step of the attention computation in detail. For clarity, the embedding and projection layers are omitted.
    </figcaption>
</figure>

### How do we actually attend over the input sequence?

There are many different ways of formalizing attention. These variants depend on the form of a *scoring* function and an *attention* function (and on whether the previous state of the decoder $h_{t_{i-1}}$ is used instead of $h_{t_{i}}$ in the scoring function as originally suggested in Bahdanau et al. (2015); **we will stick to using $h_{t_{i}}$** in this notebook). Luong et al. (2015) demonstrate that only a few choices actually matter:

1. First, the basic form of attention, i.e., **direct connections between target and source**, needs to be present. 

2. Second, it's important to **feed the attention vector to the next timestep** to inform the network about past attention decisions.

3. Lastly, **choices of the scoring function** can often result in different performance. See Luong et al. (2015) for further details.

### A general framework for computing attention

The attention computation happens at every decoder time step. It consists of the following stages:

1. The current target (encoder) hidden state $h_{t_i}$ is compared with all source (decoder) states $h_{s_j}$ to derive *attention weights* $\alpha_{ij}$.
2. Based on the attention weights we compute a *context vector* $c_{i}$ as the weighted average of the source states.
3. We combine the context vector $c_{i}$ with the current target hidden state $h_{s_j}$ to yield the final *attention vector* $a_t$.
4. The attention vector $a_i$ is fed as an input to the next time step (*input feeding*). 

The first three steps can be summarized by the equations below:

$$\large\begin{align*}
\alpha_{ij} &= \frac{
    \exp(\text{score}(h_{t_i}, h_{s_j}))
}{
    \sum_{k=1}^{T_s}{\exp(\text{score}(h_{t_i}, h_{s_k}))}
} \tag{attention weights} \\\\
c_{i} &= \sum_{j=1}^{T_s} \alpha_{ij} h_{s_j} \tag{context vector} \\\\
a_{i} &= f(c_{i}, h_{t_i}) \tag{attention vector} \\\\
\end{align*}$$

Here, the function `score` is used to compare the target hidden state $h_{t_i}$ with each of the source hidden states $h_{s_j}$, and the result is normalized over the source timesteps $j = 1, \dots, T_s$ to produce attention weights $\alpha_{ij}$ (which define a distribution over source positions $j$ for a given source timestep $i$). (There are various choices of the scoring function; we will consider three below.) Note that we make use of the current decoder (or *target*) hidden state $h_{t_i}$, which is computed as a function of the previous hidden state $h_{t_{i-1}}$, the embedding of the input token $x_{i}$ (which is either the emission or the ground truth token from the previous timestep) using the standard formula for a recurrent cell. Optionally, in the case of *input feeding*, we combine $h_{t_{i-1}}$ with the context vector from the previous timestep, $c_{t_{i-1}}$ (which may require a change in the size of the kernel matrix, depending on how the combination is implemented). The encoder (or *source*) hidden states $h_{s_j}$ for $j=1, \dots T_s$ are similarly the standard hidden state for a recurrent cell.

We can also vectorize the computation of the context vector $c_i$ for every target timestep as follows: Given the source hidden states $h_{s_1}, \dots, h_{s_{T_s}}$, we construct a matrix $H_s$ of size `hidden_size` $\times$ `input_seq_len` by stacking the source hidden states into columns. Attention allows us to dynamically weight certain timesteps of the input sequence in a fixed size vector $c_i$ by taking a convex combination of the columns of $H_s$. In particular, we calculate a nonzero and normalized attention weight vector $\vec{\alpha}_i = [\alpha_{i1}, \dots, \alpha_{iT_s}]^T$ that weights the source hidden states in the computation

$$\large c_i = H_s\vec{\alpha}_i~.$$



The attention vector $a_i$ is used to derive the softmax logits and thereafter the loss by transformation under a function $f$.The function $f$ is commonly the a concatenation followed by $\tanh$ layer:

$$\large a_{i} = \tanh(W_a[c_i; h_{t_i}])$$

but could take other forms. We then compute the predictive distribution over output tokens as

$$\large p(y_i \mid y_1, \dots y_{i-1}, x_i) = \text{softmax}(W_s a_{i})~.$$

## Q1. LSTM cell with attention (8 pts)

In the block below, you will implement the method `call`, which computes a single step of an LSTM cell using a method `attention` that computes an attention vector with some score function, as described above. **Complete the skeleton below**; assume inputs is already the input embedding (i.e., there is no need to construct an embedding matrix).

In [None]:
class LSTMCellWithAttention(LSTMCell):
    
    def __init__(self, num_units, memory):
        super(LSTMCellWithAttention, self).__init__(num_units)
        self.memory = memory
        
    def attention(self):
        raise NotImplementedError("The subclass must implement this method!")

    def call(self, inputs, state):
        """Run this LSTM cell with attention on inputs, conditional on state."""
        
        # Cell and hidden states of the LSTM
        c, h = state
        
        # Source (encoder) states to attend over
        source_states = self.memory
        
        # Cell activation (e.g., tanh, relu, etc.)
        activation = self._activation
        
        # LSTM cell parameters
        kernel = self._kernel
        bias = self._bias
        forget_bias = self._forget_bias
        
        ### YOUR CODE HERE
        raise NotImplementedError("Need to implement an LSTM cell with "
                                  "attention.")
        
        ### END YOUR CODE
        ### Your code should compute attention vector, new_c and new_h

        # Adhering to convention
        new_state = tf.contrib.rnn.LSTMStateTuple(new_c, new_h)
    
        return attention_vector, new_state 

We can implement a "dummy" version of attention in order to test that the LSTM cell step function is working correctly:

In [None]:
class LSTMCellWithDummyAttention(LSTMCellWithAttention):

    def attention(self, target_state, source_states):
        """Just return the target state so that the update becomes the vanilla
        LSTM update."""
        return target_state

## Q2A. Dot-product Attention (8 pts)

We first consider the simplest version of attention, which simply calculates the similarity between $h_{t_i}$ and $h_{s_j}$ by computing their dot product:

$$\large\begin{align*}
\text{score}(h_{t_i}, h_{s_j})&=h_{t_i}^\mathrm{\,T}\, h_{s_j}~.
\end{align*}$$

This computation has no additional parameters, but it limits the expressivity of the model since its forces the input and output encodings to be close in order to have high score.

For this question, **implement the __call__ function of the following LSTM cell using dot-product attention.** Your code should be less than ten lines and *not* make use of any higher-level primitives from `tf.nn` or `tf.layers`, etc. (6 pts). As a further step, **vectorize the operation** so that you can compute $\text{score}(\cdot, h_{s_j})$ for every word in the source sentence in parallel (2 pts).

In [None]:
class LSTMCellWithDotProductAttention(LSTMCellWithAttention):
        
    def build(self, inputs_shape):
        super(LSTMCellWithDotProductAttention, self).build(inputs_shape)
        self._W_c = self.add_variable("W_c", 
                                      shape=[self._num_units + self._num_units, 
                                             256])

    def attention(self, target_state, source_states):
        """Return the attention vector computed from attending over
        source_states using a function of target_state and source_states."""
        
        ### YOUR CODE HERE
        raise NotImplementedError("Need to implement dot-product attention.")
        
        ### END YOUR CODE
        
        ### Your code should compute the context vector c
        attention_vector = tf.tanh(tf.matmul(tf.concat([c, target_state], -1), self._W_c))
        
        return attention_vector

## Q2B. Bilinear Attention (8 pts)

To make the score function more expressive, we may consider using a bilinear function of the form

$$\large\begin{align*}
\text{score}(h_{t_i}, h_{s_j})&=h_{t_i}^\mathrm{\,T} W_\text{att} h_{s_j}~,
\end{align*}$$

which transforms the source encoding $h_{s_j}$ by a linear transformation parameterized by $W_\text{att}$ before taking the dot product. This formulation adds additional parameters that must be learned, but increases expressivity and also allows the source and target encodings to be of different dimensionality (if we so wish).

For this question, **implement the __call__ function of the following LSTM cell using bilinear attention.** Your code should be less than ten lines and *not* make use of any higher-level primitives from `tf.nn`or `tf.layers`, etc. (6 pts). As a further step, **vectorize the operation** so that you can compute $\text{score}(\cdot, h_{s_j})$ for every word in the source sentence in parallel (2 pts).

In [None]:
class LSTMCellWithBilinearAttention(LSTMCellWithAttention):
    
    def build(self, inputs_shape):
        super(LSTMCellWithBilinearAttention, self).build(inputs_shape)
        self._W_att = self.add_variable("W_att", 
                                        shape=[self._num_units, 
                                               self._num_units])
        self._W_c = self.add_variable("W_c", 
                                      shape=[self._num_units + self._num_units, 
                                             256])

    def attention(self, target_state, source_states):
        """Return the attention vector computed from attending over
        source_states using a function of target_state and source_states."""
        
        ### YOUR CODE HERE
        raise NotImplementedError("Need to implement bilinear attention."
                                  "using the weight matrix self._W_att_1.")
       
        ### END YOUR CODE
        
        ### Your code should compute the context vector c
        attention_vector = tf.tanh(tf.matmul(tf.concat([c, target_state], -1), self._W_c))
        
        return attention_vector

## Q2C. Feedforward Attention (8 pts)

Instead of simply using a linear transformation, why don't we use an even more expressive feedforward neural network to compute the score?

$$\large\begin{align*}
\text{score}(h_{t_i}, h_{s_j})&=W_{\text{att}_2} \tanh( W_{\text{att}_1} [h_{t_i}; h_{s_j}])~,
\end{align*}$$

where $[v_1; v_2]$ denotes a concatenation of the vectors $v_1$ and $v_2$, and $W_{\text{att}_1}$ and $W_{\text{att}_2}$ are learned parameter matrices. The feedforward approach typically has fewer parameters (depending on the size of the hidden layer) than the bilinear attention mechanism (which requires `source_embedding_dim` $\times$ `target_embedding_dim` parameters).

For this question, **implement the __call__ function of the following LSTM cell using feedforward attention.** Your code should be less than ten lines and *not* make use of any higher-level primitives from `tf.nn` or `tf.layers`, etc. (6 pts). As a further step, **vectorize the operation** so that you can compute $\text{score}(\cdot, h_{s_j})$ for every word in the source sentence in parallel (2 pts).

In [None]:
class LSTMCellWithFeedForwardAttention(LSTMCellWithAttention):
    
    def build(self, inputs_shape):
        super(LSTMCellWithFeedForwardAttention, self).build(inputs_shape)

        self._W_att_1 = self.add_variable("W_att_1", 
                                          shape=[self._num_units + self._num_units, 
                                                 self._num_units])
        self._W_att_2 = self.add_variable("W_att_2", 
                                          shape=[self._num_units, 1])
        self._W_c = self.add_variable("W_c", 
                                      shape=[self._num_units + self._num_units, 
                                             256])
        
    def attention(self, target_state, source_states):
        """Return the attention vector computed from attending over
        source_states using a function of target_state and source_states."""
        
        ### YOUR CODE HERE
        raise NotImplementedError("Need to implement feedforward attention "
                                  "using the weight matrices self._W_att_1 "
                                  "and self._W_att_2.")

        ### END YOUR CODE
        
        ### Your code should compute the context vector c
        attention_vector = tf.tanh(tf.matmul(tf.concat([c, target_state], -1), self._W_c))
        
        return attention_vector

## Hyperparameter settings

You may find it useful to tune some of these parameters (but not necessarily).

In [None]:
def create_standard_hparams(data_path, out_dir):
    
    hparams = tf.contrib.training.HParams(
        
        # Data
        src="vi",
        tgt="en",
        train_prefix=os.path.join(data_path, "train"),
        dev_prefix=os.path.join(data_path, "tst2012"),
        test_prefix=os.path.join(data_path, "tst2013"),
        vocab_prefix="",
        embed_prefix="",
        out_dir=out_dir,
        src_vocab_file=os.path.join(data_path, "vocab.vi"),
        tgt_vocab_file=os.path.join(data_path, "vocab.en"),
        src_embed_file="",
        tgt_embed_file="",
        src_file=os.path.join(data_path, "train.vi"),
        tgt_file=os.path.join(data_path, "train.en"),
        dev_src_file=os.path.join(data_path, "tst2012.vi"),
        dev_tgt_file=os.path.join(data_path, "tst2012.en"),
        test_src_file=os.path.join(data_path, "tst2013.vi"),
        test_tgt_file=os.path.join(data_path, "tst2013.en"),

        # Networks
        num_units=512,
        num_layers=1,
        num_encoder_layers=1,
        num_decoder_layers=1,
        num_encoder_residual_layers=0,
        num_decoder_residual_layers=0,
        dropout=0.2,
        unit_type="lstm",
        encoder_type="uni",
        residual=False,
        time_major=True,
        num_embeddings_partitions=0,

        # Train
        optimizer="adam",
        batch_size=128,
        init_op="uniform",
        init_weight=0.1,
        max_gradient_norm=100.0,
        learning_rate=0.001,
        warmup_steps=0,
        warmup_scheme="t2t",
        decay_scheme="luong234",
        colocate_gradients_with_ops=True,
        num_train_steps=12000,

        # Data constraints
        num_buckets=5,
        max_train=0,
        src_max_len=25,
        tgt_max_len=25,
        src_max_len_infer=0,
        tgt_max_len_infer=0,

        # Data format
        sos="<s>",
        eos="</s>",
        subword_option="",
        check_special_token=True,

        # Misc
        forget_bias=1.0,
        num_gpus=1,
        epoch_step=0,  # record where we were within an epoch.
        steps_per_stats=100,
        steps_per_external_eval=0,
        share_vocab=False,
        metrics=["bleu"],
        log_device_placement=False,
        random_seed=None,
        # only enable beam search during inference when beam_width > 0.
        beam_width=0,
        length_penalty_weight=0.0,
        override_loaded_hparams=True,
        num_keep_ckpts=5,
        avg_ckpts=False,
        num_intra_threads=0,
        num_inter_threads=0,

        # For inference
        inference_indices=None,
        infer_batch_size=32,
        sampling_temperature=0.0,
        num_translations_per_input=1,
        
    )
    
    src_vocab_size, _ = vocab_utils.check_vocab(hparams.src_vocab_file, hparams.out_dir)
    tgt_vocab_size, _ = vocab_utils.check_vocab(hparams.tgt_vocab_file, hparams.out_dir)
    hparams.add_hparam('src_vocab_size', src_vocab_size)
    hparams.add_hparam('tgt_vocab_size', tgt_vocab_size)
    
    out_dir = hparams.out_dir
    if not tf.gfile.Exists(out_dir):
        tf.gfile.MakeDirs(out_dir)
         
    for metric in hparams.metrics:
        hparams.add_hparam("best_" + metric, 0)  # larger is better
        best_metric_dir = os.path.join(hparams.out_dir, "best_" + metric)
        hparams.add_hparam("best_" + metric + "_dir", best_metric_dir)
        tf.gfile.MakeDirs(best_metric_dir)

        if hparams.avg_ckpts:
            hparams.add_hparam("avg_best_" + metric, 0)  # larger is better
            best_metric_dir = os.path.join(hparams.out_dir, "avg_best_" + metric)
            hparams.add_hparam("avg_best_" + metric + "_dir", best_metric_dir)
            tf.gfile.MakeDirs(best_metric_dir)

    return hparams

## Q3. Training (8 pts)

For this question, **train at least two of the models that use the attention modules you defined above**. Did you notice any difference in the training or evaluation of the different models? **Provide a brief written answer below.**

*Note*: Make sure you **remove the model checkpoints** in the appropriate folders (`nmt_model_dotprod_att`, `nmt_model_binlinear_att` or `nmt_model_feedforward_att`)  if you would like to start training from scratch. (It's safe to delete all the files saved in the directory, or move them elsewhere.) Otherwise, the saved parameters will automatically be reloaded from the latest checkpoint and training will resume where it left off.

**Your written answer here!**

In [None]:
# If desired as a baseline, train a vanilla LSTM model without attention
hparams = create_standard_hparams(
    data_path=os.path.join("datasets", "nmt_data_vi"), 
    out_dir="nmt_model_noatt"
)
hparams.add_hparam("attention_cell_class", LSTMCellWithDummyAttention)
train(hparams, AttentionalModel)

In [None]:
# Train an LSTM model with dot-product attention
hparams = create_standard_hparams(data_path=os.path.join("datasets", "nmt_data_vi"), 
                                  out_dir="nmt_model_dotprodatt")
hparams.add_hparam("attention_cell_class", LSTMCellWithDotProductAttention)
train(hparams, AttentionalModel)

In [None]:
# Train an LSTM model with bilinear attention
hparams = create_standard_hparams(data_path=os.path.join("datasets", "nmt_data_vi"),
                                  out_dir="nmt_model_bilinearatt")
hparams.add_hparam("attention_cell_class", LSTMCellWithBilinearAttention)
train(hparams, AttentionalModel)

In [None]:
# Train an LSTM model with feedforward attention
hparams = create_standard_hparams(data_path=os.path.join("datasets", "nmt_data_vi"), 
                                  out_dir="nmt_model_ffatt")
hparams.add_hparam("attention_cell_class", LSTMCellWithFeedForwardAttention)
train(hparams, AttentionalModel)