# Machine Translation and Attention
In this notebook, we will implement a model for neural machine translation (NMT) with attention. This notebook is adapted from the [TensorFlow tutorial on NMT](https://www.tensorflow.org/tutorials/seq2seq) at  as well as the [TensorFlow NMT package](https://github.com/tensorflow/nmt/).

In [1]:
%matplotlib inline

import collections
from functools import partial
import math
import matplotlib.pyplot as plt
import os
import random
import time
import zipfile

import numpy as np
from six.moves import urllib
from six.moves import xrange

import tensorflow as tf

# Helper TensorFlow functions
from utils import maybe_download

# The encoder-decoder architecture
from nmt.model import AttentionalModel, LSTMCell
from nmt.utils import vocab_utils
from nmt.train import train

Instructions for updating:
Use the retry module or similar alternatives.


## Data
We'll train our model on a small-scale dataset: an English-Vietnamese parallel corpus of TED talks (133K sentence pairs) provided by the IWSLT Evaluation Campaign (https://sites.google.com/site/iwsltevaluation2015/).

In [2]:
out_dir = os.path.join('datasets', 'nmt_data_vi')
site_prefix = "https://nlp.stanford.edu/projects/nmt/data/"

maybe_download(site_prefix + 'iwslt15.en-vi/train.en', out_dir, 13603614)
maybe_download(site_prefix + 'iwslt15.en-vi/train.vi', out_dir, 18074646)

maybe_download(site_prefix + 'iwslt15.en-vi/tst2012.en', out_dir, 140250)
maybe_download(site_prefix + 'iwslt15.en-vi/tst2012.vi', out_dir, 188396)

maybe_download(site_prefix + 'iwslt15.en-vi/tst2013.en', out_dir, 132264)
maybe_download(site_prefix + 'iwslt15.en-vi/tst2013.vi', out_dir, 183855)

maybe_download(site_prefix + 'iwslt15.en-vi/vocab.en', out_dir, 139741)
maybe_download(site_prefix + 'iwslt15.en-vi/vocab.vi', out_dir, 46767)

Found and verified datasets/nmt_data_vi/train.en
Found and verified datasets/nmt_data_vi/train.vi
Found and verified datasets/nmt_data_vi/tst2012.en
Found and verified datasets/nmt_data_vi/tst2012.vi
Found and verified datasets/nmt_data_vi/tst2013.en
Found and verified datasets/nmt_data_vi/tst2013.vi
Found and verified datasets/nmt_data_vi/vocab.en
Found and verified datasets/nmt_data_vi/vocab.vi


'vocab.vi'

## Introduction to NMT

<figure>
    <img src='images/encdec.jpg' alt='missing' />
    <figcaption>**Figure 1.** Example of a general, *encoder-decoder* approach to NMT. An encoder converts a source sentence into a representation which is passed through a decoder to produce a translation</figcaption>
</figure>

A neural machine translation (NMT) system reads in a source sentence using an *encoder*, and then uses a *decoder* to emit a translation. NMT models vary in terms of their exact architectures. A natural choice for sequential data is the recurrent neural network (RNN). Usually an RNN is used for both the encoder and decoder. The RNN models, however, differ in terms of: (a) directionality – unidirectional or bidirectional (whether they read the source sentence in forwards or forwards and backwards); (b) depth – single- or multi-layer; and (c) type – often either a vanilla RNN, a Long Short-term Memory (LSTM), or a gated recurrent unit (GRU).

We will consider a deep multi-layer RNN which is bi-directional (it reads the input sequence both forwards and backwards) and uses LSTM units with attention. At a high level, the NMT model consists of two recurrent neural networks: the encoder recurrent network simply consumes the input source words without making any prediction; the decoder, on the other hand, processes the target sentence while predicting the next words.

<figure>
    <img src='images/seq2seq.jpg' alt='missing' />
    <figcaption>**Figure 2.** Example of a neural machine translation system for translating a source sentence "I am a student" into a target sentence "Je suis étudiant".  Here, $<s>$ marks the start of the decoding process while $</s>$ tells the decoder to stop.
    </figcaption>
</figure>

At the bottom layer, the encoder and decoder recurrent networks receive as input the following: first, the source sentence, then a boundary marker $</s>$ which indicates the transition from the encoding to the decoding mode, and the target sentence. We now go into the details of how the model deals with source and target sentences.

### Embedding
Given the categorical nature of words, the model must first look up the source and target embeddings to retrieve the corresponding word representations. For this embedding layer to work, a vocabulary is first chosen for each language. Usually, a vocabulary size $V$ is selected, and only the most frequent $V$ words in the corpus are treated as unique. All other words are converted to an "unknown" token $<$UNK$>$ and all get the same embedding. The embedding weights, one set per language, are usually learned during training (but pretrained word embeddings may be used instead).

### Encoder
Once retrieved, the word embeddings are then fed as input into the main network, which consists of two multi-layer recurrent neural networks -- an encoder for the source language and a decoder for the target language. These two networks, in principle, can share the same weights; however, in practice, we often use two different sets of parameters (such models do a better job when fitting large training datasets). The encoder uses zero vectors as its starting states (before it sees the source sequence). In TensorFlow:

    # Build RNN cell
    encoder_cell = YourEncoderRNNCell(num_units)

    # Run Dynamic RNN
    #   encoder_outputs: [max_time, batch_size, num_units]
    #   encoder_state: [batch_size, num_units]
    encoder_outputs, encoder_state = tf.nn.dynamic_rnn(
        encoder_cell, encoder_emb_inp,
        sequence_length=source_sequence_length, time_major=True)

### Decoder
The decoder also needs to have access to the source information, and one simple way to achieve that is to initialize it with the last hidden state of the encoder, `encoder_state`. In Figure 2, we pass the hidden state at the source word "student" to the decoder side.

    # Build RNN cell
    decoder_cell = tf.nn.rnn_cell.BasicLSTMCell(num_units)
    
    # Helper
    helper = tf.contrib.seq2seq.TrainingHelper(
        decoder_emb_inp, decoder_lengths, time_major=True)

    # Decoder
    decoder = tf.contrib.seq2seq.BasicDecoder(
        decoder_cell, helper, encoder_state, output_layer=projection_layer)
    
    # Dynamic decoding
    outputs, _ = tf.contrib.seq2seq.dynamic_decode(decoder, ...)
    logits = outputs.rnn_output

### Loss
Given the logits above, we are now ready to compute the training loss:

    xent = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=decoder_outputs, logits=logits)
    train_loss = (tf.reduce_sum(crossent * target_weights) / batch_size)

Here, target_weights is a zero-one matrix of the same size as decoder_outputs. It masks padding positions outside of the target sequence lengths with values 0.

Important note: It's worth pointing out that we should divide the loss by `batch_size`, so our hyperparameters are "invariant" to `batch_size`. Some people divide the loss by (`batch_size * num_time_steps`), which plays down the errors made on short sentences. More subtly, the same hyperparameters (applied to the former way) can't be used for the latter way. For example, if both approaches use SGD with a learning of `1.0`, the latter approach effectively uses a much smaller learning rate of `1 / num_time_steps`.

### How to generate translations at test time

While you're training your NMT models (and once you have trained models), you can obtain translations given previously unseen source sentences. At test time, we only have access to the source sentence; i.e., `encoder_inputs`. There are many ways to perform decoding given those inputs. Decoding methods include greedy, sampling, and beam-search decoding. Here, we will discuss the greedy decoding strategy.

The idea is simple and illustrated in Figure 3:

1. We still encode the source sentence in the same way as during training to obtain an `encoder_state`, and this `encoder_state` is used to initialize the decoder.

2. The decoding (translation) process is started as soon as the decoder receives a starting symbol $<$/s$>$.

3. For each timestep on the decoder side, we treat the recurrent network's output as a set of logits. We choose the most likely word, the id associated with the maximum logit value, as the emitted word (this is the "greedy" behavior). For example in Figure 3, the word "moi" has the highest translation probability in the first decoding step. We then feed this word as input to the next timestep. (At training time, however, we may feed in the true target as input to the next timestep in a process called *teacher forcing*.)

4. The process continues until the end-of-sentence marker $<$/s$>$ is produced as an output symbol.

<figure>
    <img src='images/greedy_dec.jpg' alt='missing' />
    <figcaption>**Figure 3.** Example of how a trained NMT model produces a translation for a source sentence "Je suis étudiant" using greedy search.
    </figcaption>
</figure>

## Introduction to Attention

The attention mechanism was first introduced by Bahdanau et al., 2015 [1] and then later refined by Luong et al., 2015 [2] and others. The key idea of the attention mechanism is to establish direct short-cut connections between the target and the source by paying "attention" to relevant source content as we translate (produce output tokens). A nice byproduct of the attention mechanism is an easy-to-visualize alignment matrix between the source and target sentences that we will visualize at the end of this notebook.
 
Remember that in a vanilla seq2seq model, we pass the last source state $h_{s_{T_s}}$ from the encoder to the decoder when starting the decoding process. This works well for short and medium-length sentences; however, for long sentences, the single fixed-size hidden state becomes an information bottleneck. Instead of discarding all of the hidden states computed in the source RNN, the attention mechanism provides an approach that allows the decoder to peek at them (treating them as a dynamic memory of the source information). By doing so, the attention mechanism improves the translation of longer sentences. Nowadays, attention mechanisms are the *de facto* standard and have been successfully applied to many other tasks (including image caption generation, speech recognition, and text summarization).

<figure>
    <img src='images/att.jpg' alt='missing' />
    <figcaption>**Figure 4.** Example of an attention-based NMT system with the first step of the attention computation in detail. For clarity, the embedding and projection layers are omitted.
    </figcaption>
</figure>

### How do we actually attend over the input sequence?

There are many different ways of formalizing attention. These variants depend on the form of a *scoring* function and an *attention* function (and on whether the previous state of the decoder $h_{t_{i-1}}$ is used instead of $h_{t_{i}}$ in the scoring function as originally suggested in Bahdanau et al. (2015); **we will stick to using $h_{t_{i}}$** in this notebook). Luong et al. (2015) demonstrate that only a few choices actually matter:

1. First, the basic form of attention, i.e., **direct connections between target and source**, needs to be present. 

2. Second, it's important to **feed the attention vector to the next timestep** to inform the network about past attention decisions.

3. Lastly, **choices of the scoring function** can often result in different performance. See Luong et al. (2015) for further details.

### A general framework for computing attention

The attention computation happens at every decoder time step. It consists of the following stages:

1. The current target (encoder) hidden state $h_{t_i}$ is compared with all source (decoder) states $h_{s_j}$ to derive *attention weights* $\alpha_{ij}$.
2. Based on the attention weights we compute a *context vector* $c_{i}$ as the weighted average of the source states.
3. We combine the context vector $c_{i}$ with the current target hidden state $h_{s_j}$ to yield the final *attention vector* $a_t$.
4. The attention vector $a_i$ is fed as an input to the next time step (*input feeding*). 

The first three steps can be summarized by the equations below:

$$\large\begin{align*}
\alpha_{ij} &= \frac{
    \exp(\text{score}(h_{t_i}, h_{s_j}))
}{
    \sum_{k=1}^{T_s}{\exp(\text{score}(h_{t_i}, h_{s_k}))}
} \tag{attention weights} \\\\
c_{i} &= \sum_{j=1}^{T_s} \alpha_{ij} h_{s_j} \tag{context vector} \\\\
a_{i} &= f(c_{i}, h_{t_i}) \tag{attention vector} \\\\
\end{align*}$$

Here, the function `score` is used to compare the target hidden state $h_{t_i}$ with each of the source hidden states $h_{s_j}$, and the result is normalized over the source timesteps $j = 1, \dots, T_s$ to produce attention weights $\alpha_{ij}$ (which define a distribution over source positions $j$ for a given source timestep $i$). (There are various choices of the scoring function; we will consider three below.) Note that we make use of the current decoder (or *target*) hidden state $h_{t_i}$, which is computed as a function of the previous hidden state $h_{t_{i-1}}$, the embedding of the input token $x_{i}$ (which is either the emission or the ground truth token from the previous timestep) using the standard formula for a recurrent cell. Optionally, in the case of *input feeding*, we combine $h_{t_{i-1}}$ with the context vector from the previous timestep, $c_{t_{i-1}}$ (which may require a change in the size of the kernel matrix, depending on how the combination is implemented). The encoder (or *source*) hidden states $h_{s_j}$ for $j=1, \dots T_s$ are similarly the standard hidden state for a recurrent cell.

We can also vectorize the computation of the context vector $c_i$ for every target timestep as follows: Given the source hidden states $h_{s_1}, \dots, h_{s_{T_s}}$, we construct a matrix $H_s$ of size `hidden_size` $\times$ `input_seq_len` by stacking the source hidden states into columns. Attention allows us to dynamically weight certain timesteps of the input sequence in a fixed size vector $c_i$ by taking a convex combination of the columns of $H_s$. In particular, we calculate a nonzero and normalized attention weight vector $\vec{\alpha}_i = [\alpha_{i1}, \dots, \alpha_{iT_s}]^T$ that weights the source hidden states in the computation

$$\large c_i = H_s\vec{\alpha}_i~.$$



The attention vector $a_i$ is used to derive the softmax logits and thereafter the loss by transformation under a function $f$.The function $f$ is commonly the a concatenation followed by $\tanh$ layer:

$$\large a_{i} = \tanh(W_a[c_i; h_{t_i}])$$

but could take other forms. We then compute the predictive distribution over output tokens as

$$\large p(y_i \mid y_1, \dots y_{i-1}, x_i) = \text{softmax}(W_s a_{i})~.$$

## Q1. LSTM cell with attention (8 pts)

In the block below, you will implement the method `call`, which computes a single step of an LSTM cell using a method `attention` that computes an attention vector with some score function, as described above. **Complete the skeleton below**; assume inputs is already the input embedding (i.e., there is no need to construct an embedding matrix).

In [3]:
class LSTMCellWithAttention(LSTMCell):
    
    def __init__(self, num_units, memory):
        super(LSTMCellWithAttention, self).__init__(num_units)
        self.memory = memory
        
    def attention(self):
        raise NotImplementedError("The subclass must implement this method!")

    def call(self, inputs, state):
        """Run this LSTM cell with attention on inputs, conditional on state."""
        
        # Cell and hidden states of the LSTM
        c, h = state
        
        # Source (encoder) states to attend over
        source_states = self.memory
        
        # Cell activation (e.g., tanh, relu, etc.)
        activation = self._activation
        
        # LSTM cell parameters
        kernel = self._kernel
        bias = self._bias
        forget_bias = self._forget_bias
        
        ### YOUR CODE HERE
        # shapes of tensors
        # input [batch, state_size] or [batch, num_units] 
        # source_states [batch, input_length, state_size]
        # c [batch, state_size]
        # h [batch, state_size]
        
        lstm_matrix = tf.matmul(tf.concat([inputs, h], 1), kernel) # [batch, 4*state_size]
        lstm_matrix = tf.add(lstm_matrix, bias)
        i, g, f, o = tf.split(lstm_matrix, 4, 1) # each size [batch, state_size]
        new_c = tf.sigmoid(f + forget_bias) * c + tf.sigmoid(i) * activation(g)
        new_h = tf.sigmoid(o) * activation(new_c)
        
        attention_vector = self.attention(new_h, source_states) # what's target_state here?      
        ### END YOUR CODE
        ### Your code should compute attention vector, new_c and new_h

        # Adhering to convention
        new_state = tf.contrib.rnn.LSTMStateTuple(new_c, new_h)
    
        return attention_vector, new_state 

We can implement a "dummy" version of attention in order to test that the LSTM cell step function is working correctly:

In [5]:
class LSTMCellWithDummyAttention(LSTMCellWithAttention):

    def attention(self, target_state, source_states):
        """Just return the target state so that the update becomes the vanilla
        LSTM update."""
        return target_state

## Q2A. Dot-product Attention (8 pts)

We first consider the simplest version of attention, which simply calculates the similarity between $h_{t_i}$ and $h_{s_j}$ by computing their dot product:

$$\large\begin{align*}
\text{score}(h_{t_i}, h_{s_j})&=h_{t_i}^\mathrm{\,T}\, h_{s_j}~.
\end{align*}$$

This computation has no additional parameters, but it limits the expressivity of the model since its forces the input and output encodings to be close in order to have high score.

For this question, **implement the __call__ function of the following LSTM cell using dot-product attention.** Your code should be less than ten lines and *not* make use of any higher-level primitives from `tf.nn` or `tf.layers`, etc. (6 pts). As a further step, **vectorize the operation** so that you can compute $\text{score}(\cdot, h_{s_j})$ for every word in the source sentence in parallel (2 pts).

In [4]:
class LSTMCellWithDotProductAttention(LSTMCellWithAttention):
        
    def build(self, inputs_shape):
        super(LSTMCellWithDotProductAttention, self).build(inputs_shape)
        self._W_c = self.add_variable("W_c", 
                                      shape=[self._num_units + self._num_units, 
                                             256])

    def attention(self, target_state, source_states):
        """Return the attention vector computed from attending over
        source_states using a function of target_state and source_states."""
        
        ### YOUR CODE HERE
        #raise NotImplementedError("Need to implement dot-product attention.")
        
        # shapes of tensors
        # source_states [batch, input_length, state_size]
        # target_state [batch, state_size]

        scores = tf.matmul(source_states, tf.expand_dims(target_state, -1)) # [batch, input_length, 1]
        scores = scores - tf.reduce_max(scores, 1, keepdims=True)
        scores_exp = tf.exp(scores)
        scores = scores_exp/tf.reduce_sum(scores_exp, 1, keepdims=True)
        c = tf.squeeze(tf.matmul(source_states, scores, transpose_a=True), -1) # [batch, state_size]
        
        ### END YOUR CODE
        
        ### Your code should compute the context vector c
        attention_vector = tf.tanh(tf.matmul(tf.concat([c, target_state], -1), self._W_c))
        
        return attention_vector

## Q2B. Bilinear Attention (8 pts)

To make the score function more expressive, we may consider using a bilinear function of the form

$$\large\begin{align*}
\text{score}(h_{t_i}, h_{s_j})&=h_{t_i}^\mathrm{\,T} W_\text{att} h_{s_j}~,
\end{align*}$$

which transforms the source encoding $h_{s_j}$ by a linear transformation parameterized by $W_\text{att}$ before taking the dot product. This formulation adds additional parameters that must be learned, but increases expressivity and also allows the source and target encodings to be of different dimensionality (if we so wish).

For this question, **implement the __call__ function of the following LSTM cell using bilinear attention.** Your code should be less than ten lines and *not* make use of any higher-level primitives from `tf.nn`or `tf.layers`, etc. (6 pts). As a further step, **vectorize the operation** so that you can compute $\text{score}(\cdot, h_{s_j})$ for every word in the source sentence in parallel (2 pts).

In [11]:
class LSTMCellWithBilinearAttention(LSTMCellWithAttention):
    
    def build(self, inputs_shape):
        super(LSTMCellWithBilinearAttention, self).build(inputs_shape)
        self._W_att = self.add_variable("W_att", 
                                        shape=[self._num_units, 
                                               self._num_units])
        self._W_c = self.add_variable("W_c", 
                                      shape=[self._num_units + self._num_units, 
                                             256])

    def attention(self, target_state, source_states):
        """Return the attention vector computed from attending over
        source_states using a function of target_state and source_states."""
        
        ### YOUR CODE HERE
        
        # shapes of tensors
        # source_states [batch, input_length, state_size]
        # target_state [batch, state_size]
        
        batch_size = tf.shape(target_state)[0]
        W_att_batch = tf.tile(tf.expand_dims(self._W_att, 0), [batch_size, 1, 1]) # [batch, state_size, state_size]
        target_state_W_att = tf.matmul(W_att_batch,  tf.expand_dims(target_state, -1)) # [batch, state_size, 1]
        scores = tf.matmul(source_states, target_state_W_att) # [batch, input_length, 1]
        scores = scores - tf.reduce_max(scores, 1, keepdims=True)
        scores_exp = tf.exp(scores)
        scores = scores_exp/tf.reduce_sum(scores_exp, 1, keepdims=True)
        c = tf.squeeze(tf.matmul(source_states, scores, transpose_a=True), -1) # [batch, state_size]
        
       
        ### END YOUR CODE
        
        ### Your code should compute the context vector c
        attention_vector = tf.tanh(tf.matmul(tf.concat([c, target_state], -1), self._W_c))
        
        return attention_vector

## Q2C. Feedforward Attention (8 pts)

Instead of simply using a linear transformation, why don't we use an even more expressive feedforward neural network to compute the score?

$$\large\begin{align*}
\text{score}(h_{t_i}, h_{s_j})&=W_{\text{att}_2} \tanh( W_{\text{att}_1} [h_{t_i}; h_{s_j}])~,
\end{align*}$$

where $[v_1; v_2]$ denotes a concatenation of the vectors $v_1$ and $v_2$, and $W_{\text{att}_1}$ and $W_{\text{att}_2}$ are learned parameter matrices. The feedforward approach typically has fewer parameters (depending on the size of the hidden layer) than the bilinear attention mechanism (which requires `source_embedding_dim` $\times$ `target_embedding_dim` parameters).

For this question, **implement the __call__ function of the following LSTM cell using feedforward attention.** Your code should be less than ten lines and *not* make use of any higher-level primitives from `tf.nn` or `tf.layers`, etc. (6 pts). As a further step, **vectorize the operation** so that you can compute $\text{score}(\cdot, h_{s_j})$ for every word in the source sentence in parallel (2 pts).

In [13]:
class LSTMCellWithFeedForwardAttention(LSTMCellWithAttention):
    
    def build(self, inputs_shape):
        super(LSTMCellWithFeedForwardAttention, self).build(inputs_shape)

        self._W_att_1 = self.add_variable("W_att_1", 
                                          shape=[self._num_units + self._num_units, 
                                                 self._num_units])
        self._W_att_2 = self.add_variable("W_att_2", 
                                          shape=[self._num_units, 1])
        self._W_c = self.add_variable("W_c", 
                                      shape=[self._num_units + self._num_units, 
                                             256])
        
    def attention(self, target_state, source_states):
        """Return the attention vector computed from attending over
        source_states using a function of target_state and source_states."""
        
        ### YOUR CODE HERE
        
        # shapes of tensors
        # source_states [batch, input_length, state_size]
        # target_state [batch, state_size]
        # W_att_1 [2*state_size, state_size]
        # W_att_2 [state_size, 1]
        
        input_length = tf.shape(source_states)[1]
        batch_size = tf.shape(source_states)[0]
        target_state_tile = tf.tile(tf.expand_dims(target_state, 1), [1, input_length, 1]) # [batch, input_length, state_size]
        state_concat = tf.concat([source_states, target_state_tile], 2) # [batch, input_length, 2*state_size]
        W_att_1_batch = tf.tile(tf.expand_dims(self._W_att_1, 0), [batch_size, 1, 1]) # [batch_size, 2*state_size, state_size])
        W_att_2_batch = tf.tile(tf.expand_dims(self._W_att_2, 0), [batch_size, 1, 1]) # [batch_size, state_size, 1])
        temp = tf.tanh(tf.matmul(state_concat, W_att_1_batch)) # [batch_size, input_length, state_size]
        scores = tf.matmul(temp, W_att_2_batch) # [batch_size, input_length, 1]
                                
        scores = scores - tf.reduce_max(scores, 1, keepdims=True)
        scores_exp = tf.exp(scores)
        scores = scores_exp/tf.reduce_sum(scores_exp, 1, keepdims=True)
        c = tf.squeeze(tf.matmul(source_states, scores, transpose_a=True), -1) # [batch, state_size]
        
                                
        ### END YOUR CODE
        
        ### Your code should compute the context vector c
        attention_vector = tf.tanh(tf.matmul(tf.concat([c, target_state], -1), self._W_c))
        
        return attention_vector

## Hyperparameter settings

You may find it useful to tune some of these parameters (but not necessarily).

In [5]:
def create_standard_hparams(data_path, out_dir):
    
    hparams = tf.contrib.training.HParams(
        
        # Data
        src="vi",
        tgt="en",
        train_prefix=os.path.join(data_path, "train"),
        dev_prefix=os.path.join(data_path, "tst2012"),
        test_prefix=os.path.join(data_path, "tst2013"),
        vocab_prefix="",
        embed_prefix="",
        out_dir=out_dir,
        src_vocab_file=os.path.join(data_path, "vocab.vi"),
        tgt_vocab_file=os.path.join(data_path, "vocab.en"),
        src_embed_file="",
        tgt_embed_file="",
        src_file=os.path.join(data_path, "train.vi"),
        tgt_file=os.path.join(data_path, "train.en"),
        dev_src_file=os.path.join(data_path, "tst2012.vi"),
        dev_tgt_file=os.path.join(data_path, "tst2012.en"),
        test_src_file=os.path.join(data_path, "tst2013.vi"),
        test_tgt_file=os.path.join(data_path, "tst2013.en"),

        # Networks
        num_units=512,
        num_layers=1,
        num_encoder_layers=1,
        num_decoder_layers=1,
        num_encoder_residual_layers=0,
        num_decoder_residual_layers=0,
        dropout=0.2,
        unit_type="lstm",
        encoder_type="uni",
        residual=False,
        time_major=True,
        num_embeddings_partitions=0,

        # Train
        optimizer="adam",
        batch_size=128,
        init_op="uniform",
        init_weight=0.1,
        max_gradient_norm=100.0,
        learning_rate=0.001,
        warmup_steps=0,
        warmup_scheme="t2t",
        decay_scheme="luong234",
        colocate_gradients_with_ops=True,
        num_train_steps=12000,

        # Data constraints
        num_buckets=5,
        max_train=0,
        src_max_len=25,
        tgt_max_len=25,
        src_max_len_infer=0,
        tgt_max_len_infer=0,

        # Data format
        sos="<s>",
        eos="</s>",
        subword_option="",
        check_special_token=True,

        # Misc
        forget_bias=1.0,
        num_gpus=1,
        epoch_step=0,  # record where we were within an epoch.
        steps_per_stats=100,
        steps_per_external_eval=0,
        share_vocab=False,
        metrics=["bleu"],
        log_device_placement=False,
        random_seed=None,
        # only enable beam search during inference when beam_width > 0.
        beam_width=0,
        length_penalty_weight=0.0,
        override_loaded_hparams=True,
        num_keep_ckpts=5,
        avg_ckpts=False,
        num_intra_threads=0,
        num_inter_threads=0,

        # For inference
        inference_indices=None,
        infer_batch_size=32,
        sampling_temperature=0.0,
        num_translations_per_input=1,
        
    )
    
    src_vocab_size, _ = vocab_utils.check_vocab(hparams.src_vocab_file, hparams.out_dir)
    tgt_vocab_size, _ = vocab_utils.check_vocab(hparams.tgt_vocab_file, hparams.out_dir)
    hparams.add_hparam('src_vocab_size', src_vocab_size)
    hparams.add_hparam('tgt_vocab_size', tgt_vocab_size)
    
    out_dir = hparams.out_dir
    if not tf.gfile.Exists(out_dir):
        tf.gfile.MakeDirs(out_dir)
         
    for metric in hparams.metrics:
        hparams.add_hparam("best_" + metric, 0)  # larger is better
        best_metric_dir = os.path.join(hparams.out_dir, "best_" + metric)
        hparams.add_hparam("best_" + metric + "_dir", best_metric_dir)
        tf.gfile.MakeDirs(best_metric_dir)

        if hparams.avg_ckpts:
            hparams.add_hparam("avg_best_" + metric, 0)  # larger is better
            best_metric_dir = os.path.join(hparams.out_dir, "avg_best_" + metric)
            hparams.add_hparam("avg_best_" + metric + "_dir", best_metric_dir)
            tf.gfile.MakeDirs(best_metric_dir)

    return hparams

## Q3. Training (8 pts)

For this question, **train at least two of the models that use the attention modules you defined above**. Did you notice any difference in the training or evaluation of the different models? **Provide a brief written answer below.**

*Note*: Make sure you **remove the model checkpoints** in the appropriate folders (`nmt_model_dotprod_att`, `nmt_model_binlinear_att` or `nmt_model_feedforward_att`)  if you would like to start training from scratch. (It's safe to delete all the files saved in the directory, or move them elsewhere.) Otherwise, the saved parameters will automatically be reloaded from the latest checkpoint and training will resume where it left off.

**Your written answer here!**

In [7]:
# If desired as a baseline, train a vanilla LSTM model without attention
hparams = create_standard_hparams(
    data_path=os.path.join("datasets", "nmt_data_vi"), 
    out_dir="nmt_model_noatt"
)
hparams.add_hparam("attention_cell_class", LSTMCellWithDummyAttention)
train(hparams, AttentionalModel)

# Vocab file datasets/nmt_data_vi/vocab.vi exists
# Vocab file datasets/nmt_data_vi/vocab.en exists
# creating train graph ...
  num_layers = 1, num_residual_layers=0
  cell 0  LSTM, forget_bias=1  DropoutWrapper, dropout=0.2   DeviceWrapper, device=/gpu:0
  cell 0  DropoutWrapper, dropout=0.2   DropoutWrapper  DeviceWrapper, device=/gpu:0
  learning_rate=0.001, warmup_steps=0, warmup_scheme=t2t
  decay_scheme=luong234, start_decay_step=8000, decay_steps 1000, decay_factor 0.5
# Trainable variables
  embeddings/encoder/embedding_encoder:0, (7709, 512), /device:GPU:0
  embeddings/decoder/embedding_decoder:0, (17191, 512), /device:GPU:0
  dynamic_seq2seq/encoder/rnn/basic_lstm_cell/kernel:0, (1024, 2048), /device:GPU:0
  dynamic_seq2seq/encoder/rnn/basic_lstm_cell/bias:0, (2048,), /device:GPU:0
  dynamic_seq2seq/decoder/lstm_cell_with_dummy_attention/kernel:0, (1024, 2048), /device:GPU:0
  dynamic_seq2seq/decoder/lstm_cell_with_dummy_attention/bias:0, (2048,), /device:GPU:0
  dynamic_seq

  eval dev: perplexity 45.10, time 1s, Mon Apr  2 05:27:35 2018.
  eval test: perplexity 50.96, time 1s, Mon Apr  2 05:27:36 2018.
# Finished an epoch, step 2086. Perform external evaluation
INFO:tensorflow:Restoring parameters from nmt_model_noatt/translate.ckpt-2000
  loaded infer model parameters from nmt_model_noatt/translate.ckpt-2000, time 0.09s
  # 157
    src: Bà nói : &quot; nào , hãy chắc chắn là con sẽ không làm thế chứ &quot; . Tôi nói &quot; chắc chắn ạ &quot;
    ref: She said , &quot; Now you make sure you don &apos;t do that . &quot; I said , &quot; Sure . &quot;
    nmt: And she said , &quot; I &apos;m not going to say , &quot; I &apos;m not going to be a <unk> . &quot;
INFO:tensorflow:Restoring parameters from nmt_model_noatt/translate.ckpt-2000
  loaded infer model parameters from nmt_model_noatt/translate.ckpt-2000, time 0.09s
# External evaluation, global step 2000
  decoding to output nmt_model_noatt/output_dev.
  done, num sentences 1553, num translations per inp

  step 4400 lr 0.001 step-time 0.22s wps 20.81K ppl 18.01 gN 6.24 bleu 6.90, Mon Apr  2 05:38:14 2018
  step 4500 lr 0.001 step-time 0.22s wps 20.70K ppl 17.78 gN 6.18 bleu 6.90, Mon Apr  2 05:38:36 2018
  step 4600 lr 0.001 step-time 0.22s wps 20.83K ppl 18.30 gN 6.19 bleu 6.90, Mon Apr  2 05:38:58 2018
  step 4700 lr 0.001 step-time 0.22s wps 20.79K ppl 18.15 gN 6.17 bleu 6.90, Mon Apr  2 05:39:20 2018
  step 4800 lr 0.001 step-time 0.22s wps 20.82K ppl 18.37 gN 6.18 bleu 6.90, Mon Apr  2 05:39:43 2018
  step 4900 lr 0.001 step-time 0.22s wps 20.75K ppl 18.03 gN 6.09 bleu 6.90, Mon Apr  2 05:40:05 2018
  step 5000 lr 0.001 step-time 0.22s wps 20.78K ppl 18.23 gN 6.11 bleu 6.90, Mon Apr  2 05:40:27 2018
# Save eval, global step 5000
INFO:tensorflow:Restoring parameters from nmt_model_noatt/translate.ckpt-5000
  loaded infer model parameters from nmt_model_noatt/translate.ckpt-5000, time 0.09s
  # 781
    src: Và hai ngày sau tôi đến ca trực cấp cứu tiếp theo , và đó là lúc cấp trên củ

INFO:tensorflow:Restoring parameters from nmt_model_noatt/translate.ckpt-7000
  loaded eval model parameters from nmt_model_noatt/translate.ckpt-7000, time 0.10s
  eval dev: perplexity 29.69, time 1s, Mon Apr  2 05:49:41 2018.
  eval test: perplexity 34.04, time 1s, Mon Apr  2 05:49:43 2018.
  step 7100 lr 0.001 step-time 0.22s wps 20.58K ppl 12.40 gN 6.56 bleu 7.70, Mon Apr  2 05:50:05 2018
  step 7200 lr 0.001 step-time 0.22s wps 20.62K ppl 12.54 gN 6.46 bleu 7.70, Mon Apr  2 05:50:28 2018
  step 7300 lr 0.001 step-time 0.22s wps 20.36K ppl 12.48 gN 6.46 bleu 7.70, Mon Apr  2 05:50:49 2018
# Finished an epoch, step 7301. Perform external evaluation
INFO:tensorflow:Restoring parameters from nmt_model_noatt/translate.ckpt-7000
  loaded infer model parameters from nmt_model_noatt/translate.ckpt-7000, time 0.09s
  # 825
    src: Ta không thể tống khứ vấn đề này được .
    ref: We can &apos;t get rid of it .
    nmt: We can &apos;t have to be the same thing .
INFO:tensorflow:Restoring par

  loaded infer model parameters from nmt_model_noatt/translate.ckpt-10000, time 0.09s
  # 101
    src: Chúng tôi kể chuyện cho bà và cam đoan với bà là chúng tôi luôn ở bên bà .
    ref: We told her stories and assured her that we were still with her .
    nmt: We talked to her mother and she left her to her .
INFO:tensorflow:Restoring parameters from nmt_model_noatt/translate.ckpt-10000
  loaded eval model parameters from nmt_model_noatt/translate.ckpt-10000, time 0.10s
  eval dev: perplexity 31.50, time 1s, Mon Apr  2 06:02:40 2018.
  eval test: perplexity 36.12, time 1s, Mon Apr  2 06:02:41 2018.
INFO:tensorflow:Restoring parameters from nmt_model_noatt/translate.ckpt-10000
  loaded infer model parameters from nmt_model_noatt/translate.ckpt-10000, time 0.09s
  # 390
    src: Thay vì chỉ biết quyên góp tiền , chúng tôi có thể giúp được gì ?
    ref: Other than writing a check , what could we do ?
    nmt: So instead of just knowing how much money we could do ?
INFO:tensorflow:Restori

  loaded eval model parameters from nmt_model_noatt/translate.ckpt-12000, time 0.10s
  eval dev: perplexity 32.45, time 1s, Mon Apr  2 06:12:05 2018.
  eval test: perplexity 37.13, time 1s, Mon Apr  2 06:12:06 2018.
INFO:tensorflow:Restoring parameters from nmt_model_noatt/translate.ckpt-12000
  loaded infer model parameters from nmt_model_noatt/translate.ckpt-12000, time 0.09s
# External evaluation, global step 12000
  decoding to output nmt_model_noatt/output_dev.
  done, num sentences 1553, num translations per input 1, time 9s, Mon Apr  2 06:12:15 2018.
  bleu dev: 8.1
  saving hparams to nmt_model_noatt/hparams
# External evaluation, global step 12000
  decoding to output nmt_model_noatt/output_test.
  done, num sentences 1268, num translations per input 1, time 8s, Mon Apr  2 06:12:25 2018.
  bleu test: 7.3
  saving hparams to nmt_model_noatt/hparams
# Final, step 12000 lr 0.000125 step-time 0.22s wps 20.62K ppl 6.10 gN 6.82 dev ppl 32.45, dev bleu 8.1, test ppl 37.13, test bleu 

({'dev_ppl': 32.448353852135405,
  'dev_scores': {'bleu': 8.058157810604447},
  'test_ppl': 37.13180638014357,
  'test_scores': {'bleu': 7.297478762553148}},
 12000)

In [8]:
# Train an LSTM model with dot-product attention
hparams = create_standard_hparams(data_path=os.path.join("datasets", "nmt_data_vi"), 
                                  out_dir="nmt_model_dotprodatt")
hparams.add_hparam("attention_cell_class", LSTMCellWithDotProductAttention)
train(hparams, AttentionalModel)

# Vocab file datasets/nmt_data_vi/vocab.vi exists
# Vocab file datasets/nmt_data_vi/vocab.en exists
# creating train graph ...
  num_layers = 1, num_residual_layers=0
  cell 0  LSTM, forget_bias=1  DropoutWrapper, dropout=0.2   DeviceWrapper, device=/gpu:0
  cell 0  DropoutWrapper, dropout=0.2   DropoutWrapper  DeviceWrapper, device=/gpu:0
  learning_rate=0.001, warmup_steps=0, warmup_scheme=t2t
  decay_scheme=luong234, start_decay_step=8000, decay_steps 1000, decay_factor 0.5
# Trainable variables
  embeddings/encoder/embedding_encoder:0, (7709, 512), /device:GPU:0
  embeddings/decoder/embedding_decoder:0, (17191, 512), /device:GPU:0
  dynamic_seq2seq/encoder/rnn/basic_lstm_cell/kernel:0, (1024, 2048), /device:GPU:0
  dynamic_seq2seq/encoder/rnn/basic_lstm_cell/bias:0, (2048,), /device:GPU:0
  dynamic_seq2seq/decoder/lstm_cell_with_dot_product_attention/kernel:0, (1024, 2048), /device:GPU:0
  dynamic_seq2seq/decoder/lstm_cell_with_dot_product_attention/bias:0, (2048,), /device:GPU:0
 

    ref: In a hospital system where medical knowledge is doubling every two or three years , we can &apos;t keep up with it .
    nmt: In a hospital where the medical economy is more than two or three years , we can &apos;t even be able to go .
INFO:tensorflow:Restoring parameters from nmt_model_dotprodatt/translate.ckpt-2000
  loaded eval model parameters from nmt_model_dotprodatt/translate.ckpt-2000, time 0.09s
  eval dev: perplexity 23.57, time 1s, Mon Apr  2 07:12:10 2018.
  eval test: perplexity 23.55, time 1s, Mon Apr  2 07:12:12 2018.
# Finished an epoch, step 2086. Perform external evaluation
INFO:tensorflow:Restoring parameters from nmt_model_dotprodatt/translate.ckpt-2000
  loaded infer model parameters from nmt_model_dotprodatt/translate.ckpt-2000, time 0.08s
  # 538
    src: Nhiều người đã dừng lại , khoảng 60 % khi chúng tôi đưa ra 24 loại mứt , Và khi chỉ có 6 loại , thì chỉ có 40 % .
    ref: More people stopped when there were 24 , about 60 percent , than when there wer

  step 4600 lr 0.001 step-time 0.23s wps 19.54K ppl 9.61 gN 6.30 bleu 14.03, Mon Apr  2 07:24:12 2018
  step 4700 lr 0.001 step-time 0.24s wps 19.60K ppl 9.81 gN 6.53 bleu 14.03, Mon Apr  2 07:24:35 2018
  step 4800 lr 0.001 step-time 0.24s wps 19.69K ppl 9.71 gN 6.34 bleu 14.03, Mon Apr  2 07:24:59 2018
  step 4900 lr 0.001 step-time 0.24s wps 19.47K ppl 9.72 gN 6.31 bleu 14.03, Mon Apr  2 07:25:23 2018
  step 5000 lr 0.001 step-time 0.24s wps 19.65K ppl 9.68 gN 6.21 bleu 14.03, Mon Apr  2 07:25:46 2018
# Save eval, global step 5000
INFO:tensorflow:Restoring parameters from nmt_model_dotprodatt/translate.ckpt-5000
  loaded infer model parameters from nmt_model_dotprodatt/translate.ckpt-5000, time 0.08s
  # 535
    src: Họ có hơn 348 loại mứt khác nhau .
    ref: They had 348 different kinds of jam .
    nmt: They have more than <unk> different <unk> .
INFO:tensorflow:Restoring parameters from nmt_model_dotprodatt/translate.ckpt-5000
  loaded eval model parameters from nmt_model_dotpro

  step 7300 lr 0.001 step-time 0.23s wps 19.25K ppl 7.51 gN 6.29 bleu 14.61, Mon Apr  2 07:36:31 2018
# Finished an epoch, step 7301. Perform external evaluation
INFO:tensorflow:Restoring parameters from nmt_model_dotprodatt/translate.ckpt-7000
  loaded infer model parameters from nmt_model_dotprodatt/translate.ckpt-7000, time 0.08s
  # 140
    src: bà ôm tôi chặt đến mức tôi thấy khó thở rồi sau đó bà để tôi đi
    ref: And she &apos;d squeeze me so tight I could barely breathe and then she &apos;d let me go .
    nmt: She was holding my breath down to me , and then she left me with breath after she left .
INFO:tensorflow:Restoring parameters from nmt_model_dotprodatt/translate.ckpt-7000
  loaded infer model parameters from nmt_model_dotprodatt/translate.ckpt-7000, time 0.08s
# External evaluation, global step 7000
  decoding to output nmt_model_dotprodatt/output_dev.
  done, num sentences 1553, num translations per input 1, time 9s, Mon Apr  2 07:36:41 2018.
  bleu dev: 14.4
  saving

    ref: The other thing is we know where all the gas stations are .
    nmt: Another thing is we know , all of the gas stations .
INFO:tensorflow:Restoring parameters from nmt_model_dotprodatt/translate.ckpt-10000
  loaded eval model parameters from nmt_model_dotprodatt/translate.ckpt-10000, time 0.10s
  eval dev: perplexity 16.40, time 1s, Mon Apr  2 07:49:00 2018.
  eval test: perplexity 15.77, time 1s, Mon Apr  2 07:49:01 2018.
INFO:tensorflow:Restoring parameters from nmt_model_dotprodatt/translate.ckpt-10000
  loaded infer model parameters from nmt_model_dotprodatt/translate.ckpt-10000, time 0.08s
  # 826
    src: Chúng ta có những định kiến dựa vào kinh nghiệm sẵn có , ví dụ tôi có thể chấp nhận là một bệnh nhân đau ngực có tiền sử bệnh hoàn hảo .
    ref: We have our cognitive biases , so that I can take a perfect history on a patient with chest pain .
    nmt: We have these stereotypes that are available , for example , to accept that patient &apos;s <unk> is perfect , and I c

  eval test: perplexity 16.34, time 1s, Mon Apr  2 07:58:37 2018.
INFO:tensorflow:Restoring parameters from nmt_model_dotprodatt/translate.ckpt-12000
  loaded infer model parameters from nmt_model_dotprodatt/translate.ckpt-12000, time 0.08s
# External evaluation, global step 12000
  decoding to output nmt_model_dotprodatt/output_dev.
  done, num sentences 1553, num translations per input 1, time 9s, Mon Apr  2 07:58:46 2018.
  bleu dev: 14.6
  saving hparams to nmt_model_dotprodatt/hparams
# External evaluation, global step 12000
  decoding to output nmt_model_dotprodatt/output_test.
  done, num sentences 1268, num translations per input 1, time 8s, Mon Apr  2 07:58:55 2018.
  bleu test: 15.2
  saving hparams to nmt_model_dotprodatt/hparams
# Final, step 12000 lr 0.000125 step-time 0.24s wps 19.59K ppl 4.33 gN 6.07 dev ppl 16.95, dev bleu 14.6, test ppl 16.34, test bleu 15.2, Mon Apr  2 07:58:56 2018
# Done training!, time 3328s, Mon Apr  2 07:58:56 2018.
# Start evaluating saved best 

({'dev_ppl': 16.94916214271354,
  'dev_scores': {'bleu': 14.60982733339068},
  'test_ppl': 16.34162359243638,
  'test_scores': {'bleu': 15.20867011933523}},
 12000)

In [12]:
# Train an LSTM model with bilinear attention
hparams = create_standard_hparams(data_path=os.path.join("datasets", "nmt_data_vi"),
                                  out_dir="nmt_model_bilinearatt")
hparams.add_hparam("attention_cell_class", LSTMCellWithBilinearAttention)
train(hparams, AttentionalModel)

# Vocab file datasets/nmt_data_vi/vocab.vi exists
# Vocab file datasets/nmt_data_vi/vocab.en exists
# creating train graph ...
  num_layers = 1, num_residual_layers=0
  cell 0  LSTM, forget_bias=1  DropoutWrapper, dropout=0.2   DeviceWrapper, device=/gpu:0
  cell 0  DropoutWrapper, dropout=0.2   DropoutWrapper  DeviceWrapper, device=/gpu:0
  learning_rate=0.001, warmup_steps=0, warmup_scheme=t2t
  decay_scheme=luong234, start_decay_step=8000, decay_steps 1000, decay_factor 0.5
# Trainable variables
  embeddings/encoder/embedding_encoder:0, (7709, 512), /device:GPU:0
  embeddings/decoder/embedding_decoder:0, (17191, 512), /device:GPU:0
  dynamic_seq2seq/encoder/rnn/basic_lstm_cell/kernel:0, (1024, 2048), /device:GPU:0
  dynamic_seq2seq/encoder/rnn/basic_lstm_cell/bias:0, (2048,), /device:GPU:0
  dynamic_seq2seq/decoder/lstm_cell_with_bilinear_attention/kernel:0, (1024, 2048), /device:GPU:0
  dynamic_seq2seq/decoder/lstm_cell_with_bilinear_attention/bias:0, (2048,), /device:GPU:0
  dynam

  loaded infer model parameters from nmt_model_bilinearatt/translate.ckpt-2000, time 0.08s
  # 1437
    src: Giống như viên gạch bê tông , transistor cho phép bạn xây những mạch điện lớn và phức tạp hơn , từng viên gạch một .
    ref: Like the concrete block , the transistor allows you to build much larger , more complex circuits , one brick at a time .
    nmt: Like the <unk> , the <unk> for you , the <unk> you can build the <unk> and more complex .
INFO:tensorflow:Restoring parameters from nmt_model_bilinearatt/translate.ckpt-2000
  loaded eval model parameters from nmt_model_bilinearatt/translate.ckpt-2000, time 0.09s
  eval dev: perplexity 24.88, time 2s, Mon Apr  2 08:56:40 2018.
  eval test: perplexity 25.35, time 2s, Mon Apr  2 08:56:43 2018.
# Finished an epoch, step 2086. Perform external evaluation
INFO:tensorflow:Restoring parameters from nmt_model_bilinearatt/translate.ckpt-2000
  loaded infer model parameters from nmt_model_bilinearatt/translate.ckpt-2000, time 0.08s
  # 1

  step 4300 lr 0.001 step-time 0.39s wps 11.75K ppl 9.76 gN 6.94 bleu 14.81, Mon Apr  2 09:13:35 2018
  step 4400 lr 0.001 step-time 0.39s wps 11.74K ppl 9.72 gN 6.94 bleu 14.81, Mon Apr  2 09:14:14 2018
  step 4500 lr 0.001 step-time 0.39s wps 11.78K ppl 9.92 gN 7.11 bleu 14.81, Mon Apr  2 09:14:53 2018
  step 4600 lr 0.001 step-time 0.39s wps 11.77K ppl 10.01 gN 7.10 bleu 14.81, Mon Apr  2 09:15:32 2018
  step 4700 lr 0.001 step-time 0.39s wps 11.77K ppl 9.94 gN 7.14 bleu 14.81, Mon Apr  2 09:16:11 2018
  step 4800 lr 0.001 step-time 0.39s wps 11.75K ppl 10.16 gN 6.96 bleu 14.81, Mon Apr  2 09:16:51 2018
  step 4900 lr 0.001 step-time 0.39s wps 11.72K ppl 9.95 gN 7.03 bleu 14.81, Mon Apr  2 09:17:30 2018
  step 5000 lr 0.001 step-time 0.40s wps 11.63K ppl 10.25 gN 7.01 bleu 14.81, Mon Apr  2 09:18:10 2018
# Save eval, global step 5000
INFO:tensorflow:Restoring parameters from nmt_model_bilinearatt/translate.ckpt-5000
  loaded infer model parameters from nmt_model_bilinearatt/translat

  step 6400 lr 0.001 step-time 0.39s wps 11.73K ppl 6.86 gN 6.71 bleu 15.54, Mon Apr  2 09:29:02 2018
  step 6500 lr 0.001 step-time 0.40s wps 11.72K ppl 7.14 gN 6.97 bleu 15.54, Mon Apr  2 09:29:41 2018
  step 6600 lr 0.001 step-time 0.39s wps 11.67K ppl 7.22 gN 6.95 bleu 15.54, Mon Apr  2 09:30:21 2018
  step 6700 lr 0.001 step-time 0.39s wps 11.72K ppl 7.39 gN 8.22 bleu 15.54, Mon Apr  2 09:31:00 2018
  step 6800 lr 0.001 step-time 0.39s wps 11.68K ppl 7.36 gN 7.09 bleu 15.54, Mon Apr  2 09:31:39 2018
  step 6900 lr 0.001 step-time 0.40s wps 11.66K ppl 7.47 gN 7.07 bleu 15.54, Mon Apr  2 09:32:19 2018
  step 7000 lr 0.001 step-time 0.40s wps 11.69K ppl 7.48 gN 7.03 bleu 15.54, Mon Apr  2 09:32:59 2018
# Save eval, global step 7000
INFO:tensorflow:Restoring parameters from nmt_model_bilinearatt/translate.ckpt-7000
  loaded infer model parameters from nmt_model_bilinearatt/translate.ckpt-7000, time 0.08s
  # 1404
    src: Khó có thể nói làm thế nào hình ảnh được tạo ra
    ref: Make i

INFO:tensorflow:Restoring parameters from nmt_model_bilinearatt/translate.ckpt-9000
  loaded infer model parameters from nmt_model_bilinearatt/translate.ckpt-9000, time 0.08s
# External evaluation, global step 9000
  decoding to output nmt_model_bilinearatt/output_dev.
  done, num sentences 1553, num translations per input 1, time 10s, Mon Apr  2 09:50:07 2018.
  bleu dev: 15.5
  saving hparams to nmt_model_bilinearatt/hparams
# External evaluation, global step 9000
  decoding to output nmt_model_bilinearatt/output_test.
  done, num sentences 1268, num translations per input 1, time 9s, Mon Apr  2 09:50:17 2018.
  bleu test: 16.0
  saving hparams to nmt_model_bilinearatt/hparams
  step 9400 lr 0.0005 step-time 0.50s wps 8.93K ppl 5.60 gN 6.75 bleu 15.54, Mon Apr  2 09:50:34 2018
  step 9500 lr 0.0005 step-time 0.40s wps 11.57K ppl 4.68 gN 6.35 bleu 15.54, Mon Apr  2 09:51:14 2018
  step 9600 lr 0.0005 step-time 0.39s wps 11.78K ppl 4.79 gN 6.56 bleu 15.54, Mon Apr  2 09:51:53 2018
  st

  done, num sentences 1553, num translations per input 1, time 10s, Mon Apr  2 10:05:21 2018.
  bleu dev: 15.8
  saving hparams to nmt_model_bilinearatt/hparams
# External evaluation, global step 11000
  decoding to output nmt_model_bilinearatt/output_test.
  done, num sentences 1268, num translations per input 1, time 9s, Mon Apr  2 10:05:31 2018.
  bleu test: 16.6
  saving hparams to nmt_model_bilinearatt/hparams
  step 11500 lr 0.000125 step-time 0.50s wps 8.89K ppl 4.33 gN 6.56 bleu 15.83, Mon Apr  2 10:05:55 2018
  step 11600 lr 0.000125 step-time 0.39s wps 11.74K ppl 4.17 gN 6.32 bleu 15.83, Mon Apr  2 10:06:34 2018
  step 11700 lr 0.000125 step-time 0.39s wps 11.85K ppl 4.23 gN 6.49 bleu 15.83, Mon Apr  2 10:07:13 2018
  step 11800 lr 0.000125 step-time 0.39s wps 11.82K ppl 4.16 gN 6.39 bleu 15.83, Mon Apr  2 10:07:52 2018
  step 11900 lr 0.000125 step-time 0.39s wps 11.82K ppl 4.14 gN 6.44 bleu 15.83, Mon Apr  2 10:08:31 2018
  step 12000 lr 0.000125 step-time 0.39s wps 11.85K 

({'dev_ppl': 17.442222185685058,
  'dev_scores': {'bleu': 15.35520479604916},
  'test_ppl': 16.691589883678144,
  'test_scores': {'bleu': 16.070008518562382}},
 12000)

In [14]:
# Train an LSTM model with feedforward attention
hparams = create_standard_hparams(data_path=os.path.join("datasets", "nmt_data_vi"), 
                                  out_dir="nmt_model_ffatt")
hparams.add_hparam("attention_cell_class", LSTMCellWithFeedForwardAttention)
train(hparams, AttentionalModel)

# Vocab file datasets/nmt_data_vi/vocab.vi exists
# Vocab file datasets/nmt_data_vi/vocab.en exists
# creating train graph ...
  num_layers = 1, num_residual_layers=0
  cell 0  LSTM, forget_bias=1  DropoutWrapper, dropout=0.2   DeviceWrapper, device=/gpu:0
  cell 0  DropoutWrapper, dropout=0.2   DropoutWrapper  DeviceWrapper, device=/gpu:0
  learning_rate=0.001, warmup_steps=0, warmup_scheme=t2t
  decay_scheme=luong234, start_decay_step=8000, decay_steps 1000, decay_factor 0.5
# Trainable variables
  embeddings/encoder/embedding_encoder:0, (7709, 512), /device:GPU:0
  embeddings/decoder/embedding_decoder:0, (17191, 512), /device:GPU:0
  dynamic_seq2seq/encoder/rnn/basic_lstm_cell/kernel:0, (1024, 2048), /device:GPU:0
  dynamic_seq2seq/encoder/rnn/basic_lstm_cell/bias:0, (2048,), /device:GPU:0
  dynamic_seq2seq/decoder/lstm_cell_with_feed_forward_attention/kernel:0, (1024, 2048), /device:GPU:0
  dynamic_seq2seq/decoder/lstm_cell_with_feed_forward_attention/bias:0, (2048,), /device:GPU:0

  step 2000 lr 0.001 step-time 0.71s wps 6.45K ppl 20.48 gN 6.69 bleu 7.25, Mon Apr  2 10:36:08 2018
# Save eval, global step 2000
INFO:tensorflow:Restoring parameters from nmt_model_ffatt/translate.ckpt-2000
  loaded infer model parameters from nmt_model_ffatt/translate.ckpt-2000, time 0.08s
  # 1473
    src: Và chúng tôi muốn vật liệu này tiếp cận được với mọi người .
    ref: And we want to make this material accessible to everyone .
    nmt: And we want this material to be able to get people .
INFO:tensorflow:Restoring parameters from nmt_model_ffatt/translate.ckpt-2000
  loaded eval model parameters from nmt_model_ffatt/translate.ckpt-2000, time 0.10s
  eval dev: perplexity 21.64, time 5s, Mon Apr  2 10:36:14 2018.
  eval test: perplexity 22.24, time 5s, Mon Apr  2 10:36:20 2018.
# Finished an epoch, step 2086. Perform external evaluation
INFO:tensorflow:Restoring parameters from nmt_model_ffatt/translate.ckpt-2000
  loaded infer model parameters from nmt_model_ffatt/translate.ckp

  step 4600 lr 0.001 step-time 0.72s wps 6.46K ppl 9.22 gN 6.65 bleu 13.56, Mon Apr  2 11:09:53 2018
  step 4700 lr 0.001 step-time 0.71s wps 6.46K ppl 9.28 gN 6.83 bleu 13.56, Mon Apr  2 11:11:04 2018
  step 4800 lr 0.001 step-time 0.72s wps 6.46K ppl 9.30 gN 6.77 bleu 13.56, Mon Apr  2 11:12:16 2018
  step 4900 lr 0.001 step-time 0.71s wps 6.47K ppl 9.16 gN 6.59 bleu 13.56, Mon Apr  2 11:13:27 2018
  step 5000 lr 0.001 step-time 0.71s wps 6.48K ppl 9.35 gN 6.80 bleu 13.56, Mon Apr  2 11:14:38 2018
# Save eval, global step 5000
INFO:tensorflow:Restoring parameters from nmt_model_ffatt/translate.ckpt-5000
  loaded infer model parameters from nmt_model_ffatt/translate.ckpt-5000, time 0.08s
  # 609
    src: Nếu tôi chỉ ra cho bạn 600 loại tạp chí Và tôi chia nó ra làm 10 loại so với khi tôi chỉ cho bạn 400 tạp chí và chia nó ra thành 20 loại Bạn tin rằng tôi đã đưa cho bạn nhiều sự lựa chọn và những trải nghiệm lựa chọn tốt hơn nếu tôi cho bạn 400 hơn là tôi chỉ cho bạn 600
    ref: If I

  loaded infer model parameters from nmt_model_ffatt/translate.ckpt-7000, time 0.08s
  # 326
    src: và vì thế , con người chúng ta có phẩm giá cơ bản phải được luật pháp bảo vệ .
    ref: And because of that there &apos;s this basic human dignity that must be respected by law .
    nmt: And so , as humans , we have the basic dignity of protecting the protection .
INFO:tensorflow:Restoring parameters from nmt_model_ffatt/translate.ckpt-7000
  loaded eval model parameters from nmt_model_ffatt/translate.ckpt-7000, time 0.10s
  eval dev: perplexity 14.89, time 5s, Mon Apr  2 11:41:05 2018.
  eval test: perplexity 14.72, time 5s, Mon Apr  2 11:41:11 2018.
  step 7100 lr 0.001 step-time 0.71s wps 6.46K ppl 7.07 gN 6.75 bleu 14.20, Mon Apr  2 11:42:22 2018
  step 7200 lr 0.001 step-time 0.72s wps 6.46K ppl 7.17 gN 6.80 bleu 14.20, Mon Apr  2 11:43:34 2018
  step 7300 lr 0.001 step-time 0.69s wps 6.43K ppl 7.11 gN 6.84 bleu 14.20, Mon Apr  2 11:44:43 2018
# Finished an epoch, step 7301. Perf

  step 10000 lr 0.0005 step-time 0.72s wps 6.45K ppl 4.59 gN 6.64 bleu 14.27, Mon Apr  2 12:19:38 2018
# Save eval, global step 10000
INFO:tensorflow:Restoring parameters from nmt_model_ffatt/translate.ckpt-10000
  loaded infer model parameters from nmt_model_ffatt/translate.ckpt-10000, time 0.08s
  # 341
    src: Quan toà chứng nhận đó là một người trưởng thành , nhưng tôi thấy cậu ấy vẫn còn là một đứa trẻ
    ref: And the judge has certified him as an adult , but I see this kid .
    nmt: The judge was an adult , but I saw him still as a child .
INFO:tensorflow:Restoring parameters from nmt_model_ffatt/translate.ckpt-10000
  loaded eval model parameters from nmt_model_ffatt/translate.ckpt-10000, time 0.09s
  eval dev: perplexity 15.89, time 5s, Mon Apr  2 12:19:44 2018.
  eval test: perplexity 15.94, time 5s, Mon Apr  2 12:19:50 2018.
INFO:tensorflow:Restoring parameters from nmt_model_ffatt/translate.ckpt-10000
  loaded infer model parameters from nmt_model_ffatt/translate.ckpt-100

  eval dev: perplexity 16.20, time 5s, Mon Apr  2 12:46:06 2018.
  eval test: perplexity 15.93, time 5s, Mon Apr  2 12:46:12 2018.
INFO:tensorflow:Restoring parameters from nmt_model_ffatt/translate.ckpt-12000
  loaded infer model parameters from nmt_model_ffatt/translate.ckpt-12000, time 0.08s
  # 476
    src: Và điều đó , với mức độ rộng lớn hơn , là cái chúng ta muốn làm hiện nay .
    ref: And that , to a large extent , is what we want to do now .
    nmt: And that , for a greater level , is what we want to do today .
INFO:tensorflow:Restoring parameters from nmt_model_ffatt/translate.ckpt-12000
  loaded eval model parameters from nmt_model_ffatt/translate.ckpt-12000, time 0.10s
  eval dev: perplexity 16.20, time 5s, Mon Apr  2 12:46:20 2018.
  eval test: perplexity 15.93, time 5s, Mon Apr  2 12:46:26 2018.
INFO:tensorflow:Restoring parameters from nmt_model_ffatt/translate.ckpt-12000
  loaded infer model parameters from nmt_model_ffatt/translate.ckpt-12000, time 0.08s
# External e

({'dev_ppl': 16.2045087774781,
  'dev_scores': {'bleu': 13.786509957677342},
  'test_ppl': 15.926899547146027,
  'test_scores': {'bleu': 13.457804929373632}},
 12000)