# Transformer Architecture and Machine Translation with the Transformer

In this notebook, you will understand the Transformer architecture introduced in [Vaswani et al., 2017]. You will further learn how to load a pretrained Transformer model and translate a few sentences youself with the `BeamSearchTranslator`.

## Preparation

We start with some usual preparation such as importing libraries and setting the environment.


In [None]:
import random
import math

import numpy as np
import mxnet as mx
from mxnet import gluon, nd
from mxnet.gluon import nn
import gluonnlp as nlp
import d2l

np.random.seed(100)
random.seed(100)
mx.random.seed(10000)
ctx = d2l.try_gpu(0)

mx.set_np_shape(True)

import warnings
warnings.filterwarnings('ignore')

# Attention Mechanism


In :numref:`chapter_seq2seq`, we encode the source sequence input information in the recurrent unit state and then pass it to the decoder to generate the target sequence. A token in the target sequence may closely relate to some tokens in the source sequence instead of the whole source sequence. For example, when translating "Hello world." to "Bonjour le monde.", "Bonjour" maps to "Hello" and "monde" maps to "world". In the seq2seq model, the decoder may implicitly select the corresponding information from the state passed by the decoder. The attention mechanism, however, makes this selection explicit.

Attention is a generalized pooling method with bias alignment over inputs. The core component in the attention mechanism is the attention layer, or called attention for simplicity. An input of the attention layer is called a query. For a query, the attention layer returns the output based on its memory, which is a set of key-value pairs. To be more specific, assume a query $\mathbf{q}\in\mathbb R^{d_q}$, and the memory contains $n$ key-value pairs, $(\mathbf{k}_1, \mathbf{v}_1), \ldots, (\mathbf{k}_n, \mathbf{v}_n)$, with $\mathbf{k}_i\in\mathbb R^{d_k}$, $\mathbf{v}_i\in\mathbb R^{d_v}$. The attention layer then returns an output $\mathbf o\in\mathbb R^{d_v}$ with the same shape as a value.

<center>
<img src="../img/attention.svg" width="33%"/>
</center>



To compute the output, we first assume there is a score function $\alpha$ which measures the similarity between the query and a key. Then we compute all $n$ scores $a_1, \ldots, a_n$ by

$$a_i = \alpha(\mathbf q, \mathbf k_i).$$

Next we use softmax to obtain the attention weights

$$b_1, \ldots, b_n = \textrm{softmax}(a_1, \ldots, a_n).$$

The output is then a weighted sum of the values

$$\mathbf o = \sum_{i=1}^n b_i \mathbf v_i.$$

Different choices of the score function lead to different attention layers. We will discuss two commonly used attention layers in the rest of this section. Before diving into the implementation, we first introduce a masked version of the softmax operator and explain a specialized dot operator `nd.batched_dot`.

## Enforcing Causality: Masked Softmax

<center>
<img src="../img/causal-attention.png" width="40%"/>
</center>

The masked softmax enables enforcing causality when computing attention weights.
It takes the attention scores and a mask as input and filters out masked scores when computing the attention weights.

In [None]:
attention_scores = nd.random.uniform(shape=(2,2,4))
mask = nd.ones(shape=(2,2,4))
mask[0, :, 2:] = 0
mask[1, :, 3:] = 0
nlp.model.attention_cell._masked_softmax(nd, attention_scores, mask, attention_scores.dtype)

Now we create two batches, and each batch has one query and 10 key-value pairs. 
We specify through `mask` that for the first batch, we will only pay attention to the first key-value pair, while for the second batch, we will check the first 6 key-value pairs. Therefore, though both batches have the same query and key-value pairs, we obtain different outputs.

## Computing Attention Weights: Dot Product Attention

The dot product assumes the query has the same dimension as the keys, namely $\mathbf q, \mathbf k_i \in\mathbb R^d$ for all $i$. It computes the score by an inner product between the query and a key, often then divided by $\sqrt{d}$ to make the scores less sensitive to the dimension $d$. In other words,

$$\alpha(\mathbf q, \mathbf k) = \langle \mathbf q, \mathbf k \rangle /\sqrt{d}.$$

Assume $\mathbf Q\in\mathbb R^{m\times d}$ contains $m$ queries and $\mathbf K\in\mathbb R^{n\times d}$ has all $n$ keys. We can compute all $mn$ scores by

$$\alpha(\mathbf Q, \mathbf K) = \mathbf Q \mathbf K^T /\sqrt{d}.$$

Now let's implement this layer that supports a batch of queries and key-value pairs. In addition, it supports randomly dropping some attention weights as a regularization.

In [None]:
class DotProductAttention(nn.Block): 
    def __init__(self, dropout, **kwargs):
        super(DotProductAttention, self).__init__(**kwargs)
        self.dropout = nn.Dropout(dropout)

    # query: (batch_size, #queries, d)
    # key: (batch_size, #kv_pairs, d)
    # value: (batch_size, #kv_pairs, dim_v)
    # mask: (batch_size, #queries, #kv_pairs)
    def forward(self, query, key, value, mask=None):
        d = query.shape[-1]
        # set transpose_b=True to swap the last two dimensions of key
        scores = nd.batch_dot(query, key, transpose_b=True) / math.sqrt(d)
        attention_weights = nlp.model.attention_cell._masked_softmax(mx.nd, scores, mask, scores.dtype)
        attention_weights = self.dropout(attention_weights)
        return nd.batch_dot(attention_weights, value)

In [None]:
atten = DotProductAttention(dropout=0.5)
atten.initialize()
X = nd.broadcast_axis(nd.arange(5).reshape((1,5,1)), axis=0, size=2)
mask = nd.ones(shape=(2,5,5))
mask[0, :, 2:] = 0
mask[1, :, 4:] = 0
print(atten(X, X, X, mask))

In gluonnlp available as `nlp.model.DotProductAttentionCell`.

## Multi-Head Attention

<center>
<img src="../img/multi-head-attention.svg" width="50%"/>
</center>

A multi-head attention layer consists of $h$ parallel attention layers, each one is called a head. For each head, we use three dense layers with hidden sizes $p_q$, $p_k$ and $p_v$ to project the queries, keys and values, respectively, before feeding into the attention layer. The outputs of these $h$ heads are concatenated and then projected by another dense layer.

To be more specific, assume we have the learnable parameters
$\mathbf W_q^{(i)}\in\mathbb R^{p_q\times d_q}$,
$\mathbf W_k^{(i)}\in\mathbb R^{p_k\times d_k}$,
and $\mathbf W_v^{(i)}\in\mathbb R^{p_v\times d_v}$,
 for $i=1,\ldots,h$, and $\mathbf W_o\in\mathbb R^{d_o\times h p_v}$. Then the output for each head can be obtained by

$$\mathbf o^{(i)} = \textrm{attention}(\mathbf W_q^{(i)}\mathbf q, \mathbf W_k^{(i)}\mathbf k,\mathbf W_v^{(i)}\mathbf v),$$

where $\text{attention}$ can be any attention layer introduced before. Since we already have learnable parameters, the simple dot product attention is used.

Then we concatenate all outputs and project them to obtain the multi-head attention output

$$\mathbf o = \mathbf W_o \begin{bmatrix}\mathbf o^{(1)}\\\vdots\\\mathbf o^{(h)}\end{bmatrix}.$$

In practice, we often use $p_q=p_k=p_v=d_o/h$. The hyper-parameters for a multi-head attention, therefore, contain the number heads $h$, and output feature size $d_o$.

In [None]:
class MultiHeadAttention(nn.Block):
    def __init__(self, units, num_heads, dropout, **kwargs):  # units = d_o
        super(MultiHeadAttention, self).__init__(**kwargs)
        assert units % num_heads == 0
        self.num_heads = num_heads
        self.attention = DotProductAttention(dropout)
        self.W_q = nn.Dense(units, use_bias=False, flatten=False)
        self.W_k = nn.Dense(units, use_bias=False, flatten=False)
        self.W_v = nn.Dense(units, use_bias=False, flatten=False)

    # query, key, and value shape: (batch_size, num_items, dim)
    # mask shape is (batch_size, query_length, memory_length)
    def forward(self, query, key, value, mask):
        # Project and transpose from (batch_size, num_items, units) to
        # (batch_size * num_heads, num_items, p), where units = p * num_heads.
        query, key, value = [transpose_qkv(X, self.num_heads) for X in (
            self.W_q(query), self.W_k(key), self.W_v(value))]
        if mask is not None:
            # Replicate mask for each of the num_heads heads
            mask = nd.broadcast_axis(nd.expand_dims(mask, axis=1),
                                    axis=1, size=self.num_heads)\
                    .reshape(shape=(-1, 0, 0), reverse=True)
        output = self.attention(query, key, value, mask)
        # Transpose from (batch_size * num_heads, num_items, p) back to
        # (batch_size, num_items, units)
        return transpose_output(output, self.num_heads)

Here are the definitions of the transpose functions.

In [None]:
def transpose_qkv(X, num_heads):
    # Shape after reshape: (batch_size, num_items, num_heads, p)
    # 0 means copying the shape element, -1 means inferring its value
    X = X.reshape((0, 0, num_heads, -1))
    # Swap the num_items and the num_heads dimensions
    X = X.transpose((0, 2, 1, 3))
    # Merge the first two dimensions. Use reverse=True to infer
    # shape from right to left
    return X.reshape((-1, 0, 0), reverse=True)

def transpose_output(X, num_heads):
    # A reversed version of transpose_qkv
    X = X.reshape((-1, num_heads, 0, 0), reverse=True)
    X = X.transpose((0, 2, 1, 3))
    return X.reshape((0, 0, -1))

Create a multi-head attention with the output size $d_o$ equals to 100, the output will share the same batch size and sequence length as the input, but the last dimension will be equal to $d_o$.

In [None]:
cell = MultiHeadAttention(units=100, num_heads=10, dropout=0.5)
cell.initialize()
cell(X, X, X, mask).shape

In gluonnlp available as `nlp.model.MultiHeadAttentionCell`.

# Transformer Architecture

The Transformer model is also based on the encoder-decoder architecture. It,
however, differs to the seq2seq model that the transformer replaces the
recurrent layers in seq2seq with attention layers. To deal with sequential
inputs, each item in the sequential is copied as the query, the key and the
value as illustrated in :numref:`fig_self_attention`. It therefore outputs a same length
sequential output. We call such an attention layer as a self-attention layer.


<!-- Compared to a recurrent layer, output items of a self-attention layer can be computed in parallel and, therefore, it is easy to obtain a high-efficient implementation. -->

The transformer architecture, with a comparison to the seq2seq model with
attention, is shown in :numref:`fig_transformer`. These two models are similar to
each other in overall: the source sequence embeddings are fed into $n$ repeated
blocks. The outputs of the last block are then used as attention memory for the
decoder.  The target sequence embeddings is similarly fed into $n$ repeated
blocks in the decoder, and the final outputs are obtained by applying a dense
layer with vocabulary size to the last block's outputs.

<center>
<img src="../img/transformer.png" width="37%"/>
</center>

It can also be seen that the transformer differs to the seq2seq with attention model in three major places:

1. A recurrent layer in seq2seq is replaced with a transformer block. This block contains a self-attention layer (multi-head attention) and a network with two dense layers (position-wise FFN) for the encoder. For the decoder, another multi-head attention layer is used to take the encoder state.
1. The encoder state is passed to every transformer block in the decoder, instead of using as an additional input of the first recurrent layer in seq2seq.
1. Since the self-attention layer does not distinguish the item order in a sequence, a positional encoding layer is used to add sequential information into each sequence item.

In the rest of this section, we will explain every new layer introduced by the transformer, and construct a model to train on the machine translation dataset.

## Position-wise Feed-Forward Networks

The position-wise feed-forward network accepts a 3-dim input with shape (batch size, sequence length, feature size). It consists of two dense layers that applies to the last dimension, which means the same dense layers are used for each position item in the sequence, so called position-wise.

In [None]:
class PositionWiseFFN(nn.Block):
    def __init__(self, units, hidden_size, **kwargs):
        super(PositionWiseFFN, self).__init__(**kwargs)
        # Dense layers is used for each position item in the sequence, so called position-wise. 
        self.ffn_1 = nn.Dense(hidden_size, flatten=False, activation='relu')
        self.ffn_2 = nn.Dense(units, flatten=False)

    def forward(self, X):
        # shape of X: (batch size, sequence length, feature size)
        return self.ffn_2(self.ffn_1(X))

Similar to the multi-head attention, the position-wise feed-forward network will only change the last dimension size of the input. In addition, if two items in the input sequence are identical, the according outputs will be identical as well.

In [None]:
ffn = PositionWiseFFN(4, 8)
ffn.initialize()
ffn(nd.ones((2, 3, 4)))

## Add and Norm

The input and the output of a multi-head attention layer or a position-wise feed-forward network are combined by a block that contains a residual structure and a layer normalization layer.

Layer normalization is similar batch normalization, but the mean and variances are calculated along the last dimension, e.g `X.mean(axis=-1)` instead of the first batch dimension, e.g. `X.mean(axis=0)`.

In [None]:
layer = nn.LayerNorm()  # Normalize along channel-dimension
layer.initialize()
batch = nn.BatchNorm()  # Normalize along batch-dimension
batch.initialize()
X = nd.array([[1,2],[2,3]])
# compute mean and variance from X in the training mode.
with mx.autograd.record():
    print('layer norm:',layer(X), '\nbatch norm:', batch(X))

The connection block accepts two inputs $X$ and $Y$, the input and output of an other block. Within this connection block, we apply dropout on $Y$.

In [None]:
class AddNorm(nn.Block):
    def __init__(self, dropout, **kwargs):
        super(AddNorm, self).__init__(**kwargs)
        self.dropout = nn.Dropout(dropout)
        self.norm = nn.LayerNorm()

    def forward(self, X, Y):
        return self.norm(self.dropout(Y) + X)

Due to the residual connection, $X$ and $Y$ should have the same shape.

In [None]:
add_norm = AddNorm(0.5)
add_norm.initialize()
add_norm(nd.ones((2,3,4)), nd.ones((2,3,4))).shape

## Positional Encoding

Unlike the recurrent layer, both the multi-head attention layer and the position-wise feed-forward network compute the output of each item in the sequence independently. This property allows us to parallel the computation but is inefficient to model the sequence information. The transformer model therefore adds positional information into the input sequence.

Assume $X\in\mathbb R^{l\times d}$ is the embedding of an example, where $l$ is the sequence length and $d$ is the embedding size. This layer will create a positional encoding $P\in\mathbb R^{l\times d}$ and output $P+X$, with $P$ defined as following:

$$P_{i,2j} = \sin(i/10000^{2j/d}),\quad P_{i,2j+1} = \cos(i/10000^{2j/d}),$$

for $i=0,\ldots,l-1$ and $j=0,\ldots,\lfloor(d-1)/2\rfloor$.

In [None]:
def position_encoding_init(max_length, dim):
    X = nd.arange(0, max_length).reshape((-1,1)) / nd.power(
            10000, nd.arange(0, dim, 2)/dim)
    position_weight = nd.zeros((max_length, dim))

    position_weight[:, 0::2] = nd.sin(X)
    position_weight[:, 1::2] = nd.cos(X)
    return position_weight


class PositionalEncoding(nn.Block):
    def __init__(self, units, dropout=0, max_len=1000):
        super(PositionalEncoding, self).__init__()
        self._max_len = max_len
        self._units = units
        self.position_weight = position_encoding_init(max_len, units)
        self.dropout = nn.Dropout(dropout)

    def forward(self, X):
        pos_seq = mx.nd.arange(X.shape[1]).expand_dims(0)
        emb = nd.Embedding(pos_seq, self.position_weight, self._max_len, self._units)
        return self.dropout(X + emb)

Now we visualize the position values for 4 dimensions. As can be seen, the 4th dimension has the same frequency as the 5th but with different offset. The 5th and 6th dimension have a lower frequency.

In [None]:
%matplotlib inline

from matplotlib import pyplot as plt

pe = PositionalEncoding(20)
pe.initialize()
X = nd.zeros((1, 100, 20))
Y = pe(X)
_ = plt.plot(np.arange(100), Y.asnumpy()[0, :,4:8])

## Transformer Encoder

<img src="../img/self-attention.svg" width="50%"/>

Now we define the transformer block for the encoder, which contains a multi-head attention layer, a position-wise feed-forward network, and two connection blocks.

Due to the residual connections, this block will not change the input shape. It means the `units` argument should be equal to the input's last dimension size.

In [None]:
class EncoderBlock(nn.Block):
    def __init__(self, units, hidden_size, num_heads, dropout, **kwargs):
        super(EncoderBlock, self).__init__(**kwargs)
        self.attention = MultiHeadAttention(units, num_heads, dropout)
        self.add_1 = AddNorm(dropout)
        self.ffn = PositionWiseFFN(units, hidden_size)
        self.add_2 = AddNorm(dropout)

    def forward(self, X, mask):
        Y = self.add_1(X, self.attention(X, X, X, mask))
        return self.add_2(Y, self.ffn(Y))
    
encoder_blk = EncoderBlock(24, 48, 8, 0.5)
encoder_blk.initialize()
mask = nd.ones(shape=(2, 100, 100))
mask[0, :, 2:] = 0
mask[1, :, 3:] = 0
encoder_blk(nd.ones((2, 100, 24)), mask).shape

The encoder stacks $n$ blocks. Due to the residual connection again, the embedding layer size $d$ is same as the transformer block output size. Also note that we multiple the embedding output by $\sqrt{d}$ to avoid its values are too small compared to positional encodings.

In [None]:
class TransformerEncoder(nn.Block):
    def __init__(self, vocab_size, units, hidden_size,
                 num_heads, num_layers, dropout, **kwargs):
        super(TransformerEncoder, self).__init__(**kwargs)
        self.units = units
        self.embed = nn.Embedding(vocab_size, units)
        self.pos_encoding = PositionalEncoding(units, dropout)
        self.blks = nn.Sequential()
        for i in range(num_layers):
            self.blks.add(
                EncoderBlock(units, hidden_size, num_heads, dropout))

    def forward(self, X, mask, *args):
        X = self.pos_encoding(self.embed(X) * math.sqrt(self.units))
        for blk in self.blks:
            X = blk(X, mask)
        return X
    
encoder = TransformerEncoder(200, 24, 48, 8, 2, 0.5)
encoder.initialize()
encoder(nd.ones((2, 100)), mask).shape

## Transformer-Decoder

<center>
<img src="../img/self-attention-predict.svg" width="50%"/>
</center>

Let first look at how a decoder behaviors during predicting. Similar to the seq2seq model, we call $T$ forwards to generate a $T$ length sequence. At time step $t$, assume $\mathbf x_t$ is the current input, i.e. the query. Then keys and values of the self-attention layer consist of the current query with all past queries $\mathbf x_1, \ldots, \mathbf x_{t-1}$.

During training, because the output for the $t$-query could depend all $T$ key-value pairs, which results in an inconsistent behavior than prediction. We can eliminate it by specifying the valid length to be $t$ for the $t$-th query.

Another difference compared to the encoder transformer block is that the decoder block has an additional multi-head attention layer that accepts the encoder outputs as keys and values.

In [None]:
class DecoderBlock(nn.Block):
    # i means it's the i-th block in the decoder
    def __init__(self, units, hidden_size, num_heads, dropout, i, **kwargs):
        super(DecoderBlock, self).__init__(**kwargs)
        self.i = i
        self.attention_1 = MultiHeadAttention(units, num_heads, dropout)
        self.add_1 = AddNorm(dropout)
        self.attention_2 = MultiHeadAttention(units, num_heads, dropout)
        self.add_2 = AddNorm(dropout)
        self.ffn = PositionWiseFFN(units, hidden_size)
        self.add_3 = AddNorm(dropout)

    def forward(self, X, state, dec_mask=None):
        # key_values contains the past queries for this block
        enc_outputs, enc_mask, key_values = state
        key_values[self.i] = nd.concat(key_values[self.i], X, dim=1)
        X2 = self.attention_1(X, key_values[self.i], key_values[self.i], dec_mask)
        Y = self.add_1(X, X2)
        Y2 = self.attention_2(Y, enc_outputs, enc_outputs, enc_mask)
        Z = self.add_2(Y, Y2)
        return self.add_3(Z, self.ffn(Z)), [enc_outputs, enc_mask, key_values]

decoder_blk = DecoderBlock(24, 48, 8, 0.5, i=0)
decoder_blk.initialize()

X = nd.ones((2, 100, 24))
enc_outputs = encoder_blk(X, mask)
key_values = [nd.zeros(shape=(2,0,24))]
state = [enc_outputs, mask, key_values]

# Similar to the encoder block, `units` should be equal to the last dimension size of $X$.
decoder_blk(X, state)[0].shape

The construction of the decoder is identical to the encoder except for the additional last dense layer to obtain confident scores.

In [None]:
class TransformerDecoder(nn.Block):
    def __init__(self, vocab_size, units, hidden_size,
                 num_heads, num_layers, dropout, **kwargs):
        super(TransformerDecoder, self).__init__(**kwargs)
        self.units = units
        self.num_layers = num_layers
        self.embed = nn.Embedding(vocab_size, units)
        self.pos_encoding = PositionalEncoding(units, dropout)
        self.blks = nn.Sequential()
        for i in range(num_layers):
            self.blks.add(
                DecoderBlock(units, hidden_size, num_heads, dropout, i))
        self.dense = nn.Dense(vocab_size, flatten=False)

    def init_state(self, enc_outputs, enc_valid_lengh, *args):
        batch_size = enc_valid_length.shape[0]
        empty_key_values = nd.zeros(shape=(batch_size, 0, self.units))
        return [enc_outputs, enc_valid_lengh, [empty_key_values]*self.num_layers]

    def forward(self, X, state):
        X = self.pos_encoding(self.embed(X) * math.sqrt(self.units))
        for blk in self.blks:
            X, state = blk(X, state)
        return self.dense(X), state

## Use the Pretrained Transformer model

Next, we load the Transformer model in GluonNLP model zoo, which returns the model + the source and target vocabulary.

In [None]:
import nmt

with mx.np_shape(False):
    wmt_transformer_model, wmt_src_vocab, wmt_tgt_vocab = \
        nlp.model.get_model('transformer_en_de_512',
                            dataset_name='WMT2014',
                            pretrained=True,
                            ctx=ctx)

# we are using mixed vocab of EN-DE, so the source and target language vocab are the same
print('#Source Vocab:', len(wmt_src_vocab), ', #Target Vocab:', len(wmt_tgt_vocab))

In [None]:
print(wmt_transformer_model) # Print the model

### Translation Example

In [None]:
from gluonnlp.model import BeamSearchScorer, BeamSearchSampler
import hyperparameters as hparams
print('Beam Size =', hparams.beam_size,
      ', Lengh penalty Alpha=', hparams.lp_alpha,
      ', Length penalty K=', hparams.lp_k)

def decode_logprob(step_input, states):
    out, states, _ = wmt_transformer_model.decode_step(step_input, states)
    return mx.nd.log_softmax(out), states


sampler = BeamSearchSampler(
    decoder=decode_logprob,
    beam_size=hparams.beam_size,
    eos_id=wmt_tgt_vocab.token_to_idx[wmt_tgt_vocab.eos_token],
    scorer=nlp.model.BeamSearchScorer(alpha=hparams.lp_alpha, K=hparams.lp_k),
    max_length=200)


def translate(src_sentence, src_vocab, tgt_vocab):
    src_sentence.append(src_vocab[src_vocab.eos_token])
    src_nd = mx.nd.array([src_sentence], dtype=np.int32, ctx=ctx)
    src_valid_length = mx.nd.array([src_nd.shape[1]], ctx=ctx)
    encoder_outputs, _ = wmt_transformer_model.encode(src_nd, valid_length=src_valid_length)
    decoder_states = wmt_transformer_model.decoder.init_state_from_encoder(encoder_outputs,
                                                                 src_valid_length)
    
    bos_idx = tgt_vocab.token_to_idx[tgt_vocab.bos_token]
    inputs = mx.nd.full(shape=(1,), ctx=ctx, dtype=np.float32, val=bos_idx)

    samples, scores, sample_valid_length = sampler(inputs, decoder_states)

    max_score_sample = samples[:, 0, :].asnumpy()
    
    sample_valid_length = sample_valid_length[:, 0].asnumpy()
    translation_out = [
        wmt_tgt_vocab.idx_to_token[ele] for ele in
         max_score_sample[0][1:(sample_valid_length[0] - 1)]]
    return nmt.bleu._bpe_to_words(translation_out)
    
print('Translating English to German:')

sample_sentence = 'We love natural language processing .'.split()
print(sample_sentence)
sample_src_seq = wmt_src_vocab[sample_sentence]
sample_tgt_seq = translate(sample_src_seq, wmt_src_vocab, wmt_tgt_vocab)
print('The German translation is:')
print(sample_tgt_seq)

If you'd like to train your own transformer models, you may find the training scripts in our
[scripts](https://github.com/dmlc/gluon-nlp/tree/master/scripts/machine_translation).

## References

[1] Vaswani, Ashish, et al. "Attention is all you need." Advances in Neural Information Processing Systems. 2017.