# Encoder-Decoder model with Attention 

References: 
* https://github.com/tensorflow/nmt
* https://arxiv.org/abs/1508.04025v5

<img src="images/attention_mechanism.jpg" style="width: 500px;">

At timestep $t$, the decoder computes its output as follows:

1. Compute $\text{score}(\mathbf{h}_{t},\overline{\mathbf{h}}_{s})$ for $s=1,\ldots,S$, where $\overline{\mathbf{h}}_{s}$ are the encoder outputs at timestep $s=1,\ldots,S$ and $\mathbf{h}_{t}$ is the hidden state of the decoder (more explicitly, the hidden state of the GRU layer in the decoder) at timestep $t$.

    * Note that the hidden state of the decoder is initially set by the hidden state of the encoder at its final timestep. In particular, the encoder units and the decoder units are the same. They will be expressed as `endec_units`. 

1. Compute the attention weights $\alpha_{ts}$, the softmax of the scores.

1. Compute the context vector: $\mathbf{c}_{t}=\sum_{s=1}^{S}\alpha_{ts}\overline{\mathbf{h}}_{s}$.

1. Pass the concatenated vector $[\mathbf{c}_{t};\mathbf{e}_{t}]$ to the decoder.GRU, $\mathbf{e}_{t}$ is the embedding output of the given input to the decoder at timestep $t$.

1. Pass the GRU output to a dense layer whose output dimension is decoder\_vocab\_size.

The following shows two methods of computing scores:

* $\text{score}(\mathbf{h}_{t},\overline{\mathbf{h}}_{s})=\mathbf{h}_{t}^{T}\mathbf{W}\overline{\mathbf{h}}_{s}$
(Luong's multiplicative style)

* $\text{score}(\mathbf{h}_{t},\overline{\mathbf{h}}_{s})=\mathbf{v}_{a}^{T}\tanh\left(\mathbf{W}_{1}\mathbf{h}_{t}+\mathbf{W}_{2}\overline{\mathbf{h}}_{s}\right)$
(Bahdanau's additive style)

__Encoder__:

* The encoder consists of an embedding layer and a GRU layer. Note that it doesn't have a dense layer contrary to the decoder, since the encoder-decoder combined is the whole model.

* If x is a batch of shape (batch_size, seq_length), then encoder(x) returns a tuple (output, state) which is simply the output of the GRU layer, where 
    * output.shape is (batch_size, seq_length, endec_units) and  
    * state.shape is (batch_size, endec_units).

* The GRU layer is created with the parameters `return_sequences=True` and `return_state=True`.

* Note that  
    * if `return_sequences=False` and `return_state=False` (which are by default), GRU returns only output with shape (batch_size, units);
    * if `return_sequences=True` and `return_state=False`, GRU returns only output with shape (batch_size, seq_length, units);
    * if `return_sequences=True` and `return_state=True`, GRU returns both output and state whose shapes are (batch_size, seq_length, units) and (batch_size, units), respectively.

```python
class Encoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
        super(Encoder, self).__init__()
        self.batch_sz = batch_sz
        self.enc_units = enc_units
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = tf.keras.layers.GRU(self.enc_units, 
                                       return_sequences=True, 
                                       return_state=True, 
                                       recurrent_initializer='glorot_uniform')
    def call(self, x, hidden):
        x = self.embedding(x)
        output, state = self.gru(x, initial_state=hidden)
        return output, state
    
    def initialize_hidden_state(self):
        return tf.zeros((self.batch_sz, self.enc_units))
```

__BahdanauAttention__:

* Recall Bahdanau's additive style:
$\text{score}(\mathbf{h}_{t},\overline{\mathbf{h}}_{s})=\mathbf{v}_{a}^{T}\tanh\left(\mathbf{W}_{1}\mathbf{h}_{t}+\mathbf{W}_{2}\overline{\mathbf{h}}_{s}\right)$

* The layer `BahdanauAttention` computes context vector and attention weights. 
    * inputs: a hidden state ($\mathbf{h}_t$) of the decoder, encoder outputs ($\overline{\mathbf{h}}_{s}$s)
    * outputs: context vector, attention weights

* The parameter `units` in `init()` is the output dimension of both $\mathbf{W}_{1}$ and $\mathbf{W}_{2}$. It is independent of `endec_units`, but `units` is set to `endec_units` as shown in init() of the decoder class. Moreover, the last dimensions of $\mathbf{h}_t$ and $\overline{\mathbf{h}}_{s}$s are also independent to each other in the Bahdanau's additive formula, but they are set to be the same in this example.

* The parameter `query` in `call()` is $\mathbf{h}_t$. Its shape is (batch_size, endec_units).

* The parameter `values` in `call()` is $\overline{\mathbf{h}}_{s}$s. Its shape is (batch_size, enc_seq_length, endec_units).

* The output `context_vector` has shape of (batch_size, endec_units). 

* The output `attention_weights` has shape of (batch_size, enc_seq_length, 1).

```python
class BahdanauAttention(tf.keras.layers.Layer):
    def __init__(self, units):
        super(BahdanauAttention, self).__init__()
        self.W1 = tf.keras.layers.Dense(units)
        self.W2 = tf.keras.layers.Dense(units)
        self.V = tf.keras.layers.Dense(1)
        
    def call(self, query, values):
        # query.shape: (batch_size, endec_units)
        # values.shape: (batch_size, enc_seq_length, endec_units)
        score = self.V(tf.nn.tanh(self.W1(tf.expand_dims(query,1)) + self.W2(values)))
        # score.shape: (batch_size, enc_seq_length, 1)
        
        attention_weights = tf.nn.softmax(score, axis=1)
        # attention_weights.shape: (batch_size, enc_seq_length, 1)
        
        context_vector = tf.reduce_sum(attention_weights * values, axis=1)
        # context_vector.shape: (batch_size, endec_units)
        
        return context_vector, attention_weights
```

__Decoder__:

```python
class Decoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
        super(Decoder, self).__init__()
        self.batch_sze = batch_sz
        self.dec_units = dec_units
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = tf.keras.layers.GRU(self.dec_units, 
                                       return_sequences=True, 
                                       return_state=True, 
                                       recurrent_initializer='glorot_uniform')
        self.fc = tf.keras.layers.Dense(vocab_size)
        self.attention = BahdanauAttention(self.dec_units)
        
    def call(self, x, hidden, enc_output):
        # x.shape: (batch_size, 1, dec_seq_length)
        # hidden.shape: (batch_size, endec_units)
        # enc_output.sape: (batch_size, enc_seq_length, endec_units)
        
        context_vector, attention_weights = self.attention(hidden, enc_output)
        # context_vector.shape: (batch_size, endec_units)
        # attention_weights.shape: (batch_size, enc_seq_length, 1)
        
        x = self.embedding(x)
        # x.shape: (batch_size, 1, embedding_dim)
        
        x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
        # x.shape: (batch_size, 1, endec_units + embedding_dim)
        
        output, state = self.gru(x)
        # output.shape = (batch_size, 1, endec_units)
        # state.shape = (batch_size, endec_units)
        
        output = tf.reshape(output, (-1, output.shape[2]))
        # output.shape = (batch_size, endec_units)
        
        output = self.fc(output)
        # output.shape = (batch_size, vocab_size)
        
        return output, state, attention_weights 
```