# Attention: Intuition

- The state-of-the-art NLP features the use of **Attention** or its sophisticated application, **transformers**.
- In this unit, we will provide an intuitive understanding of **Attention** mechanism in deep learning.

## Self-Attention

- Self-attention operation is fundamental to the SOTA NLP.
- It is a simple sequence-to-sequence operation: a sequence of vectors (**input vectors**) goes in, and a sequence of vectors comes out.
- The self-attention operation builds upon the assumption that among all the input vectors, some are more similar to each other.
- Therefore, the Self-Attention layer may give more weights to those input vectors that are more similar to each other when generating the output vectors.

- How do we know which input vectors are more similar or more connected to each ther? The simplest way is to compute the **dot-product** of the two vectors (i.e., similar to Cosine Similarity).
- Therefore, in Self-Attention, each input vector (**Query**) is compared to all the other input vectors (**Keys**) to get the weights or similarity measures.
- And its output vector is a weighted sum over all the input vectors, weighted by the similarity measures (in-between input vectors).

![](../images/seq2seq-self-atten.gif)

- For instance, in the following example, the word $walks$ may be more relevant to *who* is doing the walking, i.e., $cats$, or, *where* the agent is walking, i.e, $street$, and less relevant to grammatical words like $the$.
- Therefore, an effective Self-Attention layer should create the output vector of $walks$ (i.e., the weighted sum) by assigning higher weights on these relevant tokens (as indicated by the widths of the arrows) and lower weights on those irrelevant tokens.

- To simply put, the Self-Attention layer transforms each input vector into the output vector by taking into consideration how the input vector (Query) is connected to the rest of the input vectors (Keys).

## Sequence Model with Attention

### Vanillar Encoder-Decoder Model

- Attention arises as an effective mechanism for many-to-many sequence-to-sequence model.
- A typical application is machine translation.
- In the vanilla RNN Encoder-Decoder model, after the decoder processes the input sequences in all time steps, the decoder takes the output of the last time step from the encoder as the input for decoding.

![](../images/seq2seq-vanilla-rnn.jpeg)

- Two sequence models need to be trained: encoder and decoder:
    - Encoder:
        - A vanilla version of the seq-to-seq model takes the **last** return state $h_t$ of the encoder as the initial and only input for the decoder
        - If the encoder uses the LSTM cell, the output of the encoder would be the last return state and the last memory cell, i.e., $h_t$ and $c_t$
    - Decoder
        - During the training stage, the decoder takes the previous return state $h_{t-1}$ and the current $Y_t$ as the input for the LSTM (concatenated). This is referred to as **teacher forcing**.
        - During the testing stage, the decoder would decode the output one at a time, taking the previous return state $h_{t-1}$ and the previous return output $Y_{t-1}$ as the inputs of the LSTM (concatenated). That is, no **teacher-forcing** during the testing stage.

### Peeky Encoder-Decoder Model

![](../images/seq2seq-peeky.jpeg)

- In the vanilla encoder-decoder model, decoder can only access the last hidden state from the encoder.
- A variant of the seq-to-seq model is to makes available the last return state $h_{t}$ from the encoder to every time step in the decoder.
- An intuitive understanding of this **peeky** approach is that during the decoding stage (i.e., translation), the contexts from the source input should be made available to all decoding steps.

### Attention-based Encoder-Decoder Model

- Compared to Peeky Encoder-Decoder Model, the Attention-based Encoder-Decoder Model goes one step further by allowing the decoder to access not only the hidden state of the last time step from encoder, but all the hidden states from the encoder.
- This is where the Attention mechanism comes in.

![](../images/seq2seq-enc-dec-attn.gif)

- Attention mechansim can be seen as much more sophisticated design of the peeky approach.
- The idea is that during the decoding stage, we need to consider the pairwise relationship (similarity) in-between the decoder state $h_{t}$ and **ALL** the return states from the encoder.
- An intuitive understanding is as follows. When decoding the translation of $Y_{1}$, it is very likely that its translation is more relevant to some of the input words and less relevant to the others.
- Therefore, the Attention mechanism first needs to determine the relative pairwise relationship in-between the decoder $h_{1}$ and all the encoder hidden states in order to generate the **attention outputs**. 

![](../images/seq2seq-attention-weights.jpeg)

- There are many proposals regarding how to compute the attention weights. 
- In the current Tensorflow implementation, there are three types of [Attention layers](https://keras.io/api/layers/attention_layers/):
    - `Attention` Layer: Luong's style attention (i.e., simple dot-product) [Luong2015](https://arxiv.org/pdf/1508.4025.pdf)
    - `AdditiveAttention` Layer: Bahdanau's style attention [Bahdanau2015](https://arxiv.org/pdf/1409.0473.pdf)
    - `MultiHeadAttention` Layer: transformer's style attention [“Attention is All you Need” (Vaswani, et al., 2017)](http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf)
- The **Attention** layer then will transform all the hidden states of the encoder into **context vectors**, indicating how the decoding step is relevant to all the input sequences.
- These context vectors (from the Attention layer) can contribute to the decoding process (translation).

## Usage of Attention Layer in keras

- When defining the Attention layer, we need to specify the **Query** tensors and the **Key** tensors.
- In Self-Attention layers, the Query is all the input vectors, and the Key is also the input vectors. 
- In Attention Layer in sequence-to-sequence model, the Query is the hidden state from the Decoder; the keys are all the input states from the Encoder.

## References

- Please see a very nice review of Lilian Weng's [Attention? Attention!](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html).