## Introduction

Attention is, to some extent, motivated by how we pay visual attention to different regions of an image or correlate words in one sentence.

In a nutshell, attention in the deep learning can be broadly interpreted as a vector of importance weights: in order to predict or infer one element, such as a pixel in an image or a word in a sentence, we estimate using the attention vector how strongly it is correlated with other elements and take the sum of their values weighted by the attention vector as the approximation of the target.

The seq2seq model aims to transform an input sequence (source) to a new one (target) and both sequences can be of arbitrary lengths. Examples of transformation tasks include machine translation between multiple languages in either text or audio, question-answer dialog generation, or even parsing sentences into grammar trees. A critical and apparent disadvantage of it's fixed-length context vector design is incapability of remembering long sentences. Often it has forgotten the first part once it completes processing the whole input. The attention mechanism was born to resolve this problem.

### Self Attention Mechanism
Self-attention, also known as intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the same sequence. It has been shown to be very useful in machine reading, abstractive summarization, or image description generation.

## Transformer Model

It is actually possible to do seq2seq modeling without recurrent network units using the transformer models. The proposed “transformer” model is entirely built on the self-attention mechanisms without using sequence-aligned recurrent architecture.

### Key Value and Query

The major component in the transformer is the unit of multi-head self-attention mechanism. The transformer views the encoded representation of the input as a set of key-value pairs, , both of dimension (input sequence length); in the context of NMT, both the keys and values are the encoder hidden states. In the decoder, the previous output is compressed into a query ( of dimension ) and the next output is produced by mapping this query and the set of keys and values.

The transformer adopts the scaled dot-product attention: the output is a weighted sum of the values, where the weight assigned to each value is determined by the dot-product of the query with all the keys.

### Multi-Head Self-Attention

<img src="https://lilianweng.github.io/lil-log/assets/images/multi-head-attention.png" style="height: 350px;" />  

                                    Multi-head scaled dot-product attention mechanism
                                   
Rather than only computing the attention once, the multi-head mechanism runs through the scaled dot-product attention multiple times in parallel. The independent attention outputs are simply concatenated and linearly transformed into the expected dimensions.

### Encoder

<img src="https://lilianweng.github.io/lil-log/assets/images/transformer-encoder.png" style="height: 250px;" />  

                                       The transformer’s encoder

The encoder generates an attention-based representation with capability to locate a specific piece of information from a potentially infinitely-large context.  

1. A stack of N=6 identical layers.
2. Each layer has a multi-head self-attention layer and a simple position-wise fully connected feed-forward network.
3. Each sub-layer adopts a residual connection and a layer normalization. All the sub-layers output data of the same dimension.

### Decoder

<img src="https://lilianweng.github.io/lil-log/assets/images/transformer-decoder.png" style="height: 300px;" />  

                                        The transformer's decoder

The decoder is able to retrieval from the encoded representation.

1. A stack of N = 6 identical layers
2. Each layer has two sub-layers of multi-head attention mechanisms and one sub-layer of fully-connected feed-forward network.
3. Similar to the encoder, each sub-layer adopts a residual connection and a layer normalization.
4. The first multi-head attention sub-layer is modified to prevent positions from attending to subsequent positions, as we don’t want to look into the future of the target sequence when predicting the current position.

### Full Architecture

Finally here is the complete view of the transformer’s architecture:

1. Both the source and target sequences first go through embedding layers to produce data of the same dimension .
2. To preserve the position information, a sinusoid-wave-based positional encoding is applied and summed with the embedding output.
3. A softmax and linear layer are added to the final decoder output.

<img src="https://lilianweng.github.io/lil-log/assets/images/transformer.png" style="height: 500px;" />  

## Implementation