# Transformer

Transformer was described in "Attention is All you Need" [(Vaswani, et al., 2017)](http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf). This approach does not follow the seq2seq framework (where the decoder needs to be executed after the encoder has processed the input sentence). Instead, the translation is generated by considering the attention between words. This not only includes attention between the words in the source and target sentences, but also attention between the words within the source, and within the target.


One important property of the transformer is that each word of a sentence can be processed independently. By providing: (i) word, (ii) sentence, and (iii) word position, the process of each word can be done in parallel (there are no dependencies with the previous words like in seq2seq models).

The diagram of the transformer is represented in "Attention is All you Need" as:

![transformer](images/transformer.png)

The architecture is structured as follows:

* Encoder:
  * Positional Encoding
  * EncoderLayer (x 6)
     * Self attention (multiheaded)
     * Feed forward

* Decoder:
  * Positional Encoding
  * Decoder layer (x 6)
    * Masked Self attention
    * Encoder-decoder attention
    * Feed forward
  * Linear and Softmax Layer


In the following subsections we describe each of the sublayers.


## Positional Encoding

In transformer the words are processed independently (in parallel). The positional encoding is used to add positional information of a word so it is known which words are closer to which. 

This is done by creating vecors that will be added to the input vectors. The key idea is that if two words are close to each other, their positional vectors should be similar to each other.


Imagine we had to represent the position as vectors of binary values:


> 0: 0000
>
> 1: 0001
>
> 2: 0010
>
> 3: 0011
>
> 4: 0100
>
> 5: 0101
>
> ...


In the list, the fourth element (fourth column) the changes between 0 and 1 occurs every next position. In the third column however, the changes between 0 and 1 occurs every two positons. A (non-binary) generalization of this can be:

>$pos: [sin(\omega_1 pos),sin(w_2 pos),...,sin(w_{d_x} pos)]$

where the last position of the vector has a higher frequency $w_{d_x}$ than the first position $w_1$.


In transformer, this vector is used to encode the position. It creates $n$ (length of the sentence) vectors $t_1,...t_n$. Each $t_i$ is added to the input $x_i$, so the input becomes: $x_i=x_i+t_i$



## Self Attention

Self attention is a mechanism to consider how a word is related to the rest of the words in the sentence.


From each vector, three smaller vectors are extracted: Q (query), K (key) and V (Value).
* V: The actual value of the vector
* Q and K: Used to compute the attention score (by "comparing" a vector Q to every vector K of the input).

The output vector will be a sequence of vectors $(z_1, z_2, ...,z_n)$ where each $z_c$ correspond to a weighted sum of the $V$  vectors of the input as:  $z_{c}=\sum_{i=1}^{n}att_i*v_i$

 where each $att_i$ represents how relevant is the vector $x_i$ (from the input sequence) for  $x_c$. Therefore, $att_c$ will be high (as $x_c$ will be very relevant to "itself"), but the weight of less relevant words will have a low score. The attention vector is:
 
$$att=softmax(\frac{Q_c*K_i^{T}}{\sqrt{d_k}} , \frac{Q_c*K_2^{T}}{\sqrt{d_k}}, ...,\frac{Q_c*K_n^{T}}{\sqrt{d_k}})$$


The diagram of the scaled dot product attention in "Attention is All you Need" as:

![scaled_dotproduct_attention](images/scaled_dotproduct_attention.png)

Given the input vectors (sequence of vectors, e.g. embeddings of each word) proceed as:

1. Create the three vectors (Q, K, and V) from each input vector. This is done by multiplying each vector $X$ of the input as: $X*W_Q=Q$, $X*W_K=K$ and $X*W_V=V$. The matrices $W_Q$, $W_K$, $W_V$ need to be trained.
2. Then, for each  $Q_c$:

   2.1. Obtain a score for each $K_i$ of the input against the current $Q_c$: $(Q_c*K_1^{T},Q_c*K_2^{T},...,Q_c*K_n^{T})$.
   
   2.2. Normalize by dividing $\sqrt{d_k}$ (the value of $d_k$ corresponds to the dimension of the key vector, by default is 64): $(\frac{Q_c*K_1^{T}}{\sqrt{d_k}},\frac{Q_c*K_2^{T}}{\sqrt{d_k}},...,\frac{Q_c*K_n^{T}}{\sqrt{d_k}})$.
   
   2.3. Apply softmax to have the scores in the range [0-1]  and so they sum 1: $softmax_i(\frac{Q_c*K_i^{T}}{\sqrt{d_k}})$.
   
   2.4. Multiply each vector $V_i$ of the input by the corresponding weight: the (softmax-ed) attention score computed in previous step: $softmax_i(\frac{Q_c*K_i^{T}}{\sqrt{d_k}})*V$.
   
   2.5. Sum the weighted values vectors


#### Multi-headed Attention

The explained above has one limitation: homonyms. For example, the word "light" can be "bright" or "not heavy". To solve that we can produce several output vectors (so they capture the different senses) and use different attentions. For this reason, the self attention architecture is replicated several times. This implies having several weighting matrices for extracting several $Q$ vectors: $W_{Q_1}$,.., $W_{Q_h}$, several $K$ vectors $W_{K_1}$,.., $W_{K_h}$, and several $V$ vectors $W_{V_1}$,.., $W_{V_h}$.

In the end, each vector of the output $Z_i$ will be a concatenation of the different heads as: $Z_i=concat(Z_{i_1},...,Z_{i_h})$



## Masked Self Attention

As seen before, the self attention mechanism compare a word with every other words in the sentence. However, at decoding time, the aim is to predict the following words. As it cannot consider the following words to compute the attention, these words are masked. In the decoder, the self attention is computed considering only the words that have been already produced.


## Encoder-decoder attention

In order to perform the translation, the tranformer computed the attention between the source sentence and the sentence that is being produced.


Encoder-decoder attention, in a similar way to self-attention, also obtains three vector $Q_{decoder}$, $K_{decoder}$, and $V_{decoder}$ from each word vector. However, in order to compute the attention, the current decoder vector $Q_{decoder}$  is queried on the key and values of the encoder ($K_{encoder}$ and $V_{encoder}$).



# References

http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

http://jalammar.github.io/illustrated-transformer/

https://towardsdatascience.com/illustrated-self-attention-2d627e33b20a
