# Transformers

* RNN->GRU->LSTM (increasing complexity)
    * sequential (dependent on previous tokens)
    * can we run the computation in parallel?
* [paper](https://arxiv.org/pdf/1706.03762)
* Attention + CNN
    * self-attention (computing all attention representation at once)
    * multi-head attention (multiple versions of attention representations)

## Self-attention

* A(q,K,V) = attention-based vector representation of a word
    * calculate A for each word, ie $A^{<1>},...,A^{<T_x>}$
    * representation depends on the context of A (other As in the sentence)
    * transformer attention $A(q,K,V) = \sum_i \frac{exp(q \cdot k^{<i>})}{\sum_j exp(q \cdot k^{<i>})} v^{<i>} $
        * for every word we have query (q), key (k) and value (v)
        * ie we have a third word in a sentence $x^{<3>}$
            * $q^{<3>} = W^Q \cdot x^{<3>}$,
            * $k^{<3>} = W^K \cdot x^{<3>}$,
            * $v^{<3>} = W^V \cdot x^{<3>}$,
            * $W$ are learned parameters
        * intuition
            * $q^{<3>}$ is a question about $x^{<3>}$
            * we compute $q^{<3>} \cdot k^{<1>}$ to understand how good is the first word answer to the question, we compute this for every word in a sentence
            * transformer attention over the results to get $A^{<3>}$
        * in literature $Attention(Q,K,V) = softmax(\frac{QK^T}{\sqrt{d_k}}V)$

## Multi-head attention

* self-attention to a sequence is called a "head", repeating that multiple times "multi-head"
* MultiHead(Q,K,V)
    * first head
        * matrices $W^Q_1, W^K_1, W^V_1$ are learned and address the relationship
        * $Attention(W^Q_1Q, W^K_1K, W^V_1K)$
        * mechanism asks and answers a first questions (ie what is happening?)
    * second head
        * matrices $W^Q_2, W^K_2, W^V_2$ are learned and address the relationship
        * $Attention(W^Q_2Q, W^K_2K, W^V_2K)$
        * mechanism asks and answers a second questions (ie when happening?)
    * ...
    * attention heads are concatenated and multiplied by learned weights
    * all heads can be computed in parallel
* TLDR
    * $head_i = Attention(W^Q_iQ, W^K_iK, W^V_iK)$
    * $MultiHead(Q,K,V) = concat(head_1, head_2, ..., head_h)W_0$

## Transformer network
    
* encoder (repeated n-times)
    * inputs sequence and its corresponding embedding, incl \<SOS\>, \<EOS\> tokens
    * multi-head attention layer
    * feed-forward neural network
* decoder (repeated n-times)
    * getting start of the sequence (or the part of translation done so far)
    * multi-head attention layer, inputs start of the sequence calculating Q, K, V
    * multi-head attention layer, inputs Q from previous layer and K,V from encoder
    * feed-forward neural network
    * linear & softmax layers
    * next token in sequence
* positional encoding is part of input to both encoder and decoder
    * represents position of input, through sin & cos functions
    * same dimensions as a words embedding vector
    * read values off multiple sin/cos funcs to reflect the positions
    * added directly to the word embedding vector
    * also passed to the net through residual connections (input to enc end dec)
* add & norm layers to speed-up convergence after multi-heads and ffs
* masked multi-head attention
    * used for training for masking "the future" part of sequence
    