# Fundamental concepts of transformer architecture

### Positional encoding
* position of a token is important to reflect meaning of a sentence
* sin & cos waves
    * $PE(pos,2i) = sin(\frac{pos}{10000^\frac{2i}{d_{model}}})$ (even dimensions)
    * $PE(pos,2i+1) = cos(\frac{pos}{10000^\frac{2i}{d_{model}}})$ (odd dimensions)
    * where $pos$ represents position of $sin$ wave over time
    * $i$ is dimension index and helps to generate unique sin or cos wave for each embedding
    * $0 \leq i < \frac{d}{2}$
* example
    * Sentence "Transformers are awesome"

    **Embeddings**

| Token | D1 | D2 | D3 | D4 |
| -------- | ------- |------- |------- |------- |
| Transformers | 0.2 | 0.4 | 0.1 | 0.3 |
| are | 0.5 | 0.2 | 0.7 | 0.9 |
| awesome | 0.6 | 0.6 | 0.4 | 0.2 |

Position (Pos) -> 0,1,2  
Dimensions (i) -> 0,1,2,3  


**Example calculation**

$PE(0,0) = sin(0/10000^{(2*0/4)}) = sin(0) = 0$  
$PE(0,0) = cos(0/10000^{(2*0/4)}) = cos(0) = 1$  
$PE(0,1) = sin(0/10000^{(2*0/4)}) = sin(0) = 0$  
$PE(0,1) = cos(0/10000^{(2*0/4)}) = cos(0) = 1$  

**Positional encodings**

| Token | D1 | D2 | D3 | D4 |
| -------- | ------- |------- |------- |------- |
| Transformers | 0 | 1 | 0 | 1 |
| are | 0.84 | 0.54 | 0.01 | 0.99 |
| awesome | 0.90 | -0.41 | 0.02 | 0.99 |

* you can limit sequence size by vocabulary size, the encoding matrix is then (vocab size) x (embedding size)
* columns here are indicative of varying position function, while rows are representative of a position in a sequence
* this generally enables to the positional encoding to be unique (due to the numerous positioning funcs)
* positional encoding can be added to the embeddings vector, ensuring the elements' order is maintained
* positional encoding can be learnable (GPT)
* segment embeddings are added in some models (BERT), providing additional positional information

### Attention (translation example)

* query, dictionary {key, value}; 
* query -> one-hot encoded vectors, composing a query matrix
* keys -> one-hot encoded vectors, composing a key matrix
* values -> one-hot encoded vectors, composing values mat
* rows of keys and values needs to be aligned (row for a key corresponds with the row for the values)
* $Attention (q_{sous}, K, V) = q_{sous}^T \ K^T \ V$, where we query a key matrix and retrieve the value based on the key
    * $h = q_{word}^T \ K^T \ V$
    * retrieving translated word with $\hat w = argmax_i\{hV^T\}$
* can be expanded to word embeddings, that is one can use embedding representation rather the one-hot encoding, which allows for dealing with an unseen words
* the approach can be further refined by incorporating softmax to the first part of the search
* $Attention (q_{ci-dessous, K, V}) = softmax_r (q_{ci-dessous}^T \ K^T) V$
    * retrieving translated word again with $\hat w = argmax_i\{hV^T\}$

### Self-attention mechanism (translation example)

* core of a language transformer,
* each word in a sequence attends every other word in parallel
* each sequence transformed to a matrix representation, where the sequence length is not fixed
    * query projections with bias $Q = W_qX+b_q1^T$, where $W$ and $b$ are learnable params
    * key projections with bias $K = W_kX+b_k1^T$, with similar learnable params
    * value projections with bias $V = W_vX+b_v1^T$, with similar learnable params
    * sequential embeddings are usually added before this operation
* $Attention (Q, K, V) = V softmax_c(\frac{K^T \ Q}{\sqrt D})$
    * $H' =  V softmax_c(\frac{K^T \ Q}{\sqrt D})$, where $D$ represents the dimension of the embeddings
    * length of $H'$ consists of enhanced embeddings (columns), and their number is aligned with the input sequence
    * additional learnable output projection can be used such as $H = W^{[0]} H' + b^{[0]}1^T$, using additional learnable params
* averaged contextual embeddings are further passed to simple neural net
* its initial layer output is $z^1 = \frac{1}{T} \sum_{t=1}^T \ h_t$ (average embedding), dimensions aligned with the embeddings
* outputs logits $z^{2}$ that are aligned with the vocab size
* $z^{[2]} = W^{[2]}z^{[1]}+b^{[2]}$
* prediction are produced with $\hat w_t = argmax_i(z^{[2]})$, the dot-product in the previously described approach is replaced by the simple neural net

### Transformers

* scaled dot-product with multiple heads
    * foundational piece of transformer models
    * the core mechanism computes a dot product between queries and keys, scales the result (to ensure numerical stability), applies softmax to get attention weights
    * in some cases masking is incorporated to hide future words
    * attention weights are multiplied by the values for the final output
    * self-attention -> Q, K, V all come from same input, which is derived by learnable matrices
    * cross-attention -> Q comes from one input (source lang), K and V come from another (target language)

* multi-head attention
    * instead of computing a single attention output, multihead attention computes multiple outputs in parallel
    * process
        * splitting inputs -> input embeddings are split into smaller vectors
        * parallel attention -> each split vector is processed by its own attention mechanism
        * concatenation -> outputs of all heads are concatenated and passed through a final linear layer to produce output
    * benefits -> each head can focus on different parts of the input seq, capturing diverse dependencies (subject-verb relationship vs adjectives)
* transformer architecture
    * the core idea is improve efficiency by stacking multiple layers of attention and feed-forward nets together
    * components
        * encoder -> processes the input seq using multihead self-attention, followed by a feed-forward net
            * multi-head attention computes attention over the input
            * add & norm adds the attention output to the original input and applies layer normalization
            * feed-forward net is applied to each position independently
        * decoder
            * similar to encoder, but includes masking to prevent leakage
            * uses cross-attention layers for translation
    * stacking
        * multiple enc and dec layers can be stacked to capture complex relationships

### Transformers for classification (encoder)

* transformers for text classification
    * transformers can leverage word-ordering and contextual information in the classification task
    * attention mechanism might be helpful here :)

* creating the text pipe
    * iterators
    * data split
    * tokenization to split sentences into words and to build vocabulary
    * padding to standardize input size
    * data loader objects for train, valid, test; labels and sequences

* creating the model
    * embedding layers -> converting input indices into embeddings
    * positional encoding -> positional encodings informs the encoder of the ordering, allows the model to understand both order and semantic meaning
    * transformer encoder -> multiple encoder layers stacked together, each layer consists of self-attention and ffn, outputs contextual embedding
    * classification layer -> output layer 

* model forward method
    * input tensor -> [seq, batch]
    * embedding layer -> input converted into shape [seq, batch, emb_dim]
    * positional encoding -> added
    * transformer encoder -> processes emb+pos into contextual vecs
    * mean pooling -> averaging contextual embeddings into one
    * classifier -> linear layer predicts the labels

* training as usual
    * cross-entropy loss, SGD, learning rate scheduler

* practical considerations
    * padding -> make sure all sequences have same length
    * positional encoding -> essential to understand word order
    * mean-pooling -> condenses the embeddings into a single vector
    * hyper-params -> number of encoding layers, embedding dimension, number of attention heads.