# Transformers

## NLP Transformer

Using multi-head attention from [Attention is All You Need](https://arxiv.org/abs/1706.03762), we can create attention only architecture called **transformer**. The key benefit of a transformer architecture is scalability via parallelism; the matrix operation in each head can run independent of other heads. 

![Transformer](assets/transformer.png)

There are 2 phases, encoding and decoding.

### Encoding

We obtain the embeddings $X$ for an input sequence, using the same example from Andrew Ng's lecture i.e. "Jane viste l'Afrique en septembre". 

$X$, in conjunction with $Q$, $K$, and $V$, is used to compute the output of the heads. The output will be concatenated, summed with residuals, and normalized. The normalized outputs will then be fed into a non-linear unit such as a simple feedfoward neural network. The output of the neural network will be summed with residuals and normalized again. The encoder block will be repeated multiple times. Each tiem the output should have the same shape as $Q$, $K$, $V$ just like the inputs to each block.

<img src="assets/encoder.png" alt="Encoder" width="400"/>


### Decoding

First the `SOS` start of sentence token is fed into the decoder generate the query matrix $Q$ while discarding the $K$ and $V$ matrix. The decoded sequence is always the "query" while the keys and values should be found in the encoded sequences. The decoding attention blocks will run `N` times just like the encoding attention blocks. 

<img src="assets/decoder.png" alt="Encoder" width="600"/>

The feed-forward network in the decoder should generate the next word in the seqeuence, i.e. `Jane` ideally. Then `SOS` and `Jane` are fed into the decoder blocks again and repeat the same logic until the `EOS` is reached.

## PyTorch Example

Source: https://pytorch.org/tutorials/beginner/transformer_tutorial.html

## Vision Transformer

Taking the idea from NLP, [An Image is Worth 16x16 Words](https://arxiv.org/abs/2010.11929) applies transformer to computer vision tasks. The standard transformer receives a 1D sequence of token embeddings, vision transformer receives a sequence of flattended 2D patches

$$
x_p \in \mathbb{R}^{N \times (P^2 \cdot C)}
$$

The original image $x$ has shape $\mathbb{R}^{H \times W \times C}$ which is height times width times channels. The 2D patch has square patches of shape `(P, P)` where P is the number of pixels in each side of the patch. $P^2$ represents the total number of pixels in each patch and $C$ is the color channel which is 3 or 4 in most cases. `N` represents the number of patches used in the image. Essentially this is reshaping the image into words-like representation.

The pixel values $x_p$ will be linearly projected into an embedding space with dimension `D`. Position embeddings are also added to the patch embeddings to retain positional information.

$$
E \in \mathbb{R}^{(P^2 \cdot C) \times D}
$$

$$
E_{pos} \in  \mathbb{R}^{(N+1) \times D}
$$

$$
z_0 = E + E_{pos}
$$

Multiple layers of multi-head self-attention (MSA) and multi-layer perceptron (basically a feed-forward network like above) is applied on $z$. Let `L` represents how many layers of self-attention to apply. 

$$
z_l^{\prime} = \text{MSA}\,(\text{LayerNorm}\,(z_{l-1})) + z_{l-1} \;\; l = 1, ..., L
$$

$$
z_l = \text{MLP}\,(\text{LayerNorm}\,(z_l^{\prime})) + z_l^{\prime} \;\; l = 1, ..., L
$$

The final output will be coming from the last layer of the attention block.

$$
y = \text{LayerNorm}\,(z^0_L)
$$

![Vision Transformer](assets/vision-transformer.png)

https://towardsdatascience.com/implementing-visualttransformer-in-pytorch-184f9f16f632