#### READ/WATCH THIS LATER:

<ul>
  <li><a href="https://e2eml.school/transformers">✅ Transformers from Scratch - Brandon Rohrer</a></li>
  <li><a href='https://youtu.be/rPFkX5fJdRY?si=TjTc9X4ltfCZmIhP'>✅ Transformers: Zero to Hero - CodeEmporium</a></li>
</ul>

## Intoduction 🥸

- Transformer models has revolutionized solving machine learning problems that involve sequential data. 

- They have advanced the SOTA by a significant margin compared to the previous leaders, RNN-based models. Transformer have outperformed other sequential models such as LSTMs and GRUs

- *One of the primary reasons that the Transformer model is so performant is that it has access to the whole sequence of items (e.g. sequence of tokens), as opposed to RNN-based models, which look at one item at a time.* 

## Transformer Architecture 🤖

$\rightarrow$ Orginal Paper: [*Attention Is All You Need*](https://arxiv.org/abs/1706.03762)

#### $\rightarrow$ Key Points on Transformers

- *A Transformer is a type of Seq2Seq model.*
- *Transformer models can work with both image & text data.*
- *The Transformer model takes in sequence of inputs and maps that to a seq. of outputs.*
- *Just like Seq2Seq model, the Transformer consists of an encoder-decoder.*

#### $\rightarrow$ Overview of how Transformer Model works

Let’s understand how the Transformer model works in context of Language Translation task:

- *The encoder takes in a sequence of source language tokens and produces a sequence of interim outputs.* 

- *Then the decoder takes in a sequence of target language tokens and predicts the next token for each time step (the teacher forcing technique).* 

    - Teacher Forcing Technique: The objective of the decoder is, given the last encoder state and the previous token the decoder predicted, predict the next token.<br></br>

- *Both the encoder and the decoder use attention mechanisms to improve performance.* 

    - For example, the decoder uses attention to inspect all the past encoder states and previous decoder inputs.<br></br>

- *The attention mechanism is conceptually similar to Bahdanau attention.* [ [🔗Link to my Notes on this.](https://github.com/avr2002/NLP-with-Tensorflow/blob/main/Ch_09_Seq-to-Seq%20Learning/01_Seq-to-Seq%20Learning-NMT.ipynb) ]
    
<div align='center'>
    <img src='images/enc_dec.png'/>
</div>

## Encoder & Decoder ☯️

### Overview: Encoder, Decoder, Self-Attention Layer, FCC-Layer

#### $\rightarrow$ Understanding basic building blocks of a Transformer layer:

1. The Encoder and Decoder has almost same architecture with few differences. Both of them are designed to consume a sequence of i/p items at a time, but their goals during the task differ:

    1. *The encoder produces a latent representation with the inputs, whereas*
    
    2. *The decoder produces a target o/p with it's inputs and the encoder's outputs.*

2. To perform these computations, these inputs are propagated through several stacked layers. 
    - *Each layer within these models takes in a sequence of elements and outputs another sequence of elements.* 
    
    - *Each layer is also made from several sub-layers that encapsulate different computations performed on a sequence of input tokens to produce a sequence of outputs.*
    
3. **A layer found in the Transformer mainly comprises the following two sub-layers:**
    - A **Self-Attention Layer**
    - *A Fully Connected Layer $\text{(FCC-L)}$*<br></br>
    
    
<div align='center'>
    <h5>Diff. b/w Self-Attention & FCC Layer</h5>
    <img src='images/self_attention_and_fcc_layers.png'/>
</div>

#### $\rightarrow$ Self-Attention Layer

4. The self-attention layer produces its o/p using matrix multiplications & activation functions, similar to a FCC-Layer. It takes in a seq. of i/ps and produces a seq. of o/ps.

    - *However,* **a special characteristic of the self-attention layer** *is that, when producing an output at each time step, it has access to all the other inputs in that sequence.* 

    - *This makes learning and remembering long sequences of inputs trivial for this layer.* 

        - *For comparison, RNNs struggle to remember long sequences of inputs as they need to go through each input sequentially.*<br></br> 

    - *Additionally, by design, the self-attention layer can select and combine different inputs at each time step based on the task it’s solving.* 

**This makes Transformers very powerful in sequential learning tasks.** 

* **

##### $\rightarrow$ Let's understand this:
>*Additionally, by design, the self-attention layer can select and combine different inputs at each time step based on the task it’s solving.*

$\textit{Why it’s important to selectively combine different input elements this way?}$

- In an NLP context, the self-attention layer enables the model to peek at other words while processing a certain word. 

- This means that while the encoder is processing the word $\textit{it}$ in the sentence $\textit{I kicked the ball and it disappeared}$, the model can attend to the word $\textit{ball}$ and understand the context of the word $\textit{it}$. 

- **By doing this, the Transformer can learn dependencies and disambiguate words, which leads to better language understanding.**


<div align='center'>
    <a href='https://amzn.eu/d/7bbPm74'><i>Self-Attention Intutive Exp. by the author</i></a>
</div>
<div align='center'>
    <img src='images/self_attention_exp.png'/>
</div>

* **

#### $\rightarrow$ The Fully Connected Layer $\textit{(FCC-L)}$

- The self-attention layer is followed by a **FCC-Layer**, which has all the i/p nodes connected to all the o/p nodes, (optionally) followed by a non-linear activation function.

-  It takes the o/p elements produced by the self-attention sub-layer and produces a hidden representation for each o/p element. 

- Unlike the self-attention layer, the fully connected layer treats individual sequence items independently, performing computations on them in an element-wise fashion.

- They introduce non-linear transformations while making the model deeper, thus allowing the model to perform better.

### $\rightarrow$ a bit more on Enoder & Decoder

Breakdown of Encoder's & Decoder's sub-layers 

$\textit{Before diving in, let's establish some basics:}$

- *The encoder takes in an i/p seq. and the decoder takes in an i/p seq. as well (a diff. sequence to the encoder i/p). Then the decoder produces an o/p seq.*

- Let's call a single item in these sequences a $\textit{token}$.

* **

$\large{\textit{Encoder:}}$

The encoder consists of a stack of layers, where each layer consists of $2$ sub-layers:
- A $\textit{Self-Attention Layer}$ :
    - *Generates a latent representation for each encoder i/p token in the sequence.* 
    
    - *For each i/p token, this layer looks at the whole seq. and selects other tokens in the seq. that enrich the semantics of the generated hidden o/p for that token (i.e., $\textit{‘attended’ representation}$).*

- A $\textit{FCC-Layer}$ :
    - Generates an element-wise deeper hidden representation of the **attended representation of the encoder**.
    
* **

$\large{\textit{Decoder:}}$

The decoder layer consists of $3$ sub-layers:
- $\textit{A Masked Self-Attention Layer}$ :
    - *For each decoder i/p, a token looks at all the tokens to the left of it.*
    
    - *The decoder needs to mask words to the right to prevent the model from seeing words in the future.* 
    
    - *Having access to successive words during prediction can make the prediction task trivial for the decoder.* 

- $\textit{An Attention Layer}$ :
    - For each i/p token in the decoder, it looks at both the $\textit{encoder's outputs}$ and the $\textit{decoder’s masked attended output}$ to generate a semantically rich hidden output.
    
    - *Since this layer is not only focused on decoder inputs, we’ll call this an* $\textit{attention layer.}$

- $\textit{A FCC-Layer}$ :
    - Generates an element-wise deeper hidden representation of the *attended representation of the decoder*.
    
* **
$\textit{English to French Transalation :}$
$$\textit{dogs are great} \rightarrow \textit{les chiens sont super}$$

<div align='center'>
    <img src='images/transformer_eng_to_french.png' title='Source: NLP with TensorFlow by Thushan Ganegedara'/>
</div>

## *Mechanics of the* $\textit{Self-Attention Layer}$ 🔧 

### Computing the output of the $\textit{self-attention layer}$

$\rightarrow Definition: \textit{Query(Q), Key(K), & Value(V)}$

1. *There are 3 key concepts to understand the computations of self-attention technique:* $\textit{Query(Q), Key(K), & Value(V)}:$

2. The $\textit{query}$ and the $\textit{key}$ are used to generate an $\textit{affinity matrix}$. 

3. *For the decoder's attention layer, the affinity matrix’s position $i,j$ represents how similar the encoder state $(key, K)$ $\textit{ i }$ is to the decoder input $j$ $(query, Q)$.* 

4. *Then, we create a weighted average of encoder states $(value, V)$ for each position, where the weights are given by the $\textit{affinity matrix}$.*

* **

<div align='center'>
    <img src='images/example_q_k_v.png' title='Example as given by the author: NLP with TensorFlow by Thushan Ganegedara'/>
</div>

* **

- To compute the $\textit{query, key, and value}$, we use [linear projection](https://stackoverflow.com/questions/37889914/what-is-a-projection-layer-in-the-context-of-neural-networks) of the actual i/ps provided using weight matrices. The 3 weight matrices are:
    - Query weights matrix $(W_q)$
    - Key weights matrix $(W_k)$
    - Value weights matrix $(W_v)$<br></br>

- Each matrix produces 3 o/ps for a given token (at position $i$) in a given i/p seq. by multiplying with the weight matrix:

$$Q_i = W_q q_i, K_i = W_k k_i \text{ and } V_i = W_v v_i$$

$Q, K, \text{ and } V$ are $[B, T, d]$ *sized tensor*, where $B = \text{batch size}$, $T = \text{# of time-steps}$, and $d = \text{hyperparameter, represents dimensionality of the latent representation}$.

- **These are then used to compute the affinity matrix:**

<div align='center'>
    <img src='images/affinity_matrix.png' title='Affinity Matrix: NLP with TensorFlow by Thushan Ganegedara'/>
</div>


- The $\textit{affinity matrix P}$ is computed as: $$P = softmax \left(\frac{Q \cdot K^T}{\sqrt{d_k}}\right)$$

- *Then the final $\textit{attended output of the self-attention layer}$ is computed as follows:* $$h = P \cdot V = softmax \left(\frac{Q \cdot K^T}{\sqrt{d_k}}\right)V$$


Here, $\textit{Q = queries tensor}$, $\textit{K = keys tensor}$, and $\textit{V = values tensor}$. 

This is what makes Transformer models so powerful; unlike LSTM models, Transformer models aggregate all tokens in a sequence to a single matrix multiplication, making these models highly parallelizable. 

### Embedding layers in the Transformer