<a href="https://colab.research.google.com/github/chaitragopalappa/MIE590-690D/blob/main/8_Attention_Transformers_LLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#RNN, Attention, Transformers, LLM

Sources:
* Chapter 14, Probabilistic Machine Learning: An Introduction by Kevin Murphy  
* Dive into Deep Learning, by Aston Zhang, Zachary C. Lipton, Mu Li, Alexander J. Smola https://d2l.ai/index.html
* HuggingFace LLM course https://huggingface.co/learn/llm-course/chapter0/1

Note: In some instances I have directly copied text from sources. This notebook is used for class lecture only.

---
---

**Seq2Seq problem**: Learn functions of the form $f(θ) : \mathbb{R}^{TD} → \mathbb{R}^{T'C}$. i.e., map from one sequence of length $T$ to another of length $T ′$

**Seq2Seq problem** typically rely on **encoder-decoder mechanism**:
* *Encoder: The encoder receives an input and builds a representation of it (its features). This means that the model is optimized to acquire understanding from the input.*
* *Decoder: The decoder uses the encoder’s representation (features) along with other inputs to generate a target sequence. This means that the model is optimized for generating outputs.*

---

**Common architectures of encoder-decoder mechanism**
* CNN with transposed convolution (U-Net) (we saw this lecture on CNN/ DL for image data)
* RNN as encoder and decoder
* RNN with attention
* Transformers (multi-head self-attention)(Attention is all you need, by Vaswani et. al., 2017)(Google)

---
---



**Seq2Seq RNN as encoder and decoder**

  * aligned case (we saw this in RNN lecture): $ T ′ = T $, i.e., input and output sequences have the same length
  * unaligned case: $T' \not= ̸T $, i.e., input and output sequences have different lengths.

**Unaligned case**  
Encoder: Suppose the input sequence is $x_1,...x_T$, such that $x_t$ is the $i^{th}$ token. At time step $t$, the RNN transforms the input feature vector $\mathbf{x_t}$ for $x_t$ and the hidden state $h_{t-1}$ from the previous time step into the current hidden state $h_t$.
we have $$h_t=f(\mathbf{x_t},h_{t-1})$$

In general, the encoder transforms the hidden states at all time steps into a context variable through a customized function
$$c=q(h_1,...,h_T)$$
The context variable could be just the hidden state corresponding to the encoder RNN’s representation after processing the final token of the input sequence (last state of an RNN or average pooling over a biRNN).
We then generate output sequence using an RNN as decoder
$y_{1:T ′} = f_d(c)$


<img src="https://github.com/probml/pml-book/blob/main/book1-figures/Figure_15.7.png?raw=true" height="200" width ="500">

*Embedded from TextBook; Figure 15.7: Encoder-decoder RNN architecture for mapping sequence $x_{1:T}$ to sequence $y_{1:T ′}$*.

<img src="https://github.com/probml/pml-book/blob/main/book1-figures/Figure_15.8_A.png?raw=true" height="400" width ="500">

*Figure 15.8: llustration of a seq2seq model for translating English to French. The - character represents the end of a sentence.*

---
---

**ATTENTION**

**Overview**:
In all of the neural networks we have considered so far, the hidden activations are a linear combination of the input activations, followed by a nonlinearity: $Z = φ(XW)$

However, we can imagine a more flexible model in which the weights depend on the inputs, i.e.,
$Z = φ(XW(X))$. This kind of multiplicative interaction is called attention. More generally,
we can write $Z = φ(VW(Q, K))$, where
* $Q$ are a set of queries (derived from X) used to describe what each input is “looking for” (Figure below depicts each token (x), in matrix form we take the full (X),
* $K$ are a set of keys (derived from X) used to describe what each input vector contains, and
* $V$ are a set of values (derived from X) used to describe how each input should be transmitted to the output.

**General equation**:
$$Attn(q, (k_1, v_1), . . . , (k_m, v_m)) = Attn(q, (k_{1:m}, v_{1:m})) =
\sum_i^m \alpha_i(q, k_{1:m})v_i $$
where
* $\alpha_i(q, k_{1:m})$ is the $i$’th attention weight;
*  $0 ≤ \alpha_i(q, k_{1:m}) ≤ 1$ for each $i$
* $\sum_i \alpha_i(q, k_{1:m}) = 1$

The attention weights are defined as:
$$
\alpha_i(q, k_{1:m})
= \text{softmax}_i\big( [\, a(q, k_1), \ldots, a(q, k_m) \,] \big)
= \frac{\exp(a(q, k_i))}{\sum_{j=1}^m \exp(a(q, k_j))}.
$$
* $a(q, k)$ are the attention scores;  it computes the similarity between query $q$ and key $k$;

---

**Bahdanau (additive) attention** https://arxiv.org/pdf/1409.0473

Computes attention score by using a feed-forward network with
a single hidden layer.
  $$a(q,k)=w_v^T tanh(W_qq+W_kk)$$
* $W_q$ and $W_k$ are learnable weight matrices
* In this general case, the query $q ∈ R^q$ and the key $k ∈ R^k$ may
have different sizes. Then $W_q\in \mathbb{R}^{h\times q}$; $W_k\in \mathbb{R}^{h\times k}$  

---
**Luong attention (general)**  
${\displaystyle {\text{Attention}}(Q,K,V)={\text{softmax}}(QWK^{T})V}$
* $W$ is learnable weight matrix

---
**Scaled dot-product attention**  
Introduced in context of [Attention is all you need by Vaswani et. al.](https://arxiv.org/pdf/1706.03762). Similar to Luong attention, except queries and keys both have length $d$ and the score is scaled by ${√d}$.

$$Attn(Q,K,V) = softmax\big(\frac{QK^T}{√d} \big) V $$
* $a(q,k)=\big(\frac{QK^T}{√d} \big)$ are the attention scores, it computes the similarity between query $q$ and key $k$;
*  $\alpha(q,k)=softmax\big(\frac{QK^T}{√d} \big)$ are the attention weights
* Further, they are **self-attention**, because query, key, and values all come from $X$, i.e., they are linear projections of the input (note: inputs can be outputs from previous layers),
  * $Q = W_qX$,
  * $K = W_kX$, and
  * $V = W_vX$.

---
[WORKED EXAMPLE](https://medium.com/@saraswatp/understanding-scaled-dot-product-attention-in-transformer-models-5fe02b0f150c)   
Word embeddings and token: converts words to vectors /tokens

```
embeddings = {  
    'the': np.array([0.1, 0.2, 0.3]),  
    'cat': np.array([0.4, 0.5, 0.6]),  
    'sat': np.array([0.7, 0.8, 0.9]),  
    'on': np.array([1.0, 1.1, 1.2]),  
    'mat': np.array([1.3, 1.4, 1.5])  
}
embedded_tokens = np.array([
    [0.1, 0.2, 0.3],
    [0.4, 0.5, 0.6],
    [0.7, 0.8, 0.9],
    [1.0, 1.1, 1.2],
    [0.1, 0.2, 0.3],
    [1.3, 1.4, 1.5]
])
```

In above example: word-embedding size is 4, sequence length is 5

---

**VISUAL UNDERSTANDING OF ATTENTION**
<img src="https://upload.wikimedia.org/wikipedia/commons/8/81/Attention-qkv.png?raw=true" height="400" width ="800">  
Source: Embedded from Wikipedia: https://en.wikipedia.org/wiki/Attention_(machine_learning)  



[**EXAMPLE CODE**](https://colab.research.google.com/github/probml/pyprobml/blob/master/notebooks/book1/15/attention_torch.ipynb)



**Masked Attention**

Sequences are typically of differnt lengths. We might want to pad sequences to a fixed length (for efficient minibatching), in which case we should “mask out” the padded locations. This is called masked attention. We can immplement this efficiently by setting the attention score for the masked entries to a large negative number, such as $−10^6$, so that the corresponding softmax weights will be 0.

---
---




**Seq2Seq: RNN with attention**

Recall the seq2seq model above : It used an RNN decoder of the form $h^d_t = f_d(h^d_{t−1}, y_{t−1}, c)$, where $c$ is a fixed-length context vector, representing the encoding of the input
$x_{1:T}$ . Usually we set $c = h^e_T$, which is the final state of the encoder RNN (or we use a bidirectional RNN with average pooling).

However, for tasks such as machine translation, this can result in poor performance, since the output does not have access to the input words themselves. We can avoid this bottleneck by allowing the output words to directly “look at” the input words. But which inputs should it look at? After all, word order is not always preserved across languages (e.g., German often puts verbs at the end of a sentence), so we need to infer the alignment between source and target. We can solve this problem (in a differentiable way) by using (soft) attention.

In particular, we can **replace the fixed context vector $c$ in the decoder with a dynamic context vector $c_t$ computed using attention** as follows:
$$c_t = \sum_{i=1}^T = \alpha_i(h^d_{t−1}, h^e_{1:T})h^e_i$$
where,
* the query is the hidden state of the decoder at the previous step, $h^d_{t−1}$,
* the keys are all the hidden states from the encoder ($h^e_{1:T}$), and
* the values are also the hidden states from the encoder.
(When the RNN has multiple hidden layers, we usually take the top layer from the encoder, as the keys and values, and the top layer of the decoder as the query.)
* This context vector is concatenated with the input vector of the decoder, $y_{t−1}$, and fed into the decoder, along with the previous hidden state $h^d_{t−1}$, to create $h^d_t$ .

<img src="https://github.com/probml/pml-book/blob/main/book1-figures/Figure_15.18.png?raw=true" height="400" width ="600">

*Embedded from TextBook; Figure 15.18: Illustration of seq2seq with attention for English to French translation*

We can train this model in the usual way on sentence pairs, and then use it to perform machine
translation. (See [RNN with attention-based decoder EXAMPLE CODE](https://colab.research.google.com/github/probml/pyprobml/blob/master/notebooks/book1/15/nmt_attention_torch.ipynb). We can also visualize the attention weights computed at each step of decoding, to get an idea of which parts of the input the model thinks are most relevant for generating the corresponding output. Some examples are shown in Figure 15.19

<img src="https://github.com/probml/pml-book/blob/main/book1-figures/Figure_15.19_A.png?raw=true" height="400" width ="500">

*Embedded from TextBook; Figure 15.18: Illustration of the attention heatmaps (attention weights ~softmax of attention scores) generated while translating a sentence from Spanish to English.*

* [EXAMPLE Code](https://www.tensorflow.org/text/tutorials/nmt_with_attention)

---
---
**Soft v hard attention**  
If we force the attention heatmap to be sparse, so that each output can only attend to one input location instead of a weighted combination of all of them, the method is called hard attention. Below figure compares these two approaches for an image captioning problem. Unfortunately, hard
attention results in a nondifferentiable training objective, and requires methods such as reinforcement learning to fit the model.
<img src="https://github.com/probml/pml-book/blob/main/book1-figures/Figure_15.22_A.png?raw=true" height="400" width ="500">

<img src="https://github.com/probml/pml-book/blob/main/book1-figures/Figure_15.22_B.png?raw=true" height="400" width ="500">

*Embedded from TextBook; Figure 15.22: Image captioning using attention. (a) Soft attention. Generates “a woman is throwing a frisbee in a park”. (b) Hard attention. Generates “a man and a woman playing frisbee in a field”*

---
---

**Seq2vec with attention (text classification)**
We can also use attention with sequence classifiers. One study applied an RNN classifier
to the problem of predicting if a patient will die or not. The input is a set of electronic health
records, which is a time series containing structured data, as well as unstructured text (clinical
notes). Attention is useful for identifying “relevant” parts of the input.

<img src="https://github.com/probml/pml-book/blob/main/book1-figures/Figure_15.20.png?raw=true" height="500" width ="900">

*Embedded from TextBook; Figure 15.20: Example of an electronic health record. In this example, 24h after admission to the hospital, the RNN classifier predicts the risk of death as 19.9%; the patient ultimately died 10 days after admission. The “relevant” keywords from the input clinical notes are shown in red, as identified by an attention mechanism.*

---
---


**TRANSFORMER**

The transformer model is a seq2seq model which uses attention in the encoder as well as the decoder, thus eliminating the need for RNNs.

**Self attention** (see earlier discussions in ATTENTION overview)   
IN context of seq2seq using RNN with attention, the decoder of an RNN used attention to the input sequence in order to capture contexual embeddings of each input. However, rather than the decoder attending to
the encoder, we can modify the model so the **encoder attends to itself**. This is called self attention (see earlier discussions in ATTENTION overview).

Given a sequence of input tokens $x_1, . . . , x_n$, where $x_i ∈ \mathbf{R}^d$, self-attention can generate a sequence of outputs of the same size using
$y_i = Attn(x_i, (x_1, x_1), . . . , (x_n, x_n))$

where the query is $x_i$, and the keys and values are all the (valid) inputs $x_1, . . . , x_n$. To use this in a decoder, we can set $x_i = y_{i−1}$, and $n = i − 1$, so all the previously generated outputs are available. At training time, all the outputs are already known, so we can evaluate the above function in parallel, overcoming the sequential bottleneck of using RNNs.

In addition to improved speed, self-attention can give improved representations of context. As an example, consider translating the English sentences “The animal didn’t cross the street because
it was too tired” and “The animal didn’t cross the street because it was too wide” into French. To generate a pronoun of the correct gender in French, we need to know what “it” refers to (this is called
coreference resolution). In the first case, the word “it” refers to the animal. In the second case, the word “it” now refers to the street.

Figure 15.23 illustrates how self attention applied to the English sentence is able to resolve this ambiguity. In the first sentence, the representation for “it” depends on the earlier representations of
“animal”, whereas in the latter, it depends on the earlier representations of “street”.

<img src="https://github.com/probml/pml-book/blob/main/book1-figures/Figure_15.23.png?raw=true" height="200" width ="500">

*Embedded from TextBook; Figure 15.23:llustration of how encoder self-attention for the word “it” differs depending on the input context.*

---

**Multi-head attention**  
If we think of an attention matrix as like a kernel matrix, it is natural
to want to use multiple attention matrices, to capture different notions of similarity. This is the basic idea behind multi-headed attention (MHA). Given queries $q \in \mathbb{R}^{d_q}$, keys $k_j \in \mathbb{R}^{d_k}$, and values $v_j \in \mathbb{R}^{d_v}$, we define the $i$-th attention head as:

$$
h_i = \text{Attn}\big(W_i^{(q)} q,\; \{ W_i^{(k)} k_j,\; W_i^{(v)} v_j \}\big) \in \mathbb{R}^{p_v}
$$

where  
$W_i^{(q)} \in \mathbb{R}^{p_q \times d_q}$,  
$W_i^{(k)} \in \mathbb{R}^{p_k \times d_k}$,  
and $W_i^{(v)} \in \mathbb{R}^{p_v \times d_v}$  
are projection matrices.

We then stack the $h$ heads together and project to $\mathbb{R}^{p_o}$ using:

$$
h = \text{MHA}(q,\{k_j, v_j\}) = W_o
\begin{pmatrix}
h_1 \\
\vdots \\
h_h
\end{pmatrix}
\in \mathbb{R}^{p_o}
$$

where each $h_i$ is defined above, and $W_o \in \mathbb{R}^{p_o \times h p_v}$.

If we set $p_qh = p_kh = p_v h= p_o$,  we can compute all the output heads in parallel.

[**CODE EXAMPLE**](https://colab.research.google.com/github/probml/pyprobml/blob/master/notebooks/book1/15/multi_head_attention_torch.ipynb)

---
**Positional encoding**
The performance of “vanilla” self-attention can be low, since attention is permutation invariant, and hence ignores the input word ordering.  

In "The dog chased another dog" outputs from multi-head attention will be identical for both 'dog'

To overcome this, we can concatenate the word embeddings with a positional embedding, so that the model knows what order the words occur in.

[SEE HUGGING FACE VIZUAL](https://huggingface.co/blog/designing-positional-encoding)

---

 **Putting it all together --> Transformer**

A transformer is a seq2seq model that uses self-attention for the encoder and decoder rather than an RNN. The encoder uses a series of encoder blocks, each of which uses multi-headed attention, residual connections, feedforward layers, and layer normalization
```
def EncoderBlock(X):
Z = LayerNorm(MultiHeadAttn(Q=X, K=X, V=X) + X)
E = LayerNorm(FeedForward(Z) + Z)
return E

def Encoder(X, N):
E = POS(Embed(X))
for n in range(N):
E = EncoderBlock(E)
return E

def DecoderBlock(Y, E):
Z = LayerNorm(MultiHeadAttn(Q=Y, K=Y, V=Y) + Y)
Z’ = LayerNorm(MultiHeadAttn(Q=Z, K=E, V=E) + Z)
D = LayerNorm(FeedForward(Z’) + Z’)
return D

def Decoder(Y, E, N):
D = POS(Embed(Y))
for n in range(N):
D = DecoderBlock(D,E)
return D
```


<img src="https://github.com/chaitragopalappa/MIE590-690D/blob/main/images/Multi-head-attention.png?raw=true" height="500" width ="800">


*Embedded from TextBook; Figure 15.26: The Transformer*

---
---

**Comparing transformers, CNNs and RNNs**
The figure compares three different architectures for mapping a sequence $x_{1:n}$ to another sequence $y_{1:n}$: a 1d CNN, an RNN, and an attention-based model.
<img src="https://github.com/probml/pml-book/blob/main/book1-figures/Figure_15.27.png?raw=true" height="300" width ="500">
Source: Embedded from textbook *Figure 15.27: Comparison of (1d) CNNs, RNNs and self-attention models.*

---
---

**LARGE LANGUAGE MODELS (LLMs)**

**Transformers (Attention is all you need, by Vaswani et. al., 2017)**
https://arxiv.org/pdf/1706.03762  
The article that started it all. Original article mainly motivated for machine translation (encoder-decoder mechanism) also proposed use of tranformers in encoder-only and decoder-only mechanisms and **proposed** the future that we know today "*We are excited about the future of attention-based models and plan to apply them to other tasks. We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs
such as images, audio and video. Making generation less sequential is another research goals of ours.*"

---

**Large Language Models (LLM) (models built using transformer architectures) early history**
* (Attention is all you need, by Vaswani et. al., 2017)(Google)
* *June 2018: GPT, the first pretrained Transformer model, used for fine-tuning on various NLP tasks and obtained state-of-the-art results* **they are generative LLM or casual LLM or autoregresive model built on decoder-only transformer**
* *October 2018: BERT, another large pretrained model, this one designed to produce better summaries of sentences* (**non-generative LLMs for learning representations; build on encoder-only transformer**)
* *February 2019: GPT-2, an improved (and bigger) version of GPT that was not immediately publicly released due to ethical concerns*
* *October 2019: T5, A multi-task focused implementation of the sequence-to-sequence Transformer architecture.*
* 2019: BART, or Bidirectional and Auto-Regressive Transformers. It is a denoising autoencoder that combines the strengths of BERT's bidirectional encoder for understanding text and GPT's auto-regressive decoder for generating text.
* *May 2020, GPT-3, an even bigger version of GPT-2 that is able to perform well on a variety of tasks without the need for fine-tuning (called zero-shot learning)*
* *January 2022: InstructGPT,  Use reinforcement learning with human feedback (RLHF) for fine-tuning GPT 3 to follow an instruction in a prompt and provide a detailed response.*
* *November 2022: **ChatGPT*** Instruction fine-tuning (type of finetuning that teaches model to behave like chatbot) turned GPT 3 into ChatGPT
<img src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/transformers_chrono-dark.svg?raw=true" height="400" width ="800">  
Source: Embedded from [HuggingFace LLM Course](https://huggingface.co/learn/llm-course/chapter1/3)

---

**Transformers architectures broad catgories depending on the task**:

* *Encoder-only models: Good for tasks that require understanding of the input, such as sentence classification and named entity recognition.*
* *Decoder-only models: Good for generative tasks such as text generation.*
* *Encoder-decoder models or sequence-to-sequence models: Good for generative tasks that require an input, such as translation or summarization.*


---
---

**Autoencoders**  
Autoencoders are used for tasks like dimensionality reduction, data denoising, feature extraction, and anomaly detection

An autoencoder is a neural network that maps inputs x to a low-dimensional latent space using an
encoder, $z = f_e(x)$, and then attempts to reconstruct the inputs using a decoder, $xˆ = f_d(z)$. The
model is trained to minimize
$$L(θ) = ||r(x) − x||_2^2$$
where $r(x) = fd(fe(x))$. (We can also replace squared error with more general conditional log
likelihoods.)

Components and process
* Encoder: The first part of the network that takes the input data and compresses it into a lower-dimensional representation called the latent space. It captures the most relevant features of the input.
* Bottleneck (or latent space): This is the compressed representation of the data, which has fewer dimensions than the original input. The size of this layer is a critical design choice.
* Decoder: The second part of the network that takes the compressed representation from the latent space and reconstructs the original input data from it.

Applications
* Dimensionality Reduction: By compressing data into a lower-dimensional space, autoencoders can be used to reduce the number of features while retaining the most important information.
* Anomaly Detection: Autoencoders are trained to reconstruct normal data. When they encounter an anomaly, they will have a higher reconstruction error, allowing for the detection of unusual patterns.
* Denoising: By training on noisy data and using a clean version as the target, an autoencoder can learn to remove noise and produce a cleaner output.
* Feature Extraction: The latent space representation can be used as a set of features for other machine learning tasks, like classification.
* Generative Tasks: Modified autoencoders, like Variational Autoencoders (VAEs), can be used for generating new data that is similar to the training data.

**EXAMPLEs**
* [MNIST image reconstruction using comvolution autoencoder.](https://colab.research.google.com/github/probml/pyprobml/blob/master/notebooks/book1/20/ae_mnist_conv.ipynb)
* [Time-series anomaly detection using RNN-autoencoder](https://www.kaggle.com/code/mineshjethva/timeseries-anomaly-detection-using-rnn-autoencoder)

---
---