## 2 - Self Attention
<hr>

Self-attention, also known as **scaled dot-product attention,** is a fundamental concept in the field of NLP and deep learning. It plays a pivotal role in tasks such as machine translation, text summarization, and sentiment analysis. Self-attention enables models to weigh the importance of different parts of an input sequence when making predictions or capturing dependencies between words.

<br>

<div style="text-align:center">
    <img src="images/selfattention.png" widht=800>
    <caption><center><font color="purple"><b><u>Figure 1:</u></b> Self-Attention and Multi-Head Attention</font></center></caption>
</div>

### 2.1 - Overview

We have a set of input data points $x_1, x_2, \cdots, x_n$. They can all be $d$-dimensional vectors. We will produce a set of outputs $y_1, y_2, \cdots, y_n$, also $d$-dimensional vectors:

$$y_i = \sum_{j=1}^n W_{ij} x_j$$

i.e., each output is a weighted average of all inputs where the weights $W_{ij}$ are row-normalized such that they sum to 1. Crucially, the weights here are not the same as the (learned) parameters in a neural network layer. Instead, they are derived from the inputs. For example, one option is that we choose the weights to be dot-products:

$$w_{ij} = x_i^T x_j$$

and apply the softmax function so that we get row-normalization:

$$W_{ij} = \frac{\text{exp} \ w_{ij}}{\sum_j \text{exp} \ w_{ij}}$$

and use these weights to construct the outputs. That's basically self-attention in a nutshell. In the above definition of the self-attention layer, observe that each data point $x_i$ plays three roles:

- It is compared with all other data points to construct weights for its own output $y_i$ i.e., in the dot-product example above, the sequence of weights:

$$w_{i1} = x_i^T x_1, w_{i2} = x_i^T x_2, \cdots, w_{in} = x_i^T x_n$$

- It is compared with every other data point $x_j$ to construct weights for their output $y_j$ i.e., the weights:

$$w_{1i} = x_1^T x_i, w_{2i} = x_2^T x_i, \cdots $$

- Once all the weights $w_{ij}$ have been constructed, they are used to finally synthesize each actual output $y_1, \cdots, y_n$

These three roles are called the **query, key, and value** respectively.

### 2.2 - The Quartet: Q, K, V, and Self-Attention

At the heart of self-attention are the quartet of Query $(Q)$, Key $(K)$, Value $(V)$, and Self-Attention itself. These components work together in a symphony:

- **Query $(Q)$:** Think of the queries as the elements seeking information. For each word in the input sequence, a query vector is calculated. These queries represent what you want to pay attention to within the sequence.

- **Key $(K)$:** Keys are like signposts. They help identify and locate important elements in the sequence. Like queries, key vectors are computed for each word.

- **Value $(V)$:** Values carry the information. Once again, for each word, a value vector is computed. These vectors hold the content that we want to consider when determining the importance of words in the sequence.

Consider a sequence of input words represented as vectors, $X = \left[ x_1, x_2, \cdots, x_n \right]$, where each $x_i$ is a vector representation of the $i$th word in the sequence.

#### Step 1: Linear Transformations
First, we transform the input vectors into three different vectors - Queries (Q), Keys (K), and Values (V) - using trainable weight matrices $W_Q, W_K$ and $W_V$ respectively.

$$Q = XW_Q, \quad K = XW_K, \quad V = XW_V$$

Here, $Q, K, V$ are matrices formed by stacking all query, key, and value vectors, respectively.

#### Step 2: Scaled Dot-Product Attention
The self-attention score for each word is computed using a scaled dot-product of the query with all keys:

$$\text{Attention}(Q, K, V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V $$

- $\frac{QK^T}{\sqrt{d_k}}$ calculates the dot products of the query with all keys, scaling down by the square root of the dimension of the key vectors $d_k$ for stability.

- The softmax function is applied row-wise and serves to assign weights to each value in $V$.

- This result is then multiplied by the value vectors $V$ to obtain the final output of the self-attention layer for each word.

#### Step 3: Output of Self-Attention
The output is a weighted sum of the values, where the weight assigned to each value is determined by the dot product of the query with the corresponding key.

$$\text{Self-Attention Output} = \sum \left( \text{Attention Weights} \cdot V \right) $$

## 3 - Self Attention in Transformers
<hr>

Transformers, the backbone of modern NLP models, prominently feature self-attention. In a transformer architecture, self-attention is applied in parallel multiple times, followed by feedforward layers.

- **Query, Key, and Value:** Each input vector $x_i$ is linearly transformed into three vectors: query $(q_i)$, key $(k_i)$, and value $(v_i)$. These transformations are achieved through learned weight matrices $W_Q, W_K, W_V$. These vectors are used to compute attention scores.

- **Attention Scores:** The attention score between a query vector $q_i$ and a key vector $k_j$ is computed as their dot product:

$$\text{Attention}(q_i, k_j) = q_i \cdot k_j$$

- **Scaled Attention:** To stabilize training and control gradient magnitudes, the dot products are scaled down by a factor of $\sqrt{d_k}$ where $d_k$ is the dimension of the key vectors:

$$\text{Scaled Attention}(q_i, k_j) = \frac{q_i \cdot k_j}{\sqrt{d_k}}$$

- **Attention Weights:** The scaled attention scores are passed through a softmax function to obtain attention weights that sum to 1:

$$\text{Attention Weights}(q_i, k_j) = \text{softmax}\left(\text{Scaled Attention}(q_i, k_j)\right)$$

- **Weighted Sum:** Finally, the attention weights are used to compute a weighted sum of the value vectors:

$$\text{Self-Attention}(X) = \sum_j \text{Attention Weight}(q_i, k_j) \cdot v_j$$

## 4 - Multi-Head Attention
<hr>

In practical applications, self-attention is often extended to multi-head attention. Instead of relying on a single set of learned transformations $(W_Q, W_K, W_V)$, multi-head attention uses multiple sets of transformations, or "heads." Each head focuses on different aspects or relationships within the input sequence. The outputs of these heads are concatenated and linearly combined to produce the final self-attention output. This mechanism allows models to capture various types of information simultaneously.

## 5 - Positional Encoding
<hr>

One critical aspect of self-attention is that it doesn't inherently capture the sequential order of elements in the input sequence, as it computes attention based on content alone. To address this limitation, positional encodings are added to the input embeddings in transformers. These encodings provide the model with information about the positions of words in the sequence, enabling it to distinguish between words with the same content but different positions.