# Self-Attention in Transformers: A Deep Dive into the Architecture (Detailed Explanation)

## Project Overview

This Jupyter Notebook provides a comprehensive, step-by-step explanation of the **self-attention layer** within the Transformer architecture. We will delve into its core functionality in transforming fixed word embeddings into contextual embeddings, and the detailed mathematical intuitions behind the scaled dot product attention mechanism. This includes the derivation of Query, Key, and Value vectors, the computation and scaling of attention scores, the generation of attention weights, and finally, the creation of contextual embeddings through a weighted sum of value vectors.

## Table of Contents

1.  **Revisiting Self-Attention: From Fixed to Contextual Embeddings**
2.  **Self-Attention as Scaled Dot Product Attention**
3.  **Step 1: Token Embeddings**
4.  **Step 2: Linear Transformation - Deriving Query, Key, and Value (Q, K, V) Vectors**
    * What are Q, K, V Vectors?
    * Importance of Query Vectors
    * Importance of Key Vectors
    * Importance of Value Vectors
    * Practical Example of Q, K, V Derivation
5.  **Step 3: Compute Attention Scores (Dot Product of Q and K)**
    * Example Calculation
6.  **Step 4: Scaling the Attention Scores**
    * Why Scaling is Necessary: Preventing Gradient Exploding and Softmax Saturation
    * Softmax Saturation Explained
    * The Problem of Vanishing Gradients (without scaling)
    * The Solution: Scaling and its Impact on Softmax Output
    * Why Divide by $\sqrt{d_k}$?
7.  **Step 5: Apply Softmax to Scaled Scores (Generating Attention Weights)**
8.  **Step 6: Weighted Sum of Value Vectors (Generating Contextual Embeddings)**
    * Example Calculation
9.  **Summary of Self-Attention Steps**

---

## 1. Revisiting Self-Attention: From Fixed to Contextual Embeddings

The **self-attention layer** is at the heart of the Transformer's ability to understand language nuances. Its primary goal is to transform initial, fixed word **vectors** into **contextual embeddings**.

* **Fixed Vectors**: When words like "the," "cat," "sat" are initially fed into the model, they are converted into numerical vectors (e.g., using an embedding layer). These are "fixed" because the vector for "cat" would be the same regardless of the surrounding words. For example:
    * "the" → [1, 0, 1, 0]
    * "cat" → [0, 1, 0, 1]
    * "sat" → [1, 1, 1, 1] (Example dimensions and values, usually much larger)

* **Contextual Embeddings**: The self-attention layer takes these fixed vectors and processes them to produce new vectors (e.g., $Z_1, Z_2, Z_3$). These new vectors are **contextual embeddings** because they incorporate the "importance" and relationship of other words (tokens) in the input sequence.

The overall definition of self-attention (also known as **Scaled Dot Product Attention**) is a mechanism that allows the model to **weigh the importance of different tokens in the input sequence relative to each other**.

---

## 2. Self-Attention as Scaled Dot Product Attention

The entire self-attention mechanism operates on the concept of **scaled dot product attention**. This involves a series of mathematical operations that we will break down step-by-step.

---

## 3. Step 1: Token Embeddings

The very first step is to convert the input words into their numerical **token embeddings**. This is typically done using an **embedding layer**.

* **Process**: Each word in the input sequence (e.g., "the", "cat", "sat") is mapped to a fixed-size vector.
* **Example** (dimension = 4):
    * $E_{the}$ = [1, 0, 1, 0]
    * $E_{cat}$ = [0, 1, 0, 1]
    * $E_{sat}$ = [1, 1, 1, 1]

---

## 4. Step 2: Linear Transformation - Deriving Query, Key, and Value (Q, K, V) Vectors - *Detailed*

This step is where the magic begins to transform a static word meaning into something dynamic and context-aware. Instead of just one embedding for each word, we create three specialized representations: the Query, Key, and Value vectors.

**Mechanism:**
For each initial word embedding $E_x$ (where 'x' is a specific word in your input sequence), we apply three different linear transformations. This is done by multiplying $E_x$ with three distinct, learned weight matrices: $W_Q$, $W_K$, and $W_V$.

* **Query vector ($Q_x$)**: $Q_x = E_x \cdot W_Q$
* **Key vector ($K_x$)**: $K_x = E_x \cdot W_K$
* **Value vector ($V_x$)**: $V_x = E_x \cdot W_V$

**Key Attributes & Their Definitions:**

* **Weight Matrices ($W_Q, W_K, W_V$)**:
    * **Definition**: These are trainable parameter matrices unique to each attention head (in multi-head attention) and shared across all positions within a single head. They are typically initialized randomly and are *learned* during the model's training process (via backpropagation and gradient descent) to optimally transform the input embeddings.
    * **Purpose**: They project the original word embedding into different sub-spaces. Each sub-space (Query, Key, Value) is designed to capture different aspects of the word's meaning relevant to the attention mechanism. Without these learned transformations, the self-attention layer wouldn't be able to learn complex relationships between words.

* **Query Vector ($Q$)**:
    * **Definition**: A vector that represents the "question" or "query" for the current word being processed. It's what the word is looking for in other words.
    * **Role**: When computing attention for a specific token at position 'i', its $Q_i$ vector is used to compare against all other Key vectors in the sequence. It's the "seeker" of relevant information.
    ![alt](images/query-vector.png)

* **Key Vector ($K$)**:
    * **Definition**: A vector that represents the "label" or "descriptor" of a word, which can be matched against queries.
    * **Role**: Every word in the sequence has a Key vector. These $K$ vectors are what Query vectors "look up" to determine relevance. When $Q_i$ is multiplied by $K_j$, it's like asking: "How well does the information I'm looking for (from $Q_i$) match the description of this other word (from $K_j$)?"
    ![alt](images/key-vector.png)

* **Value Vector ($V$)**:
    * **Definition**: A vector that contains the actual information content or "payload" of a word.
    * **Role**: Once the attention mechanism determines *how much* to focus on each word (using Q and K), the Value vectors are the ones whose information is actually summed up, weighted by the attention scores. $V$ carries the rich, abstract representation of the word that will contribute to the output contextual embedding.
    ![alt](images/value-vector.png)

**Practical Implication:**
This step allows each word to adopt three distinct "roles" in the attention process. The fact that $W_Q, W_K, W_V$ are different means that the model can learn nuanced ways to ask questions (Query), describe information (Key), and provide content (Value). This disentanglement of roles is crucial for the richness of context that self-attention provides.

### Practical Example of Q, K, V Derivation

Assume, for illustrative purposes, that $W_Q, W_K, W_V$ are all initialized as **identity matrices** (In reality, they are learned and not typically identity matrices.):

* $E_{the}$ = [1, 0, 1, 0]
* $Q_{the}$ = $K_{the}$ = $V_{the}$ = [1, 0, 1, 0]
* $Q_{cat}$ = [0, 1, 0, 1] (and $K_{cat}$, $V_{cat}$ would also be [0,1,0,1] if $W_Q, W_K, W_V$ were identity matrices)
* $Q_{sat}$ = [1, 1, 1, 1] (and $K_{sat}$, $V_{sat}$ would also be [1,1,1,1] if $W_Q, W_K, W_V$ were identity matrices)

---

## 5. Step 3: Compute Attention Scores (Dot Product of Q and K) - *Detailed*

This is where the "attention" truly begins to form. We determine how much each word in the sequence is "relevant" to the current word we are focusing on.

**Mechanism:**
For a given Query vector $Q_i$ (representing the $i$-th word in the sequence), we calculate a dot product with every Key vector $K_j$ (representing the $j$-th word in the sequence, including $K_i$ itself).

* **Formula**: $\text{Score}(Q_i, K_j) = Q_i \cdot K_j^T$ (The superscript 'T' denotes transpose, ensuring the vectors are compatible for a dot product, resulting in a scalar value).
* **Mathematical Intuition**: The dot product of two vectors is a measure of their similarity and alignment.
    * If $Q_i$ and $K_j$ point in similar directions (meaning they capture similar semantic or syntactic features), their dot product will be large.
    * If they are orthogonal (unrelated), their dot product will be close to zero.
    * If they point in opposite directions, the dot product will be negative (though typically handled by non-negative values in embeddings or subsequent operations).

**Key Attributes & Their Definitions:**

* **Attention Score (or Raw Score)**:
    * **Definition**: A scalar value resulting from the dot product of a Query vector with a Key vector. It quantifies the raw, unnormalized affinity or relevance between the querying token and the keyed token.
    * **Purpose**: These scores are the initial indicators of how much "attention" or "focus" the current word (represented by its Query) should pay to another word (represented by its Key). A higher score means greater potential relevance.

### Example Calculation (for Query of "the"):

Using our example from Step 2 with identity matrices:
$Q_{the} = [1, 0, 1, 0]$
$K_{the} = [1, 0, 1, 0]$
$K_{cat} = [0, 1, 0, 1]$
$K_{sat} = [1, 1, 1, 1]$

1.  **Score for $Q_{the}$ and $K_{the}$**:
    * $\text{Score}(Q_{the}, K_{the}) = [1, 0, 1, 0] \cdot [1, 0, 1, 0]^T = 1\times1 + 0\times0 + 1\times1 + 0\times0 = 2$

2.  **Score for $Q_{the}$ and $K_{cat}$**:
    * $\text{Score}(Q_{the}, K_{cat}) = [1, 0, 1, 0] \cdot [0, 1, 0, 1]^T = 1\times0 + 0\times1 + 1\times0 + 0\times1 = 0$

3.  **Score for $Q_{the}$ and $K_{sat}$**:
    * $\text{Score}(Q_{the}, K_{sat}) = [1, 0, 1, 0] \cdot [1, 1, 1, 1]^T = 1\times1 + 0\times1 + 1\times1 + 0\times1 = 2$

Example scores for "the": [2, 0, 2]

This step is repeated for all Query vectors. Our complete attention scores matrix (for all query-key pairs) would be:

$$
\begin{bmatrix}
- & K_{the} & K_{cat} & K_{sat} \\
Q_{the} & 2 & 0 & 2 \\
Q_{cat} & 0 & 2 & 2 \\
Q_{sat} & 2 & 2 & 4
\end{bmatrix}
$$

---

## 6. Step 4: Scaling the Attention Scores - *Detailed*

Before we turn these raw scores into probabilities, we perform a critical scaling operation.

**Mechanism:**
Each attention score is divided by the square root of the dimension of the Key vectors.

* **Formula**: $\text{Scaled Score} = \frac{\text{Score}}{\sqrt{d_k}}$

**Key Attributes & Their Definitions:**

* **$d_k$ (Dimension of Key Vectors)**:
    * **Definition**: This refers to the dimensionality (number of elements) of the Query and Key vectors. In most Transformer implementations, $d_Q = d_K$.
    * **Purpose**: This value is central to the scaling factor. For instance, if your initial word embeddings are 512 dimensions, and the learned weight matrices $W_Q, W_K, W_V$ transform these into Query, Key, and Value vectors of 64 dimensions, then $d_k=64$. It is a hyperparameter determining the size of the vectors within the attention mechanism.

* **Square Root ($\sqrt{d_k}$)**:
    * **Definition**: The specific factor used for scaling.
    * **Purpose**: This factor is derived from statistical properties. As the dimension ($d_k$) of the Query and Key vectors increases, the magnitude of their dot products tends to grow. Specifically, if the elements of $Q$ and $K$ are drawn independently from a distribution with mean 0 and variance 1, then the dot product $Q \cdot K^T$ will have a mean of 0 and a variance of $d_k$. Dividing by $\sqrt{d_k}$ effectively normalizes the variance of the dot products back to 1. This prevents the scores from becoming too large.

### Why is Scaling Absolutely Necessary?

1.  **Variance Control / Preventing Gradient Exploding**:
    * Without scaling, the raw attention scores (dot products) can become very large, especially in high-dimensional spaces ($d_k$).
    * When these large scores are fed into the non-linear softmax function, the gradients during backpropagation can also become excessively large. This phenomenon, known as **gradient exploding**, can lead to unstable training, erratic weight updates, and can prevent the model from converging or learning effectively.
    * Dividing by $\sqrt{d_k}$ helps to normalize the input magnitudes, keeping the gradients within a more stable and manageable range.

2.  **Combating Softmax Saturation**:
    * The softmax function ($\text{Softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}$) is highly sensitive to the magnitude of its inputs.
    * If the input values (attention scores) are very large, the exponential function $e^{x_i}$ will cause one value to become disproportionately large compared to others. This leads to an output distribution where one probability is very close to 1, and all others are very close to 0. This is known as "softmax saturation."
    * **Consequence of Saturation**:
        * **Loss of Nuance**: The attention mechanism effectively becomes a hard attention, almost exclusively focusing on a single token and ignoring the subtle influences of other tokens in the context. This defeats the purpose of learning complex, distributed dependencies.
        * **Vanishing Gradients**: When softmax output values are either extremely close to 0 or 1, the derivative (gradient) of the softmax function with respect to its input approaches 0. During backpropagation, these tiny gradients for the $W_Q, W_K, W_V$ matrices would mean that the model learns very little or very slowly. This prevents the attention mechanism from effectively adjusting its focus and improving its understanding of relationships.

**In essence, scaling ensures that the attention weights are not too sharp, allowing the model to attend to multiple relevant tokens simultaneously and enabling effective gradient flow during training.**

### Example (with scaling):

Our example with $d_k = 4$: divide by 2.
* Scaled scores for "the": [1, 0, 1]
* Scaled scores for "cat": [0, 1, 1]
* Scaled scores for "sat": [1, 1, 2]

Let's illustrate the impact on Softmax:
Consider original (unscaled) scores: `[6, 4]` (e.g., from a higher $d_k$ or just larger dot products).
* Softmax(`[6, 4]`): $P_1 = \frac{e^6}{e^6 + e^4} \approx 0.88$, $P_2 = \frac{e^4}{e^6 + e^4} \approx 0.12$. (Very skewed)

Now, applying scaling (e.g., if $d_k$ caused `[6, 4]` to become `[3, 2]`):
* Softmax(`[3, 2]`): $P_1 = \frac{e^3}{e^3 + e^2} \approx 0.73$, $P_2 = \frac{e^2}{e^3 + e^2} \approx 0.27$. (Much more balanced, allowing more nuanced attention.)

---

## 7. Step 5: Apply Softmax to Scaled Scores (Generating Attention Weights) - *Detailed*

This step transforms the scaled relevance scores into a meaningful probability distribution, indicating the actual "attention" given to each word.

**Mechanism:**
The softmax function is applied independently to each row of the scaled attention scores matrix. For each query, its set of scaled scores is converted into probabilities.

* **Formula**: $\text{Attention Weights}_{i,j} = \text{Softmax}(\frac{Q_i \cdot K_j^T}{\sqrt{d_k}})$
    * For a given query $Q_i$, we have a vector of scaled scores: $[\text{scaled\_score}_{i,1}, \text{scaled\_score}_{i,2}, \dots, \text{scaled\_score}_{i,N}]$.
    * Softmax takes this vector and outputs a new vector of probabilities $[\text{weight}_{i,1}, \text{weight}_{i,2}, \dots, \text{weight}_{i,N}]$.

**Key Attributes & Their Definitions:**

* **Softmax Function**:
    * **Definition**: A mathematical function that converts a vector of arbitrary real values (like our scaled attention scores) into a probability distribution. It squashes values between 0 and 1, and ensures that the sum of the resulting probabilities for any given row is 1.
    * **Properties**:
        * **Outputs Probabilities**: Each output value is a non-negative number between 0 and 1.
        * **Sums to One**: The sum of all output probabilities for a given input vector (i.e., for a single query's attention weights) is exactly 1.
        * **Exaggerates Differences**: The exponential nature of softmax means that larger input values will result in disproportionately larger probabilities, effectively highlighting the most relevant items while still giving some (potentially small) weight to less relevant ones (provided scaling prevented saturation). This "soft" selection is why it's called soft attention.

* **Attention Weights**:
    * **Definition**: The output of the softmax function. These are probabilistic values (between 0 and 1) that represent the learned "importance" or "contribution" of each token in the sequence to the contextual representation of the current querying token.
    * **Purpose**: These weights dictate *how much* information from each Value vector (from Step 2) will be incorporated into the final contextual embedding. A high attention weight for $V_j$ (when computing for $Q_i$) means that $V_j$ provides significant contextual information for $Q_i$. They are the "coefficients" for the weighted sum in the next step.

### Examples of Attention Weights:

* For "the" (from scaled scores [1, 0, 1]):
    * $P_{the,the} = \frac{e^1}{e^1 + e^0 + e^1} \approx 0.4223$
    * $P_{the,cat} = \frac{e^0}{e^1 + e^0 + e^1} \approx 0.1554$
    * $P_{the,sat} = \frac{e^1}{e^1 + e^0 + e^1} \approx 0.4223$
    * Resulting attention weights for "the" → [0.4223, 0.1554, 0.4223] (sums to approx 1)

* For "cat" (from scaled scores [0, 1, 1]):
    * Resulting attention weights for "cat" → [0.1554, 0.4223, 0.4223]

* For "sat" (from scaled scores [1, 1, 2]):
    * Resulting attention weights for "sat" → [0.2119, 0.2119, 0.5762]

This matrix of attention weights is the heart of "attention"—it explicitly shows how each word focuses on every other word in the sequence. These weights will directly determine the blend of information in the final contextual embeddings.

---

## 8. Step 6: Weighted Sum of Value Vectors (Generating Contextual Embeddings) - *Detailed*

This is the grand finale! With the attention weights in hand, we now combine the Value vectors to produce the final **contextual embeddings**. This step aggregates the information from the entire input sequence, weighted by how relevant each part of the sequence is to the current token.

**Mechanism:**
For each query token (represented by $Q_i$), its new contextual embedding ($Z_i$) is calculated as a weighted sum of *all* Value vectors ($V_j$) in the input sequence. The weights used for this summation are precisely the attention weights (obtained from Step 5) that correspond to $Q_i$ attending to each $K_j$.

* **Formula (for a single contextual embedding $Z_i$)**:
    $Z_i = \sum_{j=1}^{N} \text{AttentionWeight}(Q_i, K_j) \cdot V_j$
    where:
    * $N$ is the total number of tokens in the sequence.
    * $\text{AttentionWeight}(Q_i, K_j)$ is the scalar attention weight (probability) that token $i$ assigns to token $j$.
    * $V_j$ is the Value vector for token $j$.

* **Formula (in matrix form for all contextual embeddings $Z$)**:
    $Z = \text{AttentionWeights} \cdot V_{\text{matrix}}$
    where:
    * $Z$ is the matrix of all output contextual embeddings (each row is $Z_i$).
    * $\text{AttentionWeights}$ is the $N \times N$ matrix of attention weights (from Step 5), where each row sums to 1.
    * $V_{\text{matrix}}$ is the $N \times d_v$ matrix of Value vectors (from Step 2), where $d_v$ is the dimension of the Value vectors.

**Key Attributes & Their Definitions:**

* **Contextual Embedding ($Z_i$)**:
    * **Definition**: The final output vector for a specific token $i$ after the self-attention layer has processed it. It's a numerical representation of the token that encapsulates its meaning *in the context of the entire input sequence*.
    * **Purpose**: Unlike the initial fixed embedding, $Z_i$ is rich with information about the relationships between token $i$ and all other tokens. This contextualized representation is what makes Transformers so powerful for understanding language, allowing them to resolve ambiguities (e like "bank" in "river bank" vs. "money bank") and capture long-range dependencies.

* **Weighted Sum**:
    * **Definition**: A sum where each term is multiplied by a coefficient (its weight).
    * **Purpose**: The attention weights act as these coefficients. They dictate how much emphasis or "contribution" each Value vector provides to the final contextual embedding. If token $i$ pays a lot of attention to token $j$ (high $\text{AttentionWeight}(Q_i, K_j)$), then $V_j$ will have a large influence on $Z_i$. Conversely, if attention is low, $V_j$ will contribute little. This selective aggregation of information is the core idea of attention.

### Example Calculation (for "the"):

We want to find $Z_{the}$, the contextual embedding for "the."
From Step 5, its attention weights are `[0.4223, 0.1554, 0.4223]`.
From Step 2 (using identity matrices for simplicity), the Value vectors are:
* $V_{the}$ = `[1, 0, 1, 0]`
* $V_{cat}$ = `[0, 1, 0, 1]`
* $V_{sat}$ = `[1, 1, 1, 1]`

Let's compute $Z_{the}$ by performing the weighted sum:

$Z_{the} = (0.4223 \cdot V_{the}) + (0.1554 \cdot V_{cat}) + (0.4223 \cdot V_{sat})$

Breaking it down component-wise:

$Z_{the} = (0.4223 \cdot [1, 0, 1, 0]) + (0.1554 \cdot [0, 1, 0, 1]) + (0.4223 \cdot [1, 1, 1, 1])$

$Z_{the} = [0.4223, 0, 0.4223, 0] \quad \text{(Contribution from } V_{the} \text{)}$
$+ [0, 0.1554, 0, 0.1554] \quad \text{(Contribution from } V_{cat} \text{)}$
$+ [0.4223, 0.4223, 0.4223, 0.4223] \quad \text{(Contribution from } V_{sat} \text{)}$

Now, summing these up element by element:

$Z_{the}[0] = 0.4223 + 0 + 0.4223 = 0.8446$
$Z_{the}[1] = 0 + 0.1554 + 0.4223 = 0.5777$
$Z_{the}[2] = 0.4223 + 0 + 0.4223 = 0.8446$
$Z_{the}[3] = 0 + 0.1554 + 0.4223 = 0.5777$

Thus, $Z_{the} = [0.8446, 0.5777, 0.8446, 0.5777]$

This $Z_{the}$ is the **contextual embedding** for "the." Notice how different it is from its original fixed embedding `[1, 0, 1, 0]`. It's a blend of information from "the" itself, "cat," and "sat," all carefully weighted by how relevant they were deemed to "the" by the attention mechanism. This dynamic combination allows the model to capture the contextual meaning of "the" within "the cat sat."

Similarly, contextual embeddings ($Z_{cat}$ and $Z_{sat}$) would be computed for "cat" and "sat" using their respective attention weights and the same set of Value vectors. These contextualized representations are then passed to the next sub-layer in the Transformer encoder (the Feed-Forward Neural Network) or to the decoder.

---

## 9. Summary of Self-Attention Steps

1. Embed words into vectors
2. Derive Q, K, V using linear layers
3. Compute attention scores (dot product Q and K)
4. Scale scores using $\sqrt{d_k}$
5. Apply softmax to get attention weights
6. Multiply attention weights with V to get contextual embedding