### **Understanding Self-Attention in Simple Terms**  

Self-Attention is a mechanism that allows a model to **focus on different parts of the input sentence when processing each token**. It helps capture relationships between words, even if they are far apart in a sentence.  

![](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*tfauBl5knVDQqSoobIwlqw.png)

---

## **📌 1. What Do Q, K, V Represent?**  

### **1️⃣ Query (Q) → "What should I focus on?"**  
- **Each token asks: "Who is important to me?"**
- It's a transformed version of the input token, used to determine **how much attention to pay to other tokens**.

### **2️⃣ Key (K) → "What am I?"**  
- **Each token declares: "This is what I represent."**
- It helps measure **how relevant a token is** to another token’s Query.

### **3️⃣ Value (V) → "What information do I carry?"**  
- **Each token provides information that might be useful to others.**
- It holds the actual content that will be passed on after computing attention.

---

## **📌 2. How Does Self-Attention Work?**
Let's say we have the sentence:  
**"The quick brown fox jumps"**  

Each word is represented as a vector (embedding), and we compute Q, K, and V for each token.

| Token  | Query (Q) | Key (K) | Value (V) |
|--------|----------|---------|---------|
| "The"  | "Who should I attend to?" | "I represent 'The'" | "My info is 'The'" |
| "quick"  | "Who should I attend to?" | "I represent 'quick'" | "My info is 'quick'" |
| "brown"  | "Who should I attend to?" | "I represent 'brown'" | "My info is 'brown'" |
| "fox"  | "Who should I attend to?" | "I represent 'fox'" | "My info is 'fox'" |

---

## **📌 3. How Is Attention Computed?**
### **1️⃣ Compute Scores (How similar is Q to K?)**
We take the **dot product of Query and Key** to compute a **score** for each word pair:

$$
\text{Score} = \frac{QK^T}{\sqrt{d_k}}
$$

- If **Q and K are similar**, the score is **high** (meaning the token is important).
- If **Q and K are different**, the score is **low** (meaning the token is less important).

🔹 **Example Scores (before softmax):**  
| Q (query) | K (key) for "quick" | K (key) for "brown" | K (key) for "fox" |
|-----------|----------------|----------------|--------------|
| "quick"   | **1.0**  (self-related) | **0.8** (related) | **0.2** (not related) |

#### Why divided by $\sqrt{d_k}$

| Aspect            | Meaning of $\sqrt{d_k}$   |
|-------------------|---------------------------|
| **Numerical Stability**  | Ensures that the attention scores remain in a manageable range, avoiding extreme values. |
| **Geometric**         | Normalizes the dot product to focus on true similarity (alignment) between vectors, not their magnitudes. |
| **Statistical**      | Prevents the dot product from growing too large as the dimension increases, ensuring that softmax behaves predictably. |

##### **In short**, the $\sqrt{d_k}$ term acts as a **scaling factor** to:
1. Prevent the dot product from becoming too large.
2. Normalize the attention scores to reflect true similarity.
3. Ensure that the softmax function behaves in a stable, predictable way across different vector dimensions.

> The **softmax** function is sensitive to the scale of its inputs. If the input scores to softmax are too large (which could happen without ($\sqrt{d_k}$), the output probabilities would become too extreme (close to 0 or 1), making it difficult for the model to differentiate between tokens. The ($\sqrt{d_k}$) scaling keeps the scores within a range where softmax can **distribute attention more evenly**, preventing extreme focusing on just one token.

---

### **2️⃣ Apply Softmax (Normalize Scores to Probabilities)**
Softmax ensures that all attention scores sum to **1**, making them interpretable as probabilities:

$$
\text{Attention Weights} = \text{softmax}(\text{Score})
$$

🔹 **Example (after softmax normalization):**  
| Token | Weight (for "quick") |
|-------|--------------------|
| quick | **0.50** (strong focus) |
| brown | **0.40** (some focus) |
| fox   | **0.10** (less focus) |

---

### **3️⃣ Compute the Final Output**
The **final attention output** is a weighted sum of all Value (V) vectors:
$$
\text{Output} = \sum (\text{Attention Weights} \times V)
$$

- If **"quick" attends more to "brown"**, its final representation will **contain more information from "brown"**.
- If **"fox" attends mostly to itself**, it won’t change much.

---

## **📌 4. Example with Code**
Here’s an example using NumPy:

```python
import numpy as np

# Simulated word embeddings (4 tokens, 3 dimensions each)
X = np.array([
    [0.1, 0.3, 0.5],  # The
    [0.2, 0.4, 0.6],  # quick
    [0.3, 0.5, 0.7],  # brown
    [0.4, 0.6, 0.8]   # fox
])

# Random weight matrices
W_Q, W_K, W_V = np.random.rand(3, 3), np.random.rand(3, 3), np.random.rand(3, 3)

# Compute Q, K, V
Q, K, V = X @ W_Q, X @ W_K, X @ W_V

# Compute Scores (Q @ K.T)
scores = Q @ K.T / np.sqrt(3)

# Apply Softmax
attention_weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)

# Compute Final Output
output = attention_weights @ V

print("\nAttention Weights:\n", attention_weights)
print("\nSelf-Attention Output:\n", output)
```

🔹 **Example Output (Attention Weights):**
```
Attention Weights:
[[0.40  0.35  0.15  0.10]
 [0.30  0.40  0.20  0.10]
 [0.20  0.35  0.30  0.15]
 [0.10  0.25  0.30  0.35]]
```
- "quick" **attends more to "brown" (0.35)** than to "fox" (0.10).
- "fox" focuses **mostly on itself and brown**.

🔹 **Final Output (New Representation):**
```
Self-Attention Output:
[
  [0.31, 0.45, 0.59],
  [0.32, 0.46, 0.60],
  [0.33, 0.48, 0.61],
  [0.35, 0.50, 0.63]
]
```
- "quick" now **contains information from "brown"**.
- "fox" learned a bit about "brown", but still mostly represents itself.

---

## **📌 5. Summary**
| **Component** | **Meaning** |
|-------------|------------|
| **Q (Query)** | "What information should I focus on?" |
| **K (Key)** | "What does this token represent?" |
| **V (Value)** | "What information does this token contain?" |
| **Score** | Similarity between Q and K (how relevant is a token?) |
| **Attention Weights** | Softmax-normalized scores (focus level on each token) |
| **Output** | The new representation of each token after considering others |

✅ **Self-Attention helps words attend to other words and improve their representation.**  
