## Self Attention in Transformers

Self-attention helps a model **pay attention** to different parts of the input data.

### Imagine a Sentence:

Take the sentence: **The cat sat on the mat.**

When the model reads this sentence, **self-attention** helps it determine which other words are important for understanding each word.

For example, when processing the word **"cat"**, the model might focus on:
- **"the"** — to recognize that it's a specific cat.
- **"sat"** — to understand the action the cat is performing.

This allows the model to build a deeper understanding of the sentence by considering how words relate to each other.

### How Transformers Understand Language: The Complete Attention Mechanism

Let's understand how a Transformer processes the sentence: **"The cat sat on the mat"**


### Step 1: Input Tokenization and Embedding

**Process:** The sentence is first broken down into individual tokens (words or sub-words), and each token is converted into a high-dimensional vector representation called an embedding.

```
Input sentence: "The cat sat on the mat"
Tokens: ["The", "cat", "sat", "on", "the", "mat"]
Token positions: [0, 1, 2, 3, 4, 5]
```

**Embedding Details:** Each token gets converted into a dense vector, typically of size 512 or 768 dimensions. These embeddings capture semantic meaning learned during training.

```
"cat" → [0.2, -0.1, 0.8, 0.3, ..., 0.5]  (d_model = 512 dimensions)
"sat" → [0.1, 0.4, -0.2, 0.7, ..., -0.3] (d_model = 512 dimensions)
```

**Mathematical Representation:**
```
X has shape (n, d_model)
where:
  n = sequence length (e.g., 6 tokens in above sentence)
  d_model = embedding dimension (e.g., 512)
```

### Step 2: Creating Query, Key, and Value Vectors

The attention mechanism works by transforming each input embedding into three different vector types. This is the core innovation that allows the model to determine what information is relevant.

#### Query (Q) Vectors - "What am I looking for?"

**Explanation:** The Query vector represents what information a particular token is seeking from other tokens in the sequence. Think of it as a search query that each word uses to find relevant context from other words.

**Detailed Example:** When processing the word "cat", its Query vector encodes questions like:
- "What actions are associated with me?"
- "What descriptive words modify me?"
- "What spatial relationships involve me?"

**Mathematical Transformation:**
```
Q = X × W_q
Where:
 W_q has shape (d_model, d_k)
 Typically: d_model = 512, d_k = 64  # dimension reduction for efficiency
 If X has shape (n, d_model), e.g., n = 6

Then Q will have shape (n, d_k) = (6, 64)
```

#### Key (K) Vectors - "What information do I offer?"

**Explanation:** The Key vector represents what information each token can provide to others. It acts like a searchable index or label that describes the token's content and role in the sentence.

**Detailed Example:** The word "sat" has a Key vector that announces:
- "I am a past-tense action verb"
- "I describe what the subject (cat) did"
- "I connect the subject to a location"

**Mathematical Transformation:**
```
K = X × W_k
Where:
 W_k has shape (d_model, d_k)
 Typically: d_model = 512, d_k = 64
 If X has shape (n, d_model), e.g., n = 6

Then K will have shape (n, d_k) = (6, 64)

```

#### Value (V) Vectors - "What is my actual content?"

**Explanation:** The Value vector contains the actual information content that will be used in the final representation. If a token is deemed relevant (through Query-Key matching), its Value vector contributes to the output.

**Detailed Example:** The Value vector for "cat" contains:
- Semantic information about animals
- Syntactic information about being a noun
- Contextual information about its role as a subject

**Mathematical Transformation:**
```
V = X × W_v
Where:
 W_v has shape (d_model, d_v)
 Typically: d_model = 512, d_k = 64
 If X has shape (n, d_model), e.g., n = 6

Then V will have shape (n, d_v) = (6, 64)
```

#### Representation of Q, K, V Creation:

| Token | Original Embedding | Query Vector | Key Vector | Value Vector |
|-------|-------------------|--------------|------------|-------------|
| "The" | [512 dims] → | [64 dims] | [64 dims] | [64 dims] |
| "cat" | [512 dims] → | [64 dims] | [64 dims] | [64 dims] |
| "sat" | [512 dims] → | [64 dims] | [64 dims] | [64 dims] |
| "on" | [512 dims] → | [64 dims] | [64 dims] | [64 dims] |
| "the" | [512 dims] → | [64 dims] | [64 dims] | [64 dims] |
| "mat" | [512 dims] → | [64 dims] | [64 dims] | [64 dims] |


### Step 3: Computing Attention Scores

**Overview:** For each token, we calculate how much attention it should pay to every other token (including itself) by comparing its Query vector with all Key vectors.

#### Similarity Calculation (Dot Product)

**Explanation:** The dot product between Query and Key vectors measures their similarity. A high dot product means the Query is looking for exactly what the Key offers, indicating strong relevance.

**Mathematical Formula:**
```
Attention_Scores = Q × K^T
Where:
 Q has shape (n, d_k)
 K^T (transpose of K) has shape (d_k, n)

Resulting Attention_Scores will have shape (n, n) = (6, 6)

```

**Example - Attention Score Matrix:**
```
        The   cat   sat   on    the   mat
The   [ 2.1   1.2   0.8   0.9   2.0   0.7 ]
cat   [ 1.1   3.2   4.1   0.6   1.0   1.8 ]
sat   [ 0.9   4.0   2.8   2.1   0.8   2.2 ]
on    [ 0.7   1.5   2.3   1.9   1.1   3.1 ]
the   [ 1.8   1.0   0.9   1.2   2.1   0.8 ]
mat   [ 0.6   2.1   1.9   2.8   0.7   2.5 ]
```

#### Scaling for Stability

**Why Scaling is Needed:** Large dot products can cause the softmax function to have very small gradients, making training difficult. Scaling prevents this issue.

**Mathematical Formula:**
```
Scaled_Scores = Attention_Scores / sqrt(d_k)
For d_k = 64: Scaled_Scores = Attention_Scores / sqrt(64) = Attention_Scores / 8
```


### Step 4: Converting Scores to Probabilities (Softmax)

**Purpose:** The softmax function converts raw attention scores into a probability distribution, ensuring all attention weights for a token sum to 1.0.

**Mathematical Formula:**
```
Attention_Weights[i, j] = exp(Scaled_Scores[i, j]) / sum(exp(Scaled_Scores[i, k]) for k in range(n))

```

**Example - After Softmax:**
```
Attention weights for "cat" looking at all tokens:
"The" → 0.05  (very low attention)
"cat" → 0.15  (moderate self-attention)
"sat" → 0.65  (high attention - verb relates strongly to subject)
"on"  → 0.03  (very low attention)
"the" → 0.07  (low attention)
"mat" → 0.05  (very low attention)

Sum = 1.00 (probabilities must sum to 1)
```

### Step 5: Computing Final Output

**Process:** The attention weights are used to compute a weighted average of all Value vectors, creating a context-aware representation for each token.

**Mathematical Formula:**
```
Output = Attention_Weights × V
Where:
 Attention_Weights has shape (n, n)
 V has shape (n, d_v)

Resulting Output will have shape (n, d_v) = (6, 64)

```

**Detailed Example for "cat":**
```
cat_output = (
    0.05 * V_"The" +    # Contribution from "The"
    0.15 * V_"cat" +    # Contribution from "cat" itself
    0.65 * V_"sat" +    # Highest attention weight, meaning "sat" is most relevant
    0.03 * V_"on" +     # Smaller contributions from other words
    0.07 * V_"the" +
    0.05 * V_"mat"
)

cat_output is the output vector for the word "cat" after applying attention.
It is a weighted sum of the value vectors (V) of all words in the sequence.
The weights (like 0.05, 0.15, etc.) come from the attention scores for "cat".
```

**Result Interpretation:** The output representation for "cat" now includes:
- Primarily information from "sat" (65% weight) - understanding the action
- Some self-information (15% weight) - retaining its identity
- Small contributions from other tokens - maintaining broader context


### Complete Mathematical Summary

**The entire attention mechanism in one formula:**
```
Attention(Q, K, V) = softmax((Q × K^T) / sqrt(d_k)) × V

X shape: (n, d_model) = (6, 512)
W_q, W_k, W_v shapes: (d_model, d_k) = (512, 64)

Q = X × W_q    # (6, 512) × (512, 64) = (6, 64)
K = X × W_k    # (6, 512) × (512, 64) = (6, 64)
V = X × W_v    # (6, 512) × (512, 64) = (6, 64)
```


### Why This Mechanism is Powerful

**Parallel Processing:** Unlike RNNs that process tokens sequentially, attention processes all tokens simultaneously, making it much faster and allowing for better parallelization.

**Long-Range Dependencies:** Any token can directly attend to any other token, regardless of distance, solving the vanishing gradient problem that plagued earlier architectures.

**Dynamic Context:** The attention weights are computed dynamically based on the content, allowing the model to focus on different aspects depending on what's most relevant for the current context.

**Multi-Head Attention:** In practice, transformers use multiple attention heads (typically 8-12), each learning to focus on different types of relationships:
- Syntactic relationships (subject-verb, adjective-noun)
- Semantic relationships (word meanings and associations)
- Positional relationships (word order and sequence)
- Discourse relationships (coreference)

This comprehensive attention mechanism enables Transformers to build rich, contextual representations that capture both local and global dependencies in language.

\begin{equation}
    \text{Attention}(Q, K, V) = \text{softmax}\left( \frac{Q K^T}{\sqrt{d_k}} \right) V
\end{equation}


#### Generate Data

In [None]:
import numpy as np
import math

In [None]:
seq_len, d_k, d_v = 4, 8, 8
Q = np.random.randn(seq_len, d_k) # Q (Query): shape (seq_len, d_k) -> (4, 8)
K = np.random.randn(seq_len, d_k) # K (Key): shape (seq_len, d_k) -> (4, 8)
V = np.random.randn(seq_len, d_v) # V (Value): shape (seq_len, d_v) -> (4, 8)

Here, **seq_len = 4** indicates that the sentence or sequence consists of 4 tokens (words) and **d_k = 8** is the embedding dimension.

In [None]:
print("Q\n", Q)
print("K\n", K)
print("V\n", V)

Q
 [[ 0.09734245 -1.23944871 -1.95007434 -1.21171088 -1.39929826 -0.31226623
   0.31715713 -0.04633441]
 [-1.08375397  0.66662904 -0.98343286  0.0560969   0.89519004  0.10004913
  -1.0552281   0.69236504]
 [ 0.28716318 -1.82738228  0.81813136 -1.68986197  0.04376673  0.3502005
  -0.0423009  -0.41868561]
 [ 1.50431526 -0.22491201 -0.05686595  0.33269655  0.02673335 -0.1195548
  -0.45053287  0.6156462 ]]
K
 [[ 0.86628805  0.94090168 -0.26610346 -0.64998308  1.0927421   0.84253231
   0.87369975  1.05338847]
 [-1.424108   -0.32713153  0.71388877  0.61280272  0.17939416 -0.87583657
  -0.52110976 -1.161473  ]
 [-0.76329217  1.07186251 -0.48726729  0.67996207  0.26643412 -0.89958272
   1.88396001  0.49325744]
 [-1.71032029 -1.1453007  -0.91363038 -0.85842559 -1.24469381 -0.30221369
   1.13283269  0.7531194 ]]
V
 [[ 1.07902494 -1.26877275  0.23930751 -0.35502214 -0.95741432  0.12459074
   0.76431504  1.23321664]
 [ 1.7843734  -2.08439949  0.436698    0.21342351  2.24085292 -0.47206793
   0.788

#### Self Attention

\begin{equation}
    \text{Attention}(Q, K, V) = \text{softmax}\left( \frac{Q K^T}{\sqrt{d_k}} \right) V
\end{equation}


In [None]:
np.matmul(Q, K.T) # Q has shape (4, 8) and K.T has shape (8, 4), so the result will have shape (4, 4)

array([[-1.33923421, -1.95682855, -0.79378427,  6.23532513],
       [ 0.78350543,  0.47631351,  0.56110739,  0.12197776],
       [-0.72506282, -0.05318572, -4.31516902,  1.78136061],
       [ 1.07380337, -2.27622329, -1.56581858, -2.59277681]])

**Why we need sqrt(d_k) in denominator**

In [None]:
print("Q variance:", Q.var())
print("K variance:", K.var())
print("Q.K(T) variance:", np.matmul(Q, K.T).var())

Q variance: 0.7053850744157935
K variance: 0.867173339673601
Q.K(T) variance: 5.176241421279855


In [None]:
scaled = np.matmul(Q, K.T) / math.sqrt(d_k)
print("Q variance:", Q.var())
print("K variance:", K.var())
print("normalized Q.K(T) variance:", normalized.var())

Q variance: 0.7053850744157935
K variance: 0.867173339673601
normalized Q.K(T) variance: 0.6470301776599817


**Notice the reduction in variance of the product**

In [None]:
scaled

array([[-0.47349079, -0.69184337, -0.28064512,  2.20452034],
       [ 0.277011  ,  0.16840226,  0.19838142,  0.04312565],
       [-0.25634842, -0.01880399, -1.52564264,  0.62980608],
       [ 0.37964682, -0.80476646, -0.55360047, -0.91668503]])

#### Masking
- This is to ensure words don't get context from words generated in the future.
- Not required in the encoders, but required int the decoders

In [None]:
mask = np.triu(np.ones((seq_len, seq_len)), k=1)
mask

array([[0., 1., 1., 1.],
       [0., 0., 1., 1.],
       [0., 0., 0., 1.],
       [0., 0., 0., 0.]])

The masking dimension is usually the same as the sequence length dimension **seq_len**

In [None]:
mask[mask == 1] = -np.inf
mask

array([[  0., -inf, -inf, -inf],
       [  0.,   0., -inf, -inf],
       [  0.,   0.,   0., -inf],
       [  0.,   0.,   0.,   0.]])

In [None]:
scaled + mask

array([[-0.47349079,        -inf,        -inf,        -inf],
       [ 0.277011  ,  0.16840226,        -inf,        -inf],
       [-0.25634842, -0.01880399, -1.52564264,        -inf],
       [ 0.37964682, -0.80476646, -0.55360047, -0.91668503]])

#### Softmax

$$
\text{softmax}(\mathbf{x})_i = \frac{e^{x_i}}{\sum_{j} e^{x_j}}
$$


In [None]:
def softmax(x):
    exp_x = np.exp(x)
    sum = exp_x.sum()
    softmax = exp_x/sum
    return softmax

In [None]:
attention = softmax(scaled + mask)

In [None]:
attention

array([[0.07803034, 0.        , 0.        , 0.        ],
       [0.16527315, 0.14826346, 0.        , 0.        ],
       [0.09695434, 0.12295084, 0.02724707, 0.        ],
       [0.18313716, 0.05602635, 0.07202319, 0.05009411]])

In [None]:
new_V = np.matmul(attention, V)
new_V

array([[ 0.08419669, -0.09900277,  0.01867325, -0.0277025 , -0.07470737,
         0.00972186,  0.05963976,  0.09622832],
       [ 0.44289122, -0.51873434,  0.10429746, -0.02703272,  0.17400172,
        -0.04939892,  0.24317385,  0.39681055],
       [ 0.29852493, -0.34882499,  0.0530059 ,  0.02716019,  0.16791201,
        -0.09774164,  0.18362196,  0.27451335],
       [ 0.30384966, -0.22216965, -0.09406749,  0.03175239, -0.08274747,
        -0.12484244,  0.22354814,  0.34666975]])

### Function

In [None]:
def softmax(x):
  return np.exp(x) / np.sum(np.exp(x), axis=-1, keepdims=True)

def self_attention(Q, K, V, mask=None):
  d_k = Q.shape[-1]
  normalized = np.matmul(Q, K.T) / math.sqrt(d_k)
  if mask is not None:
    normalized = normalized + mask
  attention = softmax(normalized)
  out = np.matmul(attention, V)
  return out, attention

In [None]:
values, attention = self_attention(Q, K, V, mask=mask)
print("Q\n", Q)
print("K\n", K)
print("V\n", V)
print("New V\n", values)
print("Attention\n", attention)

Q
 [[-1.55805137e+00  1.74599065e+00  6.13182772e-03  9.16655390e-02
  -4.19406664e-01  1.20209171e+00 -3.22952013e-01 -1.57484697e+00]
 [-6.03090864e-01 -1.39311745e+00  5.41038476e-04  8.68501500e-01
   1.01740729e+00 -1.23332745e+00  1.31092668e+00  1.07903594e+00]
 [ 4.15527794e-01  2.58613492e-01  1.41064692e+00  1.89567492e+00
  -1.88668455e-01  1.22294262e+00  1.73747406e+00  1.29566400e+00]
 [-1.65589298e+00  9.20265360e-01  1.67641560e-01  1.55976837e+00
   6.85358948e-02  1.00433167e+00  3.21650406e-01 -1.95328813e-01]]
K
 [[ 0.31691824 -0.74019652 -0.16705587  1.15992309 -1.55371717 -0.86020114
   0.75528109 -0.38178461]
 [ 0.2937555  -1.16828174  0.65689693  0.63906619  0.52966841  0.56031207
  -0.1390223  -0.35731796]
 [-0.14005376  1.02710414  0.37303933 -0.28437786 -1.08703429  0.01543534
  -0.38521217  0.77901054]
 [ 1.85600985  1.40490388 -0.41918586  0.43513747  1.02341881  0.765293
   1.34713076  0.31635243]]
V
 [[ 0.99755393  1.56166281  0.2986221   0.47543551  0.81