### üß† Self-Attention  

Embeddings of tokens are **independent of each other**, which means they **cannot capture the context** of a sentence.  

**Self-Attention** helps overcome this by converting these **independent token embeddings** into **contextualized embeddings** ‚Äî where the representation of each word depends on all other words in the sentence.  

In simple terms,  
> The embedding of a word becomes a **weighted sum** of all other embeddings in the same sentence.

---

#### üì• Input  
Input shape: (batch_size, context_len)


---

#### ‚öôÔ∏è Problem with Simple Dot Product  
If we only take the **dot product** of embeddings to compute attention scores, the same words across different sentences would get **identical scores**, even though their meanings may differ depending on context.

---

#### üí° Solution ‚Äî Introduce Query, Key, and Value  

We introduce **three learnable matrices** that transform each embedding into different representations:

\[
\begin{aligned}
W_Q &: (\text{context\_len}, \text{context\_len}) \\
W_K &: (\text{context\_len}, \text{context\_len}) \\
W_V &: (\text{context\_len}, \text{context\_len})
\end{aligned}
\]

These are used to generate:
- **Q (Query):** What the current word is *looking for* in other words.  
- **K (Key):** What each word *contains* that others might look for.  
- **V (Value):** The actual information content of each word.  

---

#### üß© Self-Attention Formula  

The **contextualized embedding** for each token is computed as:
$$

\text{Attention}(Q, K, V) = \text{softmax}\left( \frac{QK^T}{\sqrt{d_k}} \right) V

$$

Where:
- \( QK^T \) gives **similarity scores** between tokens  
- \( d_k \) is the **dimension of the key vector**, used for scaling  
- **softmax** normalizes the scores into probabilities  
- Finally, multiplying by \( V \) gives a **weighted sum** of values (contextualized embeddings)


In [26]:
import torch
# this is one sentence input
inputs = torch.rand(6 , 3)
print(f"Context size : ({inputs.shape[0]}), Embedding dim : ({inputs.shape[1]})")
inputs

Context size : (6), Embedding dim : (3)


tensor([[0.2428, 0.9521, 0.8071],
        [0.4264, 0.6224, 0.9260],
        [0.9122, 0.2526, 0.2183],
        [0.4540, 0.9181, 0.6634],
        [0.4926, 0.6716, 0.5819],
        [0.8921, 0.5329, 0.0452]])

In [None]:
# for making query, key, value 
din = inputs.shape[1]
dout = 3

Wq = torch.nn.Parameter(torch.rand(din, dout), requires_grad=False)
Wk = torch.nn.Parameter(torch.rand(din, dout), requires_grad=False)
Wv = torch.nn.Parameter(torch.rand(din, dout), requires_grad=False)

In [12]:
queries = inputs @ Wq.weight
keys = inputs @ Wk.weight
values = inputs @ Wv.weight

print(f"Queries shape : {queries.shape}")
print(f"Keys shape : {keys.shape}")
print(f"Values shape : {values.shape}")

Queries shape : torch.Size([6, 3])
Keys shape : torch.Size([6, 3])
Values shape : torch.Size([6, 3])


In [18]:
# Now we need to calculate attention score by getting dot product
# For now it should be 6 x 6 matrix as we have 6 words so each word will have 6 weights 

attention_scores = queries @ keys.T
print(f"Shape of attention scores : {attention_scores.shape}")
attention_scores

Shape of attention scores : torch.Size([6, 6])


tensor([[-1.8795e+00, -2.8734e+00, -1.0750e+00, -1.1436e+00, -2.5227e+00,
         -1.1112e-01],
        [-2.3830e+00, -3.7004e+00, -1.3775e+00, -1.4084e+00, -3.2607e+00,
         -3.9009e-02],
        [-8.9538e-01, -1.3941e+00, -4.2938e-01, -4.8184e-01, -1.1525e+00,
         -4.5575e-04],
        [-1.4712e+00, -2.2130e+00, -7.8770e-01, -8.9895e-01, -1.8970e+00,
         -1.4746e-01],
        [-1.9428e+00, -3.0416e+00, -1.0526e+00, -1.0919e+00, -2.6192e+00,
          1.8658e-02],
        [-9.7863e-01, -1.3951e+00, -5.2670e-01, -6.6487e-01, -1.1975e+00,
         -2.3683e-01]], grad_fn=<MmBackward0>)

### ‚öñÔ∏è Why Do We Divide by ‚àöd in Self-Attention?

In the **self-attention mechanism**, we compute similarity between tokens using the dot product of **Query (Q)** and **Key (K)** vectors:

$$
\text{scores} = QK^T
$$

---

### üí° The Problem: Large Dot Products

When the embedding dimension \( d_k \) is large, each element in \( Q \) and \( K \) contributes to the sum ‚Äî making the **dot product values grow very large**.

These large values are passed through a **softmax** function:

$$
\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}
$$

If \( x_i \) are too large:
- \( e^{x_i} \) becomes extremely large  
- Softmax output becomes **very peaky** (one value ‚âà 1, others ‚âà 0)  
- Gradients become **tiny** ‚Üí leading to **unstable training**

---

### ‚úÖ The Solution: Scaling by \( \sqrt{d_k} \)

To fix this, we **scale down** the dot product by dividing it by the **square root of the key dimension**:

$$
\text{Scaled Scores} = \frac{QK^T}{\sqrt{d_k}}
$$

This ensures that:
- The variance of the dot product remains roughly **constant (‚âà 1)**  
- Softmax values stay in a **reasonable range**  
- Training becomes **stable and smooth**

---

### üß† Intuitive Analogy

Think of \( \sqrt{d_k} \) as a **normalization factor**.

When vectors have more dimensions (larger \( d_k \)), their dot products naturally become larger ‚Äî like louder ‚Äúsignals‚Äù.  
Dividing by \( \sqrt{d_k} \) **keeps the volume consistent**, so the model doesn‚Äôt get overwhelmed.

---

### üî¢ Example (Conceptually)

If elements of \( Q \) and \( K \) each have mean 0 and variance 1:

$$
\text{Var}(QK^T) = d_k
$$

Then dividing by \( \sqrt{d_k} \) gives:

$$
\text{Var}\left(\frac{QK^T}{\sqrt{d_k}}\right) = 1
$$

‚úÖ This keeps the scores well-scaled across different embedding dimensions, ensuring **consistent attention behavior**.

---


In [23]:
attention_scores_new = attention_scores / torch.sqrt(torch.tensor(din, dtype=torch.float32))

In [24]:
# Now we need to apply softmax on the scores to make them stable and interpretable 
attention_scores_sclaed = torch.softmax(attention_scores_new, dim=-1)
attention_scores_sclaed

tensor([[0.1227, 0.0691, 0.1952, 0.1877, 0.0846, 0.3406],
        [0.1055, 0.0493, 0.1884, 0.1851, 0.0635, 0.4081],
        [0.1456, 0.1092, 0.1906, 0.1849, 0.1255, 0.2441],
        [0.1341, 0.0874, 0.1990, 0.1866, 0.1049, 0.2880],
        [0.1161, 0.0615, 0.1940, 0.1897, 0.0785, 0.3602],
        [0.1493, 0.1174, 0.1938, 0.1789, 0.1316, 0.2291]],
       grad_fn=<SoftmaxBackward0>)

In [20]:
# Now we need to get the contextualized embeddings for each input word in sentence
context_inputs = attention_scores_sclaed @ values
print(f"Shape of context inputs : {context_inputs.shape}")

Shape of context inputs : torch.Size([6, 3])


In [25]:
print(context_inputs)

tensor([[ 0.1820, -0.1630, -1.4938],
        [ 0.1591, -0.0893, -1.5360],
        [ 0.2222, -0.3157, -1.4101],
        [ 0.2032, -0.2354, -1.4522],
        [ 0.1751, -0.1382, -1.5092],
        [ 0.2294, -0.3440, -1.3900]], grad_fn=<MmBackward0>)


See how different contextualized word embeddings are as compared to input embeddings

#### Self-Attention Class

In [30]:
import torch.nn as nn 

class SelfAttention(nn.Module):
    def __init__(self, din, dout):
        super().__init__()
        self.Wq = torch.nn.Parameter(torch.rand(din, dout))
        self.Wk = torch.nn.Parameter(torch.rand(din, dout))
        self.Wv = torch.nn.Parameter(torch.rand(din, dout))

    def forward(self, x):
        # x is input senetence (context_len, embedding_dim)
        # calculate key, query and value
        keys = x @ self.Wk
        queries = x @ self.Wq 
        values = x @ self.Wv 

        # attention scores 
        attn_scores = queries @ keys.T 
        # scaling attention scores 
        attn_scores = attn_scores / torch.sqrt(torch.tensor(x.shape[1], dtype=torch.float32))

        scaled_attn_weights = torch.softmax(attn_scores, dim=-1)
        contextualized_inputs = scaled_attn_weights @ values 
        return contextualized_inputs

In [32]:
self_attention = SelfAttention(3, 3)
inputs2 = torch.rand(6, 3)
inputs2

tensor([[0.7492, 0.9378, 0.9723],
        [0.0987, 0.7818, 0.0461],
        [0.0569, 0.5981, 0.5543],
        [0.1916, 0.9791, 0.7765],
        [0.4310, 0.1584, 0.6319],
        [0.6062, 0.5110, 0.2282]])

In [33]:
contextualized_inputs2 = self_attention(inputs2)
contextualized_inputs2

tensor([[0.4659, 1.3326, 1.3747],
        [0.4154, 1.1671, 1.2156],
        [0.4203, 1.1841, 1.2316],
        [0.4483, 1.2763, 1.3203],
        [0.4114, 1.1536, 1.2027],
        [0.4207, 1.1842, 1.2328]], grad_fn=<MmBackward0>)