## Self Attention in Transformers

Self-attention helps a model **pay attention** to different parts of the input data.

### Imagine a Sentence:

Let’s say we have a sentence: **The cat sat on the mat.**

Self-attention helps the model figure out which words in this sentence should be focused on when processing each word. For example, when processing the word **cat**, the model might pay attention to words like **the** (before it) and **sat** (after it) to understand its meaning better.

### The Process:

1. **Input:** The sentence is broken down into words: "The", "cat", "sat", "on", "the", "mat".
2. **Embedding:** Each word gets converted into a vector.
3. **Queries, Keys, and Values:**
   - Each word gets three vectors: Query (Q), Key (K), and Value (V). These are created by multiplying the original word vectors by different weights.
   - The **Query (Q)**: What the word is looking for
   - The **Key (K)**: What the word offers to others
   - The **Value (V)**: The actual information the word contains
4. **Attention Mechanism:**
   - For each word, it compares its **Query (Q)** vector with the **Key (K)** vectors of all the other words to figure out how much attention each word should get.
   - This comparison usually involves calculating a **score** that tells how much attention each word should give to others.
   - These scores are then used to calculate a weighted average of the **Value (V)** vectors, which gives the output for that word.


\begin{equation}
    \text{Attention}(Q, K, V) = \text{softmax}\left( \frac{Q K^T}{\sqrt{d_k}} \right) V
\end{equation}


#### Generate Data

In [1]:
import numpy as np
import math

In [2]:
seq_len, d_k, d_v = 4, 8, 8
Q = np.random.randn(seq_len, d_k) # Q (Query): shape (seq_len, d_k) -> (4, 8)
K = np.random.randn(seq_len, d_k) # K (Key): shape (seq_len, d_k) -> (4, 8)
V = np.random.randn(seq_len, d_v) # V (Value): shape (seq_len, d_v) -> (4, 8)

Here, **seq_len = 4** indicates that the sentence or sequence consists of 4 tokens (words) and **d_k = 8** is the embedding dimension.

In [3]:
print("Q\n", Q)
print("K\n", K)
print("V\n", V)

Q
 [[ 0.09734245 -1.23944871 -1.95007434 -1.21171088 -1.39929826 -0.31226623
   0.31715713 -0.04633441]
 [-1.08375397  0.66662904 -0.98343286  0.0560969   0.89519004  0.10004913
  -1.0552281   0.69236504]
 [ 0.28716318 -1.82738228  0.81813136 -1.68986197  0.04376673  0.3502005
  -0.0423009  -0.41868561]
 [ 1.50431526 -0.22491201 -0.05686595  0.33269655  0.02673335 -0.1195548
  -0.45053287  0.6156462 ]]
K
 [[ 0.86628805  0.94090168 -0.26610346 -0.64998308  1.0927421   0.84253231
   0.87369975  1.05338847]
 [-1.424108   -0.32713153  0.71388877  0.61280272  0.17939416 -0.87583657
  -0.52110976 -1.161473  ]
 [-0.76329217  1.07186251 -0.48726729  0.67996207  0.26643412 -0.89958272
   1.88396001  0.49325744]
 [-1.71032029 -1.1453007  -0.91363038 -0.85842559 -1.24469381 -0.30221369
   1.13283269  0.7531194 ]]
V
 [[ 1.07902494 -1.26877275  0.23930751 -0.35502214 -0.95741432  0.12459074
   0.76431504  1.23321664]
 [ 1.7843734  -2.08439949  0.436698    0.21342351  2.24085292 -0.47206793
   0.788

#### Self Attention

\begin{equation}
    \text{Attention}(Q, K, V) = \text{softmax}\left( \frac{Q K^T}{\sqrt{d_k}} \right) V
\end{equation}


In [4]:
np.matmul(Q, K.T) # Q has shape (4, 8) and K.T has shape (8, 4), so the result will have shape (4, 4)

array([[-1.33923421, -1.95682855, -0.79378427,  6.23532513],
       [ 0.78350543,  0.47631351,  0.56110739,  0.12197776],
       [-0.72506282, -0.05318572, -4.31516902,  1.78136061],
       [ 1.07380337, -2.27622329, -1.56581858, -2.59277681]])

**Why we need sqrt(d_k) in denominator**

In [5]:
print("Q variance:", Q.var()) 
print("K variance:", K.var())
print("Q.K(T) variance:", np.matmul(Q, K.T).var())

Q variance: 0.7053850744157935
K variance: 0.867173339673601
Q.K(T) variance: 5.176241421279855


In [7]:
scaled = np.matmul(Q, K.T) / math.sqrt(d_k)
print("Q variance:", Q.var()) 
print("K variance:", K.var())
print("normalized Q.K(T) variance:", normalized.var())

Q variance: 0.7053850744157935
K variance: 0.867173339673601
normalized Q.K(T) variance: 0.6470301776599817


**Notice the reduction in variance of the product**

In [8]:
scaled

array([[-0.47349079, -0.69184337, -0.28064512,  2.20452034],
       [ 0.277011  ,  0.16840226,  0.19838142,  0.04312565],
       [-0.25634842, -0.01880399, -1.52564264,  0.62980608],
       [ 0.37964682, -0.80476646, -0.55360047, -0.91668503]])

#### Masking
- This is to ensure words don't get context from words generated in the future.
- Not required in the encoders, but required int the decoders

In [9]:
mask = np.triu(np.ones((seq_len, seq_len)), k=1)
mask

array([[0., 1., 1., 1.],
       [0., 0., 1., 1.],
       [0., 0., 0., 1.],
       [0., 0., 0., 0.]])

The masking dimension is usually the same as the sequence length dimension **seq_len**

In [10]:
mask[mask == 1] = -np.inf
mask

array([[  0., -inf, -inf, -inf],
       [  0.,   0., -inf, -inf],
       [  0.,   0.,   0., -inf],
       [  0.,   0.,   0.,   0.]])

In [11]:
scaled + mask

array([[-0.47349079,        -inf,        -inf,        -inf],
       [ 0.277011  ,  0.16840226,        -inf,        -inf],
       [-0.25634842, -0.01880399, -1.52564264,        -inf],
       [ 0.37964682, -0.80476646, -0.55360047, -0.91668503]])

#### Softmax

$$
\text{softmax}(\mathbf{x})_i = \frac{e^{x_i}}{\sum_{j} e^{x_j}}
$$


In [12]:
def softmax(x):
    exp_x = np.exp(x)
    sum = exp_x.sum()
    softmax = exp_x/sum
    return softmax

In [13]:
attention = softmax(scaled + mask)

In [14]:
attention

array([[0.07803034, 0.        , 0.        , 0.        ],
       [0.16527315, 0.14826346, 0.        , 0.        ],
       [0.09695434, 0.12295084, 0.02724707, 0.        ],
       [0.18313716, 0.05602635, 0.07202319, 0.05009411]])

In [15]:
new_V = np.matmul(attention, V)
new_V

array([[ 0.08419669, -0.09900277,  0.01867325, -0.0277025 , -0.07470737,
         0.00972186,  0.05963976,  0.09622832],
       [ 0.44289122, -0.51873434,  0.10429746, -0.02703272,  0.17400172,
        -0.04939892,  0.24317385,  0.39681055],
       [ 0.29852493, -0.34882499,  0.0530059 ,  0.02716019,  0.16791201,
        -0.09774164,  0.18362196,  0.27451335],
       [ 0.30384966, -0.22216965, -0.09406749,  0.03175239, -0.08274747,
        -0.12484244,  0.22354814,  0.34666975]])

### Function

In [15]:
def softmax(x):
  return np.exp(x) / np.sum(np.exp(x), axis=-1, keepdims=True)

def self_attention(Q, K, V, mask=None):
  d_k = Q.shape[-1]
  normalized = np.matmul(Q, K.T) / math.sqrt(d_k)
  if mask is not None:
    normalized = normalized + mask
  attention = softmax(normalized)
  out = np.matmul(attention, V)
  return out, attention

In [16]:
values, attention = self_attention(Q, K, V, mask=mask)
print("Q\n", Q)
print("K\n", K)
print("V\n", V)
print("New V\n", values)
print("Attention\n", attention)

Q
 [[-1.55805137e+00  1.74599065e+00  6.13182772e-03  9.16655390e-02
  -4.19406664e-01  1.20209171e+00 -3.22952013e-01 -1.57484697e+00]
 [-6.03090864e-01 -1.39311745e+00  5.41038476e-04  8.68501500e-01
   1.01740729e+00 -1.23332745e+00  1.31092668e+00  1.07903594e+00]
 [ 4.15527794e-01  2.58613492e-01  1.41064692e+00  1.89567492e+00
  -1.88668455e-01  1.22294262e+00  1.73747406e+00  1.29566400e+00]
 [-1.65589298e+00  9.20265360e-01  1.67641560e-01  1.55976837e+00
   6.85358948e-02  1.00433167e+00  3.21650406e-01 -1.95328813e-01]]
K
 [[ 0.31691824 -0.74019652 -0.16705587  1.15992309 -1.55371717 -0.86020114
   0.75528109 -0.38178461]
 [ 0.2937555  -1.16828174  0.65689693  0.63906619  0.52966841  0.56031207
  -0.1390223  -0.35731796]
 [-0.14005376  1.02710414  0.37303933 -0.28437786 -1.08703429  0.01543534
  -0.38521217  0.77901054]
 [ 1.85600985  1.40490388 -0.41918586  0.43513747  1.02341881  0.765293
   1.34713076  0.31635243]]
V
 [[ 0.99755393  1.56166281  0.2986221   0.47543551  0.81