## Self Attention in Transformers

Self-attention helps a model **pay attention** to different parts of the input data.

### Imagine a Sentence:

Let’s say we have a sentence: **The cat sat on the mat.**

Self-attention helps the model figure out which words in this sentence should be focused on when processing each word. For example, when processing the word **cat**, the model might pay attention to words like **the** (before it) and **sat** (after it) to understand its meaning better.

### The Process:

1. **Input:** The sentence is broken down into words: "The", "cat", "sat", "on", "the", "mat".
2. **Embedding:** Each word gets converted into a vector.
3. **Queries, Keys, and Values:**
   - Each word gets three vectors: Query (Q), Key (K), and Value (V). These are created by multiplying the original word vectors by different weights.
   - The **Query (Q)** vector is the word you're currently focusing on.
   - The **Key (K)** vector tells how similar another word is to your current word.
   - The **Value (V)** vector contains the meaning or information of the word you're looking at.
4. **Attention Mechanism:**
   - For each word, it compares its **Query (Q)** vector with the **Key (K)** vectors of all the other words to figure out how much attention each word should get.
   - This comparison usually involves calculating a **score** that tells how much attention each word should give to others.
   - These scores are then used to calculate a weighted average of the **Value (V)** vectors, which gives the output for that word.


\begin{equation}
    \text{Attention}(Q, K, V) = \text{softmax}\left( \frac{Q K^T}{\sqrt{d_k}} \right) V
\end{equation}


#### Generate Data

In [1]:
import numpy as np
import math

In [2]:
seq_len, d_k, d_v = 4, 8, 8
Q = np.random.randn(seq_len, d_k) # Q (Query): shape (seq_len, d_k) -> (4, 8)
K = np.random.randn(seq_len, d_k) # K (Key): shape (seq_len, d_k) -> (4, 8)
V = np.random.randn(seq_len, d_v) # V (Value): shape (seq_len, d_v) -> (4, 8)

Here, **seq_len = 4** indicates that the sentence or sequence consists of 4 tokens (words) and **d_k = 8** is the embedding dimension.

In [3]:
print("Q\n", Q)
print("K\n", K)
print("V\n", V)

Q
 [[-1.55805137e+00  1.74599065e+00  6.13182772e-03  9.16655390e-02
  -4.19406664e-01  1.20209171e+00 -3.22952013e-01 -1.57484697e+00]
 [-6.03090864e-01 -1.39311745e+00  5.41038476e-04  8.68501500e-01
   1.01740729e+00 -1.23332745e+00  1.31092668e+00  1.07903594e+00]
 [ 4.15527794e-01  2.58613492e-01  1.41064692e+00  1.89567492e+00
  -1.88668455e-01  1.22294262e+00  1.73747406e+00  1.29566400e+00]
 [-1.65589298e+00  9.20265360e-01  1.67641560e-01  1.55976837e+00
   6.85358948e-02  1.00433167e+00  3.21650406e-01 -1.95328813e-01]]
K
 [[ 0.31691824 -0.74019652 -0.16705587  1.15992309 -1.55371717 -0.86020114
   0.75528109 -0.38178461]
 [ 0.2937555  -1.16828174  0.65689693  0.63906619  0.52966841  0.56031207
  -0.1390223  -0.35731796]
 [-0.14005376  1.02710414  0.37303933 -0.28437786 -1.08703429  0.01543534
  -0.38521217  0.77901054]
 [ 1.85600985  1.40490388 -0.41918586  0.43513747  1.02341881  0.765293
   1.34713076  0.31635243]]
V
 [[ 0.99755393  1.56166281  0.2986221   0.47543551  0.81

#### Self Attention

\begin{equation}
    \text{Attention}(Q, K, V) = \text{softmax}\left( \frac{Q K^T}{\sqrt{d_k}} \right) V
\end{equation}


In [4]:
np.matmul(Q, K.T) # Q has shape (4, 8) and K.T has shape (8, 4), so the result will have shape (4, 4)

array([[-1.70591902, -1.37586814,  1.35979172, -0.84403445],
       [ 1.9056601 ,  1.28581105, -2.38259031, -0.49412577],
       [ 1.9622213 ,  1.83882766,  0.75857093,  4.86142051],
       [-0.07765669,  0.16947995,  0.46103018,  0.03823273]])

**Why we need sqrt(d_k) in denominator**

In [5]:
print("Q variance:", Q.var()) 
print("K variance:", K.var())
print("Q.K(T) variance:", np.matmul(Q, K.T).var())

Q variance: 1.0633619664558474
K variance: 0.6293201691477228
Q.K(T) variance: 2.9059453138911335


In [6]:
normalized = np.matmul(Q, K.T) / math.sqrt(d_k)
print("Q variance:", Q.var()) 
print("K variance:", K.var())
print("normalized Q.K(T) variance:", normalized.var())

Q variance: 1.0633619664558474
K variance: 0.6293201691477228
normalized Q.K(T) variance: 0.3632431642363916


**Notice the reduction in variance of the product**

In [7]:
normalized

array([[-0.60313345, -0.48644285,  0.48075897, -0.29841124],
       [ 0.67375259,  0.45460286, -0.84237288, -0.17469984],
       [ 0.69374999,  0.65012375,  0.26819532,  1.71877171],
       [-0.02745579,  0.05992021,  0.16299878,  0.01351731]])

#### Masking
- This is to ensure words don't get context from words generated in the future.
- Not required in the encoders, but required int the decoders

In [8]:
mask = np.triu(np.ones((seq_len, seq_len)), k=1)
mask

array([[0., 1., 1., 1.],
       [0., 0., 1., 1.],
       [0., 0., 0., 1.],
       [0., 0., 0., 0.]])

The masking dimension is usually the same as the sequence length dimension **seq_len**

In [9]:
mask[mask == 1] = -np.inf
mask

array([[  0., -inf, -inf, -inf],
       [  0.,   0., -inf, -inf],
       [  0.,   0.,   0., -inf],
       [  0.,   0.,   0.,   0.]])

In [10]:
normalized + mask

array([[-0.60313345,        -inf,        -inf,        -inf],
       [ 0.67375259,  0.45460286,        -inf,        -inf],
       [ 0.69374999,  0.65012375,  0.26819532,        -inf],
       [-0.02745579,  0.05992021,  0.16299878,  0.01351731]])

#### Softmax

$$
\text{softmax}(\mathbf{x})_i = \frac{e^{x_i}}{\sum_{j} e^{x_j}}
$$


In [11]:
def softmax(x):
    exp_x = np.exp(x)
    sum = exp_x.sum()
    softmax = exp_x/sum
    return softmax

In [12]:
attention = softmax(normalized + mask)

In [13]:
attention

array([[0.04042334, 0.        , 0.        , 0.        ],
       [0.14493617, 0.11641294, 0.        , 0.        ],
       [0.14786369, 0.14155164, 0.09661521, 0.        ],
       [0.07188625, 0.07844997, 0.08696795, 0.07489283]])

In [14]:
new_V = np.matmul(attention, V)
new_V

array([[ 0.04032446,  0.06312763,  0.0120713 ,  0.01921869,  0.03306634,
         0.00240532, -0.00472579, -0.00409524],
       [-0.13459867,  0.25509703, -0.11080394,  0.0280311 ,  0.1120454 ,
         0.07629048, -0.2235742 , -0.10467009],
       [-0.08661027,  0.26311021, -0.24234362, -0.00936914, -0.06104208,
        -0.10105507, -0.22962908,  0.00502036],
       [-0.03938946,  0.00811048, -0.28066134, -0.01644259, -0.05047221,
        -0.04874643, -0.00406434,  0.0616164 ]])

### Function

In [None]:
def softmax(x):
  return np.exp(x) / np.sum(np.exp(x), axis=-1, keepdims=True)

def self_attention(Q, K, V, mask=None):
  d_k = Q.shape[-1]
  normalized = np.matmul(Q, K.T) / math.sqrt(d_k)
  if mask is not None:
    normalized = normalized + mask
  attention = softmax(normalized)
  out = np.matmul(attention, V)
  return out, attention

In [None]:
values, attention = self_attention(Q, K, V, mask=mask)
print("Q\n", Q)
print("K\n", K)
print("V\n", V)
print("New V\n", values)
print("Attention\n", attention)