# A First Look at Self-Attention

## Why Self-Attention Helps With Meaning

Consider the following review:

> **“The movie was *not* good, but the soundtrack was amazing.”**

A simple bag-of-words classifier will see both *good* and *amazing* (positive) and probably predict a positive sentiment, missing the negation “not.”  

Self-attention can discover that "not" modifies "good" while leaving "amazing" untouched.

### 1. Tokenise the sentence

| position | token |
|:-------:|-------|
| 0 | The |
| 1 | movie |
| 2 | was |
| 3 | **not** |
| 4 | **good** |
| 5 | , |
| 6 | but |
| 7 | the |
| 8 | soundtrack |
| 9 | was |
|10 | **amazing** |
|11 | . |

Each token is mapped to a small vector (embedding).  
For illustration imagine every token is already a 4-D vector.  
The exact numbers are not important; they are learned during training.



### 2. Compute attention scores (conceptually)

Focus on the token at position 4, **“good.”**

* Query $q_{good}$ is compared with every key $k_j$.
* Large dot products mean higher relevance.
* After scaling and the softmax, we obtain a **weight** for each other token.

Suppose the softmax gives (rounded):

| key token $j$ | weight $w_{4j}$ |
|-----------------|-------------------|
| The             | 0.01 |
| movie           | 0.02 |
| was (1st)       | 0.03 |
| **not**         | **0.55** |
| **good**        | 0.10 |
| ,               | 0.02 |
| but             | 0.05 |
| the             | 0.02 |
| soundtrack      | 0.03 |
| was (2nd)       | 0.05 |
| amazing         | 0.11 |
| .               | 0.01 |

*The model assigns more than half of the total weight to “not,” capturing the local negation, and a moderate share to “amazing,” which influences the overall sentiment.*


### 3. Weighted sum of value vectors

$$
\text{output}(\text{good})
      =\sum_{j=0}^{11} w_{4j}\,v_j .
$$

Because $w_{4,3}=0.55$ is large, the output vector encodes that **“good” is negated**.  

Later layers (or a classifier head) can use this context-rich vector to predict a negative contribution from *“not good,”* while recognising the strong positive signal from *“amazing.”*

### 4. Key points 

* **Context matters.** Self-attention lets every token look at the entire sentence, so “not” can influence “good.”  
* **Parallel computation.** Unlike an RNN, all tokens are processed at once, which is faster and handles long sentences gracefully.  
* **Dynamic meaning.** The same word can mean something different in another sentence; the attention pattern adapts.


### Query **Q**, Key **K**, and Value **V**

Let  

* $X \in \mathbb{R}^{n \times d_{\text{model}}}$ be the matrix whose rows are the token-embedding vectors after positional encoding  
  $\bigl[x_1^{\top};\,x_2^{\top};\,\dots;\,x_n^{\top}\bigr]$.

For **each attention head** $h$ the model keeps three learned weight matrices  

$$
W_Q^{(h)},\; W_K^{(h)},\; W_V^{(h)} \in \mathbb{R}^{d_{\text{model}}\times d_k}.
$$

Multiplying the embedding matrix by those weights yields  

$$
\boxed{Q^{(h)} \;=\; X\,W_Q^{(h)}}, \qquad
\boxed{K^{(h)} \;=\; X\,W_K^{(h)}}, \qquad
\boxed{V^{(h)} \;=\; X\,W_V^{(h)}}.
$$

* **Shape:** each of $Q^{(h)}, K^{(h)}, V^{(h)}$ is $n \times d_k$.  
  (One $d_k$-dimensional row per token.)
* **Rows:** $q_i,\;k_i,\;v_i$ — the *query*, *key*, and *value* vectors for token $i$.


| Symbol |   Intuitive meaning |
|--------|---------------------------------|
| $q_i$  | “What does **this** token need or look for?” |
| $k_j$  | “What attributes does token *j* offer to others?” |
| $v_j$  |  “The information token *j* will contribute if it is attended to.” |








In [None]:
import numpy as np

# ---------------------------------------------------------------------
# 1) Token list – indices are handy for debugging
# ---------------------------------------------------------------------
tokens = ["The", "movie", "was", "not", "good", ",", "but",
          "the", "soundtrack", "was", "amazing", "."]

n_tokens, d = len(tokens), 2                # d_model = 2
Q = np.zeros((n_tokens, d))
K = np.zeros((n_tokens, d))
V = np.zeros((n_tokens, d))

# ---------------------------------------------------------------------
# 2) Hand-crafted vectors for the three special words
# ---------------------------------------------------------------------
special_qk = {3: [1, 1],      # “not”
              4: [0, 1],      # “good”
             10: [1, 0]}      # “amazing”

special_v  = {3: [1, 0],
              4: [0, 1],
             10: [1, 1]}

for idx, vec in special_qk.items():
    Q[idx] = K[idx] = vec

for idx, vec in special_v.items():
    V[idx] = vec

# ---------------------------------------------------------------------
# 3) Scaled dot-product attention  (decoder style → add causal mask)
# ---------------------------------------------------------------------
scores = Q @ K.T                           # (n_tokens × n_tokens)
scores /= np.sqrt(d)                       # scale by √d_k

# ---- causal (look-ahead) mask ----
mask = np.triu(np.ones_like(scores, dtype=bool), k=1)  # j > i region
scores[mask] = -1e9                      # −∞ ⇒ 0 after softmax

# Softmax row by row
exp_scores = np.exp(scores)
weights = exp_scores / exp_scores.sum(axis=-1, keepdims=True)

# ---------------------------------------------------------------------
# 4) Attention output = weights · V
# ---------------------------------------------------------------------
outputs = weights @ V                      # (n_tokens × d)

# ---------------------------------------------------------------------
# 5) Inspect the row that corresponds to the *query* word “good”
# ---------------------------------------------------------------------
idx_good = 4
print(f"\nMasked attention weights FROM ‘{tokens[idx_good]}’ (index {idx_good})")
for j, w in enumerate(weights[idx_good]):
    print(f"  → {tokens[j]:<11s}  w = {w:.2f}")

print("\nResulting context vector for 'good':", outputs[idx_good])



Masked attention weights FROM ‘good’ (index 4)
  → The          w = 0.07
  → movie        w = 0.07
  → was          w = 0.07
  → not          w = 0.62
  → good         w = 0.15
  → ,            w = 0.00
  → but          w = 0.00
  → the          w = 0.00
  → soundtrack   w = 0.00
  → was          w = 0.00
  → amazing      w = 0.00
  → .            w = 0.00

Resulting context vector for 'good': [0.62393289 0.15168853]


where  

* **$Q$** = queries (one per input position)  
* **$K$** = keys (one per input position)  
* **$V$** = values (one per input position)  
* $d_k$ = dimensionality of the keys (used for scaling).

 

## A Toy Example (3-word mini-sentence)

Assume the sentence *“She **did** not”* has already been converted to three 2-dimensional vectors (for clarity the numbers are tiny integers):

| token | $q_i$ | $k_i$ | $v_i$ |
|-------|-------|-------|-------|
| She   | $\begin{bmatrix}1\\0\end{bmatrix}$ | $\begin{bmatrix}1\\0\end{bmatrix}$ | $\begin{bmatrix}1\\1\end{bmatrix}$ |
| did   | $\begin{bmatrix}0\\1\end{bmatrix}$ | $\begin{bmatrix}0\\1\end{bmatrix}$ | $\begin{bmatrix}0\\2\end{bmatrix}$ |
| not   | $\begin{bmatrix}1\\1\end{bmatrix}$ | $\begin{bmatrix}1\\1\end{bmatrix}$ | $\begin{bmatrix}2\\1\end{bmatrix}$ |

### 1. Compute the *raw* attention scores  
For each pair $(q_i,k_j)$ take the dot product:

|  | **She** | **did** | **not** |
|---|--------|--------|--------|
| **She** | $1\cdot1+0\cdot0 = 1$ | $1\cdot0+0\cdot1 = 0$ | $1\cdot1+0\cdot1 = 1$ |
| **did** | $0\cdot1+1\cdot0 = 0$ | $0\cdot0+1\cdot1 = 1$ | $0\cdot1+1\cdot1 = 1$ |
| **not** | $1\cdot1+1\cdot0 = 1$ | $1\cdot0+1\cdot1 = 1$ | $1\cdot1+1\cdot1 = 2$ |

### 2. Scale and apply $\operatorname{softmax}$ row-wise  
With $d_k=2$, divide by $\sqrt{2}\approx1.41$, then apply $\operatorname{softmax}$ to each row  
(result rounded to two decimals):

|  | She | did | not |
|---|-----|-----|-----|
| **She** | 0.42 | 0.16 | 0.42 |
| **did** | 0.16 | 0.42 | 0.42 |
| **not** | 0.26 | 0.26 | 0.48 |

These are the **attention weights**.

### 3. Weighted sum of the values  
For the first token “She”:

$$
\text{output}(\text{She}) =
0.42\,v_{\text{She}} \;+\; 0.16\,v_{\text{did}} \;+\; 0.42\,v_{\text{not}}
= 0.42\!\begin{bmatrix}1\\1\end{bmatrix}
  +0.16\!\begin{bmatrix}0\\2\end{bmatrix}
  +0.42\!\begin{bmatrix}2\\1\end{bmatrix}
= \begin{bmatrix}1.26\\1.26\end{bmatrix}.
$$

The same happens for “did” and “not”.  
Each output vector now **blends information from the entire sequence**, with larger weights on more relevant words.

 


In [None]:
import tensorflow as tf
from tensorflow.keras import layers, Model
import numpy as np

# ------------------------------------------------------------------
# 0.  Reproducibility
# ------------------------------------------------------------------
tf.random.set_seed(0)
np.random.seed(0)

# ------------------------------------------------------------------
# 1.  Synthetic data  (NumPy arrays only!)
# ------------------------------------------------------------------
VOCAB_SIZE  = 51   
SEQ_LEN     = 6      
NUM_SAMPLES = 8_000   
EPOCHS = 3          

tokens_np = np.random.randint(1, VOCAB_SIZE, size=(NUM_SAMPLES, SEQ_LEN))
tokens_np = np.concatenate(
    [np.zeros((NUM_SAMPLES, 1), dtype=int), tokens_np], axis=1   # prepend [CLS]
)                                   # shape (N, 7)

labels_np = (tokens_np[:, 3] == 42).astype("float32")

# indices where the label is 1
pos_idx = np.where(labels_np == 1)[0]

print("Indices with label 1:", pos_idx)

# show the actual label values (all 1.0) to double-check
print("Their label values:", labels_np[pos_idx])

# if you also want to see the sentences themselves:
print("Sentences with label 1:\n", tokens_np[pos_idx])
# ------------------------------------------------------------------
# 2.  Positional-embedding block
# ------------------------------------------------------------------
class PositionalEmbedding(layers.Layer):
    def __init__(self, vocab, d_model, max_len, **kw):
        super().__init__(**kw)
        self.tok = layers.Embedding(vocab,   d_model, name="tok_emb")
        self.pos = layers.Embedding(max_len, d_model, name="pos_emb")

    def call(self, tok_ids):                       # (B, L)
        L = tf.shape(tok_ids)[1]
        pos_ids = tf.range(L)                      # 0 … L-1
        return self.tok(tok_ids) + self.pos(pos_ids)

# ------------------------------------------------------------------
# 3.  Model
# ------------------------------------------------------------------
D_MODEL   = 8
NUM_HEADS = 1
KEY_DIM   = D_MODEL // NUM_HEADS

inp      = layers.Input((SEQ_LEN + 1,), dtype="int32", name="tokens")
emb      = PositionalEmbedding(VOCAB_SIZE, D_MODEL, SEQ_LEN + 1,
                               name="embed")(inp)

attn_out, attn_scores = layers.MultiHeadAttention(
        num_heads=NUM_HEADS,
        key_dim=KEY_DIM,
        output_shape=D_MODEL,
        name="self_attn")(emb, emb, return_attention_scores=True)

x        = layers.LayerNormalization(epsilon=1e-6, name="ln")(emb + attn_out)
cls_vec  = x[:, 0, :]                           # [CLS]
logits   = layers.Dense(1, activation="sigmoid", name="classifier")(cls_vec)

model = Model(inp, logits, name="TinySelfAttention")
model.compile(optimizer="adam", loss="binary_crossentropy")
model.summary()

# ------------------------------------------------------------------
# 4.  Train
# ------------------------------------------------------------------
train_ds = (
    tf.data.Dataset.from_tensor_slices((tokens_np, labels_np))
      .shuffle(NUM_SAMPLES)
      .batch(16)
)
model.fit(train_ds, epochs=EPOCHS, verbose=0)

# ------------------------------------------------------------------
# 5.  Attention matrix for the first sample
# ------------------------------------------------------------------
attn_extractor = Model(inp, attn_scores)       # model that outputs only A
A = attn_extractor.predict(tokens_np[:1])      # shape (1, 1, 7, 7)

print("\nAttention weights  (sample 0 · head 0):")
with np.printoptions(precision=4, suppress=True):
    print(A[0, 0])


Indices with label 1: [  10   11  135  208  222  300  349  401  427  448  462  478  482  488
  493  564  570  585  590  607  620  675  723  800  831  884  933  973
 1043 1108 1115 1137 1151 1166 1215 1389 1496 1513 1598 1638 1647 1680
 1688 1808 1830 1857 1893 1901 1903 1922 1944 2080 2087 2121 2130 2199
 2269 2335 2391 2443 2444 2508 2531 2571 2595 2642 2666 2674 2693 2718
 2804 2856 2867 2890 2985 2993 3025 3036 3094 3163 3213 3365 3418 3496
 3504 3523 3539 3547 3556 3560 3588 3729 3761 3808 3824 3831 3934 3939
 4043 4084 4092 4128 4148 4184 4221 4290 4294 4360 4579 4630 4685 4723
 4729 4743 4751 4786 4821 5048 5133 5150 5163 5165 5278 5294 5316 5378
 5440 5518 5544 5572 5802 5902 5918 5972 6146 6172 6232 6373 6428 6470
 6594 6611 6634 6673 6751 6849 7018 7035 7101 7214 7307 7311 7330 7417
 7482 7552 7565 7587 7599 7765 7849 7850 7888 7895 7898 7909 7911 7966]
Their label values: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 47ms/step

Attention weights  (sample 0 · head 0):
[[0.1647 0.1109 0.1221 0.2472 0.1152 0.1147 0.1252]
 [0.1619 0.1164 0.126  0.2272 0.1201 0.1197 0.1287]
 [0.1633 0.1138 0.1241 0.2369 0.1177 0.1173 0.127 ]
 [0.1579 0.1227 0.1307 0.2048 0.1258 0.1254 0.1327]
 [0.1636 0.1132 0.1237 0.2389 0.1172 0.1168 0.1266]
 [0.161  0.1178 0.127  0.2224 0.1213 0.1209 0.1296]
 [0.1604 0.1188 0.1277 0.2188 0.1222 0.1218 0.1302]]


In [45]:
print("Sentence-0:", tokens_np[0])
print("Token at pos-3:", tokens_np[0, 3])
pred = model.predict(tokens_np[:1])[0, 0]
print("Model output for sample-0:", pred)   # should be ≪ 0.5

idx = np.where(labels_np == 1)[0][0]   # first positive sample
print("Sentence-idx:", tokens_np[idx])
print(model.predict(tokens_np[idx:idx+1])[0, 0])  # should be ≫ 0.5


Sentence-0: [ 0 45 48  1  4  4 40]
Token at pos-3: 1
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 65ms/step
Model output for sample-0: 0.00061386
Sentence-idx: [ 0 16  5 42 43 32  2]
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step
0.99714214
