# A First Look at Self-Attention

## Why Self-Attention Helps With Meaning

Consider the following review:

> **“The movie was *not* good, but the soundtrack was amazing.”**

A simple bag-of-words classifier will see both *good* and *amazing* (positive) and probably predict a positive sentiment, missing the negation “not.”  

Self-attention can discover that "not" modifies "good" while leaving "amazing" untouched.

### 1. Tokenise the sentence

| position | token |
|:-------:|-------|
| 0 | The |
| 1 | movie |
| 2 | was |
| 3 | **not** |
| 4 | **good** |
| 5 | , |
| 6 | but |
| 7 | the |
| 8 | soundtrack |
| 9 | was |
|10 | **amazing** |
|11 | . |

Each token is mapped to a small vector (embedding).  
For illustration imagine every token is already a 4-D vector.  
The exact numbers are not important; they are learned during training.



### 2. Compute attention scores (conceptually)

Focus on the token at position 4, **“good.”**

* Query $q_{good}$ is compared with every key $k_j$.
* Large dot products mean higher relevance.
* After scaling and the softmax, we obtain a **weight** for each other token.

Suppose the softmax gives (rounded):

| key token $j$ | weight $w_{4j}$ |
|-----------------|-------------------|
| The             | 0.01 |
| movie           | 0.02 |
| was (1st)       | 0.03 |
| **not**         | **0.55** |
| **good**        | 0.10 |
| ,               | 0.02 |
| but             | 0.05 |
| the             | 0.02 |
| soundtrack      | 0.03 |
| was (2nd)       | 0.05 |
| amazing         | 0.11 |
| .               | 0.01 |

*The model assigns more than half of the total weight to “not,” capturing the local negation, and a moderate share to “amazing,” which influences the overall sentiment.*


### 3. Weighted sum of value vectors

$$
\text{output}(\text{good})
      =\sum_{j=0}^{11} w_{4j}\,v_j .
$$

Because $w_{4,3}=0.55$ is large, the output vector encodes that **“good” is negated**.  

Later layers (or a classifier head) can use this context-rich vector to predict a negative contribution from *“not good,”* while recognising the strong positive signal from *“amazing.”*

### 4. Key points 

* **Context matters.** Self-attention lets every token look at the entire sentence, so “not” can influence “good.”  
* **Parallel computation.** Unlike an RNN, all tokens are processed at once, which is faster and handles long sentences gracefully.  
* **Dynamic meaning.** The same word can mean something different in another sentence; the attention pattern adapts.


## Toy Example on the sentence  

> **“The movie was *not* good, but the soundtrack was amazing.”**

1. **Start with an input embedding**  
   For every token $i$ in the sentence you have a fixed-size vector  
   $$
     x_i \in \mathbb{R}^{d_{\text{model}}}.
   $$

2. **Project that same vector three different ways**  
   The self-attention layer contains three trainable weight matrices  
   $W_Q,\,W_K,\,W_V \in \mathbb{R}^{d_{\text{model}}\times d_k}$.
   It computes  
   $$
     q_i = W_Q x_i, \qquad
     k_i = W_K x_i, \qquad
     v_i = W_V x_i .
   $$

3. **Interpretation**  
   * **Query $q_i$**: “What am I looking for in the other tokens?”  
   * **Key $k_i$**: “How well do I match what others might look for?”  
   * **Value $v_i$**: “The information I will contribute if I am selected.”

   During training the matrices learn to make queries and keys align for
   linguistically relevant relations (negation, subject–verb agreement,
   coreference, and so on).



To keep arithmetic tiny we use **2-dimensional** vectors and *hand-craft* them
for the three important words; all others are zeros.

| token (index) | query $q_i$ | key $k_i$ | value $v_i$ |
|--------------|--------------|-------------|---------------|
| not (3)      | $[1,1]$    | $[1,1]$   | $[1,0]$ |
| good (4)     | $[0,1]$    | $[0,1]$   | $[0,1]$ |
| amazing (10) | $[1,0]$    | $[1,0]$   | $[1,1]$ |
| all others   | $[0,0]$    | $[0,0]$   | $[0,0]$ |

### Attention weights **from “good” to every token**

1. Dot products of $q_{\text{good}}=[0,1]$ with every $k_j$:

   * $q\cdot k_{3} = 1$ → token 3 (“not”)
   * $q\cdot k_{4} = 1$ → token 4 (“good” itself)
   * all other dot products = 0  

2. Scale by $\sqrt{d_k}=\sqrt{2}\approx1.41$  
   non-zero scores become $1/1.41 = 0.707$.

3. Softmax across 12 tokens:

   $$
   w_{4,3}=w_{4,4}\approx0.14,\quad
   w_{4,j\neq3,4}\approx0.07 .
   $$



In [1]:
import numpy as np

# 12 tokens × 2-dimensional toy vectors
Q = np.zeros((12, 2), dtype="float32")
K = np.zeros_like(Q)
V = np.zeros_like(Q)

# encode three words with non-zero vectors
Q[3] = K[3] = [1, 1]   # token 3 = "not"
Q[4] = K[4] = [0, 1]   # token 4 = "good"   ← the query we will inspect
Q[10] = K[10]= [1, 0]  # token10 = "amazing"
V[:] = K[:]            # values = keys for clarity

dk = K.shape[-1]
scores   = Q[4] @ K.T / np.sqrt(dk)          # dot products from “good” to all keys
weights  = np.exp(scores) / np.exp(scores).sum()

print("Attention weights from 'good':")
print(np.round(weights, 2))


Attention weights from 'good':
[0.07 0.07 0.07 0.14 0.14 0.07 0.07 0.07 0.07 0.07 0.07 0.07]


where  

* **$Q$** = queries (one per input position)  
* **$K$** = keys (one per input position)  
* **$V$** = values (one per input position)  
* $d_k$ = dimensionality of the keys (used for scaling).

 

## A Toy Example (3-word mini-sentence)

Assume the sentence *“She **did** not”* has already been converted to three 2-dimensional vectors (for clarity the numbers are tiny integers):

| token | $q_i$ | $k_i$ | $v_i$ |
|-------|-------|-------|-------|
| She   | $\begin{bmatrix}1\\0\end{bmatrix}$ | $\begin{bmatrix}1\\0\end{bmatrix}$ | $\begin{bmatrix}1\\1\end{bmatrix}$ |
| did   | $\begin{bmatrix}0\\1\end{bmatrix}$ | $\begin{bmatrix}0\\1\end{bmatrix}$ | $\begin{bmatrix}0\\2\end{bmatrix}$ |
| not   | $\begin{bmatrix}1\\1\end{bmatrix}$ | $\begin{bmatrix}1\\1\end{bmatrix}$ | $\begin{bmatrix}2\\1\end{bmatrix}$ |

### 1. Compute the *raw* attention scores  
For each pair $(q_i,k_j)$ take the dot product:

|  | **She** | **did** | **not** |
|---|--------|--------|--------|
| **She** | $1\cdot1+0\cdot0 = 1$ | $1\cdot0+0\cdot1 = 0$ | $1\cdot1+0\cdot1 = 1$ |
| **did** | $0\cdot1+1\cdot0 = 0$ | $0\cdot0+1\cdot1 = 1$ | $0\cdot1+1\cdot1 = 1$ |
| **not** | $1\cdot1+1\cdot0 = 1$ | $1\cdot0+1\cdot1 = 1$ | $1\cdot1+1\cdot1 = 2$ |

### 2. Scale and apply $\operatorname{softmax}$ row-wise  
With $d_k=2$, divide by $\sqrt{2}\approx1.41$, then apply $\operatorname{softmax}$ to each row  
(result rounded to two decimals):

|  | She | did | not |
|---|-----|-----|-----|
| **She** | 0.42 | 0.16 | 0.42 |
| **did** | 0.16 | 0.42 | 0.42 |
| **not** | 0.26 | 0.26 | 0.48 |

These are the **attention weights**.

### 3. Weighted sum of the values  
For the first token “She”:

$$
\text{output}(\text{She}) =
0.42\,v_{\text{She}} \;+\; 0.16\,v_{\text{did}} \;+\; 0.42\,v_{\text{not}}
= 0.42\!\begin{bmatrix}1\\1\end{bmatrix}
  +0.16\!\begin{bmatrix}0\\2\end{bmatrix}
  +0.42\!\begin{bmatrix}2\\1\end{bmatrix}
= \begin{bmatrix}1.26\\1.26\end{bmatrix}.
$$

The same happens for “did” and “not”.  
Each output vector now **blends information from the entire sequence**, with larger weights on more relevant words.

 


| index | token              | weight | interpretation                              |
|:----:|--------------------|:------:|---------------------------------------------|
| 0    | The                | 0.07   | almost ignored                              |
| 1    | movie              | 0.07   | almost ignored                              |
| 2    | was (first)        | 0.07   | almost ignored                              |
| 3    | **not**            | **0.14** | most relevant external word (negation)      |
| 4    | **good** (itself)  | **0.14** | self-focus helps preserve the base meaning  |
| 5    | ,                  | 0.07   | punctuation, little influence               |
| 6    | but                | 0.07   | discourse marker, low weight here           |
| 7    | the                | 0.07   | almost ignored                              |
| 8    | soundtrack         | 0.07   | almost ignored                              |
| 9    | was (second)       | 0.07   | almost ignored                              |
| 10   | amazing            | 0.07   | small influence, sentiment elsewhere        |
| 11   | .                  | 0.07   | negligible influence                        |

Key observations:

* The two highest weights (0.14) fall on **“not”** and **“good”** itself.  
  The model therefore combines the negation signal with the word it negates.
* Every other token receives the baseline weight of about \(1/12 \approx 0.08\),
  meaning they contribute very little to the representation of “good.”

In a fully trained model you would expect an even sharper focus on **“not”**
and a near-zero weight on punctuation or stop-words. 

The principle is: weights tell you **which words the model uses as
evidence when forming the contextual meaning of the current word.**








In [2]:
import tensorflow as tf
from tensorflow.keras import layers, Model
import numpy as np

# ------------------------------------------------------------------
# 0.  Reproducibility
# ------------------------------------------------------------------
tf.random.set_seed(0)
np.random.seed(0)

# ------------------------------------------------------------------
# 1.  Synthetic data  (NumPy arrays only!)
# ------------------------------------------------------------------
VOCAB_SIZE  = 51         # reserve 0 for [CLS]
SEQ_LEN     = 6          # NOT counting [CLS]
NUM_SAMPLES = 8_000          # instead of 64
EPOCHS = 3                   # fewer epochs are fine with more data

tokens_np = np.random.randint(1, VOCAB_SIZE, size=(NUM_SAMPLES, SEQ_LEN))
tokens_np = np.concatenate(
    [np.zeros((NUM_SAMPLES, 1), dtype=int), tokens_np], axis=1   # prepend [CLS]
)                                   # shape (N, 7)

labels_np = (tokens_np[:, 3] == 42).astype("float32")

# indices where the label is 1
pos_idx = np.where(labels_np == 1)[0]

print("Indices with label 1:", pos_idx)

# show the actual label values (all 1.0) to double-check
print("Their label values:", labels_np[pos_idx])

# if you also want to see the sentences themselves:
print("Sentences with label 1:\n", tokens_np[pos_idx])
# ------------------------------------------------------------------
# 2.  Positional-embedding block
# ------------------------------------------------------------------
class PositionalEmbedding(layers.Layer):
    def __init__(self, vocab, d_model, max_len, **kw):
        super().__init__(**kw)
        self.tok = layers.Embedding(vocab,   d_model, name="tok_emb")
        self.pos = layers.Embedding(max_len, d_model, name="pos_emb")

    def call(self, tok_ids):                       # (B, L)
        L = tf.shape(tok_ids)[1]
        pos_ids = tf.range(L)                      # 0 … L-1
        return self.tok(tok_ids) + self.pos(pos_ids)

# ------------------------------------------------------------------
# 3.  Model
# ------------------------------------------------------------------
D_MODEL   = 8
NUM_HEADS = 1
KEY_DIM   = D_MODEL // NUM_HEADS

inp      = layers.Input((SEQ_LEN + 1,), dtype="int32", name="tokens")
emb      = PositionalEmbedding(VOCAB_SIZE, D_MODEL, SEQ_LEN + 1,
                               name="embed")(inp)

attn_out, attn_scores = layers.MultiHeadAttention(
        num_heads=NUM_HEADS,
        key_dim=KEY_DIM,
        output_shape=D_MODEL,
        name="self_attn")(emb, emb, return_attention_scores=True)

x        = layers.LayerNormalization(epsilon=1e-6, name="ln")(emb + attn_out)
cls_vec  = x[:, 0, :]                           # [CLS]
logits   = layers.Dense(1, activation="sigmoid", name="classifier")(cls_vec)

model = Model(inp, logits, name="TinySelfAttention")
model.compile(optimizer="adam", loss="binary_crossentropy")
model.summary()

# ------------------------------------------------------------------
# 4.  Train
# ------------------------------------------------------------------
train_ds = (
    tf.data.Dataset.from_tensor_slices((tokens_np, labels_np))
      .shuffle(NUM_SAMPLES)
      .batch(16)
)
model.fit(train_ds, epochs=EPOCHS, verbose=0)

# ------------------------------------------------------------------
# 5.  Attention matrix for the first sample
# ------------------------------------------------------------------
attn_extractor = Model(inp, attn_scores)       # model that outputs only A
A = attn_extractor.predict(tokens_np[:1])      # shape (1, 1, 7, 7)

print("\nAttention weights  (sample 0 · head 0):")
with np.printoptions(precision=4, suppress=True):
    print(A[0, 0])


Indices with label 1: [  10   11  135  208  222  300  349  401  427  448  462  478  482  488
  493  564  570  585  590  607  620  675  723  800  831  884  933  973
 1043 1108 1115 1137 1151 1166 1215 1389 1496 1513 1598 1638 1647 1680
 1688 1808 1830 1857 1893 1901 1903 1922 1944 2080 2087 2121 2130 2199
 2269 2335 2391 2443 2444 2508 2531 2571 2595 2642 2666 2674 2693 2718
 2804 2856 2867 2890 2985 2993 3025 3036 3094 3163 3213 3365 3418 3496
 3504 3523 3539 3547 3556 3560 3588 3729 3761 3808 3824 3831 3934 3939
 4043 4084 4092 4128 4148 4184 4221 4290 4294 4360 4579 4630 4685 4723
 4729 4743 4751 4786 4821 5048 5133 5150 5163 5165 5278 5294 5316 5378
 5440 5518 5544 5572 5802 5902 5918 5972 6146 6172 6232 6373 6428 6470
 6594 6611 6634 6673 6751 6849 7018 7035 7101 7214 7307 7311 7330 7417
 7482 7552 7565 7587 7599 7765 7849 7850 7888 7895 7898 7909 7911 7966]
Their label values: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1

2025-04-29 07:33:28.021067: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M2 Max
2025-04-29 07:33:28.021089: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 96.00 GB
2025-04-29 07:33:28.021093: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 36.00 GB
I0000 00:00:1745926408.021104 15681900 pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
I0000 00:00:1745926408.021126 15681900 pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


2025-04-29 07:33:28.571476: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type GPU is enabled.


KeyboardInterrupt: 

In [18]:
print("Sentence-0:", tokens_np[0])
print("Token at pos-3:", tokens_np[0, 3])
pred = model.predict(tokens_np[:1])[0, 0]
print("Model output for sample-0:", pred)   # should be ≪ 0.5

idx = np.where(labels_np == 1)[0][0]   # first positive sample
print("Sentence-idx:", tokens_np[idx])
print(model.predict(tokens_np[idx:idx+1])[0, 0])  # should be ≫ 0.5


Sentence-0: [ 0 45 48  1  4  4 40]
Token at pos-3: 1
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step
Model output for sample-0: 0.00041184496
Sentence-idx: [ 0 16  5 42 43 32  2]
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step
0.99762386
