# Transformers and the Attention Mechanism

I'm learning about transformers as I've only ever really learnt about them in theory, and instead did the mathematical stuff on earlier architectures like RNNs, LSTMs, CNNs, linear models, etc., which have just been totally devoured by the general purposeness and stupid effectieness of transformers.

Given how important it is, I thought I'd get my hands dirty, and follow what Karpathy did (except in python) and code one up from scratch to try to learn this tech better.

First thing to understand is how is it different from other architectures? 

Well, one of the key differences is the **attention mechanism**. So what the attention mechanism? It's got a bit of history, but I'll just go into what they are in transformers.

## The Attention Mechanism

Attention has to do with one special feature of transformers, and that's the contextualised representation of inputs (e.g. words).

In word2vec, for example, words are represented by **static** vectors, which dont change depending on their context around them. That means, for example, a river 'bank' and a memory 'bank' are represented with the same vector. 

In transformers however, the context matters! The vector represenation of a word changes depending on the words around it. How does it do this? Through attention!

Attention tells us what other words to focus on for an input word (we'll keep using words as inputs here for a while, but note that's not necessary or fundamental). It does this using three other vectors known as the **query, key, and value** vectors, and some matrix maths.

## QKV Matrices

The query vector represents the word we're asking about in question: It's the word we're hoping to find relevant other words for, or that we are 'attending to'. The key vector is the representation of the word we search against, and the value vector is what the word 'represents', btu is not simply the input embedding, it is that in context. 

An attention score is the similarity between a query vector and a key vector, which is calculated between one wod and all other words. 

These vectors are created as linear transformations of input embeddings, but we'll get to how those are calculated later.

So for example, in the sentence

"The cat sat", each word has three vectors, which may look like this:

In [2]:
import numpy as np
# Input embeddings for "I love coding"
embeddings = np.array([
    [1, 0, 1], # I
    [0, 1, 0], # love
    [1, 1, 1], # coding
])

# Predefined weights for query, key, and value vectors
W_query = np.array([
    [1, 0, 1],
    [0, 1, 0],
    [1, 1, 1]
])
W_key = np.array([
    [0, 1, 1],
    [1, 0, 0],
    [0, 0, 1]
])
W_value = np.array([
    [1, 2, 1],
    [0, 1, 0],
    [1, 1, 0]
])

# Step 1: Transform input embeddings into the Query, Key, and Value vectors
queries = embeddings @ W_query
keys = embeddings @ W_key
values = embeddings @ W_value

# Print these values 
print("Queries:\n", queries)
print("Keys:\n", keys)
print("Values:\n", values)


Queries:
 [[2 1 2]
 [0 1 0]
 [2 2 2]]
Keys:
 [[0 1 2]
 [1 0 0]
 [1 1 2]]
Values:
 [[2 3 1]
 [0 1 0]
 [2 4 1]]


Okay, so we too some input embedding vectors, and some weight matrices, and multiplied them out to get some query, key, and value vectors, one for each input word. 

Next, we want to do the attention calculation, which once again calculates the relevance of a word to other words in a context, by calculating the dot product (a similarity measurement between vectors) between the query vector and the other key vectors. 

One question you might have is, why aren't the query vectors and the key vectors for a word the same if we're just using them to look up similarities between words in context?

This is because query vectors represents what the word is **'searching'** for, and the key vector is what a word **'has to offer the searcher'**. This separation is key to allowing the same word to behave differently in different contexts.

The query and key weight matrices in transformers are learned during trainingt to optimise the attention mechanism, and it turns out that by doing this, by specialisign the roles of the vectors in a different way than just measuring vector similarity but instead an affinity between a query and a key vector, we get better results. It's 

Now we have that down, let's calculate the attention.

In [3]:
# Step 2: Select the query for the word "love"
query_love = queries[1]  # Index 1 corresponds to "love"

# Step 3: Compute dot products of the query with all keys
attention_scores = keys @ query_love  # Matrix-vector multiplication

# Print the raw attention scores
print("Raw Attention Scores for 'love':", attention_scores)

Raw Attention Scores for 'love': [1 0 1]


Great, so this is telling us that the word love has a raw attention score of 1 with I, 0 with "love", and 1 with "coding". 

But this on its own isn't all that useful if we want to use it to use it as a kind of scaling factor. 

To turn this vector into one where each element represents a weight and all weights sum to 1, we can use **softmax**:

In [4]:
# Step 4: Apply softmax function to the raw attention scores
def softmax(scores):
    exp_scores = np.exp(scores - np.max(scores))    # Shift vector to a range between [0, max(scores)]
    return exp_scores / np.sum(exp_scores)  # Ensure all explonentiated scores add up to 1

attention_probs = softmax(attention_scores)

# Print normalised attention probabilities
print("Attention Probabilities for 'love':", attention_probs)

Attention Probabilities for 'love': [0.4223188 0.1553624 0.4223188]


So what's going on here? We see that higher raw attention scores are boosted relative to low attention scores, which are relatively suppressed.

Okay, so next we will use these probabilities to **weight the value vectors** and aggregate them into a single output vector for "love".

In [5]:
# Step 5: Calculate output vector for "love"
output_love = attention_probs @ values
print("Output vector for 'love':", output_love)

Output vector for 'love': [1.68927519 3.11159399 0.8446376 ]


This output vector represents the input vector 'love', but **contextualised by the words "I" and "coding"**

## Summary
We covered a bit about the attention mechanism in transformers, why it's useful, how we use the query key and value matrices to calculate attention and the resulting contextualised output vector. Great job! 

Next up will probably be how the QKV weight matrices get calculated in training. 

Thanks for reading!