# 8 The Attention Mechanism from Scratch

In [25]:
import numpy as np
from scipy.special import softmax

## 8.3 The General Attention Mechanism with NumPy and SciPy

Consider the following four fabricated word embeddings. (In practice, word embeddings are usually the output of an encoder).

In [4]:
word_1 = np.array([1, 0, 0])
word_2 = np.array([0, 1, 0])
word_3 = np.array([1, 1, 0])
word_4 = np.array([0, 0, 1])

And the following randomly-initiated Query, Key and Value matrices. (Again, in practice, these would be akin to weights learned during training).

In [9]:
np.random.seed(42)

W_Q = np.random.randint(3, size=(3, 3))
W_K = np.random.randint(3, size=(3, 3))
W_V = np.random.randint(3, size=(3, 3))

Note how the number of rows of each of these matrices equals the dimensionality of our word embeddings. We will now calculate the query, key and value _vectors_ using matrix multiplication.

In [34]:
query_1 = word_1 @ W_Q
key_1 = word_1 @ W_K
value_1 = word_1 @ W_V

query_2 = word_2 @ W_Q
key_2 = word_2 @ W_K
value_2 = word_2 @ W_V

query_3 = word_3 @ W_Q
key_3 = word_3 @ W_K
value_3 = word_3 @ W_V

query_4 = word_4 @ W_Q
key_4 = word_4 @ W_K
value_4 = word_4 @ W_V

query_1.shape, key_1.shape, value_1.shape

((3,), (3,), (3,))

Next, let us calculate the "alignment scores" for the first word using its query vector and all the key vectors.

In [33]:
scores_1 = np.array(
    [
        np.dot(query_1, key_1),
        np.dot(query_1, key_2),
        np.dot(query_1, key_3),
        np.dot(query_1, key_4),
    ]
)
scores_1.shape

(4,)

Next, we can calculate the "attention weights" by applying a softmax function to the scores. But first, it is cusotmary to divide the scores by the square root of the square root of the dimensionality of keys, in order to control its variance and keep the gradients stable.

In [28]:
weights_1 = softmax(scores_1 / np.sqrt(key_1.shape[0]))

Finally, we calculate attention as the weighted sum of the four value vectors.

In [39]:
attention_1 = (
    weights_1[0] * value_1
    + weights_1[1] * value_2
    + weights_1[2] * value_3
    + weights_1[3] * value_4
)
attention_1

array([0.98522025, 1.74174051, 0.75652026])

Of course, we could do this for all four token embeddings in parallel using matrix algebra:

In [48]:
words = np.array([word_1, word_2, word_3, word_4])

Q = words @ W_Q
K = words @ W_K
V = words @ W_V

scores = Q @ K.T
weights = softmax(scores / np.sqrt(K.shape[1]), axis=1)
attention = weights @ V

attention

array([[0.98522025, 1.74174051, 0.75652026],
       [0.90965265, 1.40965265, 0.5       ],
       [0.99851226, 1.75849334, 0.75998108],
       [0.99560386, 1.90407309, 0.90846923]])

The specifics of what the queries, keys and values are depends on the specific architecture. For instance, in the Bahdanau attention mechanism, the queries would be analogous to the previous decoder output $s_{t-1}$, the keys would be analogous to the encoded inputs (concatenated forward and backward hidden states) $h_{i}$ and the values would be same vectors as the keys.

**Note:** This simple attention mechanism has no learnable parameters! All the learning is in the word embeddings. So in the simple attention mechanism, a large part of the behaviour of the model comes from the parameters _upstream_.