# Scaled Dot-Product Attention

## Dot Product

The dot product allows us to calculate the similarity between two vectors. In scaled dot-product attention, this is used to determine how similar a **Query** vector for one token is to the **Key** vector of another token.

The dot product is calculated by multiplying each component of two vectors, then summing each of them.

$$
\begin{align*}
\vec{a} = [a_{1}, a_{2}, a_{3}, ..., a_{n}] \\
\vec{b} = [b_{1}, b_{2}, b_{3}, ..., b_{n}] \\
\vec{a} \cdot \vec{b} = a_{1}b_{1} + a_{2}b_{2} + a_{3}b_{3} + ... + a_{n}b_{n} \\
\end{align*}
$$

or

$$\vec{a} \cdot \vec{b} = \sum_{i=1}^{n}a_{i}b_{i}$$

This can also be written as matrix multiplying between vector $a$ and the transform of vector $b$



In [37]:
import numpy as np
import pandas as pd

np.random.seed(42)

a = np.random.random(5)
b = np.random.random(5)
dot_product = [i*x for i, x in zip(a, b)]

print(f"a = {a}")
print(f"b = {b}")
print()
for i, product in enumerate(dot_product):
    print(f"{a[i]:.2f} * {b[i]:.2f} = {product:.2f}")

print("___________________")
print(f"              {sum(dot_product):.2f}")

a = [0.37454012 0.95071431 0.73199394 0.59865848 0.15601864]
b = [0.15599452 0.05808361 0.86617615 0.60111501 0.70807258]

0.37 * 0.16 = 0.06
0.95 * 0.06 = 0.06
0.73 * 0.87 = 0.63
0.60 * 0.60 = 0.36
0.16 * 0.71 = 0.11
___________________
              1.22


This gives us the similarity between the two vectors. 

We can also calculate the dot product in Numpy using `np.dot(a, b)`

In [18]:
print(f"{np.dot(a, b):.2f}")

1.22


## Scaled Dot-Product Attention

Let's look at how we can calculate the dot product between matrices built using multiple Key - Query vectors.

In [56]:
q1 = np.random.random(5)
q2 = np.random.random(5)
q3 = np.random.random(5)
q4 = np.random.random(5)

# Start with an example sentence. Each word will be a token and get a key and query vector
example_sentence="machine learning is fun".split()
print(f"Example sentence split by word: {example_sentence}")

# We map each word to an index
word_to_idx = dict([(example_sentence[i], i) for i in range(len(example_sentence))])
print(f"\nWord to index mapping: {word_to_idx}")

# Next, we can build key, query, and value vectors with 5 random numbers per word
# Each ROW corresponds to a WORD and each COLUMN corresponds to an EMBEDDING
EMBEDDING_DIM=5
query_vectors = np.array([np.random.random(EMBEDDING_DIM) for i in example_sentence])
key_vectors = np.array([np.random.random(EMBEDDING_DIM) for i in example_sentence])
value_vectors = np.array([np.random.random(EMBEDDING_DIM) for i in example_sentence])

column_map = {i: f"Dim {i+1}" for i in range(EMBEDDING_DIM)}

# Let's print out the words and their embeddings
def convert_embedding_to_df(matrix, word_to_idx):
    """Helper function to convert a matrix where each row is a word embedding to a Pandas df"""
    matrix_df = pd.DataFrame()

    for i, word in enumerate(word_to_idx):
        matrix_df[word] = matrix[i]

    matrix_df = matrix_df.T.rename(columns=column_map)
    
    return matrix_df

print("\nQuery Matrix:")
query_vector_df = convert_embedding_to_df(query_vectors, word_to_idx)
print(query_vector_df.head())

print("\nKey Matrix:")
key_vector_df = convert_embedding_to_df(key_vectors, word_to_idx)
print(key_vector_df.head())


Example sentence split by word: ['machine', 'learning', 'is', 'fun']

Word to index mapping: {'machine': 0, 'learning': 1, 'is': 2, 'fun': 3}

Query Matrix:
             Dim 1     Dim 2     Dim 3     Dim 4     Dim 5
machine   0.164266  0.814575  0.665197  0.523065  0.358830
learning  0.877201  0.392445  0.816599  0.439135  0.376944
is        0.462680  0.301378  0.747609  0.502720  0.232213
fun       0.899575  0.383891  0.543553  0.906472  0.624238

Key Matrix:
             Dim 1     Dim 2     Dim 3     Dim 4     Dim 5
machine   0.116898  0.939832  0.627708  0.334906  0.139272
learning  0.794025  0.620073  0.533461  0.893893  0.788597
is        0.151675  0.311722  0.248489  0.743946  0.033532
fun       0.569890  0.762459  0.876766  0.342082  0.821257


Now that we have the query and key matrices, we want to get the **Dot Product** between the Query and Key vectors for each word. As above, we want to calculate $query_{1}key_{1} + query_{2}key_{2} + ... + query_{n}key_{n}$. To do this, we need to **Transpose** the key matrix. Otherwise, the matrix multiplication would multiply each row of the query matrix with each column of the value matrix, which is not what we want.

$$\vec{\text{Query}} \cdot \vec{\text{Key}} = \text{Query} \times \text{Key}^{T}$$

In [58]:
matrix_dot_product = np.dot(query_vectors, key_vectors.T)

# We can represent the result as the similarity between each words Query vector to each other words Key vector
def convert_query_key_dot_product_to_df(matrix_dot_product, word_to_idx):
    return pd.DataFrame(data=matrix_dot_product, index=[i for i in word_to_idx], columns =[i for i in word_to_idx])
    
matrix_dot_product_df = convert_query_key_dot_product_to_df(matrix_dot_product, word_to_idx)
print(matrix_dot_product_df)

           machine  learning        is       fun
machine   1.427468  1.740921  0.845295  1.771538
learning  1.183528  2.065285  0.797632  1.974885
is        1.007316  1.585576  0.731679  1.511621
fun       1.197666  2.544851  1.086476  2.104676


The results of this matrix are then scaled by dividing the dot product by the square root of the number of embedding dimensions for the keys, or $d_{k}$. Each key consists of a vector of 5 numbers, so we would divide by the square root of 5. This scaling is done to counteract the vanishing gradient effects of calculating softmax values for large values of $d_{k}$.

In [None]:
scaled_matrix_dot_product = matrix_dot_product / np.sqrt(EMBEDDING_DIM)
scaled_matrix_dot_product_df = convert_query_key_dot_product_to_df(scaled_matrix_dot_product, word_to_idx)
print(scaled_matrix_dot_product_df)

           machine  learning        is       fun
machine   0.638383  0.778563  0.378028  0.792256
learning  0.529290  0.923623  0.356712  0.883196
is        0.450485  0.709091  0.327217  0.676017
fun       0.535612  1.138092  0.485887  0.941240


A **Softmax** function is then applied to the scaled dot product matrix to isolate important values. The softmax function converts the values into a range of probabilities between 0-1.

In [62]:
def softmax(x):
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

attention_weights = softmax(scaled_matrix_dot_product)
attention_weights_df = convert_query_key_dot_product_to_df(attention_weights, word_to_idx)
print(attention_weights_df)

           machine  learning        is       fun
machine   0.059588  0.068554  0.045929  0.069500
learning  0.053429  0.079256  0.044960  0.076116
is        0.049380  0.063954  0.043653  0.061873
fun       0.053768  0.098215  0.051160  0.080665


This matrix, called the **attention value** matrix, represents how similar each word is to one another based on their Query and Key vectors. These are learned from the data, which allows the algorithm to learn to associate similar words. 

The matrix of value vectors are then used as weights for the values in the value vector. You can think of the Keys and Queries as asking, "which of the words are most relevant to one another?", while the value is relevant return value. For translation, the value might correspond to a word in a different language. If the Query for the word "machine" matches closely with the Key for "learning", and the value for "learning" in a trained German translation model would correspond with "lernen".

In summary, Queries as the question: "where is the information that's relevent to me?" the Keys answer: "I am ___ relevant to you" and the Values contain the information that is passed forward.

In [70]:
attention_output = np.matmul(attention_weights, value_vectors)

attention_output_df = convert_embedding_to_df(attention_output, word_to_idx)
print(attention_output_df)

             Dim 1     Dim 2     Dim 3     Dim 4     Dim 5
machine   0.111879  0.131627  0.096534  0.121022  0.165639
learning  0.118141  0.132201  0.105887  0.129003  0.173682
is        0.101552  0.116247  0.088280  0.109343  0.146913
fun       0.129455  0.142853  0.123007  0.147033  0.191542
