The paper starts by discussing scaled dot-product attention. *Attention* refers to a mechanism that allows for "modeling of dependencies without regard to their input or output sequencies". In other words, attention allows the model to *attend* to different parts of the input when learning to approximate a function.

The common example shown for attention is how different words in a sentence relate to each other. For example, consider the sentence "A big red dog jumped over a small pond". As a reader, it's easy to understand that the words "big", "red", and "jumped" all refer to the dog, or are at least more relevant to understand what the dog is doing than the word "small".

In [6]:
import torch
import torch.nn as nn
import torch.nn.functional as F

import math

In [14]:
def scaled_dot_product_attention(Q: torch.Tensor, K: torch.Tensor, V: torch.Tensor) -> torch.Tensor:
    """
    Performs scaled dot-product attention as defined in the Transformers paper.

    :param Q: The query vector of shape (d_model, d_keys)
    :param K: The key vector of shape (d_model, d_keys)
    :param V: The values vector of shape (d_model, d_values)
    :return: The scaled attention scores of shape (d_model, d_values)
    """
    d_keys = Q.shape[1]

    scaling_factor = 1 / math.sqrt(d_keys)

    return F.softmax(Q @ K.T * scaling_factor, dim=0) @ V

In [15]:
# let's try this out with some random values
d_keys = 100
d_values = 25
d_model = 1000

Q = torch.randn((d_model, d_keys))
K = torch.randn((d_model, d_keys))
V = torch.randn((d_model, d_values))

scaled_scores = scaled_dot_product_attention(Q, K, V)

In [17]:
print(scaled_scores.shape)

torch.Size([1000, 25])
