# Self Attention In Transformers

Self attention is the mechanism by which transformers learn to embed tokens in the input sequence with richer information from other tokens in the input sequence.

> In this work we propose the Transformer, a model architecture eschewing recurrence and instead
relying entirely on an attention mechanism to draw global dependencies between input and output.
The Transformer allows for significantly more parallelization and can reach a new state of the art in
translation quality after being trained for as little as twelve hours on eight P100 GPUs.

For now, we'll focus on the scaled dot product self attention mechanism. This mechanism is an introspective mechanism through which each token in a sequence is  imbued with information from other tokens within the same sequence with whom they have a learned relationship.

In [2]:
import torch

Let's assume that the below tensor represents the embeddings of 64 possible subsequences, each 256 tokens in length, with each token having an embedding of 512 in length. 

In [4]:
encoded_sentence = torch.randn((64, 256, 512), dtype=torch.float)
encoded_sentence

tensor([[[ 7.2068e-01, -7.3312e-01, -3.1714e-01,  ...,  1.4042e+00,
          -8.1633e-02,  2.3458e-02],
         [ 1.1932e+00,  2.4985e+00, -7.4632e-01,  ...,  1.3621e+00,
           2.1245e-01,  3.1476e-01],
         [ 1.0477e-01,  8.8112e-01, -3.3800e-01,  ...,  8.6387e-02,
          -1.3226e+00,  1.1879e+00],
         ...,
         [ 1.1468e+00, -5.8442e-01,  3.2023e-01,  ...,  1.3805e+00,
           7.1827e-01, -2.0997e-01],
         [-6.4437e-01,  9.8400e-01, -3.8320e-01,  ..., -5.1742e-01,
          -1.2141e+00,  1.0175e+00],
         [-1.3575e+00,  6.1631e-01, -1.1711e-01,  ...,  7.1650e-01,
          -4.4457e-01, -1.2195e+00]],

        [[-8.8185e-02,  7.6894e-01, -3.1657e-01,  ..., -4.5393e-01,
          -5.9143e-01, -1.1855e+00],
         [ 7.7802e-01, -2.2019e-02,  7.4432e-01,  ..., -6.5237e-01,
           1.7421e+00,  2.1597e-01],
         [ 2.0892e-02,  1.3980e+00, -4.8068e-01,  ..., -3.8617e-01,
           1.1633e-02, -8.2565e-02],
         ...,
         [ 2.5897e-01, -5

Self attention works by generating queries, keys and values from an input sequence and then applying those queries to the keys and values generated from the same sequence and adding the resulting values to the original sequence.

In [None]:
from torch import nn

class SelfAttention(nn.Module):
    def __init__(self, d_model: int, d_key: int, d_value: int) -> None:
        super().__init__()
        self.key_proj = nn.Parameter(torch.randn((d_model, d_key), dtype=torch.float))
        self.value_proj = nn.Parameter(torch.randn((d_model, d_value), dtype=torch.float))
        self.query_proj = nn.Parameter(torch.randn((d_model, d_key), dtype=torch.float))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        