# **Attention is all you need**

## https://arxiv.org/pdf/1706.03762




## The mathematical trick in self-attention


In [3]:
# consider the following toy example:
import torch

torch.manual_seed(42)

a = torch.ones(3,3)

b = torch.randint(0,10,(3,2)).float()

c = a @ b

print('a=')
print(a)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)

a=
tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]])
--
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
--
c=
tensor([[14., 16.],
        [14., 16.],
        [14., 16.]])


### Aside: Using torch.tril()

We use torch.tril() to get the lower triangular part of a matrix. 

torch.tril() creates a lower triangular matrix by keeping all elements on and below the main diagonal while setting everything above it to zero. For example:

In [None]:
import torch
x = torch.tensor([[1, 2, 3],
                  [4, 5, 6],
                  [7, 8, 9]])
                  
print(torch.tril(x))
# Output:
# [[1, 0, 0],
#  [4, 5, 0],
#  [7, 8, 9]]

## Applying the trick

In [4]:
# consider the following toy example:
import torch

torch.manual_seed(42)

a = torch.tril(torch.ones(3,3))

b = torch.randint(0,10,(3,2)).float()

c = a @ b

print('a=')
print(a)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)

# Go through the matrix multiplication step by step:

a=
tensor([[1., 0., 0.],
        [1., 1., 0.],
        [1., 1., 1.]])
--
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
--
c=
tensor([[ 2.,  7.],
        [ 8., 11.],
        [14., 16.]])


## Applying the weighted average
This makes the future tokens attentive to the past tokens.

In [5]:
# keep eveything the same but divide a to make it apply a weighted average

a = a / torch.sum(a, 1, keepdim=True)

c = a @ b
print('a=')
print(a)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)

a=
tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
--
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
--
c=
tensor([[2.0000, 7.0000],
        [4.0000, 5.5000],
        [4.6667, 5.3333]])


## Applying the same trick with softmax

In [10]:
# Lets take an example 
import torch.nn.functional as F # Importing the module for softmax

torch.manual_seed(42)
B, T, C = 4, 8, 2 # batch, time, channels

# Batch represents the number of sequences in the batch
# Time represents the number of tokens in the sequence
# Channels represents the number of features in the input

# This reuslts in a tensor of shape (B, T, C)
x = torch.randn(B, T, C)


# Apply the same trick however this time we use the softmax function
tril = torch.tril(torch.ones(T,T))
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
#wei = F.softmax(wei, dim=-1)

wei

# xbow = wei @ x # bow --> bag of words


tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])

In [20]:
# Lets take an example 
import torch.nn.functional as F # Importing the module for softmax

torch.manual_seed(42)
B, T, C = 4, 8, 2 # batch, time, channels

# Batch represents the number of sequences in the batch (e.g. the number of letters (or text sequences) in the batch)
# Time represents the number of tokens in the sequence (e.g. the number of letters in the sequence)
# Channels represents the number of features in the input (e.g. the vector that represents the token/word)

# This reuslts in a tensor of shape (B, T, C)
x = torch.randn(B, T, C)


# Apply the same trick however this time we use the softmax function
tril = torch.tril(torch.ones(T,T))
wei = torch.zeros((T,T)) # Set initially to 0, during training this will be the attention matrix that we will learn based on the data
wei = wei.masked_fill(tril == 0, float('-inf')) # Set the upper triangular part to -inf, so the tokens from the future are can not communicate
wei = F.softmax(wei, dim=-1) # Apply the softmax function to normalize the weights

# bow --> bag of words
xbow = wei @ x # Then we apply the attention matrix to the input to get the output

print(wei)
print('---')
print(xbow.size())
print(xbow[:,:,1])

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])
---
torch.Size([4, 8, 2])
tensor([[ 1.4873, -0.3091, -0.6176, -0.8644, -0.3617, -0.5354, -0.5388, -0.3762],
        [-0.1596,  0.1400,  0.4528,  0.7597,  0.8671,  0.9450,  0.8160,  0.8215],
        [-0.8712,  0.4231,  0.1405, -0.0882,  0.1285,  0.0069,  0.3092,  0.2095],
        [-0.6581, -0.0662,  0.3530,  0.0808,  0.0718,  0.1724,  0.4113,  0.5329]])


**What does each dimension represent?**

- B: Batch size
- T: Time (number of tokens in the sequence)
- C: Channels (number of features in the input)

To summarize we create an attention matrix which averages the input based on the past tokens on the current token.

#### Applying the query and key

In [13]:
import torch.nn as nn

# In the code cell above our weights are uniform i.e. all the previous (and current) tokens are given equal weight in producing the output.
# we will learn how to create an attention matrix that is not uniform.


# New toy model with a larget channel dimension
torch.manual_seed(42)
B, T, C = 4, 8, 32 # batch, time, channels
x = torch.randn(B, T, C)

# We implement a single head of self attention.
head_size = 16 # Project the input to a smaller dimension

# Do not include bias
# Only the relationship between the query and key is important for the attention mechanism
# Not their relative difference
query = nn.Linear(C, head_size, bias=False) 
key = nn.Linear(C, head_size, bias=False)

k = key(x) # (B, T, head_size) -> (B, T, 16)
q = query(x) # (B, T, head_size) -> (B, T, 16)

## 3.2.1 Scaled Dot-Product Attention from the paper

# The wei matrix gives us the relationship between the each of the tokens in the sequence
wei = q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) -> (B, T, T)


In [14]:
# Are we missing something?
wei[0]

tensor([[-0.3332, -0.6597,  0.3630, -0.1001,  0.0136, -0.5833,  1.1351, -1.2784],
        [-1.1723,  0.7869, -1.5219,  0.8649, -1.6202,  1.2025, -1.9940, -0.4554],
        [-1.0216, -1.2725,  0.7821, -0.0335, -1.9888, -0.3281,  1.5545, -1.4118],
        [-0.0545,  1.6851, -1.7215,  1.0221, -0.3327,  0.9147, -1.8037,  0.6392],
        [-1.0950,  0.1159, -0.3494, -0.1350, -1.2506,  0.9809, -0.5062, -0.5780],
        [ 0.2735,  0.5450,  0.2884, -0.3078, -0.8928, -0.4859, -2.6109,  1.9291],
        [ 0.1340,  0.2356, -0.1021,  0.1440, -2.2674,  1.7589, -1.0739,  1.6689],
        [-0.8490, -0.1962, -1.4271, -0.3019,  3.0561,  0.1650,  1.6430,  0.1103]],
       grad_fn=<SelectBackward0>)

In [None]:
# Yes, we need to ensure that the past tokens do not see the future tokens
# We have already done this above

# CODE HERE


In [22]:
# Lets Check the output
print(xbow.shape)
print(xbow[0])

# We can see that the wei matrix is now a lower triangular matrix
# This means that the past tokens are only able to see the past tokens

torch.Size([4, 8, 2])
tensor([[ 1.9269,  1.4873],
        [ 1.4138, -0.3091],
        [ 1.1687, -0.6176],
        [ 0.8657, -0.8644],
        [ 0.5422, -0.3617],
        [ 0.3864, -0.5354],
        [ 0.2272, -0.5388],
        [ 0.1027, -0.3762]])


#### Adding the value

So you probably have heard about the query, key and value. But currently we have only used the query and key.

So what is the value?

In [27]:
# We create a new linear layer to project the input to the value
value = nn.Linear(C, head_size, bias=False)

v = value(x) # (B, T, C) -> (B, T, 16)

# We apply the attention matrix to the value instead of the raw x
vbow_out = wei @ v # (B, T, T) @ (B, T, 16) -> (B, T, 16)

# Note you have to implement the triangulation mask above for wei
print(vbow_out.shape)
print(vbow_out[0,:,1])

torch.Size([4, 8, 16])
tensor([-0.3704, -0.0500,  0.0160,  0.0762,  0.0193,  0.0577,  0.0694,  0.0524],
       grad_fn=<SelectBackward0>)


#### Notes

1. Batches do not communicate with each other.
2. For some applications like sentiment analysis, we do not need to mask the future tokens since we want all the tokens to talk to each other fully. So in this case we use the encoder block, where we remove the mask. In 'decoder' blocks we keep the mask, because we want to model language.
3. Self attention vs cross attention:
    - Self attention: the query, key and value are all coming from the same source.
    - Cross attention: the query, key and value are all coming from different sources. For example in transformers queries can be produced from x and keys and values can be produced from the encoder, a whole seperate source.




### Refer to the forward method in the **gpt.py** file for the full implementation

There are many additional optimizations which we have not shown you in our tutorial, but is implemented in the **gpt.py** file. In his video Dr.Karpathy discusses:

1. Multi-head attention
2. Feed-forward layer structure in transformers
3. Residual connections
4. Layer and batch normalization
5. Scaling up the model (which is why you may not be able to run the code on your own machine)
6. General concepts not used by him but used in industry like RLHF
