Moving on to Self Attention Math Trick. How do we give a tensor of B,T,C where B is the batch, T is the time dimension 
but really it is a single vector as a row, where channels is the vocab size (column) defined from the nn.Embedding table. 

In [1]:
import torch
torch.manual_seed(1337)
B,T,C = 4,8,2 

x = torch.randn(B,T,C)
x.shape

torch.Size([4, 8, 2])

In [2]:
x[0]


tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679],
        [-1.3499, -0.5102],
        [ 0.2360, -0.2398],
        [-0.9211,  1.5433]])

In [3]:
#NOw lets try to communicate information (vector information) about the previous tokens (vectors) in the time step 
#to the current token. 
#We would average out the information from the previous token with the current token and move forward like this. 
#Averaging isnt the best approach because it makes our information highly noisy, however for now lets do it for now. 
xbow = torch.zeros((B,T,C))
for b in range(B): 
    for t in range(T): 
        xprev = x[b,:t+1]
        xbow[b,t] = torch.mean(xprev, 0) #0 means,  column mean.

In [4]:
print(xbow.shape)
print(x.shape)

torch.Size([4, 8, 2])
torch.Size([4, 8, 2])


In [5]:
xbow[0]

tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])

In [6]:
#There is a better and faster way to calculate this average. 
torch.manual_seed(13)
a = torch.ones(3,3)
b = torch.randint(0,10, (3,2)).float()
c = a @ b 
print(a)
print('--' * 10)
print(b)
print('--' * 10)
print(c)
#Here, we have just mutliplied these two tensors which essentially is summing each column in tensor b entirely. 

tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]])
--------------------
tensor([[8., 2.],
        [4., 6.],
        [8., 6.]])
--------------------
tensor([[20., 14.],
        [20., 14.],
        [20., 14.]])


In [7]:
#Now see this 
torch.manual_seed(13)
a = torch.tril(torch.ones(3,3)) #tril gives you a lower triangle with the upper triangle as 0s. 
b = torch.randint(0,10, (3,2)).float()
c = a @ b 
print(a)
print('--' * 10)
print(b)
print('--' * 10)
print(c)
#Now, the c tensor shows that we are summing column b to with its previous elements in the column. 
# We can now calculate the average by creating tensor a's rows between 0 and 1 so that the average can be calculated.  


tensor([[1., 0., 0.],
        [1., 1., 0.],
        [1., 1., 1.]])
--------------------
tensor([[8., 2.],
        [4., 6.],
        [8., 6.]])
--------------------
tensor([[ 8.,  2.],
        [12.,  8.],
        [20., 14.]])


In [8]:
torch.manual_seed(13)
a = torch.tril(torch.ones(3,3)) #tril gives you a lower triangle with the upper triangle as 0s. 
a = a / torch.sum(a, 1, keepdim=True) #sum across the rows (hence dim=1) keep-dim=True to maintain broadcasting rules. 
b = torch.randint(0,10, (3,2)).float()
c = a @ b 
print(a)
print('--' * 10)
print(b)
print('--' * 10)
print(c)
#SEE! Now each element in tensor c (Column Wise), contains the average of tensor B with its previous elements. 

tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
--------------------
tensor([[8., 2.],
        [4., 6.],
        [8., 6.]])
--------------------
tensor([[8.0000, 2.0000],
        [6.0000, 4.0000],
        [6.6667, 4.6667]])


In [9]:
#Now lets go back to x and xbow and apply this trick there instead of the for loop 

torch.manual_seed(1337)
B,T,C = 4,8,2 

x = torch.randn(B,T,C)
print(x.shape)
print(x[0])

torch.Size([4, 8, 2])
tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679],
        [-1.3499, -0.5102],
        [ 0.2360, -0.2398],
        [-0.9211,  1.5433]])


In [10]:
#Creating a weights matrix which by analogy is like tensor a above. 
wei = torch.tril(torch.ones(T,T)) 
wei = wei/ torch.sum(wei, 1, keepdim=True)
wei

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])

In [11]:
#Our tensor b is simply x 
xbow2 = wei @ x # (T,T) @ (B,T,C) pytorch will create a batch tensor for wei (B,T,T) and then multiply it. 
xbow2.shape

torch.Size([4, 8, 2])

In [12]:
xbow2[0]

tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])

In [13]:
xbow[0] #xbow2 is equal to xbow

tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])

Let's Now Try another way to calculate the average using Softmax (Basically a way to allow previous tokens to influence current tokens and making sure future tokens cannot be seen by the current token (vector))

In [14]:
tril = torch.tril(torch.ones(T,T))
tril

tensor([[1., 0., 0., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 1.]])

In [15]:
wei = torch.zeros((T,T)) #weights matrix. this decides how much can each token of the past influence the current token. 
wei 

tensor([[0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.]])

In [16]:
wei = wei.masked_fill(tril==0, float('-inf')) #this basically makes sure that the tokens of the future cannot influence the current token.
wei

tensor([[0., -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., -inf],
        [0., 0., 0., 0., 0., 0., 0., 0.]])

In [17]:
from torch.nn import functional as F 
wei = F.softmax(wei, dim=-1)
wei

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])

In [18]:
xbow3 = wei @ x
xbow3.shape

torch.Size([4, 8, 2])

In [19]:
xbow3[0]

tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])