#### Downloading the tiny Shakesphere Dataset we will be working on. Let start the fun.

In [54]:
!pip install wget



In [55]:
!python -m wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt


Saved under input (2).txt


#### Lets see the dataset for a bit

In [56]:
with open(('../input.txt'), 'r', encoding='utf-8') as f:
    text = f.read()

In [57]:
print("Length of the dataset in Characters: ", len(text))

Length of the dataset in Characters:  1115394


In [58]:
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [59]:
# Here we get the vocabulary, i.e. unique characters in the text
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


### Now we will develop a strategy to tokenize the input text

We will be converting raw text of a string to some sequence of integers according to some vocabulary of possible elements.
Here, we will be building a character level language model and will be translating chars into integers

In [60]:
# create a mapping from Characters to integers
stoi = {ch : i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}
encoder = lambda s: [stoi[c] for c in s] # encoder: takes a string and returns list of integers according to encodeings
decoder = lambda l: ''.join([itos[i] for i in l]) # Decoder: takes in a list of integers and returns the string corresponding to the list

print(encoder("hii there"))
print(decoder(encoder("hii there")))


[46, 47, 47, 1, 58, 46, 43, 56, 43]
hii there


### This is just one of the many ways the mapping can be achieved like - 
1. Google uses SentencePiece, which encodes text into integers using a different schema and vocabulary. It is a sub word tokenizer, i.e. its not precisely word based or character based, but somewhere in the middle of the two.
2. OpenAi has the library called Tiktoken, which is a BPE(byte-pair encoding) tokenizer, thats also used by GPT. 

These popular tokenizers have a bigger vocabulary, leading to smaller encoding(without loss in value). For this simple project we will be using the Character level tokenizer.

In [61]:
# Encoding the whole file
import torch
data = torch.tensor(encoder(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:1000])

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
      

### Seperating data into training and Validation split

In [62]:
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]

### Understanding the input structure and block size

In [63]:
block_size = 8
data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

In [64]:
x = train_data[:block_size]
y = train_data[1:block_size+1]
for idx in range(block_size):
    context = x[:idx+1]
    target = y[idx]
    print(f"When input is {context} the target: {target}")

When input is tensor([18]) the target: 47
When input is tensor([18, 47]) the target: 56
When input is tensor([18, 47, 56]) the target: 57
When input is tensor([18, 47, 56, 57]) the target: 58
When input is tensor([18, 47, 56, 57, 58]) the target: 1
When input is tensor([18, 47, 56, 57, 58,  1]) the target: 15
When input is tensor([18, 47, 56, 57, 58,  1, 15]) the target: 47
When input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target: 58


For the Above Implementation, we had two ways that comes at the top of my mind-
1. Using what had been done in this implementation
2. Always having a window of {block_size}, with one target

Initially the 2nd option looks better as it makes the transformer more accurate by giving it more input data to relate with than the first one.
But on a deeper level, approach was used because it helped the transformer to be used to seeing contexts as little as one input all the way upto blocksize. We would like our transformer to be comfortable with everything in between - because its useful during inference step, as when we are sampling we can start sample generation with as little as one character context.

### Batch Dimension-
the above input is a data chunk : tensor([18, 47, 56, 57, 58,  1, 15, 47, 58]). 
#
We will be adding a Batch dimension to the input, which is multiple datachunks stacked together as another dimension to this Tensor(array like structure that is useful for working with GPUs), making it even more efficient as it helps increase the parallel processing we can do.

In [65]:
torch.manual_seed(1337) # helps make things uniform throught the code, making the results reproducable
batch_size = 4 # how many independent sequence(data chunk) will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?

def get_batch(split):
    #generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    # we will make an ix, the starting index for the various batches we will be using
    ix = torch.randint(len(data) - block_size, (batch_size,)) # ix is a one dimensional tensor with batch_size random integers, where each integer is in range of 0 - (len(data) - block_size - 1)
    x = torch.stack([data[i:i+block_size] for i in ix]) # input to the transformer
    y = torch.stack([data[i+1:i+block_size+1] for i in ix]) # output from a transformer
    return x,y

xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)

print('------------------')

for b in range(batch_size):
    for t in range(block_size):
        context = xb[b,:t+1]
        target = yb[b,t]
        print(f"When input is {context.tolist()} the target: {target.tolist()}")


inputs:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets:
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
------------------
When input is [24] the target: 43
When input is [24, 43] the target: 58
When input is [24, 43, 58] the target: 5
When input is [24, 43, 58, 5] the target: 57
When input is [24, 43, 58, 5, 57] the target: 1
When input is [24, 43, 58, 5, 57, 1] the target: 46
When input is [24, 43, 58, 5, 57, 1, 46] the target: 43
When input is [24, 43, 58, 5, 57, 1, 46, 43] the target: 39
When input is [44] the target: 53
When input is [44, 53] the target: 56
When input is [44, 53, 56] the target: 1
When input is [44, 53, 56, 1] the target: 58
When input is [44, 53, 56, 1, 58] the target: 46
When in

### Now We will use Bigram language model
We will use the simplest neural network for this study, the BiGram language model. Bigram model is explained in another file 'BigramModel.py'.


In [66]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        # each token reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):

        # idx and targets are both (B, T) tensor of Integers
        logits = self.token_embedding_table(idx) # (B,T,C)

        if targets is None:
            return logits, None

        B, T, C = logits.shape
        logits = logits.view(B*T, C)
        targets = targets.view(B*T) # or instead of B*T, we can use -1 as well as its a generalized case
        loss = F.cross_entropy(logits, targets) # requires the shape of (B, C, T) as per the implementation of pytorch

        return logits, loss
    
    def generate(self, idx, max_new_tokens):
        # idx is a (B,T) array of indices in the current context
        for _ in range(max_new_tokens):
            #get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            # print(logits.shape)
            logits = logits[:, -1, :] # become (B, C) coz we want the output of scores of all chars for the last char in the series
            #apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # becomes (B, C)
            # Sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B,1)
            #apply sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B,T+1)

        return idx
    
m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)
print(logits.shape)
print(loss)

print(decoder(m.generate(torch.zeros((1,1), dtype=torch.long), max_new_tokens=100)[0].tolist()))

torch.Size([32, 65])
tensor(4.8786, grad_fn=<NllLossBackward0>)

Sr?qP-QWktXoL&jLDJgOLVz'RIoDqHdhsV&vLLxatjscMpwLERSPyao.qfzs$Ys$zF-w,;eEkzxjgCKFChs!iWW.ObzDnxA Ms$3


### Now we are going to train the model
Optimizer - AdamW, which is an advanced and efficient optimizer, and lr is generally 1e-4, but noe we are going for 1e-3 for this small network.

In [67]:
# creating a PyTorch Optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

In [68]:
batch_size = 32 # bigger batch size for more parallel processing
for steps in range(1000):

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True) # Otherwise gradients from each batch will be added together, so with this batchs are not correlated.
    loss.backward() # Calculates the gradients of loss w.r.t. model parameters (thus we take it from loss)
    optimizer.step() # Updating model parameters based on the computed gradients
    # print(loss)

print(loss.item())

3.721843719482422


In [69]:
print(decoder(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=500)[0].tolist()))


olylvLLko'TMyatyIoconxad.?-tNSqYPsx&bF.oiR;BD$dZBMZv'K f bRSmIKptRPly:AUC&$zLK,qUEy&Ay;ZxjKVhmrdagC-bTop-QJe.H?x
JGF&pwst-P sti.hlEsu;w:w a BG:tLhMk,epdhlay'sVzLq--ERwXUzDnq-bn czXxxI&V&Pynnl,s,Ioto!uvixwC-IJXElrgm C-.bcoCPJ
IMphsevhO AL!-K:AIkpre,
rPHEJUzV;P?uN3b?ohoRiBUENoV3B&jumNL;Aik,
xf -IEKROn JSyYWW?n 'ay;:weO'AqVzPyoiBL? seAX3Dot,iy.xyIcf r!!ul-Koi:x pZrAQly'v'a;vEzN
BwowKo'MBqF$PPFb
CjYX3beT,lZ qdda!wfgmJP
DUfNXmnQU mvcv?nlnQF$JUAAywNocd  bGSPyAlprNeQnq-GRSVUP.Ja!IBoDqfI&xJM AXEHV&DKvRS


## Building Self-Attention block

### The mathematical trick in self-attention

In [70]:
# Consider the following toy example

torch.manual_seed(1337)
B,T,C = 4,8,2 
x = torch.randn(B,T,C)
x.shape

torch.Size([4, 8, 2])

In [71]:
x

tensor([[[ 0.1808, -0.0700],
         [-0.3596, -0.9152],
         [ 0.6258,  0.0255],
         [ 0.9545,  0.0643],
         [ 0.3612,  1.1679],
         [-1.3499, -0.5102],
         [ 0.2360, -0.2398],
         [-0.9211,  1.5433]],

        [[ 1.3488, -0.1396],
         [ 0.2858,  0.9651],
         [-2.0371,  0.4931],
         [ 1.4870,  0.5910],
         [ 0.1260, -1.5627],
         [-1.1601, -0.3348],
         [ 0.4478, -0.8016],
         [ 1.5236,  2.5086]],

        [[-0.6631, -0.2513],
         [ 1.0101,  0.1215],
         [ 0.1584,  1.1340],
         [-1.1539, -0.2984],
         [-0.5075, -0.9239],
         [ 0.5467, -1.4948],
         [-1.2057,  0.5718],
         [-0.5974, -0.6937]],

        [[ 1.6455, -0.8030],
         [ 1.3514, -0.2759],
         [-1.5108,  2.1048],
         [ 2.7630, -1.7465],
         [ 1.4516, -1.5103],
         [ 0.8212, -0.2115],
         [ 0.7789,  1.5333],
         [ 1.6097, -0.4032]]])

### 
For a particular batch, we want tokens to *interact with/talk to* each other for building an attention bloack at its crux.
#
But for this generation senario, we want them to communicate in a form such that the 4th token can interact with 1st, 2nd, and 3rd token(previous context), but not with 5th, 6th...(future tokens) tokens.(as we are trying to predict the future)
#
Now, what is the easiest way for the tokens to communicate? If we are the 4th token and want to communicate with the past, one way would be to just get the average of all the prev tokens(1,2 and 3), techinically giving back a feature vector that summarizes 4th token w.r.t. the previous timesteps/tokens.
#
Its obvious that just doing a sum is an extremly weak form of interction between the tokens. This communication is extremely lossy, as we have lost a ton of information like positions of the considered tokens(i.e. spatial arrangements of all those tokens)
#
But lets first build it up before starting the next steps

In [72]:
# xbow is x bag of words - a term frequently used when we are averaging up things
# we want x[b,t] = mean_{i<=t} x[b,i]

xbow = torch.zeros((B,T,C))
for b in range(B):
    for t in range(T):
        xprev = x[b,:t+1] # (t, C)  (for the 4th token this is 1st, 2nd, 3rd and 4th token, which is a little different from what we have just said)
        xbow[b, t] = torch.mean(xprev, 0) # (C)

In [73]:
x

tensor([[[ 0.1808, -0.0700],
         [-0.3596, -0.9152],
         [ 0.6258,  0.0255],
         [ 0.9545,  0.0643],
         [ 0.3612,  1.1679],
         [-1.3499, -0.5102],
         [ 0.2360, -0.2398],
         [-0.9211,  1.5433]],

        [[ 1.3488, -0.1396],
         [ 0.2858,  0.9651],
         [-2.0371,  0.4931],
         [ 1.4870,  0.5910],
         [ 0.1260, -1.5627],
         [-1.1601, -0.3348],
         [ 0.4478, -0.8016],
         [ 1.5236,  2.5086]],

        [[-0.6631, -0.2513],
         [ 1.0101,  0.1215],
         [ 0.1584,  1.1340],
         [-1.1539, -0.2984],
         [-0.5075, -0.9239],
         [ 0.5467, -1.4948],
         [-1.2057,  0.5718],
         [-0.5974, -0.6937]],

        [[ 1.6455, -0.8030],
         [ 1.3514, -0.2759],
         [-1.5108,  2.1048],
         [ 2.7630, -1.7465],
         [ 1.4516, -1.5103],
         [ 0.8212, -0.2115],
         [ 0.7789,  1.5333],
         [ 1.6097, -0.4032]]])

In [74]:
x[0]

tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679],
        [-1.3499, -0.5102],
        [ 0.2360, -0.2398],
        [-0.9211,  1.5433]])

In [75]:
xbow[0]

tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])

In [76]:
# The above is good, but very very inefficient.
torch.manual_seed(42)
a = torch.ones(3,3)
b = torch.randint(0, 10, (3,2)).float()
c = a @ b

print('a =')
print(a)
print('-----')
print('b =')
print(b)
print('------')
print('c =')
print(c)

a =
tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]])
-----
b =
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
------
c =
tensor([[14., 16.],
        [14., 16.],
        [14., 16.]])


In [77]:
# gives us an awesome clue on how to proceed with matrix multiplication for faster processing, i.e using lower triangular matrix.


torch.manual_seed(42)
a = torch.tril(torch.ones(3,3))
b = torch.randint(0, 10, (3,2)).float()
c = a @ b

print('a =')
print(a)
print('-----')
print('b =')
print(b)
print('------')
print('c =')
print(c)

# we now get a matrix of sum of rows
# we just need to convert it into average now
# tada, we just need to normalize the rows of a

a =
tensor([[1., 0., 0.],
        [1., 1., 0.],
        [1., 1., 1.]])
-----
b =
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
------
c =
tensor([[ 2.,  7.],
        [ 8., 11.],
        [14., 16.]])


In [78]:
torch.manual_seed(42)
a = torch.tril(torch.ones(3,3))
a = a / torch.sum(a, 1, keepdim=True)
b = torch.randint(0, 10, (3,2)).float()
c = a @ b

print('a =')
print(a)
print('-----')
print('b =')
print(b)
print('------')
print('c =')
print(c)

# now dats the stuff

a =
tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
-----
b =
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
------
c =
tensor([[2.0000, 7.0000],
        [4.0000, 5.5000],
        [4.6667, 5.3333]])


Now, Implementing what we have learned

In [79]:
wei = torch.tril(torch.ones(T,T)) # short for weights
wei = wei / wei.sum(1, keepdim=True)
xbow2 = wei @ x
# wei is of size (T,T) and x is of (B,T,C)
# So, pytorch automatically makes wei of size(B,T,T), and these two(weight and x) interact such that for a particular batch (T,T) @ (T,C) happens, giving us our desired o/p (T,C) containing avg just the way we want it, like in the long method
# (B,T,T) @ (B,T,C) ---> (B,T,C)
torch.allclose(xbow, xbow2)

if torch.allclose(xbow, xbow2):
    print("Both are the same")
else:
    print("BOW are different")


# The output should be tru, but we observe a minute difference between few of the values in xbow vs xbow2


BOW are different


#
Now lets add another change on top of these two, this time using softmax

In [80]:
tril = torch.tril(torch.ones(T,T))
tril

tensor([[1., 0., 0., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 1.]])

In [81]:
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
print(wei)
print('-----')
wei = F.softmax(wei, dim=-1)
#softmax will act as a normalization funtion
wei

tensor([[0., -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., -inf],
        [0., 0., 0., 0., 0., 0., 0., 0.]])
-----


tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])

The reason we use the above in our code is that values of the weights can be looked at as interaction strength or affinity. 
#
Currently we have set up such that the affinity between each previous element is same(0), but these effinities may not end up 0, they are data dependent, as some tokens will find other tokens more or less interesting, and depending on values, they are going to find this affinity being of different ammounts.

In [82]:
xbow3 = wei @ x
torch.allclose(xbow2, xbow3)


True

## Self Attention-

Attention sargents, in last segment we discussed how we want our weights to look like, if they were data independent. Now, We will be taking things up a notch and making the wei's also Data dependent.

Before - 

In [87]:
torch.manual_seed(1337)
B,T,C = 4,8,32
x = torch.randn(B,T,C)

tril = torch.tril(torch.ones(T,T))
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
out = wei @ x

out.shape


torch.Size([4, 8, 32])

In [89]:
# Version 4: self-Attention
torch.manual_seed(1337)
B, T, C = 4, 8, 32 # Batch, Time, Channels
x = torch.randn(B,T,C)

# lets set single head perform self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False) # Its the embedding layer, saying what that token contains
query = nn.Linear(C, head_size, bias=False) # Its the embedding layer, saying what that token is looking for
value = nn.Linear(C, head_size, bias=False) # Its the embedding layer saying that if you are interseted, here is what I will share

k = key(x) # (B,T,16)
q = query(x) # (B,T,16)
wei = q @ k.transpose(-2, -1) # (B,T,16) @ (B,16,T) --> (B,T,T)
# Essentially its the matrix of dot products between the query vector of a particular token and key values of very token in the batch. If they are synergic, it will give a high value, else, lower value

tril = torch.tril(torch.ones(T,T))
wei = wei.masked_fill(tril == 0, float('-inf')) # This makes it a decoder block
wei = F.softmax(wei, dim=-1)

v = value(x)
out = wei @ v

out.shape

torch.Size([4, 8, 16])

Notes:
- Attention is a **communication mechanism**. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.
- There is no notion of space. Attention simply acts over a set of vectors. This is why we need to positionally encode tokens. This is different from convolution, as there is a specific layout(sorta like template) of information in space. In abstract terms, Convolution is more hard coded than attention?
- Each example across batch dimension is of course processed completely independently and never "talk" to each other
- In an "encoder" attention block just delete the single line that does masking with `tril`, allowing all tokens to communicate. This block here is called a "decoder" attention block because it has triangular masking, and is usually used in autoregressive settings, like language modeling.
- "self-attention" just means that the keys and values are produced from the same source as queries. In "cross-attention", the queries still get produced from x, but the keys and values come from some other, external source (e.g. an encoder module)
- "Scaled" attention additional divides `wei` by 1/sqrt(head_size). This makes it so when input Q,K are unit variance, wei will be unit variance too and Softmax will stay diffuse and not saturate too much. Illustration below

In [104]:
k = torch.randn(B,T,head_size)
q = torch.randn(B,T,head_size)
wei = q @ k.transpose(-2, -1) #* head_size**-0.5

In [105]:
k.var()

tensor(1.0104)

In [106]:
q.var()

tensor(1.0204)

In [107]:
wei.var()

tensor(17.6841)

In [108]:
# k = torch.randn(B,T,head_size)
# q = torch.randn(B,T,head_size)
wei = q @ k.transpose(-2, -1) * head_size**-0.5

In [109]:
k.var()

tensor(1.0104)

In [110]:
q.var()

tensor(1.0204)

In [111]:
wei.var()

tensor(1.1053)

In [94]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1)

tensor([0.1925, 0.1426, 0.2351, 0.1426, 0.2872])

In [95]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5])*8, dim=-1) # gets too peaky, converges to one-hot

tensor([0.0326, 0.0030, 0.1615, 0.0030, 0.8000])

 We can see that in *8 case, the curve is peaky and sharp, i.e. want to make the highest value higher, while decresing the other values(something like a triangle, that is getting sharper in the top(gradient high near the peak)). More like Softmax is converging to One-hot vectors.