# Nano GPT From Scratch

## Data Loading

- Download the data
- Read and Save 
- Basic EDA

In [None]:
!wget 

In [3]:
#Read in to inspect it
with open("input.txt", "r", encoding='utf-8') as f:
    text = f.read()

In [4]:
print(f"Length of the dataset is {len(text)}")

Length of the dataset is 1115394


In [5]:
#lets look at first 1000 characters
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



Unique charcaters in the dataset and `vocab_size`

In [12]:
#unique characters in the dataset
chars = list(set(text))
vocab_size = len(chars)

print(f"Unique characters in the dataset are : {''.join(chars)}")
print(f"Total vocab size is {vocab_size}")

Unique characters in the dataset are : Mdyluc.z$NjnOe,qwvAIGC
ftakVUpmPHs&Y;b-EXxQo FZKRW'S3!Tg?DirhJL:B
Total vocab size is 65


## Data Encoder-Decoder Mapping

- Create a mapping from characters to integer
- `encoder` - Takes a string and output a list of integers
- `decoder` - take a list of inetgers, outout a string

In [14]:
#create a mapping 
stoi = {c:i for i, c in enumerate(chars)}
itos = {i:c for i, c in enumerate(chars)}

#define encoder and decoder
encoder = lambda s:[stoi[c] for c in s]
decoder = lambda l:''.join([itos[i] for i in l])

In [15]:
#test the above mappers

s = "hii there!"

print(f"Encoded value: {encoder(s)}")
print(f"Decoded value: {decoder(encoder(s))}")

Encoded value: [60, 58, 58, 44, 24, 60, 13, 59, 13, 53]
Decoded value: hii there!


Okay, this seems working fine!

Note that we are making the LM on character level tokenization. It is simple or primitive way to tokenization but in real world we use more sophesticated way to tokenization. Like google uses `SentencePiece` and OpenAI uses `tiktoken` tokenizer which utilizes Byte pair tokenization.

There is tradeoff between vocab size and sequence length. If you are building a LM on character level then you will have small vocab size but larger sequence length and if you are doing on subword level then you will have large vocab size but small sequence length.

## Encode the data

- Do data transformation so that we can directly feed it to model

In [16]:
import torch

In [17]:
data = torch.LongTensor(encoder(text))

print(data.shape, data.dtype)

#this is how GPT will look at the texts
print(data[:1000])

torch.Size([1115394]) torch.int64
tensor([45, 58, 59, 33, 24, 44, 21, 58, 24, 58,  7, 13, 11, 63, 22, 64, 13, 23,
        43, 59, 13, 44, 16, 13, 44, 29, 59, 43,  5, 13, 13,  1, 44, 25, 11,  2,
        44, 23,  4, 59, 24, 60, 13, 59, 14, 44, 60, 13, 25, 59, 44, 30, 13, 44,
        33, 29, 13, 25, 26,  6, 22, 22, 18,  3,  3, 63, 22, 51, 29, 13, 25, 26,
        14, 44, 33, 29, 13, 25, 26,  6, 22, 22, 45, 58, 59, 33, 24, 44, 21, 58,
        24, 58,  7, 13, 11, 63, 22, 35, 43,  4, 44, 25, 59, 13, 44, 25,  3,  3,
        44, 59, 13, 33, 43,  3, 17, 13,  1, 44, 59, 25, 24, 60, 13, 59, 44, 24,
        43, 44,  1, 58, 13, 44, 24, 60, 25, 11, 44, 24, 43, 44, 23, 25, 30, 58,
        33, 60, 56, 22, 22, 18,  3,  3, 63, 22, 48, 13, 33, 43,  3, 17, 13,  1,
         6, 44, 59, 13, 33, 43,  3, 17, 13,  1,  6, 22, 22, 45, 58, 59, 33, 24,
        44, 21, 58, 24, 58,  7, 13, 11, 63, 22, 45, 58, 59, 33, 24, 14, 44,  2,
        43,  4, 44, 26, 11, 43, 16, 44, 21, 25, 58,  4, 33, 44,  0, 25, 59,  5,
      

## Train and Validation Split

- First 90% will be train data and rest will be validation data
- We don't feed entire text to the transformers once because that will be hugely computational expensive.
- We rather take random samples as small chunks from the data and then do training
- `block_size`/`context_length` - chunk size

In [19]:
#train/test split -> 
n = int(0.9*len(data))
train_data = data[:n]
valid_data = data[n:]

print(f"Train size : {len(train_data)}")
print(f"Validation size : {len(valid_data)}")

Train size : 1003854
Validation size : 111540


In [27]:
block_size = 8
train_data[:block_size+1]
decoder(train_data[:block_size+1].tolist())

'First Cit'

In [29]:
#print train and test tokens from block size

x = train_data[:block_size]
y = train_data[1:block_size+1]

for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target : {target}")

when input is tensor([45]) the target : 58
when input is tensor([45, 58]) the target : 59
when input is tensor([45, 58, 59]) the target : 33
when input is tensor([45, 58, 59, 33]) the target : 24
when input is tensor([45, 58, 59, 33, 24]) the target : 44
when input is tensor([45, 58, 59, 33, 24, 44]) the target : 21
when input is tensor([45, 58, 59, 33, 24, 44, 21]) the target : 58
when input is tensor([45, 58, 59, 33, 24, 44, 21, 58]) the target : 24


This is done to make sure transformer sees the context as little as of length one and as big as of context_size. This is also done to increase the efficiency.

Transformer will not see characters more than the block size for predicting the next token.

`min_batches` - done for efficiency to facilatates parallel processing

### Introduce batch dimension

For parallel processing we will introduce batch dimension in our training set. This helps in faster training as GPU can utilize the data in batches and process them in parallel. 

- set manual seed
- `batch_size` - How many independent sequences will be process in parallel
- `block_size` - maxm context length for prediction


In [86]:
len(train_data)

1003854

Function to retrun train/valid data based on split

In [133]:
#set manual seed
torch.manual_seed(1337)

batch_size = 4 
block_size = 8

def get_batch(split):

    #generate a small batch of data of inputs x and targets y
    data = train_data if split == "train" else valid_data
    ix = torch.randint(len(data) - block_size, (batch_size, ))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y


In [134]:
#let's test the above function
xb, yb = get_batch('train')

print('inputs')
print("xb shape : ", xb.shape)
print(xb)
print("targets")
print("yb shape : ", yb.shape)
print(yb)


inputs
xb shape :  torch.Size([4, 8])
tensor([[62, 13, 24, 50, 33, 44, 60, 13],
        [23, 43, 59, 44, 24, 60, 25, 24],
        [11, 24, 44, 24, 60, 25, 24, 44],
        [ 0, 39, 12, 63, 22, 19, 44, 29]])
targets
yb shape :  torch.Size([4, 8])
tensor([[13, 24, 50, 33, 44, 60, 13, 25],
        [43, 59, 44, 24, 60, 25, 24, 44],
        [24, 44, 24, 60, 25, 24, 44, 60],
        [39, 12, 63, 22, 19, 44, 29, 25]])


In [135]:
#lets print some examples

#iterate over batch dimension
for b in range(batch_size):
    print(f"Batch No. {b}")
    #iterate over time dimension(context_length)
    for t in range(block_size):
        context = xb[b, :t+1]
        target = yb[b, t]
        print(f"when context is {context}, target : {target}")


Batch No. 0
when context is tensor([62]), target : 13
when context is tensor([62, 13]), target : 24
when context is tensor([62, 13, 24]), target : 50
when context is tensor([62, 13, 24, 50]), target : 33
when context is tensor([62, 13, 24, 50, 33]), target : 44
when context is tensor([62, 13, 24, 50, 33, 44]), target : 60
when context is tensor([62, 13, 24, 50, 33, 44, 60]), target : 13
when context is tensor([62, 13, 24, 50, 33, 44, 60, 13]), target : 25
Batch No. 1
when context is tensor([23]), target : 43
when context is tensor([23, 43]), target : 59
when context is tensor([23, 43, 59]), target : 44
when context is tensor([23, 43, 59, 44]), target : 24
when context is tensor([23, 43, 59, 44, 24]), target : 60
when context is tensor([23, 43, 59, 44, 24, 60]), target : 25
when context is tensor([23, 43, 59, 44, 24, 60, 25]), target : 24
when context is tensor([23, 43, 59, 44, 24, 60, 25, 24]), target : 44
Batch No. 2
when context is tensor([11]), target : 24
when context is tensor([11

Now we are done with data preparation step. Lets build a Biagram language model which just infer next token based on immediate preceeding token.

## BigramLanguage Model

In [136]:
#imports
import torch
import torch.nn as nn 
from torch.nn import functional as F 
torch.manual_seed(1337)

<torch._C.Generator at 0x7fad987249b0>

In [137]:
vocab_size

65

In [138]:
yb

tensor([[13, 24, 50, 33, 44, 60, 13, 25],
        [43, 59, 44, 24, 60, 25, 24, 44],
        [24, 44, 24, 60, 25, 24, 44, 60],
        [39, 12, 63, 22, 19, 44, 29, 25]])

In [139]:
nn.Embedding(vocab_size, vocab_size)(torch.tensor(10))

tensor([-1.3441, -0.2827, -0.6887, -0.6897,  0.5899,  0.5532,  0.0651, -1.7956,
         1.3145,  1.7042,  0.5254, -1.2803, -1.1621,  0.6652,  0.0291,  3.6271,
        -0.1357, -0.4648, -1.4324,  0.1254, -1.1245,  0.4881, -0.6896, -0.7080,
        -0.3152,  0.7196, -0.0178, -1.2635,  0.8914, -1.2858, -2.1067, -1.9922,
         0.7629, -0.5948,  0.9828, -0.4151, -0.2026, -1.8955,  0.6117,  0.1095,
         0.0157, -1.0636,  0.8398,  0.4211, -2.0257,  1.0383,  0.5182,  0.5283,
        -0.5648,  0.0383,  0.3049, -2.0662, -1.1418, -0.1391,  1.0827,  1.1522,
         0.5198, -0.8982,  0.3749, -0.0422,  0.7197,  1.8447,  1.4385, -1.3166,
         1.2690], grad_fn=<EmbeddingBackward0>)

In [140]:
class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        #each token directly reads off the logits for the next token from the lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
    
    def forward(self, idx, targets=None):

        #idx and targets both are of dim (B, T)
        logits = self.token_embedding_table(idx) # (B, T, C) C -> vocab size
        B, T, C = logits.shape 
        #pytorch expects dimensions as below
        logits = logits.view(B*T, C)
        print(logits)
        targets = targets.view(B*T)
        print(targets)
        #loss - measures quality of target over prediction
        loss = F.cross_entropy(logits, targets)

        return logits, loss

m = BigramLanguageModel(vocab_size=vocab_size)
logits, loss = m(xb, yb)

print(logits.shape)
print(loss)

tensor([[-0.7582, -1.8711,  0.7141,  ..., -0.5707, -0.4843, -0.0299],
        [-1.6906,  0.6377,  0.6544,  ..., -2.2905,  1.0941,  1.0316],
        [-1.1730,  0.1125,  1.3759,  ...,  0.2143,  1.5742, -0.1005],
        ...,
        [ 0.6917,  2.0363,  0.2135,  ..., -0.1197,  0.5410, -1.7943],
        [-1.2892,  1.5030, -0.5783,  ...,  0.2987,  0.0178,  0.1400],
        [ 0.3585,  2.0936, -0.8058,  ...,  1.9319,  0.4377, -0.1681]],
       grad_fn=<ViewBackward0>)
tensor([13, 24, 50, 33, 44, 60, 13, 25, 43, 59, 44, 24, 60, 25, 24, 44, 24, 44,
        24, 60, 25, 24, 44, 60, 39, 12, 63, 22, 19, 44, 29, 25])
torch.Size([32, 65])
tensor(4.4300, grad_fn=<NllLossBackward0>)


Add method for generating next tokens in above `BigramLanguageModel` for generating next tokens based on current token.

In [148]:
class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        #each token directly reads off the logits for the next token from the lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
    
    def forward(self, idx, targets=None):

        #idx and targets both are of dim (B, T)
        logits = self.token_embedding_table(idx) # (B, T, C) C -> vocab size
        B, T, C = logits.shape 
        #pytorch expects dimensions as below
        if targets is None:
            loss = None
        else:
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            #loss - measures quality of target over prediction
            loss = F.cross_entropy(logits, targets)
        return logits, loss

    #inference
    def generate(self, idx, max_new_tokens):
        #ids is (B, T) array of indices in the current context

        for _ in range(max_new_tokens):
            #get the prediction
            logits, loss = self.forward(idx)
            #focus only on last time stamp

            logits = logits[:, -1, :] #become (B, C)
            #apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            #sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            #append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim = 1) # (B, T+1)
        return idx
            
model = BigramLanguageModel(vocab_size=vocab_size)

idx = torch.zeros((1, 1), dtype=torch.long)

print(decoder(m.generate(idx, max_new_tokens=100)[0].tolist())) # [0] - unplug single dimension

MI:

Cioncome Lo t aricl, ly. st othessce.
KE&, we
Co,
I's, vesoosfad thay,
NI blWe wnckee IAY pie
OR


In [149]:
#create an optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)


In [None]:
eval_iters = 1000

In [150]:
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ["train", "valid"]:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

In [146]:
batch_size = 32

epochs = 10000

for epoch in range(epochs):

    #sample a batch of data
    xb, yb = get_batch('train')

    #evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

    print(loss.item())

3.5621116161346436
3.5735630989074707
3.5994362831115723
3.442075729370117
3.4601173400878906
3.5711328983306885
3.506955862045288
3.5818910598754883
3.5547502040863037
3.5862114429473877
3.6111860275268555
3.5587995052337646
3.533806324005127
3.5791144371032715
3.5628952980041504
3.4882655143737793
3.6718530654907227
3.469181537628174
3.4715819358825684
3.661914587020874
3.542369842529297
3.6369903087615967
3.545372724533081
3.529508113861084
3.5342860221862793
3.4788224697113037
3.5055735111236572
3.4876248836517334
3.553682327270508
3.4603562355041504
3.595341920852661
3.7065069675445557
3.6074111461639404
3.5682461261749268
3.5516371726989746
3.573221445083618
3.539686679840088
3.7132062911987305
3.576834201812744
3.5698750019073486
3.5373458862304688
3.566281318664551
3.613236904144287
3.612971544265747
3.5716552734375
3.6020519733428955
3.5616631507873535
3.651553153991699
3.4886226654052734
3.6022706031799316
3.6424553394317627
3.5684168338775635
3.5132052898406982
3.59797048568

In [147]:

print(decoder(m.generate(idx, max_new_tokens=1000)[0].tolist())) # [0] - unplug single dimension

Mand Whe
Mace, aslldouse y:
ININave, re. t schar d,
FLend Tost soe t wa t this herell ly f prigoonownder on:

RKILEYONRInd y brdotorear, hanrs m.
Totasome agh Yerdershareweng!
Thocof; tue;
Wik any tare te, me tir ldyendis, ture letharlid, sthisthe Mag thinou the n I th;
GLout id mue ad. st ILONVIs s w te hyierinofre:

As mu VORityath o hovat nthat h.
INous fo mas trtor busweand llou bldeangiggious brere imsautond: r wit ismiks oulland PThemor athot and hanondo t lito BEDuseefom meay fas witelofeuruse toura le herownd, ougoo
GLLO:

CYowhio t?
I'd's?
Tale feg ble Ve Bo r fithy.
TUTEGe u sthe deaues, our hood yeagou at t RI Con f, le, f tord ba mise r rch he a he ng anoners; arr.
Bus qck kse h!
Gorar she ond are f f a suronotind monon
LORTENTUShag Py. beng th utht bile we'de thafirthe bead,
ANRI ne nthadangowisitse averayoust he doflen chagondatom'st ofan pad athe
shate pld bll t GICone sttealing iknot foppre.
Min RD ENAR je,
Torsise d s ss qur,
Het
Yere hthan boull 's semy f t; o tas:
KE

In above setting tokens are not interecting with each other we are just taking current token to predict the next token.

## Mathematical trick in self-attention

In [151]:
torch.manual_seed(1337)

B, T, C = 4, 8, 2
x = torch.randn(B, T, C)
x.shape

torch.Size([4, 8, 2])

In [152]:
#We want x[b, t] = mean_{i<=t} x[b, i]
xbow = torch.zeros((B, T, C))
for b in range(B):
    for t in range(T):
        xprev = x[b, :t+1]
        xbow[b, t] = torch.mean(xprev, 0)
        


In [153]:
xbow

tensor([[[ 0.1808, -0.0700],
         [-0.0894, -0.4926],
         [ 0.1490, -0.3199],
         [ 0.3504, -0.2238],
         [ 0.3525,  0.0545],
         [ 0.0688, -0.0396],
         [ 0.0927, -0.0682],
         [-0.0341,  0.1332]],

        [[ 1.3488, -0.1396],
         [ 0.8173,  0.4127],
         [-0.1342,  0.4395],
         [ 0.2711,  0.4774],
         [ 0.2421,  0.0694],
         [ 0.0084,  0.0020],
         [ 0.0712, -0.1128],
         [ 0.2527,  0.2149]],

        [[-0.6631, -0.2513],
         [ 0.1735, -0.0649],
         [ 0.1685,  0.3348],
         [-0.1621,  0.1765],
         [-0.2312, -0.0436],
         [-0.1015, -0.2855],
         [-0.2593, -0.1630],
         [-0.3015, -0.2293]],

        [[ 1.6455, -0.8030],
         [ 1.4985, -0.5395],
         [ 0.4954,  0.3420],
         [ 1.0623, -0.1802],
         [ 1.1401, -0.4462],
         [ 1.0870, -0.4071],
         [ 1.0430, -0.1299],
         [ 1.1138, -0.1641]]])

Similar result can be achieved using matrix multiplication. `torch.tril` truncates all the elements above diag to zeros.

In [154]:
torch.manual_seed(42)
a = torch.ones((3, 3))

torch.tril(a)

tensor([[1., 0., 0.],
        [1., 1., 0.],
        [1., 1., 1.]])

In [157]:
wei = torch.tril(torch.ones(T, T))
wei = wei/wei.sum(1, keepdim=True)
xbow2 = wei @ x
torch.allclose(xbow, xbow2)

True

## Rewriting code for transformer

Lets change the above code of `BigramLanguageModel` by introducing a linear interection layer.


In [None]:
#global variables
vocab_size = vocab_size
n_embd = 32

In [None]:
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        #define the embedding layer
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd=32)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.lm_head= nn.Linear(n_embd, vocab_size)
    
    def forward(self, idx, targets=None):

        #idx and targets both are of dim (B, T)
        token_embd = self.token_embedding_table(idx) # (B, T, C) C -> n_embd size
        pos_emd = self.position_embedding_table(torch.arange(T)) # T, C
        logits = self.lm_head(token_embd) # (B, T, C) C -> vocab size
        B, T, C = logits.shape 
        #pytorch expects dimensions as below
        if targets is None:
            loss = None
        else:
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            #loss - measures quality of target over prediction
            loss = F.cross_entropy(logits, targets)
        return logits, loss

    #inference
    def generate(self, idx, max_new_tokens):
        #ids is (B, T) array of indices in the current context

        for _ in range(max_new_tokens):
            #get the prediction
            logits, loss = self.forward(idx)
            #focus only on last time stamp

            logits = logits[:, -1, :] #become (B, C)
            #apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            #sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            #append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim = 1) # (B, T+1)
        return idx
            
model = BigramLanguageModel(vocab_size=vocab_size)

idx = torch.zeros((1, 1), dtype=torch.long)

print(decoder(m.generate(idx, max_new_tokens=100)[0].tolist())) # [0] - unplug single dimension

Different tokens will find different other tokens more or less interesting and we want to be data dependent. For example a vowel will look for consonent in past.

How self attention solves it?

- Every single node or token at every position will emit two vectors 
    - query vector - what am i looking for 
    - key - what do i contain
    - value - communicate value

if key and query(for other token are align) - then good match

In [161]:
# version 4: self-attention
torch.manual_seed(1337)
B, T, C = 4, 8, 32 # batch, time, channels
x = torch.randn(B, T, C)

#let's see a single Head perform self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x) # (B, T, 16)
q = query(x) # (B, T, 16)
v = value(x)
#interection happens when 
wei = q @ k.transpose(-2, -1) # (B, T, T)

tril = torch.tril(torch.ones(T, T))

wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim = 1)

out = wei @ v

out.shape

torch.Size([4, 8, 16])

**Notes**
- Attention is a communication mechanism. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.

- Attention supports arbitary connectivity of nodes.

- There is no notion of space. Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.

- Each example across batch dimension is of course processed completely independently and never talk to each other.

- `self-attention` just means the keys and values are produced from the source as queries. In `cross-attention` keys and queries comes from the separate source(eg. encode module) and it used when we have seperate source of nodes from which we want to pull information from to our node.

- `Scaled Attention` - Additional divides wei by 1/sqrt(head_size). This makes it so when input Q, K are unit variance, wei will be unit variance too and Softmax will stay diffuse and not saturate too much.
    - control variation during initialization

Illustartion below:

In [167]:
k = torch.randn(B, T, head_size)
q = torch.randn(B, T, head_size)
wei = q @ k.transpose(-2, -1) * head_size**-0.5

In [168]:
k.var()

tensor(1.0966)

In [169]:
q.var()

tensor(0.9416)

In [170]:
wei.var()

tensor(1.0065)

## Self head Implementation

In [183]:
class Head(nn.Module):

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size)
        self.query = nn.Linear(n_embd, head_size)
        self.value = nn.Linear(n_embd, head_size)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):

        B, T, C = x.shape
        k = self.key(x) # (B, T, C) C - head_size
        q = self.query(x) # (B, T, C) - C- head_size
        #compute attenstion score (affinities)
        wei = q @ k.transpose(-2, -1) * C**-0.5 # (B, T, T)
        #make decoder block
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float("-inf")) #(B, T, T)
        wei = F.softmax(wei, dim=-1) #(B, T, T)
        #perform weighted aggregate of the values
        v = self.value(x) #(B, T, C)
        out = wei @ v # (B, T, C)

        return out


Rewrite the Bigram Language Model using Head node

In [184]:
n_embd = 32

In [209]:
class BigramLanguageModel(nn.Module):

    def __init__(self) -> None:
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.positional_embdding_table = nn.Embedding(block_size, n_embd)
        #self attention head with head size as n_emd
        self.sa_head = Head(n_embd)
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets = None):
        B, T = idx.shape
        # idx and target both are of dim (B, T)
        tok_embd = self.token_embedding_table(idx) #(B, T, C)
        pos_embd = self.positional_embdding_table(torch.arange(T)) # (T, C)
        x = tok_embd + pos_embd # (B, T, C)
        x = self.sa_head(x) #apply one head of self attention
        logits = self.lm_head(x) #(B, T, vocab_size)
        # print(logits.shape)
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss
    #because of positional encoding we need to crop the idx - if idx is more than block size then we will be getting error from pos embd table

    #inference
    def generate(self, idx, max_new_tokens):
        #ids is (B, T) array of indices in the current context

        for _ in range(max_new_tokens):
            #crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]

            #get the prediction
            logits, loss = self.forward(idx_cond)
            #focus only on last time stamp

            logits = logits[:, -1, :] #become (B, C)
            #apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            #sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            #append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim = 1) # (B, T+1)
        return idx
        

In [210]:
model = BigramLanguageModel()

In [212]:
batch_size = 32

epochs = 10000

for epoch in range(epochs):

    #sample a batch of data
    xb, yb = get_batch('train')

    #evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

    print(loss.item())

4.214176654815674
4.199700832366943
4.239514350891113
4.218862056732178
4.186700820922852
4.226130485534668
4.24436092376709
4.208166599273682
4.214605331420898
4.240248203277588
4.221784591674805
4.196395397186279
4.189698219299316
4.235991954803467
4.233548641204834
4.220687389373779
4.22767448425293
4.208745002746582
4.22116756439209
4.2244648933410645
4.211333274841309
4.230423927307129
4.179531097412109
4.190152168273926
4.197641372680664
4.201192855834961
4.208263874053955
4.25800085067749
4.182485580444336
4.213343143463135
4.199041366577148
4.230070114135742
4.185204982757568
4.181177616119385
4.216397285461426
4.197425842285156
4.252711296081543
4.205231666564941
4.218751430511475
4.222886562347412
4.222411155700684
4.238119602203369
4.251081943511963
4.187753200531006
4.22334623336792
4.220522880554199
4.220126152038574
4.2478485107421875
4.224193572998047
4.238748073577881
4.197943210601807
4.197122097015381
4.20084810256958
4.225310802459717
4.224092960357666
4.218008041381

## Multihead attention

- Multi communication channels
- multiple attention in parallel


In [213]:
class MultiHeadAttention(nn.Module):

    """ multiple heads of self attention in parallel """

    def __init__(self, num_head, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_head)])

    def forward(self, x):

        return torch.cat([h(x) for h in self.heads], dim=-1)


In [214]:
class BigramLanguageModel(nn.Module):

    def __init__(self) -> None:
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.positional_embdding_table = nn.Embedding(block_size, n_embd)
        #self attention head with head size as n_emd
        self.sa_head = MultiHeadAttention(4, n_embd//4) #group convolution
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets = None):
        B, T = idx.shape
        # idx and target both are of dim (B, T)
        tok_embd = self.token_embedding_table(idx) #(B, T, C)
        pos_embd = self.positional_embdding_table(torch.arange(T)) # (T, C)
        x = tok_embd + pos_embd # (B, T, C)
        x = self.sa_head(x) #apply one head of self attention
        logits = self.lm_head(x) #(B, T, vocab_size)
        # print(logits.shape)
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss
    #because of positional encoding we need to crop the idx - if idx is more than block size then we will be getting error from pos embd table

    #inference
    def generate(self, idx, max_new_tokens):
        #ids is (B, T) array of indices in the current context

        for _ in range(max_new_tokens):
            #crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]

            #get the prediction
            logits, loss = self.forward(idx_cond)
            #focus only on last time stamp

            logits = logits[:, -1, :] #become (B, C)
            #apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            #sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            #append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim = 1) # (B, T+1)
        return idx
        

In [215]:
batch_size = 32

epochs = 10000

for epoch in range(epochs):

    #sample a batch of data
    xb, yb = get_batch('train')

    #evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

    print(loss.item())

4.20404577255249
4.210916996002197
4.215507984161377
4.202003002166748
4.242471218109131
4.210153579711914
4.257818698883057
4.198739528656006
4.1947340965271
4.241333484649658
4.260656356811523
4.193447589874268
4.181999683380127
4.23184871673584
4.191461086273193
4.223646640777588
4.237530708312988
4.205941200256348
4.188686847686768
4.195237159729004
4.236495018005371
4.234069347381592
4.235219478607178
4.203344345092773
4.202004909515381
4.2177019119262695
4.2579569816589355
4.234684467315674
4.21341609954834
4.210850238800049
4.219239234924316
4.221963405609131
4.224623680114746
4.236835956573486
4.198772430419922
4.228184700012207
4.209569931030273
4.214357376098633
4.20311164855957
4.203883647918701
4.195871353149414
4.231898784637451
4.222784519195557
4.2008819580078125
4.225440979003906
4.199305057525635
4.227909564971924
4.204291343688965
4.225578784942627
4.216580390930176
4.228621959686279
4.240225791931152
4.192161560058594
4.232958793640137
4.232292652130127
4.20242023468

Multihead attention helps in creating multiple channel of communication

## Introduce feed forward layer in decoder block


In [216]:
class FeedForward(nn.Module):
    """"Simple linear layer followed by a non-linearity"""

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, n_embd),
            nn.ReLU()
        )
    def forward(self, x):
        return self.net(x)

Add it in the Bigram Model - self attention is like communication to fetch the data once the data is there they need to think about it individually.

In [217]:
class BigramLanguageModel(nn.Module):

    def __init__(self) -> None:
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.positional_embdding_table = nn.Embedding(block_size, n_embd)
        #self attention head with head size as n_emd
        self.sa_head = MultiHeadAttention(4, n_embd//4) #group convolution
        self.ffwd = FeedForward(n_embd)
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets = None):
        B, T = idx.shape
        # idx and target both are of dim (B, T)
        tok_embd = self.token_embedding_table(idx) #(B, T, C)
        pos_embd = self.positional_embdding_table(torch.arange(T)) # (T, C)
        x = tok_embd + pos_embd # (B, T, C)
        x = self.sa_head(x) #apply one head of self attention
        x = self.ffwd(x) # token level
        logits = self.lm_head(x) #(B, T, vocab_size)
        # print(logits.shape)
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss
    #because of positional encoding we need to crop the idx - if idx is more than block size then we will be getting error from pos embd table

    #inference
    def generate(self, idx, max_new_tokens):
        #ids is (B, T) array of indices in the current context

        for _ in range(max_new_tokens):
            #crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]

            #get the prediction
            logits, loss = self.forward(idx_cond)
            #focus only on last time stamp

            logits = logits[:, -1, :] #become (B, C)
            #apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            #sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            #append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim = 1) # (B, T+1)
        return idx
        

In [218]:
batch_size = 32

epochs = 10000

for epoch in range(epochs):

    #sample a batch of data
    xb, yb = get_batch('train')

    #evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

    print(loss.item())

4.204038619995117
4.211588382720947
4.199848651885986
4.229249477386475
4.210080623626709
4.191341400146484
4.211798191070557
4.227687835693359
4.2225341796875
4.189696788787842
4.252888202667236
4.193850517272949
4.183999538421631
4.220370292663574
4.211663246154785
4.241042137145996
4.220822811126709
4.223501682281494
4.219017028808594
4.206402778625488
4.187771797180176
4.240336894989014
4.2052130699157715
4.1895365715026855
4.192267894744873
4.200622081756592
4.231438636779785
4.244041919708252
4.211033821105957
4.246546745300293
4.220117092132568
4.227297306060791
4.224756717681885
4.208759307861328
4.2256855964660645
4.205783367156982
4.20211124420166
4.221246242523193
4.197579860687256
4.208597183227539
4.203438758850098
4.227713108062744
4.233147621154785
4.195008754730225
4.223347187042236
4.215117454528809
4.219760894775391
4.227991580963135
4.2010498046875
4.2074055671691895
4.182341575622559
4.200370788574219
4.196925640106201
4.239253044128418
4.203332424163818
4.237870693

## Creating a block excluding the cross head attention

In [219]:
class Block(nn.Module):
    """" Transformer block : communication followed by computation """
    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(num_head=n_head, head_size=head_size)
        self.ffwd = FeedForward(n_embd)

    def forward(self, x):
        x = self.sa(x)
        x = self.ffwd(x)
        return x

In [None]:
class BigramLanguageModel(nn.Module):

    def __init__(self) -> None:
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.positional_embdding_table = nn.Embedding(block_size, n_embd)
        #self attention head with head size as n_emd
        #introduce the above block nn
        self.blocks = nn.Sequential(
            Block(n_embd, n_head=4),
            Block(n_embd, n_head=4),
            Block(n_embd, n_head=4),
        )

    def forward(self, idx, targets = None):
        B, T = idx.shape
        # idx and target both are of dim (B, T)
        tok_embd = self.token_embedding_table(idx) #(B, T, C)
        pos_embd = self.positional_embdding_table(torch.arange(T)) # (T, C)
        x = tok_embd + pos_embd # (B, T, C)
        x = self.sa_head(x) #apply one head of self attention
        x = self.ffwd(x) # token level
        logits = self.lm_head(x) #(B, T, vocab_size)
        # print(logits.shape)
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss
    #because of positional encoding we need to crop the idx - if idx is more than block size then we will be getting error from pos embd table

    #inference
    def generate(self, idx, max_new_tokens):
        #ids is (B, T) array of indices in the current context

        for _ in range(max_new_tokens):
            #crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]

            #get the prediction
            logits, loss = self.forward(idx_cond)
            #focus only on last time stamp

            logits = logits[:, -1, :] #become (B, C)
            #apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            #sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            #append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim = 1) # (B, T+1)
        return idx
        

To make the deep nn remain optimizable is to use residual/skip connetions

- addition distribute gradients equally to all of its branches

In [None]:
class Block(nn.Module):
    """" Transformer block : communication followed by computation """
    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(num_head=n_head, head_size=head_size)
        self.ffwd = FeedForward(n_embd)

    def forward(self, x):
        x = x + self.sa(x) # forked off do some communicationa and come back
        x = x + self.ffwd(x)
        return x

Add projection in multihead attention

In [221]:
class MultiHeadAttention(nn.Module):

    """ multiple heads of self attention in parallel """

    def __init__(self, num_head, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_head)])
        self.proj = nn.Linear(n_embd, n_embd)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.proj(out)
        return out


Add projection in FFWD

In [223]:
class FeedForward(nn.Module):
    """"Simple linear layer followed by a non-linearity"""

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4*n_embd),
            nn.ReLU(),
            nn.Linear(4*n_embd, n_embd),
        )
    def forward(self, x):
        return self.net(x)

In [224]:
class BigramLanguageModel(nn.Module):

    def __init__(self) -> None:
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.positional_embdding_table = nn.Embedding(block_size, n_embd)
        #self attention head with head size as n_emd
        #introduce the above block nn
        self.blocks = nn.Sequential(
            Block(n_embd, n_head=4),
            Block(n_embd, n_head=4),
            Block(n_embd, n_head=4),
        )

    def forward(self, idx, targets = None):
        B, T = idx.shape
        # idx and target both are of dim (B, T)
        tok_embd = self.token_embedding_table(idx) #(B, T, C)
        pos_embd = self.positional_embdding_table(torch.arange(T)) # (T, C)
        x = tok_embd + pos_embd # (B, T, C)
        x = self.blocks(x) #apply one head of self attention
        logits = self.lm_head(x) #(B, T, vocab_size)
        # print(logits.shape)
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss
    #because of positional encoding we need to crop the idx - if idx is more than block size then we will be getting error from pos embd table

    #inference
    def generate(self, idx, max_new_tokens):
        #ids is (B, T) array of indices in the current context

        for _ in range(max_new_tokens):
            #crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]

            #get the prediction
            logits, loss = self.forward(idx_cond)
            #focus only on last time stamp

            logits = logits[:, -1, :] #become (B, C)
            #apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            #sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            #append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim = 1) # (B, T+1)
        return idx
        

In [226]:
model = BigramLanguageModel()

In [228]:
batch_size = 32

epochs = 10000

for epoch in range(epochs):

    #sample a batch of data
    xb, yb = get_batch('train')

    #evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

    print(loss.item())

AttributeError: 'BigramLanguageModel' object has no attribute 'sa_head'

Another way to optimization deep nn is doing **layer normalization**

    - Row normalization

Add layernorm in Blocks

In [229]:
class Block(nn.Module):
    """" Transformer block : communication followed by computation """
    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(num_head=n_head, head_size=head_size)
        self.ffwd = FeedForward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)


    def forward(self, x):
        x = x + self.sa(self.ln1(x)) # forked off do some communicationa and come back
        x = x + self.ffwd(self.ln2(x))
        return x

In [None]:
class BigramLanguageModel(nn.Module):

    def __init__(self) -> None:
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.positional_embdding_table = nn.Embedding(block_size, n_embd)
        #self attention head with head size as n_emd
        #introduce the above block nn
        self.blocks = nn.Sequential(
            Block(n_embd, n_head=4),
            Block(n_embd, n_head=4),
            Block(n_embd, n_head=4),
            nn.LayerNorm(n_embd),
        )

    def forward(self, idx, targets = None):
        B, T = idx.shape
        # idx and target both are of dim (B, T)
        tok_embd = self.token_embedding_table(idx) #(B, T, C)
        pos_embd = self.positional_embdding_table(torch.arange(T)) # (T, C)
        x = tok_embd + pos_embd # (B, T, C)
        x = self.blocks(x) #apply one head of self attention
        logits = self.lm_head(x) #(B, T, vocab_size)
        # print(logits.shape)
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss
    #because of positional encoding we need to crop the idx - if idx is more than block size then we will be getting error from pos embd table

    #inference
    def generate(self, idx, max_new_tokens):
        #ids is (B, T) array of indices in the current context

        for _ in range(max_new_tokens):
            #crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]

            #get the prediction
            logits, loss = self.forward(idx_cond)
            #focus only on last time stamp

            logits = logits[:, -1, :] #become (B, C)
            #apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            #sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            #append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim = 1) # (B, T+1)
        return idx
        

In [230]:
model = BigramLanguageModel()

In [231]:
batch_size = 32

epochs = 10000

for epoch in range(epochs):

    #sample a batch of data
    xb, yb = get_batch('train')

    #evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

    print(loss.item())


AttributeError: 'BigramLanguageModel' object has no attribute 'sa_head'

Add droput layer