# The purpose of this notebook:

The purpose of this notebook is to develop a small scale LLM (a SLM?) of my own, and study the word to word relations uncovered. LLMs have found a lot of use in the study of cognition and speech processing in general, and I think a deeper understanding of the Transformer architecture used in them will help me as a scientist.

If that's not your jam, no worries, this is how ChatGPT works, in miniature. This is based on Andrej Karpathy's [GPT from Scratch](https://www.youtube.com/watch?v=kCc8FmEb1nY) video, do check him out on youtube.
 ***

A little mini-goal that I have set for myself is to write better comments, so I hope I achieve that here.

# Tokenizing

In [None]:
#I am using this doc from Andrej Karpathy, in which all of Shakespeare's works have been concatenated, as my training data
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2024-12-04 10:42:30--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2024-12-04 10:42:30 (32.7 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [None]:
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()
print(text[:100])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You


In [None]:
chars = sorted(list(set(text))) #This splits up the above text into individual characters and sorts the unique ones
print("no. of unique characters is ", len(chars), "\nThey are ", chars)

no. of unique characters is  65 
They are  ['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


As you can see, apart from the 26 letters, punctuation, spaces, line breaks and a few numbers have also been added to the 'vocabulary' that we will use.

Next, we need to ***tokenise*** the above characters. Tokenising means converting these characters into integers that representing those characters.

The below terminology (speech to integer/ s to i) is borrowed from Andrej Karpathy once again, and in this step we initialise two python dictionaries
The first makes a dictionary, stoi, which maps each character to its list index (an integer) and another dict, itos which maps the index to the character, allowing decoding and encoding

In [None]:
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

decode used to be `decode = lambda l: [itos[c] for c in l]`
There's an issue with the decoding, though, namely that it doesn't recognise spaces, and would output each letter as a distinct character, not as words.hence the ''.join() command

In [None]:
#Testing this out on strings and numbers
print(encode("test run! one two three"))
print(decode(encode("test run! one two three")))

[58, 43, 57, 58, 1, 56, 59, 52, 2, 1, 53, 52, 43, 1, 58, 61, 53, 1, 58, 46, 56, 43, 43]
test run! one two three


Now we encode the source text.

In [None]:
import torch # we use PyTorch: https://pytorch.org, a very handy package for deep learning in general.
# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

Something fun to note, is that the real tokenisers, as used in GPT and so forth, don't work on a character level.
They use syllables, sections of words, entire words, etc. So the no. of distinct tokens in GPT2 is around 50,000!

In [None]:
#Train-test split (seperating out some data that is unfamiliar to the model, that we can fairly test the data on)
split_pt = int(0.8 * len(data)) #the int is important, because you can't index using a float variable (Imagine asking, "what's the 37.6th item in your list?")
test =  data[split_pt:]
train = data[:split_pt]

Training a deep network is never done in one go. Instead of one gulp, the data is fed into the transformer in bite sized chunks. There can be multiple bites at a time, though (the metaphor is getting a little confusing at this point).

Essentially, instead of asking the model to learn all the connections between the words in the entirety of the data, it is tasked with studying what comes next after one word, two words, or so. It learns in small sentences. However, multiple small sentences can be processed simultaneously without interference.

"The maximum context length for the prediction" is given by how big the bite is.

GPUs are excellent at parallel processing, and it helps drastically cut down on training and testing time. One appealing thing about transformer networks, and a lot of DL models, is that it is essentially a number of matrix multiplications, which are easily parallelised.

As such, we use two variables 'block size' and 'batch size' to show how big each chunk of data is, and how many chunks of data are fed into the transformer simultaneously

In [None]:
torch.manual_seed(1337) #The 1337 is a pun, leet, which tells you how old Andrej is. More importantly, setting a 'seed' makes sure that the same 'random' number is generated each time you run this code
block_size = 8
batch_size = 4

def batching(split):
  data = train if split=='train' else test #If i run batching(train) it batches the train_data
  ix = torch.randint(len(data) - block_size, (batch_size,)) #This is a random integer less than len(data)- block size
  x = torch.stack([data[i:i+block_size] for i in ix])
  y = torch.stack([data[i+1:i+block_size+1] for i in ix])#these are the targets, basically the next character for each data point in x
  return x, y
xb, yb = batching('train')

In [None]:
print(xb) #This is our input to the transformer, 4 chunks of 8 charas each
print("\n",yb)

tensor([[58, 63,  8,  0,  0, 19, 24, 27],
        [39, 59, 45, 46, 58,  1, 46, 43],
        [49, 43, 57,  1, 53, 50, 42,  1],
        [52, 41, 47, 43, 52, 58,  1, 56]])

 tensor([[63,  8,  0,  0, 19, 24, 27, 33],
        [59, 45, 46, 58,  1, 46, 43,  1],
        [43, 57,  1, 53, 50, 42,  1, 46],
        [41, 47, 43, 52, 58,  1, 56, 47]])


# The basic structure of the model
For this model, we're building our deep learning model, one of

In [None]:
import torch.nn as nn #Essential for building neural nets in torch
from torch.nn import functional as F


The basic structure of the single character, no attention, model that we're building goes like this:
```
class BigramLanguageModel(nn.Module):
  def __init__(self, vocab_size):
    super().__init__() #this bit inherits the characteristics of nn.Module
    self.token_embedding_table = nn.Embedding(vocab_size, vocab_size) #https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html
  def forward(self,idx, targets=None):
    logits = self.token_embedding_table(idx) #reads off probabilities/logits of next token from the embedding table
    loss = F.cross_entropy()

    return loss,logits
```
However, in the actual code, we need to permute and rearrange the elements in the tensor, because forward (nn.Embedding) requires something of the form (minibatch_no,channel number(from vocab_size),etc), while by default we have (batch,time,channel). So we're ensuring that channel remains in the second dimension by turning batch,time into one extended dim.


In [None]:
class BigramLanguageModel(nn.Module):
  def __init__(self, vocab_size):
    super().__init__() #this bit inherits the characteristics of nn.Module
    self.token_embedding_table = nn.Embedding(vocab_size, vocab_size) #https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html

  def forward(self,idx, targets=None):#targets is None, optional, to determine if it's train or test, else when we call forward from generate, there will be an error due to lack of targets
    logits = self.token_embedding_table(idx) #reads off probabilities/logits of next token from the embedding table
    if targets is None:
      loss = None
    else:
      B,T,C = logits.shape
      logits = logits.view(B*T, C)
      targets = targets.view(B*T)
          #batch no, timept, channel no. See above text note for explanation
      loss = F.cross_entropy(logits, targets)
    return logits,loss

  def generate(self,idx,max_new_tokens):
    #max_new_tokens is the number of new characters/tokens to be generated after the input
    for _ in range(max_new_tokens):
      logits,loss = self.forward(idx)
      logits = logits[:,-1,:]#last element in the time dimension, most recent timepoint
      probs = F.softmax(logits, dim=-1) #(B,c)
      idx_next = torch.multinomial(probs,num_samples=1) #(B,1)
      idx = torch.cat((idx,idx_next),dim=1) #(B,T+1)
    return idx


In [None]:
m = BigramLanguageModel(vocab_size=len(char))
logits,loss= m(xb,yb)
print(logits.shape)
print("The loss is ",loss)

torch.Size([32, 65])
The loss is  tensor(4.6627, grad_fn=<NllLossBackward0>)


If the loss is sheer luck, it's -log(likelihood), -ln(1/len(char)), or 4.17 and our current loss is a little higher.

In [None]:
#Generating new tokens
idx = torch.zeros((1,1),dtype=torch.long) #Starting with a new line makes sense, and encode("\n") gives [0]
print(decode(m.generate(idx,max_new_tokens=1000)[0].tolist()))


l-QYjt'CL?jLDuQcLzy'RIo;'KdhpV
vLixa,nswYZwLEPS'ptIZqOZJ$CA$zy-QTkeMk x.gQSFCLg!iW3fO!3DGXAqTsq3pdgq!LznIeJydZJSrFSrPLR!:VwWSmFNxbjPiNYQ:sry,OfKrxfvJI$WS3JqCbB-TSQXeKroeZfPL&,:opkl;Bvtz$LmOMyDjxxaZWtpv,OxZQsWZalk'uxajqgoSXAWt'e.Q$.lE-aV
;spkRHcpkdot:u'-NGEzkMPy'hZCWhv.w.q!f'mOxF&IDRR,x
?$Ox?xj.BHJsGhwVtcuyoMIRfhoPL&fg-NwJmOQalcEDveP$IYUMv&JMHkzd:O;yXCV?wy.RRyMys-fg;kHOB EacboP g;txxfPL
NTMlX'FNYcpkHSGHNuoKXe..ehnsarggGFrSjIr!SXJ?KeMl!.?,MlbDP!sfyfBPeNqwjLtIxiwDDjSJzydFm$CfhqkCe,n:kyRBubVbxdojhEzAtV
l;Undhmj.KZaOZJnHlrAaAQcn-iugqTxJ;Ig,NqE&HOxzYcLyHaxyj'ak'StIhPBfJi3Y.uFYc$'NqtvDXhot;tXacKz$FU&V.bESfOng;;:N
OoeAgkcLo'dF&$ydutvA$VrIJdTkBHcb-T itZmY&qEh;lg

O;kHQYQNCd yeXhfUOm FvDmVehVerKkDF bv3pPyXAg;ukn:OajcSl;.kHF3Ml?llX
xVtIrK-kHE;:sZElrIZ
tTx-wBPfqTgLNcCy,abjxFg;tVxFdFlpdoimjDRSJs&UPL?$kas:uvg
k!kDptzkcusoCZJ afBkAs:Naj'aIUrXN!LwJIY..kjwFfduwRjfvsroed.iuHg.WFEhyC?t:WtkBfGTZKjfh;&tkwFRkNsThFj.eJ rCoh3,OxjtffUPGXyTTC$m Xcn?q$CGukHMXnAQTwNUzmZED3fmOEX
fJOdFvLwC.Q!Np'&fhTHrbOrCC IRTyprP

As you can see, the text is gibberish. Now it's time to train the model. In order to train a model, you need an optimiser, like Standard Gradient Descent (SGD), or the most common and well used optimiser, Adam (an abbreviation of Adaptive something or other)



In [None]:
#Optimiser loop
batch_size = 32
optimizer = torch.optim.Adam(m.parameters(),lr=1e-3) #The learning rate of 1e-3 since this is a very basic model,  3e-4 is a better standard protocol
for steps in range(10000):
  xb,yb = batching(train)
  _,loss = m(xb,yb)
  optimizer.zero_grad(set_to_none=True) #At each step, the optimiser doesn't start with any value of gradient (Look up https://builtin.com/data-science/gradient-descent)
  loss.backward()#calculates the gradient(loss) using backpropagation
  optimizer.step()#changes parameters in the opposite direction of gradient, so that loss decreases
print(loss.item())

2.4418282508850098


In [None]:
print(decode(m.generate(idx,max_new_tokens=1000)[0].tolist()))


n.
Paure wagepplirn!
meftou ow prink, avewist th;thomayo alingienco, An he ware whiougou he s imaror?
Bu ne-ithof acat: bel,
Fothind at wrt:
HQUCANTh ros!
ANQNThes med thestw cos wand herrs yorfold madlous jouney biPoer bngabtestouMatswo IONThathery ththe tonty th, fourid irtys ndyod pp qur awe ainowhemy azur:
I,
Ishit tinghast ha tteredef seasiomamy.
Makine,

TISARThe ientr?
GRO:
DELAre y l, ure t codausioftotierr,
her tr fed
th h'er?
Sh woo!qual,
A y kise ugive n telo,
ANCEThisk, thoa iroro s lly ndst onindave S:

me; d monderecr.
UCofre,
Pe lllit, ddff; choManome ureswir anqur t h sele s wame me Jo mshe 'ro'e s terena h w towant ak.
KEMIRSe dorome hels, to aly f t, mmelel meauns athed fr yeak, LELUMo tce be e hend VIO:
Whell a t te ghs: t wed.
Thintsheshuine,
'sowathepon.

ENTO:
LOMI I sethethor, l f ghato, wn nz'SI tar
Thohor llo:
PedHAPENCHWee are, PELirr tyowail I towithengory t ce stheer.

yo awes ntherem; ssted bee iteme ur,
TMIAd cou pe Lof IORTams webear, mss be'Zxpans! alis

Running it for several steps, few epochs drops the loss, giving us a more coherent gibberish

# Self-Attention

Self attention basically builds in connections between different tokens at each step.

In [None]:
import torch.nn as nn #Essential for building neural nets in torch
from torch.nn import functional as F

In [None]:
torch.manual_seed(1337)
B,T,C = 4,8,2 # batch, time, channels
x = torch.randn(B,T,C)
x.shape

torch.Size([4, 8, 2])

In [None]:
tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
print(wei)
xbow3 = wei @ x #weighted matrix * our matrix

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])


As you can see, using a lower triangle matrix of ones (torch.tril()), filling 0s with -infs, (this shows that at each stage, the subsequent tokens are of negligible priority), and then using softmax to generate values that sum to one at each row, we basically create the contribution of each token to the prediction of the next.

**Notes from Andrej Karpathy:**

Attention is a communication mechanism. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.
There is no notion of space. Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.
Each example across batch dimension is of course processed completely independently and never "talk" to each other
In an "encoder" attention block just delete the single line that does masking with tril, allowing all tokens to communicate. This block here is called a "decoder" attention block because it has triangular masking, and is usually used in autoregressive settings, like language modeling.
"self-attention" just means that the keys and values are produced from the same source as queries. In "cross-attention", the queries still get produced from x, but the keys and values come from some other, external source (e.g. an encoder module)
"Scaled" attention additional divides wei by 1/sqrt(head_size). This makes it so when input Q,K are unit variance, wei will be unit variance too and Softmax will stay diffuse and not saturate too much. Illustration below

In [None]:
head_size = 16
k = torch.randn(B,T,head_size)
q = torch.randn(B,T,head_size)
wei = q @ k.transpose(-2, -1)
print("variance of q: ",q.var()," k: ",k.var())
print("variance of wei without normalising: ",wei.var())
wei = q @ k.transpose(-2, -1) * head_size**-0.5
print("variance of wei after normalising:", wei.var())
print("Wei feeds into a softmax, and during initialisationm if wei takes on very positive or very negative numbers, it tends to form one-hot vectors - it would peak at the values with greatest variation")

variance of q:  tensor(1.0449)  k:  tensor(1.0554)
variance of wei without normalising:  tensor(18.1082)
variance of wei after normalising: tensor(1.1318)
Wei feeds into a softmax, and during initialisationm if wei takes on very positive or very negative numbers, it tends to form one-hot vectors - it would peak at the values with greatest variation


# Building the model, component by component

In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
vocab_size = len(char)
batch_size = 16 # how many independent sequences will we process in parallel?
block_size = 32 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 100
learning_rate = 1e-3 #Self-Attention can't tolerate very high learning rates
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 64
n_head = 4
n_layer = 4
dropout = 0.0

**Self attention**
- Self attention here is built not on the average that we built earlier, using softmax, but allowing for unequal contributions, each token would have different context cues after all (vowels, consonants)
- In this self-attention model, each token at each position is assigned two vectors: a *key* and a *query*. These represent what they 'contain' and what they 'look for' respectively
- The weighted avg vector, wei, is replaced by the dot product of a token's query with all the previous tokens' keys. (and then the previous processes and softmax are done as normal)
- If a key and query are aligned, the dot product will be higher than otherwise, and the weight of that key will be higher

```
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C) #random matrix of shape (B,T,C)

#Self-Attention Head:
head_size = 16
key = nn.Linear(C,head_size, bias=False) #bias is typically not used when it's not a trainable parameter (when there's a layer after it that pools, etc and cancels it out)
query = nn.Linear(C, head_size,bias = False)
value = nn.Linear(C, head_size , bias = False)

k = key(x) #(B,T,16/head_size)
q = query(x) #(B,T,16)
v = value(x)
wei = q @ k.transpose(-2,-1) #(B,T,16) @ (B,16,T) --> (B,T,T)

tril = torch.tril(torch.ones(T,T)) #Because same for each batch
wei = wei.masked_fill(tril==0,float('-inf'))#Remove this if nodes from the present need context from the future : e.g. sentiment analysis encoders
wei = F.softmax(wei,dim=-1)
#out = wei @ x
out = wei @ v #

```



In [None]:
"""
print(out.shape)
print(wei[0])
print("\nThis is clearly from the untrained values of weights, but they're definitely disproportionate, not equal values")
"""

'\nprint(out.shape)\nprint(wei[0])\nprint("\nThis is clearly from the untrained values of weights, but they\'re definitely disproportionate, not equal values")\n'

**A single head of attention**
- We start with a single 'head' of attention

In [None]:
class Head(nn.Module):
  """one head of attention"""
  def __init__(self,head_size):
    super().__init__()
    self.key = nn.Linear(n_embd,head_size, bias=False) #bias is typically not used when it's not a trainable parameter (when there's a layer after it that pools, etc and cancels it out)
    self.query = nn.Linear(n_embd, head_size,bias = False)
    self.value = nn.Linear(n_embd, head_size , bias = False)
    self.register_buffer('tril',torch.tril(torch.ones(block_size,block_size))) #This replaces the tril that was initialised separately, see he masked fill line now

    self.dropout = nn.Dropout(dropout)

  def forward(self,x):
    B,T,C = x.shape
    k = self.key(x) #(B,T,C)
    q = self.query(x) #(B,T,C/head_size)

    #compute attention scores, aka 'affinities'
    wei = q @ k.transpose(-2,-1) * C **(-0.5)
    tril = torch.tril(torch.ones(T,T)) #Because same for each batch
    wei = wei.masked_fill(self.tril[:T,:T]==0,float('-inf')) #Remove this if nodes from the present need context from the future : e.g. sentiment analysis encoders
    wei = F.softmax(wei,dim=-1)
    wei = self.dropout(wei)

    v = self.value(x)
    out = wei @ v
    return out



**Multi-Headed Attention**
- Now we add multi-headed attention, which basically means we add multiple heads (like the ones we created above) in parallel, like so:

In [None]:
class MultiHeadAttention(nn.Module):
  def __init__(self, num_heads, head_size):
    super().__init__()
    self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])

  def forward(self,x):
    return torch.cat([h(x) for h in self.heads], dim = -1)

**Feedforward Block**
- The feedforward block is simply a multi-layer perceptron or MLP (nn.Linear)
- Here we use a single linear layer from n_embd nodes to n_embd nodes, with a ReLU activation function
- According to Andrej, this essentially is an opportunity to 'think on' the data developed during the interconnected knowledge transfer from the attention block

In [None]:
class FeedForward(nn.Module):
  def __init__(self,n_embd):
    super().__init__()
    self.ffwd = nn.Sequential(
        nn.Linear(n_embd,n_embd),
        nn.ReLU(),
    )

  def forward(self,x):
    return self.ffwd(x)

**Block arrangement**
- Blocks of communication and computations (embedding tables & attention, and MLPs) alternately arranged make up the structure of the Transformer

In [None]:
class Block(nn.Module):
  def __init__(self,n_embd,n_heads):
    super().__init__()
    head_size = n_embd//n_heads
    self.sa = MultiHeadAttention(n_heads,head_size)
    self.ffwd = FeedForward(n_embd)

  def forward(self,x):
    x = self.sa(x)
    x = self.ffwd(x)
    return x

**Skip connections and norm**
- With these blocks of attention blocks, MLPs and more building up, the deep neural network is actually becoming fairly, well, deep.
- The deeper a network gets, it runs into issues of overfitting and potentially also becoming very exorbitant to train
- In this case, we use two tricks:
    - **Residual Connections aka Skip Connections**
      - Source: [Deep Residual Learning for Image Recognition - He et. al, 2015](https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf)
      - There is addition from the source, i.e. if the deep net takes a node A and performs some computations on it
      - Say A -> ReLU(Linear(Attention(A))) -> B
      - A 'residual' or skip connection straight from A -> B can be made, based solely on +
      - The end result basically is ReLU(Lin(Att(A))) + A
      - Gradients are equally distributed during sums, so gradients of the loss are forked off ?? Yeah I'm lost
      - Each block then comes online slowly during optimization (This is also beyond my current understanding)
    - **Layer Norm**
      - Layer normalisation is very similar to batch normalisation, which ensures that across the batch dimension, each neuron had gaussian distribution (0 mean and 1 std deviation)
      - Layer norm, however, doesn't normalise the columns, it normalises the rows.
      - Another thing to note: since the publication of the *Attention is all you need* paper, not much has changed when it comes to transformer structure. However, nowadays layer normalisation is done prior to the attention blocks rather than afterwards (pre-norm).

# Final (complete) model

In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F

#updated hyperparameters
batch_size = 64 # how many independent sequences will we process in parallel?
block_size = 256 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 100
learning_rate = 3e-4 #Self-Attention can't tolerate very high learning rates
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 384
n_head = 6
n_layer = 6
dropout = 0.2

In [None]:
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

In [None]:
class Head(nn.Module):
  """one head of attention"""
  def __init__(self,head_size):
    super().__init__()
    self.key = nn.Linear(n_embd,head_size, bias=False) #bias is typically not used when it's not a trainable parameter (when there's a layer after it that pools, etc and cancels it out)
    self.query = nn.Linear(n_embd, head_size,bias = False)
    self.value = nn.Linear(n_embd, head_size , bias = False)
    self.register_buffer('tril',torch.tril(torch.ones(block_size,block_size))) #This replaces the tril that was initialised separately, see he masked fill line now

    self.dropout = nn.Dropout(dropout)

  def forward(self,x):
    B,T,C = x.shape
    k = self.key(x) #(B,T,C)
    q = self.query(x) #(B,T,C/head_size)

    #compute attention scores, aka 'affinities'
    wei = q @ k.transpose(-2,-1) * C **(-0.5)
    tril = torch.tril(torch.ones(T,T)) #Because same for each batch
    wei = wei.masked_fill(self.tril[:T,:T]==0,float('-inf')) #Remove this if nodes from the present need context from the future : e.g. sentiment analysis encoders
    wei = F.softmax(wei,dim=-1)
    wei = self.dropout(wei)

    v = self.value(x)
    out = wei @ v
    return out



In [None]:
#Now we implement these skip connections into the classes, # indicates new lines
class MultiHeadAttention(nn.Module):
  def __init__(self, num_heads, head_size):
    super().__init__()
    self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
    self.proj = nn.Linear(n_embd,n_embd)  ##

  def forward(self,x):
    out = torch.cat([h(x) for h in self.heads], dim = -1)
    out = self.proj(out)  ##
    return out

class FeedForward(nn.Module):
  def __init__(self,n_embd):
    super().__init__()
    self.ffwd = nn.Sequential(
        nn.Linear(n_embd,4 * n_embd), #The 4 * n_embd is taken directly from the attention paper, where the inner layer of ffwd is 4 times input
        nn.ReLU(),
        nn.Linear(4 * n_embd,n_embd), ##
        nn.Dropout(dropout), ##
    )

  def forward(self,x):
    return self.ffwd(x)

In [None]:
#Updated block
class Block(nn.Module):
  def __init__(self,n_embd,n_heads):
    super().__init__()
    head_size = n_embd//n_heads
    self.sa = MultiHeadAttention(n_heads,head_size)
    self.ffwd = FeedForward(n_embd)
    self.ln1 = nn.LayerNorm(n_embd) #Normalises across to token, but is also optimisable
    self.ln2 = nn.LayerNorm(n_embd)

  def forward(self,x):
    x = x + self.sa(self.ln1(x))
    x = x + self.ffwd(self.ln2(x))
    return x

In [None]:
#The cooler model, now incorporating the above 'Head' of attention


class SmallLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size,n_embd)
        #self.sa_head = Head(n_embd) #For single head attention
        self.sa_heads = MultiHeadAttention(n_head, n_embd//n_head) #n_heads being four, 4 heads of 8-dimensional attention. // takes the floor of the quotient
        self.ffwd = FeedForward(n_embd)
        '''self.blocks = nn.Sequential(
            Block(n_embd,n_heads=4),
            Block(n_embd,n_heads=4),
            Block(n_embd,n_heads=4),
            nn.LayerNorm(n_embd),
        )'''#replaced by the next chunk
        self.blocks = nn.Sequential(*[Block(n_embd, n_heads=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd,vocab_size)

    def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        B,T = idx.shape

        tok_embed = self.token_embedding_table(idx)
        pos_embed = self.position_embedding_table(torch.arange(T,device=device))
        embed = tok_embed + pos_embed
        embed = self.sa_heads(embed)
        embed = self.ffwd(embed)
        embed = self.blocks(embed)
        embed = self.ln_f(embed) # (B,T,C)
        logits = self.lm_head(embed) # (B,T,C)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            idx_crop = idx[:,-block_size:]
            # get the predictions
            logits, loss = self(idx_crop)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx


In [None]:
model = SmallLanguageModel()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
#max_iters = 5000
for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()


12.560705 M parameters
step 0: train loss 4.3500, val loss 4.3486
step 100: train loss 3.0778, val loss 3.1211
step 200: train loss 2.5301, val loss 2.5234
step 300: train loss 2.4576, val loss 2.4628
step 400: train loss 2.3926, val loss 2.4201
step 500: train loss 2.2683, val loss 2.3069
step 600: train loss 2.0914, val loss 2.1430
step 700: train loss 1.9197, val loss 2.0114
step 800: train loss 1.7689, val loss 1.9071
step 900: train loss 1.6778, val loss 1.8550
step 1000: train loss 1.5946, val loss 1.7880
step 1100: train loss 1.5311, val loss 1.7306
step 1200: train loss 1.4908, val loss 1.6971
step 1300: train loss 1.4483, val loss 1.6569
step 1400: train loss 1.4127, val loss 1.6333
step 1500: train loss 1.3887, val loss 1.6104
step 1600: train loss 1.3621, val loss 1.5834


KeyboardInterrupt: 

In [None]:
# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))


HASTINGS:
How, Thy granding of thy biock have nought?

GLOUCESTER:
This crown of thy spirit
He or back end of powern your honour's breds:
Ere buid his his siders' smonutunt power
And as out once must and seal down me more.

ANGELO:
A pherdic no, name take me sleep.

SICINIUS:
As it was, lest, by be spite: 'tis is no flesh.

PARIS:
How now, my might to world, and is to off the land?

RATCLIF:
And the glory upon him.

MARIANA:
By all; and thinks and, here did my name.

QUEEN MARGARET:
Not I thanks, what would I do himself have word,
He had you ever mistress, gentleman me
The cursely of speck, and so leave wise and easure.
Call we must I see a foot of me;
And give answer all to her insto wife,
That shall nor with a saf pather's ere lay,
Which my brother's for my kindle flear;
The toold hour eyes daughter's watching all,
And my comfors of his is thing sound's from myself.
Did driven is speak streetch well!

GREEN:
Ten his love, lawbury. I disgrace; and of ild you king
The eye steet no thi

Now, this isn't exactly GPT, not just because of the scale (we have around 12million parameters, GPT has billions), but also because this decoder is equivalent to the pre-training stage of GPT.
OpenAI went on to train GPT using supervised learning, showing thousands of prompts and labels, indicating how a chat assistant should answer. Otherwise, it was likely to barf out text that matched what it learned on the internet - answering questions with questions, links, entire articles, or so on.

Regardless, I think this has been a very fun learing experience.
I hope the section-by-section elaboration that I've written out has been helpful.
Once again, I do have a primary source for this, and it's Andrej Karpathy's ['GPT from scratch' video](/https://www.youtube.com/watch?v=kCc8FmEb1nY). If you're not familiar with his work, I definitely recommend you check him out, he has very thorough tutorials and explanations about deep learning and AI, just like this one.

In [None]:
print("Good luck, happy Transforming! :)")