<a href="https://colab.research.google.com/github/classic-mathematician/LLMs/blob/main/gpt_0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Shakespeare dataset
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2024-09-13 06:01:35--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2024-09-13 06:01:36 (18.1 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [None]:
# reading and opening input.txt
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [None]:
# dataset length in chars
print("length of dataset in characters: ", len(text))

length of dataset in characters:  1115394


In [None]:
# First 1000 characters
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [None]:
# We make a set of the text so that only the unique characters remain and then we put those into a list
chars = sorted(list(set(text)))
# The vocab size is equal to the length of the unique character list. Pretty straight forward
vocab_size = len(chars)
# printing the unique chars as a single string
print(''.join(chars))
# length of the vocabulary
print(vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


In [None]:
# preliminary encoding

# string to int dictionary using magic
stoi = { ch:i for i,ch in enumerate(chars) }

# int to string dictionary using magic
itos = { i:ch for i,ch in enumerate(chars) }

# lambda function that returns the encoded (string -> int) representation of a string (calls the stoi function on every char of a given string)
encode = lambda s: [stoi[c] for c in s]

# lambda function that returns the decoded (string -> int) representation of a string (calls the itos function on every int in a string)
decode = lambda l: ''.join([itos[i] for i in l])

# example
print(encode("magic"))
print(decode(encode("magic")))

[51, 39, 45, 47, 41]
magic


In [None]:
# we use PyTorch
import torch

# we encode the whole text into a toch.tensor using our encode function on the text and declaring the dtype as long
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)

# this is how the gpt will see the information
print(data[:1000])

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
      

In [None]:
# Classic ML splitting, training and validation.

# 90% train 10% test split
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]

In [None]:
# context size of the transformer, THIS IS CALLED THE TIME DIMENSION
block_size = 8
# +1 since the idea is to use 8 tokens to predict the next one, so we need an offset of 1.
train_data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

In [None]:
# THIS IS HOW THE TRANSFORMER WILL TRAIN ON THESE EXAMPLES
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target: {target}")

when input is tensor([18]) the target: 47
when input is tensor([18, 47]) the target: 56
when input is tensor([18, 47, 56]) the target: 57
when input is tensor([18, 47, 56, 57]) the target: 58
when input is tensor([18, 47, 56, 57, 58]) the target: 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target: 58


In [None]:
# manually setting a seed for reproducibility
torch.manual_seed(1337)

# batch size of 4, means the amount of parallel sequeneces we are going to be processing
batch_size = 4
# context size
block_size = 8

def get_batch(split):
    # select from which split we are going to pick the batches
    data = train_data if split == 'train' else val_data
    # select batch_size amount of indexes to generate the block_size context lists (the indexes can go from, 0 to len(data) - blocksize for things to fit, we dont want to get outofbonds)
    ix = torch.randint(len(data) - block_size, (batch_size,))
    # using those indexes we produce the batches, basically a list of data[i:i+block_size] data from where the random index tells us, to block size places after
    x = torch.stack([data[i:i+block_size] for i in ix])
    # same exact thing, but with one index displacement to get the results
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y


xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)

print('----')

for b in range(batch_size): # batch dimension
    for t in range(block_size): # time dimension
        context = xb[b, :t+1]
        target = yb[b,t]
        print(f"when input is {context.tolist()} the target: {target}")

inputs:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets:
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
----
when input is [24] the target: 43
when input is [24, 43] the target: 58
when input is [24, 43, 58] the target: 5
when input is [24, 43, 58, 5] the target: 57
when input is [24, 43, 58, 5, 57] the target: 1
when input is [24, 43, 58, 5, 57, 1] the target: 46
when input is [24, 43, 58, 5, 57, 1, 46] the target: 43
when input is [24, 43, 58, 5, 57, 1, 46, 43] the target: 39
when input is [44] the target: 53
when input is [44, 53] the target: 56
when input is [44, 53, 56] the target: 1
when input is [44, 53, 56, 1] the target: 58
when input is [44, 53, 56, 1, 58] the target: 46
when input is [44, 53

In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F

# seed setting for reproducibility
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # declaration of an embedding table using nn.Embedding of size vocab_size x vocab_size
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)


    # forward pass function, basically runs an input through the model and returns the results (LOGITS, )
    def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        #print('idx: ', idx)
        #print('idx shape: ', idx.shape)
        # calling the embedding table transforms each vector inside the (B,T) -> (B,T,C) so it embeds each one of them making a cube
        logits = self.token_embedding_table(idx) # (B,T,C)
        #print('logits: ', logits)
        #print('logis shape: ', logits.shape)


        # simple validation
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            # we simply compress the logits to fit pytorch standards so instead of having 4 batches, we just put them all together into one long tensor
            logits = logits.view(B*T, C)
            # we do the same with the targets
            targets = targets.view(B*T)

            # here is the tricky part, the loss is calculated automatically and pytorch also triggers the softmax on its own and calculates the probabilty of each class to then compare it with the right response (targets). thats why targets need no embedding.
            loss = F.cross_entropy(logits, targets)


        return logits, loss


    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # we run the forward function and get the logits
            logits, loss = self(idx)
            # focus only on the last time step, since those are the ones with the last token generated
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution, so we dont necesarily pick the word with the highest prob, but rather make a sampling from the distribution produced by the softmax! (NOTE: this returns an integer corresponding to the class that was picked)
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence, so we are simply adding it to the list of words that we already have
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)
print(logits.shape)
print(loss)


print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))

torch.Size([32, 65])
tensor(4.8786, grad_fn=<NllLossBackward0>)

Sr?qP-QWktXoL&jLDJgOLVz'RIoDqHdhsV&vLLxatjscMpwLERSPyao.qfzs$Ys$zF-w,;eEkzxjgCKFChs!iWW.ObzDnxA Ms$3


In [None]:
# create a PyTorch optimizer Adam with Weights
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

In [None]:
batch_size = 32

# THIS IS THE TRAINING LOOP
for steps in range(10000): # increase number of steps for good results...

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = m(xb, yb)

    # we set all the gradients from the previous step to 0
    optimizer.zero_grad(set_to_none=True)

    # we compute the gradients
    loss.backward()

    # we apply the gradients to the parameters
    optimizer.step()

print(loss.item())



2.5727508068084717


In [None]:
print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=300)[0].tolist()))


Iyoteng h hasbe pave pirance
Rie hicomyonthar's
Plinseard ith henoure wounonthioneir thondy, y heltieiengerofo'dsssit ey
KIN d pe wither vouprrouthercc.
hathe; d!
My hind tt hinig t ouchos tes; st yo hind wotte grotonear 'so it t jod weancotha:
h hay.JUCle n prids, r loncave w hollular s O:
HIs; ht 


Self attention trick


In [None]:
# TOY EXAMPLE

# seed for reproducibility
torch.manual_seed(1337)

# smaller example of same shape, 4 batches, 8 context size and two channels
B,T,C = 4,8,2 # batch, time, channels

# we initialize the tensor randomly
x = torch.randn(B,T,C)

# printing the shape
x.shape

torch.Size([4, 8, 2])

In [None]:
# We want x[b,t] = mean_{i<=t} x[b,i]. We are basically averaging the previous tokens to kinda get some context as to what is happening


# initialization of bag of words (something that is said from time to time) all zeros
xbow = torch.zeros((B,T,C))

# for every batch
for b in range(B):

    # for every token!
    for t in range(T):

        # we are making a slice of T for every batch to have something like this

        # T [2]
        # T [2, 4]
        # T [2, 4, 3]
        # T [2, 4, 3, 8]
        # T [2, 4, 3, 8, 7]
        # T [2, 4, 3, 8, 7, 9]

        # again, the idea is get the slices at every index of T and then average said slices at the position that we are working on! and save those at the position in xbow which is full of ZEROS, remember!

        xprev = x[b,:t+1] # (t,C)


        xbow[b,t] = torch.mean(xprev, 0)


In [None]:
x[0]

tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679],
        [-1.3499, -0.5102],
        [ 0.2360, -0.2398],
        [-0.9211,  1.5433]])

In [None]:
xbow[0]

tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])

In [None]:

# each row in here represents a token
"""tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679],
        [-1.3499, -0.5102],
        [ 0.2360, -0.2398],
        [-0.9211,  1.5433]])"""

'tensor([[ 0.1808, -0.0700],\n        [-0.3596, -0.9152],\n        [ 0.6258,  0.0255],\n        [ 0.9545,  0.0643],\n        [ 0.3612,  1.1679],\n        [-1.3499, -0.5102],\n        [ 0.2360, -0.2398],\n        [-0.9211,  1.5433]])'

# THIS IS HIGHLY INNEFICIENT!

# So we do this!

In [None]:
# toy example illustrating how matrix multiplication can be used for a "weighted aggregation"

# i wont explain this again!
torch.manual_seed(42)

# Initialize the triangle of 1s in order to produce the behavior of not letting the future influence the past
a = torch.tril(torch.ones(3, 3))

# we are essentially telling it to make every row add up to one in an even manner
a = a / torch.sum(a, 1, keepdim=True)

# randon initialization of matrix 3,2 and casted as float
b = torch.randint(0,10,(3,2)).float()

# matrix multiplication is used with @
c = a @ b
print('a=')
print(a)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)

a=
tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
--
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
--
c=
tensor([[2.0000, 7.0000],
        [4.0000, 5.5000],
        [4.6667, 5.3333]])


# End of the parenthesis

In [None]:
# version 2: using matrix multiply for a weighted aggregation

# same exact same thing, but the size is T by T
wei = torch.tril(torch.ones(T, T))
# same exact behaviour to adquire the average of all previous and current tokens
wei = wei / wei.sum(1, keepdim=True)
print(wei)

xbow2 = wei @ x # (T, T) @ (B, T, C) ----> (B, T, C) pytorch adds the batch dimention to (T, T) and makes it (B, T, T)

#guarantee that they are identical
torch.allclose(xbow, xbow2)

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])


False

In [None]:
# version 3: use Softmax

# again, we are making the triangle and initialiing a matrix of size T, T to do the trick of attention
tril = torch.tril(torch.ones(T, T))
print(tril)

# we make the place were we are going to store everything, exactly as big as T, T
wei = torch.zeros((T,T))

# we initialize all 0s as -inf for the softmax to work
wei = wei.masked_fill(tril == 0, float('-inf'))
print(wei)

# we apply softmax in order to produce the same result as the previous matrix
wei = F.softmax(wei, dim=-1)

# same operation
xbow3 = wei @ x
torch.allclose(xbow, xbow3)

# THE ELEMENTS OF THE LOWER TRIANGULAR SECTION ARENT ALWAYS AN AVERAGE, THAT IS A TOY EXAMPLE, BUT THE IDEA IS LETTING THE MODEL DECIDE THE IMPORTANCE OF EACH TOKEN IN RELATIONSHIP TO OTHERS THAT IS

tensor([[1., 0., 0., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 1.]])
tensor([[0., -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., -inf],
        [0., 0., 0., 0., 0., 0., 0., 0.]])


False

In [None]:
# version 4: self-attention!
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels (we incremented this in order to have a deeper model)

# same random initialization for toy example
x = torch.randn(B,T,C)

# let's see a single Head perform self-attention
head_size = 16 # this is the deepth of the key and query

# we apply a linear transformation not to work exclusively with the words that we have and also to reduce the dimentionality
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x)   # (B, T, 16)
q = query(x) # (B, T, 16)
wei =  q @ k.transpose(-2, -1) * head_size**-0.5 # (B, T, 16) @ (B, 16, T) ---> (B, T, T) # the batch matching is done automatically, however we do need to make it understand that we need to multiply T 16 * 16 T to end up with T T and get the affinities
# it is important to know why are we are dividing by the square root of headsize, the idea is that if we dont do this -> the values produced in wei will be larger and softmax will sharpen the inputs towards the larger ones and the distribution wont be flat enough specially during initialization
tril = torch.tril(torch.ones(T, T))
#wei = torch.zeros((T,T))

# this is sometimes called a decoder block, since it allows only the previous tokens to communicate with the future ones, but not viceversa. if it did, it would be a encoder block
wei = wei.masked_fill(tril == 0, float('-inf'))
# same process being discussed, we eliminate the future words from the affinity matrix
wei = F.softmax(wei, dim=-1)

# we apply a linear transformation to X and we use this value, the new encoded one to proceed
v = value(x)

# the answer is affinity @ the values, so same mechanism
out = wei @ v
#out = wei @ x

out.shape

torch.Size([4, 8, 16])

In [None]:
k = torch.randn(B,T,head_size)
q = torch.randn(B,T,head_size)
wei = q @ k.transpose(-2, -1) * head_size**-0.5

In [None]:
k.var()

tensor(1.0449)

In [None]:
q.var()

tensor(1.0700)

In [None]:
wei.var()

tensor(1.0918)

In [None]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1)

tensor([0.1925, 0.1426, 0.2351, 0.1426, 0.2872])

In [None]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5])*8, dim=-1) # gets too peaky, converges to one-hot

tensor([0.0326, 0.0030, 0.1615, 0.0030, 0.8000])

# Final unified code

In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 16 # how many independent sequences will we process in parallel? (->B, T, C) The B.
block_size = 32 # what is the maximum context length for predictions? The T
max_iters = 5000 # Iterations to train the model
eval_interval = 100
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu' # using GPU if available
eval_iters = 200
n_embd = 64 # the dimensions of the embeddings
n_head = 4 # number of attention heads
n_layer = 4
dropout = 0.0
# ------------

torch.manual_seed(1337)

# WE GET THE SHAKESPEARE FULL WORK IN TXT FORM
# wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))

# as explained, we are using unique chars as the vocab size
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y

    # using a selection of random indexes, we use the blocksize to grab random parts of the corpus of that size.
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    # we send them to device in order for them to be used by cuda.
    x, y = x.to(device), y.to(device)
    return x, y


# the idea is to estimate the loss as an average of multiple iterations
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out


# Class head of self attention > in which the keys, querys and values are produced from the same input.
class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        # we declare the key, query and value as a linear transformation (fully connected layer n_embd -> head_size output (reduces dimensionality))
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)

        # the idea to register them as buffer is for them not be affected by back propagation. (this is the triangle that we use to cut out the future from the affinity matrix.)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    #   the forward pass is basically just when the data runs through this block
    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        # we replace zeros with infinity for the softmax to work better!
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        # we introduce regularization (0.2 dropout) since the model is growing on size.
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)

        # we basically use the affinity matrix to modify all the tokens by adding however much the affinity matrix tells us for that token
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        # The idea behind this is to simply replicate several heads in parallel.
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        # the projection of this is just calling a linear transformation
        self.proj = nn.Linear(n_embd, n_embd)
        # same regularization since the model is starting to become beefy.
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):

        # its running several heads at the same time, storing them into a list and concatenating them into the channel dimension
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        # we do linear transformation and regularization.
        out = self.dropout(self.proj(out))
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()

        # simple linear layer with input n_embd -> 4* n_embd -> n_emdb and dropout
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    # the input flows through the transformation.
    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        # since in MHA we concatenate the results, the head size should be equal to the n_embd divived by the number of heads!
        head_size = n_embd // n_head
        # we create a self attention section
        self.sa = MultiHeadAttention(n_head, head_size)
        # what we already stablished, the linear transformation
        self.ffwd = FeedFoward(n_embd)
        # layer normalization -> instead of doing it column wise it is row wise
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        # this, as far as im concerned has to with the vanishing gradient problem
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)

        # this are the positional embeddings
        self.position_embedding_table = nn.Embedding(block_size, n_embd)

        # we define how many blocks of multiheaded attention we want
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        # what we said, in the end all you need is attention stablished that we need a layer norm in the end
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        #
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

model = BigramLanguageModel()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))


FileNotFoundError: [Errno 2] No such file or directory: 'input.txt'

# GPT 0

In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 64 # how many independent sequences will we process in parallel?
block_size = 256 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 500
learning_rate = 3e-4
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 384
n_head = 6
n_layer = 6
dropout = 0.2
# ------------

torch.manual_seed(1337)

#wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # input of size (batch, time-step, channels)
        # output of size (batch, time-step, head size)
        B,T,C = x.shape
        k = self.key(x)   # (B,T,hs)
        q = self.query(x) # (B,T,hs)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * k.shape[-1]**-0.5 # (B, T, hs) @ (B, hs, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,hs)
        out = wei @ v # (B, T, T) @ (B, T, hs) -> (B, T, hs)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(head_size * num_heads, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

class GPTLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

        # better init, not covered in the original GPT video, but important, will cover in followup video
        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

model = GPTLanguageModel()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))
open('more.txt', 'w').write(decode(m.generate(context, max_new_tokens=10000)[0].tolist()))

10.788929 M parameters
step 0: train loss 4.2221, val loss 4.2306
step 500: train loss 1.7600, val loss 1.9146
step 1000: train loss 1.3903, val loss 1.5987
step 1500: train loss 1.2644, val loss 1.5271
step 2000: train loss 1.1835, val loss 1.4978
step 2500: train loss 1.1233, val loss 1.4910
step 3000: train loss 1.0718, val loss 1.4804
step 3500: train loss 1.0179, val loss 1.5127
step 4000: train loss 0.9604, val loss 1.5102
step 4500: train loss 0.9125, val loss 1.5351
step 4999: train loss 0.8589, val loss 1.5565

But with prison, I will steal for the fimker.

KING HENRY VI:
To prevent it, as I love this country's cause.

HENRY BOLINGBROKE:
I thank bhop my follow. Walk ye were so?

NORTHUMBERLAND:
My lord, I hearison! Who may love me accurse
Some chold or flights then men shows to great the cur
Ye cause who fled the trick that did princely action?
Take my captiving sound, althoughts thy crown.

RICHMOND NE:
God neit will he not make it wise this!

DUKE VINCENTIO:
Worthy Prince fo

In [None]:
# After training the model
torch.save(model.state_dict(), 'gpt_language_model.pth')

In [None]:
# Load the model
model = GPTLanguageModel()  # Make sure to initialize the model structure
model.load_state_dict(torch.load('gpt_language_model.pth'))
model.to(device)  # Move the model to the same device as used during training (CPU or GPU)
model.eval()  # Set the model to evaluation mode (not training mode)

In [None]:
# Your custom prompt
prompt = "Who is stronger, Gojo or Sukuna?"

# Encode the prompt into indices
encoded_prompt = torch.tensor([encode(prompt)], dtype=torch.long, device=device)

# Generate text based on the prompt
generated_indices = model.generate(encoded_prompt, max_new_tokens=500)

# Decode the generated indices back into a string
generated_text = decode(generated_indices[0].tolist())

# Print the generated text
print(generated_text)

# Optionally save the generated text to a file
with open('more.txt', 'w') as f:
    f.write(generated_text)

Who is stronger, Gojo or Sukuna? the king will:
Was fortroll'n king by the wisdom to my thoughty mlts?
Or were the matter, then, being our word's legs,
For ince must be be glad on the friar's life,
That kiss either and the loss of chese.

RICHARD:
Your cousin of this regally is dead;
Having, to set it out companion, this clue;
And living to have stay'd therein the rest thou how;
That one hath more worth to shorive them oated?
What hath the wounded me? Whe would go to this houst
Shall do with what grace this fill?
I, how there?


In [None]:
open('more.txt', 'w').write(decode(m.generate(context, max_new_tokens=10000)[0].tolist()))

10001