In [1]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2025-08-29 12:55:37--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt.9’


2025-08-29 12:55:38 (2.43 MB/s) - ‘input.txt.9’ saved [1115394/1115394]



In [2]:
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [3]:
chars = sorted(list(set(text)))
vocab_size = len(chars)

In [4]:
vocab_size

65

Tokenize: Map characters ==> sequence of integers according to some vocabulary of possiblilities

In [5]:
stoi = { ch:i for i, ch in enumerate(chars) }
itos = { i:ch for i, ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: takes in a string, outputs integer list
decode = lambda l: ''.join([itos[i] for i in l]) # input integer list, output string

Now let's tokenize the entire tinyshakespeare data set with pytorch

In [6]:
import torch
import torch.nn as nn
from torch.nn import functional as F
data = torch.tensor(encode(text), dtype=torch.long)

# then let's split the data into training and validation data
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]
print(n)

1003854


It is computationally expensive to feed all of the data into the transformers at once. A way to navigate this is to use a context length or block size to chunk the incoming data so the transformer can learn it

In [7]:
block_size = 8
train_data[:block_size + 1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

 Then the transformer will basically enumerate and learn which integers will come after another in a combinatorial way. Itll look at 18 and 47 and predict that 56 should come next. Then it will look at 18 47 56 and know that 57 should come next. And so on.

In [8]:
x = train_data[:block_size]
y = train_data[1:block_size + 1] # attempting to predict what's going to immediately follow the block size?
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f'when input is {context} the target is {target}')

when input is tensor([18]) the target is 47
when input is tensor([18, 47]) the target is 56
when input is tensor([18, 47, 56]) the target is 57
when input is tensor([18, 47, 56, 57]) the target is 58
when input is tensor([18, 47, 56, 57, 58]) the target is 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target is 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target is 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target is 58


Now we will form the input data as batches according to how many rows of a block we want to give.

In [9]:
torch.manual_seed(1337)
batch_size = 4
block_size = 8

# based on batch and block sizes, generate small batch of data with inputs x and targets y
def get_batch(split):
    # data is training data if specified, otherwise it's validation
    data = train_data if split == 'train' else val_data
    
    # randomly pull (starting points?) in data at random,
    # find the first (block_size) characters that follow it
    # and do four of these because the batch size is four
    # need to subtract off block size so you don't try to extend over the size of the data
    ix = torch.randint(len(data) - block_size, size=(batch_size,))
    print(ix)
    
    # craft the inpput data based on the block size
    # and we want to concatenate all of these blocks into a nice batched tensor
    # so we use torch.stack to do this
    x = torch.stack([data[i:i+block_size] for i in ix])
    print(f'x: {x}')
    
    # then we always want to predict whatever character is going to come after the 
    # iteration of the block size, so we add one for make the target data
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])

    return x, y

xb, yb = get_batch('train')
print('------')
print('inputs:')
print(xb.shape)
print(xb)

print('targets:')
print(yb.shape)
print(yb)
print('------')

for b in range(batch_size): # iterating along the batch dimension
    for t in range(block_size): # iterating along the block
        context = xb[b, :t+1] # add one because we start at zero
        target = yb[b, t] # y is already shifted so no need to add anything here
        print(f'when input is {context.tolist()} the target is {target}')

tensor([ 76049, 234249, 934904, 560986])
x: tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
------
inputs:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets:
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
------
when input is [24] the target is 43
when input is [24, 43] the target is 58
when input is [24, 43, 58] the target is 5
when input is [24, 43, 58, 5] the target is 57
when input is [24, 43, 58, 5, 57] the target is 1
when input is [24, 43, 58, 5, 57, 1] the target is 46
when input is [24, 43, 58, 5, 57, 1, 46] the target is 43
when input is [24, 43, 58, 5, 57, 1, 46, 43] the tar

In [10]:
print(xb) # this is our input into a transformer

tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])


Now we're going to set up a simple bigram language model

In [11]:
torch.manual_seed(1337)

# here there are four batches, block_size is 8, channel is the vocab size
# logits are the scores for the next characters of the sequence, which 
# will go to the token embedding table and pluck out the row corresponding
# to the idx, which is really each xb
class BigramLanguageModel(nn.Module):
    
    def __init__(self, vocab_size):
        super().__init__()
        # each token will pluc
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
        
    # (batch, time, channel) = (4, 8, 65)
    def forward(self, idx, targets=None):
        logits = self.token_embedding_table(idx) # (batch, time, channel)
        
        if targets is None:
            loss = None            
        else:
            # need to reshape logits so that it fits into cross entropy functional form
            # bc pytorch expects you to pass in your logits this way
            B, T, C = logits.shape
            #print(f'B, T, C: {B}, {T}, {C}')
            logits = logits.view(B*T, C)

            # need to do the same as targets
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
        
        return logits, loss
    
    
    # now let's generate from the trained model
    # idx is the current context of some characters in a batch -- (B, T)
    # generate extends it to be (B, T+1), then again to (B, T+2) ...
    def generate(self, idx, max_new_tokens):
        # idx is (B,T) array of indices in the current context
        for _ in range(max_new_tokens):
            logits, loss = self(idx)
            print(f'logits.shape: {logits.shape}')
            
            # for now we'll only focus on the last time step (last part of the block)
            # since the predications are following whatever was there previously
            logits = logits[:, -1, :] # becomes (B, C)

            
            # apply softmax to get the probabilities
            # dim=-1 bc we want channel probabilities (?)
            probs = F.softmax(logits, dim=0) # dim = (B, C)
            
            # sample from the distribution
            # one sample per batch
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            
            # append sampled index to a running sequence of indices throughout gen process
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
            
        return idx
    
m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)
print(logits.shape)
print(loss)

# 0 is how we are going to kick off the generation. we're at time step 1
#idx = torch.zeros((1, 1), dtype=torch.long

# then m.generate will continue the generation for 100 tokens
# generate works on the batch level, so we have to index to the 0 element
# because we're assuming batch size is 1 here
# this is just an example, and could technically start from a different element
print(decode(m.generate(torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))

torch.Size([32, 65])
tensor(4.8786, grad_fn=<NllLossBackward0>)
logits.shape: torch.Size([1, 1, 65])
logits.shape: torch.Size([1, 2, 65])
logits.shape: torch.Size([1, 3, 65])
logits.shape: torch.Size([1, 4, 65])
logits.shape: torch.Size([1, 5, 65])
logits.shape: torch.Size([1, 6, 65])
logits.shape: torch.Size([1, 7, 65])
logits.shape: torch.Size([1, 8, 65])
logits.shape: torch.Size([1, 9, 65])
logits.shape: torch.Size([1, 10, 65])
logits.shape: torch.Size([1, 11, 65])
logits.shape: torch.Size([1, 12, 65])
logits.shape: torch.Size([1, 13, 65])
logits.shape: torch.Size([1, 14, 65])
logits.shape: torch.Size([1, 15, 65])
logits.shape: torch.Size([1, 16, 65])
logits.shape: torch.Size([1, 17, 65])
logits.shape: torch.Size([1, 18, 65])
logits.shape: torch.Size([1, 19, 65])
logits.shape: torch.Size([1, 20, 65])
logits.shape: torch.Size([1, 21, 65])
logits.shape: torch.Size([1, 22, 65])
logits.shape: torch.Size([1, 23, 65])
logits.shape: torch.Size([1, 24, 65])
logits.shape: torch.Size([1, 25, 

So we should expect the loss to be completely uniform across the entire vocabulary size distribution, since the probability of each should be $-\mathrm{ln}(1/65) \approx 4.17$ but we end up getting $4.87$.. which isn't that good. It's not completely uniform, which means that there's some correlations (?) and some entropy. 

### Checkpoint at 35:00: now let's train the model

For this we'll need an optimizer: One standard optimizer is Adam

In [11]:
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

Never forget to zero your gradients out!

In [12]:
batch_size = 32
# 100 steps doesn't really optimize too much --- went from 4.7 to 4.58, but we know a lower bound is 4.17
# 1_000 got me to 3.63, a couple more times yields 
for steps in range(10_000):
    xb, yb = get_batch('train')
    
    # evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()
    
#print(loss.item())

In [13]:
print(decode(m.generate(torch.zeros((1, 1), dtype=torch.long), max_new_tokens=200)[0].tolist()))


Iyoteng h hasbe pave pirance
Rie hicomyonthar's
Plinseard ith henoure wounonthioneir thondy, y heltieiengerofo'dsssit ey
KIN d pe wither vouprrouthercc.
hathe; d!
My hind tt hinig t ouchos tes; st yo 


Now onto a mathematical trick that is commonly used in self attention. Begin at 42:00

The trick has something to do with context length. The model should not have access to future tokens, and only be able to reference those that it has seen from the past in order to predict the next token.

We want to take an average over all of the past history to be able to predict the next token. We're really creating what's called a feature vector to represent this quantity. We'll be working directly with the channels of x in order to make this happen.

In [14]:
torch.manual_seed(1337)
B, T, C = 4, 8, 2 # batch, time, channels
x = torch.randn(B, T, C)
x.shape

torch.Size([4, 8, 2])

bow = bag of words, which is a common term for the average of the model's history

each xprev is the the input up until some token when specified a certain batch

remember that C = channels, which is really represented by the vocabulary size. we're just using c = 2 as an example for now since we want to understand the mathematical trick.

so we're looking at each batch. the T dimension is here because this is our time series, and each token is obtained at each point in the observation process. And because our toy vocab size is just 2, we're really averaging over possible token sequences which have two possible characters. but we are starting off with x being initialized with random floats. The idea is that we'd train the model through Adam so the model converges onto something useful

In [15]:
xbow = torch.zeros((B, T, C))
for b in range(B):
    for t in range(T):
        xprev = x[b, :t+1] # we're in some batch b, and looking at all of the previous tokens up to token t. keeping dim C
        xbow[b, t] = torch.mean(xprev, 0) # then we'll store the average into xbow as a one dimensional vector

However this is pretty inefficient, and we want to speed things up a bit. 

To illustrate how we can speed things up, we'll play with some easy examples

In [30]:
torch.manual_seed(42)
# the below allows you to more easily see where each element of c comes from
#a = torch.tensor(((1., 1., 1.), (2., 2., 2.), (3., 3., 3.)))
#a = torch.ones(3, 3) # the standard a, without pulling out the triangular matrix
a = torch.tril(torch.ones(3, 3))
a = a / torch.sum(a, dim=1, keepdim=True)
b = torch.randint(0, 10, (3, 2)).float()
c = a @ b
print('a=')
print(a)
print('----------')
print('b=')
print(b)
print('----------')
print('c=')
print(c)
print('----------')

a=
tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
----------
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
----------
c=
tensor([[2.0000, 7.0000],
        [4.0000, 5.5000],
        [4.6667, 5.3333]])
----------


Now let's implement this with our weights and create a new xbow, xbow2

In [78]:
wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(dim=1, keepdim=True)
'''
At first the following matrix multiplication will involve
(T, T) @ (B, T, C)
but PyTorch will recognize that the tensor broadcasting will not work
so it always creates the dimension on the left; in this case, it's the batch dimension
==> (B, T, T) @ (B, T, C) then becomes (B, T, C) = xbow2
'''
xbow2 = wei @ x 
torch.allclose(xbow, xbow2, rtol=1e-05, atol=1e-07, equal_nan=False)

True

And now for method #3, which will utilize softmax to create the weights. this is a method of setting up the context window

In [90]:
tril = torch.tril(torch.ones(T, T)) # create the triangular matrix
wei = torch.zeros((T, T)) # instantiate weights
wei = wei.masked_fill(transformer_from_scratch.pyil == 0, float('-inf')) # create what'll be put into softmax
wei = F.softmax(wei, dim=-1) # softmax on every row
xbow3 = wei @ x
torch.allclose(xbow2, xbow3)

True

Question for comp mech-y implementations: Could we replace self attention with causal states? If so, how better can we predict the next token?

#### Full self attention

And now onto version 4, where we'll be implementing the full version of self attention

Now at least for the bigram model, for example if a vowel was the previously seen token then we are way more likely to look for a consonant. So even though we are averaging the past contexts, we want to funnel the past information which is specifically relevent to the present. This is the problem that self attention solves

How it works is that each token (going to stop here because I feel like I'm going at 80% and getting more distracted. Currently at 1:02:00. Also my bigram.py doesn't work I haven't tried to figure out why yet, and I'm hoping it'll get resolved once we implement self attention into the bigram model)

Alright, now onto the full version of self attention

Each token will have two characterizing vectors: Key and query. 

**Query:** What am I looking for

**Key:** What do I contain

Dot products with queries and keys will now become weights, and the infinities will show up. This will increase likely hood of high probability tokens interacting with each other, and decrease interactions between low probability past tokens

To me right now, keys and queries are basically the same. They're both linear layers and don't have any differences between them right now besides the random numbers that they'll contain

In [8]:
torch.manual_seed(1337)
B, T, C = 4, 8, 32
x = torch.randn(B, T, C)

# create a single Head of self attention
# value is the "public" version of the values of x, so we don't have to directly matmul it
head_size = 16
key = nn.Linear(C, head_size, bias=False) # (dim_input, dim_output, bias)
query = nn.Linear(C, head_size, bias=False) # (dim_input, dim_output, bias)
value = nn.Linear(C, head_size, bias=False) # (dim_input, dim_output, bias)

# performs x . key^T or x . query^T since there is no bias
k = key(x) # (B, T, head_size)
q = query(x) # (B, T, head_size)
v = value(x) # (B, T, head_size)

# do batch matrix multiplication---the last two dimensions will multiply like regular matrices
wei = q @ k.transpose(-2, -1) * head_size**-0.5 # (B, T, 16) @ (B, 16, T) ---> (B, T, T), the last two dimensions do regular matmul

# and now we do the same thing as before
tril = torch.tril(torch.ones(T, T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
out = wei @ v

In [9]:
wei[0]

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3966, 0.6034, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3069, 0.2892, 0.4039, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3233, 0.2175, 0.2443, 0.2149, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1479, 0.2034, 0.1663, 0.1455, 0.3369, 0.0000, 0.0000, 0.0000],
        [0.1259, 0.2490, 0.1324, 0.1062, 0.3141, 0.0724, 0.0000, 0.0000],
        [0.1598, 0.1990, 0.1140, 0.1125, 0.1418, 0.1669, 0.1061, 0.0000],
        [0.0845, 0.1197, 0.1078, 0.1537, 0.1086, 0.1146, 0.1558, 0.1553]],
       grad_fn=<SelectBackward0>)

Now we've created data-dependent batch elements that we can optimize on. The higher probabilities in the tokens prior to the present token will have greater probability than those that aren't. Of course when we're starting out, this is just randomized. But as we train, the probabilities depending on whether the current token was a vowel or a consonant will be tuned accordingly. 

Attention is a communication system because it aggregates past information into vector information, which allows data dependent inference on past tokens in order to predict the next token. At any given node, all other nodes are pointing directly to it, and depending on their dependence, the arrows connecting the nodes will contribute more to the weighted sum

There is also no notion of space in the self attention head, there's really only the vectors being multipled by each other. 

Batching the data allows them to be processed in parallel and independently

We're building a decoder because we have an autoregressive way of masking the triangular matrix to ensure good context. You can delete the masking in order to have an "encoding" structure which is useful for things like analyzing the sentiment of a sentence. In that case you'd want all nodes to talk to each other and note just the past tokens. But we're instantiating a position dependence in the nodes through the triangular matrix in order to predict the next token

It is called self attention because the keys, values and queries all come from x. The same source. In encoder-decoder transformers, you can have keys and values coming from one source but queries are coming from another place. This intermixing allows for more complicated? learning behaviors

Now if we don't divide by the square root of the head size, then the variance will be on the order of the head size. We don't want this because the variance isn't preserved. The inputs are initially random Gaussian, so this division will keep the structure of the input. 

The inputs to softmax will converge to one hot vectors if the inputs (weights) are either too negative or positive. We want diffuse weights during the learning process, so we don't want sparse weights. I just tried this and indeed it makes even this case more diffuse than it originally was

Current questions: 
1) I'm not seeing how either the different keys and queries are communicating with each other within this directed graph intuition picture. I know that the weights are going to be optimized. Well actually the weights are a dot product of the queries and keys, so that means we'll be optimizing the queries and keys as well. This means that based on some next token in the training data, we'll be adjusting the weights so that a certain letter will follow another letter, depending on what's likely

Now we're going to move onto the blocks in the transformer. We're going to be replicating and interspersing the (encoder-decoder?) blocks and then having them repeat so that the training is being iterated on multiple times. (Is this like an RNN kind of thing?) Currently at 1:27:00

Now we're going to implement LayerNorm, which is another innovation that is added to the transformer


In [11]:
# we normalize the rows instead of the columns of the vectors in each batch
class LayerNorm1d:
    
    def __init__(self, dim, eps=1e-5):
        self.eps = eps
        self.gamma = torch.ones(dim)
        self.beta = torch.zeros(dim)
        
    def __call__(self, x):
        # calculate the forward pass
        xmean = x.mean(1, keepdim=True) # batch mean 
        xvar = x.var(1, keepdim=True) # batch variance
        xhat = (x - xmean) / torch.sqrt(xvar + self.eps) # normalize to unit variance
        self.out = self.gamma * xhat + self.beta
        return self.out
    
    def parameters(self):
        return [self.gamma, self.beta]
    
torch.manual_seed(1337)
module = LayerNorm1d(100)
x = torch.randn(32, 100) # batch size of 32, with 100 dimensional vectors
x = module(x)

In [18]:
x[:, 0].mean(), x[:, 0].std(), x[0, :].mean(), x[0, :].std()

(tensor(0.1469), tensor(0.8803), tensor(-9.5367e-09), tensor(1.0000))

Currently at 1:35:00