# This script seeks to create a transformer by taking it from the Tiny Shakespeare's Dataset to generate infinite (but completely random) Shakespeare-like text.

## First, lets import the dataset:

In [3]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2026-02-07 23:21:34--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt.1’


2026-02-07 23:21:34 (13.2 MB/s) - ‘input.txt.1’ saved [1115394/1115394]



### Now lets read it:

In [4]:
with open('input.txt', 'r', encoding = 'utf-8') as f:
    text = f.read() #saves the entire file to one large string

In [5]:
print(f"Length of dataset in characters: {len(text)}")

Length of dataset in characters: 1115394


Then lets fetch the unique characters in this text to fetch our vocabulary:

In [6]:
chars = sorted(list(set(text)))
vocab_amount = len(chars)
print(''.join(chars))
print(vocab_amount)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


Now lets try to tokenize the input text from raw text to some vector of notebooks from the vocabulary:

In [7]:
str_int = { ch : i for i, ch in enumerate(chars) } # for encoding
int_str = { i : ch for i, ch in enumerate(chars) } # for decoding
encode = lambda s: [str_int[ch] for ch in s]
decode = lambda i: ''.join(int_str[d] for d in i)

# let's test it out

print(encode("hello, david is awesome"))
print(decode(encode("hello, david is awesome")))

[46, 43, 50, 50, 53, 6, 1, 42, 39, 60, 47, 42, 1, 47, 57, 1, 39, 61, 43, 57, 53, 51, 43]
hello, david is awesome


To encode the entire test dataset we need to import PyTorch.

In [8]:
import torch

In [9]:
data = torch.tensor(encode(text), dtype = torch.long) # Take all of the text from tiny shakespeare and encode it, then wrap to a tensor.

print(data.shape, data.dtype)
print(data[:1000]) # Only the first 1,000 characters tokenized

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
      

Let's now seperate our data into train and validation sets. Specifically a 90-10 split.
This means that we will keep 90% of the data and withhold the last 10% to validate so 
we can see how much it overfits as we don't want this LLM memorizing the entire dataset.

In [10]:
n = int(.9 * len(data))
print(n)
train = data[:n]
val = data[n:]

1003854


### Now its time to actually implement a transformer to train and learn these patterns

#### It's important to note that training transformers isn't just slapping the entire dataset in because when the data is large that can be very computationally demanding. Instead, we only work with "chunks" of the data instead, and sample random chunks out of the set to train chunks of length k at max, which typically is referred to as "block_size", or "context_length". In our example block_size will be 8. But in modern days block_size has advanced from sizes of 512-2048 in GPT-3 to 8k-128K+ in models like Opus 4.6 due to improvements in attention.

In [11]:
block_size = 8
train[:block_size + 1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

We pack 9 indexes in this example because transformers update as they traverse the data. Therefore 9 indexes results in 8 interactions. 

e.g: \
We see 18, and contextualize that 47 likely comes next. \
We see 18, 47, then contextualize that 56 likely comes next, etc.

In [12]:
x = train[:block_size]
y = train[1:block_size + 1]

for i in range(block_size):
    context = x[:i + 1]
    target = y[i]
    print(f"When the input is {context} the target is {target}.")

When the input is tensor([18]) the target is 47.
When the input is tensor([18, 47]) the target is 56.
When the input is tensor([18, 47, 56]) the target is 57.
When the input is tensor([18, 47, 56, 57]) the target is 58.
When the input is tensor([18, 47, 56, 57, 58]) the target is 1.
When the input is tensor([18, 47, 56, 57, 58,  1]) the target is 15.
When the input is tensor([18, 47, 56, 57, 58,  1, 15]) the target is 47.
When the input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target is 58.


Note: the for loop is for visualization, but blocks x and y are what are actually fed into PyTorch. \
Imagine an input x, a block containing [0, 1, 2, 3, 4, 5, 6, 7]  \
And a target y, another block containing [1, 2, 3, 4, 5, 6, 7, 8] \
The model predicts what the next token would be at each position (without looking into the future), then compares the prediction against y to compute loss.

#### The cool part about GPU's is that many cores can work on completely seperate things without ever having to communicate with each other, so now let's generalize the above example to a wider scale.

In [13]:
torch.manual_seed(1337) # sets the random seed so that we can get the same result for example purposes.
batch_size = 4 # 4 concurrent processes that forward-pass and backwards-pass
block_size = 8 # max context length of 8 in predictions

def get_batch(split):
    data = train if split == 'train' else val # we are shadowing the global data. this is just some random local "data variable"
    ix = torch.randint(len(data) - block_size, (batch_size,)) # Here you can see that duplicate data IS possible, but that is the point. We want completely random data
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])

    return x, y
    

Lets try it out: you should see that the targets are just offset by 1.

In [14]:
xb, yb = get_batch('train')
print(f'inputs:\n{xb.shape}\n{xb}\ntargets:\n{yb.shape}\n{yb}')

inputs:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets:
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])


#### This gives us 4 completely independent examples (x) that will be fed into the transformer which will then be compared to our targets (y).

### Now it's time to feed this to a neural network. For simplicity we will use the bigram language model.

In [26]:
import torch
import torch.nn as nn # Neural Network, stateful
from torch.nn import functional as F # returns a value for a function (like softmax, Relu, etc) and doesn't need to persist. Stateless

torch.manual_seed(1337) #same seed as before

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__() # calls the __init__ of the parent class to initialize and store weights for future use.
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size) # creates a 65 x 65 (in this case) matrix of random normal distributed values. Over time it gets trained to higher accuracy

    def forward(self, idx, targets = None): # None by default so generate() can use it.

        logits = self.token_embedding_table(idx) # replaces every single token with a 65 character row for the corresponding row. e.g. 43 becomes the entire 65 character row 43.

        if targets is None:
            loss = None # doesnt return anything since its not comparing to anything
        else:
            # cross entropy doesn't want a B, T, C shape, it wants B, C, T. 
            B, T, C = logits.shape
            logits  = logits.view(B*T, C) # PyTorch doenst expect a 3D matrix, so we convert a 4 x 8 matrix with a 65 row result into 32 rows each containing a 65 row tensor.
            targets = targets.view(B*T) # Same idea here
        
            loss = F.cross_entropy(logits, targets) 
        """
        Cross Entropy basically compares predictions from Logits to expected answers. Basically from logits it softmaxes it and converts each value
        into probabilities all summing to 1. Then it takes the -ln() of each value. Then PyTorch takes the actual answer it shouldve been and 
        compares it to the negative log of the correct token. After calculating the loss value at each token it averages it and that is what
        is spit out by loss.
        """
        
        return logits, loss

    def generate(self, idx, max_new_tokens):

        for _ in range(max_new_tokens):

            logits, loss = self(idx) # grab predictions
            logits = logits[:, -1, :] # becomes (B, C). this is a bigram model so we only care about the most recent output to generate a new token.
            probs = F.softmax(logits, dim=-1) # (B, C)
            idx_next  = torch.multinomial(probs, num_samples = 1) # (B, 1). Selects a random value based on percentages. even something with a .01% of being chosen COULD be chosen
            idx = torch.cat((idx, idx_next), dim = 1) #(B, T + 1) appends the value to idx.
            
        return idx

    """ 
    This generate function is pretty dumb because we are still feeding it the ENTIRE context, but only reading the last token to make a prediction on. 
    That's because this is a bigram model that only cares about the last token but in the future we will have Models that actually see the whole
    context (or at least as much as it can) so we are keeping the generate function wide so that future code changes will stay the same for 
    consistency.
    """

        
m = BigramLanguageModel(vocab_amount) # creates a BLM of a vocab_amount x vocab_amount matrix.

logits, loss = m(xb, yb) # indirectly calls __call__ in BigramLanguageModel which calls forward.
print(logits.shape)
print(loss)


torch.Size([32, 65])
tensor(4.8786, grad_fn=<NllLossBackward0>)


In our first run, we got a loss of ~4.8786. Very high but to be expected as this is a completely new embedding table. Theoretically since this is pretty random the loss should've been around -ln(1/65) or ln(65) but that assumes the neural network was like "hey i have 0 clue, lets use everything equally). But these values do NOT have equal probabilities so some values inevitably got used more and less than others.

In [25]:
print(decode(m.generate(idx = torch.zeros((1, 1), dtype = torch.long), max_new_tokens = 100)[0].tolist()))
# This code basically predicts what goes on after a single token 0, which is a new line in our example.


SBLBcHAUhk&PHdhcOb
nhJ?FJU?pRiOLQeUN!BxjPLiq-GJdUV'hsnla!murI!IM?SPNPq?VgC'R
pD3cLv-bxn-tL!upg
SZ!Uv


#### So the first run ever, this is the output:

SBLBcHAUhk&PHdhcOb \
nhJ?FJU?pRiOLQeUN!BxjPLiq-GJdUV'hsnla!murI!IM?SPNPq?VgC'R \ 
pD3cLv-bxn-tL!upg \
SZ!Uv

#### This is pretty garbage, but luckily that's to be expected since this model is completely random. It's now time to train.

In [30]:
optimizer = torch.optim.AdamW(m.parameters(), lr = 1e-3)

this is a PyTorch optimizer. 
an optimizer actually updates those weights stored in __init__.
AdamW is a smarter way of optimizes that asks 2 main questions:
 - How big were recent gradient pushes? (if it's been moving in tiny pushes, give it a larger push this time)
 - What direction have gradients been going? (if it keeps moving in one direction, carry the momentum)

lr, or learning rate of 1e-3 means that the scaling factor. 

e.g.

new_weight = old_weight - lr x other gradient stuff.

In [37]:
batch_size = 32

for steps in range(10000):

    xb, yb = get_batch('train') # First creates the batch

    logits, loss = m(xb, yb) # Gets the logits and loss from the batches
    optimizer.zero_grad(set_to_none = True) # Resets each gradient for fresh usage from previous .step, .backwards.
    loss.backward() # Computes new gradients in self.token_embedding_table.grad. Identical size but instead it shows what should change in each one.
    optimizer.step() # Adjusts actual weights based on .grad.

print(loss.item())

2.5693893432617188


#### After running the above a couple times and seeing the loss go down, lets check out the new guess:

In [39]:
print(decode(m.generate(idx = torch.zeros((1, 1), dtype = torch.long), max_new_tokens = 400)[0].tolist()))


T athe ce! oungleasitasend devibathe pe tulaitowitt,
ICIn witere at?
obl,
Thalul y ou s s t pordel be
Whay st ILI s Ithelonew,
LO Ed athothake, t u were Youeje the-thery blouther bowat towithaianattt kntr I se y ce sech is arsenased tthentsman lemout:
I
We;


NGle, LAndr!
Fimau f,
QUTirourthithak, malit I myon Ispoupe mom;
H:
hades cks thaker de benlleall in
OUETCHUETHare t VABUSAmie irrowaks

Had


#### New output:


T athe ce! oungleasitasend devibathe pe tulaitowitt, \
ICIn witere at? \
obl, \
Thalul y ou s s t pordel be \
Whay st ILI s Ithelonew, \
LO Ed athothake, t u were Youeje the-thery blouther bowat towithaianattt kntr I se y ce sech is arsenased tthentsman lemout: \
I \
We; \


NGle, LAndr! \
Fimau f, \
QUTirourthithak, malit I myon Ispoupe mom; \
H: \
hades cks thaker de benlleall in \
OUETCHUETHare t VABUSAmie irrowaks \

Had

#### Still pretty crap, but is a huge improvement to before. It knows to occasionally add a space and not really use symbols or numbers that frequently and even made a real word at the end.

The issue stopping this from creating coherent Shakespeare is that a Bigram model doesn't coordinate tokens __with__ each other, which is what makes words function.