# This script seeks to create a transformer by taking it from the Tiny Shakespeare's Dataset to generate infinite (but completely random) Shakespeare-like text.

## First, lets import the dataset:

In [1]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2026-02-06 20:30:56--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2026-02-06 20:30:56 (12.2 MB/s) - ‘input.txt’ saved [1115394/1115394]



### Now lets read it:

In [3]:
with open('input.txt', 'r', encoding = 'utf-8') as f:
    text = f.read() #saves the entire file to one large string

In [4]:
print(f"Length of dataset in characters: {len(text)}")

Length of dataset in characters: 1115394


Then lets fetch the unique characters in this text to fetch our vocabulary:

In [6]:
chars = sorted(list(set(text)))
vocab_amount = len(chars)
print(''.join(chars))
print(vocab_amount)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


Now lets try to tokenize the input text from raw text to some vector of notebooks from the vocabulary:

In [7]:
str_int = { ch : i for i, ch in enumerate(chars) } # for encoding
int_str = { i : ch for i, ch in enumerate(chars) } # for decoding
encode = lambda s: [str_int[ch] for ch in s]
decode = lambda i: ''.join(int_str[d] for d in i)

# let's test it out

print(encode("hello, david is awesome"))
print(decode(encode("hello, david is awesome")))

[46, 43, 50, 50, 53, 6, 1, 42, 39, 60, 47, 42, 1, 47, 57, 1, 39, 61, 43, 57, 53, 51, 43]
hello, david is awesome


To encode the entire test dataset we need to import PyTorch.

In [8]:
import torch

In [22]:
data = torch.tensor(encode(text), dtype = torch.long) # Take all of the text from tiny shakespeare and encode it, then wrap to a tensor.

print(data.shape, data.dtype)
print(data[:1000]) # Only the first 1,000 characters tokenized

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
      

Let's now seperate our data into train and validation sets. Specifically a 90-10 split.
This means that we will keep 90% of the data and withhold the last 10% to validate so 
we can see how much it overfits as we don't want this LLM memorizing the entire dataset.

In [15]:
n = int(.9 * len(data))
print(n)
train = data[:n]
val = data[n:]

1003854


### Now its time to actually implement a transformer to train and learn these patterns

#### It's important to note that training transformers isn't just slapping the entire dataset in because when the data is large that can be very computationally demanding. Instead, we only work with "chunks" of the data instead, and sample random chunks out of the set to train chunks of length k at max, which typically is referred to as "block_size", or "context_length". In our example block_size will be 8. But in modern days block_size has advanced from sizes of 512-2048 in GPT-3 to 8k-128K+ in models like Opus 4.6 due to improvements in attention.

In [16]:
block_size = 8
train[:block_size + 1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

We pack 9 indexes in this example because transformers update as they traverse the data. Therefore 9 indexes results in 8 interactions. 

e.g: \
We see 18, and contextualize that 47 likely comes next. \
We see 18, 47, then contextualize that 56 likely comes next, etc.

In [21]:
x = train[:block_size]
y = train[1:block_size + 1]

for i in range(block_size):
    context = x[:i + 1]
    target = y[i]
    print(f"When the input is {context} the target is {target}.")

When the input is tensor([18]) the target is 47.
When the input is tensor([18, 47]) the target is 56.
When the input is tensor([18, 47, 56]) the target is 57.
When the input is tensor([18, 47, 56, 57]) the target is 58.
When the input is tensor([18, 47, 56, 57, 58]) the target is 1.
When the input is tensor([18, 47, 56, 57, 58,  1]) the target is 15.
When the input is tensor([18, 47, 56, 57, 58,  1, 15]) the target is 47.
When the input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target is 58.


Note: the for loop is for visualization, but blocks x and y are what are actually fed into PyTorch. \
Imagine an input x, a block containing [0, 1, 2, 3, 4, 5, 6, 7]  \
And a target y, another block containing [1, 2, 3, 4, 5, 6, 7, 8] \
The model predicts what the next token would be at each position (without looking into the future), then compares the prediction against y to compute loss.

#### The cool part about GPU's is that many cores can work on completely seperate things without ever having to communicate with each other, so now let's generalize the above example to a wider scale.

In [23]:
torch.manual_seed(1337) # sets the random seed so that we can get the same result for example purposes.
batch_size = 4 # 4 concurrent processes that forward-pass and backwards-pass
block_size = 8 # max context length of 8 in predictions

def get_batch(split):
    data = train if split == 'train' else val # we are shadowing the global data. this is just some random local "data variable"
    ix = torch.randint(len(data) - block_size, (batch_size,)) # Here you can see that duplicate data IS possible, but that is the point. We want completely random data
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])

    return x, y
    

Lets try it out: you should see that the targets are just offset by 1.

In [24]:
xb, yb = get_batch('train')
print(f'inputs:\n{xb.shape}\n{xb}\ntargets:\n{yb.shape}\n{yb}')

inputs:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets:
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])


#### This gives us 4 completely independent examples (x) that will be fed into the transformer which will then be compared to our targets (y).

### Now it's time to feed this to a neural network. For simplicity we will use the bigram language model.

In [None]:
import torch
import torch.nn as nn # Neural Network
from torch.nn import functional as F # A version of nn where you give it your own weights instead of it usings its own