In [3]:
import torch

with open("data/input.txt", "r") as f:
    text = f.read()

print(f"Text length: {len(text)}")

first_60 = text[:60]
print(f"{first_60}")


Text length: 1115393
First Citizen:
Before we proceed any further, hear me speak.


## Loading and Previewing the Text

In this code cell, we read in the text file containing the **Complete Works of Shakespeare**. We then:
- Print out the total length of the text.
- Print the first 60 characters.

This provides a quick sanity check to ensure the data is loaded and also gives us insight into the text structure and content.

In [4]:
chars = sorted(list(set(text)))
vocab_size = len(chars)

print('|'.join(chars))
print(f"Unique characters: {vocab_size}")


| |!|$|&|'|,|-|.|3|:|;|?|A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z|a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z
Unique characters: 65


## Identifying the Character Set

Here, we:
- Extract all unique characters by converting the text into a Python `set`.
- Sort them so that we have a consistent ordering of characters.
- Print them to see exactly which characters appear in the text.
- Print the total count, which in this case is 65.

For this project, each **character** maps to a single integer. In modern LLMs, this mapping happens at a larger token (sub-word) level.

In [3]:
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

print(encode(first_60))

[18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 0, 14, 43, 44, 53, 56, 43, 1, 61, 43, 1, 54, 56, 53, 41, 43, 43, 42, 1, 39, 52, 63, 1, 44, 59, 56, 58, 46, 43, 56, 6, 1, 46, 43, 39, 56, 1, 51, 43, 1, 57, 54, 43, 39, 49, 8]


## Encoding and Decoding

We define two main dictionaries:
- **`stoi` (string-to-integer)**: Maps each character to an integer.
- **`itos` (integer-to-string)**: Maps back from the integer to the character.

We then define two lambda functions:
- `encode(s)`: Converts a string into a list of integer indices.
- `decode(l)`: Converts a list of indices back into a string.

Finally, we print the integer-encoded version of the first 60 characters to verify our mapping is working.

In [4]:
data = torch.tensor(encode(text))

# split into train and validation
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]

block_size = 8

# 9 items will have 8 predicion examples
train_data[:block_size+1]

x = train_data[:block_size]
y = train_data[1:block_size+1]

print(x)

# useful so the transformer is used to seeing different lengths of data
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target is {target}")


tensor([18, 47, 56, 57, 58,  1, 15, 47])
when input is tensor([18]) the target is 47
when input is tensor([18, 47]) the target is 56
when input is tensor([18, 47, 56]) the target is 57
when input is tensor([18, 47, 56, 57]) the target is 58
when input is tensor([18, 47, 56, 57, 58]) the target is 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target is 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target is 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target is 58


## Dataset Splitting and Context-Target Pairs

1. We **split** the data into a training set (90%) and a validation set (10%).
2. We set a `block_size` of 8, meaning our context window includes 8 tokens (characters).
3. We illustrate how each token in the block is used to predict the *next* token.

In practice, the training loop will sample chunks of the text in random order, so the model sees a variety of patterns.

In [9]:
from src import SimpleBigramLanguageModel, BatchLoader, Evaluator, Trainer

batch_size = 32
block_size = 16
max_iters = 4500
eval_interval = 500
learning_rate = 3e-3

# Setup data and model
torch.manual_seed(1337)
train_loader = BatchLoader(train_data, block_size=block_size, batch_size=batch_size)
val_loader = BatchLoader(val_data, block_size=block_size, batch_size=batch_size)

# model = SimpleBigramLanguageModel(vocab_size, n_embed, block_size)
model = SimpleBigramLanguageModel(vocab_size, block_size)

# Setup training components
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
evaluator = Evaluator(model, train_loader, val_loader, vocab_size)
trainer = Trainer(model, optimizer, train_loader, evaluator, max_iters, eval_interval)

# Train the model
final_losses = trainer.train()

step 0: perplexity: 112.3, 
step 500: perplexity: 27.1, 
step 1000: perplexity: 15.6, 
step 1500: perplexity: 13.1, 
step 2000: perplexity: 12.4, 
step 2500: perplexity: 12.1, 
step 3000: perplexity: 11.9, 
step 3500: perplexity: 11.9, 
step 4000: perplexity: 11.8, 
step 4500: perplexity: 11.8, 
step 5000: perplexity: 11.7, 


## Training a Simple Bigram Language Model

Here, we define:
- **`BatchLoader`**: A small class to handle data in mini-batches.
- **`Evaluator`**: A helper to measure perplexity on training/validation sets.
- **`Trainer`**: Orchestrates the training loop.

We instantiate `SimpleBigramLanguageModel` with:
- `block_size = 16`
- `batch_size = 32`

We run up to `max_iters = 4500` steps, checking perplexity every `eval_interval = 500` steps.

A final perplexity of ~11.7 means that for this model, it guesses the right next character about 1 in 11.7 times. Definitely an improvement from pure randomness, but still quite high.

In [12]:
# Generate some text
context = torch.zeros((1, 1), dtype=torch.long)
generated_text = decode(model.generate(context, max_new_tokens=490)[0].tolist())
print(generated_text)


O:
Anele er fe co,
LLamer squsethaittthtr ayit tifod rer; y e ined guratosoulyequg.
BUEEd tavaperelee athavis u warray, n
We by bronond man, d cr miowivero agarlan
has,

Binksue; ain'lilavealeamy y t Isoup uge o'sth r.
What Beeethisunded orachigorsh kn, Ta cheneinhit we t,
Fr s ide Bus ithikee me;
Bul ake har apy ave I arillevVIO hineeo n:
TI ad by andulcavis, scld
Atlithe day;

AO: T:
G butor benkeave y'd,
Gecknfime ttinthalond sBy wapiorasonou haverl the heayet asen d bor t man pe, t


## Generated Text (Simple Bigram Model)

Using `model.generate()`, we sample from our learned distribution to produce text. This snippet:
- Creates an **empty** context (a single zero token)
- Asks the model for the next 490 characters.
- Decodes the token IDs back to characters.

The text is somewhat "Shakespeare-like" but still full of nonsense. This is expected for a bigram character-level model.


## Self-Attention Section (Empty Code Cell)

This empty cell was part of the original notebook structure. It's a placeholder where one might add additional code or placeholders for analysis. We'll keep it here for continuity.

In [15]:
from src import BigramLanguageModel

# Model parameters
batch_size = 32
block_size = 16
max_iters = 4001
eval_interval = 500
learning_rate = 3e-3
n_embed = 64
n_heads = 4
n_layer = 1
dropout = 0.1
#

# model = SimpleBigramLanguageModel(vocab_size, n_embed, block_size)
model = BigramLanguageModel(vocab_size, n_embed, block_size, n_layer, n_heads, dropout)

# Setup training components
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
evaluator = Evaluator(model, train_loader, val_loader, vocab_size)
trainer = Trainer(model, optimizer, train_loader, evaluator, max_iters, eval_interval)

# Train the model
final_losses = trainer.train()

step 0: perplexity: 65.9, 
step 500: perplexity: 8.7, 
step 1000: perplexity: 7.6, 
step 1500: perplexity: 7.1, 
step 2000: perplexity: 6.8, 
step 2500: perplexity: 6.6, 
step 3000: perplexity: 6.4, 
step 3500: perplexity: 6.4, 
step 4000: perplexity: 6.2, 
step 4500: perplexity: 6.2, 
step 5000: perplexity: 6.1, 


## Introducing Self-Attention

Here, we switch to a more advanced architecture: **`BigramLanguageModel`** that implements a simplified version of self-attention. Key differences:
- We have `n_heads = 4`, meaning we use multi-head attention.
- `dropout = 0.1` helps prevent overfitting.
- `n_embed = 64` increases the dimension of our embeddings, letting the model learn more nuanced patterns.

During training, you can see the perplexity now **drops faster** and much lower than the simple bigram model—down to around 6.1. This demonstrates the power of attention-based layers.

In [57]:
# Generate some text
context = torch.zeros((1, 1), dtype=torch.long)
generated_text = decode(model.generate(context, max_new_tokens=300)[0].tolist())
print(generated_text)


your bes, you as: nowful hear of nold; ward. bath hed lequieds a firt nobor, we have's can to fordids me forim stul of them grabless wind mons to rewind, gays, buthit gner upon
to til Roe sive ve thoughd pods.
That thols, and tame, you seas, fightr, dors soo stae woung, tet, is of com staiser.

Sher.'t Caren: out be het righs. your peaut wapes, in sie;
This to my, this, twer one and med fea kin, we balwans fight rove Rorand don-hil man
I word, seer no, awrad youghtrise:
Stto hy pove, and hond an


## Generated Text (Self-Attention Model)

Now that we have introduced self-attention, the output text, while still nonsensical in parts, contains more coherent words and partial sentences.

You can see phrases like *"your bes, you as:"* and *"Sher.'t Caren:"* that—though random—are starting to look more like 16th-century English.

By adjusting hyperparameters (e.g., more layers, larger embeddings), we can push this further.

In [40]:
# Model parameters
batch_size = 32
block_size = 6
max_iters = 4001
eval_interval = 500
learning_rate = 3e-3
n_embed = 192
n_heads = 3
n_layer = 4
dropout = 0.1
#

# model = SimpleBigramLanguageModel(vocab_size, n_embed, block_size)
model = BigramLanguageModel(vocab_size, n_embed, block_size, n_layer, n_heads, dropout)

# Setup training components
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
evaluator = Evaluator(model, train_loader, val_loader, vocab_size)
trainer = Trainer(model, optimizer, train_loader, evaluator, max_iters, eval_interval)

# Train the model
final_losses = trainer.train()

# 2.09 5.6

step 0: perplexity: 4412.1, 
step 500: perplexity: 7.6, 
step 1000: perplexity: 6.7, 
step 1500: perplexity: 6.2, 
step 2000: perplexity: 6.0, 
step 2500: perplexity: 5.8, 
step 3000: perplexity: 5.7, 
step 3500: perplexity: 5.5, 
step 4000: perplexity: 5.5, 


## Scaling Up Further

To demonstrate the impact of scaling, we adjust:
- **`block_size`** to a smaller 6 here (just for demonstration)
- **`n_embed`** to 192
- **`n_layer`** to 4
- **`n_heads`** to 3

Even with these changes, perplexity drops into the mid-5 range, showing how deeper networks and bigger embeddings improve predictive power. In practice, you'd also raise the `block_size` to allow the model to see longer context.


In [42]:
# Generate some text
context = torch.zeros((1, 1), dtype=torch.long)
generated_text = decode(model.generate(context, max_new_tokens=400)[0].tolist())
print(generated_text)


NOR YORK:
Love fring.

LADY CAPULET:
King Esell own so it not by
here to I will love thee.

SICINIUS:
Action; porder.
Who on the me proction.
Now you the words my mercest fornight
That a reconk lay to before-jasts a contern:
O grath?

COMINIUS:
No fro to'th with your gruerm of mror now
on yous: to't!' then't Cabsughtly senator
Your chage
Fausenue to the such in shive.
Will bord, my exaincin, and M


## Final Text Generation

Our scaled-up model now produces text that hints at characters, place names, and partial phrases resembling Shakespeare.
We see references to _"LADY CAPULET"_ and partial coherent lines that mimic stage directions or dialogues. While still not perfect English, it's closer in style to Shakespeare.
