# How Language Models Work

In this notebook we will be building a simple language model from scratch. We will be using the **Complete Works of Shakespeare** as our dataset.

## Loading and Encode Text

In this code cell, we read in the text file containing the **Complete Works of Shakespeare**. First we load the text in:

In [1]:
import torch

with open("data/input.txt", "r") as f:
    text = f.read()

print(f"Text length: {len(text)}")

first_60 = text[:60]
print(f"{first_60}")


Text length: 1115393
First Citizen:
Before we proceed any further, hear me speak.


#### Text Encoding

The first step is to convert the text into a numerical format. The simplest way to do this is to create a mapping between each character and a number.

Here, we:
- Extract all unique characters by converting the text into a Python `set`.
- Sort them so that we have a consistent ordering of characters.

In [2]:
chars = sorted(list(set(text)))
vocab_size = len(chars)

print('|'.join(chars))
print(f"Unique characters: {vocab_size}")


| |!|$|&|'|,|-|.|3|:|;|?|A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z|a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z
Unique characters: 65


In our text there are 65 unique characters (26 uppercae, 26 lowercase and 13 numbers and special characters). So we can simply convert each character to a number from 1 to 65.

In our case we'll sort alphabetically and then convert each character to a number. To do this we define two main dictionaries:
- **`stoi` (string-to-integer)**: Maps each character to an integer.
- **`itos` (integer-to-string)**: Maps back from the integer to the character.

We then define two lambda functions:
- `encode(s)`: Converts a string into a list of integer indices.
- `decode(l)`: Converts a list of indices back into a string.

Below are the first 60 characters encoded as integers:

In [3]:
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

print(encode(first_60))

[18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 0, 14, 43, 44, 53, 56, 43, 1, 61, 43, 1, 54, 56, 53, 41, 43, 43, 42, 1, 39, 52, 63, 1, 44, 59, 56, 58, 46, 43, 56, 6, 1, 46, 43, 39, 56, 1, 51, 43, 1, 57, 54, 43, 39, 49, 8]


## Dataset Splitting and Context-Target Pairs

1. We **split** the data into a training set (90%) and a validation set (10%).
2. We set a `block_size` of 8, meaning our context window includes 8 tokens (characters).
3. We illustrate how each token in the block is used to predict the *next* token.

In practice, the training loop will sample chunks of the text in random order, so the model sees a variety of patterns.

In [4]:
data = torch.tensor(encode(text))

# split into train and validation
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]

block_size = 8

# 9 items will have 8 predicion examples
train_data[:block_size+1]

x = train_data[:block_size]
y = train_data[1:block_size+1]

print(x)

# useful so the transformer is used to seeing different lengths of data
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target is {target}")


tensor([18, 47, 56, 57, 58,  1, 15, 47])
when input is tensor([18]) the target is 47
when input is tensor([18, 47]) the target is 56
when input is tensor([18, 47, 56]) the target is 57
when input is tensor([18, 47, 56, 57]) the target is 58
when input is tensor([18, 47, 56, 57, 58]) the target is 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target is 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target is 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target is 58


## Training a Simple Bigram Language Model

Here, we call the following classes:
- **`BatchLoader`**: A small class to handle data in mini-batches.
- **`Evaluator`**: A helper to measure perplexity on training/validation sets.
- **`Trainer`**: Orchestrates the training loop.

We instantiate `SimpleBigramLanguageModel` with:
- `block_size = 16`
- `batch_size = 32`

We run up to `max_iters = 4001` steps, checking perplexity every `eval_interval = 500` steps.

A final perplexity of ~11.8 means that for this model, it guesses the right next character about 1 in 12 times. Definitely an improvement from pure randomness, but still quite high.

In [5]:
from src import SimpleBigramLanguageModel, BatchLoader, Evaluator, Trainer

batch_size = 32
block_size = 16
max_iters = 4001
eval_interval = 500
learning_rate = 3e-3

# Setup data and model
torch.manual_seed(1337)
train_loader = BatchLoader(train_data, block_size=block_size, batch_size=batch_size)
val_loader = BatchLoader(val_data, block_size=block_size, batch_size=batch_size)

# model = SimpleBigramLanguageModel(vocab_size, n_embed, block_size)
model = SimpleBigramLanguageModel(vocab_size, block_size)

# Setup training components
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
evaluator = Evaluator(model, train_loader, val_loader, vocab_size)
trainer = Trainer(model, optimizer, train_loader, evaluator, max_iters, eval_interval)

# Train the model
final_losses = trainer.train()

step 0: perplexity: 112.3, 
step 500: perplexity: 27.1, 
step 1000: perplexity: 15.6, 
step 1500: perplexity: 13.1, 
step 2000: perplexity: 12.4, 
step 2500: perplexity: 12.1, 
step 3000: perplexity: 11.9, 
step 3500: perplexity: 11.9, 
step 4000: perplexity: 11.8, 


## Generated Text (Simple Bigram Model)

Using `model.generate()`, we sample from our learned distribution to produce text. This snippet:
- Creates an **empty** context (a single zero token)
- Asks the model for the next 490 characters.
- Decodes the token IDs back to characters.

The text is somewhat "Shakespeare-like" but still full of nonsense. This is expected for a bigram character-level model.


In [6]:
# Generate some text
context = torch.zeros((1, 1), dtype=torch.long)
generated_text = decode(model.generate(context, max_new_tokens=490)[0].tolist())
print(generated_text)


Wadoust ftes inupuctararirtowir gs wingucr, as aith helpr;
Ju ove d rangos blthiok
Pl ghese tringhan then ande

I cus, Bu d:
TEENGOKENCILA iby ELOLoristhe,
Plethatird fu$go tckid t y ollf ta he cere. G ha hearee d ld beat gu bean ane as tiorseade marethioathowow alr, t ot anand are rend.
LEOLUENTITENCKICOMEOLLLI e.
KEThof aren:
NIVI miecohyor, erinst flawhant fe ere bons thand athe it ilee-OR f: in, he spoth,
Melladalee isaloveen ol it gllinde me h chin oug--ccas hed as o tot thit ance


## Introducing Self-Attention

Here, we switch to a more advanced architecture: **`BigramLanguageModel`** that implements a simplified version of self-attention. Key differences:
- We have `n_heads = 4`, meaning we use multi-head attention.
- `dropout = 0.1` helps prevent overfitting.
- `n_embed = 64` increases the dimension of our embeddings, letting the model learn more nuanced patterns.

During training, you can see the perplexity now **drops faster** and much lower than the simple bigram model—down to around 6.1. This demonstrates the power of attention-based layers.

In [7]:
from src import BigramLanguageModel

# Model parameters
batch_size = 32
block_size = 16
max_iters = 4001
eval_interval = 500
learning_rate = 3e-3
n_embed = 64
n_heads = 4
n_layer = 1
dropout = 0.1
#

# model = SimpleBigramLanguageModel(vocab_size, n_embed, block_size)
model = BigramLanguageModel(vocab_size, n_embed, block_size, n_layer, n_heads, dropout)

# Setup training components
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
evaluator = Evaluator(model, train_loader, val_loader, vocab_size)
trainer = Trainer(model, optimizer, train_loader, evaluator, max_iters, eval_interval)

# Train the model
final_losses = trainer.train()

step 0: perplexity: 66.7, 
step 500: perplexity: 8.5, 
step 1000: perplexity: 7.5, 
step 1500: perplexity: 7.0, 
step 2000: perplexity: 6.8, 
step 2500: perplexity: 6.6, 
step 3000: perplexity: 6.4, 
step 3500: perplexity: 6.3, 
step 4000: perplexity: 6.3, 


#### Generated Text (Self-Attention Model)

Now that we have introduced self-attention, the output text, while still nonsensical in parts, contains more coherent words and partial sentences.

You can see phrases like *"your bes, you as:"* and *"Sher.'t Caren:"* that—though random—are starting to look more like 16th-century English.

By adjusting hyperparameters (e.g., more layers, larger embeddings), we can push this further.

In [8]:
# Generate some text
context = torch.zeros((1, 1), dtype=torch.long)
generated_text = decode(model.generate(context, max_new_tokens=300)[0].tolist())
print(generated_text)


HESS Nairs, as grown, fatendam trongoon body honour cinsfel spother dee
Casens for would would his, joy banderve mawn that all mades,
Or, in, fair io findance;
So mell, Thees by ken denhower chies.

EDWARWICK:
Her coldwick owy titry
Are, prikens to Prolost: Iuble your fom us:
Your thate see, but hav


## Scaling Up Further

To demonstrate the impact of scaling, we adjust:
- **`block_size`** to a smaller 6 here (just for demonstration)
- **`n_embed`** to 192
- **`n_layer`** to 4
- **`n_heads`** to 3

Even with these changes, perplexity drops into the mid-5 range, showing how deeper networks and bigger embeddings improve predictive power. In practice, you'd also raise the `block_size` to allow the model to see longer context.


In [15]:
# Model parameters
batch_size = 32
block_size = 16
max_iters = 8001
eval_interval = 500
learning_rate = 3e-3
n_embed = 192
n_heads = 3
n_layer = 3
dropout = 0.1
#

# model = SimpleBigramLanguageModel(vocab_size, n_embed, block_size)
model = BigramLanguageModel(vocab_size, n_embed, block_size, n_layer, n_heads, dropout)

# Setup training components
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
evaluator = Evaluator(model, train_loader, val_loader, vocab_size)
trainer = Trainer(model, optimizer, train_loader, evaluator, max_iters, eval_interval)

# Train the model
final_losses = trainer.train()

# 2.09 5.6

step 0: perplexity: 290.1, 
step 500: perplexity: 7.6, 
step 1000: perplexity: 6.7, 
step 1500: perplexity: 6.3, 
step 2000: perplexity: 6.2, 
step 2500: perplexity: 5.9, 
step 3000: perplexity: 5.7, 
step 3500: perplexity: 5.6, 
step 4000: perplexity: 5.5, 
step 4500: perplexity: 5.5, 
step 5000: perplexity: 5.4, 
step 5500: perplexity: 5.4, 
step 6000: perplexity: 5.3, 
step 6500: perplexity: 5.2, 
step 7000: perplexity: 5.1, 
step 7500: perplexity: 5.2, 
step 8000: perplexity: 5.1, 


#### Final Text Generation

Our scaled-up model now produces text that hints at characters, place names, and partial phrases resembling Shakespeare.
We see references to _"LADY CAPULET"_ and partial coherent lines that mimic stage directions or dialogues. While still not perfect English, it's closer in style to Shakespeare.


In [16]:
# Generate some text
context = torch.zeros((1, 1), dtype=torch.long)
generated_text = decode(model.generate(context, max_new_tokens=400)[0].tolist())
print(generated_text)


It's of it, Bolingbrokd say me
Was my let little?

PRINCE ELIZABETH:
I, andeed Isabelgarern face, boy:
Tyknow't hair, Venousin, that death know Klaint us hoy are hope's hope and will-harlst.
What'struct by intale
The rihour cory or than an,,
Not justice precition 'tis the nuch and eyes?

DUKE OF UntOLYCUS:
Younharis!

DUSHERDOLIO:
The divicers.
Upone this my throng and reat. No: even our me.
They 


## Use Cloud GPU

In [1]:
import json
from src.cloud_utils import sync_to_pod, run_on_pod

# First, sync files
print("Syncing files to RunPod...")
sync_success = sync_to_pod()

  from .autonotebook import tqdm as notebook_tqdm


Syncing files to RunPod...


AuthenticationError: Unauthorized request, please check your API key.

In [2]:
# Define your configuration
config = {
    "batch_size": 32, #128
    "block_size": 16, #512
    "max_iters": 10000, #10000
    "eval_interval": 500, #500
    "learning_rate": 1e-4, #1e-4
    "n_embed": 64, #512
    "n_heads": 4, #8
    "n_layer": 1,
    "dropout": 0.2
}

# Convert config to JSON string
config_str = json.dumps(config)

# Install requirements
# print("Installing requirements...")
# run_remote_command(ssh, "cd shakespeare && pip install torch numpy tqdm")

# Start training with config
print("\nStarting training...")
run_on_pod(f'cd shakespeare && python cloud_train.py \'{json.dumps(config)}\'')

Setting up training configuration...
Error: Your SSH client doesn't support PTY
Writing configuration...
Running command: cd ~/shakespeare && python cloud_train.py config.json
Error: Your SSH client doesn't support PTY


True