A 100M parameter language model built from scratch in PyTorch, following the OLMo2 architecture by Allen AI.
This is a hands-on learning project for understanding how modern LLMs work — every component (attention, RoPE, SwiGLU, normalization) is implemented from first principles with detailed comments explaining the math and design decisions.
- OLMo2 architecture: RMSNorm, Rotary Position Embeddings, SwiGLU FFN, QK norm, reordered norm
- 101.8M trainable parameters (configurable)
- bf16 mixed-precision training on NVIDIA GPUs
- Gradio web dashboard with start/pause/resume/stop controls and live charts
- Unattended training with auto-resume from checkpoints, graceful signal handling
- FineWeb-Edu integration for real-world training data (streaming from HuggingFace)
- lm-eval-harness benchmarks (HellaSwag, ARC-Easy, LAMBADA)
- Heavily commented code — every function explains the math and the "why"
Token IDs ──> Embedding ──> [TransformerBlock x 12] ──> RMSNorm ──> Linear ──> Logits
Each TransformerBlock:
input x
├──> Attention(x) ──> RMSNorm ──> + (residual) = h [reordered norm]
└──> SwiGLU(h) ──> RMSNorm ──> + (residual) = output
| Hyperparameter | Value |
|---|---|
d_model |
512 |
n_layers |
12 |
n_heads |
8 |
head_dim |
64 |
FFN hidden dim |
2048 (SwiGLU) |
vocab_size |
50,304 (GPT-2 tokenizer) |
context_length |
2048 |
| Normalization | RMSNorm (no bias, eps=1e-6) |
| Position encoding | RoPE (theta=500,000) |
| Activation | SwiGLU (SiLU-gated) |
| QK norm | RMSNorm on Q and K |
| Norm placement | Reordered (post-sublayer, pre-residual) |
| Total parameters | 101,857,280 |
Parameter breakdown:
- Token embedding: 25.8M
- 12 transformer blocks: 50.3M (4.2M each)
- Final norm + LM head: 25.8M
- Python 3.10+
- NVIDIA GPU with CUDA support (tested on DGX Spark with GB10, CUDA 13.0)
- ~2 GB disk for the model + checkpoints
- For FineWeb-Edu: additional disk for training data (configurable)
git clone <repo-url> ailmo
cd ailmo
./setup.shThis creates a virtual environment, installs PyTorch + dependencies, verifies GPU access, and prepares the default dataset (TinyShakespeare).
If you prefer manual setup:
python3 -m venv .venv
source .venv/bin/activate
pip install torch --index-url https://download.pytorch.org/whl/cu130 # adjust for your CUDA
pip install tiktoken numpy gradio pandas datasets lm-eval
python3 data.py # prepare TinyShakespeare./start.shOpens a Gradio web dashboard at http://localhost:7860 (or use --port 8080 for a different port).
Dashboard tabs:
- Training — configure hyperparameters, start/pause/resume/stop training, manage checkpoints
- Monitoring — live charts for loss, validation loss, throughput (tok/s), learning rate
- Generate — test the model with custom prompts, temperature, and top-k controls
- Logs — real-time training log stream
Workflow: Set "Steps per chunk" (e.g., 50), enable "Auto-pause", click "Apply Configuration", then "Start / Resume". The model trains for one chunk, pauses, and waits. Check Monitoring for charts, Generate to test output, then resume when ready.
./train_full.shRuns training to completion without interaction. Features:
- Auto-resume: if interrupted, re-run the same command to pick up from the last checkpoint
- Graceful shutdown: Ctrl+C saves a checkpoint before exiting
- Rolling checkpoints: keeps the last N checkpoints (default 5) to save disk space
- JSONL logging: every step is logged to
results/training_log.jsonl
Common options:
# Train longer
./train_full.sh --max-steps 50000
# Custom hyperparameters
./train_full.sh --lr 1e-3 --batch-size 64 --grad-accum 2
# Use FineWeb-Edu data (prepare it first — see Data section below)
./train_full.sh --data-dir data/fineweb/
# Adjust checkpoint behavior
./train_full.sh --save-interval 1000 --keep-checkpoints 3 --eval-interval 500All CLI options for train_full.py:
| Option | Default | Description |
|---|---|---|
--max-steps |
5000 | Total training steps |
--lr |
3e-4 | Peak learning rate (cosine schedule with warmup) |
--batch-size |
32 | Micro batch size per step |
--grad-accum |
4 | Gradient accumulation steps (effective batch = batch-size * grad-accum) |
--data-dir |
data |
Directory containing train.bin and val.bin |
--checkpoint-dir |
checkpoints |
Where to save model checkpoints |
--keep-checkpoints |
5 | Rolling window of checkpoints to keep |
--eval-interval |
250 | Run validation every N steps |
--save-interval |
500 | Save checkpoint every N steps |
--log-interval |
10 | Print metrics every N steps |
After training, generate text from the command line:
source .venv/bin/activate
python3 generate.py --checkpoint checkpoints/final.pt --prompt "To be, or not to be," --temperature 0.8 --top-k 50Or use the Generate tab in the Gradio dashboard.
The complete works of Shakespeare (~1MB, 338K tokens). Good for smoke testing and debugging — the model can memorize patterns within a few thousand steps.
./prepare_data.sh
# or: python3 prepare_data.pyA high-quality, educationally-focused subset of web text from HuggingFace, filtered from 15 trillion tokens using a Llama-3 classifier. Data is streamed so you only download what you need.
# Download 1GB (~250M tokens)
./prepare_data.sh --dataset fineweb-edu
# Download 5GB (~1.25B tokens)
./prepare_data.sh --dataset fineweb-edu --size 5
# Custom output directory
./prepare_data.sh --dataset fineweb-edu --size 5 --output data/fineweb/Approximate token counts per size:
| Size | Tokens | Training time (est.) |
|---|---|---|
| 1 GB | ~250M | A few hours |
| 5 GB | ~1.25B | ~1 day |
| 10 GB | ~2.5B | ~2 days |
All CLI options for prepare_data.py:
| Option | Default | Description |
|---|---|---|
--dataset |
tiny_shakespeare |
Dataset to prepare (tiny_shakespeare or fineweb-edu) |
--size |
1.0 | Size in GB for FineWeb-Edu |
--output |
data |
Output directory for .bin files |
--val-fraction |
0.02 | Fraction reserved for validation |
Evaluate your trained model using EleutherAI's lm-evaluation-harness, the industry-standard framework used by HuggingFace's Open LLM Leaderboard.
# Evaluate latest checkpoint with default benchmarks
./evaluate.sh
# Specific checkpoint
./evaluate.sh --checkpoint checkpoints/step_5000.pt
# Custom benchmarks
./evaluate.sh --tasks hellaswag,arc_easy,lambada_openai
# Few-shot evaluation
./evaluate.sh --num-fewshot 5| Benchmark | Type | Random Baseline | Good 100M |
|---|---|---|---|
| HellaSwag | Commonsense reasoning | ~25% | 26-30% |
| ARC-Easy | Elementary science QA | ~25% | 25-30% |
| LAMBADA | Next-word prediction | ~0% | 5-15% |
Results are saved to results/eval_results/ as JSON files. If lm-eval is not installed, the script falls back to built-in perplexity evaluation on the validation set.
All CLI options for evaluate.py:
| Option | Default | Description |
|---|---|---|
--checkpoint |
(required) | Path to model checkpoint |
--tasks |
hellaswag,arc_easy,lambada_openai |
Comma-separated benchmark names |
--output |
results/eval_results |
Output directory for results JSON |
--batch-size |
16 | Evaluation batch size |
--num-fewshot |
0 | Number of few-shot examples |
ailmo/
├── model.py # Full model: RMSNorm, RoPE, Attention, SwiGLU, Block, LLM
├── configs.py # ModelConfig and TrainConfig dataclasses
├── data.py # TokenDataset class and basic data utilities
├── generate.py # Text generation with temperature and top-k sampling
│
├── train.py # Simple standalone training loop (no frills)
├── train_full.py # Production training: auto-resume, signals, JSONL logging
├── engine.py # TrainingEngine with pause/resume/stop (for dashboard)
├── dashboard.py # Gradio web dashboard
│
├── prepare_data.py # Data download and tokenization (Shakespeare or FineWeb-Edu)
├── evaluate.py # lm-eval-harness benchmark wrapper
│
├── setup.sh # One-time environment setup
├── start.sh # Launch Gradio dashboard
├── train_full.sh # Launch unattended training
├── evaluate.sh # Launch evaluation (auto-finds latest checkpoint)
├── prepare_data.sh # Launch data preparation
│
├── requirements.txt # Python dependencies
├── .gitignore # Excludes data/, checkpoints/, results/
├── CLAUDE.md # Project context for AI assistants
│
├── data/ # Tokenized training data (.bin files)
│ ├── train.bin # Training tokens (memory-mapped uint16)
│ ├── val.bin # Validation tokens
│ └── metadata.json # Dataset info (for FineWeb-Edu)
│
├── checkpoints/ # Model checkpoints
│ ├── step_1000.pt # Periodic checkpoints
│ ├── step_2000.pt
│ └── final.pt # Final checkpoint after training completes
│
└── results/ # Training outputs
├── training_log.jsonl # Per-step metrics (loss, lr, tok/s, timestamp)
├── training_summary.json # Final stats (loss, time, throughput)
└── eval_results/ # Benchmark results per checkpoint
└── final_eval.json
One JSON object per logged step:
{"step": 100, "loss": 6.2341, "lr": 0.000285, "tok_s": 19500, "timestamp": "2026-03-29T20:15:30"}
{"step": 250, "loss": 4.8912, "lr": 0.000300, "tok_s": 20100, "timestamp": "2026-03-29T20:18:45", "val_loss": 5.1023}Load with pandas:
import pandas as pd
df = pd.read_json("results/training_log.jsonl", lines=True)
df.plot(x="step", y="loss"){
"final_step": 5000,
"final_train_loss": 3.45,
"final_val_loss": 3.62,
"total_time_seconds": 1800.5,
"avg_tokens_per_sec": 19500,
"model_params": 101857280,
"checkpoint": "checkpoints/final.pt"
}{
"checkpoint": "checkpoints/final.pt",
"tasks": ["hellaswag"],
"num_fewshot": 0,
"results": {
"hellaswag": {
"acc,none": 0.2634,
"acc_norm,none": 0.2701
}
}
}The code is designed to be read and learned from. Start with these files in order:
-
configs.py — All hyperparameters in two dataclasses. Read this first to understand the model dimensions.
-
model.py — The core of the project. Every class has a 10+ line docstring explaining:
- What the component does and why it exists
- The mathematical formula with variable names mapped to code
- How it differs from the vanilla Transformer
- References to the relevant papers
Read bottom-up:
RMSNorm->RoPE->Attention->SwiGLUFFN->TransformerBlock->LLM -
data.py — How text becomes numbers. Tokenization with tiktoken, memory-mapped binary files for efficient random access.
-
train.py — A clean, minimal training loop. Good for understanding the basics: forward pass, loss, backward, optimizer step, learning rate schedule.
-
generate.py — Autoregressive text generation. Temperature scaling, top-k sampling, greedy decoding.
| Concept | File | What to learn |
|---|---|---|
| RMSNorm | model.py | Simpler alternative to LayerNorm; why we upcast to float32 |
| RoPE | model.py | Encoding position through rotation; complex number trick |
| QK Norm | model.py | Preventing attention logit explosion in bf16 training |
| SwiGLU | model.py | Gated MLPs; why 3 weight matrices beat 2 |
| Reordered Norm | model.py | OLMo2's norm placement innovation |
| Cosine LR Schedule | train.py | Warmup + cosine decay; why it works |
| Gradient Accumulation | train_full.py | Simulating large batches on limited memory |
| Mixed Precision | train_full.py | bf16 autocast for speed + memory savings |
| Causal Masking | model.py | Why language models can only look backward |
| Memory-mapped Data | data.py | Handling datasets larger than RAM |
On a DGX Spark (GB10 GPU, 128GB unified memory):
Step 0: loss=10.83 (random, ~ln(50304))
Step 100: loss=7.50 (learning basic token frequencies)
Step 500: loss=5.20 (learning common phrases)
Step 1000: loss=4.30 (learning grammar patterns)
Step 5000: loss=3.50 (coherent short passages)
With FineWeb-Edu data and 50K+ steps, expect:
- HellaSwag: 27-30%
- Coherent paragraph-level generation
- Basic factual knowledge
This is an educational project. The code is provided as-is for learning purposes.
- OLMo2: Open Language Model 2 — Allen AI, 2025
- RoFormer: Enhanced Transformer with Rotary Position Embedding — Su et al., 2021
- GLU Variants Improve Transformer — Shazeer, 2020
- Root Mean Square Layer Normalization — Zhang & Sennrich, 2019
- FineWeb-Edu Dataset — HuggingFace, 2024
- lm-evaluation-harness — EleutherAI