ailmo

A 100M parameter language model built from scratch in PyTorch, following the OLMo2 architecture by Allen AI.

This is a hands-on learning project for understanding how modern LLMs work — every component (attention, RoPE, SwiGLU, normalization) is implemented from first principles with detailed comments explaining the math and design decisions.

Features

OLMo2 architecture: RMSNorm, Rotary Position Embeddings, SwiGLU FFN, QK norm, reordered norm
101.8M trainable parameters (configurable)
bf16 mixed-precision training on NVIDIA GPUs
Gradio web dashboard with start/pause/resume/stop controls and live charts
Unattended training with auto-resume from checkpoints, graceful signal handling
FineWeb-Edu integration for real-world training data (streaming from HuggingFace)
lm-eval-harness benchmarks (HellaSwag, ARC-Easy, LAMBADA)
Heavily commented code — every function explains the math and the "why"

Architecture

Token IDs ──> Embedding ──> [TransformerBlock x 12] ──> RMSNorm ──> Linear ──> Logits

Each TransformerBlock:

input x
  ├──> Attention(x) ──> RMSNorm ──> + (residual) = h     [reordered norm]
  └──> SwiGLU(h)    ──> RMSNorm ──> + (residual) = output

Hyperparameter	Value
`d_model`	512
`n_layers`	12
`n_heads`	8
`head_dim`	64
`FFN hidden dim`	2048 (SwiGLU)
`vocab_size`	50,304 (GPT-2 tokenizer)
`context_length`	2048
Normalization	RMSNorm (no bias, eps=1e-6)
Position encoding	RoPE (theta=500,000)
Activation	SwiGLU (SiLU-gated)
QK norm	RMSNorm on Q and K
Norm placement	Reordered (post-sublayer, pre-residual)
Total parameters	101,857,280

Parameter breakdown:

Token embedding: 25.8M
12 transformer blocks: 50.3M (4.2M each)
Final norm + LM head: 25.8M

Requirements

Python 3.10+
NVIDIA GPU with CUDA support (tested on DGX Spark with GB10, CUDA 13.0)
~2 GB disk for the model + checkpoints
For FineWeb-Edu: additional disk for training data (configurable)

Setup

git clone <repo-url> ailmo
cd ailmo
./setup.sh

This creates a virtual environment, installs PyTorch + dependencies, verifies GPU access, and prepares the default dataset (TinyShakespeare).

If you prefer manual setup:

python3 -m venv .venv
source .venv/bin/activate
pip install torch --index-url https://download.pytorch.org/whl/cu130  # adjust for your CUDA
pip install tiktoken numpy gradio pandas datasets lm-eval
python3 data.py  # prepare TinyShakespeare

Quick Start

1. Interactive Dashboard

./start.sh

Opens a Gradio web dashboard at http://localhost:7860 (or use --port 8080 for a different port).

Dashboard tabs:

Training — configure hyperparameters, start/pause/resume/stop training, manage checkpoints
Monitoring — live charts for loss, validation loss, throughput (tok/s), learning rate
Generate — test the model with custom prompts, temperature, and top-k controls
Logs — real-time training log stream

Workflow: Set "Steps per chunk" (e.g., 50), enable "Auto-pause", click "Apply Configuration", then "Start / Resume". The model trains for one chunk, pauses, and waits. Check Monitoring for charts, Generate to test output, then resume when ready.

2. Unattended Training

./train_full.sh

Runs training to completion without interaction. Features:

Auto-resume: if interrupted, re-run the same command to pick up from the last checkpoint
Graceful shutdown: Ctrl+C saves a checkpoint before exiting
Rolling checkpoints: keeps the last N checkpoints (default 5) to save disk space
JSONL logging: every step is logged to results/training_log.jsonl

Common options:

# Train longer
./train_full.sh --max-steps 50000

# Custom hyperparameters
./train_full.sh --lr 1e-3 --batch-size 64 --grad-accum 2

# Use FineWeb-Edu data (prepare it first — see Data section below)
./train_full.sh --data-dir data/fineweb/

# Adjust checkpoint behavior
./train_full.sh --save-interval 1000 --keep-checkpoints 3 --eval-interval 500

All CLI options for train_full.py:

Option	Default	Description
`--max-steps`	5000	Total training steps
`--lr`	3e-4	Peak learning rate (cosine schedule with warmup)
`--batch-size`	32	Micro batch size per step
`--grad-accum`	4	Gradient accumulation steps (effective batch = batch-size * grad-accum)
`--data-dir`	`data`	Directory containing `train.bin` and `val.bin`
`--checkpoint-dir`	`checkpoints`	Where to save model checkpoints
`--keep-checkpoints`	5	Rolling window of checkpoints to keep
`--eval-interval`	250	Run validation every N steps
`--save-interval`	500	Save checkpoint every N steps
`--log-interval`	10	Print metrics every N steps

3. Text Generation

After training, generate text from the command line:

source .venv/bin/activate
python3 generate.py --checkpoint checkpoints/final.pt --prompt "To be, or not to be," --temperature 0.8 --top-k 50

Or use the Generate tab in the Gradio dashboard.

Training Data

TinyShakespeare (default)

The complete works of Shakespeare (~1MB, 338K tokens). Good for smoke testing and debugging — the model can memorize patterns within a few thousand steps.

./prepare_data.sh
# or: python3 prepare_data.py

FineWeb-Edu (recommended for real training)

A high-quality, educationally-focused subset of web text from HuggingFace, filtered from 15 trillion tokens using a Llama-3 classifier. Data is streamed so you only download what you need.

# Download 1GB (~250M tokens)
./prepare_data.sh --dataset fineweb-edu

# Download 5GB (~1.25B tokens)
./prepare_data.sh --dataset fineweb-edu --size 5

# Custom output directory
./prepare_data.sh --dataset fineweb-edu --size 5 --output data/fineweb/

Approximate token counts per size:

Size	Tokens	Training time (est.)
1 GB	~250M	A few hours
5 GB	~1.25B	~1 day
10 GB	~2.5B	~2 days

All CLI options for prepare_data.py:

Option	Default	Description
`--dataset`	`tiny_shakespeare`	Dataset to prepare (`tiny_shakespeare` or `fineweb-edu`)
`--size`	1.0	Size in GB for FineWeb-Edu
`--output`	`data`	Output directory for `.bin` files
`--val-fraction`	0.02	Fraction reserved for validation

Evaluation

Evaluate your trained model using EleutherAI's lm-evaluation-harness, the industry-standard framework used by HuggingFace's Open LLM Leaderboard.

# Evaluate latest checkpoint with default benchmarks
./evaluate.sh

# Specific checkpoint
./evaluate.sh --checkpoint checkpoints/step_5000.pt

# Custom benchmarks
./evaluate.sh --tasks hellaswag,arc_easy,lambada_openai

# Few-shot evaluation
./evaluate.sh --num-fewshot 5

Recommended Benchmarks for 100M Models

Benchmark	Type	Random Baseline	Good 100M
HellaSwag	Commonsense reasoning	~25%	26-30%
ARC-Easy	Elementary science QA	~25%	25-30%
LAMBADA	Next-word prediction	~0%	5-15%

Results are saved to results/eval_results/ as JSON files. If lm-eval is not installed, the script falls back to built-in perplexity evaluation on the validation set.

All CLI options for evaluate.py:

Option	Default	Description
`--checkpoint`	(required)	Path to model checkpoint
`--tasks`	`hellaswag,arc_easy,lambada_openai`	Comma-separated benchmark names
`--output`	`results/eval_results`	Output directory for results JSON
`--batch-size`	16	Evaluation batch size
`--num-fewshot`	0	Number of few-shot examples

Project Structure

ailmo/
├── model.py             # Full model: RMSNorm, RoPE, Attention, SwiGLU, Block, LLM
├── configs.py           # ModelConfig and TrainConfig dataclasses
├── data.py              # TokenDataset class and basic data utilities
├── generate.py          # Text generation with temperature and top-k sampling
│
├── train.py             # Simple standalone training loop (no frills)
├── train_full.py        # Production training: auto-resume, signals, JSONL logging
├── engine.py            # TrainingEngine with pause/resume/stop (for dashboard)
├── dashboard.py         # Gradio web dashboard
│
├── prepare_data.py      # Data download and tokenization (Shakespeare or FineWeb-Edu)
├── evaluate.py          # lm-eval-harness benchmark wrapper
│
├── setup.sh             # One-time environment setup
├── start.sh             # Launch Gradio dashboard
├── train_full.sh        # Launch unattended training
├── evaluate.sh          # Launch evaluation (auto-finds latest checkpoint)
├── prepare_data.sh      # Launch data preparation
│
├── requirements.txt     # Python dependencies
├── .gitignore           # Excludes data/, checkpoints/, results/
├── CLAUDE.md            # Project context for AI assistants
│
├── data/                # Tokenized training data (.bin files)
│   ├── train.bin        # Training tokens (memory-mapped uint16)
│   ├── val.bin          # Validation tokens
│   └── metadata.json    # Dataset info (for FineWeb-Edu)
│
├── checkpoints/         # Model checkpoints
│   ├── step_1000.pt     # Periodic checkpoints
│   ├── step_2000.pt
│   └── final.pt         # Final checkpoint after training completes
│
└── results/             # Training outputs
    ├── training_log.jsonl    # Per-step metrics (loss, lr, tok/s, timestamp)
    ├── training_summary.json # Final stats (loss, time, throughput)
    └── eval_results/         # Benchmark results per checkpoint
        └── final_eval.json

Results Format

training_log.jsonl

One JSON object per logged step:

{"step": 100, "loss": 6.2341, "lr": 0.000285, "tok_s": 19500, "timestamp": "2026-03-29T20:15:30"}
{"step": 250, "loss": 4.8912, "lr": 0.000300, "tok_s": 20100, "timestamp": "2026-03-29T20:18:45", "val_loss": 5.1023}

Load with pandas:

import pandas as pd
df = pd.read_json("results/training_log.jsonl", lines=True)
df.plot(x="step", y="loss")

training_summary.json

{
  "final_step": 5000,
  "final_train_loss": 3.45,
  "final_val_loss": 3.62,
  "total_time_seconds": 1800.5,
  "avg_tokens_per_sec": 19500,
  "model_params": 101857280,
  "checkpoint": "checkpoints/final.pt"
}

Eval results

{
  "checkpoint": "checkpoints/final.pt",
  "tasks": ["hellaswag"],
  "num_fewshot": 0,
  "results": {
    "hellaswag": {
      "acc,none": 0.2634,
      "acc_norm,none": 0.2701
    }
  }
}

Understanding the Code

The code is designed to be read and learned from. Start with these files in order:

configs.py — All hyperparameters in two dataclasses. Read this first to understand the model dimensions.
model.py — The core of the project. Every class has a 10+ line docstring explaining:
- What the component does and why it exists
- The mathematical formula with variable names mapped to code
- How it differs from the vanilla Transformer
- References to the relevant papers
Read bottom-up: RMSNorm -> RoPE -> Attention -> SwiGLUFFN -> TransformerBlock -> LLM
data.py — How text becomes numbers. Tokenization with tiktoken, memory-mapped binary files for efficient random access.
train.py — A clean, minimal training loop. Good for understanding the basics: forward pass, loss, backward, optimizer step, learning rate schedule.
generate.py — Autoregressive text generation. Temperature scaling, top-k sampling, greedy decoding.

Key Concepts Implemented

Concept	File	What to learn
RMSNorm	model.py	Simpler alternative to LayerNorm; why we upcast to float32
RoPE	model.py	Encoding position through rotation; complex number trick
QK Norm	model.py	Preventing attention logit explosion in bf16 training
SwiGLU	model.py	Gated MLPs; why 3 weight matrices beat 2
Reordered Norm	model.py	OLMo2's norm placement innovation
Cosine LR Schedule	train.py	Warmup + cosine decay; why it works
Gradient Accumulation	train_full.py	Simulating large batches on limited memory
Mixed Precision	train_full.py	bf16 autocast for speed + memory savings
Causal Masking	model.py	Why language models can only look backward
Memory-mapped Data	data.py	Handling datasets larger than RAM

Typical Training Run

On a DGX Spark (GB10 GPU, 128GB unified memory):

Step 0:      loss=10.83  (random, ~ln(50304))
Step 100:    loss=7.50   (learning basic token frequencies)
Step 500:    loss=5.20   (learning common phrases)
Step 1000:   loss=4.30   (learning grammar patterns)
Step 5000:   loss=3.50   (coherent short passages)

With FineWeb-Edu data and 50K+ steps, expect:

HellaSwag: 27-30%
Coherent paragraph-level generation
Basic factual knowledge

License

This is an educational project. The code is provided as-is for learning purposes.

References

OLMo2: Open Language Model 2 — Allen AI, 2025
RoFormer: Enhanced Transformer with Rotary Position Embedding — Su et al., 2021
GLU Variants Improve Transformer — Shazeer, 2020
Root Mean Square Layer Normalization — Zhang & Sennrich, 2019
FineWeb-Edu Dataset — HuggingFace, 2024
lm-evaluation-harness — EleutherAI

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ailmo

Features

Architecture

Requirements

Setup

Quick Start

1. Interactive Dashboard

2. Unattended Training

3. Text Generation

Training Data

TinyShakespeare (default)

FineWeb-Edu (recommended for real training)

Evaluation

Recommended Benchmarks for 100M Models

Project Structure

Results Format

training_log.jsonl

training_summary.json

Eval results

Understanding the Code

Key Concepts Implemented

Typical Training Run

License

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
configs.py		configs.py
dashboard.py		dashboard.py
data.py		data.py
engine.py		engine.py
evaluate.py		evaluate.py
evaluate.sh		evaluate.sh
generate.py		generate.py
model.py		model.py
prepare_data.py		prepare_data.py
prepare_data.sh		prepare_data.sh
requirements.txt		requirements.txt
setup.sh		setup.sh
start.sh		start.sh
train.py		train.py
train_full.py		train_full.py
train_full.sh		train_full.sh

Folders and files

Latest commit

History

Repository files navigation

ailmo

Features

Architecture

Requirements

Setup

Quick Start

1. Interactive Dashboard

2. Unattended Training

3. Text Generation

Training Data

TinyShakespeare (default)

FineWeb-Edu (recommended for real training)

Evaluation

Recommended Benchmarks for 100M Models

Project Structure

Results Format

training_log.jsonl

training_summary.json

Eval results

Understanding the Code

Key Concepts Implemented

Typical Training Run

License

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages