Skip to content

drasimwagan/ailmo

Repository files navigation

ailmo

A 100M parameter language model built from scratch in PyTorch, following the OLMo2 architecture by Allen AI.

This is a hands-on learning project for understanding how modern LLMs work — every component (attention, RoPE, SwiGLU, normalization) is implemented from first principles with detailed comments explaining the math and design decisions.

Features

  • OLMo2 architecture: RMSNorm, Rotary Position Embeddings, SwiGLU FFN, QK norm, reordered norm
  • 101.8M trainable parameters (configurable)
  • bf16 mixed-precision training on NVIDIA GPUs
  • Gradio web dashboard with start/pause/resume/stop controls and live charts
  • Unattended training with auto-resume from checkpoints, graceful signal handling
  • FineWeb-Edu integration for real-world training data (streaming from HuggingFace)
  • lm-eval-harness benchmarks (HellaSwag, ARC-Easy, LAMBADA)
  • Heavily commented code — every function explains the math and the "why"

Architecture

Token IDs ──> Embedding ──> [TransformerBlock x 12] ──> RMSNorm ──> Linear ──> Logits

Each TransformerBlock:

input x
  ├──> Attention(x) ──> RMSNorm ──> + (residual) = h     [reordered norm]
  └──> SwiGLU(h)    ──> RMSNorm ──> + (residual) = output
Hyperparameter Value
d_model 512
n_layers 12
n_heads 8
head_dim 64
FFN hidden dim 2048 (SwiGLU)
vocab_size 50,304 (GPT-2 tokenizer)
context_length 2048
Normalization RMSNorm (no bias, eps=1e-6)
Position encoding RoPE (theta=500,000)
Activation SwiGLU (SiLU-gated)
QK norm RMSNorm on Q and K
Norm placement Reordered (post-sublayer, pre-residual)
Total parameters 101,857,280

Parameter breakdown:

  • Token embedding: 25.8M
  • 12 transformer blocks: 50.3M (4.2M each)
  • Final norm + LM head: 25.8M

Requirements

  • Python 3.10+
  • NVIDIA GPU with CUDA support (tested on DGX Spark with GB10, CUDA 13.0)
  • ~2 GB disk for the model + checkpoints
  • For FineWeb-Edu: additional disk for training data (configurable)

Setup

git clone <repo-url> ailmo
cd ailmo
./setup.sh

This creates a virtual environment, installs PyTorch + dependencies, verifies GPU access, and prepares the default dataset (TinyShakespeare).

If you prefer manual setup:

python3 -m venv .venv
source .venv/bin/activate
pip install torch --index-url https://download.pytorch.org/whl/cu130  # adjust for your CUDA
pip install tiktoken numpy gradio pandas datasets lm-eval
python3 data.py  # prepare TinyShakespeare

Quick Start

1. Interactive Dashboard

./start.sh

Opens a Gradio web dashboard at http://localhost:7860 (or use --port 8080 for a different port).

Dashboard tabs:

  • Training — configure hyperparameters, start/pause/resume/stop training, manage checkpoints
  • Monitoring — live charts for loss, validation loss, throughput (tok/s), learning rate
  • Generate — test the model with custom prompts, temperature, and top-k controls
  • Logs — real-time training log stream

Workflow: Set "Steps per chunk" (e.g., 50), enable "Auto-pause", click "Apply Configuration", then "Start / Resume". The model trains for one chunk, pauses, and waits. Check Monitoring for charts, Generate to test output, then resume when ready.

2. Unattended Training

./train_full.sh

Runs training to completion without interaction. Features:

  • Auto-resume: if interrupted, re-run the same command to pick up from the last checkpoint
  • Graceful shutdown: Ctrl+C saves a checkpoint before exiting
  • Rolling checkpoints: keeps the last N checkpoints (default 5) to save disk space
  • JSONL logging: every step is logged to results/training_log.jsonl

Common options:

# Train longer
./train_full.sh --max-steps 50000

# Custom hyperparameters
./train_full.sh --lr 1e-3 --batch-size 64 --grad-accum 2

# Use FineWeb-Edu data (prepare it first — see Data section below)
./train_full.sh --data-dir data/fineweb/

# Adjust checkpoint behavior
./train_full.sh --save-interval 1000 --keep-checkpoints 3 --eval-interval 500

All CLI options for train_full.py:

Option Default Description
--max-steps 5000 Total training steps
--lr 3e-4 Peak learning rate (cosine schedule with warmup)
--batch-size 32 Micro batch size per step
--grad-accum 4 Gradient accumulation steps (effective batch = batch-size * grad-accum)
--data-dir data Directory containing train.bin and val.bin
--checkpoint-dir checkpoints Where to save model checkpoints
--keep-checkpoints 5 Rolling window of checkpoints to keep
--eval-interval 250 Run validation every N steps
--save-interval 500 Save checkpoint every N steps
--log-interval 10 Print metrics every N steps

3. Text Generation

After training, generate text from the command line:

source .venv/bin/activate
python3 generate.py --checkpoint checkpoints/final.pt --prompt "To be, or not to be," --temperature 0.8 --top-k 50

Or use the Generate tab in the Gradio dashboard.

Training Data

TinyShakespeare (default)

The complete works of Shakespeare (~1MB, 338K tokens). Good for smoke testing and debugging — the model can memorize patterns within a few thousand steps.

./prepare_data.sh
# or: python3 prepare_data.py

FineWeb-Edu (recommended for real training)

A high-quality, educationally-focused subset of web text from HuggingFace, filtered from 15 trillion tokens using a Llama-3 classifier. Data is streamed so you only download what you need.

# Download 1GB (~250M tokens)
./prepare_data.sh --dataset fineweb-edu

# Download 5GB (~1.25B tokens)
./prepare_data.sh --dataset fineweb-edu --size 5

# Custom output directory
./prepare_data.sh --dataset fineweb-edu --size 5 --output data/fineweb/

Approximate token counts per size:

Size Tokens Training time (est.)
1 GB ~250M A few hours
5 GB ~1.25B ~1 day
10 GB ~2.5B ~2 days

All CLI options for prepare_data.py:

Option Default Description
--dataset tiny_shakespeare Dataset to prepare (tiny_shakespeare or fineweb-edu)
--size 1.0 Size in GB for FineWeb-Edu
--output data Output directory for .bin files
--val-fraction 0.02 Fraction reserved for validation

Evaluation

Evaluate your trained model using EleutherAI's lm-evaluation-harness, the industry-standard framework used by HuggingFace's Open LLM Leaderboard.

# Evaluate latest checkpoint with default benchmarks
./evaluate.sh

# Specific checkpoint
./evaluate.sh --checkpoint checkpoints/step_5000.pt

# Custom benchmarks
./evaluate.sh --tasks hellaswag,arc_easy,lambada_openai

# Few-shot evaluation
./evaluate.sh --num-fewshot 5

Recommended Benchmarks for 100M Models

Benchmark Type Random Baseline Good 100M
HellaSwag Commonsense reasoning ~25% 26-30%
ARC-Easy Elementary science QA ~25% 25-30%
LAMBADA Next-word prediction ~0% 5-15%

Results are saved to results/eval_results/ as JSON files. If lm-eval is not installed, the script falls back to built-in perplexity evaluation on the validation set.

All CLI options for evaluate.py:

Option Default Description
--checkpoint (required) Path to model checkpoint
--tasks hellaswag,arc_easy,lambada_openai Comma-separated benchmark names
--output results/eval_results Output directory for results JSON
--batch-size 16 Evaluation batch size
--num-fewshot 0 Number of few-shot examples

Project Structure

ailmo/
├── model.py             # Full model: RMSNorm, RoPE, Attention, SwiGLU, Block, LLM
├── configs.py           # ModelConfig and TrainConfig dataclasses
├── data.py              # TokenDataset class and basic data utilities
├── generate.py          # Text generation with temperature and top-k sampling
│
├── train.py             # Simple standalone training loop (no frills)
├── train_full.py        # Production training: auto-resume, signals, JSONL logging
├── engine.py            # TrainingEngine with pause/resume/stop (for dashboard)
├── dashboard.py         # Gradio web dashboard
│
├── prepare_data.py      # Data download and tokenization (Shakespeare or FineWeb-Edu)
├── evaluate.py          # lm-eval-harness benchmark wrapper
│
├── setup.sh             # One-time environment setup
├── start.sh             # Launch Gradio dashboard
├── train_full.sh        # Launch unattended training
├── evaluate.sh          # Launch evaluation (auto-finds latest checkpoint)
├── prepare_data.sh      # Launch data preparation
│
├── requirements.txt     # Python dependencies
├── .gitignore           # Excludes data/, checkpoints/, results/
├── CLAUDE.md            # Project context for AI assistants
│
├── data/                # Tokenized training data (.bin files)
│   ├── train.bin        # Training tokens (memory-mapped uint16)
│   ├── val.bin          # Validation tokens
│   └── metadata.json    # Dataset info (for FineWeb-Edu)
│
├── checkpoints/         # Model checkpoints
│   ├── step_1000.pt     # Periodic checkpoints
│   ├── step_2000.pt
│   └── final.pt         # Final checkpoint after training completes
│
└── results/             # Training outputs
    ├── training_log.jsonl    # Per-step metrics (loss, lr, tok/s, timestamp)
    ├── training_summary.json # Final stats (loss, time, throughput)
    └── eval_results/         # Benchmark results per checkpoint
        └── final_eval.json

Results Format

training_log.jsonl

One JSON object per logged step:

{"step": 100, "loss": 6.2341, "lr": 0.000285, "tok_s": 19500, "timestamp": "2026-03-29T20:15:30"}
{"step": 250, "loss": 4.8912, "lr": 0.000300, "tok_s": 20100, "timestamp": "2026-03-29T20:18:45", "val_loss": 5.1023}

Load with pandas:

import pandas as pd
df = pd.read_json("results/training_log.jsonl", lines=True)
df.plot(x="step", y="loss")

training_summary.json

{
  "final_step": 5000,
  "final_train_loss": 3.45,
  "final_val_loss": 3.62,
  "total_time_seconds": 1800.5,
  "avg_tokens_per_sec": 19500,
  "model_params": 101857280,
  "checkpoint": "checkpoints/final.pt"
}

Eval results

{
  "checkpoint": "checkpoints/final.pt",
  "tasks": ["hellaswag"],
  "num_fewshot": 0,
  "results": {
    "hellaswag": {
      "acc,none": 0.2634,
      "acc_norm,none": 0.2701
    }
  }
}

Understanding the Code

The code is designed to be read and learned from. Start with these files in order:

  1. configs.py — All hyperparameters in two dataclasses. Read this first to understand the model dimensions.

  2. model.py — The core of the project. Every class has a 10+ line docstring explaining:

    • What the component does and why it exists
    • The mathematical formula with variable names mapped to code
    • How it differs from the vanilla Transformer
    • References to the relevant papers

    Read bottom-up: RMSNorm -> RoPE -> Attention -> SwiGLUFFN -> TransformerBlock -> LLM

  3. data.py — How text becomes numbers. Tokenization with tiktoken, memory-mapped binary files for efficient random access.

  4. train.py — A clean, minimal training loop. Good for understanding the basics: forward pass, loss, backward, optimizer step, learning rate schedule.

  5. generate.py — Autoregressive text generation. Temperature scaling, top-k sampling, greedy decoding.

Key Concepts Implemented

Concept File What to learn
RMSNorm model.py Simpler alternative to LayerNorm; why we upcast to float32
RoPE model.py Encoding position through rotation; complex number trick
QK Norm model.py Preventing attention logit explosion in bf16 training
SwiGLU model.py Gated MLPs; why 3 weight matrices beat 2
Reordered Norm model.py OLMo2's norm placement innovation
Cosine LR Schedule train.py Warmup + cosine decay; why it works
Gradient Accumulation train_full.py Simulating large batches on limited memory
Mixed Precision train_full.py bf16 autocast for speed + memory savings
Causal Masking model.py Why language models can only look backward
Memory-mapped Data data.py Handling datasets larger than RAM

Typical Training Run

On a DGX Spark (GB10 GPU, 128GB unified memory):

Step 0:      loss=10.83  (random, ~ln(50304))
Step 100:    loss=7.50   (learning basic token frequencies)
Step 500:    loss=5.20   (learning common phrases)
Step 1000:   loss=4.30   (learning grammar patterns)
Step 5000:   loss=3.50   (coherent short passages)

With FineWeb-Edu data and 50K+ steps, expect:

  • HellaSwag: 27-30%
  • Coherent paragraph-level generation
  • Basic factual knowledge

License

This is an educational project. The code is provided as-is for learning purposes.

References

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages