Turbo-Muon + LaCT (MuonTTT)

val_bpb: TBD (3-seed mean) | 8xH100 SXM

What this is

This combines the PR #1089 Turbo-Muon stack with a novel test-time training approach:

PR #1089's Turbo-Muon stack (1.1091 BPB) -- faster Newton-Schulz, EngramLite, parameter banking, mixed-precision GPTQ, brotli compression
LaCT (Large Chunk TTT with Muon) -- test-time training using the Muon optimizer instead of SGD for weight adaptation at evaluation time

Standard TTT methods use SGD to adapt model weights on each chunk of the evaluation data. LaCT replaces SGD with the Muon optimizer, which applies Newton-Schulz orthogonalized updates. This gives higher-quality weight updates per step, so fewer training epochs are needed (2 vs 5 for SGD-based TTT), meaning more chunks can be processed within the evaluation time budget.

Key advantages of MuonTTT over SGD-based TTT:

Higher-quality updates per step via Newton-Schulz orthogonalization
2 epochs sufficient vs 5 for SGD, so each chunk processes faster
Large-chunk processing achieves ~70% GPU utilization vs <5% for per-token TTT
More chunks processed within the fixed evaluation budget

Architecture

11 layers, 512d, 8 query heads, 4 KV heads (GQA). MLP 3.5x with per-layer LeakyReLU^2 slopes (ASQU v3). XSA on all layers. Partial RoPE (16/64 dims). U-Net skip connections with sigmoid gates. EngramLite bigram+trigram hash embeddings (2 heads, 8192 buckets). SmearGate. ValueEmbedding on layers 9-10.

Turbo-Muon

Standard Muon uses 5 Newton-Schulz iterations. Turbo-Muon cuts that to 4:

AOL preconditioning via left-Gram Gershgorin scaling
Polar Express coefficients (Amsel et al., arXiv:2505.16932)
Post-NS row+column normalization

Fewer iterations = faster steps = more training in 600s.

LaCT (MuonTTT) -- our novel contribution

At evaluation time, for each chunk of the test data:

Temporarily adapt model weights on the chunk using the Muon optimizer (Newton-Schulz orthogonalized updates)
Run 2 epochs of TTT at lr=0.02 -- Muon's higher-quality updates mean fewer epochs are needed compared to SGD (which typically requires 5)
Evaluate the adapted model on the chunk, then restore weights for the next chunk

The Muon optimizer reuses the same Newton-Schulz iteration pipeline from training, so the TTT step inherits Turbo-Muon's speedups (4 NS iterations with Polar Express coefficients). This keeps per-chunk TTT overhead low enough to process all chunks within the evaluation budget.

Compression

Brotli q=11 + byte-shuffle. Mixed-precision: base int5, sensitive groups promoted to int6/int7 based on Hessian trace. Selective +/-1,+/-2 pruning to hit 16MB exactly.

Results

Seed	PR #1089 baseline	This fork (LaCT)	Delta
42	1.1086	TBD	--
1337	1.1090	TBD	--
2025	1.1096	TBD	--

How to run

# One-shot deploy to RunPod (from your local machine)
bash deploy_runpod.sh root@<runpod-ip> 42

# Or all 3 seeds
bash deploy_3seeds.sh root@<runpod-ip>

# Or manually on the pod
pip install brotli
TTT_ENABLED=1 TTT_OPTIMIZER=muon TTT_EPOCHS=2 TTT_LR=0.02 SEED=42 torchrun --standalone --nproc_per_node=8 train_gpt.py

# Disable LaCT to match PR #1089 exactly (for comparison)
TTT_ENABLED=0 SEED=42 torchrun --standalone --nproc_per_node=8 train_gpt.py

Files

train_gpt.py -- compressed self-extracting submission (~33KB)
train_gpt_human.py -- human-readable source (~150KB)
deploy_runpod.sh -- one-shot deploy to RunPod (upload + install + run)
deploy_3seeds.sh -- run all 3 seeds
run.sh -- on-pod launch with preflight checks
compress.py -- builds train_gpt.py from train_gpt_human.py
submission.json / README.md

Credits

Base code: PR #1089 by @mikeapedia (Turbo-Muon + EngramLite + ParamBanking). LaCT (MuonTTT) is our novel contribution -- applying the Muon optimizer to test-time training for higher-quality weight adaptation. Lineage: PR #609, #399, #493, #265/#287 (XSA). Turbo-Muon: Polar Express (Amsel et al., arXiv:2505.16932).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Turbo-Muon + LaCT (MuonTTT)

What this is

Architecture

Turbo-Muon

LaCT (MuonTTT) -- our novel contribution

Compression

Results

How to run

Files

Credits

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
compress.py		compress.py
deploy_3seeds.sh		deploy_3seeds.sh
deploy_runpod.sh		deploy_runpod.sh
run.sh		run.sh
submission.json		submission.json
train_gpt.py		train_gpt.py
train_gpt_human.py		train_gpt_human.py

Folders and files

Latest commit

History

Repository files navigation

Turbo-Muon + LaCT (MuonTTT)

What this is

Architecture

Turbo-Muon

LaCT (MuonTTT) -- our novel contribution

Compression

Results

How to run

Files

Credits

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages