val_bpb: TBD (3-seed mean) | 8xH100 SXM
This combines the PR #1089 Turbo-Muon stack with a novel test-time training approach:
- PR #1089's Turbo-Muon stack (1.1091 BPB) -- faster Newton-Schulz, EngramLite, parameter banking, mixed-precision GPTQ, brotli compression
- LaCT (Large Chunk TTT with Muon) -- test-time training using the Muon optimizer instead of SGD for weight adaptation at evaluation time
Standard TTT methods use SGD to adapt model weights on each chunk of the evaluation data. LaCT replaces SGD with the Muon optimizer, which applies Newton-Schulz orthogonalized updates. This gives higher-quality weight updates per step, so fewer training epochs are needed (2 vs 5 for SGD-based TTT), meaning more chunks can be processed within the evaluation time budget.
Key advantages of MuonTTT over SGD-based TTT:
- Higher-quality updates per step via Newton-Schulz orthogonalization
- 2 epochs sufficient vs 5 for SGD, so each chunk processes faster
- Large-chunk processing achieves ~70% GPU utilization vs <5% for per-token TTT
- More chunks processed within the fixed evaluation budget
11 layers, 512d, 8 query heads, 4 KV heads (GQA). MLP 3.5x with per-layer LeakyReLU^2 slopes (ASQU v3). XSA on all layers. Partial RoPE (16/64 dims). U-Net skip connections with sigmoid gates. EngramLite bigram+trigram hash embeddings (2 heads, 8192 buckets). SmearGate. ValueEmbedding on layers 9-10.
Standard Muon uses 5 Newton-Schulz iterations. Turbo-Muon cuts that to 4:
- AOL preconditioning via left-Gram Gershgorin scaling
- Polar Express coefficients (Amsel et al., arXiv:2505.16932)
- Post-NS row+column normalization
Fewer iterations = faster steps = more training in 600s.
At evaluation time, for each chunk of the test data:
- Temporarily adapt model weights on the chunk using the Muon optimizer (Newton-Schulz orthogonalized updates)
- Run 2 epochs of TTT at lr=0.02 -- Muon's higher-quality updates mean fewer epochs are needed compared to SGD (which typically requires 5)
- Evaluate the adapted model on the chunk, then restore weights for the next chunk
The Muon optimizer reuses the same Newton-Schulz iteration pipeline from training, so the TTT step inherits Turbo-Muon's speedups (4 NS iterations with Polar Express coefficients). This keeps per-chunk TTT overhead low enough to process all chunks within the evaluation budget.
Brotli q=11 + byte-shuffle. Mixed-precision: base int5, sensitive groups promoted to int6/int7 based on Hessian trace. Selective +/-1,+/-2 pruning to hit 16MB exactly.
| Seed | PR #1089 baseline | This fork (LaCT) | Delta |
|---|---|---|---|
| 42 | 1.1086 | TBD | -- |
| 1337 | 1.1090 | TBD | -- |
| 2025 | 1.1096 | TBD | -- |
# One-shot deploy to RunPod (from your local machine)
bash deploy_runpod.sh root@<runpod-ip> 42
# Or all 3 seeds
bash deploy_3seeds.sh root@<runpod-ip>
# Or manually on the pod
pip install brotli
TTT_ENABLED=1 TTT_OPTIMIZER=muon TTT_EPOCHS=2 TTT_LR=0.02 SEED=42 torchrun --standalone --nproc_per_node=8 train_gpt.py
# Disable LaCT to match PR #1089 exactly (for comparison)
TTT_ENABLED=0 SEED=42 torchrun --standalone --nproc_per_node=8 train_gpt.pytrain_gpt.py-- compressed self-extracting submission (~33KB)train_gpt_human.py-- human-readable source (~150KB)deploy_runpod.sh-- one-shot deploy to RunPod (upload + install + run)deploy_3seeds.sh-- run all 3 seedsrun.sh-- on-pod launch with preflight checkscompress.py-- builds train_gpt.py from train_gpt_human.pysubmission.json/README.md
Base code: PR #1089 by @mikeapedia (Turbo-Muon + EngramLite + ParamBanking). LaCT (MuonTTT) is our novel contribution -- applying the Muon optimizer to test-time training for higher-quality weight adaptation. Lineage: PR #609, #399, #493, #265/#287 (XSA). Turbo-Muon: Polar Express (Amsel et al., arXiv:2505.16932).