EPIC: Parameter Golf Hackathon Submission
Status: IN-FLIGHT
Priority: CRITICAL (P0)
Deadline: 30 April 2026 (~9 days remaining)
Owner: LEAD
Spec: .trinity/specs/epic-parameter-golf.md
Executive Summary
OpenAI Model Craft Challenge: Parameter Golf — train the best language model that fits in a 16 MB artifact (weights + code) and trains in <10 minutes on 8xH100 , scored by tokenizer-agnostic bits-per-byte (BPB) on the FineWeb validation set.
Our strategy: byte-level Trinity 3^k architecture + GF16 quantization + phi-init.
Metric
Value
SOTA (2026-04-20)
1.0810 BPB (bigbag)
Our target
< 1.15 BPB
OpenAI baseline
1.2244 BPB (9-layer, dim=512, vocab=1024)
Artifact budget
16 MB (weights + train_gpt.py combined)
Training budget
10 min wall-clock / 8xH100 SXM
Eval metric
tokenizer-agnostic bits-per-byte
Submission deadline
30 April 2026
Research & SOTA Analysis
Competition Landscape
The leaderboard dropped from 1.2244 to ~1.08 BPB in ~4 weeks. Key techniques from top submissions:
Sliding window evaluation — stride-64 eval instead of fixed context; jumped ~1.20 to ~1.19 BPB
EMA / SWA weight averaging — smoother final weights that compress and generalize better
N-gram caching — hybrid neural + n-gram approach pushed sub-1.0 BPB (controversial — compression vs LM quality)
Muon optimizer — replaced AdamW; Newton-Schulz orthogonalization; 35% faster convergence (NanoGPT speedrun proven)
Modernized arch — RoPE, QK-Norm, ReLU^2 (from modded-nanogpt lineage)
Hardware lottery — top pods run 83-88ms/step; slow pods 260ms/step; same code, 3x variance
Relevant Research
Paper / Project
Key Insight
Relevance
Byte Latent Transformer (BLT) — Meta, ACL 2025
Byte-level LLM matching tokenization at scale via adaptive patching
Validates byte-level approach; patch grouping strategy
EvaByte — HKU, 2025
6.5B byte-level LM with multibyte prediction + EVA attention; 5-10x faster than vanilla
Multibyte prediction technique for byte models
ByteFlow — 2025
Adaptive byte compression + chunking outperforms BPE
End-to-end tokenizer-free modeling is feasible
Bolmo — 2026
Byteifying existing subword LMs at 1B/7B scale
Byteification strategy; vocab-free approach
Muon Optimizer — Jordan et al.
Momentum + Newton-Schulz orthogonalization; SOTA for NanoGPT speedrun
Critical optimizer choice; 35% speedup over AdamW
NorMuon — 2025
Muon + neuron-wise adaptive scaling
More stable Muon variant for small models
Unweight — Cloudflare, 2026
Lossless 15-22% weight compression via Huffman + palette encoding
Post-training compression to fit 16MB budget
Phi-init — Academia, 2026
Golden ratio as structural prior for width/depth ratio in neural arch
phi-based initialization for Trinity architecture
Fibonacci Init — 2024
Fibonacci sequence weight init; 94% accuracy on small datasets
Supports phi-based init hypothesis
modded-nanogpt — Keller Jordan
RoPE + QK-Norm + ReLU^2 + Muon; cumulative 20% WR drop in 3 months
Architecture recipe for fast convergence
Our Differentiators (Trinity 3^k)
All dimensions are powers of 3 — hidden=243 (3^5), heads=27 (3^3), vocab=729 (3^6) — hardware-aligned, cache-friendly
GF16 quantization — custom 16-bit golden-float format using phi-based mantissa distribution
Tied embeddings — in/out share weights; critical for 16MB budget (embeddings = 15%+ of small model params)
Phi-init — weight initialization using golden ratio scaling (validated in issue ✂ Ch.4 — Golden Cut (The Platonic Rupture) [Lane B] · refs #30 #68 )
Byte-level — no tokenizer dependency; directly optimizes BPB metric
What We Know (Validated Results)
Issue
What We Proved
Status
#55
BPB 0 -> 5.48 realistic (was broken)
GREEN
#57
Weight delta tests: T1,T3,T4,T5 pass; T2 fail (diagnostic)
GREEN
#64
simple_backward fix: tied embeddings in/out split; BPB 0.0983
GREEN
#65
Finite-diff grad check: 20/20 rel_err < 5e-3, sign = 100%
GREEN
#66
Vocab audit: single source 729, active 256
GREEN
#67
Overfit-100: BPB 7.9997 -> 1.2000; all 7 gates GREEN (commit 85bf8b7 )
GREEN
#68
Trinity config: hidden=243, 2.70 MB, all dims 3^k
GREEN
Wall-time measurement: 18.47 step/sec production throughput (GAMMA agent).
Architecture Spec
Model: Trinity-3k Byte-Level Transformer
- vocab_size: 729 (3^6) — byte-level, no tokenizer
- hidden_dim: 243 (3^5)
- n_heads: 27 (3^3)
- head_dim: 9 (3^2)
- n_layers: TBD (maximize within 16MB)
- activation: ReLU^2
- position: RoPE
- normalization: QK-Norm + RMSNorm
- embeddings: tied (in == out)
- optimizer: Muon (hidden layers) + AdamW (embeddings/norm)
- quantization: GF16 -> INT4 post-training (fit 16MB)
- init: phi-scaled Xavier
Size budget breakdown (16 MB = 16,777,216 bytes):
Component
Formula
Estimate
Embeddings (tied)
729 * 243 * 2B = 354 KB
354 KB
Per-layer (QKV+O+MLP)
~(4243^2 + 2 243*972) * 2B = ~1.4 MB
1.4 MB/layer
train_gpt.py
~50 KB
50 KB
Available for layers
(16MB - 354KB - 50KB)
~15.6 MB
Max layers (FP16)
15.6 / 1.4
~11 layers
Max layers (INT4)
15.6 / 0.35
~44 layers (if INT4 post-train)
Decomposed Plan
Phase 0: Infrastructure (Day 0-1, Apr 20-21)
Task
Issue
Owner
Gate
Integrate trios-proto + trios-core into one-shot training pipeline
#107
ALPHA
cargo build --release passes
Set up 8xH100 RunPod environment
—
LEAD
SSH + NCCL test passes
Implement FineWeb data loader (cached dataset, no network)
—
GAMMA
Loads 1M bytes in <1s
Phase 1: Backward Pass Fix (Day 2, Apr 21)
Task
Issue
Owner
Gate
Fix backward pass for tied embeddings (CE masking + diverse data + grad collapse warning)
#67
GAMMA
overfit-100 BPB < 2.0, 7/7 gates GREEN
Phase 2: Optimizer Integration (Day 2-3, Apr 21-22)
Task
Issue
Owner
Gate
Implement Muon optimizer for 2D weight matrices
#69
DELTA
step/sec >= 18, loss decreasing monotonically
NQA 15K pre-training run (baseline measurement)
#70
DELTA
BPB on val set recorded, loss curve saved
Phase 3: Architecture Scaling (Day 3-5, Apr 22-24)
Task
Issue
Owner
Gate
Layer count sweep: 8, 11, 16, 22 layers (FP16 vs INT4 tradeoff)
—
GAMMA
Best BPB/size ratio identified
MLP ratio sweep: 3x vs 4x hidden dim
—
GAMMA
Optimal config locked
Attention: RoPE + QK-Norm integration
—
DELTA
Grad check passes, BPB improves
Phase 4: Quantization (Day 5-6, Apr 24-25)
Task
Issue
Owner
Gate
Implement GF16 training format
—
GAMMA
Training stable, no NaN
Post-training INT4 quantization (AWQ-style)
—
DELTA
BPB degradation < 0.02 from FP16
Artifact size verification
—
ALPHA
ls -la artifact.bin < 16,000,000 bytes
Phase 5: Training Optimization (Day 6-7, Apr 25-26)
Task
Issue
Owner
Gate
Full 60K-step training run (5 seeds)
#71
GAMMA
All 5 runs complete in <10 min each
EMA weight averaging (decay=0.999)
—
DELTA
EMA weights BPB < raw weights BPB
Learning rate schedule: warmup + cosine decay
—
GAMMA
Loss curve smooth, no spikes
Sliding window eval (stride=64)
—
DELTA
BPB improvement measured
Phase 6: Entropy Sweep & Selection (Day 8-9, Apr 27-28)
Task
Issue
Owner
Gate
Entropy sweep across 5 seed runs
#72
GAMMA
Best candidate identified
Ablation study: log all config deltas
—
DELTA
Table of (config, BPB) pairs
Select final candidate (lowest BPB, reproducible)
—
LEAD
p-value < 0.01 across seeds
Phase 7: Submission (Day 9-10, Apr 29-30)
Task
Issue
Owner
Gate
Package artifact: weights + train_gpt.py < 16MB
—
ALPHA
Size verified
Write short write-up (techniques, architecture, results)
—
LEAD
Markdown ready
Submit PR to openai/parameter-golf
—
LEAD
PR accepted, score on leaderboard
Zenodo deposit (reproducibility archive)
—
DELTA
DOI assigned
Risk Matrix
Risk
Impact
Mitigation
INT4 quantization degrades BPB by >0.05
HIGH
Fall back to FP16 with fewer layers; try mixed INT4/INT8
Muon unstable on 3^k dimensions
MEDIUM
Fall back to AdamW; Muon only for attention/MLP
10-min wall-clock exceeded
HIGH
Profile step/sec early; reduce layers or batch size
RunPod hardware lottery (slow pod)
MEDIUM
Request pod swap; measure ms/step before full run
N-gram caching ruled out by organizers
LOW
Pure neural approach; our arch should still beat baseline
Success Criteria
Timeline
Day
Date
Action
Refs
0-1
Apr 20-21
Infrastructure: one-shot integration, RunPod setup
#107
2
Apr 21
Fix backward pass (#67 )
#67
2-3
Apr 21-22
Muon optimizer + NQA 15K baseline
#70 , #69
3-5
Apr 22-24
Architecture scaling: layer/MLP/attention sweeps
—
5-6
Apr 24-25
GF16 training + INT4 post-quantization
—
6-7
Apr 25-26
Full 60K training (5 seeds) + EMA + sliding eval
#71
8
Apr 27
Entropy sweep
#72
9
Apr 28-29
Select candidate, write-up
—
10
Apr 29-30
Submit PR + Zenodo
—
Issues
Issue
Description
Owner
Status
#1
Phase 2: Fix backward pass (#67 )
GAMMA
pending
#2
Phase 2: NQA 15K pre-training (#70 )
DELTA
pending
#3
Phase 3: Run #71 5 seeds x 60K
GAMMA
pending
#4
Phase 4: #72 entropy sweep
DELTA
pending
#5
Phase 7: Submit + Zenodo
LEAD
pending
Refs: PG-ONE-SHOT, #67 , #69 , #70 , #71 , #72 , #107
EPIC: Parameter Golf Hackathon Submission
Status: IN-FLIGHT
Priority: CRITICAL (P0)
Deadline: 30 April 2026 (~9 days remaining)
Owner: LEAD
Spec: .trinity/specs/epic-parameter-golf.md
Executive Summary
OpenAI Model Craft Challenge: Parameter Golf — train the best language model that fits in a 16 MB artifact (weights + code) and trains in <10 minutes on 8xH100, scored by tokenizer-agnostic bits-per-byte (BPB) on the FineWeb validation set.
Our strategy: byte-level Trinity 3^k architecture + GF16 quantization + phi-init.
Research & SOTA Analysis
Competition Landscape
The leaderboard dropped from 1.2244 to ~1.08 BPB in ~4 weeks. Key techniques from top submissions:
Relevant Research
Our Differentiators (Trinity 3^k)
What We Know (Validated Results)
Wall-time measurement: 18.47 step/sec production throughput (GAMMA agent).
Architecture Spec
Size budget breakdown (16 MB = 16,777,216 bytes):
Decomposed Plan
Phase 0: Infrastructure (Day 0-1, Apr 20-21)
cargo build --releasepassesPhase 1: Backward Pass Fix (Day 2, Apr 21)
Phase 2: Optimizer Integration (Day 2-3, Apr 21-22)
Phase 3: Architecture Scaling (Day 3-5, Apr 22-24)
Phase 4: Quantization (Day 5-6, Apr 24-25)
ls -la artifact.bin< 16,000,000 bytesPhase 5: Training Optimization (Day 6-7, Apr 25-26)
Phase 6: Entropy Sweep & Selection (Day 8-9, Apr 27-28)
Phase 7: Submission (Day 9-10, Apr 29-30)
Risk Matrix
Success Criteria
Timeline
Issues
Refs: PG-ONE-SHOT, #67, #69, #70, #71, #72, #107