Skip to content

🏆 EPIC: Parameter Golf Hackathon Submission (deadline 30 April 2026) #110

@gHashTag

Description

@gHashTag

EPIC: Parameter Golf Hackathon Submission

Status: IN-FLIGHT
Priority: CRITICAL (P0)
Deadline: 30 April 2026 (~9 days remaining)
Owner: LEAD
Spec: .trinity/specs/epic-parameter-golf.md


Executive Summary

OpenAI Model Craft Challenge: Parameter Golf — train the best language model that fits in a 16 MB artifact (weights + code) and trains in <10 minutes on 8xH100, scored by tokenizer-agnostic bits-per-byte (BPB) on the FineWeb validation set.

Our strategy: byte-level Trinity 3^k architecture + GF16 quantization + phi-init.

Metric Value
SOTA (2026-04-20) 1.0810 BPB (bigbag)
Our target < 1.15 BPB
OpenAI baseline 1.2244 BPB (9-layer, dim=512, vocab=1024)
Artifact budget 16 MB (weights + train_gpt.py combined)
Training budget 10 min wall-clock / 8xH100 SXM
Eval metric tokenizer-agnostic bits-per-byte
Submission deadline 30 April 2026

Research & SOTA Analysis

Competition Landscape

The leaderboard dropped from 1.2244 to ~1.08 BPB in ~4 weeks. Key techniques from top submissions:

  1. Sliding window evaluation — stride-64 eval instead of fixed context; jumped ~1.20 to ~1.19 BPB
  2. EMA / SWA weight averaging — smoother final weights that compress and generalize better
  3. N-gram caching — hybrid neural + n-gram approach pushed sub-1.0 BPB (controversial — compression vs LM quality)
  4. Muon optimizer — replaced AdamW; Newton-Schulz orthogonalization; 35% faster convergence (NanoGPT speedrun proven)
  5. Modernized arch — RoPE, QK-Norm, ReLU^2 (from modded-nanogpt lineage)
  6. Hardware lottery — top pods run 83-88ms/step; slow pods 260ms/step; same code, 3x variance

Relevant Research

Paper / Project Key Insight Relevance
Byte Latent Transformer (BLT) — Meta, ACL 2025 Byte-level LLM matching tokenization at scale via adaptive patching Validates byte-level approach; patch grouping strategy
EvaByte — HKU, 2025 6.5B byte-level LM with multibyte prediction + EVA attention; 5-10x faster than vanilla Multibyte prediction technique for byte models
ByteFlow — 2025 Adaptive byte compression + chunking outperforms BPE End-to-end tokenizer-free modeling is feasible
Bolmo — 2026 Byteifying existing subword LMs at 1B/7B scale Byteification strategy; vocab-free approach
Muon Optimizer — Jordan et al. Momentum + Newton-Schulz orthogonalization; SOTA for NanoGPT speedrun Critical optimizer choice; 35% speedup over AdamW
NorMuon — 2025 Muon + neuron-wise adaptive scaling More stable Muon variant for small models
Unweight — Cloudflare, 2026 Lossless 15-22% weight compression via Huffman + palette encoding Post-training compression to fit 16MB budget
Phi-init — Academia, 2026 Golden ratio as structural prior for width/depth ratio in neural arch phi-based initialization for Trinity architecture
Fibonacci Init — 2024 Fibonacci sequence weight init; 94% accuracy on small datasets Supports phi-based init hypothesis
modded-nanogpt — Keller Jordan RoPE + QK-Norm + ReLU^2 + Muon; cumulative 20% WR drop in 3 months Architecture recipe for fast convergence

Our Differentiators (Trinity 3^k)

  1. All dimensions are powers of 3 — hidden=243 (3^5), heads=27 (3^3), vocab=729 (3^6) — hardware-aligned, cache-friendly
  2. GF16 quantization — custom 16-bit golden-float format using phi-based mantissa distribution
  3. Tied embeddings — in/out share weights; critical for 16MB budget (embeddings = 15%+ of small model params)
  4. Phi-init — weight initialization using golden ratio scaling (validated in issue ✂ Ch.4 — Golden Cut (The Platonic Rupture) [Lane B] · refs #30 #68)
  5. Byte-level — no tokenizer dependency; directly optimizes BPB metric

What We Know (Validated Results)

Issue What We Proved Status
#55 BPB 0 -> 5.48 realistic (was broken) GREEN
#57 Weight delta tests: T1,T3,T4,T5 pass; T2 fail (diagnostic) GREEN
#64 simple_backward fix: tied embeddings in/out split; BPB 0.0983 GREEN
#65 Finite-diff grad check: 20/20 rel_err < 5e-3, sign = 100% GREEN
#66 Vocab audit: single source 729, active 256 GREEN
#67 Overfit-100: BPB 7.9997 -> 1.2000; all 7 gates GREEN (commit 85bf8b7) GREEN
#68 Trinity config: hidden=243, 2.70 MB, all dims 3^k GREEN

Wall-time measurement: 18.47 step/sec production throughput (GAMMA agent).


Architecture Spec

Model: Trinity-3k Byte-Level Transformer
- vocab_size: 729 (3^6) — byte-level, no tokenizer
- hidden_dim: 243 (3^5)
- n_heads: 27 (3^3)
- head_dim: 9 (3^2)
- n_layers: TBD (maximize within 16MB)
- activation: ReLU^2
- position: RoPE
- normalization: QK-Norm + RMSNorm
- embeddings: tied (in == out)
- optimizer: Muon (hidden layers) + AdamW (embeddings/norm)
- quantization: GF16 -> INT4 post-training (fit 16MB)
- init: phi-scaled Xavier

Size budget breakdown (16 MB = 16,777,216 bytes):

Component Formula Estimate
Embeddings (tied) 729 * 243 * 2B = 354 KB 354 KB
Per-layer (QKV+O+MLP) ~(4243^2 + 2243*972) * 2B = ~1.4 MB 1.4 MB/layer
train_gpt.py ~50 KB 50 KB
Available for layers (16MB - 354KB - 50KB) ~15.6 MB
Max layers (FP16) 15.6 / 1.4 ~11 layers
Max layers (INT4) 15.6 / 0.35 ~44 layers (if INT4 post-train)

Decomposed Plan

Phase 0: Infrastructure (Day 0-1, Apr 20-21)

Task Issue Owner Gate
Integrate trios-proto + trios-core into one-shot training pipeline #107 ALPHA cargo build --release passes
Set up 8xH100 RunPod environment LEAD SSH + NCCL test passes
Implement FineWeb data loader (cached dataset, no network) GAMMA Loads 1M bytes in <1s

Phase 1: Backward Pass Fix (Day 2, Apr 21)

Task Issue Owner Gate
Fix backward pass for tied embeddings (CE masking + diverse data + grad collapse warning) #67 GAMMA overfit-100 BPB < 2.0, 7/7 gates GREEN

Phase 2: Optimizer Integration (Day 2-3, Apr 21-22)

Task Issue Owner Gate
Implement Muon optimizer for 2D weight matrices #69 DELTA step/sec >= 18, loss decreasing monotonically
NQA 15K pre-training run (baseline measurement) #70 DELTA BPB on val set recorded, loss curve saved

Phase 3: Architecture Scaling (Day 3-5, Apr 22-24)

Task Issue Owner Gate
Layer count sweep: 8, 11, 16, 22 layers (FP16 vs INT4 tradeoff) GAMMA Best BPB/size ratio identified
MLP ratio sweep: 3x vs 4x hidden dim GAMMA Optimal config locked
Attention: RoPE + QK-Norm integration DELTA Grad check passes, BPB improves

Phase 4: Quantization (Day 5-6, Apr 24-25)

Task Issue Owner Gate
Implement GF16 training format GAMMA Training stable, no NaN
Post-training INT4 quantization (AWQ-style) DELTA BPB degradation < 0.02 from FP16
Artifact size verification ALPHA ls -la artifact.bin < 16,000,000 bytes

Phase 5: Training Optimization (Day 6-7, Apr 25-26)

Task Issue Owner Gate
Full 60K-step training run (5 seeds) #71 GAMMA All 5 runs complete in <10 min each
EMA weight averaging (decay=0.999) DELTA EMA weights BPB < raw weights BPB
Learning rate schedule: warmup + cosine decay GAMMA Loss curve smooth, no spikes
Sliding window eval (stride=64) DELTA BPB improvement measured

Phase 6: Entropy Sweep & Selection (Day 8-9, Apr 27-28)

Task Issue Owner Gate
Entropy sweep across 5 seed runs #72 GAMMA Best candidate identified
Ablation study: log all config deltas DELTA Table of (config, BPB) pairs
Select final candidate (lowest BPB, reproducible) LEAD p-value < 0.01 across seeds

Phase 7: Submission (Day 9-10, Apr 29-30)

Task Issue Owner Gate
Package artifact: weights + train_gpt.py < 16MB ALPHA Size verified
Write short write-up (techniques, architecture, results) LEAD Markdown ready
Submit PR to openai/parameter-golf LEAD PR accepted, score on leaderboard
Zenodo deposit (reproducibility archive) DELTA DOI assigned

Risk Matrix

Risk Impact Mitigation
INT4 quantization degrades BPB by >0.05 HIGH Fall back to FP16 with fewer layers; try mixed INT4/INT8
Muon unstable on 3^k dimensions MEDIUM Fall back to AdamW; Muon only for attention/MLP
10-min wall-clock exceeded HIGH Profile step/sec early; reduce layers or batch size
RunPod hardware lottery (slow pod) MEDIUM Request pod swap; measure ms/step before full run
N-gram caching ruled out by organizers LOW Pure neural approach; our arch should still beat baseline

Success Criteria

  • BPB < 1.15 on FineWeb validation
  • Artifact < 16,000,000 bytes
  • Training wall-clock < 10 minutes on 8xH100
  • Reproducible: p < 0.01 across 5 seeds
  • PR submitted to openai/parameter-golf before April 30

Timeline

Day Date Action Refs
0-1 Apr 20-21 Infrastructure: one-shot integration, RunPod setup #107
2 Apr 21 Fix backward pass (#67) #67
2-3 Apr 21-22 Muon optimizer + NQA 15K baseline #70, #69
3-5 Apr 22-24 Architecture scaling: layer/MLP/attention sweeps
5-6 Apr 24-25 GF16 training + INT4 post-quantization
6-7 Apr 25-26 Full 60K training (5 seeds) + EMA + sliding eval #71
8 Apr 27 Entropy sweep #72
9 Apr 28-29 Select candidate, write-up
10 Apr 29-30 Submit PR + Zenodo

Issues

Issue Description Owner Status
#1 Phase 2: Fix backward pass (#67) GAMMA pending
#2 Phase 2: NQA 15K pre-training (#70) DELTA pending
#3 Phase 3: Run #71 5 seeds x 60K GAMMA pending
#4 Phase 4: #72 entropy sweep DELTA pending
#5 Phase 7: Submit + Zenodo LEAD pending

Refs: PG-ONE-SHOT, #67, #69, #70, #71, #72, #107

Metadata

Metadata

Assignees

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions