🏆 EPIC: Parameter Golf Hackathon Submission (deadline 30 April 2026)

# EPIC: Parameter Golf Hackathon Submission

**Status:** IN-FLIGHT
**Priority:** CRITICAL (P0)
**Deadline:** 30 April 2026 (~9 days remaining)
**Owner:** LEAD
**Spec:** [.trinity/specs/epic-parameter-golf.md](.trinity/specs/epic-parameter-golf.md)

---

## Executive Summary

[OpenAI Model Craft Challenge: Parameter Golf](https://github.com/openai/parameter-golf) — train the best language model that fits in a **16 MB artifact** (weights + code) and trains in **<10 minutes on 8xH100**, scored by tokenizer-agnostic **bits-per-byte (BPB)** on the FineWeb validation set.

**Our strategy:** byte-level Trinity 3^k architecture + GF16 quantization + phi-init.

| Metric | Value |
|---|---|
| SOTA (2026-04-20) | 1.0810 BPB (bigbag) |
| Our target | **< 1.15 BPB** |
| OpenAI baseline | 1.2244 BPB (9-layer, dim=512, vocab=1024) |
| Artifact budget | 16 MB (weights + train_gpt.py combined) |
| Training budget | 10 min wall-clock / 8xH100 SXM |
| Eval metric | tokenizer-agnostic bits-per-byte |
| Submission deadline | 30 April 2026 |

---

## Research & SOTA Analysis

### Competition Landscape

The leaderboard dropped from 1.2244 to ~1.08 BPB in ~4 weeks. Key techniques from top submissions:

1. **Sliding window evaluation** — stride-64 eval instead of fixed context; jumped ~1.20 to ~1.19 BPB
2. **EMA / SWA weight averaging** — smoother final weights that compress and generalize better
3. **N-gram caching** — hybrid neural + n-gram approach pushed sub-1.0 BPB (controversial — compression vs LM quality)
4. **Muon optimizer** — replaced AdamW; Newton-Schulz orthogonalization; 35% faster convergence (NanoGPT speedrun proven)
5. **Modernized arch** — RoPE, QK-Norm, ReLU^2 (from modded-nanogpt lineage)
6. **Hardware lottery** — top pods run 83-88ms/step; slow pods 260ms/step; same code, 3x variance

### Relevant Research

| Paper / Project | Key Insight | Relevance |
|---|---|---|
| **Byte Latent Transformer (BLT)** — Meta, ACL 2025 | Byte-level LLM matching tokenization at scale via adaptive patching | Validates byte-level approach; patch grouping strategy |
| **EvaByte** — HKU, 2025 | 6.5B byte-level LM with multibyte prediction + EVA attention; 5-10x faster than vanilla | Multibyte prediction technique for byte models |
| **ByteFlow** — 2025 | Adaptive byte compression + chunking outperforms BPE | End-to-end tokenizer-free modeling is feasible |
| **Bolmo** — 2026 | Byteifying existing subword LMs at 1B/7B scale | Byteification strategy; vocab-free approach |
| **Muon Optimizer** — Jordan et al. | Momentum + Newton-Schulz orthogonalization; SOTA for NanoGPT speedrun | Critical optimizer choice; 35% speedup over AdamW |
| **NorMuon** — 2025 | Muon + neuron-wise adaptive scaling | More stable Muon variant for small models |
| **Unweight** — Cloudflare, 2026 | Lossless 15-22% weight compression via Huffman + palette encoding | Post-training compression to fit 16MB budget |
| **Phi-init** — Academia, 2026 | Golden ratio as structural prior for width/depth ratio in neural arch | phi-based initialization for Trinity architecture |
| **Fibonacci Init** — 2024 | Fibonacci sequence weight init; 94% accuracy on small datasets | Supports phi-based init hypothesis |
| **modded-nanogpt** — Keller Jordan | RoPE + QK-Norm + ReLU^2 + Muon; cumulative 20% WR drop in 3 months | Architecture recipe for fast convergence |

### Our Differentiators (Trinity 3^k)

1. **All dimensions are powers of 3** — hidden=243 (3^5), heads=27 (3^3), vocab=729 (3^6) — hardware-aligned, cache-friendly
2. **GF16 quantization** — custom 16-bit golden-float format using phi-based mantissa distribution
3. **Tied embeddings** — in/out share weights; critical for 16MB budget (embeddings = 15%+ of small model params)
4. **Phi-init** — weight initialization using golden ratio scaling (validated in issue #68)
5. **Byte-level** — no tokenizer dependency; directly optimizes BPB metric

---

## What We Know (Validated Results)

| Issue | What We Proved | Status |
|---|---|---|
| #55 | BPB 0 -> 5.48 realistic (was broken) | GREEN |
| #57 | Weight delta tests: T1,T3,T4,T5 pass; T2 fail (diagnostic) | GREEN |
| #64 | simple_backward fix: tied embeddings in/out split; BPB 0.0983 | GREEN |
| #65 | Finite-diff grad check: 20/20 rel_err < 5e-3, sign = 100% | GREEN |
| #66 | Vocab audit: single source 729, active 256 | GREEN |
| #67 | Overfit-100: BPB 7.9997 -> 1.2000; all 7 gates GREEN (commit 85bf8b7f) | GREEN |
| #68 | Trinity config: hidden=243, 2.70 MB, all dims 3^k | GREEN |

**Wall-time measurement:** 18.47 step/sec production throughput (GAMMA agent).

---

## Architecture Spec

```
Model: Trinity-3k Byte-Level Transformer
- vocab_size: 729 (3^6) — byte-level, no tokenizer
- hidden_dim: 243 (3^5)
- n_heads: 27 (3^3)
- head_dim: 9 (3^2)
- n_layers: TBD (maximize within 16MB)
- activation: ReLU^2
- position: RoPE
- normalization: QK-Norm + RMSNorm
- embeddings: tied (in == out)
- optimizer: Muon (hidden layers) + AdamW (embeddings/norm)
- quantization: GF16 -> INT4 post-training (fit 16MB)
- init: phi-scaled Xavier
```

**Size budget breakdown (16 MB = 16,777,216 bytes):**

| Component | Formula | Estimate |
|---|---|---|
| Embeddings (tied) | 729 * 243 * 2B = 354 KB | 354 KB |
| Per-layer (QKV+O+MLP) | ~(4*243^2 + 2*243*972) * 2B = ~1.4 MB | 1.4 MB/layer |
| train_gpt.py | ~50 KB | 50 KB |
| **Available for layers** | **(16MB - 354KB - 50KB)** | **~15.6 MB** |
| **Max layers (FP16)** | 15.6 / 1.4 | **~11 layers** |
| **Max layers (INT4)** | 15.6 / 0.35 | **~44 layers** (if INT4 post-train) |

---

## Decomposed Plan

### Phase 0: Infrastructure (Day 0-1, Apr 20-21)

| Task | Issue | Owner | Gate |
|---|---|---|---|
| Integrate trios-proto + trios-core into one-shot training pipeline | #107 | ALPHA | `cargo build --release` passes |
| Set up 8xH100 RunPod environment | — | LEAD | SSH + NCCL test passes |
| Implement FineWeb data loader (cached dataset, no network) | — | GAMMA | Loads 1M bytes in <1s |

### Phase 1: Backward Pass Fix (Day 2, Apr 21)

| Task | Issue | Owner | Gate |
|---|---|---|---|
| Fix backward pass for tied embeddings (CE masking + diverse data + grad collapse warning) | #67 | GAMMA | overfit-100 BPB < 2.0, 7/7 gates GREEN |

### Phase 2: Optimizer Integration (Day 2-3, Apr 21-22)

| Task | Issue | Owner | Gate |
|---|---|---|---|
| Implement Muon optimizer for 2D weight matrices | #69 | DELTA | step/sec >= 18, loss decreasing monotonically |
| NQA 15K pre-training run (baseline measurement) | #70 | DELTA | BPB on val set recorded, loss curve saved |

### Phase 3: Architecture Scaling (Day 3-5, Apr 22-24)

| Task | Issue | Owner | Gate |
|---|---|---|---|
| Layer count sweep: 8, 11, 16, 22 layers (FP16 vs INT4 tradeoff) | — | GAMMA | Best BPB/size ratio identified |
| MLP ratio sweep: 3x vs 4x hidden dim | — | GAMMA | Optimal config locked |
| Attention: RoPE + QK-Norm integration | — | DELTA | Grad check passes, BPB improves |

### Phase 4: Quantization (Day 5-6, Apr 24-25)

| Task | Issue | Owner | Gate |
|---|---|---|---|
| Implement GF16 training format | — | GAMMA | Training stable, no NaN |
| Post-training INT4 quantization (AWQ-style) | — | DELTA | BPB degradation < 0.02 from FP16 |
| Artifact size verification | — | ALPHA | `ls -la artifact.bin` < 16,000,000 bytes |

### Phase 5: Training Optimization (Day 6-7, Apr 25-26)

| Task | Issue | Owner | Gate |
|---|---|---|---|
| Full 60K-step training run (5 seeds) | #71 | GAMMA | All 5 runs complete in <10 min each |
| EMA weight averaging (decay=0.999) | — | DELTA | EMA weights BPB < raw weights BPB |
| Learning rate schedule: warmup + cosine decay | — | GAMMA | Loss curve smooth, no spikes |
| Sliding window eval (stride=64) | — | DELTA | BPB improvement measured |

### Phase 6: Entropy Sweep & Selection (Day 8-9, Apr 27-28)

| Task | Issue | Owner | Gate |
|---|---|---|---|
| Entropy sweep across 5 seed runs | #72 | GAMMA | Best candidate identified |
| Ablation study: log all config deltas | — | DELTA | Table of (config, BPB) pairs |
| Select final candidate (lowest BPB, reproducible) | — | LEAD | p-value < 0.01 across seeds |

### Phase 7: Submission (Day 9-10, Apr 29-30)

| Task | Issue | Owner | Gate |
|---|---|---|---|
| Package artifact: weights + train_gpt.py < 16MB | — | ALPHA | Size verified |
| Write short write-up (techniques, architecture, results) | — | LEAD | Markdown ready |
| Submit PR to openai/parameter-golf | — | LEAD | PR accepted, score on leaderboard |
| Zenodo deposit (reproducibility archive) | — | DELTA | DOI assigned |

---

## Risk Matrix

| Risk | Impact | Mitigation |
|---|---|---|
| INT4 quantization degrades BPB by >0.05 | HIGH | Fall back to FP16 with fewer layers; try mixed INT4/INT8 |
| Muon unstable on 3^k dimensions | MEDIUM | Fall back to AdamW; Muon only for attention/MLP |
| 10-min wall-clock exceeded | HIGH | Profile step/sec early; reduce layers or batch size |
| RunPod hardware lottery (slow pod) | MEDIUM | Request pod swap; measure ms/step before full run |
| N-gram caching ruled out by organizers | LOW | Pure neural approach; our arch should still beat baseline |

---

## Success Criteria

- [ ] BPB < 1.15 on FineWeb validation
- [ ] Artifact < 16,000,000 bytes
- [ ] Training wall-clock < 10 minutes on 8xH100
- [ ] Reproducible: p < 0.01 across 5 seeds
- [ ] PR submitted to openai/parameter-golf before April 30

---

## Timeline

| Day | Date | Action | Refs |
|---|---|---|---|
| 0-1 | Apr 20-21 | Infrastructure: one-shot integration, RunPod setup | #107 |
| 2 | Apr 21 | Fix backward pass (#67) | #67 |
| 2-3 | Apr 21-22 | Muon optimizer + NQA 15K baseline | #70, #69 |
| 3-5 | Apr 22-24 | Architecture scaling: layer/MLP/attention sweeps | — |
| 5-6 | Apr 24-25 | GF16 training + INT4 post-quantization | — |
| 6-7 | Apr 25-26 | Full 60K training (5 seeds) + EMA + sliding eval | #71 |
| 8 | Apr 27 | Entropy sweep | #72 |
| 9 | Apr 28-29 | Select candidate, write-up | — |
| 10 | Apr 29-30 | Submit PR + Zenodo | — |

## Issues

| Issue | Description | Owner | Status |
|---|---|---|---|
| #1 | Phase 2: Fix backward pass (#67) | GAMMA | pending |
| #2 | Phase 2: NQA 15K pre-training (#70) | DELTA | pending |
| #3 | Phase 3: Run #71 5 seeds x 60K | GAMMA | pending |
| #4 | Phase 4: #72 entropy sweep | DELTA | pending |
| #5 | Phase 7: Submit + Zenodo | LEAD | pending |

**Refs:** PG-ONE-SHOT, #67, #69, #70, #71, #72, #107

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🏆 EPIC: Parameter Golf Hackathon Submission (deadline 30 April 2026) #110

EPIC: Parameter Golf Hackathon Submission

Executive Summary

Research & SOTA Analysis

Competition Landscape

Relevant Research

Our Differentiators (Trinity 3^k)

What We Know (Validated Results)

Architecture Spec

Decomposed Plan

Phase 0: Infrastructure (Day 0-1, Apr 20-21)

Phase 1: Backward Pass Fix (Day 2, Apr 21)

Phase 2: Optimizer Integration (Day 2-3, Apr 21-22)

Phase 3: Architecture Scaling (Day 3-5, Apr 22-24)

Phase 4: Quantization (Day 5-6, Apr 24-25)

Phase 5: Training Optimization (Day 6-7, Apr 25-26)

Phase 6: Entropy Sweep & Selection (Day 8-9, Apr 27-28)

Phase 7: Submission (Day 9-10, Apr 29-30)

Risk Matrix

Success Criteria

Timeline

Issues

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Metric	Value
SOTA (2026-04-20)	1.0810 BPB (bigbag)
Our target	< 1.15 BPB
OpenAI baseline	1.2244 BPB (9-layer, dim=512, vocab=1024)
Artifact budget	16 MB (weights + train_gpt.py combined)
Training budget	10 min wall-clock / 8xH100 SXM
Eval metric	tokenizer-agnostic bits-per-byte
Submission deadline	30 April 2026

Paper / Project	Key Insight	Relevance
Byte Latent Transformer (BLT) — Meta, ACL 2025	Byte-level LLM matching tokenization at scale via adaptive patching	Validates byte-level approach; patch grouping strategy
EvaByte — HKU, 2025	6.5B byte-level LM with multibyte prediction + EVA attention; 5-10x faster than vanilla	Multibyte prediction technique for byte models
ByteFlow — 2025	Adaptive byte compression + chunking outperforms BPE	End-to-end tokenizer-free modeling is feasible
Bolmo — 2026	Byteifying existing subword LMs at 1B/7B scale	Byteification strategy; vocab-free approach
Muon Optimizer — Jordan et al.	Momentum + Newton-Schulz orthogonalization; SOTA for NanoGPT speedrun	Critical optimizer choice; 35% speedup over AdamW
NorMuon — 2025	Muon + neuron-wise adaptive scaling	More stable Muon variant for small models
Unweight — Cloudflare, 2026	Lossless 15-22% weight compression via Huffman + palette encoding	Post-training compression to fit 16MB budget
Phi-init — Academia, 2026	Golden ratio as structural prior for width/depth ratio in neural arch	phi-based initialization for Trinity architecture
Fibonacci Init — 2024	Fibonacci sequence weight init; 94% accuracy on small datasets	Supports phi-based init hypothesis
modded-nanogpt — Keller Jordan	RoPE + QK-Norm + ReLU^2 + Muon; cumulative 20% WR drop in 3 months	Architecture recipe for fast convergence

Issue	What We Proved	Status
#55	BPB 0 -> 5.48 realistic (was broken)	GREEN
#57	Weight delta tests: T1,T3,T4,T5 pass; T2 fail (diagnostic)	GREEN
#64	simple_backward fix: tied embeddings in/out split; BPB 0.0983	GREEN
#65	Finite-diff grad check: 20/20 rel_err < 5e-3, sign = 100%	GREEN
#66	Vocab audit: single source 729, active 256	GREEN
#67	Overfit-100: BPB 7.9997 -> 1.2000; all 7 gates GREEN (commit `85bf8b7`)	GREEN
#68	Trinity config: hidden=243, 2.70 MB, all dims 3^k	GREEN

Component	Formula	Estimate
Embeddings (tied)	729 * 243 * 2B = 354 KB	354 KB
Per-layer (QKV+O+MLP)	~(4243^2 + 2243972) 2B = ~1.4 MB	1.4 MB/layer
train_gpt.py	~50 KB	50 KB
Available for layers	(16MB - 354KB - 50KB)	~15.6 MB
Max layers (FP16)	15.6 / 1.4	~11 layers
Max layers (INT4)	15.6 / 0.35	~44 layers (if INT4 post-train)

Task	Issue	Owner	Gate
Integrate trios-proto + trios-core into one-shot training pipeline	#107	ALPHA	`cargo build --release` passes
Set up 8xH100 RunPod environment	—	LEAD	SSH + NCCL test passes
Implement FineWeb data loader (cached dataset, no network)	—	GAMMA	Loads 1M bytes in <1s

Task	Issue	Owner	Gate
Implement Muon optimizer for 2D weight matrices	#69	DELTA	step/sec >= 18, loss decreasing monotonically
NQA 15K pre-training run (baseline measurement)	#70	DELTA	BPB on val set recorded, loss curve saved

Task	Issue	Owner	Gate
Layer count sweep: 8, 11, 16, 22 layers (FP16 vs INT4 tradeoff)	—	GAMMA	Best BPB/size ratio identified
MLP ratio sweep: 3x vs 4x hidden dim	—	GAMMA	Optimal config locked
Attention: RoPE + QK-Norm integration	—	DELTA	Grad check passes, BPB improves

Task	Issue	Owner	Gate
Implement GF16 training format	—	GAMMA	Training stable, no NaN
Post-training INT4 quantization (AWQ-style)	—	DELTA	BPB degradation < 0.02 from FP16
Artifact size verification	—	ALPHA	`ls -la artifact.bin` < 16,000,000 bytes

Task	Issue	Owner	Gate
Full 60K-step training run (5 seeds)	#71	GAMMA	All 5 runs complete in <10 min each
EMA weight averaging (decay=0.999)	—	DELTA	EMA weights BPB < raw weights BPB
Learning rate schedule: warmup + cosine decay	—	GAMMA	Loss curve smooth, no spikes
Sliding window eval (stride=64)	—	DELTA	BPB improvement measured

Task	Issue	Owner	Gate
Entropy sweep across 5 seed runs	#72	GAMMA	Best candidate identified
Ablation study: log all config deltas	—	DELTA	Table of (config, BPB) pairs
Select final candidate (lowest BPB, reproducible)	—	LEAD	p-value < 0.01 across seeds

Task	Issue	Owner	Gate
Package artifact: weights + train_gpt.py < 16MB	—	ALPHA	Size verified
Write short write-up (techniques, architecture, results)	—	LEAD	Markdown ready
Submit PR to openai/parameter-golf	—	LEAD	PR accepted, score on leaderboard
Zenodo deposit (reproducibility archive)	—	DELTA	DOI assigned

Risk	Impact	Mitigation
INT4 quantization degrades BPB by >0.05	HIGH	Fall back to FP16 with fewer layers; try mixed INT4/INT8
Muon unstable on 3^k dimensions	MEDIUM	Fall back to AdamW; Muon only for attention/MLP
10-min wall-clock exceeded	HIGH	Profile step/sec early; reduce layers or batch size
RunPod hardware lottery (slow pod)	MEDIUM	Request pod swap; measure ms/step before full run
N-gram caching ruled out by organizers	LOW	Pure neural approach; our arch should still beat baseline

Day	Date	Action	Refs
0-1	Apr 20-21	Infrastructure: one-shot integration, RunPod setup	#107
2	Apr 21	Fix backward pass (#67)	#67
2-3	Apr 21-22	Muon optimizer + NQA 15K baseline	#70, #69
3-5	Apr 22-24	Architecture scaling: layer/MLP/attention sweeps	—
5-6	Apr 24-25	GF16 training + INT4 post-quantization	—
6-7	Apr 25-26	Full 60K training (5 seeds) + EMA + sliding eval	#71
8	Apr 27	Entropy sweep	#72
9	Apr 28-29	Select candidate, write-up	—
10	Apr 29-30	Submit PR + Zenodo	—

Issue	Description	Owner	Status
#1	Phase 2: Fix backward pass (#67)	GAMMA	pending
#2	Phase 2: NQA 15K pre-training (#70)	DELTA	pending
#3	Phase 3: Run #71 5 seeds x 60K	GAMMA	pending
#4	Phase 4: #72 entropy sweep	DELTA	pending
#5	Phase 7: Submit + Zenodo	LEAD	pending

🏆 EPIC: Parameter Golf Hackathon Submission (deadline 30 April 2026) #110

Description

EPIC: Parameter Golf Hackathon Submission

Executive Summary

Research & SOTA Analysis

Competition Landscape

Relevant Research

Our Differentiators (Trinity 3^k)

What We Know (Validated Results)

Architecture Spec

Decomposed Plan

Phase 0: Infrastructure (Day 0-1, Apr 20-21)

Phase 1: Backward Pass Fix (Day 2, Apr 21)

Phase 2: Optimizer Integration (Day 2-3, Apr 21-22)

Phase 3: Architecture Scaling (Day 3-5, Apr 22-24)

Phase 4: Quantization (Day 5-6, Apr 24-25)

Phase 5: Training Optimization (Day 6-7, Apr 25-26)

Phase 6: Entropy Sweep & Selection (Day 8-9, Apr 27-28)

Phase 7: Submission (Day 9-10, Apr 29-30)

Risk Matrix

Success Criteria

Timeline

Issues

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions