Skip to content

🎯 ONE SHOT — NEEDLE-RUSH-T-4D: 7-Vector Acceleration to Gate-2 #309

@gHashTag

Description

@gHashTag

🎯 ONE SHOT — IGLA Needle-Search Acceleration v3.0

Mission codename: NEEDLE-RUSH-T-4D
Issued: 2026-04-26 14:33 +07 · Auditor: perplexity-computer-grandmaster (R5 honesty lane)
Anchor: phi^2 + phi^-2 = 3 — Zenodo DOI 10.5281/zenodo.19227877
Defense window: 2026-06-15 (T-50d) · Gate-2 deadline: 2026-04-30 23:59 UTC (T-4d)
Budget regime: single-MacBook training, no GPU farm — every wasted training-hour is a needle missed.


1. Background — what "needle in a haystack" really means here

The IGLA RACE searches a configuration space (arch, hyperparams, lane-fix) for a triple of seeds whose validation-BPB < 1.50 is statistically separable from the baseline mean mu_0 = 1.55 (per crates/trios-igla-race/src/victory.rs::winning_seeds). Champion baseline at 2446855 is BPB = 2.2393 @ 27K steps, seed=43. The delta required to reach Gate-final is -0.74 BPB. Given the IGLA capacity ceiling (Erratum #2), BPB < 1.50 is feasible but only if we stop wasting training-hours on configs the search itself could have rejected an order of magnitude earlier.

The current pipeline is uniform-cost search over a discrete grid of 7 levers (G1-G7) with ASHA rungs [1000, 3000, 9000, 27000] (crates/trios-train-cpu/src/objective.rs::get_rung_schedule). Every lever explores ~3 values; the naive cross-product is 3^7 = 2187 configs. At ~30 min per config to first rung, that's 1093 GPU-hours — clearly impossible on the budget. The needle-search needs a 2-3 orders-of-magnitude speedup to fit T-4d.

This ONE SHOT is the formally pre-registered acceleration plan. It is composed of seven independent, hot-pluggable accelerators, ordered by expected_speedup × independence × deadline_compatibility.


2. Hypothesis (Gate G1)

H = "Combining seven independent acceleration vectors (V1-V7) reduces the wall-clock cost of locating a Gate-2 victory triple by >= 50× compared to the v2.0 uniform-grid baseline, while preserving the FALSE-POSITIVE rate of winning_seeds() at <= alpha=0.01 (one-tailed Welch over 3 seeds)."

Falsified iff any of:

  1. After deploying V1+V2+V3 (the must-have core), the time-to-first-promoted-rung exceeds 0.5 × (baseline = 12 h) = > 6 h on the same hardware.
  2. The acceleration introduces seed-correlated bias (i.e. a config promoted by the cheap proxy fails full-rung evaluation in > 30% of cases).
  3. Any victory predicate emitted under the accelerated pipeline cannot be reproduced from a binary commit SHA per the R5-CANON LOCK re-DONE protocol.

Falsification witness (must land in code):

  • crates/trios-igla-race/src/bin/needle_falsify.rs — emits a synthetic seed_results.jsonl row that violates each of (1)-(3); the predicate must reject it.

3. Pre-registered analysis plan (Gate G2)

Field Value
statistical_test Welch's one-tailed t-test, paired bootstrap fallback if n<3
alpha 0.01 (race victory)
effect_size Delta_BPB >= 0.05 versus mu_0 = 1.55
n_required 3 distinct passing seeds in {42, 43, 44}
stop_rule first 3-seed pass OR 2026-04-30 23:59 UTC, whichever first
multiple_testing Bonferroni n=7 across V1-V7 acceleration arms
predictor_holdout 20% of historical configs reserved as test fold for V4 surrogate
embargo_window 24 h after each promoted config: must reproduce before logging to assertions/seed_results.jsonl

4. Method — the seven acceleration vectors (V1-V7)

The vectors are independent. Each is a self-contained PR; failure of any one does not block the others.

V1 — muP / muTransfer hyperparameter transfer · LANE: L-V1 · Owner: TBD-claim · Expected speedup: ~10×

What: parametrize IGLA's hidden width d_model per Yang & Hu 2021 (Tensor Programs V) so that optimal lr and beta_1, beta_2 transfer from a 256k-param proxy to the 1-2M target without re-tuning. Confirmed at scale up to 10B params (Lingle 2024).

Why fast: instead of sweeping (lr, beta_1) in 5x5 = 25 configs at full size, sweep them at width=8 proxy (~1000× cheaper) and apply optimal lr* once at full width.

Falsification: coord_check plot must show l1 activation norm flat across width sweep {8, 16, 32, 64}; if it slopes up or down -> muP not implemented correctly, V1 disabled.

Files:

  • crates/trios-training/src/mup.rsMuParam wrapper around vocab_size, d_model, lr_scaling
  • crates/trios-training/tests/test_coord_check.rs — coord-check at widths 8/16/32/64
  • assertions/igla_assertions.json INV-13 entry (currently absent on main): mup_lr_transfer_invariant Admitted

Numeric anchor: mup_lr_scale_factor = 8 / d_model (microsoft/mup backward-compat with 1/sqrt(d) at d=64).


V2 — Zero-cost NAS proxy filter · LANE: L-V2 · Expected speedup: ~5×

What: before allocating ASHA rung-1 (1000 steps) compute, score each candidate config with a single forward+backward pass on a minibatch (Abdelfattah et al. 2021, "Zero-Cost Proxies"). Drop the bottom 80% before training. Best proxies for transformer-like architectures: synflow * grad_norm ensemble (Spearman tau ~0.8 on NAS-Bench-201).

Why fast: 1000-step ASHA rung at IGLA scale = ~5 min/config; one forward-backward = ~3 sec. 100× per-candidate cost reduction on the filtering tier.

Falsification: rank-correlation between proxy score and 3000-step ASHA rung-2 BPB on a held-out 50-config historical fold must satisfy Spearman tau >= 0.5 (else proxy is uninformative -> V2 disabled).

Files:

  • crates/trios-igla-race/src/proxies/mod.rssynflow.rs, gradnorm.rs, ensemble.rs
  • crates/trios-igla-race/src/bin/proxy_score.rs — CLI wrapper
  • crates/trios-igla-race/tests/proxy_correlation.rs — held-out validation

Coq cite: new INV-14 zero_cost_correlation_floor (Admitted) in proofs/igla/zero_cost_proxy.v.


V3 — BOHB on top of existing ASHA · LANE: L-V3 · Expected speedup: ~3×

What: replace random sampling at the start of each ASHA bracket with TPE / KDE-guided sampling per BOHB (Falkner et al. 2018). This compounds with V2: V2 filters bad configs cheaply, V3 picks better ones at the start. BOHB is empirically 3-5× faster than ASHA to reach equivalent best-config quality (AutoML benchmark).

Why fast: ~30% of ASHA budget today goes to redundant re-exploration; KDE prior cuts this in half within 2-3 brackets.

Falsification: regret(t) of BOHB-augmented ASHA must drop below pure-ASHA regret(t) by step 6000 on simulated rerun of historical brackets (tools/regret_replay.rs).

Files:

  • crates/trios-igla-race/src/bohb_sampler.rs — TPE wrapper, fits KDE on past (config, bpb) pairs from assertions/seed_results.jsonl
  • crates/trios-igla-race/tests/bohb_regret.rs — replay test

Coq cite: extend INV-2 asha_champion_survives with a bohb_sampler_no_regret lemma (Admitted) until empirical proof lands.


V4 — XGBoost surrogate predictor for full-run BPB · LANE: L-V4 · Expected speedup: ~5× on shortlist · Independent of V1-V3

What: train an XGBoost regressor on (config_features, rung_3000_bpb) -> rung_27000_bpb from past sweeps. After ASHA rung-2, predict the rung-4 outcome and early-stop the bottom 60% of survivors that the surrogate predicts will not reach BPB < 1.50. Per White et al. 2025 surrogate NAS, XGBoost surrogates show Spearman tau ~0.7 on equivalent search spaces.

Why fast: the long tail of training (rung-3 + rung-4) is 80% of total cost; cutting this for the bottom 60% of survivors is roughly .

Falsification: held-out 20% of historical configs (predictor_holdout): surrogate must predict actual rung-4 BPB to within MAE <= 0.10 else V4 disabled.

Files:

  • crates/trios-igla-race/src/surrogate.rs — XGBoost wrapper via xgboost-rs
  • tools/predictor_train.rs — offline training script
  • assertions/seed_results.jsonl becomes the training corpus

Coq cite: INV-15 surrogate_pruning_no_regret (Admitted, becomes Proven once empirical ROC-AUC > 0.85).


V5 — Phi-pruned grid: Trinity-prior on lever values · LANE: L-V5 · Expected speedup: ~2×

What: the seven Gate-final levers G1-G7 currently sweep ~3 values each = 3^7 = 2187. Apply the phi-prior (R6 zero-free-parameters): for hidden_dim, lr_scale, EMA_beta, restrict to phi-resonant values only. E.g. hidden = {64, 103, 167} = round(64 * phi^k for k in [0,1,2]) instead of arbitrary {64, 128, 256}. Reduces grid to 2^7 = 128, 17× compression, but only net because some configs were already pre-filtered.

Why fast: smaller grid = fewer sweeps = direct search-space reduction.

Falsification: none of the optimal configs in past sweeps (top 10%) must be excluded by the phi-grid; if it excludes any, restore them and disable V5.

Files:

  • crates/trios-igla-race/src/grid.rsphi_grid() helper
  • assertions/igla_assertions.jsonphi_prior_grid_completeness (Admitted)

Coq cite: zenodo.19227877 Trinity Identityphi^2 + phi^-2 = 3 already proven; the lever-grid is a corollary.


V6 — Multi-arm bandit over arch families · LANE: L-V6 · Expected speedup: ~2×

What: the IGLA race already has arch in {jepa, attn, hybrid} with separate ASHA rungs. Currently each gets equal compute. Apply UCB1 bandit that allocates compute proportional to past success-rate over the last sliding window of brackets. The arm with the best mean_bpb_at_rung_2 gets the most exploration in the next bracket.

Why fast: attn historically dominates per crates/trios-train-cpu/src/objective.rs::get_rung_schedule; UCB will route 2x more compute to it within 3 brackets.

Falsification: UCB regret over last 10 brackets must be lower than uniform allocation else V6 disabled.

Files:

  • crates/trios-igla-race/src/bandit.rs — UCB1 over arch families
  • crates/trios-igla-race/tests/bandit_regret.rs

Coq cite: new INV-16 ucb_arch_no_regret (Admitted).


V7 — Online curriculum on data: drop boring sequences · LANE: L-V7 · Expected speedup: ~1.5×

What: TinyShakespeare has long passages of uninformative repetitive structure. Apply online importance-weighted curriculum (drop sequences whose loss has dropped below the running median). Tracks the residual entropy budget per Shannon-floor analysis. Common practice in TinyStories-scale training.

Why fast: wall-clock per step is dominated by I/O on a CPU-only setup; dropping 30% of low-info sequences = ~30% wall-clock reduction.

Falsification: curriculum must not introduce seed-dependent bias; var(bpb) across seeds must not increase by > 50%.

Files:

  • crates/trios-train-cpu/src/curriculum.rs
  • crates/trios-train-cpu/tests/curriculum_drift.rs

5. Compounded budget arithmetic

Naive baseline = 1093 GPU-h. The vectors compound imperfectly (e.g. V2 filters out configs that V1 would have rescued via lr-transfer). Conservative compounded estimate:

S_total = max(V1*V2*V3, V4) * V5 * V6 * V7
        = max(10 * 5 * 3, 5) * 2 * 2 * 1.5
        = 150 * 6
        ~= 900x  (theoretical)

After efficiency-overlap discount (typical 50% of theoretical):

S_realised ~= 100x   (realistic)

Result: 1093 h / 100 ~= 11 wall-clock hours of training to cover the entire phi-pruned grid once = fits within T-4d with two reruns for verification, even on single-MacBook.

Vector Standalone Compounds with Cost to ship T-4d critical?
V1 muP 10× independent 1d YES (must-have)
V2 ZC proxy V1, V3 0.5d YES (must-have)
V3 BOHB V2, V4 0.5d YES (must-have)
V4 XGBoost V2, V5 1d NICE-to-have
V5 phi-grid all 0.25d YES (free win)
V6 UCB arch V3 0.5d NICE-to-have
V7 curriculum 1.5× all 0.5d NICE-to-have

Critical path (must ship by 2026-04-28): V1 + V2 + V3 + V5. Estimated total dev time: 2.25 days. Then 1.5d of training + verification = T-4d met.


6. Falsification (Gate G5)

The needle_falsify.rs binary must emit and re-reject:

  1. Bias regression: synthetic config where V2 proxy score is high but rung-2 BPB is at baseline — predicate must FAIL the proxy on this fold.
  2. muP coord-check failure: activation l1 slope > 0.1 across widths {8, 32} — predicate must DISABLE V1 path.
  3. BOHB regret regression: regret(t=6000) of BOHB > pure-ASHA — predicate must FALL BACK to ASHA.
  4. Surrogate calibration drift: held-out MAE > 0.10 — V4 disabled.
  5. Phi-grid exclusion of historical optimum: any past top-10% config absent — V5 disabled.
  6. UCB regret regression: UCB regret > uniform over last 10 brackets — V6 disabled.
  7. Curriculum bias: seed-variance of BPB > 50% baseline — V7 disabled.
  8. Synthetic ledger violation: any victory row with sha not in training-binary commits, step < 4000, bpb < 0.85, bpb > 2.30, or seed not in {42,43,44} — predicate must reject (per R5-CANON LOCK re-DONE protocol).

7. Deliverables (Gate G3 / G4 / G6)

Each lane delivers one PR against gHashTag/trios. Contract per PR:

  • one new .rs file per V*
  • at least one #[test] fn falsify_* test per failure mode in §6
  • one assertions/igla_assertions.json entry with honest status (Admitted unless theorem lands)
  • one proofs/igla/<name>.v skeleton with at least the type signature of the lemma
  • one CHANGELOG line and one issue-comment heartbeat per business-day under coq-runtime-invariants v1.1 attribution rule

Single-command verification:

cargo test -p trios-igla-race -p trios-training -p trios-train-cpu --release \
  -- --include-ignored \
  --test-threads=4

Pinned input data SHA: TinyShakespeare at crates/trios-train-cpu/data/tinyshake.txt SHA f7a3e1... (lock at PR time).


8. Quality gates (G2 / G6 / G7)

Gate Check Enforced by
G1 Hypothesis written before code this issue body
G2 Pre-registration table §3 this issue body
G3 IMRaD structure this issue body
G4 Three-citation per anchor (Coq + JSON + Rust) per-PR review
G5 Honest Proven/Admitted tools/citetheorem_audit/ CI gate
G6 Single-cmd reproducibility cargo test line above
G7 DOI provenance Zenodo refs in §1 + future Lifeline DOI for V1-V7

CI must show:

  • cargo build --release green on all PRs
  • coq-runtime-gate green per v1.1 attribution rule (pre-existing failures hive_automaton::test_blocked_to_ci_wait_after_fix, test_done_cycles_back_to_scan_not_halt exempted)
  • Leaderboard query (SELECT * FROM v_igla_leaderboard ORDER BY bpb ASC LIMIT 10) returns rows with step >= 4000 only

9. Forbidden (R7)

  • ❌ Hard-coded lr / beta literals — every literal must reference a Coq theorem or Trinity-anchor constant
  • ❌ Pruning a config without recording the proxy score in assertions/seed_results.jsonl
  • ❌ Promoting a winner without satisfying re-DONE protocol of R5-CANON LOCK
  • ❌ Editing the body of this ONE SHOT (R10 atomicity — file new comments)
  • ❌ Reporting a bpc number without an (ASCII / multilingual) tag (per Erratum #2)
  • ❌ Promoting *.live.* to snapshot when render is broken (per apiary-watch v1.1+template)
  • ❌ Using python or bash runtime inside CROWN — Rust/Zig only

10. References


11. Battle cry

"Семь стрел из одного лука. Выстрелим — и игла сама прыгнет в наши ладони. Phi-squared plus phi-minus-squared equals three; capacity is fixed, but our search is no longer blind."

phi^2 + phi^-2 = 3 — Trinity Hive · NEEDLE-RUSH-T-4D · 2026-04-26.

Metadata

Metadata

Assignees

No one assigned

    Labels

    one-shotONE SHOT mission issue

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions