docs: add debug system design — commands, TUI panel, metrics architecture#4
Merged
Merged
Conversation
…ture Design document for a feature-gated debug system that provides runtime visibility into NPC state, inference pipeline, background tasks, and performance metrics. Includes slash commands (/debug *) and a toggleable live TUI panel with tabbed views. https://claude.ai/code/session_012egaxMhLdaCuMgjHpR6359
dmooney
pushed a commit
that referenced
this pull request
Mar 22, 2026
…igation - Fix phase status inconsistencies: Phases 1-3 now correctly shown as complete across README.md, docs/index.md, phase plans, and roadmap - Update module trees in CLAUDE.md and architecture overview to reflect actual src/ structure (npc submodules, debug.rs, geo_tool, gui submodules) - Remove outdated Bevy ECS reference from architecture overview - Add ADR-012 documenting the hierarchical documentation organization - Add research/ section to docs/index.md - Update known issue #4 to reflect that ShortTermMemory exists but isn't wired into LLM prompts yet - Clean up maybe-bad-ideas.md (separate shipped items) - Add "Key Design Docs" column to phase status table in docs/index.md - Improve README.md with documentation tree diagram https://claude.ai/code/session_01VCXxoKAD8dYAr99LHaEgzq
dmooney
pushed a commit
that referenced
this pull request
Apr 3, 2026
- Replace `map_or(false, ...)` with `is_some_and(...)` in debug_snapshot.rs to satisfy clippy::unnecessary_map_or. - Move known-issue #4 (NPC memory not wired into prompts) to Resolved section — `build_enhanced_context_with_config()` already injects short-term memories, long-term recall, and gossip into Tier 1 prompts. https://claude.ai/code/session_01BEC62B2awxqMsferWdBknk
dmooney
pushed a commit
that referenced
this pull request
Apr 3, 2026
- Replace `map_or(false, ...)` with `is_some_and(...)` in debug_snapshot.rs to satisfy clippy::unnecessary_map_or - Move known issue #4 (NPC memory not in prompts) to Resolved: build_enhanced_context() already injects short-term memory, long-term recall, reactions, and gossip into all Tier 1 prompts https://claude.ai/code/session_01Ts3W9TiNfbLhSuHC6Xyqdf
This was referenced Apr 3, 2026
dmooney
added a commit
that referenced
this pull request
Apr 5, 2026
- Replace `map_or(false, ...)` with `is_some_and(...)` in debug_snapshot.rs to satisfy clippy::unnecessary_map_or. - Move known-issue #4 (NPC memory not wired into prompts) to Resolved section — `build_enhanced_context_with_config()` already injects short-term memories, long-term recall, and gossip into Tier 1 prompts. https://claude.ai/code/session_01BEC62B2awxqMsferWdBknk Co-authored-by: Claude <noreply@anthropic.com>
dmooney
added a commit
that referenced
this pull request
May 14, 2026
Re-judged all 4 cached sample sets with x-ai/grok-4.3 (non-reasoning, $1.25/$2.50 per M) to test judge sensitivity. 193 calls, 0 errors, ~43 min, ~$0.40 total. Added deepseek/deepseek-v4-pro to the candidate roster (deepseek's current flagship, $0.435/$0.87 per M). Surprise result: v4-pro scores *worse* than v3.2 (7.77 vs 7.91) on 1820 Irish dialogue — likely from heavier reasoning/structure RLHF that hurts persona fidelity. Judge findings: - 2.36-point top-to-bottom spread vs 0.81 under mistral-large. Grok discriminates harder. - Top-3 robust across judges: qwen3-max / mistral-large / qwen3-235b within 0.06 of each other under both. Real ceiling for this slice. - grok-3-mini lifts to #4 — same judge-family bias as previous sweeps. Discount by ~0.3 mentally. - gpt-4o-mini drops 1.67 points under stricter grok judge (8.27 → 6.60) — biggest cross-judge gap of any candidate. kimi-k2.6 tried as judge first; failed 70%+ of calls (reasoning tokens ate the max_tokens budget before content emission, even at 8K). Added a reasoning-field fallback to eval_lib's `call_chat` so short reply→content-empty cases recover (k2.6 still not viable as judge under JSON-schema mode). Bumped score_multiaxis _JUDGE_MAX_TOKENS 2000 → 8000 to give thinking judges headroom. Records kimi failure as evidence: multiaxis_20260514T175833Z.json (10 of 14 errors). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
dmooney
added a commit
that referenced
this pull request
May 15, 2026
…e view
eval_lib gains a `reasoning` kwarg on call_chat that gets passed through
to OpenRouter. When a target model is known to be reasoning-class
(kimi-k2.5/k2.6/k2-thinking, glm-4.7, o1/o3/o4, deepseek-r1, opus 4.7,
sonnet 4.6) the default switches to `{"enabled": false}` so cached
dialogue replies are the in-character answer rather than truncated
thought process.
Result for kimi-k2.5: 0/15 valid replies before → 15/15 valid now,
44.5 s wall vs prior timeouts. Judged with both pinned judges:
grok-4.3 → 8.56 total (n=15) -- #4 in grok ranking
mistral-large → 8.96 total (n=15) -- #4 in mistral ranking
Both rank kimi-k2.5 #4 in cross-judge consistent way (between
qwen3-235b and claude-haiku-4.5).
Static leaderboard page now exposes an "average" view in the judge
selector: per-candidate mean across each distinct judge it was scored
by. Only emitted when ≥2 judges agree on coverage (otherwise the
"average" of one row is just the row).
Removed the failed kimi-k2.6 judging run
(multiaxis_20260514T175833Z.json) — kept solely as a failure-mode
artifact, no longer useful now that the broader story is documented
in the leaderboard md.
Coverage matrix dropped its hardcoded 3-column layout for a dynamic
one driven by `--judge-cols` CSS var, so adding/removing judges no
longer requires HTML edits.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
dmooney
added a commit
that referenced
this pull request
May 15, 2026
* feat(eval): rundale-bench ELO mode (pairwise judge)
The absolute 5-axis rubric saturated near ceiling — gpt-oss-120b:free
scored 4.82/5 with no headroom for stronger models to differentiate.
Pairwise ELO replaces it as the dialogue-ranking primary.
Changes:
- judge_pairwise_v1.json — new pinned judge config (qwen3-235b on
OpenRouter, rubric_sha256 b5664f96…dc7c0), pairwise rubric requiring
one-sentence reason per verdict.
- grade.py::grade_pairwise — judge picks A | B | tie + reason.
Non-Latin script in one reply auto-disqualifies that side (judge
can't be trusted to enforce that consistently).
- rundale_bench.py --mode elo — repeated --target flags, runs every
candidate over the dialogue slice once (cached replies), schedules
one pairwise match per (a, b) pair per prompt with A/B position
randomized to absorb first-position bias. K=32 → K=16 after 50
matches per candidate. Bootstrap 5/95 CI via 500 resamples.
- test_grade.py — 5 new pairwise tests (winner, tie, invalid winner,
non-Latin auto-DQ, rubric tamper). 27/27 pass.
Smoke (3 candidates × 10 prompts × 1 pair per prompt = 60 calls,
~$0.0013, 346s):
1646.2 [CI 1598.6–1694.0] qwen/qwen3-235b-a22b-2507
1497.1 [CI 1437.0–1561.2] openai/gpt-oss-120b:free
1356.6 [CI 1306.3–1399.8] mistralai/mistral-small-24b-instruct-2501
290-point spread with non-overlapping CIs between top and bottom.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(eval): cache_dialogue_replies + rubric_lab for offline rubric tuning
Goal: iterate the judging rubric without re-querying candidate models.
Generate replies once, cache to JSON, then prototype rubrics offline.
Tools:
- cache_dialogue_replies.py: --target ... --prompts N, writes
docs/proofs/rundale-bench/dialogue_samples_<UTC>.json with one record
per (candidate, prompt) including reply, usage, latency, error.
- rubric_lab.py: loads a cached samples file + a rubric text file,
scores all samples in absolute (1-N) or pairwise (ELO) mode, prints
per-candidate distribution. Zero candidate spend per iteration —
only judge cost.
Three cache snapshots committed as evidence:
003651Z.json 6 free OpenRouter candidates × 15 prompts
— 65/90 useful (gemma-4 0/15, qwen-next 6/15 throttled)
— kept as baseline showing why free tier is unreliable
004513Z.json 6 paid cheap (<$1/M out) × 15 prompts
qwen3-235b, mistral-small-24b, gemma-3-27b, phi-4,
deepseek-v3.2, gpt-oss-120b
90/90 ok, $0.0009, 205s
005721Z.json 6 mid-tier ($1-$5/M out) × 15 prompts
claude-haiku-4.5, gpt-4o-mini, gemini-2.5-flash,
mistral-large-2512, kimi-k2.5, grok-3-mini
90/90 ok, 324s
245 useful samples across 17 distinct candidates total. Rubric
iteration now an offline activity that costs nothing per pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(eval): 12-candidate ELO sweep with mistral-large judge
Kimi 2.6 attempted as judge but failed: reasoning model emits all
output in `reasoning` field, content is null, every match returned
empty → all ties. Same problem hit z-ai/glm-4.7. Switched to a
non-thinking judge.
mistral-large-2512 ($0.50/$1.50, 1-2s/call) judged 816 pairwise
matches over the 12 paid OpenRouter candidates × 15 dialogue prompts:
1898.9 qwen/qwen3-235b-a22b-2507
1768.4 anthropic/claude-haiku-4.5
1705.0 google/gemma-3-27b-it
1682.6 mistralai/mistral-large-2512 (judge self-bias)
1622.8 moonshotai/kimi-k2.5 (n=11 only; reasoning empty)
1484.9 x-ai/grok-3-mini
1473.3 deepseek/deepseek-v3.2
1356.8 openai/gpt-oss-120b
1340.9 google/gemini-2.5-flash
1305.2 mistralai/mistral-small-24b-instruct-2501
1242.6 openai/gpt-4o-mini
1118.6 microsoft/phi-4
780-point top-to-bottom spread. Headline: claude-haiku-4.5 is the
standout cheap-tier dialogue model; phi-4 is unsuitable. qwen3-235b
top spot is suspect — it was the prior judge pin so may carry
training bias toward its own style.
Tooling updates:
- rubric_lab.py: per-match progress log (flush=True), max_tokens=2000
for judge calls (reasoning models need budget), empty-content guard
that raises "reasoning model truncated?" instead of silent 0-score.
- Documented reasoning-class judge incompatibility in leaderboard
caveats — until call_chat handles `reasoning` fallback, kimi-* and
glm-* are unusable as judges OR candidates.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* chore(eval): justfile recipes for rundale-bench workflows
Wraps the tasks executed this session so they're rerunnable without
remembering target-spec strings, env-loading boilerplate, or rubric_lab
flags. Recipes:
Correctness gates (no API spend):
- test — run grade.py unit tests
- manifest — rebuild MANIFEST.json after slice edits
- split — re-run holdout split + manifest
Single-target probes:
- intent target [limit=20] — deterministic intent grading
- dialogue target [judge limit] — absolute 5-axis rubric
- bench target [judge limit] — --slice all sweep
Multi-target ELO:
- elo "spec_a spec_b ..." [limit=10] — in-bench --mode elo, fresh replies
Cached-reply workflow (rubric iteration, no candidate spend):
- cache-cheap [prompts=15] — 6 paid-cheap fleet (~$0.001, ~3min)
- cache-mid [prompts=15] — 6 mid-tier fleet (~$0.05)
- cache-all [prompts=15] — both, back to back
- cache targets="..." [prompts]
- elo-from-cache samples [rubric judge output] — pairwise over cache
- absolute-from-cache samples rubric [axis_scale judge output]
Convenience:
- list-samples — show cached sample files
- sample-stats — n_samples / ok / empty / cost per cache
Run from repo root with `-f`:
just -f parish/testing/rundale-bench/justfile <recipe> ...
ENV_FILE points at <repo>/.env; recipes source it before invoking python
so OPENROUTER_API_KEY (and equivalents) reach call_chat. Default judge
is mistral-large-2512 (non-thinking, ~$0.001/call, 1-2s wall) — override
with `judge=...` per recipe.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(eval): multi-axis 0-10 scoring + just recipe
Per-axis 0-10 dialogue scorer over cached samples. Five axes (character,
authenticity, language, responsiveness, craft) + a judge-emitted total —
complements pairwise ELO by exposing *why* a candidate ranks where it does.
- `score_multiaxis.py` — judge emits one integer per axis (0-10) plus a
float `total`. Per-record + per-candidate aggregates written to JSON.
Default judge `mistralai/mistral-large-2512`, calibration anchors
baked into the rubric (0/3/6/8/10 gradient).
- `justfile multiaxis` recipe — accepts positional args only (samples,
judge, rubric); output is auto-stamped to docs/proofs/rundale-bench/.
Earlier draft accepted `output=...` as a named arg which just silently
parses as the next positional, sending `--judge-model 'output=...'` and
earning HTTP 400 from every call. Removed.
Evidence: ran both cached sweeps end-to-end. 76+88 calls, ~$0 spend
(well under mistral-large's $0.50/$1.50 floor). Three top candidates
agree with the ELO ranking — gemma-3-27b / qwen3-235b / claude-haiku-4.5
cluster within 0.10 of each other, mistral-large self-bias still present.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(eval): qwen3-max flagship probe + aggregator bug fix
Added qwen/qwen3-max (Chinese flagship, $1.20/$6 per M, non-reasoning)
to the multi-axis 0-10 leaderboard. Result: 9.03 total — ties
gemma-3-27b and edges qwen3-235b by 0.03, well inside rubric noise.
~17× the cost of qwen3-235b for zero meaningful gain on this slice.
Also fixed an aggregator bug in score_multiaxis.py: judge-call errors
(HTTP 503, rate-limit drops, schema-parse failures) were contributing
zeros to per-candidate means. A first qwen3-max pass hit two 503s after
4 retries each, pulling its headline mean to 7.87 — nonsense for a
flagship. Errored records still land in `per_record` for inspection
but no longer skew aggregates.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(eval): grok-4.3 judge re-sweep + deepseek-v4-pro candidate
Re-judged all 4 cached sample sets with x-ai/grok-4.3 (non-reasoning,
$1.25/$2.50 per M) to test judge sensitivity. 193 calls, 0 errors,
~43 min, ~$0.40 total.
Added deepseek/deepseek-v4-pro to the candidate roster (deepseek's
current flagship, $0.435/$0.87 per M). Surprise result: v4-pro scores
*worse* than v3.2 (7.77 vs 7.91) on 1820 Irish dialogue — likely from
heavier reasoning/structure RLHF that hurts persona fidelity.
Judge findings:
- 2.36-point top-to-bottom spread vs 0.81 under mistral-large. Grok
discriminates harder.
- Top-3 robust across judges: qwen3-max / mistral-large / qwen3-235b
within 0.06 of each other under both. Real ceiling for this slice.
- grok-3-mini lifts to #4 — same judge-family bias as previous sweeps.
Discount by ~0.3 mentally.
- gpt-4o-mini drops 1.67 points under stricter grok judge (8.27 → 6.60)
— biggest cross-judge gap of any candidate.
kimi-k2.6 tried as judge first; failed 70%+ of calls (reasoning tokens
ate the max_tokens budget before content emission, even at 8K). Added
a reasoning-field fallback to eval_lib's `call_chat` so short
reply→content-empty cases recover (k2.6 still not viable as judge
under JSON-schema mode).
Bumped score_multiaxis _JUDGE_MAX_TOKENS 2000 → 8000 to give thinking
judges headroom. Records kimi failure as evidence:
multiaxis_20260514T175833Z.json (10 of 14 errors).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(eval): TTFT/tok-s/JSON-compliance probe + gemma-4/qwen-2.5 candidates
Adds a streaming-mode perf probe complementing the rubric-quality
bench. Three new dimensions per candidate:
- ttft_ms (time-to-first-content-token via stream:true)
- tok/s (completion_tokens / generation seconds)
- json compliance rate, both free-form and schema-enforced
Implementation:
- eval_lib.py grows `call_chat_streaming(target, …)` that parses SSE,
records TTFT on first content delta, accumulates token usage from the
final usage line. Also retries on body-level 502/503 (xAI grok-4.3
capacity errors land inside 200 OK responses).
- bench_perf.py runs the probe and dumps perf_<UTC>.json with per-call
records + per-candidate medians + p90s.
- justfile adds `perf` (single target) and `perf-many` (pre-quoted
--target … flags) recipes.
Initial measurements on 4 candidates (10 prompts each):
candidate TTFT p50 Total p50 tok/s p50 JF% JS%
qwen/qwen3-235b-a22b-2507 363ms 1696ms 54.2 100% 100%
google/gemma-3-27b-it 380ms 2328ms 37.9 100% 100%
google/gemma-4-31b-it 1160ms 3214ms 21.9 100% 100%
qwen/qwen-2.5-72b-instruct 0ms 5242ms 0.0 10% 90%
qwen3-235b is fastest across all three perf axes AND tied for top-3
quality under both judges. Strong default.
gemma-4 is slower than gemma-3 in every dimension (TTFT 3×, tok/s ~½)
AND scores 0.23 lower under the grok-4.3 judge. Newer is not an upgrade.
qwen-2.5-72b-instruct returns "Provider returned error" 400 from
OpenRouter on 14/15 cache calls; streaming all errored. Free-form JSON
10%, schema 90%. Different provider routing required before this
candidate can be fairly evaluated.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* chore(eval): cache dialogue replies for 14 new candidates (high + mid tier)
Cached 15 dialogue replies for each candidate to enable offline rubric
judging without re-spending. Judging deferred — user wants more samples
collected before re-judging the full set.
High tier (6, $0.x to $5/$30 per M):
- anthropic/claude-opus-4.7
- anthropic/claude-sonnet-4.6
- openai/gpt-5.5
- openai/gpt-5.4
- google/gemini-2.5-pro
- x-ai/grok-4.3
Mid tier (8):
- meta-llama/llama-3.3-70b-instruct
- meta-llama/llama-4-maverick
- meta-llama/llama-4-scout
- openai/gpt-5.4-mini
- mistralai/mistral-medium-3.1
- z-ai/glm-4.6
- nousresearch/hermes-4-405b
- amazon/nova-pro-v1
All 210 calls returned non-empty replies. Cache files:
- dialogue_samples_20260514T221845Z.json (high tier, 90 samples, ~6min)
- dialogue_samples_20260515T181515Z.json (mid tier, 120 samples, ~6min)
Brings the candidate pool to 28 unique models cached for dialogue. Ready
for a multi-judge re-sweep when bench iteration resumes.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(eval): static leaderboard page for rundale-bench
Aggregates every multiaxis_*.json + perf_*.json + dialogue_samples_*.json
into a single static HTML page (`docs/proofs/rundale-bench/leaderboard.html`).
Open in any browser — no server required.
Sections:
- Stat header (cached / judged / unjudged / judge count / row counts)
- Quality table (sortable + filterable by candidate or judge)
- Perf table (TTFT / total / tok/s / JSON compliance pills)
- Judge coverage matrix (✓ / · per candidate × judge)
- Unjudged backlog list
Vanilla JS, no deps. Dark theme by default with auto-light via
prefers-color-scheme. Pure DOM rendering with HTML escaping — no
innerHTML on untrusted strings.
Regenerate after any new run with:
just -f parish/testing/rundale-bench/justfile leaderboard
Current state baked into the page: 30 cached candidates, 16 judged,
14 unjudged, 3 distinct judges, 31 quality rows + 5 perf rows.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(eval): re-cache kimi-k2.5 sans reasoning; add cross-judge average view
eval_lib gains a `reasoning` kwarg on call_chat that gets passed through
to OpenRouter. When a target model is known to be reasoning-class
(kimi-k2.5/k2.6/k2-thinking, glm-4.7, o1/o3/o4, deepseek-r1, opus 4.7,
sonnet 4.6) the default switches to `{"enabled": false}` so cached
dialogue replies are the in-character answer rather than truncated
thought process.
Result for kimi-k2.5: 0/15 valid replies before → 15/15 valid now,
44.5 s wall vs prior timeouts. Judged with both pinned judges:
grok-4.3 → 8.56 total (n=15) -- #4 in grok ranking
mistral-large → 8.96 total (n=15) -- #4 in mistral ranking
Both rank kimi-k2.5 #4 in cross-judge consistent way (between
qwen3-235b and claude-haiku-4.5).
Static leaderboard page now exposes an "average" view in the judge
selector: per-candidate mean across each distinct judge it was scored
by. Only emitted when ≥2 judges agree on coverage (otherwise the
"average" of one row is just the row).
Removed the failed kimi-k2.6 judging run
(multiaxis_20260514T175833Z.json) — kept solely as a failure-mode
artifact, no longer useful now that the broader story is documented
in the leaderboard md.
Coverage matrix dropped its hardcoded 3-column layout for a dynamic
one driven by `--judge-cols` CSS var, so adding/removing judges no
longer requires HTML edits.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(eval): judge all 14 high+mid candidates, parallelize, fix reasoning leaks
Judging: every cached candidate now has scores from both pinned judges
(grok-4.3 + mistral-large). 29 of 30 cached are fully judged; the
remaining qwen-2.5-72b-instruct is broken on OpenRouter and unscorable.
Parallelism: score_multiaxis grows a `--workers` flag (default 8).
ThreadPoolExecutor over the eligible records — judge calls are the
hot path, all I/O-bound. Wall-time went from estimated 1.5-2 h serial
to ~3 min for 4 judging passes across 14 new candidates.
Reasoning leaks fixed:
- gemini-2.5-pro first cache had every reply truncated mid-sentence
("Ah, the night-ter"). max_tokens=200 consumed by internal thinking
before any content emitted. First-judge scores were nonsense (grok
2.91, mistral 5.77).
- glm-4.6 first cache had 4/15 replies as meta chain-of-thought
("1. **Deconstruct the Persona:**..."). Mistral was lenient enough
to score it 9.29; grok flagged it down to 4.43. 4.86-point judge
disagreement was the smoking gun.
eval_lib.call_chat already auto-disables reasoning for known
reasoning-class models via REASONING_MODEL_PREFIXES, but the
suppression dict was hardcoded to `{"enabled": false}`. Google rejects
that with HTTP 400; only `reasoning: {effort: "low"}` works. Added a
per-provider `_default_reasoning_for(model_id)` so the dict matches
what each provider accepts.
Added prefixes: google/gemini-2.5-pro, google/gemini-3, z-ai/glm-4.6.
After re-cache + re-judge, both candidates land in the normal range
with cross-judge spread under 1.0:
gemini-2.5-pro grok=8.12 mistral=8.97 Δ=0.85
z-ai/glm-4.6 grok=7.89 mistral=8.76 Δ=0.87
Headline standings (grok-4.3 / mistral-large):
openai/gpt-5.5 8.75 / 9.05 ← high-tier top
openai/gpt-5.4 8.55 / 9.17 ← mistral's #1 high
openai/gpt-5.4-mini 8.63 / 9.03 ← mid-tier top
anthropic/claude-opus-4.7 8.64 / 9.05
anthropic/sonnet-4.6 8.53 / 9.13
mistral-medium-3.1 8.59 / 9.09
Also stripped earlier n=1 outliers from the leaderboard data:
- kimi-k2.5 single-record rows in early mid-tier sweeps
- qwen-2.5-72b-instruct single-record row (provider broken)
Leaderboard page regenerated (quality=57 rows, perf=5, cached=30,
unjudged=1).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(eval): fill perf table for 26 remaining candidates; dedupe by latest
Ran bench_perf on 4 parallel shards (~6-7 candidates each). 760 calls
across 26 new candidates, ~10 min wall, <\$0.001 spend. Combined with
the original 4-candidate probe, the perf table now covers every cached
candidate.
build_leaderboard_page now keeps only the latest perf measurement per
candidate so smoke runs do not pollute the table. (Earlier 3-prompt
gemma-4 smoke was overwriting the 10-prompt sweep result.)
Notable perf findings:
- gpt-5.4-mini fastest at the top of mid: TTFT 561 ms / 75.5 tok-s
- gpt-5.5 surprisingly slow: TTFT 2242 ms / 60 tok-s — pay flagship
premium and wait twice as long as gpt-5.4-mini
- glm-4.6 total p50 = 16883 ms (huge tail) — reasoning still active
for some calls despite suppression; investigate before recommending
- phi-4 free-form JSON 0% / schema 100% — strict schema is mandatory
- mistral-medium-3.1 free-form JSON 10% / schema 100% — same shape
qwen-2.5-72b-instruct re-cache attempt forcing provider.order=
["DeepInfra", "Together"] recovered some replies (4/15) but still
flaky; not viable as a judge or candidate via OpenRouter under any
provider sort/order I tried. Documented as unjudged.
Final state: 29/30 cached candidates judged, 30/30 perf-measured.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(eval): address gemini-code-assist review feedback
1. ELO match recording for errored candidates (HIGH).
When candidate A errored mid-sweep, `matches.append((b, a, 0.0))`
recorded a loss against B (the winner) instead of A. Bootstrap CI
inherited the same flip. Fixed to canonical `(a, b, 0.0)` ordering
so score_a=0.0 correctly means A lost.
2. Null-safe judge `reason` strings.
`str(out.get("reason", ""))` returns the literal string "None" when
the judge emits `{"reason": null}` (some providers do this even
with strict json_schema). Replaced with `str(out.get("reason") or
"")` everywhere — grade.py, rubric_lab.py, score_multiaxis.py.
3. Target identity collision in ELO mode.
`replies[(t.model, prompt_id)]` collided when two `--target` flags
shared the same `model` string (e.g. comparing the same model
across providers / base_urls). Switched to `model@base_url` as the
canonical id and added a startup check that rejects duplicates.
4. Bootstrap CI now mirrors dynamic K.
Main accumulator drops K from 32 → 16 after 50 matches per
candidate. _bootstrap_ci was using k_initial constant for every
match, so late-resampled matches moved ratings more than reality.
Bootstrap now tracks match counts per iteration and applies the
same threshold.
All 27 grade.py unit tests still pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced May 16, 2026
Merged
dmooney
added a commit
that referenced
this pull request
May 23, 2026
Six fixes spanning gemini-code-assist and chatgpt-codex-connector feedback on the runtime-loaded provider refactor. - mod_source.rs: wrap discover_mods_in + register_provider_mods_once in tokio::task::spawn_blocking, and propagate register errors instead of warn-and-continue. Sync filesystem I/O on the executor stalled the Tokio runtime on slow disks; silent registry failures turned actionable startup errors into late, confusing fallback behaviour (gemini P0, codex P2 #5). - parish-config provider.rs: drop the cfg(debug_assertions) gate on the auto-loader and rename it to ensure_mods_loaded. The release/debug skew meant any startup path that resolved provider config before the explicit bootstrap saw an empty registry, silently falling back to the simulator or panicking on Provider::from_id(...).expect(...). The auto-loader is now always-on; it consults PARISH_MODS_DIR first (operator override for packaged builds), then walks up from CARGO_MANIFEST_DIR (dev tree). Same idempotent Once guard (codex P1 #2 + #4). - resolve_cloud_config: replace Provider::from_id("openrouter") .expect(...) with an ok_or_else that returns a structured ParishError::Config. Operators who omit PARISH_CLOUD_PROVIDER on a deployment without the openrouter mod now see an actionable message instead of a crashed binary (codex P1 #2). - ProviderMod: add explicit `keyless: bool` field (TOML default false). Set true in simulator/ollama/vllm/vllm_mlx/lmstudio. The wizard's keyless guard is now driven by this flag instead of !requires_api_key, which had mislabelled `custom` as keyless and let users save it with no model name (codex P2 #6 regression). - ByokOnboarding.svelte: on listAvailableProviders failure, fall back to FALLBACK_FEATURED (anthropic/openai/openrouter/groq/google) with a visible banner instead of an empty grid. A transient API blip no longer hard-blocks first-run onboarding (codex P2 #7). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
6 tasks
dmooney
added a commit
that referenced
this pull request
May 23, 2026
* refactor: runtime-load LLM providers from mods/<id>/
Replaces the compile-time-embedded provider catalog (parish-config/build.rs
scanning providers/*.toml) with a runtime-mod-loaded design.
Builtins (5, parish-config/src/builtin_providers/):
- simulator, ollama, vllm, vllm_mlx, custom
- inlined via include_str! — engine manages local processes / downloads,
or serves as universal escape hatch (custom)
Provider mods (19, mods/<id>/):
- anthropic, cohere, deepseek, github_models, google, groq, lmstudio,
mistral, moonshot, nvidia-nim, openai, openrouter, qwen, scaleway,
siliconflow, together, vercel-ai, xai, zhipu
- one mod per provider; each declares kind = "providers" in mod.toml
- discovered + registered into ProviderRegistry via discover_mods +
register_provider_mods_once (OnceLock-guarded, called from
load_setting_mod_sync + LocalDiskModSource::list_mods)
ModKind::Providers variant + load_providers_from_mod helper in
parish-core, with traversal-rejection + duplicate-id checks.
ProviderRegistry rewired with RwLock-backed interior mutability so
post-init register_mod_providers merges cleanly. Last-wins on collision
with WARN log; identical re-registration is silent no-op.
Backend IPC handle_list_available_providers (parish-core) returns
featured/other split. Wired to Tauri command list_available_providers +
MCP bridge /api/available-providers route.
ShowPreset listing now reads dynamically from registry instead of a
hardcoded string.
Debug-build auto-loader parish_config::ensure_test_mods_loaded walks
the workspace mods/ tree so unit tests + dev runs see the same registry
as production without each test calling setup manually.
New tests:
- parish-config: builtin_providers_parse_and_register, register_mod_
providers_merges_new_ids, register_mod_providers_last_wins_on_collision
- parish-core: discover_mods_classifies_providers_kind,
load_providers_from_mod_{parses_multiple_tomls_in_lex_order,
empty_when_directory_missing, rejects_symlink_traversal,
rejects_duplicate_ids_within_one_mod}
Proof bundle at docs/proofs/provider-mods-runtime/ — live CLI transcript
verifies switching to a mod-loaded provider (openai), a runtime-added
mod (test-provider, registered without recompile), and back to a
builtin (simulator).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* chore(proofs): relocate provider-mods-runtime bundle to .proofs/
Rule 10 update (commit 554410e) requires proof bundles to live in
gitignored .proofs/<task-id>/ and attach to the PR via
`just attach-proof`. Move the bundle out of tracked docs/proofs/.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* refactor(providers): drop cloud ctors + wire UI/server to runtime registry
Follow-on cleanup to commit edc2fc7b. No deferrals left.
- Remove Provider::{anthropic, openai, openrouter, google, groq, xai,
mistral, deepseek, together, lmstudio, github_models} convenience
constructors. Every callsite now uses Provider::from_id(id) with an
appropriate .expect() (tests) or .unwrap_or_default() (runtime defaults
that previously fell to openrouter; the simulator builtin is the
fallback when openrouter mod is absent).
- parish-server: add /api/list-available-providers handler + route, so
the web UI can enumerate the runtime provider registry. Matches Tauri
command list_available_providers and the MCP bridge route.
- parish/apps/ui:
- lib/ipc.ts: add listAvailableProviders() + AvailableProvidersResponse
types.
- components/ByokOnboarding.svelte: fetch featured/other lists at
mount, drop static imports.
- lib/byokProviders.ts: collapse to a thin type-adapter
(toByokMeta + findProvider). Hand-curated FEATURED_PROVIDERS and
OTHER_PROVIDERS arrays are gone; adding a provider is a TOML drop
under mods/<id>/ with no TS edit.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* chore(mods): rename provider mods with -provider suffix
mods/<id>/ -> mods/<id>-provider/ for all 19 cloud provider mods. The
mod-id field in each mod.toml is updated to match the new directory name
(<id>-provider). The provider id inside each TOML (id = "<bare>") stays
unchanged — that is the registry key Provider::from_id(...) and
parish.toml's provider field still target.
Also removes the throwaway mods/test-provider/ from the repo. It is a
fixture artifact created on-demand by the verification script per its
header instructions; shipping it would make the no-recompile claim
tautological. The captured proof transcript still demonstrates the
add-and-discover flow.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(providers): address bot review on PR #1049
Six fixes spanning gemini-code-assist and chatgpt-codex-connector
feedback on the runtime-loaded provider refactor.
- mod_source.rs: wrap discover_mods_in + register_provider_mods_once in
tokio::task::spawn_blocking, and propagate register errors instead of
warn-and-continue. Sync filesystem I/O on the executor stalled the
Tokio runtime on slow disks; silent registry failures turned
actionable startup errors into late, confusing fallback behaviour
(gemini P0, codex P2 #5).
- parish-config provider.rs: drop the cfg(debug_assertions) gate on the
auto-loader and rename it to ensure_mods_loaded. The release/debug
skew meant any startup path that resolved provider config before the
explicit bootstrap saw an empty registry, silently falling back to
the simulator or panicking on Provider::from_id(...).expect(...).
The auto-loader is now always-on; it consults PARISH_MODS_DIR first
(operator override for packaged builds), then walks up from
CARGO_MANIFEST_DIR (dev tree). Same idempotent Once guard
(codex P1 #2 + #4).
- resolve_cloud_config: replace Provider::from_id("openrouter")
.expect(...) with an ok_or_else that returns a structured
ParishError::Config. Operators who omit PARISH_CLOUD_PROVIDER on a
deployment without the openrouter mod now see an actionable message
instead of a crashed binary (codex P1 #2).
- ProviderMod: add explicit `keyless: bool` field (TOML default false).
Set true in simulator/ollama/vllm/vllm_mlx/lmstudio. The wizard's
keyless guard is now driven by this flag instead of
!requires_api_key, which had mislabelled `custom` as keyless and let
users save it with no model name (codex P2 #6 regression).
- ByokOnboarding.svelte: on listAvailableProviders failure, fall back
to FALLBACK_FEATURED (anthropic/openai/openrouter/groq/google) with
a visible banner instead of an empty grid. A transient API blip no
longer hard-blocks first-run onboarding (codex P2 #7).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced May 26, 2026
dmooney
added a commit
that referenced
this pull request
May 26, 2026
#1138) * fix(npc): forbid mid-conversation farewells in Tier-1 prompt (TODO #4) Cycle 1 of the demo audit caught Cormac Duffy closing turn 2 with "Slán abhaile" while the conversation continued five more turns. The ALLOWED IRISH PHRASES section legitimises Slán abhaile and Slán leat as period-appropriate vocabulary, but gave the model no instruction about when those phrases are welcome. Add a NEVER FAREWELL MID- CONVERSATION section that gates Slán*, Goodbye, Farewell, and the English gloss "safe home" behind an explicit player departure cue, and direct the model to land each non-departing reply on a question, observation, or offer instead. Mirror the directive into mods/rundale/prompts/tier1_system.txt so the mod artefact does not drift from the canonical Rust prompt, and extend test_tier1_system_no_unsubstituted_placeholders to pin the new header and each gated token so a future refactor cannot drop the section silently. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: retrigger after auto-cancel * ci: retrigger after Actions incident resolved * ci: retrigger after Rust quality gate timeout (cache cold) * test(inference): route default-frequency-penalty test through Interactive lane #1127's test_inference_queue_send_default_omits_frequency_penalty sent on the Background lane but received from the Interactive receiver (`irx`). The message went to `_brx`, which nobody reads, so `irx.recv()` blocked indefinitely. PR #1127's CI runs were both cancelled before the test could surface the hang, and the merge to main stalled the Rust quality gate at the 30-minute timeout on every subsequent PR. Swap `InferencePriority::Background` to `Interactive` so the send lands on the lane `irx` actually drains. Add a comment pointing future readers at the failure mode so the lane-mismatch trap isn't re-laid. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.