docs: add debug system design — commands, TUI panel, metrics architecture by dmooney · Pull Request #4 · dmooney/Rundale

dmooney · 2026-03-18T17:49:09Z

No description provided.

…ture Design document for a feature-gated debug system that provides runtime visibility into NPC state, inference pipeline, background tasks, and performance metrics. Includes slash commands (/debug *) and a toggleable live TUI panel with tabbed views. https://claude.ai/code/session_012egaxMhLdaCuMgjHpR6359

…igation - Fix phase status inconsistencies: Phases 1-3 now correctly shown as complete across README.md, docs/index.md, phase plans, and roadmap - Update module trees in CLAUDE.md and architecture overview to reflect actual src/ structure (npc submodules, debug.rs, geo_tool, gui submodules) - Remove outdated Bevy ECS reference from architecture overview - Add ADR-012 documenting the hierarchical documentation organization - Add research/ section to docs/index.md - Update known issue #4 to reflect that ShortTermMemory exists but isn't wired into LLM prompts yet - Clean up maybe-bad-ideas.md (separate shipped items) - Add "Key Design Docs" column to phase status table in docs/index.md - Improve README.md with documentation tree diagram https://claude.ai/code/session_01VCXxoKAD8dYAr99LHaEgzq

- Replace `map_or(false, ...)` with `is_some_and(...)` in debug_snapshot.rs to satisfy clippy::unnecessary_map_or. - Move known-issue #4 (NPC memory not wired into prompts) to Resolved section — `build_enhanced_context_with_config()` already injects short-term memories, long-term recall, and gossip into Tier 1 prompts. https://claude.ai/code/session_01BEC62B2awxqMsferWdBknk

- Replace `map_or(false, ...)` with `is_some_and(...)` in debug_snapshot.rs to satisfy clippy::unnecessary_map_or - Move known issue #4 (NPC memory not in prompts) to Resolved: build_enhanced_context() already injects short-term memory, long-term recall, reactions, and gossip into all Tier 1 prompts https://claude.ai/code/session_01Ts3W9TiNfbLhSuHC6Xyqdf

- Replace `map_or(false, ...)` with `is_some_and(...)` in debug_snapshot.rs to satisfy clippy::unnecessary_map_or. - Move known-issue #4 (NPC memory not wired into prompts) to Resolved section — `build_enhanced_context_with_config()` already injects short-term memories, long-term recall, and gossip into Tier 1 prompts. https://claude.ai/code/session_01BEC62B2awxqMsferWdBknk Co-authored-by: Claude <noreply@anthropic.com>

Re-judged all 4 cached sample sets with x-ai/grok-4.3 (non-reasoning, $1.25/$2.50 per M) to test judge sensitivity. 193 calls, 0 errors, ~43 min, ~$0.40 total. Added deepseek/deepseek-v4-pro to the candidate roster (deepseek's current flagship, $0.435/$0.87 per M). Surprise result: v4-pro scores *worse* than v3.2 (7.77 vs 7.91) on 1820 Irish dialogue — likely from heavier reasoning/structure RLHF that hurts persona fidelity. Judge findings: - 2.36-point top-to-bottom spread vs 0.81 under mistral-large. Grok discriminates harder. - Top-3 robust across judges: qwen3-max / mistral-large / qwen3-235b within 0.06 of each other under both. Real ceiling for this slice. - grok-3-mini lifts to #4 — same judge-family bias as previous sweeps. Discount by ~0.3 mentally. - gpt-4o-mini drops 1.67 points under stricter grok judge (8.27 → 6.60) — biggest cross-judge gap of any candidate. kimi-k2.6 tried as judge first; failed 70%+ of calls (reasoning tokens ate the max_tokens budget before content emission, even at 8K). Added a reasoning-field fallback to eval_lib's `call_chat` so short reply→content-empty cases recover (k2.6 still not viable as judge under JSON-schema mode). Bumped score_multiaxis _JUDGE_MAX_TOKENS 2000 → 8000 to give thinking judges headroom. Records kimi failure as evidence: multiaxis_20260514T175833Z.json (10 of 14 errors). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…e view eval_lib gains a `reasoning` kwarg on call_chat that gets passed through to OpenRouter. When a target model is known to be reasoning-class (kimi-k2.5/k2.6/k2-thinking, glm-4.7, o1/o3/o4, deepseek-r1, opus 4.7, sonnet 4.6) the default switches to `{"enabled": false}` so cached dialogue replies are the in-character answer rather than truncated thought process. Result for kimi-k2.5: 0/15 valid replies before → 15/15 valid now, 44.5 s wall vs prior timeouts. Judged with both pinned judges: grok-4.3 → 8.56 total (n=15) -- #4 in grok ranking mistral-large → 8.96 total (n=15) -- #4 in mistral ranking Both rank kimi-k2.5 #4 in cross-judge consistent way (between qwen3-235b and claude-haiku-4.5). Static leaderboard page now exposes an "average" view in the judge selector: per-candidate mean across each distinct judge it was scored by. Only emitted when ≥2 judges agree on coverage (otherwise the "average" of one row is just the row). Removed the failed kimi-k2.6 judging run (multiaxis_20260514T175833Z.json) — kept solely as a failure-mode artifact, no longer useful now that the broader story is documented in the leaderboard md. Coverage matrix dropped its hardcoded 3-column layout for a dynamic one driven by `--judge-cols` CSS var, so adding/removing judges no longer requires HTML edits. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(eval): rundale-bench ELO mode (pairwise judge) The absolute 5-axis rubric saturated near ceiling — gpt-oss-120b:free scored 4.82/5 with no headroom for stronger models to differentiate. Pairwise ELO replaces it as the dialogue-ranking primary. Changes: - judge_pairwise_v1.json — new pinned judge config (qwen3-235b on OpenRouter, rubric_sha256 b5664f96…dc7c0), pairwise rubric requiring one-sentence reason per verdict. - grade.py::grade_pairwise — judge picks A | B | tie + reason. Non-Latin script in one reply auto-disqualifies that side (judge can't be trusted to enforce that consistently). - rundale_bench.py --mode elo — repeated --target flags, runs every candidate over the dialogue slice once (cached replies), schedules one pairwise match per (a, b) pair per prompt with A/B position randomized to absorb first-position bias. K=32 → K=16 after 50 matches per candidate. Bootstrap 5/95 CI via 500 resamples. - test_grade.py — 5 new pairwise tests (winner, tie, invalid winner, non-Latin auto-DQ, rubric tamper). 27/27 pass. Smoke (3 candidates × 10 prompts × 1 pair per prompt = 60 calls, ~$0.0013, 346s): 1646.2 [CI 1598.6–1694.0] qwen/qwen3-235b-a22b-2507 1497.1 [CI 1437.0–1561.2] openai/gpt-oss-120b:free 1356.6 [CI 1306.3–1399.8] mistralai/mistral-small-24b-instruct-2501 290-point spread with non-overlapping CIs between top and bottom. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(eval): cache_dialogue_replies + rubric_lab for offline rubric tuning Goal: iterate the judging rubric without re-querying candidate models. Generate replies once, cache to JSON, then prototype rubrics offline. Tools: - cache_dialogue_replies.py: --target ... --prompts N, writes docs/proofs/rundale-bench/dialogue_samples_<UTC>.json with one record per (candidate, prompt) including reply, usage, latency, error. - rubric_lab.py: loads a cached samples file + a rubric text file, scores all samples in absolute (1-N) or pairwise (ELO) mode, prints per-candidate distribution. Zero candidate spend per iteration — only judge cost. Three cache snapshots committed as evidence: 003651Z.json 6 free OpenRouter candidates × 15 prompts — 65/90 useful (gemma-4 0/15, qwen-next 6/15 throttled) — kept as baseline showing why free tier is unreliable 004513Z.json 6 paid cheap (<$1/M out) × 15 prompts qwen3-235b, mistral-small-24b, gemma-3-27b, phi-4, deepseek-v3.2, gpt-oss-120b 90/90 ok, $0.0009, 205s 005721Z.json 6 mid-tier ($1-$5/M out) × 15 prompts claude-haiku-4.5, gpt-4o-mini, gemini-2.5-flash, mistral-large-2512, kimi-k2.5, grok-3-mini 90/90 ok, 324s 245 useful samples across 17 distinct candidates total. Rubric iteration now an offline activity that costs nothing per pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(eval): 12-candidate ELO sweep with mistral-large judge Kimi 2.6 attempted as judge but failed: reasoning model emits all output in `reasoning` field, content is null, every match returned empty → all ties. Same problem hit z-ai/glm-4.7. Switched to a non-thinking judge. mistral-large-2512 ($0.50/$1.50, 1-2s/call) judged 816 pairwise matches over the 12 paid OpenRouter candidates × 15 dialogue prompts: 1898.9 qwen/qwen3-235b-a22b-2507 1768.4 anthropic/claude-haiku-4.5 1705.0 google/gemma-3-27b-it 1682.6 mistralai/mistral-large-2512 (judge self-bias) 1622.8 moonshotai/kimi-k2.5 (n=11 only; reasoning empty) 1484.9 x-ai/grok-3-mini 1473.3 deepseek/deepseek-v3.2 1356.8 openai/gpt-oss-120b 1340.9 google/gemini-2.5-flash 1305.2 mistralai/mistral-small-24b-instruct-2501 1242.6 openai/gpt-4o-mini 1118.6 microsoft/phi-4 780-point top-to-bottom spread. Headline: claude-haiku-4.5 is the standout cheap-tier dialogue model; phi-4 is unsuitable. qwen3-235b top spot is suspect — it was the prior judge pin so may carry training bias toward its own style. Tooling updates: - rubric_lab.py: per-match progress log (flush=True), max_tokens=2000 for judge calls (reasoning models need budget), empty-content guard that raises "reasoning model truncated?" instead of silent 0-score. - Documented reasoning-class judge incompatibility in leaderboard caveats — until call_chat handles `reasoning` fallback, kimi-* and glm-* are unusable as judges OR candidates. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore(eval): justfile recipes for rundale-bench workflows Wraps the tasks executed this session so they're rerunnable without remembering target-spec strings, env-loading boilerplate, or rubric_lab flags. Recipes: Correctness gates (no API spend): - test — run grade.py unit tests - manifest — rebuild MANIFEST.json after slice edits - split — re-run holdout split + manifest Single-target probes: - intent target [limit=20] — deterministic intent grading - dialogue target [judge limit] — absolute 5-axis rubric - bench target [judge limit] — --slice all sweep Multi-target ELO: - elo "spec_a spec_b ..." [limit=10] — in-bench --mode elo, fresh replies Cached-reply workflow (rubric iteration, no candidate spend): - cache-cheap [prompts=15] — 6 paid-cheap fleet (~$0.001, ~3min) - cache-mid [prompts=15] — 6 mid-tier fleet (~$0.05) - cache-all [prompts=15] — both, back to back - cache targets="..." [prompts] - elo-from-cache samples [rubric judge output] — pairwise over cache - absolute-from-cache samples rubric [axis_scale judge output] Convenience: - list-samples — show cached sample files - sample-stats — n_samples / ok / empty / cost per cache Run from repo root with `-f`: just -f parish/testing/rundale-bench/justfile <recipe> ... ENV_FILE points at <repo>/.env; recipes source it before invoking python so OPENROUTER_API_KEY (and equivalents) reach call_chat. Default judge is mistral-large-2512 (non-thinking, ~$0.001/call, 1-2s wall) — override with `judge=...` per recipe. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(eval): multi-axis 0-10 scoring + just recipe Per-axis 0-10 dialogue scorer over cached samples. Five axes (character, authenticity, language, responsiveness, craft) + a judge-emitted total — complements pairwise ELO by exposing *why* a candidate ranks where it does. - `score_multiaxis.py` — judge emits one integer per axis (0-10) plus a float `total`. Per-record + per-candidate aggregates written to JSON. Default judge `mistralai/mistral-large-2512`, calibration anchors baked into the rubric (0/3/6/8/10 gradient). - `justfile multiaxis` recipe — accepts positional args only (samples, judge, rubric); output is auto-stamped to docs/proofs/rundale-bench/. Earlier draft accepted `output=...` as a named arg which just silently parses as the next positional, sending `--judge-model 'output=...'` and earning HTTP 400 from every call. Removed. Evidence: ran both cached sweeps end-to-end. 76+88 calls, ~$0 spend (well under mistral-large's $0.50/$1.50 floor). Three top candidates agree with the ELO ranking — gemma-3-27b / qwen3-235b / claude-haiku-4.5 cluster within 0.10 of each other, mistral-large self-bias still present. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(eval): qwen3-max flagship probe + aggregator bug fix Added qwen/qwen3-max (Chinese flagship, $1.20/$6 per M, non-reasoning) to the multi-axis 0-10 leaderboard. Result: 9.03 total — ties gemma-3-27b and edges qwen3-235b by 0.03, well inside rubric noise. ~17× the cost of qwen3-235b for zero meaningful gain on this slice. Also fixed an aggregator bug in score_multiaxis.py: judge-call errors (HTTP 503, rate-limit drops, schema-parse failures) were contributing zeros to per-candidate means. A first qwen3-max pass hit two 503s after 4 retries each, pulling its headline mean to 7.87 — nonsense for a flagship. Errored records still land in `per_record` for inspection but no longer skew aggregates. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(eval): grok-4.3 judge re-sweep + deepseek-v4-pro candidate Re-judged all 4 cached sample sets with x-ai/grok-4.3 (non-reasoning, $1.25/$2.50 per M) to test judge sensitivity. 193 calls, 0 errors, ~43 min, ~$0.40 total. Added deepseek/deepseek-v4-pro to the candidate roster (deepseek's current flagship, $0.435/$0.87 per M). Surprise result: v4-pro scores *worse* than v3.2 (7.77 vs 7.91) on 1820 Irish dialogue — likely from heavier reasoning/structure RLHF that hurts persona fidelity. Judge findings: - 2.36-point top-to-bottom spread vs 0.81 under mistral-large. Grok discriminates harder. - Top-3 robust across judges: qwen3-max / mistral-large / qwen3-235b within 0.06 of each other under both. Real ceiling for this slice. - grok-3-mini lifts to #4 — same judge-family bias as previous sweeps. Discount by ~0.3 mentally. - gpt-4o-mini drops 1.67 points under stricter grok judge (8.27 → 6.60) — biggest cross-judge gap of any candidate. kimi-k2.6 tried as judge first; failed 70%+ of calls (reasoning tokens ate the max_tokens budget before content emission, even at 8K). Added a reasoning-field fallback to eval_lib's `call_chat` so short reply→content-empty cases recover (k2.6 still not viable as judge under JSON-schema mode). Bumped score_multiaxis _JUDGE_MAX_TOKENS 2000 → 8000 to give thinking judges headroom. Records kimi failure as evidence: multiaxis_20260514T175833Z.json (10 of 14 errors). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(eval): TTFT/tok-s/JSON-compliance probe + gemma-4/qwen-2.5 candidates Adds a streaming-mode perf probe complementing the rubric-quality bench. Three new dimensions per candidate: - ttft_ms (time-to-first-content-token via stream:true) - tok/s (completion_tokens / generation seconds) - json compliance rate, both free-form and schema-enforced Implementation: - eval_lib.py grows `call_chat_streaming(target, …)` that parses SSE, records TTFT on first content delta, accumulates token usage from the final usage line. Also retries on body-level 502/503 (xAI grok-4.3 capacity errors land inside 200 OK responses). - bench_perf.py runs the probe and dumps perf_<UTC>.json with per-call records + per-candidate medians + p90s. - justfile adds `perf` (single target) and `perf-many` (pre-quoted --target … flags) recipes. Initial measurements on 4 candidates (10 prompts each): candidate TTFT p50 Total p50 tok/s p50 JF% JS% qwen/qwen3-235b-a22b-2507 363ms 1696ms 54.2 100% 100% google/gemma-3-27b-it 380ms 2328ms 37.9 100% 100% google/gemma-4-31b-it 1160ms 3214ms 21.9 100% 100% qwen/qwen-2.5-72b-instruct 0ms 5242ms 0.0 10% 90% qwen3-235b is fastest across all three perf axes AND tied for top-3 quality under both judges. Strong default. gemma-4 is slower than gemma-3 in every dimension (TTFT 3×, tok/s ~½) AND scores 0.23 lower under the grok-4.3 judge. Newer is not an upgrade. qwen-2.5-72b-instruct returns "Provider returned error" 400 from OpenRouter on 14/15 cache calls; streaming all errored. Free-form JSON 10%, schema 90%. Different provider routing required before this candidate can be fairly evaluated. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore(eval): cache dialogue replies for 14 new candidates (high + mid tier) Cached 15 dialogue replies for each candidate to enable offline rubric judging without re-spending. Judging deferred — user wants more samples collected before re-judging the full set. High tier (6, $0.x to $5/$30 per M): - anthropic/claude-opus-4.7 - anthropic/claude-sonnet-4.6 - openai/gpt-5.5 - openai/gpt-5.4 - google/gemini-2.5-pro - x-ai/grok-4.3 Mid tier (8): - meta-llama/llama-3.3-70b-instruct - meta-llama/llama-4-maverick - meta-llama/llama-4-scout - openai/gpt-5.4-mini - mistralai/mistral-medium-3.1 - z-ai/glm-4.6 - nousresearch/hermes-4-405b - amazon/nova-pro-v1 All 210 calls returned non-empty replies. Cache files: - dialogue_samples_20260514T221845Z.json (high tier, 90 samples, ~6min) - dialogue_samples_20260515T181515Z.json (mid tier, 120 samples, ~6min) Brings the candidate pool to 28 unique models cached for dialogue. Ready for a multi-judge re-sweep when bench iteration resumes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(eval): static leaderboard page for rundale-bench Aggregates every multiaxis_*.json + perf_*.json + dialogue_samples_*.json into a single static HTML page (`docs/proofs/rundale-bench/leaderboard.html`). Open in any browser — no server required. Sections: - Stat header (cached / judged / unjudged / judge count / row counts) - Quality table (sortable + filterable by candidate or judge) - Perf table (TTFT / total / tok/s / JSON compliance pills) - Judge coverage matrix (✓ / · per candidate × judge) - Unjudged backlog list Vanilla JS, no deps. Dark theme by default with auto-light via prefers-color-scheme. Pure DOM rendering with HTML escaping — no innerHTML on untrusted strings. Regenerate after any new run with: just -f parish/testing/rundale-bench/justfile leaderboard Current state baked into the page: 30 cached candidates, 16 judged, 14 unjudged, 3 distinct judges, 31 quality rows + 5 perf rows. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(eval): re-cache kimi-k2.5 sans reasoning; add cross-judge average view eval_lib gains a `reasoning` kwarg on call_chat that gets passed through to OpenRouter. When a target model is known to be reasoning-class (kimi-k2.5/k2.6/k2-thinking, glm-4.7, o1/o3/o4, deepseek-r1, opus 4.7, sonnet 4.6) the default switches to `{"enabled": false}` so cached dialogue replies are the in-character answer rather than truncated thought process. Result for kimi-k2.5: 0/15 valid replies before → 15/15 valid now, 44.5 s wall vs prior timeouts. Judged with both pinned judges: grok-4.3 → 8.56 total (n=15) -- #4 in grok ranking mistral-large → 8.96 total (n=15) -- #4 in mistral ranking Both rank kimi-k2.5 #4 in cross-judge consistent way (between qwen3-235b and claude-haiku-4.5). Static leaderboard page now exposes an "average" view in the judge selector: per-candidate mean across each distinct judge it was scored by. Only emitted when ≥2 judges agree on coverage (otherwise the "average" of one row is just the row). Removed the failed kimi-k2.6 judging run (multiaxis_20260514T175833Z.json) — kept solely as a failure-mode artifact, no longer useful now that the broader story is documented in the leaderboard md. Coverage matrix dropped its hardcoded 3-column layout for a dynamic one driven by `--judge-cols` CSS var, so adding/removing judges no longer requires HTML edits. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(eval): judge all 14 high+mid candidates, parallelize, fix reasoning leaks Judging: every cached candidate now has scores from both pinned judges (grok-4.3 + mistral-large). 29 of 30 cached are fully judged; the remaining qwen-2.5-72b-instruct is broken on OpenRouter and unscorable. Parallelism: score_multiaxis grows a `--workers` flag (default 8). ThreadPoolExecutor over the eligible records — judge calls are the hot path, all I/O-bound. Wall-time went from estimated 1.5-2 h serial to ~3 min for 4 judging passes across 14 new candidates. Reasoning leaks fixed: - gemini-2.5-pro first cache had every reply truncated mid-sentence ("Ah, the night-ter"). max_tokens=200 consumed by internal thinking before any content emitted. First-judge scores were nonsense (grok 2.91, mistral 5.77). - glm-4.6 first cache had 4/15 replies as meta chain-of-thought ("1. **Deconstruct the Persona:**..."). Mistral was lenient enough to score it 9.29; grok flagged it down to 4.43. 4.86-point judge disagreement was the smoking gun. eval_lib.call_chat already auto-disables reasoning for known reasoning-class models via REASONING_MODEL_PREFIXES, but the suppression dict was hardcoded to `{"enabled": false}`. Google rejects that with HTTP 400; only `reasoning: {effort: "low"}` works. Added a per-provider `_default_reasoning_for(model_id)` so the dict matches what each provider accepts. Added prefixes: google/gemini-2.5-pro, google/gemini-3, z-ai/glm-4.6. After re-cache + re-judge, both candidates land in the normal range with cross-judge spread under 1.0: gemini-2.5-pro grok=8.12 mistral=8.97 Δ=0.85 z-ai/glm-4.6 grok=7.89 mistral=8.76 Δ=0.87 Headline standings (grok-4.3 / mistral-large): openai/gpt-5.5 8.75 / 9.05 ← high-tier top openai/gpt-5.4 8.55 / 9.17 ← mistral's #1 high openai/gpt-5.4-mini 8.63 / 9.03 ← mid-tier top anthropic/claude-opus-4.7 8.64 / 9.05 anthropic/sonnet-4.6 8.53 / 9.13 mistral-medium-3.1 8.59 / 9.09 Also stripped earlier n=1 outliers from the leaderboard data: - kimi-k2.5 single-record rows in early mid-tier sweeps - qwen-2.5-72b-instruct single-record row (provider broken) Leaderboard page regenerated (quality=57 rows, perf=5, cached=30, unjudged=1). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(eval): fill perf table for 26 remaining candidates; dedupe by latest Ran bench_perf on 4 parallel shards (~6-7 candidates each). 760 calls across 26 new candidates, ~10 min wall, <\$0.001 spend. Combined with the original 4-candidate probe, the perf table now covers every cached candidate. build_leaderboard_page now keeps only the latest perf measurement per candidate so smoke runs do not pollute the table. (Earlier 3-prompt gemma-4 smoke was overwriting the 10-prompt sweep result.) Notable perf findings: - gpt-5.4-mini fastest at the top of mid: TTFT 561 ms / 75.5 tok-s - gpt-5.5 surprisingly slow: TTFT 2242 ms / 60 tok-s — pay flagship premium and wait twice as long as gpt-5.4-mini - glm-4.6 total p50 = 16883 ms (huge tail) — reasoning still active for some calls despite suppression; investigate before recommending - phi-4 free-form JSON 0% / schema 100% — strict schema is mandatory - mistral-medium-3.1 free-form JSON 10% / schema 100% — same shape qwen-2.5-72b-instruct re-cache attempt forcing provider.order= ["DeepInfra", "Together"] recovered some replies (4/15) but still flaky; not viable as a judge or candidate via OpenRouter under any provider sort/order I tried. Documented as unjudged. Final state: 29/30 cached candidates judged, 30/30 perf-measured. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(eval): address gemini-code-assist review feedback 1. ELO match recording for errored candidates (HIGH). When candidate A errored mid-sweep, `matches.append((b, a, 0.0))` recorded a loss against B (the winner) instead of A. Bootstrap CI inherited the same flip. Fixed to canonical `(a, b, 0.0)` ordering so score_a=0.0 correctly means A lost. 2. Null-safe judge `reason` strings. `str(out.get("reason", ""))` returns the literal string "None" when the judge emits `{"reason": null}` (some providers do this even with strict json_schema). Replaced with `str(out.get("reason") or "")` everywhere — grade.py, rubric_lab.py, score_multiaxis.py. 3. Target identity collision in ELO mode. `replies[(t.model, prompt_id)]` collided when two `--target` flags shared the same `model` string (e.g. comparing the same model across providers / base_urls). Switched to `model@base_url` as the canonical id and added a startup check that rejects duplicates. 4. Bootstrap CI now mirrors dynamic K. Main accumulator drops K from 32 → 16 after 50 matches per candidate. _bootstrap_ci was using k_initial constant for every match, so late-resampled matches moved ratings more than reality. Bootstrap now tracks match counts per iteration and applies the same threshold. All 27 grade.py unit tests still pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

Six fixes spanning gemini-code-assist and chatgpt-codex-connector feedback on the runtime-loaded provider refactor. - mod_source.rs: wrap discover_mods_in + register_provider_mods_once in tokio::task::spawn_blocking, and propagate register errors instead of warn-and-continue. Sync filesystem I/O on the executor stalled the Tokio runtime on slow disks; silent registry failures turned actionable startup errors into late, confusing fallback behaviour (gemini P0, codex P2 #5). - parish-config provider.rs: drop the cfg(debug_assertions) gate on the auto-loader and rename it to ensure_mods_loaded. The release/debug skew meant any startup path that resolved provider config before the explicit bootstrap saw an empty registry, silently falling back to the simulator or panicking on Provider::from_id(...).expect(...). The auto-loader is now always-on; it consults PARISH_MODS_DIR first (operator override for packaged builds), then walks up from CARGO_MANIFEST_DIR (dev tree). Same idempotent Once guard (codex P1 #2 + #4). - resolve_cloud_config: replace Provider::from_id("openrouter") .expect(...) with an ok_or_else that returns a structured ParishError::Config. Operators who omit PARISH_CLOUD_PROVIDER on a deployment without the openrouter mod now see an actionable message instead of a crashed binary (codex P1 #2). - ProviderMod: add explicit `keyless: bool` field (TOML default false). Set true in simulator/ollama/vllm/vllm_mlx/lmstudio. The wizard's keyless guard is now driven by this flag instead of !requires_api_key, which had mislabelled `custom` as keyless and let users save it with no model name (codex P2 #6 regression). - ByokOnboarding.svelte: on listAvailableProviders failure, fall back to FALLBACK_FEATURED (anthropic/openai/openrouter/groq/google) with a visible banner instead of an empty grid. A transient API blip no longer hard-blocks first-run onboarding (codex P2 #7). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* refactor: runtime-load LLM providers from mods/<id>/ Replaces the compile-time-embedded provider catalog (parish-config/build.rs scanning providers/*.toml) with a runtime-mod-loaded design. Builtins (5, parish-config/src/builtin_providers/): - simulator, ollama, vllm, vllm_mlx, custom - inlined via include_str! — engine manages local processes / downloads, or serves as universal escape hatch (custom) Provider mods (19, mods/<id>/): - anthropic, cohere, deepseek, github_models, google, groq, lmstudio, mistral, moonshot, nvidia-nim, openai, openrouter, qwen, scaleway, siliconflow, together, vercel-ai, xai, zhipu - one mod per provider; each declares kind = "providers" in mod.toml - discovered + registered into ProviderRegistry via discover_mods + register_provider_mods_once (OnceLock-guarded, called from load_setting_mod_sync + LocalDiskModSource::list_mods) ModKind::Providers variant + load_providers_from_mod helper in parish-core, with traversal-rejection + duplicate-id checks. ProviderRegistry rewired with RwLock-backed interior mutability so post-init register_mod_providers merges cleanly. Last-wins on collision with WARN log; identical re-registration is silent no-op. Backend IPC handle_list_available_providers (parish-core) returns featured/other split. Wired to Tauri command list_available_providers + MCP bridge /api/available-providers route. ShowPreset listing now reads dynamically from registry instead of a hardcoded string. Debug-build auto-loader parish_config::ensure_test_mods_loaded walks the workspace mods/ tree so unit tests + dev runs see the same registry as production without each test calling setup manually. New tests: - parish-config: builtin_providers_parse_and_register, register_mod_ providers_merges_new_ids, register_mod_providers_last_wins_on_collision - parish-core: discover_mods_classifies_providers_kind, load_providers_from_mod_{parses_multiple_tomls_in_lex_order, empty_when_directory_missing, rejects_symlink_traversal, rejects_duplicate_ids_within_one_mod} Proof bundle at docs/proofs/provider-mods-runtime/ — live CLI transcript verifies switching to a mod-loaded provider (openai), a runtime-added mod (test-provider, registered without recompile), and back to a builtin (simulator). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore(proofs): relocate provider-mods-runtime bundle to .proofs/ Rule 10 update (commit 554410e) requires proof bundles to live in gitignored .proofs/<task-id>/ and attach to the PR via `just attach-proof`. Move the bundle out of tracked docs/proofs/. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(providers): drop cloud ctors + wire UI/server to runtime registry Follow-on cleanup to commit edc2fc7b. No deferrals left. - Remove Provider::{anthropic, openai, openrouter, google, groq, xai, mistral, deepseek, together, lmstudio, github_models} convenience constructors. Every callsite now uses Provider::from_id(id) with an appropriate .expect() (tests) or .unwrap_or_default() (runtime defaults that previously fell to openrouter; the simulator builtin is the fallback when openrouter mod is absent). - parish-server: add /api/list-available-providers handler + route, so the web UI can enumerate the runtime provider registry. Matches Tauri command list_available_providers and the MCP bridge route. - parish/apps/ui: - lib/ipc.ts: add listAvailableProviders() + AvailableProvidersResponse types. - components/ByokOnboarding.svelte: fetch featured/other lists at mount, drop static imports. - lib/byokProviders.ts: collapse to a thin type-adapter (toByokMeta + findProvider). Hand-curated FEATURED_PROVIDERS and OTHER_PROVIDERS arrays are gone; adding a provider is a TOML drop under mods/<id>/ with no TS edit. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore(mods): rename provider mods with -provider suffix mods/<id>/ -> mods/<id>-provider/ for all 19 cloud provider mods. The mod-id field in each mod.toml is updated to match the new directory name (<id>-provider). The provider id inside each TOML (id = "<bare>") stays unchanged — that is the registry key Provider::from_id(...) and parish.toml's provider field still target. Also removes the throwaway mods/test-provider/ from the repo. It is a fixture artifact created on-demand by the verification script per its header instructions; shipping it would make the no-recompile claim tautological. The captured proof transcript still demonstrates the add-and-discover flow. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(providers): address bot review on PR #1049 Six fixes spanning gemini-code-assist and chatgpt-codex-connector feedback on the runtime-loaded provider refactor. - mod_source.rs: wrap discover_mods_in + register_provider_mods_once in tokio::task::spawn_blocking, and propagate register errors instead of warn-and-continue. Sync filesystem I/O on the executor stalled the Tokio runtime on slow disks; silent registry failures turned actionable startup errors into late, confusing fallback behaviour (gemini P0, codex P2 #5). - parish-config provider.rs: drop the cfg(debug_assertions) gate on the auto-loader and rename it to ensure_mods_loaded. The release/debug skew meant any startup path that resolved provider config before the explicit bootstrap saw an empty registry, silently falling back to the simulator or panicking on Provider::from_id(...).expect(...). The auto-loader is now always-on; it consults PARISH_MODS_DIR first (operator override for packaged builds), then walks up from CARGO_MANIFEST_DIR (dev tree). Same idempotent Once guard (codex P1 #2 + #4). - resolve_cloud_config: replace Provider::from_id("openrouter") .expect(...) with an ok_or_else that returns a structured ParishError::Config. Operators who omit PARISH_CLOUD_PROVIDER on a deployment without the openrouter mod now see an actionable message instead of a crashed binary (codex P1 #2). - ProviderMod: add explicit `keyless: bool` field (TOML default false). Set true in simulator/ollama/vllm/vllm_mlx/lmstudio. The wizard's keyless guard is now driven by this flag instead of !requires_api_key, which had mislabelled `custom` as keyless and let users save it with no model name (codex P2 #6 regression). - ByokOnboarding.svelte: on listAvailableProviders failure, fall back to FALLBACK_FEATURED (anthropic/openai/openrouter/groq/google) with a visible banner instead of an empty grid. A transient API blip no longer hard-blocks first-run onboarding (codex P2 #7). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

#1138) * fix(npc): forbid mid-conversation farewells in Tier-1 prompt (TODO #4) Cycle 1 of the demo audit caught Cormac Duffy closing turn 2 with "Slán abhaile" while the conversation continued five more turns. The ALLOWED IRISH PHRASES section legitimises Slán abhaile and Slán leat as period-appropriate vocabulary, but gave the model no instruction about when those phrases are welcome. Add a NEVER FAREWELL MID- CONVERSATION section that gates Slán*, Goodbye, Farewell, and the English gloss "safe home" behind an explicit player departure cue, and direct the model to land each non-departing reply on a question, observation, or offer instead. Mirror the directive into mods/rundale/prompts/tier1_system.txt so the mod artefact does not drift from the canonical Rust prompt, and extend test_tier1_system_no_unsubstituted_placeholders to pin the new header and each gated token so a future refactor cannot drop the section silently. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: retrigger after auto-cancel * ci: retrigger after Actions incident resolved * ci: retrigger after Rust quality gate timeout (cache cold) * test(inference): route default-frequency-penalty test through Interactive lane #1127's test_inference_queue_send_default_omits_frequency_penalty sent on the Background lane but received from the Interactive receiver (`irx`). The message went to `_brx`, which nobody reads, so `irx.recv()` blocked indefinitely. PR #1127's CI runs were both cancelled before the test could surface the hang, and the merge to main stalled the Rust quality gate at the 30-minute timeout on every subsequent PR. Swap `InferencePriority::Background` to `Interactive` so the send lands on the lane `irx` actually drains. Add a comment pointing future readers at the failure mode so the lane-mismatch trap isn't re-laid. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dmooney merged commit 8b94695 into main Mar 18, 2026

dmooney deleted the claude/debug-commands-ui-G9PqK branch March 18, 2026 17:49

dmooney mentioned this pull request Apr 3, 2026

Wire NPC conversation memory into LLM prompts #205

Merged

This was referenced Apr 3, 2026

fix: resolve clippy warning and update known-issues.md #207

Merged

NPC conversation awareness #214

Merged

dmooney mentioned this pull request May 23, 2026

Runtime-loaded LLM provider mods (mods/<id>-provider/) #1049

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add debug system design — commands, TUI panel, metrics architecture#4

docs: add debug system design — commands, TUI panel, metrics architecture#4
dmooney merged 1 commit into
mainfrom
claude/debug-commands-ui-G9PqK

dmooney commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dmooney commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants