Skip to content

docs: add debug system design — commands, TUI panel, metrics architecture#4

Merged
dmooney merged 1 commit into
mainfrom
claude/debug-commands-ui-G9PqK
Mar 18, 2026
Merged

docs: add debug system design — commands, TUI panel, metrics architecture#4
dmooney merged 1 commit into
mainfrom
claude/debug-commands-ui-G9PqK

Conversation

@dmooney
Copy link
Copy Markdown
Owner

@dmooney dmooney commented Mar 18, 2026

No description provided.

…ture

Design document for a feature-gated debug system that provides runtime
visibility into NPC state, inference pipeline, background tasks, and
performance metrics. Includes slash commands (/debug *) and a toggleable
live TUI panel with tabbed views.

https://claude.ai/code/session_012egaxMhLdaCuMgjHpR6359
@dmooney dmooney merged commit 8b94695 into main Mar 18, 2026
@dmooney dmooney deleted the claude/debug-commands-ui-G9PqK branch March 18, 2026 17:49
dmooney pushed a commit that referenced this pull request Mar 22, 2026
…igation

- Fix phase status inconsistencies: Phases 1-3 now correctly shown as
  complete across README.md, docs/index.md, phase plans, and roadmap
- Update module trees in CLAUDE.md and architecture overview to reflect
  actual src/ structure (npc submodules, debug.rs, geo_tool, gui submodules)
- Remove outdated Bevy ECS reference from architecture overview
- Add ADR-012 documenting the hierarchical documentation organization
- Add research/ section to docs/index.md
- Update known issue #4 to reflect that ShortTermMemory exists but isn't
  wired into LLM prompts yet
- Clean up maybe-bad-ideas.md (separate shipped items)
- Add "Key Design Docs" column to phase status table in docs/index.md
- Improve README.md with documentation tree diagram

https://claude.ai/code/session_01VCXxoKAD8dYAr99LHaEgzq
dmooney pushed a commit that referenced this pull request Apr 3, 2026
- Replace `map_or(false, ...)` with `is_some_and(...)` in
  debug_snapshot.rs to satisfy clippy::unnecessary_map_or.
- Move known-issue #4 (NPC memory not wired into prompts) to Resolved
  section — `build_enhanced_context_with_config()` already injects
  short-term memories, long-term recall, and gossip into Tier 1 prompts.

https://claude.ai/code/session_01BEC62B2awxqMsferWdBknk
dmooney pushed a commit that referenced this pull request Apr 3, 2026
- Replace `map_or(false, ...)` with `is_some_and(...)` in
  debug_snapshot.rs to satisfy clippy::unnecessary_map_or
- Move known issue #4 (NPC memory not in prompts) to Resolved:
  build_enhanced_context() already injects short-term memory,
  long-term recall, reactions, and gossip into all Tier 1 prompts

https://claude.ai/code/session_01Ts3W9TiNfbLhSuHC6Xyqdf
dmooney added a commit that referenced this pull request Apr 5, 2026
- Replace `map_or(false, ...)` with `is_some_and(...)` in
  debug_snapshot.rs to satisfy clippy::unnecessary_map_or.
- Move known-issue #4 (NPC memory not wired into prompts) to Resolved
  section — `build_enhanced_context_with_config()` already injects
  short-term memories, long-term recall, and gossip into Tier 1 prompts.

https://claude.ai/code/session_01BEC62B2awxqMsferWdBknk

Co-authored-by: Claude <noreply@anthropic.com>
dmooney added a commit that referenced this pull request May 14, 2026
Re-judged all 4 cached sample sets with x-ai/grok-4.3 (non-reasoning,
$1.25/$2.50 per M) to test judge sensitivity. 193 calls, 0 errors,
~43 min, ~$0.40 total.

Added deepseek/deepseek-v4-pro to the candidate roster (deepseek's
current flagship, $0.435/$0.87 per M). Surprise result: v4-pro scores
*worse* than v3.2 (7.77 vs 7.91) on 1820 Irish dialogue — likely from
heavier reasoning/structure RLHF that hurts persona fidelity.

Judge findings:
- 2.36-point top-to-bottom spread vs 0.81 under mistral-large. Grok
  discriminates harder.
- Top-3 robust across judges: qwen3-max / mistral-large / qwen3-235b
  within 0.06 of each other under both. Real ceiling for this slice.
- grok-3-mini lifts to #4 — same judge-family bias as previous sweeps.
  Discount by ~0.3 mentally.
- gpt-4o-mini drops 1.67 points under stricter grok judge (8.27 → 6.60)
  — biggest cross-judge gap of any candidate.

kimi-k2.6 tried as judge first; failed 70%+ of calls (reasoning tokens
ate the max_tokens budget before content emission, even at 8K). Added
a reasoning-field fallback to eval_lib's `call_chat` so short
reply→content-empty cases recover (k2.6 still not viable as judge
under JSON-schema mode).

Bumped score_multiaxis _JUDGE_MAX_TOKENS 2000 → 8000 to give thinking
judges headroom. Records kimi failure as evidence:
multiaxis_20260514T175833Z.json (10 of 14 errors).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
dmooney added a commit that referenced this pull request May 15, 2026
…e view

eval_lib gains a `reasoning` kwarg on call_chat that gets passed through
to OpenRouter. When a target model is known to be reasoning-class
(kimi-k2.5/k2.6/k2-thinking, glm-4.7, o1/o3/o4, deepseek-r1, opus 4.7,
sonnet 4.6) the default switches to `{"enabled": false}` so cached
dialogue replies are the in-character answer rather than truncated
thought process.

Result for kimi-k2.5: 0/15 valid replies before → 15/15 valid now,
44.5 s wall vs prior timeouts. Judged with both pinned judges:
  grok-4.3        → 8.56 total (n=15)  -- #4 in grok ranking
  mistral-large   → 8.96 total (n=15)  -- #4 in mistral ranking

Both rank kimi-k2.5 #4 in cross-judge consistent way (between
qwen3-235b and claude-haiku-4.5).

Static leaderboard page now exposes an "average" view in the judge
selector: per-candidate mean across each distinct judge it was scored
by. Only emitted when ≥2 judges agree on coverage (otherwise the
"average" of one row is just the row).

Removed the failed kimi-k2.6 judging run
(multiaxis_20260514T175833Z.json) — kept solely as a failure-mode
artifact, no longer useful now that the broader story is documented
in the leaderboard md.

Coverage matrix dropped its hardcoded 3-column layout for a dynamic
one driven by `--judge-cols` CSS var, so adding/removing judges no
longer requires HTML edits.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
dmooney added a commit that referenced this pull request May 15, 2026
* feat(eval): rundale-bench ELO mode (pairwise judge)

The absolute 5-axis rubric saturated near ceiling — gpt-oss-120b:free
scored 4.82/5 with no headroom for stronger models to differentiate.
Pairwise ELO replaces it as the dialogue-ranking primary.

Changes:
- judge_pairwise_v1.json — new pinned judge config (qwen3-235b on
  OpenRouter, rubric_sha256 b5664f96…dc7c0), pairwise rubric requiring
  one-sentence reason per verdict.
- grade.py::grade_pairwise — judge picks A | B | tie + reason.
  Non-Latin script in one reply auto-disqualifies that side (judge
  can't be trusted to enforce that consistently).
- rundale_bench.py --mode elo — repeated --target flags, runs every
  candidate over the dialogue slice once (cached replies), schedules
  one pairwise match per (a, b) pair per prompt with A/B position
  randomized to absorb first-position bias. K=32 → K=16 after 50
  matches per candidate. Bootstrap 5/95 CI via 500 resamples.
- test_grade.py — 5 new pairwise tests (winner, tie, invalid winner,
  non-Latin auto-DQ, rubric tamper). 27/27 pass.

Smoke (3 candidates × 10 prompts × 1 pair per prompt = 60 calls,
~$0.0013, 346s):

  1646.2  [CI 1598.6–1694.0]  qwen/qwen3-235b-a22b-2507
  1497.1  [CI 1437.0–1561.2]  openai/gpt-oss-120b:free
  1356.6  [CI 1306.3–1399.8]  mistralai/mistral-small-24b-instruct-2501

290-point spread with non-overlapping CIs between top and bottom.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(eval): cache_dialogue_replies + rubric_lab for offline rubric tuning

Goal: iterate the judging rubric without re-querying candidate models.
Generate replies once, cache to JSON, then prototype rubrics offline.

Tools:
- cache_dialogue_replies.py: --target ... --prompts N, writes
  docs/proofs/rundale-bench/dialogue_samples_<UTC>.json with one record
  per (candidate, prompt) including reply, usage, latency, error.
- rubric_lab.py: loads a cached samples file + a rubric text file,
  scores all samples in absolute (1-N) or pairwise (ELO) mode, prints
  per-candidate distribution. Zero candidate spend per iteration —
  only judge cost.

Three cache snapshots committed as evidence:

  003651Z.json  6 free OpenRouter candidates × 15 prompts
                — 65/90 useful (gemma-4 0/15, qwen-next 6/15 throttled)
                — kept as baseline showing why free tier is unreliable

  004513Z.json  6 paid cheap (<$1/M out) × 15 prompts
                qwen3-235b, mistral-small-24b, gemma-3-27b, phi-4,
                deepseek-v3.2, gpt-oss-120b
                90/90 ok, $0.0009, 205s

  005721Z.json  6 mid-tier ($1-$5/M out) × 15 prompts
                claude-haiku-4.5, gpt-4o-mini, gemini-2.5-flash,
                mistral-large-2512, kimi-k2.5, grok-3-mini
                90/90 ok, 324s

245 useful samples across 17 distinct candidates total. Rubric
iteration now an offline activity that costs nothing per pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(eval): 12-candidate ELO sweep with mistral-large judge

Kimi 2.6 attempted as judge but failed: reasoning model emits all
output in `reasoning` field, content is null, every match returned
empty → all ties. Same problem hit z-ai/glm-4.7. Switched to a
non-thinking judge.

mistral-large-2512 ($0.50/$1.50, 1-2s/call) judged 816 pairwise
matches over the 12 paid OpenRouter candidates × 15 dialogue prompts:

  1898.9  qwen/qwen3-235b-a22b-2507
  1768.4  anthropic/claude-haiku-4.5
  1705.0  google/gemma-3-27b-it
  1682.6  mistralai/mistral-large-2512    (judge self-bias)
  1622.8  moonshotai/kimi-k2.5            (n=11 only; reasoning empty)
  1484.9  x-ai/grok-3-mini
  1473.3  deepseek/deepseek-v3.2
  1356.8  openai/gpt-oss-120b
  1340.9  google/gemini-2.5-flash
  1305.2  mistralai/mistral-small-24b-instruct-2501
  1242.6  openai/gpt-4o-mini
  1118.6  microsoft/phi-4

780-point top-to-bottom spread. Headline: claude-haiku-4.5 is the
standout cheap-tier dialogue model; phi-4 is unsuitable. qwen3-235b
top spot is suspect — it was the prior judge pin so may carry
training bias toward its own style.

Tooling updates:
- rubric_lab.py: per-match progress log (flush=True), max_tokens=2000
  for judge calls (reasoning models need budget), empty-content guard
  that raises "reasoning model truncated?" instead of silent 0-score.
- Documented reasoning-class judge incompatibility in leaderboard
  caveats — until call_chat handles `reasoning` fallback, kimi-* and
  glm-* are unusable as judges OR candidates.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* chore(eval): justfile recipes for rundale-bench workflows

Wraps the tasks executed this session so they're rerunnable without
remembering target-spec strings, env-loading boilerplate, or rubric_lab
flags. Recipes:

Correctness gates (no API spend):
- test        — run grade.py unit tests
- manifest    — rebuild MANIFEST.json after slice edits
- split       — re-run holdout split + manifest

Single-target probes:
- intent target [limit=20]    — deterministic intent grading
- dialogue target [judge limit] — absolute 5-axis rubric
- bench target [judge limit]   — --slice all sweep

Multi-target ELO:
- elo "spec_a spec_b ..." [limit=10] — in-bench --mode elo, fresh replies

Cached-reply workflow (rubric iteration, no candidate spend):
- cache-cheap [prompts=15]   — 6 paid-cheap fleet (~$0.001, ~3min)
- cache-mid [prompts=15]     — 6 mid-tier fleet (~$0.05)
- cache-all [prompts=15]     — both, back to back
- cache targets="..." [prompts]
- elo-from-cache samples [rubric judge output] — pairwise over cache
- absolute-from-cache samples rubric [axis_scale judge output]

Convenience:
- list-samples   — show cached sample files
- sample-stats   — n_samples / ok / empty / cost per cache

Run from repo root with `-f`:
    just -f parish/testing/rundale-bench/justfile <recipe> ...

ENV_FILE points at <repo>/.env; recipes source it before invoking python
so OPENROUTER_API_KEY (and equivalents) reach call_chat. Default judge
is mistral-large-2512 (non-thinking, ~$0.001/call, 1-2s wall) — override
with `judge=...` per recipe.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(eval): multi-axis 0-10 scoring + just recipe

Per-axis 0-10 dialogue scorer over cached samples. Five axes (character,
authenticity, language, responsiveness, craft) + a judge-emitted total —
complements pairwise ELO by exposing *why* a candidate ranks where it does.

- `score_multiaxis.py` — judge emits one integer per axis (0-10) plus a
  float `total`. Per-record + per-candidate aggregates written to JSON.
  Default judge `mistralai/mistral-large-2512`, calibration anchors
  baked into the rubric (0/3/6/8/10 gradient).
- `justfile multiaxis` recipe — accepts positional args only (samples,
  judge, rubric); output is auto-stamped to docs/proofs/rundale-bench/.
  Earlier draft accepted `output=...` as a named arg which just silently
  parses as the next positional, sending `--judge-model 'output=...'` and
  earning HTTP 400 from every call. Removed.

Evidence: ran both cached sweeps end-to-end. 76+88 calls, ~$0 spend
(well under mistral-large's $0.50/$1.50 floor). Three top candidates
agree with the ELO ranking — gemma-3-27b / qwen3-235b / claude-haiku-4.5
cluster within 0.10 of each other, mistral-large self-bias still present.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(eval): qwen3-max flagship probe + aggregator bug fix

Added qwen/qwen3-max (Chinese flagship, $1.20/$6 per M, non-reasoning)
to the multi-axis 0-10 leaderboard. Result: 9.03 total — ties
gemma-3-27b and edges qwen3-235b by 0.03, well inside rubric noise.
~17× the cost of qwen3-235b for zero meaningful gain on this slice.

Also fixed an aggregator bug in score_multiaxis.py: judge-call errors
(HTTP 503, rate-limit drops, schema-parse failures) were contributing
zeros to per-candidate means. A first qwen3-max pass hit two 503s after
4 retries each, pulling its headline mean to 7.87 — nonsense for a
flagship. Errored records still land in `per_record` for inspection
but no longer skew aggregates.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(eval): grok-4.3 judge re-sweep + deepseek-v4-pro candidate

Re-judged all 4 cached sample sets with x-ai/grok-4.3 (non-reasoning,
$1.25/$2.50 per M) to test judge sensitivity. 193 calls, 0 errors,
~43 min, ~$0.40 total.

Added deepseek/deepseek-v4-pro to the candidate roster (deepseek's
current flagship, $0.435/$0.87 per M). Surprise result: v4-pro scores
*worse* than v3.2 (7.77 vs 7.91) on 1820 Irish dialogue — likely from
heavier reasoning/structure RLHF that hurts persona fidelity.

Judge findings:
- 2.36-point top-to-bottom spread vs 0.81 under mistral-large. Grok
  discriminates harder.
- Top-3 robust across judges: qwen3-max / mistral-large / qwen3-235b
  within 0.06 of each other under both. Real ceiling for this slice.
- grok-3-mini lifts to #4 — same judge-family bias as previous sweeps.
  Discount by ~0.3 mentally.
- gpt-4o-mini drops 1.67 points under stricter grok judge (8.27 → 6.60)
  — biggest cross-judge gap of any candidate.

kimi-k2.6 tried as judge first; failed 70%+ of calls (reasoning tokens
ate the max_tokens budget before content emission, even at 8K). Added
a reasoning-field fallback to eval_lib's `call_chat` so short
reply→content-empty cases recover (k2.6 still not viable as judge
under JSON-schema mode).

Bumped score_multiaxis _JUDGE_MAX_TOKENS 2000 → 8000 to give thinking
judges headroom. Records kimi failure as evidence:
multiaxis_20260514T175833Z.json (10 of 14 errors).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(eval): TTFT/tok-s/JSON-compliance probe + gemma-4/qwen-2.5 candidates

Adds a streaming-mode perf probe complementing the rubric-quality
bench. Three new dimensions per candidate:
  - ttft_ms (time-to-first-content-token via stream:true)
  - tok/s   (completion_tokens / generation seconds)
  - json compliance rate, both free-form and schema-enforced

Implementation:
- eval_lib.py grows `call_chat_streaming(target, …)` that parses SSE,
  records TTFT on first content delta, accumulates token usage from the
  final usage line. Also retries on body-level 502/503 (xAI grok-4.3
  capacity errors land inside 200 OK responses).
- bench_perf.py runs the probe and dumps perf_<UTC>.json with per-call
  records + per-candidate medians + p90s.
- justfile adds `perf` (single target) and `perf-many` (pre-quoted
  --target … flags) recipes.

Initial measurements on 4 candidates (10 prompts each):

  candidate                  TTFT p50  Total p50  tok/s p50  JF%  JS%
  qwen/qwen3-235b-a22b-2507     363ms     1696ms     54.2   100% 100%
  google/gemma-3-27b-it         380ms     2328ms     37.9   100% 100%
  google/gemma-4-31b-it        1160ms     3214ms     21.9   100% 100%
  qwen/qwen-2.5-72b-instruct     0ms     5242ms      0.0    10%  90%

qwen3-235b is fastest across all three perf axes AND tied for top-3
quality under both judges. Strong default.

gemma-4 is slower than gemma-3 in every dimension (TTFT 3×, tok/s ~½)
AND scores 0.23 lower under the grok-4.3 judge. Newer is not an upgrade.

qwen-2.5-72b-instruct returns "Provider returned error" 400 from
OpenRouter on 14/15 cache calls; streaming all errored. Free-form JSON
10%, schema 90%. Different provider routing required before this
candidate can be fairly evaluated.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* chore(eval): cache dialogue replies for 14 new candidates (high + mid tier)

Cached 15 dialogue replies for each candidate to enable offline rubric
judging without re-spending. Judging deferred — user wants more samples
collected before re-judging the full set.

High tier (6, $0.x to $5/$30 per M):
  - anthropic/claude-opus-4.7
  - anthropic/claude-sonnet-4.6
  - openai/gpt-5.5
  - openai/gpt-5.4
  - google/gemini-2.5-pro
  - x-ai/grok-4.3

Mid tier (8):
  - meta-llama/llama-3.3-70b-instruct
  - meta-llama/llama-4-maverick
  - meta-llama/llama-4-scout
  - openai/gpt-5.4-mini
  - mistralai/mistral-medium-3.1
  - z-ai/glm-4.6
  - nousresearch/hermes-4-405b
  - amazon/nova-pro-v1

All 210 calls returned non-empty replies. Cache files:
  - dialogue_samples_20260514T221845Z.json (high tier, 90 samples, ~6min)
  - dialogue_samples_20260515T181515Z.json (mid tier, 120 samples, ~6min)

Brings the candidate pool to 28 unique models cached for dialogue. Ready
for a multi-judge re-sweep when bench iteration resumes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(eval): static leaderboard page for rundale-bench

Aggregates every multiaxis_*.json + perf_*.json + dialogue_samples_*.json
into a single static HTML page (`docs/proofs/rundale-bench/leaderboard.html`).
Open in any browser — no server required.

Sections:
  - Stat header (cached / judged / unjudged / judge count / row counts)
  - Quality table (sortable + filterable by candidate or judge)
  - Perf table (TTFT / total / tok/s / JSON compliance pills)
  - Judge coverage matrix (✓ / · per candidate × judge)
  - Unjudged backlog list

Vanilla JS, no deps. Dark theme by default with auto-light via
prefers-color-scheme. Pure DOM rendering with HTML escaping — no
innerHTML on untrusted strings.

Regenerate after any new run with:
  just -f parish/testing/rundale-bench/justfile leaderboard

Current state baked into the page: 30 cached candidates, 16 judged,
14 unjudged, 3 distinct judges, 31 quality rows + 5 perf rows.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(eval): re-cache kimi-k2.5 sans reasoning; add cross-judge average view

eval_lib gains a `reasoning` kwarg on call_chat that gets passed through
to OpenRouter. When a target model is known to be reasoning-class
(kimi-k2.5/k2.6/k2-thinking, glm-4.7, o1/o3/o4, deepseek-r1, opus 4.7,
sonnet 4.6) the default switches to `{"enabled": false}` so cached
dialogue replies are the in-character answer rather than truncated
thought process.

Result for kimi-k2.5: 0/15 valid replies before → 15/15 valid now,
44.5 s wall vs prior timeouts. Judged with both pinned judges:
  grok-4.3        → 8.56 total (n=15)  -- #4 in grok ranking
  mistral-large   → 8.96 total (n=15)  -- #4 in mistral ranking

Both rank kimi-k2.5 #4 in cross-judge consistent way (between
qwen3-235b and claude-haiku-4.5).

Static leaderboard page now exposes an "average" view in the judge
selector: per-candidate mean across each distinct judge it was scored
by. Only emitted when ≥2 judges agree on coverage (otherwise the
"average" of one row is just the row).

Removed the failed kimi-k2.6 judging run
(multiaxis_20260514T175833Z.json) — kept solely as a failure-mode
artifact, no longer useful now that the broader story is documented
in the leaderboard md.

Coverage matrix dropped its hardcoded 3-column layout for a dynamic
one driven by `--judge-cols` CSS var, so adding/removing judges no
longer requires HTML edits.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(eval): judge all 14 high+mid candidates, parallelize, fix reasoning leaks

Judging: every cached candidate now has scores from both pinned judges
(grok-4.3 + mistral-large). 29 of 30 cached are fully judged; the
remaining qwen-2.5-72b-instruct is broken on OpenRouter and unscorable.

Parallelism: score_multiaxis grows a `--workers` flag (default 8).
ThreadPoolExecutor over the eligible records — judge calls are the
hot path, all I/O-bound. Wall-time went from estimated 1.5-2 h serial
to ~3 min for 4 judging passes across 14 new candidates.

Reasoning leaks fixed:
- gemini-2.5-pro first cache had every reply truncated mid-sentence
  ("Ah, the night-ter"). max_tokens=200 consumed by internal thinking
  before any content emitted. First-judge scores were nonsense (grok
  2.91, mistral 5.77).
- glm-4.6 first cache had 4/15 replies as meta chain-of-thought
  ("1. **Deconstruct the Persona:**..."). Mistral was lenient enough
  to score it 9.29; grok flagged it down to 4.43. 4.86-point judge
  disagreement was the smoking gun.

eval_lib.call_chat already auto-disables reasoning for known
reasoning-class models via REASONING_MODEL_PREFIXES, but the
suppression dict was hardcoded to `{"enabled": false}`. Google rejects
that with HTTP 400; only `reasoning: {effort: "low"}` works. Added a
per-provider `_default_reasoning_for(model_id)` so the dict matches
what each provider accepts.

Added prefixes: google/gemini-2.5-pro, google/gemini-3, z-ai/glm-4.6.

After re-cache + re-judge, both candidates land in the normal range
with cross-judge spread under 1.0:
  gemini-2.5-pro  grok=8.12  mistral=8.97  Δ=0.85
  z-ai/glm-4.6    grok=7.89  mistral=8.76  Δ=0.87

Headline standings (grok-4.3 / mistral-large):
  openai/gpt-5.5            8.75 / 9.05    ← high-tier top
  openai/gpt-5.4            8.55 / 9.17    ← mistral's #1 high
  openai/gpt-5.4-mini       8.63 / 9.03    ← mid-tier top
  anthropic/claude-opus-4.7 8.64 / 9.05
  anthropic/sonnet-4.6      8.53 / 9.13
  mistral-medium-3.1        8.59 / 9.09

Also stripped earlier n=1 outliers from the leaderboard data:
  - kimi-k2.5 single-record rows in early mid-tier sweeps
  - qwen-2.5-72b-instruct single-record row (provider broken)

Leaderboard page regenerated (quality=57 rows, perf=5, cached=30,
unjudged=1).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(eval): fill perf table for 26 remaining candidates; dedupe by latest

Ran bench_perf on 4 parallel shards (~6-7 candidates each). 760 calls
across 26 new candidates, ~10 min wall, <\$0.001 spend. Combined with
the original 4-candidate probe, the perf table now covers every cached
candidate.

build_leaderboard_page now keeps only the latest perf measurement per
candidate so smoke runs do not pollute the table. (Earlier 3-prompt
gemma-4 smoke was overwriting the 10-prompt sweep result.)

Notable perf findings:
- gpt-5.4-mini fastest at the top of mid: TTFT 561 ms / 75.5 tok-s
- gpt-5.5 surprisingly slow: TTFT 2242 ms / 60 tok-s — pay flagship
  premium and wait twice as long as gpt-5.4-mini
- glm-4.6 total p50 = 16883 ms (huge tail) — reasoning still active
  for some calls despite suppression; investigate before recommending
- phi-4 free-form JSON 0% / schema 100% — strict schema is mandatory
- mistral-medium-3.1 free-form JSON 10% / schema 100% — same shape

qwen-2.5-72b-instruct re-cache attempt forcing provider.order=
["DeepInfra", "Together"] recovered some replies (4/15) but still
flaky; not viable as a judge or candidate via OpenRouter under any
provider sort/order I tried. Documented as unjudged.

Final state: 29/30 cached candidates judged, 30/30 perf-measured.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(eval): address gemini-code-assist review feedback

1. ELO match recording for errored candidates (HIGH).
   When candidate A errored mid-sweep, `matches.append((b, a, 0.0))`
   recorded a loss against B (the winner) instead of A. Bootstrap CI
   inherited the same flip. Fixed to canonical `(a, b, 0.0)` ordering
   so score_a=0.0 correctly means A lost.

2. Null-safe judge `reason` strings.
   `str(out.get("reason", ""))` returns the literal string "None" when
   the judge emits `{"reason": null}` (some providers do this even
   with strict json_schema). Replaced with `str(out.get("reason") or
   "")` everywhere — grade.py, rubric_lab.py, score_multiaxis.py.

3. Target identity collision in ELO mode.
   `replies[(t.model, prompt_id)]` collided when two `--target` flags
   shared the same `model` string (e.g. comparing the same model
   across providers / base_urls). Switched to `model@base_url` as the
   canonical id and added a startup check that rejects duplicates.

4. Bootstrap CI now mirrors dynamic K.
   Main accumulator drops K from 32 → 16 after 50 matches per
   candidate. _bootstrap_ci was using k_initial constant for every
   match, so late-resampled matches moved ratings more than reality.
   Bootstrap now tracks match counts per iteration and applies the
   same threshold.

All 27 grade.py unit tests still pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
dmooney added a commit that referenced this pull request May 23, 2026
Six fixes spanning gemini-code-assist and chatgpt-codex-connector
feedback on the runtime-loaded provider refactor.

- mod_source.rs: wrap discover_mods_in + register_provider_mods_once in
  tokio::task::spawn_blocking, and propagate register errors instead of
  warn-and-continue. Sync filesystem I/O on the executor stalled the
  Tokio runtime on slow disks; silent registry failures turned
  actionable startup errors into late, confusing fallback behaviour
  (gemini P0, codex P2 #5).

- parish-config provider.rs: drop the cfg(debug_assertions) gate on the
  auto-loader and rename it to ensure_mods_loaded. The release/debug
  skew meant any startup path that resolved provider config before the
  explicit bootstrap saw an empty registry, silently falling back to
  the simulator or panicking on Provider::from_id(...).expect(...).
  The auto-loader is now always-on; it consults PARISH_MODS_DIR first
  (operator override for packaged builds), then walks up from
  CARGO_MANIFEST_DIR (dev tree). Same idempotent Once guard
  (codex P1 #2 + #4).

- resolve_cloud_config: replace Provider::from_id("openrouter")
  .expect(...) with an ok_or_else that returns a structured
  ParishError::Config. Operators who omit PARISH_CLOUD_PROVIDER on a
  deployment without the openrouter mod now see an actionable message
  instead of a crashed binary (codex P1 #2).

- ProviderMod: add explicit `keyless: bool` field (TOML default false).
  Set true in simulator/ollama/vllm/vllm_mlx/lmstudio. The wizard's
  keyless guard is now driven by this flag instead of
  !requires_api_key, which had mislabelled `custom` as keyless and let
  users save it with no model name (codex P2 #6 regression).

- ByokOnboarding.svelte: on listAvailableProviders failure, fall back
  to FALLBACK_FEATURED (anthropic/openai/openrouter/groq/google) with
  a visible banner instead of an empty grid. A transient API blip no
  longer hard-blocks first-run onboarding (codex P2 #7).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
dmooney added a commit that referenced this pull request May 23, 2026
* refactor: runtime-load LLM providers from mods/<id>/

Replaces the compile-time-embedded provider catalog (parish-config/build.rs
scanning providers/*.toml) with a runtime-mod-loaded design.

Builtins (5, parish-config/src/builtin_providers/):
- simulator, ollama, vllm, vllm_mlx, custom
- inlined via include_str! — engine manages local processes / downloads,
  or serves as universal escape hatch (custom)

Provider mods (19, mods/<id>/):
- anthropic, cohere, deepseek, github_models, google, groq, lmstudio,
  mistral, moonshot, nvidia-nim, openai, openrouter, qwen, scaleway,
  siliconflow, together, vercel-ai, xai, zhipu
- one mod per provider; each declares kind = "providers" in mod.toml
- discovered + registered into ProviderRegistry via discover_mods +
  register_provider_mods_once (OnceLock-guarded, called from
  load_setting_mod_sync + LocalDiskModSource::list_mods)

ModKind::Providers variant + load_providers_from_mod helper in
parish-core, with traversal-rejection + duplicate-id checks.

ProviderRegistry rewired with RwLock-backed interior mutability so
post-init register_mod_providers merges cleanly. Last-wins on collision
with WARN log; identical re-registration is silent no-op.

Backend IPC handle_list_available_providers (parish-core) returns
featured/other split. Wired to Tauri command list_available_providers +
MCP bridge /api/available-providers route.

ShowPreset listing now reads dynamically from registry instead of a
hardcoded string.

Debug-build auto-loader parish_config::ensure_test_mods_loaded walks
the workspace mods/ tree so unit tests + dev runs see the same registry
as production without each test calling setup manually.

New tests:
- parish-config: builtin_providers_parse_and_register, register_mod_
  providers_merges_new_ids, register_mod_providers_last_wins_on_collision
- parish-core: discover_mods_classifies_providers_kind,
  load_providers_from_mod_{parses_multiple_tomls_in_lex_order,
  empty_when_directory_missing, rejects_symlink_traversal,
  rejects_duplicate_ids_within_one_mod}

Proof bundle at docs/proofs/provider-mods-runtime/ — live CLI transcript
verifies switching to a mod-loaded provider (openai), a runtime-added
mod (test-provider, registered without recompile), and back to a
builtin (simulator).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* chore(proofs): relocate provider-mods-runtime bundle to .proofs/

Rule 10 update (commit 554410e) requires proof bundles to live in
gitignored .proofs/<task-id>/ and attach to the PR via
`just attach-proof`. Move the bundle out of tracked docs/proofs/.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* refactor(providers): drop cloud ctors + wire UI/server to runtime registry

Follow-on cleanup to commit edc2fc7b. No deferrals left.

- Remove Provider::{anthropic, openai, openrouter, google, groq, xai,
  mistral, deepseek, together, lmstudio, github_models} convenience
  constructors. Every callsite now uses Provider::from_id(id) with an
  appropriate .expect() (tests) or .unwrap_or_default() (runtime defaults
  that previously fell to openrouter; the simulator builtin is the
  fallback when openrouter mod is absent).
- parish-server: add /api/list-available-providers handler + route, so
  the web UI can enumerate the runtime provider registry. Matches Tauri
  command list_available_providers and the MCP bridge route.
- parish/apps/ui:
  - lib/ipc.ts: add listAvailableProviders() + AvailableProvidersResponse
    types.
  - components/ByokOnboarding.svelte: fetch featured/other lists at
    mount, drop static imports.
  - lib/byokProviders.ts: collapse to a thin type-adapter
    (toByokMeta + findProvider). Hand-curated FEATURED_PROVIDERS and
    OTHER_PROVIDERS arrays are gone; adding a provider is a TOML drop
    under mods/<id>/ with no TS edit.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* chore(mods): rename provider mods with -provider suffix

mods/<id>/ -> mods/<id>-provider/ for all 19 cloud provider mods. The
mod-id field in each mod.toml is updated to match the new directory name
(<id>-provider). The provider id inside each TOML (id = "<bare>") stays
unchanged — that is the registry key Provider::from_id(...) and
parish.toml's provider field still target.

Also removes the throwaway mods/test-provider/ from the repo. It is a
fixture artifact created on-demand by the verification script per its
header instructions; shipping it would make the no-recompile claim
tautological. The captured proof transcript still demonstrates the
add-and-discover flow.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(providers): address bot review on PR #1049

Six fixes spanning gemini-code-assist and chatgpt-codex-connector
feedback on the runtime-loaded provider refactor.

- mod_source.rs: wrap discover_mods_in + register_provider_mods_once in
  tokio::task::spawn_blocking, and propagate register errors instead of
  warn-and-continue. Sync filesystem I/O on the executor stalled the
  Tokio runtime on slow disks; silent registry failures turned
  actionable startup errors into late, confusing fallback behaviour
  (gemini P0, codex P2 #5).

- parish-config provider.rs: drop the cfg(debug_assertions) gate on the
  auto-loader and rename it to ensure_mods_loaded. The release/debug
  skew meant any startup path that resolved provider config before the
  explicit bootstrap saw an empty registry, silently falling back to
  the simulator or panicking on Provider::from_id(...).expect(...).
  The auto-loader is now always-on; it consults PARISH_MODS_DIR first
  (operator override for packaged builds), then walks up from
  CARGO_MANIFEST_DIR (dev tree). Same idempotent Once guard
  (codex P1 #2 + #4).

- resolve_cloud_config: replace Provider::from_id("openrouter")
  .expect(...) with an ok_or_else that returns a structured
  ParishError::Config. Operators who omit PARISH_CLOUD_PROVIDER on a
  deployment without the openrouter mod now see an actionable message
  instead of a crashed binary (codex P1 #2).

- ProviderMod: add explicit `keyless: bool` field (TOML default false).
  Set true in simulator/ollama/vllm/vllm_mlx/lmstudio. The wizard's
  keyless guard is now driven by this flag instead of
  !requires_api_key, which had mislabelled `custom` as keyless and let
  users save it with no model name (codex P2 #6 regression).

- ByokOnboarding.svelte: on listAvailableProviders failure, fall back
  to FALLBACK_FEATURED (anthropic/openai/openrouter/groq/google) with
  a visible banner instead of an empty grid. A transient API blip no
  longer hard-blocks first-run onboarding (codex P2 #7).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
dmooney added a commit that referenced this pull request May 26, 2026
#1138)

* fix(npc): forbid mid-conversation farewells in Tier-1 prompt (TODO #4)

Cycle 1 of the demo audit caught Cormac Duffy closing turn 2 with
"Slán abhaile" while the conversation continued five more turns. The
ALLOWED IRISH PHRASES section legitimises Slán abhaile and Slán leat
as period-appropriate vocabulary, but gave the model no instruction
about when those phrases are welcome. Add a NEVER FAREWELL MID-
CONVERSATION section that gates Slán*, Goodbye, Farewell, and the
English gloss "safe home" behind an explicit player departure cue, and
direct the model to land each non-departing reply on a question,
observation, or offer instead.

Mirror the directive into mods/rundale/prompts/tier1_system.txt so the
mod artefact does not drift from the canonical Rust prompt, and extend
test_tier1_system_no_unsubstituted_placeholders to pin the new header
and each gated token so a future refactor cannot drop the section
silently.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: retrigger after auto-cancel

* ci: retrigger after Actions incident resolved

* ci: retrigger after Rust quality gate timeout (cache cold)

* test(inference): route default-frequency-penalty test through Interactive lane

#1127's test_inference_queue_send_default_omits_frequency_penalty sent on
the Background lane but received from the Interactive receiver (`irx`).
The message went to `_brx`, which nobody reads, so `irx.recv()` blocked
indefinitely. PR #1127's CI runs were both cancelled before the test
could surface the hang, and the merge to main stalled the Rust quality
gate at the 30-minute timeout on every subsequent PR.

Swap `InferencePriority::Background` to `Interactive` so the send lands
on the lane `irx` actually drains. Add a comment pointing future readers
at the failure mode so the lane-mismatch trap isn't re-laid.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants