feat: initial project scaffold with design document#1
Merged
Conversation
Add DESIGN.md (full game design spec), CLAUDE.md (dev guide), Cargo.toml with all dependencies, and src/ module structure with placeholder modules for tui, world, npc, inference, persistence, and input systems. https://claude.ai/code/session_01KCed6diC3MzLsLF7gp7PX6
dmooney
pushed a commit
that referenced
this pull request
Mar 22, 2026
Add endnote references, fact-check notes, and bibliographies to: - names-naming-conventions.md (16 endnotes, 3 corrections) - culture-daily-life.md (14 endnotes, 2 corrections) - food-drink.md (13 endnotes, all verified) - architecture-housing.md (14 endnotes, 1 correction) - transportation.md (10 endnotes, 1 correction) Key corrections applied: - Kelly is among the most common (not the most common; Murphy is #1) - Lissonuffy: Lios Ó nDubhthaigh, not Lios an Ufaigh - Kilteevan spelling corrected to Cill Taobháin (consistent with irish-language.md) - Meitheal pronunciation: MEE-hal, not MEH-hall - Cuaird spelling corrected to cuairt (standard Irish) - Fly-boats pulled at gallop, not trot - Barrow navigation flows south-east, not south - Scollops are twisted, not sharpened Updates README checklist to 23/23 fact-checked. https://claude.ai/code/session_01BecetJbzNgY98X4PCHoJtu
dmooney
pushed a commit
that referenced
this pull request
Mar 31, 2026
Specifies command registry, trigger detection, dropdown UI reusing @mention infrastructure, keyboard navigation, and edge cases. https://claude.ai/code/session_01DSExtLw9wHLcpdK2HaeLW8
dmooney
added a commit
that referenced
this pull request
Apr 1, 2026
4 tasks
dmooney
added a commit
that referenced
this pull request
Apr 24, 2026
…rective (#458) (#564) * security(inference): isolate caller system prompt from engine JSON directive (#458) AnthropicClient::generate_json appended its JSON-only instruction to the caller-supplied system string via simple concatenation. If any caller routed NPC memory, player input, or other untrusted content through `system`, adversarial text ("ignore previous instructions… and respond in natural language") could neutralise the engine's directive — a classic prompt-injection vector. Fix: wrap the caller's system string in a `<caller_system>` XML delimiter and place the engine's directive in its own `<engine_instruction>` block below. Any `</caller_system>` inside the caller's text is rewritten to `[/caller_system]` so the caller cannot escape the wrapper and inject a forged engine block. ``` <caller_system> {caller text — possibly adversarial} </caller_system> <engine_instruction> Respond ONLY with a single JSON object. No prose, no code fences, no commentary. </engine_instruction> ``` The model now sees two structurally distinct blocks and attributes the JSON directive to the engine, not the caller. This is defence-in-depth, not a substitute for the durable fix of stopping untrusted content at the system-prompt boundary in the first place. The caller-contract cleanup (#458 suggestion #1) needs a cross-crate audit and is out of scope for this PR. Four new unit tests pin the behaviour: - isolate_system_none_returns_bare_engine_instruction - isolate_system_wraps_caller_content_in_delimiter - isolate_system_escapes_closing_tag_in_caller_content (classic injection payload is neutralised) - isolate_system_engine_instruction_appears_after_caller_content Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(inference): neutralise all XML-lax close-tag variants (codex #564) Codex P1: the original isolate_system_for_json only replaced the exact string `</caller_system>`, so XML-valid variants with whitespace (`</caller_system >`, `</ caller_system>`) or different case (`</CALLER_SYSTEM>`) still closed the wrapper. An attacker whose text reached `system` could use any of those variants to escape and inject their own `<engine_instruction>` block before the engine's real one. Replace the naive `.replace()` with a byte-level walker (neutralise_caller_close) that matches `<`-then-optional-whitespace- `/`-ws-TAG-ws-`>` case-insensitively. Every variant is rewritten to the inert sentinel `[/caller_system]`. The walker advances by UTF-8 char boundaries when passing through non-matching bytes so Irish fada vowels and emoji in legitimate Rundale system prompts round- trip unchanged. Three new tests pin the behaviour: - isolate_system_neutralises_xml_lax_close_variants — eight variants (whitespace before/after `/` and tag name, tabs, newlines, full uppercase, mixed case) all get rewritten. - isolate_system_preserves_non_close_angle_brackets — `a < b`, `<caller_system_peer>` and similar non-matches pass through. - isolate_system_preserves_utf8_content — Pádraig Ó Flaithbheartaigh and the 👍 emoji survive the byte walk. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 3, 2026
7 tasks
dmooney
added a commit
that referenced
this pull request
May 15, 2026
…ing leaks
Judging: every cached candidate now has scores from both pinned judges
(grok-4.3 + mistral-large). 29 of 30 cached are fully judged; the
remaining qwen-2.5-72b-instruct is broken on OpenRouter and unscorable.
Parallelism: score_multiaxis grows a `--workers` flag (default 8).
ThreadPoolExecutor over the eligible records — judge calls are the
hot path, all I/O-bound. Wall-time went from estimated 1.5-2 h serial
to ~3 min for 4 judging passes across 14 new candidates.
Reasoning leaks fixed:
- gemini-2.5-pro first cache had every reply truncated mid-sentence
("Ah, the night-ter"). max_tokens=200 consumed by internal thinking
before any content emitted. First-judge scores were nonsense (grok
2.91, mistral 5.77).
- glm-4.6 first cache had 4/15 replies as meta chain-of-thought
("1. **Deconstruct the Persona:**..."). Mistral was lenient enough
to score it 9.29; grok flagged it down to 4.43. 4.86-point judge
disagreement was the smoking gun.
eval_lib.call_chat already auto-disables reasoning for known
reasoning-class models via REASONING_MODEL_PREFIXES, but the
suppression dict was hardcoded to `{"enabled": false}`. Google rejects
that with HTTP 400; only `reasoning: {effort: "low"}` works. Added a
per-provider `_default_reasoning_for(model_id)` so the dict matches
what each provider accepts.
Added prefixes: google/gemini-2.5-pro, google/gemini-3, z-ai/glm-4.6.
After re-cache + re-judge, both candidates land in the normal range
with cross-judge spread under 1.0:
gemini-2.5-pro grok=8.12 mistral=8.97 Δ=0.85
z-ai/glm-4.6 grok=7.89 mistral=8.76 Δ=0.87
Headline standings (grok-4.3 / mistral-large):
openai/gpt-5.5 8.75 / 9.05 ← high-tier top
openai/gpt-5.4 8.55 / 9.17 ← mistral's #1 high
openai/gpt-5.4-mini 8.63 / 9.03 ← mid-tier top
anthropic/claude-opus-4.7 8.64 / 9.05
anthropic/sonnet-4.6 8.53 / 9.13
mistral-medium-3.1 8.59 / 9.09
Also stripped earlier n=1 outliers from the leaderboard data:
- kimi-k2.5 single-record rows in early mid-tier sweeps
- qwen-2.5-72b-instruct single-record row (provider broken)
Leaderboard page regenerated (quality=57 rows, perf=5, cached=30,
unjudged=1).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
dmooney
added a commit
that referenced
this pull request
May 15, 2026
* feat(eval): rundale-bench ELO mode (pairwise judge)
The absolute 5-axis rubric saturated near ceiling — gpt-oss-120b:free
scored 4.82/5 with no headroom for stronger models to differentiate.
Pairwise ELO replaces it as the dialogue-ranking primary.
Changes:
- judge_pairwise_v1.json — new pinned judge config (qwen3-235b on
OpenRouter, rubric_sha256 b5664f96…dc7c0), pairwise rubric requiring
one-sentence reason per verdict.
- grade.py::grade_pairwise — judge picks A | B | tie + reason.
Non-Latin script in one reply auto-disqualifies that side (judge
can't be trusted to enforce that consistently).
- rundale_bench.py --mode elo — repeated --target flags, runs every
candidate over the dialogue slice once (cached replies), schedules
one pairwise match per (a, b) pair per prompt with A/B position
randomized to absorb first-position bias. K=32 → K=16 after 50
matches per candidate. Bootstrap 5/95 CI via 500 resamples.
- test_grade.py — 5 new pairwise tests (winner, tie, invalid winner,
non-Latin auto-DQ, rubric tamper). 27/27 pass.
Smoke (3 candidates × 10 prompts × 1 pair per prompt = 60 calls,
~$0.0013, 346s):
1646.2 [CI 1598.6–1694.0] qwen/qwen3-235b-a22b-2507
1497.1 [CI 1437.0–1561.2] openai/gpt-oss-120b:free
1356.6 [CI 1306.3–1399.8] mistralai/mistral-small-24b-instruct-2501
290-point spread with non-overlapping CIs between top and bottom.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(eval): cache_dialogue_replies + rubric_lab for offline rubric tuning
Goal: iterate the judging rubric without re-querying candidate models.
Generate replies once, cache to JSON, then prototype rubrics offline.
Tools:
- cache_dialogue_replies.py: --target ... --prompts N, writes
docs/proofs/rundale-bench/dialogue_samples_<UTC>.json with one record
per (candidate, prompt) including reply, usage, latency, error.
- rubric_lab.py: loads a cached samples file + a rubric text file,
scores all samples in absolute (1-N) or pairwise (ELO) mode, prints
per-candidate distribution. Zero candidate spend per iteration —
only judge cost.
Three cache snapshots committed as evidence:
003651Z.json 6 free OpenRouter candidates × 15 prompts
— 65/90 useful (gemma-4 0/15, qwen-next 6/15 throttled)
— kept as baseline showing why free tier is unreliable
004513Z.json 6 paid cheap (<$1/M out) × 15 prompts
qwen3-235b, mistral-small-24b, gemma-3-27b, phi-4,
deepseek-v3.2, gpt-oss-120b
90/90 ok, $0.0009, 205s
005721Z.json 6 mid-tier ($1-$5/M out) × 15 prompts
claude-haiku-4.5, gpt-4o-mini, gemini-2.5-flash,
mistral-large-2512, kimi-k2.5, grok-3-mini
90/90 ok, 324s
245 useful samples across 17 distinct candidates total. Rubric
iteration now an offline activity that costs nothing per pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(eval): 12-candidate ELO sweep with mistral-large judge
Kimi 2.6 attempted as judge but failed: reasoning model emits all
output in `reasoning` field, content is null, every match returned
empty → all ties. Same problem hit z-ai/glm-4.7. Switched to a
non-thinking judge.
mistral-large-2512 ($0.50/$1.50, 1-2s/call) judged 816 pairwise
matches over the 12 paid OpenRouter candidates × 15 dialogue prompts:
1898.9 qwen/qwen3-235b-a22b-2507
1768.4 anthropic/claude-haiku-4.5
1705.0 google/gemma-3-27b-it
1682.6 mistralai/mistral-large-2512 (judge self-bias)
1622.8 moonshotai/kimi-k2.5 (n=11 only; reasoning empty)
1484.9 x-ai/grok-3-mini
1473.3 deepseek/deepseek-v3.2
1356.8 openai/gpt-oss-120b
1340.9 google/gemini-2.5-flash
1305.2 mistralai/mistral-small-24b-instruct-2501
1242.6 openai/gpt-4o-mini
1118.6 microsoft/phi-4
780-point top-to-bottom spread. Headline: claude-haiku-4.5 is the
standout cheap-tier dialogue model; phi-4 is unsuitable. qwen3-235b
top spot is suspect — it was the prior judge pin so may carry
training bias toward its own style.
Tooling updates:
- rubric_lab.py: per-match progress log (flush=True), max_tokens=2000
for judge calls (reasoning models need budget), empty-content guard
that raises "reasoning model truncated?" instead of silent 0-score.
- Documented reasoning-class judge incompatibility in leaderboard
caveats — until call_chat handles `reasoning` fallback, kimi-* and
glm-* are unusable as judges OR candidates.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* chore(eval): justfile recipes for rundale-bench workflows
Wraps the tasks executed this session so they're rerunnable without
remembering target-spec strings, env-loading boilerplate, or rubric_lab
flags. Recipes:
Correctness gates (no API spend):
- test — run grade.py unit tests
- manifest — rebuild MANIFEST.json after slice edits
- split — re-run holdout split + manifest
Single-target probes:
- intent target [limit=20] — deterministic intent grading
- dialogue target [judge limit] — absolute 5-axis rubric
- bench target [judge limit] — --slice all sweep
Multi-target ELO:
- elo "spec_a spec_b ..." [limit=10] — in-bench --mode elo, fresh replies
Cached-reply workflow (rubric iteration, no candidate spend):
- cache-cheap [prompts=15] — 6 paid-cheap fleet (~$0.001, ~3min)
- cache-mid [prompts=15] — 6 mid-tier fleet (~$0.05)
- cache-all [prompts=15] — both, back to back
- cache targets="..." [prompts]
- elo-from-cache samples [rubric judge output] — pairwise over cache
- absolute-from-cache samples rubric [axis_scale judge output]
Convenience:
- list-samples — show cached sample files
- sample-stats — n_samples / ok / empty / cost per cache
Run from repo root with `-f`:
just -f parish/testing/rundale-bench/justfile <recipe> ...
ENV_FILE points at <repo>/.env; recipes source it before invoking python
so OPENROUTER_API_KEY (and equivalents) reach call_chat. Default judge
is mistral-large-2512 (non-thinking, ~$0.001/call, 1-2s wall) — override
with `judge=...` per recipe.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(eval): multi-axis 0-10 scoring + just recipe
Per-axis 0-10 dialogue scorer over cached samples. Five axes (character,
authenticity, language, responsiveness, craft) + a judge-emitted total —
complements pairwise ELO by exposing *why* a candidate ranks where it does.
- `score_multiaxis.py` — judge emits one integer per axis (0-10) plus a
float `total`. Per-record + per-candidate aggregates written to JSON.
Default judge `mistralai/mistral-large-2512`, calibration anchors
baked into the rubric (0/3/6/8/10 gradient).
- `justfile multiaxis` recipe — accepts positional args only (samples,
judge, rubric); output is auto-stamped to docs/proofs/rundale-bench/.
Earlier draft accepted `output=...` as a named arg which just silently
parses as the next positional, sending `--judge-model 'output=...'` and
earning HTTP 400 from every call. Removed.
Evidence: ran both cached sweeps end-to-end. 76+88 calls, ~$0 spend
(well under mistral-large's $0.50/$1.50 floor). Three top candidates
agree with the ELO ranking — gemma-3-27b / qwen3-235b / claude-haiku-4.5
cluster within 0.10 of each other, mistral-large self-bias still present.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(eval): qwen3-max flagship probe + aggregator bug fix
Added qwen/qwen3-max (Chinese flagship, $1.20/$6 per M, non-reasoning)
to the multi-axis 0-10 leaderboard. Result: 9.03 total — ties
gemma-3-27b and edges qwen3-235b by 0.03, well inside rubric noise.
~17× the cost of qwen3-235b for zero meaningful gain on this slice.
Also fixed an aggregator bug in score_multiaxis.py: judge-call errors
(HTTP 503, rate-limit drops, schema-parse failures) were contributing
zeros to per-candidate means. A first qwen3-max pass hit two 503s after
4 retries each, pulling its headline mean to 7.87 — nonsense for a
flagship. Errored records still land in `per_record` for inspection
but no longer skew aggregates.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(eval): grok-4.3 judge re-sweep + deepseek-v4-pro candidate
Re-judged all 4 cached sample sets with x-ai/grok-4.3 (non-reasoning,
$1.25/$2.50 per M) to test judge sensitivity. 193 calls, 0 errors,
~43 min, ~$0.40 total.
Added deepseek/deepseek-v4-pro to the candidate roster (deepseek's
current flagship, $0.435/$0.87 per M). Surprise result: v4-pro scores
*worse* than v3.2 (7.77 vs 7.91) on 1820 Irish dialogue — likely from
heavier reasoning/structure RLHF that hurts persona fidelity.
Judge findings:
- 2.36-point top-to-bottom spread vs 0.81 under mistral-large. Grok
discriminates harder.
- Top-3 robust across judges: qwen3-max / mistral-large / qwen3-235b
within 0.06 of each other under both. Real ceiling for this slice.
- grok-3-mini lifts to #4 — same judge-family bias as previous sweeps.
Discount by ~0.3 mentally.
- gpt-4o-mini drops 1.67 points under stricter grok judge (8.27 → 6.60)
— biggest cross-judge gap of any candidate.
kimi-k2.6 tried as judge first; failed 70%+ of calls (reasoning tokens
ate the max_tokens budget before content emission, even at 8K). Added
a reasoning-field fallback to eval_lib's `call_chat` so short
reply→content-empty cases recover (k2.6 still not viable as judge
under JSON-schema mode).
Bumped score_multiaxis _JUDGE_MAX_TOKENS 2000 → 8000 to give thinking
judges headroom. Records kimi failure as evidence:
multiaxis_20260514T175833Z.json (10 of 14 errors).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(eval): TTFT/tok-s/JSON-compliance probe + gemma-4/qwen-2.5 candidates
Adds a streaming-mode perf probe complementing the rubric-quality
bench. Three new dimensions per candidate:
- ttft_ms (time-to-first-content-token via stream:true)
- tok/s (completion_tokens / generation seconds)
- json compliance rate, both free-form and schema-enforced
Implementation:
- eval_lib.py grows `call_chat_streaming(target, …)` that parses SSE,
records TTFT on first content delta, accumulates token usage from the
final usage line. Also retries on body-level 502/503 (xAI grok-4.3
capacity errors land inside 200 OK responses).
- bench_perf.py runs the probe and dumps perf_<UTC>.json with per-call
records + per-candidate medians + p90s.
- justfile adds `perf` (single target) and `perf-many` (pre-quoted
--target … flags) recipes.
Initial measurements on 4 candidates (10 prompts each):
candidate TTFT p50 Total p50 tok/s p50 JF% JS%
qwen/qwen3-235b-a22b-2507 363ms 1696ms 54.2 100% 100%
google/gemma-3-27b-it 380ms 2328ms 37.9 100% 100%
google/gemma-4-31b-it 1160ms 3214ms 21.9 100% 100%
qwen/qwen-2.5-72b-instruct 0ms 5242ms 0.0 10% 90%
qwen3-235b is fastest across all three perf axes AND tied for top-3
quality under both judges. Strong default.
gemma-4 is slower than gemma-3 in every dimension (TTFT 3×, tok/s ~½)
AND scores 0.23 lower under the grok-4.3 judge. Newer is not an upgrade.
qwen-2.5-72b-instruct returns "Provider returned error" 400 from
OpenRouter on 14/15 cache calls; streaming all errored. Free-form JSON
10%, schema 90%. Different provider routing required before this
candidate can be fairly evaluated.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* chore(eval): cache dialogue replies for 14 new candidates (high + mid tier)
Cached 15 dialogue replies for each candidate to enable offline rubric
judging without re-spending. Judging deferred — user wants more samples
collected before re-judging the full set.
High tier (6, $0.x to $5/$30 per M):
- anthropic/claude-opus-4.7
- anthropic/claude-sonnet-4.6
- openai/gpt-5.5
- openai/gpt-5.4
- google/gemini-2.5-pro
- x-ai/grok-4.3
Mid tier (8):
- meta-llama/llama-3.3-70b-instruct
- meta-llama/llama-4-maverick
- meta-llama/llama-4-scout
- openai/gpt-5.4-mini
- mistralai/mistral-medium-3.1
- z-ai/glm-4.6
- nousresearch/hermes-4-405b
- amazon/nova-pro-v1
All 210 calls returned non-empty replies. Cache files:
- dialogue_samples_20260514T221845Z.json (high tier, 90 samples, ~6min)
- dialogue_samples_20260515T181515Z.json (mid tier, 120 samples, ~6min)
Brings the candidate pool to 28 unique models cached for dialogue. Ready
for a multi-judge re-sweep when bench iteration resumes.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(eval): static leaderboard page for rundale-bench
Aggregates every multiaxis_*.json + perf_*.json + dialogue_samples_*.json
into a single static HTML page (`docs/proofs/rundale-bench/leaderboard.html`).
Open in any browser — no server required.
Sections:
- Stat header (cached / judged / unjudged / judge count / row counts)
- Quality table (sortable + filterable by candidate or judge)
- Perf table (TTFT / total / tok/s / JSON compliance pills)
- Judge coverage matrix (✓ / · per candidate × judge)
- Unjudged backlog list
Vanilla JS, no deps. Dark theme by default with auto-light via
prefers-color-scheme. Pure DOM rendering with HTML escaping — no
innerHTML on untrusted strings.
Regenerate after any new run with:
just -f parish/testing/rundale-bench/justfile leaderboard
Current state baked into the page: 30 cached candidates, 16 judged,
14 unjudged, 3 distinct judges, 31 quality rows + 5 perf rows.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(eval): re-cache kimi-k2.5 sans reasoning; add cross-judge average view
eval_lib gains a `reasoning` kwarg on call_chat that gets passed through
to OpenRouter. When a target model is known to be reasoning-class
(kimi-k2.5/k2.6/k2-thinking, glm-4.7, o1/o3/o4, deepseek-r1, opus 4.7,
sonnet 4.6) the default switches to `{"enabled": false}` so cached
dialogue replies are the in-character answer rather than truncated
thought process.
Result for kimi-k2.5: 0/15 valid replies before → 15/15 valid now,
44.5 s wall vs prior timeouts. Judged with both pinned judges:
grok-4.3 → 8.56 total (n=15) -- #4 in grok ranking
mistral-large → 8.96 total (n=15) -- #4 in mistral ranking
Both rank kimi-k2.5 #4 in cross-judge consistent way (between
qwen3-235b and claude-haiku-4.5).
Static leaderboard page now exposes an "average" view in the judge
selector: per-candidate mean across each distinct judge it was scored
by. Only emitted when ≥2 judges agree on coverage (otherwise the
"average" of one row is just the row).
Removed the failed kimi-k2.6 judging run
(multiaxis_20260514T175833Z.json) — kept solely as a failure-mode
artifact, no longer useful now that the broader story is documented
in the leaderboard md.
Coverage matrix dropped its hardcoded 3-column layout for a dynamic
one driven by `--judge-cols` CSS var, so adding/removing judges no
longer requires HTML edits.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(eval): judge all 14 high+mid candidates, parallelize, fix reasoning leaks
Judging: every cached candidate now has scores from both pinned judges
(grok-4.3 + mistral-large). 29 of 30 cached are fully judged; the
remaining qwen-2.5-72b-instruct is broken on OpenRouter and unscorable.
Parallelism: score_multiaxis grows a `--workers` flag (default 8).
ThreadPoolExecutor over the eligible records — judge calls are the
hot path, all I/O-bound. Wall-time went from estimated 1.5-2 h serial
to ~3 min for 4 judging passes across 14 new candidates.
Reasoning leaks fixed:
- gemini-2.5-pro first cache had every reply truncated mid-sentence
("Ah, the night-ter"). max_tokens=200 consumed by internal thinking
before any content emitted. First-judge scores were nonsense (grok
2.91, mistral 5.77).
- glm-4.6 first cache had 4/15 replies as meta chain-of-thought
("1. **Deconstruct the Persona:**..."). Mistral was lenient enough
to score it 9.29; grok flagged it down to 4.43. 4.86-point judge
disagreement was the smoking gun.
eval_lib.call_chat already auto-disables reasoning for known
reasoning-class models via REASONING_MODEL_PREFIXES, but the
suppression dict was hardcoded to `{"enabled": false}`. Google rejects
that with HTTP 400; only `reasoning: {effort: "low"}` works. Added a
per-provider `_default_reasoning_for(model_id)` so the dict matches
what each provider accepts.
Added prefixes: google/gemini-2.5-pro, google/gemini-3, z-ai/glm-4.6.
After re-cache + re-judge, both candidates land in the normal range
with cross-judge spread under 1.0:
gemini-2.5-pro grok=8.12 mistral=8.97 Δ=0.85
z-ai/glm-4.6 grok=7.89 mistral=8.76 Δ=0.87
Headline standings (grok-4.3 / mistral-large):
openai/gpt-5.5 8.75 / 9.05 ← high-tier top
openai/gpt-5.4 8.55 / 9.17 ← mistral's #1 high
openai/gpt-5.4-mini 8.63 / 9.03 ← mid-tier top
anthropic/claude-opus-4.7 8.64 / 9.05
anthropic/sonnet-4.6 8.53 / 9.13
mistral-medium-3.1 8.59 / 9.09
Also stripped earlier n=1 outliers from the leaderboard data:
- kimi-k2.5 single-record rows in early mid-tier sweeps
- qwen-2.5-72b-instruct single-record row (provider broken)
Leaderboard page regenerated (quality=57 rows, perf=5, cached=30,
unjudged=1).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(eval): fill perf table for 26 remaining candidates; dedupe by latest
Ran bench_perf on 4 parallel shards (~6-7 candidates each). 760 calls
across 26 new candidates, ~10 min wall, <\$0.001 spend. Combined with
the original 4-candidate probe, the perf table now covers every cached
candidate.
build_leaderboard_page now keeps only the latest perf measurement per
candidate so smoke runs do not pollute the table. (Earlier 3-prompt
gemma-4 smoke was overwriting the 10-prompt sweep result.)
Notable perf findings:
- gpt-5.4-mini fastest at the top of mid: TTFT 561 ms / 75.5 tok-s
- gpt-5.5 surprisingly slow: TTFT 2242 ms / 60 tok-s — pay flagship
premium and wait twice as long as gpt-5.4-mini
- glm-4.6 total p50 = 16883 ms (huge tail) — reasoning still active
for some calls despite suppression; investigate before recommending
- phi-4 free-form JSON 0% / schema 100% — strict schema is mandatory
- mistral-medium-3.1 free-form JSON 10% / schema 100% — same shape
qwen-2.5-72b-instruct re-cache attempt forcing provider.order=
["DeepInfra", "Together"] recovered some replies (4/15) but still
flaky; not viable as a judge or candidate via OpenRouter under any
provider sort/order I tried. Documented as unjudged.
Final state: 29/30 cached candidates judged, 30/30 perf-measured.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(eval): address gemini-code-assist review feedback
1. ELO match recording for errored candidates (HIGH).
When candidate A errored mid-sweep, `matches.append((b, a, 0.0))`
recorded a loss against B (the winner) instead of A. Bootstrap CI
inherited the same flip. Fixed to canonical `(a, b, 0.0)` ordering
so score_a=0.0 correctly means A lost.
2. Null-safe judge `reason` strings.
`str(out.get("reason", ""))` returns the literal string "None" when
the judge emits `{"reason": null}` (some providers do this even
with strict json_schema). Replaced with `str(out.get("reason") or
"")` everywhere — grade.py, rubric_lab.py, score_multiaxis.py.
3. Target identity collision in ELO mode.
`replies[(t.model, prompt_id)]` collided when two `--target` flags
shared the same `model` string (e.g. comparing the same model
across providers / base_urls). Switched to `model@base_url` as the
canonical id and added a startup check that rejects duplicates.
4. Bootstrap CI now mirrors dynamic K.
Main accumulator drops K from 32 → 16 after 50 matches per
candidate. _bootstrap_ci was using k_initial constant for every
match, so late-resampled matches moved ratings more than reality.
Bootstrap now tracks match counts per iteration and applies the
same threshold.
All 27 grade.py unit tests still pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced May 16, 2026
Merged
This was referenced May 25, 2026
dmooney
added a commit
that referenced
this pull request
May 26, 2026
…1129) Across 38+ observed demo turns (cycles 1-5) the auto-player emitted exactly 1 movement action. Per TODO #30's audit, the action grammar is fine — `go to X` resolves to `Move` via the input parser landed in round 4. The blocker is prompt shape: 1. Movement was 1 of 4 few-shot examples in `build_demo_system_prompt` with no explicit cadence rule. 2. "travel widely" in `mods/rundale/demo-prompt.txt` was buried mid- sentence and read as stylistic, not directive. 3. The CRITICAL paragraph listed `"go to Z"` among forbidden command- form intents — a direct contradiction that biased the model away from the correct movement command shape. Fix carves movement out as a first-class action: - `demo-prompt.txt`: new top-level MOVEMENT paragraph citing the "You can go to: ..." surface, the 3-5-turn cadence, and the three canonical verbs. CRITICAL paragraph rewritten so `go to X` / `walk to X` / `head to X` is the documented exception to the no-command-form rule (other meta-commands like "ask about X" stay forbidden). - `build_demo_system_prompt`: new MOVEMENT CADENCE section with the 3-5-turn rule + "if only one location in last 5 turns, move next" override. Few-shots expanded 4 → 6 examples, 3 of them now movement actions covering all three verbs. Live transcript via parish-engine --headless --script proves the engine resolves both `go to the forge` and `go to the holy well` as `result:"moved"` with proper narration + minutes-elapsed — confirming the prompt is the load-bearing lever, not the schema. LLM-emits- movement evidence requires a follow-up `just demo` cycle against a real model; documented in acceptance-criteria.md as a post-merge observable. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 26, 2026
dmooney
added a commit
that referenced
this pull request
May 26, 2026
…#1139) * fix(demo): direct auto-player to move when NPCs here: none (TODO #12) Cycles 2, 7, and 9 of the demo audit caught the auto-player stranded at empty locations — 4 turns at The Mill after Brendan + Cormac departed, 18 sterile turns at the abandoned Hedge School. The LLM-as- player kept speaking aloud ("I'll wait here by the mill", "Sittin' here, I notice a book half-open on the table") instead of moving. The MOVEMENT CADENCE directive from TODO #1/#30 handles the general "after 3-5 turns, move" rhythm but not the specific signal NPCs here: none. Add a WHEN ALONE section to build_demo_system_prompt that quotes the verbatim "NPCs here: none" cue, closes the speech-at-nobody loophole explicitly, and pins the next action to one of the three movement verbs already taught in the cadence block. Pin the header, the cue, and the move-only instruction in demo_system_prompt_carries_alone_move_directive so a future refactor cannot drop the section silently. A companion engine-side fix (TODO #46 — surface a system response when the player speaks at an empty location) is deferred so the impact of this directive can be measured first. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: retrigger after Actions incident resolved * ci: retrigger after Rust quality gate timeout (cache cold) * test(inference): route default-frequency-penalty test through Interactive lane #1127's test_inference_queue_send_default_omits_frequency_penalty sent on the Background lane but received from the Interactive receiver (`irx`). The message went to `_brx`, which nobody reads, so `irx.recv()` blocked indefinitely. PR #1127's CI runs were both cancelled before the test could surface the hang, and the merge to main stalled the Rust quality gate at the 30-minute timeout on every subsequent PR. Swap `InferencePriority::Background` to `Interactive` so the send lands on the lane `irx` actually drains. Add a comment pointing future readers at the failure mode so the lane-mismatch trap isn't re-laid. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.