fix: demo bug sweep + LLM-output quality detectors + preset multi-slot routing#990
Conversation
…ster - extract_action_from_response handles bare `action": "..."}` and trailing `"}` suffix cases (covered by new demo tests) - auto-player system prompt adds 1820 fact preamble + modern-register blacklist + period-voice examples Fixes the cascade where corrupt `action": "...[Your Name]..."}` player input poisoned NPC dialogue context.
…elic build_tier1_system_prompt (parish-npc) and the mod-shipped mirror mods/rundale/prompts/tier1_system.txt gain four clauses in lockstep: - STAY IN YOUR LANE: midwife knows herbs/births; farmer knows land/beasts; priest knows souls/gossip; teacher knows pupils/books. Outside-lane questions redirect to the right role, not improvised. - 1820 fact preamble: Penal Laws repealed 1782 (teaching in Irish tolerated, not illegal); Catholic Emancipation pending 1829; Famine not until 1845; British Crown rules. - Irish-phrase whitelist (Slán abhaile / Slán leat / Dia dhuit / Go raibh maith agat / Céad míle fáilte / Sláinte / mo chara / sídhe) with an explicit "do not invent or extend Irish phrases" guard. - Modern-register blacklist (fascinating, amazing, definitely, totally, decided to visit, healing properties, taking in the sights). Live demo evidence that motivated this: - Aoife: "teaching in Irish is outlawed" (turn 6, 7) — false for 1820 - Aoife: "Slán abhaile go connachtú" (turn 8) — hallucinated Gaelic - grok-4.3 dialogue-0005: midwife reply read as farmer/tracker (score 2/5) - grok-4.3 dialogue-0003: "healing properties" modern register (score 2/5) Nine new contract assertions in test_tier1_system_no_unsubstituted_placeholders lock the new clauses against silent removal. Proof bundle written to docs/proofs/dialogue-1820-grounding/ (acceptance-criteria.md, evidence.md, cli_script.txt, judge.md — Verdict: sufficient). just check exit 0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the auto-player only flipped `demoEnabled = false` on hitting its turn limit, leaving the Tauri window alive indefinitely. Scripted demos (`just demo 2 10 > log`) hung forever and had to be killed. When `config.auto_start` is true the demo was launched from the CLI, so submit `/quit` after the last turn to actually exit the process. UI- initiated demos still stop quietly so the window stays open. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`resolve_npc_targets` always fell back to the first co-located NPC when the resolved target list was empty. That meant "talk to Aoife Brennan" in a location where only Peig is present routed the message to Peig, who then replied as if addressed. Suppress the fallback when `target_names` is non-empty but unresolved. The empty-list branch in `handle_npc_conversation` then surfaces "No one here answers to that name just now." instead of silently mis-routing the dialogue. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…arrival
When the reaction client is the offline Markov simulator (the wired
provider for the `reaction` and `simulation` categories in the default
loadout), `stream_reaction_texts` was streaming its bigram nonsense
into the chat bubble word-by-word. Players saw the embedded corpus
("bridget from the new collection... saying things... God help us")
rendered as if an NPC had spoken it.
Detect `AnyClient::Simulator` at the dispatch site and emit the
deterministic canned line instead — same behaviour as the no-client
fallback. The simulator stays useful for test fixtures; it just stops
producing readable prose for surfaces visible to the player.
Adds a regression test that drives `stream_reaction_texts` with a
simulator client and asserts the canned text wins.
Also adds a guard test against an unrelated chat-source mistagging
report: every `GameMessage` produced by `apply_movement` must carry
`source: "system"`. Location descriptions in an NPC bubble was the
visible symptom; this asserts the contract at the message-construction
boundary so any future drift fails at compile-time-adjacent.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The status-bar clock froze whenever `snap.paused || snap.inference_paused` was true. Demo runs flipped `inference_paused` several times per turn (once per LLM call), causing the displayed digits to oscillate between running and frozen — visually noisy and indistinguishable from the user-initiated pause. Honour only `snap.paused` for the rAF freeze. The "⏸ Paused" badge already uses `snap.paused` alone, so removing `inference_paused` from the freeze aligns the moving digits with the indicator. The clock is still anchored to the latest snapshot's `game_epoch_ms`, so any actual time delta during inference is corrected on the next anchor without a visible jump. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#283) `handle_game_input` captures `tick_generation` before the world lock is released for intent parsing, then compares after re-acquiring. A drift emits "World shifted while your words were in the air." to the chat. The world-tick task fired every 5 s and unconditionally bumped the counter, even though it called `inference_pause()` semantics elsewhere to freeze the clock. A 10-turn demo surfaced the TOCTOU warning four times — roughly every multi-second LLM call — even though nothing player-relevant had actually changed: weather, tier processing, and schedules all key off the frozen clock, so their work was a no-op. Skip the counter bump while the clock is inference-paused. Same logic applied in both `parish-tauri::setup` and `parish-server::session` so the parity contract holds. A 3-turn demo after the change shows zero "World shifted" emissions (previously 4 per 10 turns ≈ 40% of intents). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…emit per-turn stream-end User-reported blank NPC reply during `just demo 2 10` traced not to the emoji-reaction path (which already filters None and never emits text) but to the NPC arrival-reaction streaming triggered when the auto-player crosses locations. Root cause: parish-core/movement.rs:188 and the duplicate orchestration in parish-tauri/commands.rs:1108 emitted the per-NPC placeholder via bare `text_log(npc_name, "")` with no stream_turn_id. The UI's onTextLog handler only treats an empty-content text-log as a streaming placeholder when stream_turn_id != null; without it, the empty bubble is inserted into textLog directly and never reaped (finalizeStreamingEntry only runs on stream-turn-end). Compounding it, the arrival-reaction path emitted a single batch-level stream-end and never per-turn stream-turn-end, so the UI cleanup hook would not fire for empty reactions even if the placeholder did have a turn id. Three fixes: 1. Placeholder ties to turn (movement.rs, tauri/commands.rs): text_log(npc_name, "") → text_log_for_stream_turn(npc_name, "", turn_id). The UI's streaming-placeholder code path now recognises and reaps it. 2. Per-turn finalisation (game_session.rs stream_reaction_texts): added emit_stream_turn_end callback called exactly once per reaction after stream_npc_tokens returns — success, timeout, or empty. Updated movement.rs and tauri/commands.rs to thread the callback; tauri side emits EVENT_STREAM_TURN_END with StreamTurnEndPayload. 3. max_tokens floor (emoji_reactions.rs): infer_player_message_reaction was Some(40) — too tight for multi- codepoint emoji tokenisation. Bumped to Some(80) with a comment noting the tokenisation risk. (stream_reaction_texts already at 100.) Regression test stream_reaction_texts_emits_stream_turn_end_for_each_reaction (async_llm_integration.rs) asserts 1:1 placeholder/turn-end pairing on empty canned reactions across two NPCs. HTTP-500 and timeout tests updated to assert stream-turn-end still fires on zero-token output. 431 parish-core tests + 435 parish-npc tests pass; clippy + fmt clean. Note: rule #12 violation flagged but not fixed here — the arrival- reaction orchestration is duplicated between parish-core/movement.rs and parish-tauri/commands.rs:1108-1112. Both patched in lockstep; the unification belongs in a separate refactor PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…iring + fixture rubrics
New parish-npc::quality module — sibling to parish-npc::anachronism — with
six pure-function detectors targeting structural pathologies that no
existing judge slice covers:
1. JSON envelope leak (`action": "..."}` literal in chat text)
2. Template token leak (`[Your Name]`, `{npc_name}`, `{{location}}`)
3. Simulator corpus overlap (4-gram match vs parish_inference::simulator::CORPUS)
4. Hallucinated Gaelic (fada-bearing words not in GAELIC_VOCAB whitelist)
5. Modern register (`fascinating`, `decided to visit`, `healing properties`)
6. Reaction emoji monoculture (Shannon-style diversity check)
Each detector returns `Vec<QualityIssue>` with kind + detail + optional
byte-span. Aggregate `detect_all_text_issues(text)` runs all single-text
detectors at once. 19 unit tests cover positive samples (including the
exact strings from the May 17 live demo) + clean controls.
Runtime wiring (emits `WARN site=... kind=... detail=...`):
- parish-tauri/commands.rs: `get_llm_player_action` runs detectors on
the parsed player action after `extract_action_from_response`.
- parish-core/game_loop/npc_turn.rs: NPC reply path runs detectors on
`parsed.dialogue` after the existing non-empty guard.
Fixture rubrics (parish/crates/parish-cli/tests/eval_baselines.rs):
Three new asserts run all three "safe-on-simulator-output" detectors
across every BASELINED_FIXTURES entry — `rubric_no_json_envelope_leak_in_fixtures`,
`rubric_no_template_tokens_in_fixtures`, `rubric_no_hallucinated_gaelic_in_fixtures`.
Skipped on fixtures: simulator-corpus-overlap (would always fire on
simulator backend) and modern-register (soft signal, runtime-WARN only).
Made parish_inference::simulator::CORPUS `pub` so detectors can build a
4-gram set without duplicating the corpus. Documented intent at the
declaration site to prevent reuse.
Verification:
- cargo test -p parish-npc quality::tests → 19 passed
- cargo test -p parish --test eval_baselines → 12 passed (3 new)
- cargo clippy --no-deps → clean
- cargo fmt --check → clean
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cherry-pick of ae08e83b (per-turn stream-end) added a 14th argument to stream_reaction_texts but the existing in-file test at line 800 was introduced by a different cherry-pick (5ad8807) that predated the new signature, so auto-merge couldn't reconcile it. Adds the no-op `|_turn_id| {}` callback to make the test compile against the merged signature. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…slot loadouts Two-slot vllm-mlx (14B :8000 + 1.5B :8001) was 404ing every reaction / simulation call when the user removed those category overrides from parish.toml. fill_missing_models_from_presets() auto-pulled the preset model (1.5B) but inherited base_url from the user-level base (:8000) — where only the 14B is loaded. Model/URL mismatch → 404. Fix: extend ProviderPreset with a `base_urls` block keyed by category, plumb it through ProviderMod::preset_base_url + Provider::preset_base_url, and have fill_missing_models_from_presets populate category_base_url alongside category_model when the preset supplies a URL. vllm_mlx.toml's recommended preset now declares the canonical Apple Silicon two-slot loadout: dialogue + simulation on :8000 (14B for narrative weight), intent + reaction on :8001 (1.5B for low-latency). Older presets that omit `[presets.base_urls]` continue to inherit the user's base URL (single-slot providers — Ollama, OpenAI, Anthropic etc). Schema is additive — `base_urls` defaults to all-None via PresetBaseUrls::default; existing TOMLs that don't ship the block keep their old behaviour. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
parse_npc_stream_response previously had a binary fallback: full JSON
parse succeeds, or treat the whole raw text as dialogue. When the LLM
emits well-formed-but-truncated JSON like
`{"dialogue": "Aye, the process of milling..."` (max_tokens cutoff
or network blip), serde_json::from_str rejects it as malformed, the
fallback fires, and the raw `{"dialogue": "..."}` wrapper surfaces
as user-visible text in the chat bubble.
Live 2026-05-17 demo at The Mill: Brendan + Cormac Duffy replies
landed `{"dialogue": "Aye, 'tis the matter of charges..."` verbatim
in 7 of 10 NPC replies. Nora Duffy's replies, which happened to
complete within max_tokens, parsed cleanly.
Add a third tier:
1. Full JSON parse — preferred, captures dialogue + metadata.
2. **Heuristic dialogue extraction (NEW)** — when JSON parse fails,
walk the bytes after the first `"dialogue":` key and extract the
inner string, honouring JSON-style backslash escapes and stopping
at an unescaped closing quote (or end of stream).
3. Raw text — non-JSON providers, empty stream.
Three new test cases cover the truncated-mid-string,
truncated-at-escape, and truncated-with-empty-value paths. Existing
JSON / plain-text / empty fallback tests continue to pass (10 total).
Parallel fix to c05feb4 (`extract_action_from_response`) which
covered the player-action path; this covers the NPC-reply path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…false positives detect_simulator_corpus_overlap previously flagged a single 4-gram match against the simulator corpus as a quality issue. Live 2026-05-17 demo: Brendan Duffy reply "Me da'll be the one to tell ye how it's done around here" tripped the detector on "be the one to" — a common English idiom that happens to appear verbatim in the corpus but is not Markov leakage. Bumped to 5-gram. True simulator output chains many consecutive corpus words (e.g. the user-reported gibberish "Can quite explain bridget from the new collection for them..." still produces 5-gram matches like "can quite explain bridget from"). Organic English rarely matches 5 corpus words in a row by chance. Net effect: zero false positives on live demo output, gibberish detection preserved. New regression test `corpus_overlap_no_false_positive_on_common_idiom` asserts the Brendan-style reply does not trip. Existing `corpus_overlap_actual_demo_gibberish` updated to expect 5-gram (still flags). 20 quality tests pass (was 19, +1 false-positive guard). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🤖 Hi @dmooney, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
There was a problem hiding this comment.
Code Review
This pull request implements historical grounding for NPC dialogue, enforcing 1820-specific facts, persona constraints, and Gaelic language guardrails. It introduces a quality module for detecting structural issues like JSON leaks and modern register terms, alongside heuristic improvements for parsing truncated LLM responses. Additional changes include multi-slot provider support and fixes for clock-freezing and TOCTOU race conditions. Review feedback identifies a quote mismatch in the heuristic dialogue parser and recommends using Unicode-aware regex for modern register detection to prevent false positives.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 74c3a367ee
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if !self.category_base_url.contains_key(&cat) | ||
| && let Some(u) = p.preset_base_url(cat) | ||
| { | ||
| self.category_base_url.insert(cat, u.to_string()); |
There was a problem hiding this comment.
Avoid overriding URL for user-specified category models
fill_missing_models_from_presets now fills category_base_url whenever it is missing, even when that category already has an explicit user model override. On multi-slot providers (like vllm-mlx), this can silently reroute a user-chosen model to the preset slot URL (e.g. force :8001) and produce model/endpoint mismatches or 404s unless the user also manually sets category_base_url. The base URL should only be auto-filled when the model is also auto-filled from the preset (or when no explicit model override exists).
Useful? React with 👍 / 👎.
| '"' => break, | ||
| other => out.push(other), |
There was a problem hiding this comment.
Handle single-quoted dialogue terminators in heuristic parser
The heuristic parser accepts single-quoted dialogue values (.strip_prefix('\'')), but the scan loop only stops on an unescaped double quote. For malformed/pseudo-JSON like {'dialogue':'Aye there'}, this consumes through the trailing structure and returns leaked wrapper text (e.g. Aye there'}) instead of clean dialogue, so the truncation-recovery path still surfaces artifacts for this input class.
Useful? React with 👍 / 👎.
| if !self.category_base_url.contains_key(&cat) | ||
| && let Some(u) = p.preset_base_url(cat) | ||
| { | ||
| self.category_base_url.insert(cat, u.to_string()); |
There was a problem hiding this comment.
Avoid forcing override-client path for base-equivalent slots
Auto-populating category_base_url from presets makes every category look overridden, so resolve_category_client stops reusing the already-bootstrapped base client and rebuilds per-category clients instead. In that override path it uses InferenceConfig::default(), which drops runtime client tuning from setup (timeouts/retries/default limiter/config) for all affected categories. This regression is triggered by the new base-url autofill even when the preset URL is effectively the same base routing.
Useful? React with 👍 / 👎.
…t_is_simulator test After merging origin/main (which absorbed the 5 prior-session prior-art fixes via PR #986), the merged tree carried two definitions of `stream_reaction_texts_skips_llm_when_client_is_simulator` — one with the new 14-arg signature (from this PR's `e1e94fd7` fixup), one with the old 13-arg signature (from main's #986). Clippy + cargo test rejected both as E0428 (redefinition) + E0061 (arg mismatch). Keep the 14-arg version that matches the post-merge `stream_reaction_texts` signature; drop the 13-arg copy. No behavior change — both test bodies were byte-identical except for the missing callback arg. 2796 workspace tests pass; clippy `-D warnings` + fmt clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…preset auto-fill Three findings from gemini-code-assist + chatgpt-codex-connector on PR #990. **Heuristic dialogue parser — single-quote terminator (gemini P-medium / codex P2)** `extract_dialogue_field_heuristic` accepted `'` as an opening quote but the scan loop only stopped on `"`. For pseudo-JSON like `{'dialogue':'Aye, "good", said he'}`, the inner double quote terminated the body and the trailing `'}` leaked into the user-visible dialogue. Track the opener char and stop only on a matching closer. New tests: - `test_parse_npc_stream_response_single_quoted_with_inner_double_quote` - `test_parse_npc_stream_response_double_quoted_with_inner_single_quote` **Modern-register phrase boundary + Unicode (gemini P-medium)** `detect_modern_register`'s phrase branch used `lower.contains(term)` with no word-boundary check, so "undecided to visit" matched "decided to visit". The single-word branch was ASCII-byte-based, so non-ASCII letters bordering a term (e.g. `fascinatingñ`) were treated as boundaries rather than word chars. Unify both branches around a single Unicode-aware boundary predicate that uses `char::is_alphanumeric` (which handles non-ASCII letters correctly) and treats `_` + `'` as word chars to match Hiberno-English register. New tests: - `modern_register_phrase_respects_word_boundary` - `modern_register_word_boundary_unicode` **Preset auto-fill: pair model+URL + respect user base (codex P1 ×2)** `fill_missing_models_from_presets` filled `category_base_url` independently of `category_model`. Two failure modes: 1. **Silent reroute of user model**: user sets `category_overrides.simulation.model` to their own model but leaves `base_url` empty. Old behaviour: we'd inject the preset's slot URL (`:8001`), routing the user's model to the wrong slot → 404 or wrong-model response. 2. **Localhost clobbering**: user sets `base_url = "http://remote-gpu:8000"` for a colocated GPU box. Preset's hardcoded `http://localhost:8001` for intent/reaction would override → traffic silently lands on a local port that doesn't exist on the remote host. Two guards: - **Pair model and URL fills.** Only auto-fill `category_base_url` on the same pass we auto-fill `category_model`. If the user has set the model themselves, leave URL inheritance to the base — they own routing for that category. - **Don't override a user-set base.** If `self.base_url != provider.default_base_url()`, the user is off the canonical path and is responsible for category routing. Skip auto-fill. Side benefit: addresses codex's third finding — when every category looks "overridden", `resolve_category_client` rebuilds per-category clients with `InferenceConfig::default()` and drops the runtime client tuning from setup. The new gating keeps the base client reused when categories cleanly inherit. Refactored using `Entry` API to satisfy `clippy::map_entry` (the naive `contains_key` + `insert` pattern was the lint trigger). All tests pass: 2254 workspace lib, 462 parish-npc; clippy `-D warnings` clean; fmt clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Auto-player invented its name on demand and Qwen2.5-14B produced "Br Q" when Peig asked "Who might ye be" (post-fix run of #990). Add a paragraph to mods/rundale/demo-prompt.txt pinning the player as "Aiden Carney" from over by the Shannon, with an explicit "always answer with this exact name, never single letters" rule and a negative constraint that names the "Br Q" failure mode by string. Mod-content change only — no Rust edit. The demo extra_prompt already flows through parish-tauri::commands::get_llm_player_action. Live demo (12 turns, transcript at docs/proofs/992-demo-player-name/): Peig: "Who might ye be, if I might ask it so bold?" Player: "Aye, my name's Aiden. Came over by the Shannon, wonderin' the countryside." Closes #992. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#1001) Closes #996. PR #990 added `[presets.base_urls]` to vllm_mlx.toml so each category on Apple Silicon routes to the slot where its preset model is actually loaded. The sister TOML for Linux/Windows CUDA/ROCm (vllm.toml) was left behind: fill_missing_models_from_presets auto-picks the preset model (e.g. Qwen3-8B) but inherits the user-level base URL (:8000, where only the 14B is loaded) → 404 storm equivalent to the pre-#990 macOS bug. Layout matches the primary suggestion in the issue (each unique Qwen3 model size on its own port): :8000 — 14B (dialogue, narrative weight) :8001 — 8B (simulation + reaction, shared mid-tier) :8002 — 4B (intent, low-latency) Downstream VllmProcess::ensure_slots dedups the duplicate 8B slot, so only two extra processes spawn beyond the base. Regression guard: new parish-core unit test vllm_preset_supplies_per_category_base_url exercises the full schema round-trip — Provider::preset_base_url → fill_missing_models_from_presets → vllm_extra_slots — and asserts URLs/models for all four categories plus dialogue-elision in the extra-slot list. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…se_url is empty (#993) (#1006) After PR #990 plumbed `[presets.base_urls]` into vllm_mlx.toml, demo runs still emitted one `404 Not Found for url http://localhost:8000` per fresh save inside `infer_player_message_reaction`. Root cause was two-fold in `GameConfig::resolve_category_client`: 1. `has_override` did not include `category_model.contains_key(&cat)`. A category that had a model override but no URL/provider/key override fell through to the base client — the 1.5B reaction model sent at the 14B `:8000` slot returns 404 from vLLM ("model not found"). 2. The URL fallback chain for per-category clients went `category_base_url → base_url → provider.default_base_url()`, with no step that consulted the resolved provider's `preset_base_url(cat)`. When `fill_missing_models_from_presets` was prevented from populating `category_base_url` (e.g. because `self.base_url != p.default_base_url()` after a partial hydration pass), the per-category client still landed on the base URL — the same misroute. Fix: - Add `category_model.contains_key(&cat)` to `has_override`. - Insert `provider.preset_base_url(cat)` between the explicit `category_base_url` lookup and the `base_url` fallback. For single-slot providers (Ollama, Anthropic, OpenAI) whose preset omits `[presets.base_urls]`, `preset_base_url` returns `None` and the chain is inert — verified by a regression test. - Simplify `fill_missing_models_from_presets`: model and URL are filled independently (both vacant-only), removing the `filled_model` gate that previously stranded a user-supplied model on the base URL. Adds four regression tests in `parish-core::ipc::config::tests`: - `resolve_category_client_model_only_override_triggers_per_category_client` - `resolve_category_client_falls_back_to_preset_base_url` - `resolve_category_client_preset_fallback_is_inert_for_single_slot_provider` - `issue_993_user_config_hydration_routes_reaction_to_slot_8001` The fourth test replays the user's actual parish.toml from #993 (vllm-mlx base + `[category_overrides.intent]` only) and asserts the Reaction client resolves to `:8001` with the 1.5B model. Proof bundle: docs/proofs/993-residual-reaction-404/ - acceptance-criteria.md (5 criteria, written first) - transcript.txt (live headless CLI run; 0 occurrences of "404") - evidence.md, judge.md (verdict: sufficient; AC: met) Workspace: 2804 tests pass; clippy + fmt clean; just agent-check passes. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(reactions): wire emoji-monoculture sensor + temp=1.0 (#995) The diversity detector landed in #990 but had no runtime call site — it was reachable only from tests. This change wires it to a session-scoped capacity-8 ring buffer on `NpcManager` and feeds the buffer from every reaction-emit path (CLI headless, CLI script harness, web server, Tauri). A debounced `tracing::warn!` fires once the detector reports `Some(QualityIssue)`; the flag clears when diversity recovers so future crossings re-arm the sensor. The same buffer is exposed via a new `/debug reactions` CLI subcommand so live play-tests can verify wiring without depending on log-file access. `infer_player_message_reaction` now passes an explicit `temperature = 1.0` (was `None`) so small-model reaction sampling explores beyond the most-likely-safe `🤔` / `😊` collapse the issue describes. The 80-token output cap from #984 is preserved; the output schema is a closed palette so widening the distribution cannot break correctness. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address gemini review on record_reaction_emoji - Replace while loop with if when popping front of the ring buffer (push_back adds at most one element per call, so a single pop is sufficient). - Early-return before allocating the snapshot Vec<&str> when the buffer is below the detector's minimum sample threshold. - Expose DEFAULT_EMOJI_MIN_SAMPLES from quality.rs so the threshold lives next to the detector that owns it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ing (#994) (#1007) * fix(bench): mirror runtime tier-1 prompt in dialogue eval (#994) The rundale-bench dialogue sampler used a hardcoded 280-char `DIALOGUE_SYS` that did not reflect the runtime tier-1 grounding that #990 added (1820 fact preamble, anti-stage-Irish cultural guard, persona binding, ga-IE phrase whitelist, Latin-script guard). Result: prompt-quality work showed no measurable lift in the bench because the bench wasn't running the prompt under test. This change adds `build_dialogue_system_prompt()` to `eval_lib.py` that reads `mods/rundale/prompts/tier1_system.txt`, substitutes the bench's Brigid O'Brien persona slots, and appends a Python mirror of `parish_npc::language_directive` for en-IE/ga-IE. The five scripts that previously hardcoded a Brigid prompt now share that helper: - parish/testing/rundale-bench/cache_dialogue_replies.py - parish/testing/rundale-bench/rundale_bench.py - parish/testing/rundale-bench/bench_perf.py - parish/scripts/local-eval/gen_samples.py - parish/scripts/local-eval/gen_dlg.py Resulting prompt grows from ~280 to ~3894 chars and now carries every guardrail the live runtime applies, so future tier-1 prompt regressions will surface on the leaderboard. No runtime behaviour change. Past leaderboard scores will need re-benching against the new prompt; that work is intentionally deferred to a follow-up so this prompt change can be reviewed in isolation (issue point #3). Proof bundle: docs/proofs/994-bench-prompt-mirror/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(bench): rebench grok-4.3 under new prompt — grok-4.3 judge Switched judge from mistral-large-2512 to x-ai/grok-4.3 at user request (mistral too generous). Headline numbers re-stated against the stricter judge. Surfaced a bench-scoring artifact: the NEW prompt's runtime template emits dialogue + `---` + JSON metadata, and the judge penalizes the JSON scaffolding as anachronistic even though the runtime strips it before showing the player. On `dialogue-0011` the same advice scores 8.4 with the OLD prompt's plain-text format and 3.4 with the NEW prompt's templated format — +5-point swing on scaffolding the player never sees. Bench should strip the metadata block before judging. After stripping, grok-judge totals: OLD raw 8.09 NEW raw 7.45 (scaffolding penalty) NEW stripped 8.21 <-- apples-to-apples delta vs OLD +0.12 Per-axis (NEW stripped vs OLD): character -0.40 authenticity -0.34 language +0.87 <- GA_IE whitelist + Latin-script guard responsiveness +0.60 <- persona binding + "react to" clause craft -0.14 Both discriminative judges (mistral, grok) agree on sign; magnitude inside N=15 noise floor. Two follow-ups noted in rebench-findings.md: 1. Bench should strip the metadata block before judging (today's +5-point swing is a measurement artifact, not a prompt regression). 2. Reconcile the divergence between the mod template (---/JSON format) and the Rust builder (JSON-first format). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bench): strip runtime metadata envelope before judging Bench was sending raw model output (dialogue + `---` + JSON metadata) to the judge, but the runtime parses the metadata and shows the player only the dialogue line. Grok-4.3 (as judge) penalised the JSON scaffolding as anachronistic, depressing every NEW-prompt score by ~+0.76 points. Adds `grade.extract_dialogue_for_judging(reply)` that handles both runtime envelope shapes: - Mod-template format: `<dialogue>\n---\n{...}` → text before `\n---` - Rust-builder JSON-first: `{"dialogue": "...", ...}` → parsed `.dialogue` field - Legacy plain text: pass-through Falls through to verbatim return on malformed envelopes so the judge scores broken output as the player would see it. Never raises. Wired into every site that sends a reply to the judge: - grade.grade_dialogue - grade.grade_reaction - grade.grade_pairwise - score_multiaxis.score_one - rubric_lab.absolute_judge - rubric_lab.pairwise_judge `_non_latin` still scans the full raw reply so non-Latin leaks inside metadata are still flagged. Tests: 10 new `test_extract_dialogue_*` cases plus a SyntaxError-immune fall-through; 38/38 pass. Rebench under grok-4.3 (same NEW cache as #994 bundle): pre-fix raw 7.45 <- scaffolding penalty manual-stripped 8.21 <- one-off compare in #994 bundle post-fix auto-strip 8.29 <- this fix Recovers +0.84 of measurement artifact. NEW prompt now slightly exceeds OLD on the same judge (8.29 vs 8.09 = +0.20) — first apples-to-apples confirmation that the runtime tier-1 grounding outperforms the pre-#994 hardcoded bench prompt. Proof bundle: docs/proofs/994b-bench-strip-metadata/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bench): address review on extract_dialogue_for_judging Four bot reviews on grade.py, all legit. Mirror runtime more fully so malformed-output cases don't drift from `parish_npc::parse_npc_stream_response`. 1. Strip Markdown JSON fences (`` ```json … ``` ``) before envelope detection. Mirrors `parish_npc::strip_json_fence`. Without this, fenced replies (Anthropic) bypass extraction and ship raw scaffolding to the judge. (codex PRRT_kwDORqdnvs6CzukS) 2. Accept `---` at start of reply (no leading newline). Some models skip the dialogue line and emit only the metadata block; the runtime splits on `---` itself, so the bench must too. Regex now matches `(?:\n|^)[ \t]*---`. (codex PRRT_kwDORqdnvs6CzukN) 3. Recover the `"dialogue"` field from truncated JSON envelopes (max_tokens cutoff, network blip). Adds `_extract_dialogue_field_heuristic` mirroring the Rust runtime helper of the same name. Without this, max-tokens-truncated outputs were judged as raw envelope. (codex PRRT_kwDORqdnvs6CzukL) 4. Replace the hand-rolled brace/string tracker with `json.JSONDecoder().raw_decode()` — cleaner, tolerates trailing junk, drops 25 lines of state machine code. (gemini PRRT_kwDORqdnvs6Czu7H) Tests: 8 new cases (start-of-string dash, fenced/bare-fenced JSON, truncated recovery with and without escaped quotes, fenced+truncated combo, trailing-junk JSON). Updated one existing test whose assertion was the old hand-rolled fall-through behaviour. 46/46 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bench): tighten review issues on extract_dialogue + judge rubrics Four more codex reviews, all legit, on the strip-metadata helper + the rubrics it feeds: 1. **Judge rubrics still said 1-3 sentences** while the new shared prompt instructs 2-4. Pre-#994 craft scores systematically penalised any reply >3 sentences. Updated `judge_v1.json`, `judge_pairwise_v1.json`, and `score_multiaxis.DEFAULT_RUBRIC` to 2-4. Recomputed `rubric_sha256` on both pinned configs. (codex PRRT_kwDORqdnvs6C0CIU) 2. **Truncated-JSON recovery was too permissive.** Used `re.search` so a `dialogue` field anywhere in malformed input was recovered; runtime only recovers when `dialogue` is the leading key. Anchored the regex with `^\s*\{?\s*` and switched to `re.match`. (codex PRRT_kwDORqdnvs6C0CIZ) 3. **Fence stripping was case-insensitive** while the runtime `parish_npc::strip_json_fence` is case-sensitive lowercase. An uppercase ```JSON wrapper was being unwrapped by bench but kept raw by runtime. Dropped `re.IGNORECASE`. (codex PRRT_kwDORqdnvs6C0CIe) 4. **`---` delimiter required trailing newline.** Runtime splits on bare `---` anywhere (`parish-core::game_session::cleaned`, `parish-npc::reactions::arrival_reactions`), so inline forms like `dialogue---{action:…}` were not stripped by bench. Replaced the regex with `stripped.find("---")`. (codex PRRT_kwDORqdnvs6C0CIf) Tests: 3 new cases (inline `---{json}`, uppercase fence fall-through, non-leading `dialogue` key fall-through). 49/49 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bench): try JSON-first parse before --- split Codex review PRRT_kwDORqdnvs6C0Je2: a valid JSON reply whose dialogue text contains `---` (em-dash spelled out, e.g. `{"dialogue":"Well---maybe",...}`) was being mangled into a partial envelope by the delimiter split that ran first. Runtime `parish_npc::parse_npc_stream_response` parses JSON before falling back to other strategies. Bench now matches that order: 1. fence strip 2. JSON-first parse (raw_decode) 3. truncated-JSON heuristic 4. bare `---` split (mod-template format) 5. verbatim fallback JSON dialogue containing `---` survives intact; the `---` split is only reached for non-JSON envelopes that don't contain a parseable `{` prefix. Test: `test_extract_dialogue_json_dialogue_contains_em_dash`. 50/50 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bench): match runtime JSON-parse semantics exactly Two more codex reviews on `extract_dialogue_for_judging`: 1. **Missing dialogue field returns "" not raw envelope.** Runtime `NpcJsonResponse` uses `#[serde(default)]`, so a parseable JSON envelope without a `dialogue` key surfaces as empty dialogue to the player. Bench was falling through and judging the raw envelope. Same for non-string dialogue values (null, number). (codex PRRT_kwDORqdnvs6C0Zc8) 2. **`json.loads` instead of `raw_decode`.** Runtime `serde_json::from_str` requires the entire input to be consumed (modulo surrounding whitespace) and rejects trailing junk; `raw_decode` would accept it, drifting from runtime semantics. Switched to `loads`. Trailing-junk replies still work because the truncated-JSON heuristic kicks in next and recovers the leading `"dialogue"` field — same chain runtime follows. (codex PRRT_kwDORqdnvs6C0ZdL) Tests: 3 new cases (missing field → "", non-string field → "", `{}` → ""). Updated existing trailing-junk test comment to reflect the new fall-through path. 52/52 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
End-to-end fix for the 12 bugs surfaced in the May 17
just demo 2 10run + grok-4.3 dialogue eval, plus a newparish-npc::qualitydetector module for ongoing regression catch.13 commits, all from focused fixes — no scope creep. Engine + UI + prompt + parser + config schema layers, in that order.
Bugs fixed (live-demo + grok eval evidence)
action": "..."}in chat input)c05feb41—extract_action_from_responsestrips leading/trailing JSON wrappers[Your Name]template parrot in NPC replies5ad8807c+f6ab62b9— chat path skips simulator; preset routes simulation to 14B not 1.5Bcc311526— tier-1 system prompt adds 1820 fact preamblecc311526— 8-phrase whitelist + improvisation guardc05feb41— auto-player prompt adds modern-register blacklistf6ab62b9— reaction routes to real LLM (1.5B@8001) not simulator2fc616ce— placeholder ties tostream_turn_id; per-turnstream-turn-end; max_tokens 40→80e4ba9d82(#283) — don't bump tick generation while clock is inference-paused0f3cc098—addressed_tohonored; absent name suppresses fallback5ad8807c—apply_movementGameMessages now contractuallysource: "system"4323d710— UI keeps clock running visually during transient LLM holds--demo-max-turns Ndoesn't terminate processa3c4f0bb— submit/quitafter final turn whenconfig.auto_start{"dialogue":...on truncated streame44455b3— heuristic recovery extracts dialogue stringsimulator-corpus-overlapdetector false-positive on "be the one to"74c3a367— 4-gram → 5-gram thresholdNew eval scaffolding
e9c354f1 feat(quality)— newparish-npc::qualitymodule with 6 pure-function detectors:action": "..."}literal in chat text)[Your Name],{npc_name},{{location}})parish_inference::simulator::CORPUS)fascinating,decided to visit,healing properties)Wired into:
get_llm_player_action(parish-tauri/commands.rs) — emitsWARN site=demo-player-actionper issueWARN site=npc-replyparish-cli/tests/eval_baselines.rs(assert no JSON / no template / no Gaelic in baselined fixtures)20 unit tests covering positive + negative samples — including the exact bug strings from the May 17 demo.
Schema additive change
f6ab62b9 fix(config)—ProviderPresetgains optionalbase_urlsblock per category.fill_missing_models_from_presetspopulatescategory_base_urlalongsidecategory_model. Existing TOMLs that don't ship[presets.base_urls]keep their old single-slot behaviour.vllm_mlx.tomldeclares the canonical Apple Silicon two-slot loadout::8000): 14B for dialogue + simulation (narrative weight):8001): 1.5B for intent + reaction (low-latency)Replaces the previous user-config workaround that pointed simulation + reaction at the simulator backend.
Verification
cargo test --workspace --lib→ 2242 passed, 7 ignoredcargo test --workspace --tests→ 2784 passedcargo test -p parish-npc quality::tests→ 20 passedcargo test -p parish --test eval_baselines→ 12 passed (3 new structural rubrics)cargo clippy --workspace --no-deps→ cleancargo fmt --check→ cleanjust ui-test→ 401 passed (33 files, 12.45s)Live demo verification (
just demo 2 5on fresh save):--demo-max-turns NGrok-4.3 dialogue eval baseline (10 prompts vs 14B): bench uses its own hardcoded prompt distinct from the runtime tier-1 prompt; eval score is essentially unchanged by this PR by design. Eval delta requires updating
DIALOGUE_SYSincache_dialogue_replies.pyto mirror the new tier-1 grounding, which is intentionally NOT bundled here (separate PR territory).Test plan
cargo test --workspacecargo clippy --workspace --no-depsjust ui-testjust demo 2 5on fresh save → confirm 0 JSON leaks in NPC replies, 0 quality WARN, exit 0🤖 Generated with Claude Code