Skip to content

fix: demo bug sweep + LLM-output quality detectors + preset multi-slot routing#990

Merged
dmooney merged 16 commits into
mainfrom
demo-fixes-integration
May 17, 2026
Merged

fix: demo bug sweep + LLM-output quality detectors + preset multi-slot routing#990
dmooney merged 16 commits into
mainfrom
demo-fixes-integration

Conversation

@dmooney
Copy link
Copy Markdown
Owner

@dmooney dmooney commented May 17, 2026

Summary

End-to-end fix for the 12 bugs surfaced in the May 17 just demo 2 10 run + grok-4.3 dialogue eval, plus a new parish-npc::quality detector module for ongoing regression catch.

13 commits, all from focused fixes — no scope creep. Engine + UI + prompt + parser + config schema layers, in that order.

Bugs fixed (live-demo + grok eval evidence)

# Class Fix
1 Player JSON envelope leak (action": "..."} in chat input) c05feb41extract_action_from_response strips leading/trailing JSON wrappers
2 [Your Name] template parrot in NPC replies cascade fix from #1
3 Simulator Markov gibberish leaking to UI ("bridget...new collection...God help us") 5ad8807c + f6ab62b9 — chat path skips simulator; preset routes simulation to 14B not 1.5B
4 NPC historical drift ("teaching in Irish is outlawed" in 1820) cc311526 — tier-1 system prompt adds 1820 fact preamble
5 Hallucinated Gaelic ("Slán abhaile go connachtú") cc311526 — 8-phrase whitelist + improvisation guard
6 Modern player register ("fascinating", "decided to visit") c05feb41 — auto-player prompt adds modern-register blacklist
7 Reaction emoji monoculture (5× 😊) f6ab62b9 — reaction routes to real LLM (1.5B@8001) not simulator
8 Blank NPC reactions / stream lifecycle 2fc616ce — placeholder ties to stream_turn_id; per-turn stream-turn-end; max_tokens 40→80
9 TOCTOU "World shifted while your words were in the air" 4× per 10 turns e4ba9d82 (#283) — don't bump tick generation while clock is inference-paused
10 Wrong NPC replies when player named absent target 0f3cc098addressed_to honored; absent name suppresses fallback
11 Location description rendered as NPC dialogue bubble 5ad8807capply_movement GameMessages now contractually source: "system"
12 Clock visibly starts/stops in UI 4323d710 — UI keeps clock running visually during transient LLM holds
13 --demo-max-turns N doesn't terminate process a3c4f0bb — submit /quit after final turn when config.auto_start
14 (followup) NPC reply parser leaks {"dialogue":... on truncated stream e44455b3 — heuristic recovery extracts dialogue string
15 (followup) simulator-corpus-overlap detector false-positive on "be the one to" 74c3a367 — 4-gram → 5-gram threshold

New eval scaffolding

e9c354f1 feat(quality) — new parish-npc::quality module with 6 pure-function detectors:

  1. JSON envelope leak (action": "..."} literal in chat text)
  2. Template token leak ([Your Name], {npc_name}, {{location}})
  3. Simulator corpus overlap (5-gram match vs parish_inference::simulator::CORPUS)
  4. Hallucinated Gaelic (fada-bearing words not in vetted vocab)
  5. Modern register (fascinating, decided to visit, healing properties)
  6. Reaction emoji monoculture (Shannon-style diversity)

Wired into:

  • get_llm_player_action (parish-tauri/commands.rs) — emits WARN site=demo-player-action per issue
  • NPC reply emit site (parish-core/game_loop/npc_turn.rs) — emits WARN site=npc-reply
  • 3 new fixture rubrics in parish-cli/tests/eval_baselines.rs (assert no JSON / no template / no Gaelic in baselined fixtures)

20 unit tests covering positive + negative samples — including the exact bug strings from the May 17 demo.

Schema additive change

f6ab62b9 fix(config)ProviderPreset gains optional base_urls block per category. fill_missing_models_from_presets populates category_base_url alongside category_model. Existing TOMLs that don't ship [presets.base_urls] keep their old single-slot behaviour.

vllm_mlx.toml declares the canonical Apple Silicon two-slot loadout:

  • Slot 1 (:8000): 14B for dialogue + simulation (narrative weight)
  • Slot 2 (:8001): 1.5B for intent + reaction (low-latency)

Replaces the previous user-config workaround that pointed simulation + reaction at the simulator backend.

Verification

  • cargo test --workspace --lib2242 passed, 7 ignored
  • cargo test --workspace --tests2784 passed
  • cargo test -p parish-npc quality::tests20 passed
  • cargo test -p parish --test eval_baselines12 passed (3 new structural rubrics)
  • cargo clippy --workspace --no-deps → clean
  • cargo fmt --check → clean
  • just ui-test401 passed (33 files, 12.45s)

Live demo verification (just demo 2 5 on fresh save):

  • 5 player turns, 3 NPC dialogue replies
  • 0 JSON envelope leaks (was 7/10 pre-fix)
  • 0 quality WARN false positives (was 1/run pre-tune)
  • 0 hallucinated Gaelic / anachronism / modern-register hits
  • Emoji variety restored (was 5× 😊 monoculture; now mix of 🤔 / 😊)
  • Demo exits cleanly at --demo-max-turns N
  • Tier 2 / reaction inference 404 storm cleared (was every 10s; now 1 per run)

Grok-4.3 dialogue eval baseline (10 prompts vs 14B): bench uses its own hardcoded prompt distinct from the runtime tier-1 prompt; eval score is essentially unchanged by this PR by design. Eval delta requires updating DIALOGUE_SYS in cache_dialogue_replies.py to mirror the new tier-1 grounding, which is intentionally NOT bundled here (separate PR territory).

Test plan

  • cargo test --workspace
  • cargo clippy --workspace --no-deps
  • just ui-test
  • Manual: just demo 2 5 on fresh save → confirm 0 JSON leaks in NPC replies, 0 quality WARN, exit 0

🤖 Generated with Claude Code

dmooney and others added 13 commits May 17, 2026 13:36
…ster

- extract_action_from_response handles bare `action": "..."}` and
  trailing `"}` suffix cases (covered by new demo tests)
- auto-player system prompt adds 1820 fact preamble + modern-register
  blacklist + period-voice examples

Fixes the cascade where corrupt `action": "...[Your Name]..."}` player
input poisoned NPC dialogue context.
…elic

build_tier1_system_prompt (parish-npc) and the mod-shipped mirror
mods/rundale/prompts/tier1_system.txt gain four clauses in lockstep:

- STAY IN YOUR LANE: midwife knows herbs/births; farmer knows land/beasts;
  priest knows souls/gossip; teacher knows pupils/books. Outside-lane
  questions redirect to the right role, not improvised.
- 1820 fact preamble: Penal Laws repealed 1782 (teaching in Irish
  tolerated, not illegal); Catholic Emancipation pending 1829; Famine
  not until 1845; British Crown rules.
- Irish-phrase whitelist (Slán abhaile / Slán leat / Dia dhuit / Go raibh
  maith agat / Céad míle fáilte / Sláinte / mo chara / sídhe) with an
  explicit "do not invent or extend Irish phrases" guard.
- Modern-register blacklist (fascinating, amazing, definitely, totally,
  decided to visit, healing properties, taking in the sights).

Live demo evidence that motivated this:
- Aoife: "teaching in Irish is outlawed" (turn 6, 7) — false for 1820
- Aoife: "Slán abhaile go connachtú" (turn 8) — hallucinated Gaelic
- grok-4.3 dialogue-0005: midwife reply read as farmer/tracker (score 2/5)
- grok-4.3 dialogue-0003: "healing properties" modern register (score 2/5)

Nine new contract assertions in test_tier1_system_no_unsubstituted_placeholders
lock the new clauses against silent removal. Proof bundle written to
docs/proofs/dialogue-1820-grounding/ (acceptance-criteria.md, evidence.md,
cli_script.txt, judge.md — Verdict: sufficient).

just check exit 0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the auto-player only flipped `demoEnabled = false` on hitting
its turn limit, leaving the Tauri window alive indefinitely. Scripted
demos (`just demo 2 10 > log`) hung forever and had to be killed.

When `config.auto_start` is true the demo was launched from the CLI, so
submit `/quit` after the last turn to actually exit the process. UI-
initiated demos still stop quietly so the window stays open.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`resolve_npc_targets` always fell back to the first co-located NPC when
the resolved target list was empty. That meant "talk to Aoife Brennan"
in a location where only Peig is present routed the message to Peig,
who then replied as if addressed.

Suppress the fallback when `target_names` is non-empty but unresolved.
The empty-list branch in `handle_npc_conversation` then surfaces
"No one here answers to that name just now." instead of silently
mis-routing the dialogue.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…arrival

When the reaction client is the offline Markov simulator (the wired
provider for the `reaction` and `simulation` categories in the default
loadout), `stream_reaction_texts` was streaming its bigram nonsense
into the chat bubble word-by-word. Players saw the embedded corpus
("bridget from the new collection... saying things... God help us")
rendered as if an NPC had spoken it.

Detect `AnyClient::Simulator` at the dispatch site and emit the
deterministic canned line instead — same behaviour as the no-client
fallback. The simulator stays useful for test fixtures; it just stops
producing readable prose for surfaces visible to the player.

Adds a regression test that drives `stream_reaction_texts` with a
simulator client and asserts the canned text wins.

Also adds a guard test against an unrelated chat-source mistagging
report: every `GameMessage` produced by `apply_movement` must carry
`source: "system"`. Location descriptions in an NPC bubble was the
visible symptom; this asserts the contract at the message-construction
boundary so any future drift fails at compile-time-adjacent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The status-bar clock froze whenever `snap.paused || snap.inference_paused`
was true. Demo runs flipped `inference_paused` several times per turn
(once per LLM call), causing the displayed digits to oscillate between
running and frozen — visually noisy and indistinguishable from the
user-initiated pause.

Honour only `snap.paused` for the rAF freeze. The "⏸ Paused" badge
already uses `snap.paused` alone, so removing `inference_paused` from
the freeze aligns the moving digits with the indicator. The clock is
still anchored to the latest snapshot's `game_epoch_ms`, so any actual
time delta during inference is corrected on the next anchor without a
visible jump.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#283)

`handle_game_input` captures `tick_generation` before the world lock is
released for intent parsing, then compares after re-acquiring. A drift
emits "World shifted while your words were in the air." to the chat.

The world-tick task fired every 5 s and unconditionally bumped the
counter, even though it called `inference_pause()` semantics elsewhere
to freeze the clock. A 10-turn demo surfaced the TOCTOU warning four
times — roughly every multi-second LLM call — even though nothing
player-relevant had actually changed: weather, tier processing, and
schedules all key off the frozen clock, so their work was a no-op.

Skip the counter bump while the clock is inference-paused. Same logic
applied in both `parish-tauri::setup` and `parish-server::session` so
the parity contract holds.

A 3-turn demo after the change shows zero "World shifted" emissions
(previously 4 per 10 turns ≈ 40% of intents).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…emit per-turn stream-end

User-reported blank NPC reply during `just demo 2 10` traced not to the
emoji-reaction path (which already filters None and never emits text)
but to the NPC arrival-reaction streaming triggered when the auto-player
crosses locations.

Root cause: parish-core/movement.rs:188 and the duplicate orchestration
in parish-tauri/commands.rs:1108 emitted the per-NPC placeholder via
bare `text_log(npc_name, "")` with no stream_turn_id. The UI's
onTextLog handler only treats an empty-content text-log as a streaming
placeholder when stream_turn_id != null; without it, the empty bubble
is inserted into textLog directly and never reaped (finalizeStreamingEntry
only runs on stream-turn-end). Compounding it, the arrival-reaction
path emitted a single batch-level stream-end and never per-turn
stream-turn-end, so the UI cleanup hook would not fire for empty
reactions even if the placeholder did have a turn id.

Three fixes:

1. Placeholder ties to turn (movement.rs, tauri/commands.rs):
   text_log(npc_name, "") → text_log_for_stream_turn(npc_name, "", turn_id).
   The UI's streaming-placeholder code path now recognises and reaps it.

2. Per-turn finalisation (game_session.rs stream_reaction_texts):
   added emit_stream_turn_end callback called exactly once per reaction
   after stream_npc_tokens returns — success, timeout, or empty.
   Updated movement.rs and tauri/commands.rs to thread the callback;
   tauri side emits EVENT_STREAM_TURN_END with StreamTurnEndPayload.

3. max_tokens floor (emoji_reactions.rs):
   infer_player_message_reaction was Some(40) — too tight for multi-
   codepoint emoji tokenisation. Bumped to Some(80) with a comment
   noting the tokenisation risk. (stream_reaction_texts already at 100.)

Regression test stream_reaction_texts_emits_stream_turn_end_for_each_reaction
(async_llm_integration.rs) asserts 1:1 placeholder/turn-end pairing on
empty canned reactions across two NPCs. HTTP-500 and timeout tests
updated to assert stream-turn-end still fires on zero-token output.

431 parish-core tests + 435 parish-npc tests pass; clippy + fmt clean.

Note: rule #12 violation flagged but not fixed here — the arrival-
reaction orchestration is duplicated between parish-core/movement.rs
and parish-tauri/commands.rs:1108-1112. Both patched in lockstep; the
unification belongs in a separate refactor PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…iring + fixture rubrics

New parish-npc::quality module — sibling to parish-npc::anachronism — with
six pure-function detectors targeting structural pathologies that no
existing judge slice covers:

1. JSON envelope leak (`action": "..."}` literal in chat text)
2. Template token leak (`[Your Name]`, `{npc_name}`, `{{location}}`)
3. Simulator corpus overlap (4-gram match vs parish_inference::simulator::CORPUS)
4. Hallucinated Gaelic (fada-bearing words not in GAELIC_VOCAB whitelist)
5. Modern register (`fascinating`, `decided to visit`, `healing properties`)
6. Reaction emoji monoculture (Shannon-style diversity check)

Each detector returns `Vec<QualityIssue>` with kind + detail + optional
byte-span. Aggregate `detect_all_text_issues(text)` runs all single-text
detectors at once. 19 unit tests cover positive samples (including the
exact strings from the May 17 live demo) + clean controls.

Runtime wiring (emits `WARN site=... kind=... detail=...`):

- parish-tauri/commands.rs: `get_llm_player_action` runs detectors on
  the parsed player action after `extract_action_from_response`.
- parish-core/game_loop/npc_turn.rs: NPC reply path runs detectors on
  `parsed.dialogue` after the existing non-empty guard.

Fixture rubrics (parish/crates/parish-cli/tests/eval_baselines.rs):

Three new asserts run all three "safe-on-simulator-output" detectors
across every BASELINED_FIXTURES entry — `rubric_no_json_envelope_leak_in_fixtures`,
`rubric_no_template_tokens_in_fixtures`, `rubric_no_hallucinated_gaelic_in_fixtures`.
Skipped on fixtures: simulator-corpus-overlap (would always fire on
simulator backend) and modern-register (soft signal, runtime-WARN only).

Made parish_inference::simulator::CORPUS `pub` so detectors can build a
4-gram set without duplicating the corpus. Documented intent at the
declaration site to prevent reuse.

Verification:
- cargo test -p parish-npc quality::tests → 19 passed
- cargo test -p parish --test eval_baselines → 12 passed (3 new)
- cargo clippy --no-deps → clean
- cargo fmt --check → clean

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cherry-pick of ae08e83b (per-turn stream-end) added a 14th argument
to stream_reaction_texts but the existing in-file test at line 800 was
introduced by a different cherry-pick (5ad8807) that predated the new
signature, so auto-merge couldn't reconcile it. Adds the no-op
`|_turn_id| {}` callback to make the test compile against the merged
signature.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…slot loadouts

Two-slot vllm-mlx (14B :8000 + 1.5B :8001) was 404ing every reaction /
simulation call when the user removed those category overrides from
parish.toml. fill_missing_models_from_presets() auto-pulled the preset
model (1.5B) but inherited base_url from the user-level base (:8000) —
where only the 14B is loaded. Model/URL mismatch → 404.

Fix: extend ProviderPreset with a `base_urls` block keyed by category,
plumb it through ProviderMod::preset_base_url + Provider::preset_base_url,
and have fill_missing_models_from_presets populate category_base_url
alongside category_model when the preset supplies a URL.

vllm_mlx.toml's recommended preset now declares the canonical Apple
Silicon two-slot loadout: dialogue + simulation on :8000 (14B for
narrative weight), intent + reaction on :8001 (1.5B for low-latency).
Older presets that omit `[presets.base_urls]` continue to inherit the
user's base URL (single-slot providers — Ollama, OpenAI, Anthropic etc).

Schema is additive — `base_urls` defaults to all-None via PresetBaseUrls::default;
existing TOMLs that don't ship the block keep their old behaviour.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
parse_npc_stream_response previously had a binary fallback: full JSON
parse succeeds, or treat the whole raw text as dialogue. When the LLM
emits well-formed-but-truncated JSON like
`{"dialogue": "Aye, the process of milling..."`  (max_tokens cutoff
or network blip), serde_json::from_str rejects it as malformed, the
fallback fires, and the raw `{"dialogue": "..."}` wrapper surfaces
as user-visible text in the chat bubble.

Live 2026-05-17 demo at The Mill: Brendan + Cormac Duffy replies
landed `{"dialogue": "Aye, 'tis the matter of charges..."` verbatim
in 7 of 10 NPC replies. Nora Duffy's replies, which happened to
complete within max_tokens, parsed cleanly.

Add a third tier:

1. Full JSON parse — preferred, captures dialogue + metadata.
2. **Heuristic dialogue extraction (NEW)** — when JSON parse fails,
   walk the bytes after the first `"dialogue":` key and extract the
   inner string, honouring JSON-style backslash escapes and stopping
   at an unescaped closing quote (or end of stream).
3. Raw text — non-JSON providers, empty stream.

Three new test cases cover the truncated-mid-string,
truncated-at-escape, and truncated-with-empty-value paths. Existing
JSON / plain-text / empty fallback tests continue to pass (10 total).

Parallel fix to c05feb4 (`extract_action_from_response`) which
covered the player-action path; this covers the NPC-reply path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…false positives

detect_simulator_corpus_overlap previously flagged a single 4-gram
match against the simulator corpus as a quality issue. Live 2026-05-17
demo: Brendan Duffy reply "Me da'll be the one to tell ye how it's
done around here" tripped the detector on "be the one to" — a
common English idiom that happens to appear verbatim in the corpus
but is not Markov leakage.

Bumped to 5-gram. True simulator output chains many consecutive
corpus words (e.g. the user-reported gibberish "Can quite explain
bridget from the new collection for them..." still produces 5-gram
matches like "can quite explain bridget from"). Organic English
rarely matches 5 corpus words in a row by chance.

Net effect: zero false positives on live demo output, gibberish
detection preserved.

New regression test `corpus_overlap_no_false_positive_on_common_idiom`
asserts the Brendan-style reply does not trip. Existing
`corpus_overlap_actual_demo_gibberish` updated to expect 5-gram (still
flags). 20 quality tests pass (was 19, +1 false-positive guard).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🤖 Hi @dmooney, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements historical grounding for NPC dialogue, enforcing 1820-specific facts, persona constraints, and Gaelic language guardrails. It introduces a quality module for detecting structural issues like JSON leaks and modern register terms, alongside heuristic improvements for parsing truncated LLM responses. Additional changes include multi-slot provider support and fixes for clock-freezing and TOCTOU race conditions. Review feedback identifies a quote mismatch in the heuristic dialogue parser and recommends using Unicode-aware regex for modern register detection to prevent false positives.

Comment thread parish/crates/parish-npc/src/lib.rs
Comment thread parish/crates/parish-npc/src/quality.rs
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 74c3a367ee

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +383 to +386
if !self.category_base_url.contains_key(&cat)
&& let Some(u) = p.preset_base_url(cat)
{
self.category_base_url.insert(cat, u.to_string());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid overriding URL for user-specified category models

fill_missing_models_from_presets now fills category_base_url whenever it is missing, even when that category already has an explicit user model override. On multi-slot providers (like vllm-mlx), this can silently reroute a user-chosen model to the preset slot URL (e.g. force :8001) and produce model/endpoint mismatches or 404s unless the user also manually sets category_base_url. The base URL should only be auto-filled when the model is also auto-filled from the preset (or when no explicit model override exists).

Useful? React with 👍 / 👎.

Comment thread parish/crates/parish-npc/src/lib.rs Outdated
Comment on lines +341 to +342
'"' => break,
other => out.push(other),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle single-quoted dialogue terminators in heuristic parser

The heuristic parser accepts single-quoted dialogue values (.strip_prefix('\'')), but the scan loop only stops on an unescaped double quote. For malformed/pseudo-JSON like {'dialogue':'Aye there'}, this consumes through the trailing structure and returns leaked wrapper text (e.g. Aye there'}) instead of clean dialogue, so the truncation-recovery path still surfaces artifacts for this input class.

Useful? React with 👍 / 👎.

Comment thread parish/crates/parish-config/providers/vllm_mlx.toml
Comment on lines +383 to +386
if !self.category_base_url.contains_key(&cat)
&& let Some(u) = p.preset_base_url(cat)
{
self.category_base_url.insert(cat, u.to_string());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid forcing override-client path for base-equivalent slots

Auto-populating category_base_url from presets makes every category look overridden, so resolve_category_client stops reusing the already-bootstrapped base client and rebuilds per-category clients instead. In that override path it uses InferenceConfig::default(), which drops runtime client tuning from setup (timeouts/retries/default limiter/config) for all affected categories. This regression is triggered by the new base-url autofill even when the preset URL is effectively the same base routing.

Useful? React with 👍 / 👎.

dmooney and others added 2 commits May 17, 2026 16:46
…t_is_simulator test

After merging origin/main (which absorbed the 5 prior-session prior-art
fixes via PR #986), the merged tree carried two definitions of
`stream_reaction_texts_skips_llm_when_client_is_simulator` — one with
the new 14-arg signature (from this PR's `e1e94fd7` fixup), one with
the old 13-arg signature (from main's #986). Clippy + cargo test
rejected both as E0428 (redefinition) + E0061 (arg mismatch).

Keep the 14-arg version that matches the post-merge
`stream_reaction_texts` signature; drop the 13-arg copy. No behavior
change — both test bodies were byte-identical except for the missing
callback arg.

2796 workspace tests pass; clippy `-D warnings` + fmt clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…preset auto-fill

Three findings from gemini-code-assist + chatgpt-codex-connector on PR #990.

**Heuristic dialogue parser — single-quote terminator (gemini P-medium / codex P2)**

`extract_dialogue_field_heuristic` accepted `'` as an opening quote but
the scan loop only stopped on `"`. For pseudo-JSON like
`{'dialogue':'Aye, "good", said he'}`, the inner double quote
terminated the body and the trailing `'}` leaked into the user-visible
dialogue. Track the opener char and stop only on a matching closer.

New tests:
- `test_parse_npc_stream_response_single_quoted_with_inner_double_quote`
- `test_parse_npc_stream_response_double_quoted_with_inner_single_quote`

**Modern-register phrase boundary + Unicode (gemini P-medium)**

`detect_modern_register`'s phrase branch used `lower.contains(term)`
with no word-boundary check, so "undecided to visit" matched
"decided to visit". The single-word branch was ASCII-byte-based, so
non-ASCII letters bordering a term (e.g. `fascinatingñ`) were treated
as boundaries rather than word chars.

Unify both branches around a single Unicode-aware boundary predicate
that uses `char::is_alphanumeric` (which handles non-ASCII letters
correctly) and treats `_` + `'` as word chars to match Hiberno-English
register.

New tests:
- `modern_register_phrase_respects_word_boundary`
- `modern_register_word_boundary_unicode`

**Preset auto-fill: pair model+URL + respect user base (codex P1 ×2)**

`fill_missing_models_from_presets` filled `category_base_url`
independently of `category_model`. Two failure modes:

1. **Silent reroute of user model**: user sets `category_overrides.simulation.model`
   to their own model but leaves `base_url` empty. Old behaviour:
   we'd inject the preset's slot URL (`:8001`), routing the user's
   model to the wrong slot → 404 or wrong-model response.
2. **Localhost clobbering**: user sets `base_url = "http://remote-gpu:8000"`
   for a colocated GPU box. Preset's hardcoded
   `http://localhost:8001` for intent/reaction would override → traffic
   silently lands on a local port that doesn't exist on the remote host.

Two guards:

- **Pair model and URL fills.** Only auto-fill `category_base_url` on
  the same pass we auto-fill `category_model`. If the user has set
  the model themselves, leave URL inheritance to the base — they own
  routing for that category.
- **Don't override a user-set base.** If `self.base_url !=
  provider.default_base_url()`, the user is off the canonical path
  and is responsible for category routing. Skip auto-fill.

Side benefit: addresses codex's third finding — when every category
looks "overridden", `resolve_category_client` rebuilds per-category
clients with `InferenceConfig::default()` and drops the runtime
client tuning from setup. The new gating keeps the base client
reused when categories cleanly inherit.

Refactored using `Entry` API to satisfy `clippy::map_entry` (the
naive `contains_key` + `insert` pattern was the lint trigger).

All tests pass: 2254 workspace lib, 462 parish-npc; clippy `-D warnings`
clean; fmt clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dmooney dmooney merged commit 02a858a into main May 17, 2026
18 checks passed
@dmooney dmooney deleted the demo-fixes-integration branch May 17, 2026 21:01
dmooney added a commit that referenced this pull request May 17, 2026
Auto-player invented its name on demand and Qwen2.5-14B produced "Br Q"
when Peig asked "Who might ye be" (post-fix run of #990). Add a paragraph
to mods/rundale/demo-prompt.txt pinning the player as "Aiden Carney" from
over by the Shannon, with an explicit "always answer with this exact name,
never single letters" rule and a negative constraint that names the "Br Q"
failure mode by string.

Mod-content change only — no Rust edit. The demo extra_prompt already flows
through parish-tauri::commands::get_llm_player_action.

Live demo (12 turns, transcript at docs/proofs/992-demo-player-name/):
Peig: "Who might ye be, if I might ask it so bold?"
Player: "Aye, my name's Aiden. Came over by the Shannon, wonderin' the
        countryside."

Closes #992.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dmooney added a commit that referenced this pull request May 17, 2026
#1001)

Closes #996.

PR #990 added `[presets.base_urls]` to vllm_mlx.toml so each category on
Apple Silicon routes to the slot where its preset model is actually
loaded. The sister TOML for Linux/Windows CUDA/ROCm (vllm.toml) was
left behind: fill_missing_models_from_presets auto-picks the preset
model (e.g. Qwen3-8B) but inherits the user-level base URL (:8000,
where only the 14B is loaded) → 404 storm equivalent to the pre-#990
macOS bug.

Layout matches the primary suggestion in the issue (each unique Qwen3
model size on its own port):

  :8000 — 14B (dialogue, narrative weight)
  :8001 — 8B  (simulation + reaction, shared mid-tier)
  :8002 — 4B  (intent, low-latency)

Downstream VllmProcess::ensure_slots dedups the duplicate 8B slot, so
only two extra processes spawn beyond the base.

Regression guard: new parish-core unit test
vllm_preset_supplies_per_category_base_url exercises the full schema
round-trip — Provider::preset_base_url → fill_missing_models_from_presets
→ vllm_extra_slots — and asserts URLs/models for all four categories
plus dialogue-elision in the extra-slot list.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dmooney added a commit that referenced this pull request May 18, 2026
…se_url is empty (#993) (#1006)

After PR #990 plumbed `[presets.base_urls]` into vllm_mlx.toml, demo runs
still emitted one `404 Not Found for url http://localhost:8000` per fresh
save inside `infer_player_message_reaction`. Root cause was two-fold in
`GameConfig::resolve_category_client`:

1. `has_override` did not include `category_model.contains_key(&cat)`.
   A category that had a model override but no URL/provider/key override
   fell through to the base client — the 1.5B reaction model sent at the
   14B `:8000` slot returns 404 from vLLM ("model not found").

2. The URL fallback chain for per-category clients went
   `category_base_url → base_url → provider.default_base_url()`, with no
   step that consulted the resolved provider's `preset_base_url(cat)`.
   When `fill_missing_models_from_presets` was prevented from populating
   `category_base_url` (e.g. because `self.base_url != p.default_base_url()`
   after a partial hydration pass), the per-category client still landed
   on the base URL — the same misroute.

Fix:
- Add `category_model.contains_key(&cat)` to `has_override`.
- Insert `provider.preset_base_url(cat)` between the explicit
  `category_base_url` lookup and the `base_url` fallback. For single-slot
  providers (Ollama, Anthropic, OpenAI) whose preset omits
  `[presets.base_urls]`, `preset_base_url` returns `None` and the chain
  is inert — verified by a regression test.
- Simplify `fill_missing_models_from_presets`: model and URL are filled
  independently (both vacant-only), removing the `filled_model` gate
  that previously stranded a user-supplied model on the base URL.

Adds four regression tests in `parish-core::ipc::config::tests`:
- `resolve_category_client_model_only_override_triggers_per_category_client`
- `resolve_category_client_falls_back_to_preset_base_url`
- `resolve_category_client_preset_fallback_is_inert_for_single_slot_provider`
- `issue_993_user_config_hydration_routes_reaction_to_slot_8001`

The fourth test replays the user's actual parish.toml from #993
(vllm-mlx base + `[category_overrides.intent]` only) and asserts the
Reaction client resolves to `:8001` with the 1.5B model.

Proof bundle: docs/proofs/993-residual-reaction-404/
- acceptance-criteria.md (5 criteria, written first)
- transcript.txt (live headless CLI run; 0 occurrences of "404")
- evidence.md, judge.md (verdict: sufficient; AC: met)

Workspace: 2804 tests pass; clippy + fmt clean; just agent-check passes.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dmooney added a commit that referenced this pull request May 18, 2026
* fix(reactions): wire emoji-monoculture sensor + temp=1.0 (#995)

The diversity detector landed in #990 but had no runtime call site —
it was reachable only from tests. This change wires it to a
session-scoped capacity-8 ring buffer on `NpcManager` and feeds the
buffer from every reaction-emit path (CLI headless, CLI script
harness, web server, Tauri). A debounced `tracing::warn!` fires once
the detector reports `Some(QualityIssue)`; the flag clears when
diversity recovers so future crossings re-arm the sensor.

The same buffer is exposed via a new `/debug reactions` CLI
subcommand so live play-tests can verify wiring without depending
on log-file access.

`infer_player_message_reaction` now passes an explicit
`temperature = 1.0` (was `None`) so small-model reaction sampling
explores beyond the most-likely-safe `🤔` / `😊` collapse the issue
describes. The 80-token output cap from #984 is preserved; the
output schema is a closed palette so widening the distribution
cannot break correctness.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address gemini review on record_reaction_emoji

- Replace while loop with if when popping front of the ring buffer
  (push_back adds at most one element per call, so a single pop is
  sufficient).
- Early-return before allocating the snapshot Vec<&str> when the
  buffer is below the detector's minimum sample threshold.
- Expose DEFAULT_EMOJI_MIN_SAMPLES from quality.rs so the threshold
  lives next to the detector that owns it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dmooney added a commit that referenced this pull request May 18, 2026
…ing (#994) (#1007)

* fix(bench): mirror runtime tier-1 prompt in dialogue eval (#994)

The rundale-bench dialogue sampler used a hardcoded 280-char `DIALOGUE_SYS`
that did not reflect the runtime tier-1 grounding that #990 added
(1820 fact preamble, anti-stage-Irish cultural guard, persona binding,
ga-IE phrase whitelist, Latin-script guard). Result: prompt-quality work
showed no measurable lift in the bench because the bench wasn't running
the prompt under test.

This change adds `build_dialogue_system_prompt()` to `eval_lib.py` that
reads `mods/rundale/prompts/tier1_system.txt`, substitutes the bench's
Brigid O'Brien persona slots, and appends a Python mirror of
`parish_npc::language_directive` for en-IE/ga-IE. The five scripts that
previously hardcoded a Brigid prompt now share that helper:

- parish/testing/rundale-bench/cache_dialogue_replies.py
- parish/testing/rundale-bench/rundale_bench.py
- parish/testing/rundale-bench/bench_perf.py
- parish/scripts/local-eval/gen_samples.py
- parish/scripts/local-eval/gen_dlg.py

Resulting prompt grows from ~280 to ~3894 chars and now carries every
guardrail the live runtime applies, so future tier-1 prompt regressions
will surface on the leaderboard.

No runtime behaviour change. Past leaderboard scores will need
re-benching against the new prompt; that work is intentionally deferred
to a follow-up so this prompt change can be reviewed in isolation
(issue point #3).

Proof bundle: docs/proofs/994-bench-prompt-mirror/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(bench): rebench grok-4.3 under new prompt — grok-4.3 judge

Switched judge from mistral-large-2512 to x-ai/grok-4.3 at user
request (mistral too generous). Headline numbers re-stated against
the stricter judge.

Surfaced a bench-scoring artifact: the NEW prompt's runtime template
emits dialogue + `---` + JSON metadata, and the judge penalizes the
JSON scaffolding as anachronistic even though the runtime strips it
before showing the player. On `dialogue-0011` the same advice scores
8.4 with the OLD prompt's plain-text format and 3.4 with the NEW
prompt's templated format — +5-point swing on scaffolding the player
never sees. Bench should strip the metadata block before judging.

After stripping, grok-judge totals:

  OLD raw            8.09
  NEW raw            7.45   (scaffolding penalty)
  NEW stripped       8.21   <-- apples-to-apples
  delta vs OLD      +0.12

Per-axis (NEW stripped vs OLD):
  character      -0.40
  authenticity   -0.34
  language       +0.87   <- GA_IE whitelist + Latin-script guard
  responsiveness +0.60   <- persona binding + "react to" clause
  craft          -0.14

Both discriminative judges (mistral, grok) agree on sign; magnitude
inside N=15 noise floor.

Two follow-ups noted in rebench-findings.md:
  1. Bench should strip the metadata block before judging (today's
     +5-point swing is a measurement artifact, not a prompt regression).
  2. Reconcile the divergence between the mod template (---/JSON
     format) and the Rust builder (JSON-first format).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(bench): strip runtime metadata envelope before judging

Bench was sending raw model output (dialogue + `---` + JSON metadata)
to the judge, but the runtime parses the metadata and shows the
player only the dialogue line. Grok-4.3 (as judge) penalised the
JSON scaffolding as anachronistic, depressing every NEW-prompt score
by ~+0.76 points.

Adds `grade.extract_dialogue_for_judging(reply)` that handles both
runtime envelope shapes:

  - Mod-template format: `<dialogue>\n---\n{...}` → text before `\n---`
  - Rust-builder JSON-first: `{"dialogue": "...", ...}` → parsed
    `.dialogue` field
  - Legacy plain text: pass-through

Falls through to verbatim return on malformed envelopes so the judge
scores broken output as the player would see it. Never raises.

Wired into every site that sends a reply to the judge:

  - grade.grade_dialogue
  - grade.grade_reaction
  - grade.grade_pairwise
  - score_multiaxis.score_one
  - rubric_lab.absolute_judge
  - rubric_lab.pairwise_judge

`_non_latin` still scans the full raw reply so non-Latin leaks inside
metadata are still flagged.

Tests: 10 new `test_extract_dialogue_*` cases plus a
SyntaxError-immune fall-through; 38/38 pass.

Rebench under grok-4.3 (same NEW cache as #994 bundle):

  pre-fix raw          7.45   <- scaffolding penalty
  manual-stripped      8.21   <- one-off compare in #994 bundle
  post-fix auto-strip  8.29   <- this fix

Recovers +0.84 of measurement artifact. NEW prompt now slightly
exceeds OLD on the same judge (8.29 vs 8.09 = +0.20) — first
apples-to-apples confirmation that the runtime tier-1 grounding
outperforms the pre-#994 hardcoded bench prompt.

Proof bundle: docs/proofs/994b-bench-strip-metadata/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(bench): address review on extract_dialogue_for_judging

Four bot reviews on grade.py, all legit. Mirror runtime more fully so
malformed-output cases don't drift from `parish_npc::parse_npc_stream_response`.

1. Strip Markdown JSON fences (`` ```json … ``` ``) before envelope
   detection. Mirrors `parish_npc::strip_json_fence`. Without this,
   fenced replies (Anthropic) bypass extraction and ship raw scaffolding
   to the judge. (codex PRRT_kwDORqdnvs6CzukS)

2. Accept `---` at start of reply (no leading newline). Some models
   skip the dialogue line and emit only the metadata block; the runtime
   splits on `---` itself, so the bench must too. Regex now matches
   `(?:\n|^)[ \t]*---`. (codex PRRT_kwDORqdnvs6CzukN)

3. Recover the `"dialogue"` field from truncated JSON envelopes
   (max_tokens cutoff, network blip). Adds
   `_extract_dialogue_field_heuristic` mirroring the Rust runtime helper
   of the same name. Without this, max-tokens-truncated outputs were
   judged as raw envelope. (codex PRRT_kwDORqdnvs6CzukL)

4. Replace the hand-rolled brace/string tracker with
   `json.JSONDecoder().raw_decode()` — cleaner, tolerates trailing
   junk, drops 25 lines of state machine code. (gemini PRRT_kwDORqdnvs6Czu7H)

Tests: 8 new cases (start-of-string dash, fenced/bare-fenced JSON,
truncated recovery with and without escaped quotes, fenced+truncated
combo, trailing-junk JSON). Updated one existing test whose assertion
was the old hand-rolled fall-through behaviour. 46/46 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(bench): tighten review issues on extract_dialogue + judge rubrics

Four more codex reviews, all legit, on the strip-metadata helper +
the rubrics it feeds:

1. **Judge rubrics still said 1-3 sentences** while the new shared
   prompt instructs 2-4. Pre-#994 craft scores systematically
   penalised any reply >3 sentences. Updated `judge_v1.json`,
   `judge_pairwise_v1.json`, and `score_multiaxis.DEFAULT_RUBRIC`
   to 2-4. Recomputed `rubric_sha256` on both pinned configs.
   (codex PRRT_kwDORqdnvs6C0CIU)

2. **Truncated-JSON recovery was too permissive.** Used
   `re.search` so a `dialogue` field anywhere in malformed input was
   recovered; runtime only recovers when `dialogue` is the leading
   key. Anchored the regex with `^\s*\{?\s*` and switched to
   `re.match`. (codex PRRT_kwDORqdnvs6C0CIZ)

3. **Fence stripping was case-insensitive** while the runtime
   `parish_npc::strip_json_fence` is case-sensitive lowercase. An
   uppercase ```JSON wrapper was being unwrapped by bench but kept
   raw by runtime. Dropped `re.IGNORECASE`. (codex PRRT_kwDORqdnvs6C0CIe)

4. **`---` delimiter required trailing newline.** Runtime splits on
   bare `---` anywhere (`parish-core::game_session::cleaned`,
   `parish-npc::reactions::arrival_reactions`), so inline forms like
   `dialogue---{action:…}` were not stripped by bench. Replaced the
   regex with `stripped.find("---")`. (codex PRRT_kwDORqdnvs6C0CIf)

Tests: 3 new cases (inline `---{json}`, uppercase fence fall-through,
non-leading `dialogue` key fall-through). 49/49 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(bench): try JSON-first parse before --- split

Codex review PRRT_kwDORqdnvs6C0Je2: a valid JSON reply whose dialogue
text contains `---` (em-dash spelled out, e.g.
`{"dialogue":"Well---maybe",...}`) was being mangled into a partial
envelope by the delimiter split that ran first.

Runtime `parish_npc::parse_npc_stream_response` parses JSON before
falling back to other strategies. Bench now matches that order:

  1. fence strip
  2. JSON-first parse (raw_decode)
  3. truncated-JSON heuristic
  4. bare `---` split (mod-template format)
  5. verbatim fallback

JSON dialogue containing `---` survives intact; the `---` split is
only reached for non-JSON envelopes that don't contain a parseable
`{` prefix.

Test: `test_extract_dialogue_json_dialogue_contains_em_dash`. 50/50 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(bench): match runtime JSON-parse semantics exactly

Two more codex reviews on `extract_dialogue_for_judging`:

1. **Missing dialogue field returns "" not raw envelope.** Runtime
   `NpcJsonResponse` uses `#[serde(default)]`, so a parseable JSON
   envelope without a `dialogue` key surfaces as empty dialogue to
   the player. Bench was falling through and judging the raw
   envelope. Same for non-string dialogue values (null, number).
   (codex PRRT_kwDORqdnvs6C0Zc8)

2. **`json.loads` instead of `raw_decode`.** Runtime
   `serde_json::from_str` requires the entire input to be consumed
   (modulo surrounding whitespace) and rejects trailing junk;
   `raw_decode` would accept it, drifting from runtime semantics.
   Switched to `loads`. Trailing-junk replies still work because
   the truncated-JSON heuristic kicks in next and recovers the
   leading `"dialogue"` field — same chain runtime follows.
   (codex PRRT_kwDORqdnvs6C0ZdL)

Tests: 3 new cases (missing field → "", non-string field → "",
`{}` → ""). Updated existing trailing-junk test comment to reflect
the new fall-through path. 52/52 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant