fix(bench): mirror runtime tier-1 prompt + strip metadata before judging (#994)#1007
Conversation
The rundale-bench dialogue sampler used a hardcoded 280-char `DIALOGUE_SYS` that did not reflect the runtime tier-1 grounding that #990 added (1820 fact preamble, anti-stage-Irish cultural guard, persona binding, ga-IE phrase whitelist, Latin-script guard). Result: prompt-quality work showed no measurable lift in the bench because the bench wasn't running the prompt under test. This change adds `build_dialogue_system_prompt()` to `eval_lib.py` that reads `mods/rundale/prompts/tier1_system.txt`, substitutes the bench's Brigid O'Brien persona slots, and appends a Python mirror of `parish_npc::language_directive` for en-IE/ga-IE. The five scripts that previously hardcoded a Brigid prompt now share that helper: - parish/testing/rundale-bench/cache_dialogue_replies.py - parish/testing/rundale-bench/rundale_bench.py - parish/testing/rundale-bench/bench_perf.py - parish/scripts/local-eval/gen_samples.py - parish/scripts/local-eval/gen_dlg.py Resulting prompt grows from ~280 to ~3894 chars and now carries every guardrail the live runtime applies, so future tier-1 prompt regressions will surface on the leaderboard. No runtime behaviour change. Past leaderboard scores will need re-benching against the new prompt; that work is intentionally deferred to a follow-up so this prompt change can be reviewed in isolation (issue point #3). Proof bundle: docs/proofs/994-bench-prompt-mirror/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switched judge from mistral-large-2512 to x-ai/grok-4.3 at user
request (mistral too generous). Headline numbers re-stated against
the stricter judge.
Surfaced a bench-scoring artifact: the NEW prompt's runtime template
emits dialogue + `---` + JSON metadata, and the judge penalizes the
JSON scaffolding as anachronistic even though the runtime strips it
before showing the player. On `dialogue-0011` the same advice scores
8.4 with the OLD prompt's plain-text format and 3.4 with the NEW
prompt's templated format — +5-point swing on scaffolding the player
never sees. Bench should strip the metadata block before judging.
After stripping, grok-judge totals:
OLD raw 8.09
NEW raw 7.45 (scaffolding penalty)
NEW stripped 8.21 <-- apples-to-apples
delta vs OLD +0.12
Per-axis (NEW stripped vs OLD):
character -0.40
authenticity -0.34
language +0.87 <- GA_IE whitelist + Latin-script guard
responsiveness +0.60 <- persona binding + "react to" clause
craft -0.14
Both discriminative judges (mistral, grok) agree on sign; magnitude
inside N=15 noise floor.
Two follow-ups noted in rebench-findings.md:
1. Bench should strip the metadata block before judging (today's
+5-point swing is a measurement artifact, not a prompt regression).
2. Reconcile the divergence between the mod template (---/JSON
format) and the Rust builder (JSON-first format).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bench was sending raw model output (dialogue + `---` + JSON metadata)
to the judge, but the runtime parses the metadata and shows the
player only the dialogue line. Grok-4.3 (as judge) penalised the
JSON scaffolding as anachronistic, depressing every NEW-prompt score
by ~+0.76 points.
Adds `grade.extract_dialogue_for_judging(reply)` that handles both
runtime envelope shapes:
- Mod-template format: `<dialogue>\n---\n{...}` → text before `\n---`
- Rust-builder JSON-first: `{"dialogue": "...", ...}` → parsed
`.dialogue` field
- Legacy plain text: pass-through
Falls through to verbatim return on malformed envelopes so the judge
scores broken output as the player would see it. Never raises.
Wired into every site that sends a reply to the judge:
- grade.grade_dialogue
- grade.grade_reaction
- grade.grade_pairwise
- score_multiaxis.score_one
- rubric_lab.absolute_judge
- rubric_lab.pairwise_judge
`_non_latin` still scans the full raw reply so non-Latin leaks inside
metadata are still flagged.
Tests: 10 new `test_extract_dialogue_*` cases plus a
SyntaxError-immune fall-through; 38/38 pass.
Rebench under grok-4.3 (same NEW cache as #994 bundle):
pre-fix raw 7.45 <- scaffolding penalty
manual-stripped 8.21 <- one-off compare in #994 bundle
post-fix auto-strip 8.29 <- this fix
Recovers +0.84 of measurement artifact. NEW prompt now slightly
exceeds OLD on the same judge (8.29 vs 8.09 = +0.20) — first
apples-to-apples confirmation that the runtime tier-1 grounding
outperforms the pre-#994 hardcoded bench prompt.
Proof bundle: docs/proofs/994b-bench-strip-metadata/.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🤖 Hi @dmooney, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2bca98f573
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Code Review
This pull request updates the dialogue system prompt across the evaluation pipeline to mirror the runtime tier-1 grounding, ensuring that prompt-quality improvements are accurately reflected in the leaderboard. It introduces a shared helper, build_dialogue_system_prompt, which reads from a central template and appends the correct language directive, replacing hardcoded prompts in five different scripts. Additionally, it adds a metadata-stripping utility to ensure the judge evaluates only the dialogue visible to the player, preventing the JSON scaffolding from negatively impacting quality scores. I have kept the review comment regarding the JSON parsing logic as it provides a more robust and maintainable implementation.
Four bot reviews on grade.py, all legit. Mirror runtime more fully so malformed-output cases don't drift from `parish_npc::parse_npc_stream_response`. 1. Strip Markdown JSON fences (`` ```json … ``` ``) before envelope detection. Mirrors `parish_npc::strip_json_fence`. Without this, fenced replies (Anthropic) bypass extraction and ship raw scaffolding to the judge. (codex PRRT_kwDORqdnvs6CzukS) 2. Accept `---` at start of reply (no leading newline). Some models skip the dialogue line and emit only the metadata block; the runtime splits on `---` itself, so the bench must too. Regex now matches `(?:\n|^)[ \t]*---`. (codex PRRT_kwDORqdnvs6CzukN) 3. Recover the `"dialogue"` field from truncated JSON envelopes (max_tokens cutoff, network blip). Adds `_extract_dialogue_field_heuristic` mirroring the Rust runtime helper of the same name. Without this, max-tokens-truncated outputs were judged as raw envelope. (codex PRRT_kwDORqdnvs6CzukL) 4. Replace the hand-rolled brace/string tracker with `json.JSONDecoder().raw_decode()` — cleaner, tolerates trailing junk, drops 25 lines of state machine code. (gemini PRRT_kwDORqdnvs6Czu7H) Tests: 8 new cases (start-of-string dash, fenced/bare-fenced JSON, truncated recovery with and without escaped quotes, fenced+truncated combo, trailing-junk JSON). Updated one existing test whose assertion was the old hand-rolled fall-through behaviour. 46/46 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 49726e4fca
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Four more codex reviews, all legit, on the strip-metadata helper + the rubrics it feeds: 1. **Judge rubrics still said 1-3 sentences** while the new shared prompt instructs 2-4. Pre-#994 craft scores systematically penalised any reply >3 sentences. Updated `judge_v1.json`, `judge_pairwise_v1.json`, and `score_multiaxis.DEFAULT_RUBRIC` to 2-4. Recomputed `rubric_sha256` on both pinned configs. (codex PRRT_kwDORqdnvs6C0CIU) 2. **Truncated-JSON recovery was too permissive.** Used `re.search` so a `dialogue` field anywhere in malformed input was recovered; runtime only recovers when `dialogue` is the leading key. Anchored the regex with `^\s*\{?\s*` and switched to `re.match`. (codex PRRT_kwDORqdnvs6C0CIZ) 3. **Fence stripping was case-insensitive** while the runtime `parish_npc::strip_json_fence` is case-sensitive lowercase. An uppercase ```JSON wrapper was being unwrapped by bench but kept raw by runtime. Dropped `re.IGNORECASE`. (codex PRRT_kwDORqdnvs6C0CIe) 4. **`---` delimiter required trailing newline.** Runtime splits on bare `---` anywhere (`parish-core::game_session::cleaned`, `parish-npc::reactions::arrival_reactions`), so inline forms like `dialogue---{action:…}` were not stripped by bench. Replaced the regex with `stripped.find("---")`. (codex PRRT_kwDORqdnvs6C0CIf) Tests: 3 new cases (inline `---{json}`, uppercase fence fall-through, non-leading `dialogue` key fall-through). 49/49 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bfda4ca59f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Codex review PRRT_kwDORqdnvs6C0Je2: a valid JSON reply whose dialogue
text contains `---` (em-dash spelled out, e.g.
`{"dialogue":"Well---maybe",...}`) was being mangled into a partial
envelope by the delimiter split that ran first.
Runtime `parish_npc::parse_npc_stream_response` parses JSON before
falling back to other strategies. Bench now matches that order:
1. fence strip
2. JSON-first parse (raw_decode)
3. truncated-JSON heuristic
4. bare `---` split (mod-template format)
5. verbatim fallback
JSON dialogue containing `---` survives intact; the `---` split is
only reached for non-JSON envelopes that don't contain a parseable
`{` prefix.
Test: `test_extract_dialogue_json_dialogue_contains_em_dash`. 50/50 pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 398b76c120
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Two more codex reviews on `extract_dialogue_for_judging`:
1. **Missing dialogue field returns "" not raw envelope.** Runtime
`NpcJsonResponse` uses `#[serde(default)]`, so a parseable JSON
envelope without a `dialogue` key surfaces as empty dialogue to
the player. Bench was falling through and judging the raw
envelope. Same for non-string dialogue values (null, number).
(codex PRRT_kwDORqdnvs6C0Zc8)
2. **`json.loads` instead of `raw_decode`.** Runtime
`serde_json::from_str` requires the entire input to be consumed
(modulo surrounding whitespace) and rejects trailing junk;
`raw_decode` would accept it, drifting from runtime semantics.
Switched to `loads`. Trailing-junk replies still work because
the truncated-JSON heuristic kicks in next and recovers the
leading `"dialogue"` field — same chain runtime follows.
(codex PRRT_kwDORqdnvs6C0ZdL)
Tests: 3 new cases (missing field → "", non-string field → "",
`{}` → ""). Updated existing trailing-junk test comment to reflect
the new fall-through path. 52/52 pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 20b55a2712
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| dialogue = obj.get("dialogue", "") | ||
| return dialogue if isinstance(dialogue, str) else "" |
There was a problem hiding this comment.
Handle non-string
dialogue with runtime fallback
When a parsed JSON envelope has "dialogue" present but non-string (for example {"dialogue": null}), this branch returns "". In runtime, parse_npc_stream_response deserializes into NpcJsonResponse where dialogue is a String; #[serde(default)] covers missing fields but not wrong types, so that payload falls back to raw text rather than empty dialogue. Scoring these replies as empty text reintroduces bench/runtime drift for malformed-but-parseable outputs and can skew judge results.
Useful? React with 👍 / 👎.
| # field is absent or non-string, return ``""`` — runtime's | ||
| # ``NpcJsonResponse`` uses ``#[serde(default)]``, so the player | ||
| # sees an empty dialogue rather than the raw envelope. | ||
| if stripped.startswith("{"): |
There was a problem hiding this comment.
Recover truncated dialogue even without opening brace
Gating JSON handling behind stripped.startswith("{") skips the truncated-JSON heuristic for malformed outputs like "dialogue": "... that omit the opening {. Runtime extract_dialogue_field_heuristic explicitly tolerates optional leading { and still recovers the player-visible dialogue in this case, so bench currently feeds raw scaffolding to judges where runtime would show dialogue, creating another bench/runtime mismatch on cutoff responses.
Useful? React with 👍 / 👎.
| # envelope detection. Case-sensitive lowercase ```json (matching the | ||
| # runtime parser): uppercase variants fall through as legacy text so | ||
| # bench/runtime treat them identically. | ||
| _JSON_FENCE_RE = re.compile(r"^```(?:json)?\s*\n?") |
There was a problem hiding this comment.
Keep fence stripping semantics aligned with runtime prefixes
This regex accepts any whitespace after opening backticks, so inputs like ``` json\n{...}\n``` are unwrapped and then parsed as JSON. Runtime strip_json_fence only treats exact ```json or bare ``` prefixes as structured fences; ``` json falls through to raw-text handling. That means bench can extract clean dialogue from fenced outputs that runtime would not parse, reintroducing bench/runtime drift for malformed provider formatting.
Useful? React with 👍 / 👎.
Summary
Closes #994.
Three commits, eval-pipeline only, no runtime behaviour change:
fix(bench): mirror runtime tier-1 prompt in dialogue eval—replaces the hardcoded 4-line
DIALOGUE_SYSin 5 bench scripts witha shared helper that reads
mods/rundale/prompts/tier1_system.txtand appends a Python mirror of
parish_npc::language_directiveforen-IE / ga-IE. Prompt grows 280 → 3894 chars and now carries the
four runtime tier-1 improvements (1820 fact preamble, anti-stage-
Irish cultural guard, persona binding, GA_IE phrase whitelist +
Latin-script guard).
docs(bench): rebench grok-4.3 under new prompt— apples-to-apples re-judge of grok-4.3 under the OLD vs NEW bench prompt. Three
judges tried;
judge_v1(qwen3-235b, 1-5) saturated for bothprompts. Mistral-large too generous per user. Grok-4.3 selected.
Surfaces a measurement artifact: bench sent raw model envelope to
the judge but the runtime strips it before showing the player.
fix(bench): strip runtime metadata envelope before judging—adds
grade.extract_dialogue_for_judging(reply)handling bothruntime envelope formats (mod-template
---/JSON and Rust-builderJSON-first). Wired into all 6 judge-call sites in
grade.py,score_multiaxis.py, andrubric_lab.py. 10 new unit tests; 38/38pass.
Quality lift (post-fix, grok-4.3 judge, n=15)
Same NEW grok-4.3 cache, same judge, before/after the strip fix:
Per-axis NEW (auto-stripped) vs OLD:
First apples-to-apples confirmation that the runtime tier-1 grounding
(landed in #990) outperforms the pre-#994 hardcoded bench prompt under
a discriminative judge. Lifts concentrate on language and
responsiveness — the axes the runtime upgrades were designed to lift.
Implication for past leaderboard rows
The leaderboard's pre-#994 rows used the old hardcoded prompt and have
no envelope, so their scores stand. Future NEW-prompt rows will be
fairly judged once this lands. Past judgments under judge_v1 (1-5
scale) are saturated for top-tier candidates and should be retired in
favour of the 0-10 multi-axis rubric; that retirement is out of scope
for this PR.
Out of scope
parish_npc::build_tier1_system_prompt(Rust builder, JSON-firstformat) and
mods/rundale/prompts/tier1_system.txt(mod template,---+JSON format). The helper handles both. Aligning them is aseparate task.
of judge_v1.
Proof bundles
docs/proofs/994-bench-prompt-mirror/— prompt mirror + multi-judgerebench data
docs/proofs/994b-bench-strip-metadata/— strip fix + post-fixrebench
Test plan
python3 parish/testing/rundale-bench/test_grade.py→ 38/38 passedjust agent-check→ passed (proof bundles + judge verdicts present)DIALOGUE_SYS/SYSTEMto the 3894-char shared promptscore_multiaxisagainst NEW grok-4.3 cacheunder grok-4.3 judge produces 8.29 (≥ 8.0 AC threshold)
🤖 Generated with Claude Code