fix(bench): mirror runtime tier-1 prompt + strip metadata before judging (#994) by dmooney · Pull Request #1007 · dmooney/Rundale

dmooney · 2026-05-18T11:03:48Z

Summary

Closes #994.

Three commits, eval-pipeline only, no runtime behaviour change:

fix(bench): mirror runtime tier-1 prompt in dialogue eval —
replaces the hardcoded 4-line DIALOGUE_SYS in 5 bench scripts with
a shared helper that reads mods/rundale/prompts/tier1_system.txt
and appends a Python mirror of parish_npc::language_directive for
en-IE / ga-IE. Prompt grows 280 → 3894 chars and now carries the
four runtime tier-1 improvements (1820 fact preamble, anti-stage-
Irish cultural guard, persona binding, GA_IE phrase whitelist +
Latin-script guard).
docs(bench): rebench grok-4.3 under new prompt — apples-to-
apples re-judge of grok-4.3 under the OLD vs NEW bench prompt. Three
judges tried; judge_v1 (qwen3-235b, 1-5) saturated for both
prompts. Mistral-large too generous per user. Grok-4.3 selected.
Surfaces a measurement artifact: bench sent raw model envelope to
the judge but the runtime strips it before showing the player.
fix(bench): strip runtime metadata envelope before judging —
adds grade.extract_dialogue_for_judging(reply) handling both
runtime envelope formats (mod-template ---/JSON and Rust-builder
JSON-first). Wired into all 6 judge-call sites in grade.py,
score_multiaxis.py, and rubric_lab.py. 10 new unit tests; 38/38
pass.

Quality lift (post-fix, grok-4.3 judge, n=15)

Same NEW grok-4.3 cache, same judge, before/after the strip fix:

Run	Total
OLD-prompt baseline	8.09
NEW raw (pre-strip)	7.45
NEW manual-stripped	8.21
NEW auto-stripped	8.29

Per-axis NEW (auto-stripped) vs OLD:

Axis	OLD	NEW	Δ
character	8.07	8.13	+0.06
authenticity	8.87	8.93	+0.06
language	7.73	8.40	+0.67
responsiveness	7.73	8.20	+0.47
craft	8.07	7.80	-0.27
total	8.09	8.29	+0.20

First apples-to-apples confirmation that the runtime tier-1 grounding
(landed in #990) outperforms the pre-#994 hardcoded bench prompt under
a discriminative judge. Lifts concentrate on language and
responsiveness — the axes the runtime upgrades were designed to lift.

Implication for past leaderboard rows

The leaderboard's pre-#994 rows used the old hardcoded prompt and have
no envelope, so their scores stand. Future NEW-prompt rows will be
fairly judged once this lands. Past judgments under judge_v1 (1-5
scale) are saturated for top-tier candidates and should be retired in
favour of the 0-10 multi-axis rubric; that retirement is out of scope
for this PR.

Out of scope

Reconciling the runtime divergence between
parish_npc::build_tier1_system_prompt (Rust builder, JSON-first
format) and mods/rundale/prompts/tier1_system.txt (mod template,
---+JSON format). The helper handles both. Aligning them is a
separate task.
Re-baselining historic leaderboard rows beyond the noted retirement
of judge_v1.

Proof bundles

docs/proofs/994-bench-prompt-mirror/ — prompt mirror + multi-judge
rebench data
docs/proofs/994b-bench-strip-metadata/ — strip fix + post-fix
rebench

Test plan

python3 parish/testing/rundale-bench/test_grade.py → 38/38 passed
just agent-check → passed (proof bundles + judge verdicts present)
All 5 bench scripts import cleanly and resolve DIALOGUE_SYS /
SYSTEM to the 3894-char shared prompt
Live judge run: score_multiaxis against NEW grok-4.3 cache
under grok-4.3 judge produces 8.29 (≥ 8.0 AC threshold)

🤖 Generated with Claude Code

The rundale-bench dialogue sampler used a hardcoded 280-char `DIALOGUE_SYS` that did not reflect the runtime tier-1 grounding that #990 added (1820 fact preamble, anti-stage-Irish cultural guard, persona binding, ga-IE phrase whitelist, Latin-script guard). Result: prompt-quality work showed no measurable lift in the bench because the bench wasn't running the prompt under test. This change adds `build_dialogue_system_prompt()` to `eval_lib.py` that reads `mods/rundale/prompts/tier1_system.txt`, substitutes the bench's Brigid O'Brien persona slots, and appends a Python mirror of `parish_npc::language_directive` for en-IE/ga-IE. The five scripts that previously hardcoded a Brigid prompt now share that helper: - parish/testing/rundale-bench/cache_dialogue_replies.py - parish/testing/rundale-bench/rundale_bench.py - parish/testing/rundale-bench/bench_perf.py - parish/scripts/local-eval/gen_samples.py - parish/scripts/local-eval/gen_dlg.py Resulting prompt grows from ~280 to ~3894 chars and now carries every guardrail the live runtime applies, so future tier-1 prompt regressions will surface on the leaderboard. No runtime behaviour change. Past leaderboard scores will need re-benching against the new prompt; that work is intentionally deferred to a follow-up so this prompt change can be reviewed in isolation (issue point #3). Proof bundle: docs/proofs/994-bench-prompt-mirror/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Switched judge from mistral-large-2512 to x-ai/grok-4.3 at user request (mistral too generous). Headline numbers re-stated against the stricter judge. Surfaced a bench-scoring artifact: the NEW prompt's runtime template emits dialogue + `---` + JSON metadata, and the judge penalizes the JSON scaffolding as anachronistic even though the runtime strips it before showing the player. On `dialogue-0011` the same advice scores 8.4 with the OLD prompt's plain-text format and 3.4 with the NEW prompt's templated format — +5-point swing on scaffolding the player never sees. Bench should strip the metadata block before judging. After stripping, grok-judge totals: OLD raw 8.09 NEW raw 7.45 (scaffolding penalty) NEW stripped 8.21 <-- apples-to-apples delta vs OLD +0.12 Per-axis (NEW stripped vs OLD): character -0.40 authenticity -0.34 language +0.87 <- GA_IE whitelist + Latin-script guard responsiveness +0.60 <- persona binding + "react to" clause craft -0.14 Both discriminative judges (mistral, grok) agree on sign; magnitude inside N=15 noise floor. Two follow-ups noted in rebench-findings.md: 1. Bench should strip the metadata block before judging (today's +5-point swing is a measurement artifact, not a prompt regression). 2. Reconcile the divergence between the mod template (---/JSON format) and the Rust builder (JSON-first format). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bench was sending raw model output (dialogue + `---` + JSON metadata) to the judge, but the runtime parses the metadata and shows the player only the dialogue line. Grok-4.3 (as judge) penalised the JSON scaffolding as anachronistic, depressing every NEW-prompt score by ~+0.76 points. Adds `grade.extract_dialogue_for_judging(reply)` that handles both runtime envelope shapes: - Mod-template format: `<dialogue>\n---\n{...}` → text before `\n---` - Rust-builder JSON-first: `{"dialogue": "...", ...}` → parsed `.dialogue` field - Legacy plain text: pass-through Falls through to verbatim return on malformed envelopes so the judge scores broken output as the player would see it. Never raises. Wired into every site that sends a reply to the judge: - grade.grade_dialogue - grade.grade_reaction - grade.grade_pairwise - score_multiaxis.score_one - rubric_lab.absolute_judge - rubric_lab.pairwise_judge `_non_latin` still scans the full raw reply so non-Latin leaks inside metadata are still flagged. Tests: 10 new `test_extract_dialogue_*` cases plus a SyntaxError-immune fall-through; 38/38 pass. Rebench under grok-4.3 (same NEW cache as #994 bundle): pre-fix raw 7.45 <- scaffolding penalty manual-stripped 8.21 <- one-off compare in #994 bundle post-fix auto-strip 8.29 <- this fix Recovers +0.84 of measurement artifact. NEW prompt now slightly exceeds OLD on the same judge (8.29 vs 8.09 = +0.20) — first apples-to-apples confirmation that the runtime tier-1 grounding outperforms the pre-#994 hardcoded bench prompt. Proof bundle: docs/proofs/994b-bench-strip-metadata/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-18T11:03:58Z

🤖 Hi @dmooney, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2bca98f573

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

gemini-code-assist

Code Review

This pull request updates the dialogue system prompt across the evaluation pipeline to mirror the runtime tier-1 grounding, ensuring that prompt-quality improvements are accurately reflected in the leaderboard. It introduces a shared helper, build_dialogue_system_prompt, which reads from a central template and appends the correct language directive, replacing hardcoded prompts in five different scripts. Additionally, it adds a metadata-stripping utility to ensure the judge evaluates only the dialogue visible to the player, preventing the JSON scaffolding from negatively impacting quality scores. I have kept the review comment regarding the JSON parsing logic as it provides a more robust and maintainable implementation.

Four bot reviews on grade.py, all legit. Mirror runtime more fully so malformed-output cases don't drift from `parish_npc::parse_npc_stream_response`. 1. Strip Markdown JSON fences (`` ```json … ``` ``) before envelope detection. Mirrors `parish_npc::strip_json_fence`. Without this, fenced replies (Anthropic) bypass extraction and ship raw scaffolding to the judge. (codex PRRT_kwDORqdnvs6CzukS) 2. Accept `---` at start of reply (no leading newline). Some models skip the dialogue line and emit only the metadata block; the runtime splits on `---` itself, so the bench must too. Regex now matches `(?:\n|^)[ \t]*---`. (codex PRRT_kwDORqdnvs6CzukN) 3. Recover the `"dialogue"` field from truncated JSON envelopes (max_tokens cutoff, network blip). Adds `_extract_dialogue_field_heuristic` mirroring the Rust runtime helper of the same name. Without this, max-tokens-truncated outputs were judged as raw envelope. (codex PRRT_kwDORqdnvs6CzukL) 4. Replace the hand-rolled brace/string tracker with `json.JSONDecoder().raw_decode()` — cleaner, tolerates trailing junk, drops 25 lines of state machine code. (gemini PRRT_kwDORqdnvs6Czu7H) Tests: 8 new cases (start-of-string dash, fenced/bare-fenced JSON, truncated recovery with and without escaped quotes, fenced+truncated combo, trailing-junk JSON). Updated one existing test whose assertion was the old hand-rolled fall-through behaviour. 46/46 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 49726e4fca

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Four more codex reviews, all legit, on the strip-metadata helper + the rubrics it feeds: 1. **Judge rubrics still said 1-3 sentences** while the new shared prompt instructs 2-4. Pre-#994 craft scores systematically penalised any reply >3 sentences. Updated `judge_v1.json`, `judge_pairwise_v1.json`, and `score_multiaxis.DEFAULT_RUBRIC` to 2-4. Recomputed `rubric_sha256` on both pinned configs. (codex PRRT_kwDORqdnvs6C0CIU) 2. **Truncated-JSON recovery was too permissive.** Used `re.search` so a `dialogue` field anywhere in malformed input was recovered; runtime only recovers when `dialogue` is the leading key. Anchored the regex with `^\s*\{?\s*` and switched to `re.match`. (codex PRRT_kwDORqdnvs6C0CIZ) 3. **Fence stripping was case-insensitive** while the runtime `parish_npc::strip_json_fence` is case-sensitive lowercase. An uppercase ```JSON wrapper was being unwrapped by bench but kept raw by runtime. Dropped `re.IGNORECASE`. (codex PRRT_kwDORqdnvs6C0CIe) 4. **`---` delimiter required trailing newline.** Runtime splits on bare `---` anywhere (`parish-core::game_session::cleaned`, `parish-npc::reactions::arrival_reactions`), so inline forms like `dialogue---{action:…}` were not stripped by bench. Replaced the regex with `stripped.find("---")`. (codex PRRT_kwDORqdnvs6C0CIf) Tests: 3 new cases (inline `---{json}`, uppercase fence fall-through, non-leading `dialogue` key fall-through). 49/49 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bfda4ca59f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Codex review PRRT_kwDORqdnvs6C0Je2: a valid JSON reply whose dialogue text contains `---` (em-dash spelled out, e.g. `{"dialogue":"Well---maybe",...}`) was being mangled into a partial envelope by the delimiter split that ran first. Runtime `parish_npc::parse_npc_stream_response` parses JSON before falling back to other strategies. Bench now matches that order: 1. fence strip 2. JSON-first parse (raw_decode) 3. truncated-JSON heuristic 4. bare `---` split (mod-template format) 5. verbatim fallback JSON dialogue containing `---` survives intact; the `---` split is only reached for non-JSON envelopes that don't contain a parseable `{` prefix. Test: `test_extract_dialogue_json_dialogue_contains_em_dash`. 50/50 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 398b76c120

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Two more codex reviews on `extract_dialogue_for_judging`: 1. **Missing dialogue field returns "" not raw envelope.** Runtime `NpcJsonResponse` uses `#[serde(default)]`, so a parseable JSON envelope without a `dialogue` key surfaces as empty dialogue to the player. Bench was falling through and judging the raw envelope. Same for non-string dialogue values (null, number). (codex PRRT_kwDORqdnvs6C0Zc8) 2. **`json.loads` instead of `raw_decode`.** Runtime `serde_json::from_str` requires the entire input to be consumed (modulo surrounding whitespace) and rejects trailing junk; `raw_decode` would accept it, drifting from runtime semantics. Switched to `loads`. Trailing-junk replies still work because the truncated-JSON heuristic kicks in next and recovers the leading `"dialogue"` field — same chain runtime follows. (codex PRRT_kwDORqdnvs6C0ZdL) Tests: 3 new cases (missing field → "", non-string field → "", `{}` → ""). Updated existing trailing-junk test comment to reflect the new fall-through path. 52/52 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 20b55a2712

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-18T12:01:57Z

+            dialogue = obj.get("dialogue", "")
+            return dialogue if isinstance(dialogue, str) else ""


Handle non-string dialogue with runtime fallback

When a parsed JSON envelope has "dialogue" present but non-string (for example {"dialogue": null}), this branch returns "". In runtime, parse_npc_stream_response deserializes into NpcJsonResponse where dialogue is a String; #[serde(default)] covers missing fields but not wrong types, so that payload falls back to raw text rather than empty dialogue. Scoring these replies as empty text reintroduces bench/runtime drift for malformed-but-parseable outputs and can skew judge results.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-18T12:01:57Z

+    # field is absent or non-string, return ``""`` — runtime's
+    # ``NpcJsonResponse`` uses ``#[serde(default)]``, so the player
+    # sees an empty dialogue rather than the raw envelope.
+    if stripped.startswith("{"):


Recover truncated dialogue even without opening brace

Gating JSON handling behind stripped.startswith("{") skips the truncated-JSON heuristic for malformed outputs like "dialogue": "... that omit the opening {. Runtime extract_dialogue_field_heuristic explicitly tolerates optional leading { and still recovers the player-visible dialogue in this case, so bench currently feeds raw scaffolding to judges where runtime would show dialogue, creating another bench/runtime mismatch on cutoff responses.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-18T12:01:58Z

+# envelope detection. Case-sensitive lowercase ```json (matching the
+# runtime parser): uppercase variants fall through as legacy text so
+# bench/runtime treat them identically.
+_JSON_FENCE_RE = re.compile(r"^```(?:json)?\s*\n?")


Keep fence stripping semantics aligned with runtime prefixes

This regex accepts any whitespace after opening backticks, so inputs like ``` json\n{...}\n``` are unwrapped and then parsed as JSON. Runtime strip_json_fence only treats exact ```json or bare ``` prefixes as structured fences; ``` json falls through to raw-text handling. That means bench can extract clean dialogue from fenced outputs that runtime would not parse, reintroducing bench/runtime drift for malformed provider formatting.

Useful? React with 👍 / 👎.

dmooney and others added 3 commits May 18, 2026 07:02

chatgpt-codex-connector Bot reviewed May 18, 2026

View reviewed changes

Comment thread parish/testing/rundale-bench/grade.py Outdated

Comment thread parish/testing/rundale-bench/grade.py Outdated

Comment thread parish/testing/rundale-bench/grade.py Outdated

gemini-code-assist Bot reviewed May 18, 2026

View reviewed changes

Comment thread parish/testing/rundale-bench/grade.py Outdated

chatgpt-codex-connector Bot reviewed May 18, 2026

View reviewed changes

Comment thread parish/testing/rundale-bench/rundale_bench.py

Comment thread parish/testing/rundale-bench/grade.py

Comment thread parish/testing/rundale-bench/grade.py Outdated

Comment thread parish/testing/rundale-bench/grade.py Outdated

chatgpt-codex-connector Bot reviewed May 18, 2026

View reviewed changes

Comment thread parish/testing/rundale-bench/grade.py Outdated

chatgpt-codex-connector Bot reviewed May 18, 2026

View reviewed changes

Comment thread parish/testing/rundale-bench/grade.py Outdated

Comment thread parish/testing/rundale-bench/grade.py Outdated

dmooney merged commit a38a823 into main May 18, 2026
18 checks passed

dmooney deleted the claude/brave-jackson-739f39 branch May 18, 2026 12:00

chatgpt-codex-connector Bot reviewed May 18, 2026

View reviewed changes

		dialogue = obj.get("dialogue", "")
		return dialogue if isinstance(dialogue, str) else ""

Conversation

dmooney commented May 18, 2026

Summary

Quality lift (post-fix, grok-4.3 judge, n=15)

Implication for past leaderboard rows

Out of scope

Proof bundles

Test plan

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant