Skip to content

fix(bench): mirror runtime tier-1 prompt + strip metadata before judging (#994)#1007

Merged
dmooney merged 7 commits into
mainfrom
claude/brave-jackson-739f39
May 18, 2026
Merged

fix(bench): mirror runtime tier-1 prompt + strip metadata before judging (#994)#1007
dmooney merged 7 commits into
mainfrom
claude/brave-jackson-739f39

Conversation

@dmooney
Copy link
Copy Markdown
Owner

@dmooney dmooney commented May 18, 2026

Summary

Closes #994.

Three commits, eval-pipeline only, no runtime behaviour change:

  1. fix(bench): mirror runtime tier-1 prompt in dialogue eval
    replaces the hardcoded 4-line DIALOGUE_SYS in 5 bench scripts with
    a shared helper that reads mods/rundale/prompts/tier1_system.txt
    and appends a Python mirror of parish_npc::language_directive for
    en-IE / ga-IE. Prompt grows 280 → 3894 chars and now carries the
    four runtime tier-1 improvements (1820 fact preamble, anti-stage-
    Irish cultural guard, persona binding, GA_IE phrase whitelist +
    Latin-script guard).

  2. docs(bench): rebench grok-4.3 under new prompt — apples-to-
    apples re-judge of grok-4.3 under the OLD vs NEW bench prompt. Three
    judges tried; judge_v1 (qwen3-235b, 1-5) saturated for both
    prompts. Mistral-large too generous per user. Grok-4.3 selected.
    Surfaces a measurement artifact: bench sent raw model envelope to
    the judge but the runtime strips it before showing the player.

  3. fix(bench): strip runtime metadata envelope before judging
    adds grade.extract_dialogue_for_judging(reply) handling both
    runtime envelope formats (mod-template ---/JSON and Rust-builder
    JSON-first). Wired into all 6 judge-call sites in grade.py,
    score_multiaxis.py, and rubric_lab.py. 10 new unit tests; 38/38
    pass.

Quality lift (post-fix, grok-4.3 judge, n=15)

Same NEW grok-4.3 cache, same judge, before/after the strip fix:

Run Total
OLD-prompt baseline 8.09
NEW raw (pre-strip) 7.45
NEW manual-stripped 8.21
NEW auto-stripped 8.29

Per-axis NEW (auto-stripped) vs OLD:

Axis OLD NEW Δ
character 8.07 8.13 +0.06
authenticity 8.87 8.93 +0.06
language 7.73 8.40 +0.67
responsiveness 7.73 8.20 +0.47
craft 8.07 7.80 -0.27
total 8.09 8.29 +0.20

First apples-to-apples confirmation that the runtime tier-1 grounding
(landed in #990) outperforms the pre-#994 hardcoded bench prompt under
a discriminative judge. Lifts concentrate on language and
responsiveness — the axes the runtime upgrades were designed to lift.

Implication for past leaderboard rows

The leaderboard's pre-#994 rows used the old hardcoded prompt and have
no envelope, so their scores stand. Future NEW-prompt rows will be
fairly judged once this lands. Past judgments under judge_v1 (1-5
scale) are saturated for top-tier candidates and should be retired in
favour of the 0-10 multi-axis rubric; that retirement is out of scope
for this PR.

Out of scope

  • Reconciling the runtime divergence between
    parish_npc::build_tier1_system_prompt (Rust builder, JSON-first
    format) and mods/rundale/prompts/tier1_system.txt (mod template,
    ---+JSON format). The helper handles both. Aligning them is a
    separate task.
  • Re-baselining historic leaderboard rows beyond the noted retirement
    of judge_v1.

Proof bundles

  • docs/proofs/994-bench-prompt-mirror/ — prompt mirror + multi-judge
    rebench data
  • docs/proofs/994b-bench-strip-metadata/ — strip fix + post-fix
    rebench

Test plan

  • python3 parish/testing/rundale-bench/test_grade.py → 38/38 passed
  • just agent-check → passed (proof bundles + judge verdicts present)
  • All 5 bench scripts import cleanly and resolve DIALOGUE_SYS /
    SYSTEM to the 3894-char shared prompt
  • Live judge run: score_multiaxis against NEW grok-4.3 cache
    under grok-4.3 judge produces 8.29 (≥ 8.0 AC threshold)

🤖 Generated with Claude Code

dmooney and others added 3 commits May 18, 2026 07:02
The rundale-bench dialogue sampler used a hardcoded 280-char `DIALOGUE_SYS`
that did not reflect the runtime tier-1 grounding that #990 added
(1820 fact preamble, anti-stage-Irish cultural guard, persona binding,
ga-IE phrase whitelist, Latin-script guard). Result: prompt-quality work
showed no measurable lift in the bench because the bench wasn't running
the prompt under test.

This change adds `build_dialogue_system_prompt()` to `eval_lib.py` that
reads `mods/rundale/prompts/tier1_system.txt`, substitutes the bench's
Brigid O'Brien persona slots, and appends a Python mirror of
`parish_npc::language_directive` for en-IE/ga-IE. The five scripts that
previously hardcoded a Brigid prompt now share that helper:

- parish/testing/rundale-bench/cache_dialogue_replies.py
- parish/testing/rundale-bench/rundale_bench.py
- parish/testing/rundale-bench/bench_perf.py
- parish/scripts/local-eval/gen_samples.py
- parish/scripts/local-eval/gen_dlg.py

Resulting prompt grows from ~280 to ~3894 chars and now carries every
guardrail the live runtime applies, so future tier-1 prompt regressions
will surface on the leaderboard.

No runtime behaviour change. Past leaderboard scores will need
re-benching against the new prompt; that work is intentionally deferred
to a follow-up so this prompt change can be reviewed in isolation
(issue point #3).

Proof bundle: docs/proofs/994-bench-prompt-mirror/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switched judge from mistral-large-2512 to x-ai/grok-4.3 at user
request (mistral too generous). Headline numbers re-stated against
the stricter judge.

Surfaced a bench-scoring artifact: the NEW prompt's runtime template
emits dialogue + `---` + JSON metadata, and the judge penalizes the
JSON scaffolding as anachronistic even though the runtime strips it
before showing the player. On `dialogue-0011` the same advice scores
8.4 with the OLD prompt's plain-text format and 3.4 with the NEW
prompt's templated format — +5-point swing on scaffolding the player
never sees. Bench should strip the metadata block before judging.

After stripping, grok-judge totals:

  OLD raw            8.09
  NEW raw            7.45   (scaffolding penalty)
  NEW stripped       8.21   <-- apples-to-apples
  delta vs OLD      +0.12

Per-axis (NEW stripped vs OLD):
  character      -0.40
  authenticity   -0.34
  language       +0.87   <- GA_IE whitelist + Latin-script guard
  responsiveness +0.60   <- persona binding + "react to" clause
  craft          -0.14

Both discriminative judges (mistral, grok) agree on sign; magnitude
inside N=15 noise floor.

Two follow-ups noted in rebench-findings.md:
  1. Bench should strip the metadata block before judging (today's
     +5-point swing is a measurement artifact, not a prompt regression).
  2. Reconcile the divergence between the mod template (---/JSON
     format) and the Rust builder (JSON-first format).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bench was sending raw model output (dialogue + `---` + JSON metadata)
to the judge, but the runtime parses the metadata and shows the
player only the dialogue line. Grok-4.3 (as judge) penalised the
JSON scaffolding as anachronistic, depressing every NEW-prompt score
by ~+0.76 points.

Adds `grade.extract_dialogue_for_judging(reply)` that handles both
runtime envelope shapes:

  - Mod-template format: `<dialogue>\n---\n{...}` → text before `\n---`
  - Rust-builder JSON-first: `{"dialogue": "...", ...}` → parsed
    `.dialogue` field
  - Legacy plain text: pass-through

Falls through to verbatim return on malformed envelopes so the judge
scores broken output as the player would see it. Never raises.

Wired into every site that sends a reply to the judge:

  - grade.grade_dialogue
  - grade.grade_reaction
  - grade.grade_pairwise
  - score_multiaxis.score_one
  - rubric_lab.absolute_judge
  - rubric_lab.pairwise_judge

`_non_latin` still scans the full raw reply so non-Latin leaks inside
metadata are still flagged.

Tests: 10 new `test_extract_dialogue_*` cases plus a
SyntaxError-immune fall-through; 38/38 pass.

Rebench under grok-4.3 (same NEW cache as #994 bundle):

  pre-fix raw          7.45   <- scaffolding penalty
  manual-stripped      8.21   <- one-off compare in #994 bundle
  post-fix auto-strip  8.29   <- this fix

Recovers +0.84 of measurement artifact. NEW prompt now slightly
exceeds OLD on the same judge (8.29 vs 8.09 = +0.20) — first
apples-to-apples confirmation that the runtime tier-1 grounding
outperforms the pre-#994 hardcoded bench prompt.

Proof bundle: docs/proofs/994b-bench-strip-metadata/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🤖 Hi @dmooney, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2bca98f573

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread parish/testing/rundale-bench/grade.py Outdated
Comment thread parish/testing/rundale-bench/grade.py Outdated
Comment thread parish/testing/rundale-bench/grade.py Outdated
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the dialogue system prompt across the evaluation pipeline to mirror the runtime tier-1 grounding, ensuring that prompt-quality improvements are accurately reflected in the leaderboard. It introduces a shared helper, build_dialogue_system_prompt, which reads from a central template and appends the correct language directive, replacing hardcoded prompts in five different scripts. Additionally, it adds a metadata-stripping utility to ensure the judge evaluates only the dialogue visible to the player, preventing the JSON scaffolding from negatively impacting quality scores. I have kept the review comment regarding the JSON parsing logic as it provides a more robust and maintainable implementation.

Comment thread parish/testing/rundale-bench/grade.py Outdated
Four bot reviews on grade.py, all legit. Mirror runtime more fully so
malformed-output cases don't drift from `parish_npc::parse_npc_stream_response`.

1. Strip Markdown JSON fences (`` ```json … ``` ``) before envelope
   detection. Mirrors `parish_npc::strip_json_fence`. Without this,
   fenced replies (Anthropic) bypass extraction and ship raw scaffolding
   to the judge. (codex PRRT_kwDORqdnvs6CzukS)

2. Accept `---` at start of reply (no leading newline). Some models
   skip the dialogue line and emit only the metadata block; the runtime
   splits on `---` itself, so the bench must too. Regex now matches
   `(?:\n|^)[ \t]*---`. (codex PRRT_kwDORqdnvs6CzukN)

3. Recover the `"dialogue"` field from truncated JSON envelopes
   (max_tokens cutoff, network blip). Adds
   `_extract_dialogue_field_heuristic` mirroring the Rust runtime helper
   of the same name. Without this, max-tokens-truncated outputs were
   judged as raw envelope. (codex PRRT_kwDORqdnvs6CzukL)

4. Replace the hand-rolled brace/string tracker with
   `json.JSONDecoder().raw_decode()` — cleaner, tolerates trailing
   junk, drops 25 lines of state machine code. (gemini PRRT_kwDORqdnvs6Czu7H)

Tests: 8 new cases (start-of-string dash, fenced/bare-fenced JSON,
truncated recovery with and without escaped quotes, fenced+truncated
combo, trailing-junk JSON). Updated one existing test whose assertion
was the old hand-rolled fall-through behaviour. 46/46 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 49726e4fca

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread parish/testing/rundale-bench/rundale_bench.py
Comment thread parish/testing/rundale-bench/grade.py
Comment thread parish/testing/rundale-bench/grade.py Outdated
Comment thread parish/testing/rundale-bench/grade.py Outdated
Four more codex reviews, all legit, on the strip-metadata helper +
the rubrics it feeds:

1. **Judge rubrics still said 1-3 sentences** while the new shared
   prompt instructs 2-4. Pre-#994 craft scores systematically
   penalised any reply >3 sentences. Updated `judge_v1.json`,
   `judge_pairwise_v1.json`, and `score_multiaxis.DEFAULT_RUBRIC`
   to 2-4. Recomputed `rubric_sha256` on both pinned configs.
   (codex PRRT_kwDORqdnvs6C0CIU)

2. **Truncated-JSON recovery was too permissive.** Used
   `re.search` so a `dialogue` field anywhere in malformed input was
   recovered; runtime only recovers when `dialogue` is the leading
   key. Anchored the regex with `^\s*\{?\s*` and switched to
   `re.match`. (codex PRRT_kwDORqdnvs6C0CIZ)

3. **Fence stripping was case-insensitive** while the runtime
   `parish_npc::strip_json_fence` is case-sensitive lowercase. An
   uppercase ```JSON wrapper was being unwrapped by bench but kept
   raw by runtime. Dropped `re.IGNORECASE`. (codex PRRT_kwDORqdnvs6C0CIe)

4. **`---` delimiter required trailing newline.** Runtime splits on
   bare `---` anywhere (`parish-core::game_session::cleaned`,
   `parish-npc::reactions::arrival_reactions`), so inline forms like
   `dialogue---{action:…}` were not stripped by bench. Replaced the
   regex with `stripped.find("---")`. (codex PRRT_kwDORqdnvs6C0CIf)

Tests: 3 new cases (inline `---{json}`, uppercase fence fall-through,
non-leading `dialogue` key fall-through). 49/49 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bfda4ca59f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread parish/testing/rundale-bench/grade.py Outdated
Codex review PRRT_kwDORqdnvs6C0Je2: a valid JSON reply whose dialogue
text contains `---` (em-dash spelled out, e.g.
`{"dialogue":"Well---maybe",...}`) was being mangled into a partial
envelope by the delimiter split that ran first.

Runtime `parish_npc::parse_npc_stream_response` parses JSON before
falling back to other strategies. Bench now matches that order:

  1. fence strip
  2. JSON-first parse (raw_decode)
  3. truncated-JSON heuristic
  4. bare `---` split (mod-template format)
  5. verbatim fallback

JSON dialogue containing `---` survives intact; the `---` split is
only reached for non-JSON envelopes that don't contain a parseable
`{` prefix.

Test: `test_extract_dialogue_json_dialogue_contains_em_dash`. 50/50 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 398b76c120

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread parish/testing/rundale-bench/grade.py Outdated
Comment thread parish/testing/rundale-bench/grade.py Outdated
Two more codex reviews on `extract_dialogue_for_judging`:

1. **Missing dialogue field returns "" not raw envelope.** Runtime
   `NpcJsonResponse` uses `#[serde(default)]`, so a parseable JSON
   envelope without a `dialogue` key surfaces as empty dialogue to
   the player. Bench was falling through and judging the raw
   envelope. Same for non-string dialogue values (null, number).
   (codex PRRT_kwDORqdnvs6C0Zc8)

2. **`json.loads` instead of `raw_decode`.** Runtime
   `serde_json::from_str` requires the entire input to be consumed
   (modulo surrounding whitespace) and rejects trailing junk;
   `raw_decode` would accept it, drifting from runtime semantics.
   Switched to `loads`. Trailing-junk replies still work because
   the truncated-JSON heuristic kicks in next and recovers the
   leading `"dialogue"` field — same chain runtime follows.
   (codex PRRT_kwDORqdnvs6C0ZdL)

Tests: 3 new cases (missing field → "", non-string field → "",
`{}` → ""). Updated existing trailing-junk test comment to reflect
the new fall-through path. 52/52 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dmooney dmooney merged commit a38a823 into main May 18, 2026
18 checks passed
@dmooney dmooney deleted the claude/brave-jackson-739f39 branch May 18, 2026 12:00
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 20b55a2712

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +153 to +154
dialogue = obj.get("dialogue", "")
return dialogue if isinstance(dialogue, str) else ""
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle non-string dialogue with runtime fallback

When a parsed JSON envelope has "dialogue" present but non-string (for example {"dialogue": null}), this branch returns "". In runtime, parse_npc_stream_response deserializes into NpcJsonResponse where dialogue is a String; #[serde(default)] covers missing fields but not wrong types, so that payload falls back to raw text rather than empty dialogue. Scoring these replies as empty text reintroduces bench/runtime drift for malformed-but-parseable outputs and can skew judge results.

Useful? React with 👍 / 👎.

# field is absent or non-string, return ``""`` — runtime's
# ``NpcJsonResponse`` uses ``#[serde(default)]``, so the player
# sees an empty dialogue rather than the raw envelope.
if stripped.startswith("{"):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Recover truncated dialogue even without opening brace

Gating JSON handling behind stripped.startswith("{") skips the truncated-JSON heuristic for malformed outputs like "dialogue": "... that omit the opening {. Runtime extract_dialogue_field_heuristic explicitly tolerates optional leading { and still recovers the player-visible dialogue in this case, so bench currently feeds raw scaffolding to judges where runtime would show dialogue, creating another bench/runtime mismatch on cutoff responses.

Useful? React with 👍 / 👎.

# envelope detection. Case-sensitive lowercase ```json (matching the
# runtime parser): uppercase variants fall through as legacy text so
# bench/runtime treat them identically.
_JSON_FENCE_RE = re.compile(r"^```(?:json)?\s*\n?")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep fence stripping semantics aligned with runtime prefixes

This regex accepts any whitespace after opening backticks, so inputs like ``` json\n{...}\n``` are unwrapped and then parsed as JSON. Runtime strip_json_fence only treats exact ```json or bare ``` prefixes as structured fences; ``` json falls through to raw-text handling. That means bench can extract clean dialogue from fenced outputs that runtime would not parse, reintroducing bench/runtime drift for malformed provider formatting.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

rundale-bench: dialogue eval prompt is hardcoded and doesn't reflect runtime tier-1 grounding

1 participant