refactor(scoring): harden evaluator paths in MAX, scoring, and Ouroboros#96
Merged
refactor(scoring): harden evaluator paths in MAX, scoring, and Ouroboros#96
Conversation
…e combined regression check
Three judgment-layer fixes from a deep-research audit of patina's evaluator paths:
1. MAX mode no longer reverse-extracts the original via `prompt.split('## Input Text')`.
The CLI now passes `sourceText` to `runMaxMode`, removing a brittle dependency
on prompt-builder's internal headers.
2. `extractJson` is replaced with `parseStrictJson` + `callAndParseJson` wrapper
that retries once at temperature 0 on schema failure and surfaces
`error: 'schema-failure'` instead of silently returning null. Failures are
logged to stderr so partial breakage is visible in operation.
3. Ouroboros regression detection now compares `combinedScore` (AI-likeness +
inverted fidelity, profile-weighted) instead of raw AI-likeness delta.
Original is treated as fidelity=100 to give iteration 1 a valid baseline.
`target-score` semantics unchanged for backward compatibility.
All 69 existing e2e tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three judgment-layer fixes from a deep-research audit of patina's evaluator paths. All P0 in scope, low-blast-radius, backward-compatible.
runMaxModenow acceptssourceTextdirectly. Removes the brittleprompt.split('## Input Text')[1]reverse-extraction that silently breaks if prompt-builder's headers ever change.scoreText/scoreMPS/scoreFidelitynow useparseStrictJson+ a single retry at temperature 0 on schema failure. Failures surface aserror: 'schema-failure'and log to stderr instead of silently returning null.combinedScoredelta (AI-likeness + inverted fidelity, profile-weighted) instead of raw AI-likeness delta. Original is treated as fidelity=100 for a valid iteration-1 baseline.target-scoresemantics unchanged.Why
Audit findings, by file:line:
src/max-mode.js:33— original recovered via prompt-string slicingsrc/scoring.js:69,120— schema failures returnednull, hiding partial breakagesrc/ouroboros.js:96-101—combinedScorewas computed but never used in decision logicTest plan
npm test— 69/69 e2e tests passnode --checkon all modified filesextractJson,prompt.split) — none remain--ouroboroswith intentional regression to verify combined-delta rollback path--modelsMAX path with a long source to confirm MPS evaluator receives originalOut of scope (follow-ups)
--api-key-stdin,--config, manifest output🤖 Generated with Claude Code