refactor(scoring): harden evaluator paths in MAX, scoring, and Ouroboros by devswha · Pull Request #96 · devswha/patina

devswha · 2026-05-05T19:12:47Z

Summary

Three judgment-layer fixes from a deep-research audit of patina's evaluator paths. All P0 in scope, low-blast-radius, backward-compatible.

MAX source text passing — runMaxMode now accepts sourceText directly. Removes the brittle prompt.split('## Input Text')[1] reverse-extraction that silently breaks if prompt-builder's headers ever change.
Strict JSON parsing with retry — scoreText / scoreMPS / scoreFidelity now use parseStrictJson + a single retry at temperature 0 on schema failure. Failures surface as error: 'schema-failure' and log to stderr instead of silently returning null.
Ouroboros regression on combined score — Iteration regression detection now uses combinedScore delta (AI-likeness + inverted fidelity, profile-weighted) instead of raw AI-likeness delta. Original is treated as fidelity=100 for a valid iteration-1 baseline. target-score semantics unchanged.

Why

Audit findings, by file:line:

src/max-mode.js:33 — original recovered via prompt-string slicing
src/scoring.js:69,120 — schema failures returned null, hiding partial breakage
src/ouroboros.js:96-101 — combinedScore was computed but never used in decision logic

Test plan

npm test — 69/69 e2e tests pass
node --check on all modified files
Grep for stale references (extractJson, prompt.split) — none remain
Manual smoke: --ouroboros with intentional regression to verify combined-delta rollback path
Manual smoke: --models MAX path with a long source to confirm MPS evaluator receives original

Out of scope (follow-ups)

zh/ja lexicon files
Phase 3 self-audit simplification (depends on this PR's MPS path)
Evaluator model separation (cost question)
--api-key-stdin, --config, manifest output

🤖 Generated with Claude Code

…e combined regression check Three judgment-layer fixes from a deep-research audit of patina's evaluator paths: 1. MAX mode no longer reverse-extracts the original via `prompt.split('## Input Text')`. The CLI now passes `sourceText` to `runMaxMode`, removing a brittle dependency on prompt-builder's internal headers. 2. `extractJson` is replaced with `parseStrictJson` + `callAndParseJson` wrapper that retries once at temperature 0 on schema failure and surfaces `error: 'schema-failure'` instead of silently returning null. Failures are logged to stderr so partial breakage is visible in operation. 3. Ouroboros regression detection now compares `combinedScore` (AI-likeness + inverted fidelity, profile-weighted) instead of raw AI-likeness delta. Original is treated as fidelity=100 to give iteration 1 a valid baseline. `target-score` semantics unchanged for backward compatibility. All 69 existing e2e tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…105) - Add .claude/ to the existing OpenClaw / Claude Code workspace section in .gitignore. The dir is local IDE state with no place in the repo. - Archive the GPT deep-research audit report under docs/audits/. The actionable findings already shipped (#96, #103) or are tracked (#104).

devswha merged commit d2d0aea into main May 5, 2026
3 checks passed

devswha deleted the refactor/scoring-robustness branch May 5, 2026 19:19

This was referenced May 5, 2026

refactor(prompt-builder): skip Phase 3 self-audit in Ouroboros mode #103

Merged

Add zh/ja AI-lexicon files for stylometry overlap detection #104

Open

chore: repo housekeeping — gitignore .claude, archive audit report #105

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(scoring): harden evaluator paths in MAX, scoring, and Ouroboros#96

refactor(scoring): harden evaluator paths in MAX, scoring, and Ouroboros#96
devswha merged 1 commit intomainfrom
refactor/scoring-robustness

devswha commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

devswha commented May 5, 2026

Summary

Why

Test plan

Out of scope (follow-ups)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant