Skip to content

feat: harden pointwise model-judge reporting#17

Merged
aryeko merged 1 commit into
mainfrom
feat/model-judge-hardening-v018
Jul 4, 2026
Merged

feat: harden pointwise model-judge reporting#17
aryeko merged 1 commit into
mainfrom
feat/model-judge-hardening-v018

Conversation

@aryeko

@aryeko aryeko commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Summary

  • add generic pointwise verdict summary and calibration-report helpers
  • fail closed on pointwise judge result/provider/prompt/rubric mismatches and required run metadata
  • harden pointwise artifact/output path validation against absolute, escaping, whitespace, and non-normalized paths
  • update docs, skills, changelog, and version metadata for v0.1.8

Boundaries

  • eval-kit mechanics and generic policy only
  • no consumer semantics, prompts, fixtures, or configs changed
  • no provider-backed evals run

Verification

  • pnpm format
  • pnpm test -- tests/pointwise.test.mjs
  • pnpm install --frozen-lockfile
  • pnpm check
  • git diff --check

Independent pre-PR review approved after two fix loops for pointwise artifact-path validation.

@aryeko aryeko merged commit 5c72926 into main Jul 4, 2026
1 check passed
@aryeko aryeko deleted the feat/model-judge-hardening-v018 branch July 4, 2026 03:14

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 556a852600

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/sdk.mjs
for (const item of finalResult.items) {
counts[item.verdict] = (counts[item.verdict] ?? 0) + 1;
}
const counts = countPointwiseVerdicts(finalResult.items);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate pointwise verdicts before writing artifacts

When an adapter's canonicalizeExpectedItemMetadata returns a typo or new verdict, this new helper throws, but pointwise-result.json has already been overwritten just above. If the same run id is reused, the command exits before regenerating the report or manifest, leaving the previous manifest/report pointing at a replaced invalid result file. Move the verdict validation/counting before writing result artifacts so failed runs do not corrupt an existing bundle.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant