Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 22 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,26 @@ documented with migration notes.
- Additional docs for suite-specific adoption.
- Better compatibility tests for Promptfoo variable contracts.

## [0.1.8] - 2026-07-04

### Added

- Added generic pointwise summary helpers for advisory verdict counts and calibration notes.
- Documented the shared pointwise report summary pattern for curated manual evidence.

### Fixed

- Hardened pointwise judge result handling so provider, prompt version, rubric version, and run
manifest metadata must match the configured run before the result bundle is written.
- Added regression tests for malformed or missing pointwise run metadata.

### Notes

- Deterministic `run-case` and manual `report` compatibility are preserved.
- Consumer repos still own judge semantics, prompts, fixtures, and calibration policy.
- No npm package is published.
- Consumers may pin `github:agentic-workflow-kit/eval-kit#v0.1.8`.

## [0.1.7] - 2026-07-04

### Fixed
Expand Down Expand Up @@ -135,7 +155,8 @@ documented with migration notes.
- Suite-specific presets remain deferred.
- Consumer repos own their own semantics, prompts, cases, and pass/fail policies.

[Unreleased]: https://github.com/agentic-workflow-kit/eval-kit/compare/v0.1.7...main
[Unreleased]: https://github.com/agentic-workflow-kit/eval-kit/compare/v0.1.8...main
[0.1.8]: https://github.com/agentic-workflow-kit/eval-kit/compare/v0.1.7...v0.1.8
[0.1.7]: https://github.com/agentic-workflow-kit/eval-kit/compare/v0.1.6...v0.1.7
[0.1.6]: https://github.com/agentic-workflow-kit/eval-kit/compare/v0.1.5...v0.1.6
[0.1.5]: https://github.com/agentic-workflow-kit/eval-kit/compare/v0.1.4...v0.1.5
Expand Down
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Shared evaluation infrastructure for `agentic-workflow-kit` repositories.
```json
{
"devDependencies": {
"@agentic-workflow-kit/eval-kit": "github:agentic-workflow-kit/eval-kit#v0.1.7"
"@agentic-workflow-kit/eval-kit": "github:agentic-workflow-kit/eval-kit#v0.1.8"
}
}
```
Expand Down Expand Up @@ -70,7 +70,7 @@ Install from a Git tag in a consumer repo:
```json
{
"devDependencies": {
"@agentic-workflow-kit/eval-kit": "github:agentic-workflow-kit/eval-kit#v0.1.7"
"@agentic-workflow-kit/eval-kit": "github:agentic-workflow-kit/eval-kit#v0.1.8"
},
"scripts": {
"eval:doctor": "eval-kit doctor --config evals/eval-kit.config.json",
Expand Down Expand Up @@ -196,6 +196,7 @@ v0.1.4
v0.1.5
v0.1.6
v0.1.7
v0.1.8
v0.2.0
```

Expand Down
2 changes: 1 addition & 1 deletion docs/design/consumer-integration.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Consumer repos should adopt eval-kit through a pinned Git tag and keep their eva
```json
{
"devDependencies": {
"@agentic-workflow-kit/eval-kit": "github:agentic-workflow-kit/eval-kit#v0.1.7"
"@agentic-workflow-kit/eval-kit": "github:agentic-workflow-kit/eval-kit#v0.1.8"
}
}
```
Expand Down
2 changes: 1 addition & 1 deletion docs/guides/consumer-integration.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ If you cannot state the eval goal, do not bootstrap a suite yet. Empty harnesses
```json
{
"devDependencies": {
"@agentic-workflow-kit/eval-kit": "github:agentic-workflow-kit/eval-kit#v0.1.7"
"@agentic-workflow-kit/eval-kit": "github:agentic-workflow-kit/eval-kit#v0.1.8"
}
}
```
Expand Down
10 changes: 10 additions & 0 deletions docs/guides/model-judge-calibration-reporting.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,5 +43,15 @@ Manual reports should be written for reviewer handoff, not CI:
risks;
- state that model-judge evidence cannot upgrade deterministic red or yellow results.

Eval-kit exposes `countPointwiseVerdicts` and `formatPointwiseCalibrationSummary` as a shared
summary pattern. Consumers may use these helpers when writing curated notes or report hooks, but the
consumer still owns expected-good/expected-bad labels, critical-item policy, and false-pass or
false-fail interpretation.

For pointwise result bundles, eval-kit fails closed when required run metadata is absent or
mismatched. A valid pointwise run records run id, one case id, model, provider, reasoning effort when
present, prompt version, rubric version, runner version, and the artifact/output paths for the
pointwise result bundle.

Keep raw provider bundles under ignored `evals/results/` paths unless a human curates and commits a
summary.
2 changes: 1 addition & 1 deletion docs/guides/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ This guide adds a generic deterministic eval suite to a consumer repo.
```json
{
"devDependencies": {
"@agentic-workflow-kit/eval-kit": "github:agentic-workflow-kit/eval-kit#v0.1.7"
"@agentic-workflow-kit/eval-kit": "github:agentic-workflow-kit/eval-kit#v0.1.8"
}
}
```
Expand Down
13 changes: 13 additions & 0 deletions docs/reference/adapter-contract.md
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,19 @@ export const canonicalizeExpectedItemMetadata = (actualItems, expectedItems) =>
}));
```

Eval-kit exports generic pointwise helpers for consumers that curate summaries:

```js
import {
countPointwiseVerdicts,
formatPointwiseCalibrationSummary,
} from "@agentic-workflow-kit/eval-kit";
```

Use these helpers to report advisory counts for `covered`, `partial`, `missing`, `contradicted`, and
`unknown`, plus expected-good/expected-bad calibration labels and false-pass/false-fail notes. The
helpers do not define consumer semantics.

## Pairwise judge hook

Required for `judge-pairwise`:
Expand Down
18 changes: 9 additions & 9 deletions docs/reference/release-process.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ Consumers depend on tags like:
Title:

```text
chore(release): v0.1.7
chore(release): v0.1.8
```

Required changes:
Expand Down Expand Up @@ -63,18 +63,18 @@ git checkout main
git pull --ff-only
git rev-parse HEAD

git tag -a v0.1.7 -m "v0.1.7"
git push origin v0.1.7
git tag -a v0.1.8 -m "v0.1.8"
git push origin v0.1.8
```

Verify:

```bash
git rev-parse v0.1.7^{}
git show --no-patch --decorate v0.1.7
git rev-parse v0.1.8^{}
git show --no-patch --decorate v0.1.8
```

`v0.1.7^{}` must point to the release commit. With an annotated tag, `git rev-parse v0.1.7`
`v0.1.8^{}` must point to the release commit. With an annotated tag, `git rev-parse v0.1.8`
returns the tag object; `^{}` dereferences to the commit.

## GitHub Release
Expand All @@ -93,7 +93,7 @@ For each consumer repo:

```json
{
"@agentic-workflow-kit/eval-kit": "github:agentic-workflow-kit/eval-kit#v0.1.7"
"@agentic-workflow-kit/eval-kit": "github:agentic-workflow-kit/eval-kit#v0.1.8"
}
```

Expand All @@ -108,7 +108,7 @@ pnpm check
3. Run consumer smoke commands, for example in `technical-design`:

```bash
pnpm eval:case -- --case case-tiny-laundry-pickup-v1 --candidate evals/cases/case-tiny-laundry-pickup-v1/reference-design.md --run-id verify-eval-kit-v0.1.7
pnpm eval:case -- --case case-tiny-laundry-pickup-v1 --candidate evals/cases/case-tiny-laundry-pickup-v1/reference-design.md --run-id verify-eval-kit-v0.1.8
```

4. Open a PR with dependency, lockfile, and any compatibility fixes.
Expand All @@ -120,7 +120,7 @@ Do not move the tag.
Create a new patch release:

```text
v0.1.7 -> v0.1.8
v0.1.8 -> v0.1.9
```

Then open consumer bump PRs.
Expand Down
8 changes: 7 additions & 1 deletion docs/reference/results.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ Current schema:
"run_type": "deterministic",
"runner": {
"id": "generic-eval-case",
"version": "0.1.7"
"version": "0.1.8"
},
"case_ids": ["case-example-v1"],
"started_at": "2026-07-03T00:00:00.000Z",
Expand Down Expand Up @@ -113,3 +113,9 @@ CLI candidate labels. Its `randomization.original_order` field records the origi
candidate keys were displayed as Candidate A/B for the model judge.

Treat these as potentially sensitive.

Pointwise `judge-coverage` manifests fail closed if required run metadata is missing or mismatched.
Required pointwise metadata includes the run id, exactly one case id, model, provider, reasoning
effort when supplied, prompt version, rubric version, runner version, and artifact/output paths for
the pointwise report, structured pointwise result, Promptfoo config, raw Promptfoo results, and HTML
report.
14 changes: 14 additions & 0 deletions docs/schemas.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,20 @@ Optional model-run fields:
- `randomization`
- `provenance.parent_run_ids`

For `judge-coverage` pointwise runs, eval-kit additionally validates the run metadata before
writing the manifest. Required pointwise metadata is:

- `run_id`;
- exactly one `case_ids` entry matching the judged case;
- `model`;
- `provider`;
- `reasoning_effort` when supplied by the run command;
- `prompt_version`;
- `rubric_version`;
- `runner.version`;
- artifact and output paths for the pointwise report, structured pointwise result, Promptfoo config,
raw Promptfoo results, and Promptfoo HTML report.

### `finding.schema.json`

Generic minimal finding shape:
Expand Down
2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "@agentic-workflow-kit/eval-kit",
"version": "0.1.7",
"version": "0.1.8",
"description": "Portable eval runner primitives for local eval suites.",
"private": true,
"type": "module",
Expand Down
2 changes: 2 additions & 0 deletions skills/bootstrap-eval-suite/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,8 @@ standard two-config pattern:
- Document the local calibration policy before treating pointwise results as more than raw advisory
evidence. The policy should define expected-good and expected-bad fixture labels, `partial` and
`unknown` handling, and where curated summaries live.
- For curated summaries, use the shared count shape for `covered`, `partial`, `missing`,
`contradicted`, and `unknown`, then add consumer-owned false-pass and false-fail notes.

## Boundaries

Expand Down
3 changes: 3 additions & 0 deletions skills/review-eval-suite/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,9 @@ Use this skill when auditing or reviewing an eval-kit suite.
passes.
- Treat `partial` as non-covered unless the consumer explicitly documents why a non-critical partial
is acceptable. Repeated `unknown` verdicts are calibration or prompt-quality risks.
- Verify pointwise run metadata before trusting manual judge evidence: run id, one case id, model,
provider, reasoning effort when present, prompt version, rubric version, runner version, and
artifact/output paths must be present and coherent.
- Treat run-producing semantic portfolios as local on-demand evidence before significant changes, not default CI.
- Do not claim suite readiness without command evidence.

Expand Down
7 changes: 6 additions & 1 deletion skills/run-eval-suite/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,9 @@ Use this skill when executing a local eval-kit suite.
scripts before any manual `eval:judge:coverage` run.
- For pointwise model-judge summaries, treat `partial`, `missing`, `contradicted`, and `unknown` as
non-covered unless the consumer policy explicitly accepts the item.
- Prefer the eval-kit pointwise summary helpers for curated report counts, and record
expected-good/expected-bad labels plus false-pass/false-fail notes when summarizing manual judge
evidence.
- Expected-bad fixtures should remain adverse on their intended defect. Do not describe an adverse
bad-fixture result as a failed eval when it matches the calibration label.
- Preserve raw outputs according to the consumer repo's artifact policy.
Expand All @@ -38,4 +41,6 @@ Use this skill when executing a local eval-kit suite.
Report the config path, cases run, result directories, verdicts, report paths, and any skipped or
advisory-only checks. For model-assisted runs, state that provider calls were explicitly requested.
Report deterministic evidence first, then model-judge counts for `covered`, `partial`, `missing`,
`contradicted`, and `unknown`.
`contradicted`, and `unknown`. If a pointwise result manifest is missing run id, case id, model,
provider, prompt version, rubric version, runner version, or artifact paths, treat that run as
invalid evidence.
7 changes: 7 additions & 0 deletions src/index.mjs
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,13 @@ export {
runPromptfooRaw,
} from "./promptfoo.mjs";
export { aggregateVerdict, criticalBlockerCount } from "./verdict.mjs";
export {
POINTWISE_VERDICTS,
countPointwiseVerdicts,
formatPointwiseCalibrationSummary,
formatPointwiseVerdictCounts,
validatePointwiseRunMetadata,
} from "./pointwise.mjs";

export { loadConfig } from "./config.mjs";
export {
Expand Down
Loading