agentic-workflow-kit · aryeko · Jul 4, 2026 · Jul 4, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -12,6 +12,26 @@ documented with migration notes.
 - Additional docs for suite-specific adoption.
 - Better compatibility tests for Promptfoo variable contracts.
 
+## [0.1.8] - 2026-07-04
+
+### Added
+
+- Added generic pointwise summary helpers for advisory verdict counts and calibration notes.
+- Documented the shared pointwise report summary pattern for curated manual evidence.
+
+### Fixed
+
+- Hardened pointwise judge result handling so provider, prompt version, rubric version, and run
+  manifest metadata must match the configured run before the result bundle is written.
+- Added regression tests for malformed or missing pointwise run metadata.
+
+### Notes
+
+- Deterministic `run-case` and manual `report` compatibility are preserved.
+- Consumer repos still own judge semantics, prompts, fixtures, and calibration policy.
+- No npm package is published.
+- Consumers may pin `github:agentic-workflow-kit/eval-kit#v0.1.8`.
+
 ## [0.1.7] - 2026-07-04
 
 ### Fixed
@@ -135,7 +155,8 @@ documented with migration notes.
 - Suite-specific presets remain deferred.
 - Consumer repos own their own semantics, prompts, cases, and pass/fail policies.
 
-[Unreleased]: https://github.com/agentic-workflow-kit/eval-kit/compare/v0.1.7...main
+[Unreleased]: https://github.com/agentic-workflow-kit/eval-kit/compare/v0.1.8...main
+[0.1.8]: https://github.com/agentic-workflow-kit/eval-kit/compare/v0.1.7...v0.1.8
 [0.1.7]: https://github.com/agentic-workflow-kit/eval-kit/compare/v0.1.6...v0.1.7
 [0.1.6]: https://github.com/agentic-workflow-kit/eval-kit/compare/v0.1.5...v0.1.6
 [0.1.5]: https://github.com/agentic-workflow-kit/eval-kit/compare/v0.1.4...v0.1.5

diff --git a/README.md b/README.md
@@ -11,7 +11,7 @@ Shared evaluation infrastructure for `agentic-workflow-kit` repositories.
 ```json
 {
   "devDependencies": {
-    "@agentic-workflow-kit/eval-kit": "github:agentic-workflow-kit/eval-kit#v0.1.7"
+    "@agentic-workflow-kit/eval-kit": "github:agentic-workflow-kit/eval-kit#v0.1.8"
   }
 }
 ```
@@ -70,7 +70,7 @@ Install from a Git tag in a consumer repo:
 ```json
 {
   "devDependencies": {
-    "@agentic-workflow-kit/eval-kit": "github:agentic-workflow-kit/eval-kit#v0.1.7"
+    "@agentic-workflow-kit/eval-kit": "github:agentic-workflow-kit/eval-kit#v0.1.8"
   },
   "scripts": {
     "eval:doctor": "eval-kit doctor --config evals/eval-kit.config.json",
@@ -196,6 +196,7 @@ v0.1.4
 v0.1.5
 v0.1.6
 v0.1.7
+v0.1.8
 v0.2.0
 ```
 

diff --git a/docs/design/consumer-integration.md b/docs/design/consumer-integration.md
@@ -9,7 +9,7 @@ Consumer repos should adopt eval-kit through a pinned Git tag and keep their eva
 ```json
 {
   "devDependencies": {
-    "@agentic-workflow-kit/eval-kit": "github:agentic-workflow-kit/eval-kit#v0.1.7"
+    "@agentic-workflow-kit/eval-kit": "github:agentic-workflow-kit/eval-kit#v0.1.8"
   }
 }
 ```

diff --git a/docs/guides/consumer-integration.md b/docs/guides/consumer-integration.md
@@ -18,7 +18,7 @@ If you cannot state the eval goal, do not bootstrap a suite yet. Empty harnesses
 ```json
 {
   "devDependencies": {
-    "@agentic-workflow-kit/eval-kit": "github:agentic-workflow-kit/eval-kit#v0.1.7"
+    "@agentic-workflow-kit/eval-kit": "github:agentic-workflow-kit/eval-kit#v0.1.8"
   }
 }
 ```

diff --git a/docs/guides/model-judge-calibration-reporting.md b/docs/guides/model-judge-calibration-reporting.md
@@ -43,5 +43,15 @@ Manual reports should be written for reviewer handoff, not CI:
   risks;
 - state that model-judge evidence cannot upgrade deterministic red or yellow results.
 
+Eval-kit exposes `countPointwiseVerdicts` and `formatPointwiseCalibrationSummary` as a shared
+summary pattern. Consumers may use these helpers when writing curated notes or report hooks, but the
+consumer still owns expected-good/expected-bad labels, critical-item policy, and false-pass or
+false-fail interpretation.
+
+For pointwise result bundles, eval-kit fails closed when required run metadata is absent or
+mismatched. A valid pointwise run records run id, one case id, model, provider, reasoning effort when
+present, prompt version, rubric version, runner version, and the artifact/output paths for the
+pointwise result bundle.
+
 Keep raw provider bundles under ignored `evals/results/` paths unless a human curates and commits a
 summary.
diff --git a/docs/guides/quickstart.md b/docs/guides/quickstart.md
@@ -7,7 +7,7 @@ This guide adds a generic deterministic eval suite to a consumer repo.
 ```json
 {
   "devDependencies": {
-    "@agentic-workflow-kit/eval-kit": "github:agentic-workflow-kit/eval-kit#v0.1.7"
+    "@agentic-workflow-kit/eval-kit": "github:agentic-workflow-kit/eval-kit#v0.1.8"
   }
 }
 ```

diff --git a/docs/reference/adapter-contract.md b/docs/reference/adapter-contract.md
@@ -136,6 +136,19 @@ export const canonicalizeExpectedItemMetadata = (actualItems, expectedItems) =>
   }));
 ```
 
+Eval-kit exports generic pointwise helpers for consumers that curate summaries:
+
+```js
+import {
+  countPointwiseVerdicts,
+  formatPointwiseCalibrationSummary,
+} from "@agentic-workflow-kit/eval-kit";
+```
+
+Use these helpers to report advisory counts for `covered`, `partial`, `missing`, `contradicted`, and
+`unknown`, plus expected-good/expected-bad calibration labels and false-pass/false-fail notes. The
+helpers do not define consumer semantics.
+
 ## Pairwise judge hook
 
 Required for `judge-pairwise`:

diff --git a/docs/reference/release-process.md b/docs/reference/release-process.md
@@ -34,7 +34,7 @@ Consumers depend on tags like:
 Title:
 
 ```text
-chore(release): v0.1.7
+chore(release): v0.1.8
 ```
 
 Required changes:
@@ -63,18 +63,18 @@ git checkout main
 git pull --ff-only
 git rev-parse HEAD
 
-git tag -a v0.1.7 -m "v0.1.7"
-git push origin v0.1.7
+git tag -a v0.1.8 -m "v0.1.8"
+git push origin v0.1.8
 ```
 
 Verify:
 
 ```bash
-git rev-parse v0.1.7^{}
-git show --no-patch --decorate v0.1.7
+git rev-parse v0.1.8^{}
+git show --no-patch --decorate v0.1.8
 ```
 
-`v0.1.7^{}` must point to the release commit. With an annotated tag, `git rev-parse v0.1.7`
+`v0.1.8^{}` must point to the release commit. With an annotated tag, `git rev-parse v0.1.8`
 returns the tag object; `^{}` dereferences to the commit.
 
 ## GitHub Release
@@ -93,7 +93,7 @@ For each consumer repo:
 
 ```json
 {
-  "@agentic-workflow-kit/eval-kit": "github:agentic-workflow-kit/eval-kit#v0.1.7"
+  "@agentic-workflow-kit/eval-kit": "github:agentic-workflow-kit/eval-kit#v0.1.8"
 }
 ```
 
@@ -108,7 +108,7 @@ pnpm check
 3. Run consumer smoke commands, for example in `technical-design`:
 
 ```bash
-pnpm eval:case -- --case case-tiny-laundry-pickup-v1 --candidate evals/cases/case-tiny-laundry-pickup-v1/reference-design.md --run-id verify-eval-kit-v0.1.7
+pnpm eval:case -- --case case-tiny-laundry-pickup-v1 --candidate evals/cases/case-tiny-laundry-pickup-v1/reference-design.md --run-id verify-eval-kit-v0.1.8
 ```
 
 4. Open a PR with dependency, lockfile, and any compatibility fixes.
@@ -120,7 +120,7 @@ Do not move the tag.
 Create a new patch release:
 
 ```text
-v0.1.7 -> v0.1.8
+v0.1.8 -> v0.1.9
 ```
 
 Then open consumer bump PRs.

diff --git a/docs/reference/results.md b/docs/reference/results.md
@@ -39,7 +39,7 @@ Current schema:
   "run_type": "deterministic",
   "runner": {
     "id": "generic-eval-case",
-    "version": "0.1.7"
+    "version": "0.1.8"
   },
   "case_ids": ["case-example-v1"],
   "started_at": "2026-07-03T00:00:00.000Z",
@@ -113,3 +113,9 @@ CLI candidate labels. Its `randomization.original_order` field records the origi
 candidate keys were displayed as Candidate A/B for the model judge.
 
 Treat these as potentially sensitive.
+
+Pointwise `judge-coverage` manifests fail closed if required run metadata is missing or mismatched.
+Required pointwise metadata includes the run id, exactly one case id, model, provider, reasoning
+effort when supplied, prompt version, rubric version, runner version, and artifact/output paths for
+the pointwise report, structured pointwise result, Promptfoo config, raw Promptfoo results, and HTML
+report.
diff --git a/docs/schemas.md b/docs/schemas.md
@@ -117,6 +117,20 @@ Optional model-run fields:
 - `randomization`
 - `provenance.parent_run_ids`
 
+For `judge-coverage` pointwise runs, eval-kit additionally validates the run metadata before
+writing the manifest. Required pointwise metadata is:
+
+- `run_id`;
+- exactly one `case_ids` entry matching the judged case;
+- `model`;
+- `provider`;
+- `reasoning_effort` when supplied by the run command;
+- `prompt_version`;
+- `rubric_version`;
+- `runner.version`;
+- artifact and output paths for the pointwise report, structured pointwise result, Promptfoo config,
+  raw Promptfoo results, and Promptfoo HTML report.
+
 ### `finding.schema.json`
 
 Generic minimal finding shape:

diff --git a/package.json b/package.json
@@ -1,6 +1,6 @@
 {
   "name": "@agentic-workflow-kit/eval-kit",
-  "version": "0.1.7",
+  "version": "0.1.8",
   "description": "Portable eval runner primitives for local eval suites.",
   "private": true,
   "type": "module",

diff --git a/skills/bootstrap-eval-suite/SKILL.md b/skills/bootstrap-eval-suite/SKILL.md
@@ -29,6 +29,8 @@ standard two-config pattern:
 - Document the local calibration policy before treating pointwise results as more than raw advisory
   evidence. The policy should define expected-good and expected-bad fixture labels, `partial` and
   `unknown` handling, and where curated summaries live.
+- For curated summaries, use the shared count shape for `covered`, `partial`, `missing`,
+  `contradicted`, and `unknown`, then add consumer-owned false-pass and false-fail notes.
 
 ## Boundaries
 

diff --git a/skills/review-eval-suite/SKILL.md b/skills/review-eval-suite/SKILL.md
@@ -29,6 +29,9 @@ Use this skill when auditing or reviewing an eval-kit suite.
   passes.
 - Treat `partial` as non-covered unless the consumer explicitly documents why a non-critical partial
   is acceptable. Repeated `unknown` verdicts are calibration or prompt-quality risks.
+- Verify pointwise run metadata before trusting manual judge evidence: run id, one case id, model,
+  provider, reasoning effort when present, prompt version, rubric version, runner version, and
+  artifact/output paths must be present and coherent.
 - Treat run-producing semantic portfolios as local on-demand evidence before significant changes, not default CI.
 - Do not claim suite readiness without command evidence.
 

diff --git a/skills/run-eval-suite/SKILL.md b/skills/run-eval-suite/SKILL.md
@@ -29,6 +29,9 @@ Use this skill when executing a local eval-kit suite.
   scripts before any manual `eval:judge:coverage` run.
 - For pointwise model-judge summaries, treat `partial`, `missing`, `contradicted`, and `unknown` as
   non-covered unless the consumer policy explicitly accepts the item.
+- Prefer the eval-kit pointwise summary helpers for curated report counts, and record
+  expected-good/expected-bad labels plus false-pass/false-fail notes when summarizing manual judge
+  evidence.
 - Expected-bad fixtures should remain adverse on their intended defect. Do not describe an adverse
   bad-fixture result as a failed eval when it matches the calibration label.
 - Preserve raw outputs according to the consumer repo's artifact policy.
@@ -38,4 +41,6 @@ Use this skill when executing a local eval-kit suite.
 Report the config path, cases run, result directories, verdicts, report paths, and any skipped or
 advisory-only checks. For model-assisted runs, state that provider calls were explicitly requested.
 Report deterministic evidence first, then model-judge counts for `covered`, `partial`, `missing`,
-`contradicted`, and `unknown`.
+`contradicted`, and `unknown`. If a pointwise result manifest is missing run id, case id, model,
+provider, prompt version, rubric version, runner version, or artifact paths, treat that run as
+invalid evidence.
diff --git a/src/index.mjs b/src/index.mjs
@@ -20,6 +20,13 @@ export {
   runPromptfooRaw,
 } from "./promptfoo.mjs";
 export { aggregateVerdict, criticalBlockerCount } from "./verdict.mjs";
+export {
+  POINTWISE_VERDICTS,
+  countPointwiseVerdicts,
+  formatPointwiseCalibrationSummary,
+  formatPointwiseVerdictCounts,
+  validatePointwiseRunMetadata,
+} from "./pointwise.mjs";
 
 export { loadConfig } from "./config.mjs";
 export {
-Original file line number
+Diff line change
@@ Expand Up @@
     ```json
     {
       "devDependencies": {
-        "@agentic-workflow-kit/eval-kit": "github:agentic-workflow-kit/eval-kit#v0.1.7"
+        "@agentic-workflow-kit/eval-kit": "github:agentic-workflow-kit/eval-kit#v0.1.8"
       }
     }
     ```
@@ Expand Down @@