fix: prompt-surface correctness (P1/P2/P3/B/C5) by damienen · Pull Request #28 · fastxyz/skill-optimizer

damienen · 2026-04-17T04:19:46Z

Summary

P1: Prompt-surface benchmarks no longer hard-FAIL on coverage violations — coverage is reported but doesn't veto the verdict for surface: prompt
P2: Each prompt task is now scored against its own capability's criteria (not all against caps[0]) — GeneratedTask.capabilityId field added, generation tags it, runner resolves per-task criteria via resolveCriteriaForTask
P3: Empty criteria no longer produce a vacuous 1.0 score — evaluator sets noActiveCriteria: true, score: 0; runner surfaces an actionable SKILL.md error
Bug B: openai/ model IDs (e.g. gpt-5.4) are now exempt from dot→hyphen rewriting in fix.ts, matching openrouter/ exemption already present
C5: Dead src/discovery/prompt.ts (and its test file) deleted — active module is src/project/discover-prompt.ts

Test Plan

npm run build — clean
npm run typecheck — clean
npm run lint — clean
npm test — all suites pass including new smoke-verdict-prompt, smoke-prompt-criteria, smoke-changelog-coverage
npx tsx src/cli.ts --help — CLI still works

New test files

tests/smoke-verdict-prompt.ts — 5 scenarios: P1 regression guard, P2 caps[0]-collapse guard, P3 noActiveCriteria guard, verdict floor math, coverage exemption
tests/smoke-prompt-criteria.ts — unit tests for resolveCriteriaForTask (match, distinct per cap, throws on unknown, throws on missing)
tests/smoke-changelog-coverage.ts — release hygiene: every CHANGELOG Fixed/Added item must have a matching test file reference

🤖 Generated with Claude Code

OpenAI's direct-API model IDs use dots in version numbers (gpt-5.4, gpt-4.1). fix.ts already exempted openrouter/ but not openai/, so a manufactured model-id-bad-format issue would corrupt an openai/ ID. Defense-in-depth: validate.ts already skips emitting the issue for openai/, but fix.ts must independently respect the documented invariant in CLAUDE.md. Also wires smoke-model-ids.ts into npm test — it existed on disk but was not in the test script.

The active prompt discovery lives in src/project/discover-prompt.ts, imported by snapshot.ts and benchmark/runner.ts. src/discovery/prompt.ts was a dead parallel implementation — only referenced by its own test file. Removing both the dead module and its test file to keep the codebase lean. The live discover-prompt.ts API (discoverPromptCapabilities) is covered by other smoke tests in this PR.

Prompt-surface tasks don't guarantee 1:1 capability coverage the way SDK/CLI/MCP tasks do, so coverageViolation=true was hard-FAILing every prompt benchmark regardless of actual scores. computeVerdict now only appends the coverage-violation reason when config.surface !== 'prompt'. Coverage is still computed and appears in the report. Regression guard in new smoke-verdict-prompt.ts locks in both halves of the behavior: prompt PASSes with coverageViolation=true + scores above floor; mcp still FAILs under identical conditions.

Pure refactor. Moves the caps→criteria lookup out of runner.ts into src/benchmark/prompt-criteria.ts so it can be unit-tested without running the full LLM pipeline. Behavior is unchanged in this commit — tasks missing capabilityId are logged as eval errors (FAIL with message) rather than silently vacuously passing; capabilityId tagging by the generator lands in a later commit. Adds optional capabilityId to GeneratedTask (SDK/CLI/MCP generators don't set it). Runtime enforcement (throw on missing/unknown) lives in resolveCriteriaForTask — no silent fallback, per the no-legacy-compat policy. New smoke-prompt-criteria.ts locks in: match, distinct-per-capability, throws-on-unknown, throws-on-missing, and noActiveCriteria flagging.

Previously, when every criteria category was empty the evaluator returned score: 1.0 — any response (including an empty string) scored a perfect pass. Now the evaluator returns score: 0 and noActiveCriteria: true. The runner treats that flag as an evaluation error with an actionable message pointing at the SKILL.md section for the offending capability. Evaluator stays dumb (no pass/fail policy). Runner is the policy layer.

The runner's caps[0] global was collapsing every prompt-surface task into evaluation against the first discovered capability regardless of what the task actually exercised. With this commit, each generated prompt-surface task is tagged at generation time with the action key of the capability it exercises, and the runner looks up criteria per-task via resolveCriteriaForTask (wired in previous commits). No legacy compat. Prompt-surface tasks lacking capabilityId fail to load; users regenerate with `skill-optimizer generate-tasks`. Regression guards: - smoke-generation.ts: valid tagging plus rejection of unknown ids - smoke-verdict-prompt.ts: three caps produce distinct criteria (caps[0]-collapse detector), P3 regression guard via evaluator, and mock-LLM verdict matrix (threshold + weight math) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

CHANGELOG gets a Fixed block covering P1/P2/P3/Bug B/C5 for v1.1.0. README prompt templates section reflects per-capability scoring. SKILL.md audit for any guidance contradicting the fixed behavior.

smoke-changelog-coverage.ts parses the top block of CHANGELOG.md and asserts every item in Added/Fixed has at least one test file referencing relevant keywords. Guards against 'shipped feature, forgot the test' — the class that let P1/P2/P3 slip past v1.1.0 before this PR. smoke-release.ts also gains an assertion that the CHANGELOG contains a section header matching the current package.json version.

Copilot

Pull request overview

This PR fixes several correctness issues in the prompt-surface benchmark pipeline: verdict computation no longer hard-fails on coverage for prompt runs, prompt tasks are scored against the criteria for their own capability (via a new capabilityId), and empty criteria no longer yield a vacuous passing score.

Changes:

Update prompt-surface scoring to resolve per-task evaluation criteria using capabilityId, and treat empty criteria as an evaluation error (noActiveCriteria, score 0).
Adjust verdict logic so scopeCoverage.coverageViolation does not veto prompt-surface runs (still enforced for other surfaces).
Add/extend smoke tests for verdict behavior, prompt criteria resolution, model-id rewriting exemptions, and changelog/test linkage; remove dead prompt discovery module/tests.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tests/smoke-verdict-prompt.ts	New smoke coverage for prompt-surface verdict policy + per-cap criteria + noActiveCriteria behavior
tests/smoke-release.ts	Adds release hygiene check: CHANGELOG must include current package version header
tests/smoke-prompt-evaluator.ts	Updates evaluator expectations for `noActiveCriteria` and non-vacuous empty-criteria scoring
tests/smoke-prompt-criteria.ts	New unit smoke tests for per-task criteria resolution (`resolveCriteriaForTask`)
tests/smoke-model-ids.ts	Adds defense-in-depth test ensuring `openai/` IDs are not dot→hyphen rewritten in `applyFixes`
tests/smoke-generation.ts	Adds prompt-surface generation/grounding tests for `capabilityId` tagging and rejection paths
tests/smoke-discovery-prompt.ts	Removes obsolete smoke tests for deleted `src/discovery/prompt.ts`
tests/smoke-changelog-coverage.ts	New “CHANGELOG entries must have matching test token” guard
src/tasks/types.ts	Extends generated-task type with optional `capabilityId` (prompt surface)
src/tasks/ground.ts	Enforces `capabilityId` presence/validity for prompt-surface tasks during grounding
src/tasks/generate.ts	Prompts for and parses `capabilityId` on prompt-surface task generation
src/project/fix.ts	Exempts `openai/` IDs from dot→hyphen rewriting (matching `openrouter/` exemption)
src/discovery/prompt.ts	Deletes dead prompt discovery module
src/benchmark/scoring.ts	Makes coverage violation non-blocking for prompt-surface verdicts
src/benchmark/runner.ts	Resolves prompt evaluation criteria per-task via `capabilityId`; handles `noActiveCriteria` as failure
src/benchmark/prompt-evaluator.ts	Adds `noActiveCriteria` to results; empty criteria now score 0 instead of 1.0
src/benchmark/prompt-criteria.ts	New helper to resolve criteria per task/capability (`resolveCriteriaForTask`)
package.json	Updates test script to include new smoke tests and remove deleted one
README.md	Documents `capabilityId` and prompt-surface coverage/verdict semantics
CHANGELOG.md	Records fixes for prompt-surface scoring/verdict, model-id rewrite exemption, and module deletion

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-17T04:28:26Z

+      if (config.surface === 'prompt') {
        try {
-          const promptResult = evaluatePromptResponse(rawResponse, promptEvalCriteria);
-          taskResult.metrics.toolRecall = promptResult.score;
-          taskResult.metrics.taskPassed = promptResult.score >= 0.5;
-          console.log(`  [${slug}] Prompt score: ${promptResult.score.toFixed(3)} → ${taskResult.metrics.taskPassed ? 'PASS' : 'FAIL'}`);
+          const { criteria } = resolveCriteriaForTask(task, promptCaps);
+          const promptResult = evaluatePromptResponse(rawResponse, criteria);
+          if (promptResult.noActiveCriteria) {


resolveCriteriaForTask(task, promptCaps) relies on task.capabilityId, but benchmark tasks are loaded via loadTasks() (TaskDefinition) which currently discards unknown fields when normalizing JSON. That means even if tasks.json contains capabilityId, it will be dropped and prompt runs will consistently hit the "missing capabilityId" error path and score 0.

Fix: plumb capabilityId through the benchmark task model and loader (e.g., add optional capabilityId?: string to TaskDefinition in src/benchmark/types.ts and preserve it in normalizeTaskDefinition / JSON parsing), so the runner receives the field for prompt surface tasks.

Copilot · 2026-04-17T04:28:27Z

-  const tasks = parseGeneratedTasks(completion);
+
+  // For prompt surface, pass the known capability keys so parseGeneratedTasks
+  // can validate and attach capabilityId to each task.


The comment says parseGeneratedTasks "can validate and attach capabilityId", but the current implementation only attaches capabilityId when present and leaves validation (membership in knownCapabilityKeys) to grounding. Consider either updating the comment to match reality or adding an explicit check that rawCapabilityId is in knownCapabilityKeys so bad values fail fast during parsing.

Suggested change

// can validate and attach capabilityId to each task.

// can attach capabilityId metadata; membership validation is handled later.

damienen · 2026-04-17T04:33:20Z

Code review

Found 1 issue:

capabilityId is written to tasks.generated.json by freeze.ts but silently dropped by normalizeTaskDefinition in config.ts — the function's input type and return value only include {id, prompt, expected_actions, verify, expected_fetches}, and TaskDefinition has no capabilityId field. When the runner calls resolveCriteriaForTask(task, promptCaps), task.capabilityId is always undefined, so the function throws on every prompt-surface task. TypeScript does not catch this because TaskDefinition is structurally assignable to GeneratedTask (capabilityId is optional on GeneratedTask). Every prompt-surface benchmark run will report 100% task failures with "missing capabilityId — Regenerate tasks", even on freshly generated tasks.

Fix: add capabilityId?: string to TaskDefinition and update normalizeTaskDefinition to read and pass it through.

skill-optimizer/src/benchmark/config.ts

Lines 67 to 115 in 3a5ae71

    
           function normalizeTaskDefinition( 
        
             task: { id?: unknown; prompt?: unknown; expected_actions?: unknown; verify?: unknown; expected_fetches?: unknown }, 
        
             resolvedPath: string, 
        
             index: number, 
        
           ): TaskDefinition { 
        
             if (typeof task.id !== 'string' || task.id.trim() === '') { 
        
               throw new Error(`Tasks file ${resolvedPath}: task at index ${index} must include a non-empty string id`); 
        
             } 
        
             if (!isSafeTaskId(task.id)) { 
        
               throw new Error(`Tasks file ${resolvedPath}: task id "${task.id}" must match ${SAFE_TASK_ID.toString()} and cannot be . or ..`); 
        
             } 
        
             if (typeof task.prompt !== 'string' || task.prompt.trim() === '') { 
        
               throw new Error(`Tasks file ${resolvedPath}: task ${task.id} must include a non-empty string prompt`); 
        
             } 
        
             const rawExpectedActions = Array.isArray(task.expected_actions) ? task.expected_actions : null; 
        
             if (!rawExpectedActions) { 
        
               throw new Error(`Tasks file ${resolvedPath}: task at index ${index} must include an expected_actions array`); 
        
             } 
        
             const expected_actions = rawExpectedActions.map((rawAction, actionIndex) => normalizeExpectedAction(rawAction, resolvedPath, index, actionIndex)); 
        
             const rawVerify = Array.isArray(task.verify) ? task.verify : undefined; 
        
             if (rawVerify !== undefined) { 
        
               for (let i = 0; i < rawVerify.length; i++) { 
        
                 if (!rawVerify[i] || typeof rawVerify[i] !== 'object') { 
        
                   throw new Error(`Tasks file ${resolvedPath}: task ${task.id} verify[${i}] must be an object`); 
        
                 } 
        
               } 
        
             } 
        
             const rawFetches = Array.isArray(task.expected_fetches) ? task.expected_fetches : undefined; 
        
             if (rawFetches !== undefined) { 
        
               for (let i = 0; i < rawFetches.length; i++) { 
        
                 if (typeof rawFetches[i] !== 'string' || !(rawFetches[i] as string).trim()) { 
        
                   throw new Error(`Tasks file ${resolvedPath}: task ${task.id} expected_fetches[${i}] must be a non-empty string`); 
        
                 } 
        
               } 
        
             } 
        
             return { 
        
               id: task.id, 
        
               prompt: task.prompt, 
        
               expected_actions, 
        
               verify: rawVerify as TaskDefinition['verify'] | undefined, 
        
               expected_fetches: rawFetches as string[] | undefined, 
        
             }; 
        
           }

skill-optimizer/src/benchmark/types.ts

Lines 141 to 149 in 3a5ae71

    
           export interface TaskDefinition { 
        
             id: string; 
        
             prompt: string; 
        
             expected_actions: ExpectedAction[]; 
        
             verify?: TaskVerification[]; 
        
             expected_fetches?: string[]; 
        
           }

skill-optimizer/src/benchmark/prompt-criteria.ts

Lines 24 to 31 in 3a5ae71

    
             caps: readonly PromptCapabilityWithSection[], 
        
           ): ResolvedPromptCriteria { 
        
             if (!task.capabilityId) { 
        
               throw new Error( 
        
                 `Task ${task.id}: prompt-surface task is missing capabilityId. ` + 
        
                 `Regenerate tasks with \`skill-optimizer generate-tasks\`.`, 
        
               ); 
        
             }

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

…coverage check - Add capabilityId?: string to TaskDefinition and update normalizeTaskDefinition to read and pass it through — without this, resolveCriteriaForTask threw on every prompt-surface task because loadTasks silently dropped the field - smoke-changelog-coverage: require ≥2 tokens to co-occur in a single test file (whole-word match) instead of any one token anywhere in the corpus — prevents false-passes on generic words like "prompt" or "coverage" - generate.ts comment: clarify that parseGeneratedTasks attaches capabilityId; membership validation is in ground.ts, not here Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- discover-prompt.ts: _output capabilities now store section: section.body (full markdown with fences) instead of section: snippet (extracted content without fences). generateCriteriaFromCapability requires fences to extract format patterns, so passing bare snippet produced empty criteria and forced noActiveCriteria/FAIL for every output-format task. - coverage.ts: actionNamesOf falls back to capabilityId when expected_actions is empty — prompt tasks always have expected_actions:[] so coverage showed 0/N covered. capabilityId matches action.name for prompt capabilities (key===name in capabilityToAction), so this correctly attributes coverage. - Tests: regression guards added in smoke-prompt-criteria and smoke-coverage. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore: ignore .worktrees/ directory * feat: stable task IDs + optimizer loop diagram (#24) * chore: ignore .worktrees/ directory * feat: stable task IDs + optimizer loop diagram in README - fix(tasks): derive task IDs from sha1(action names) instead of LLM-supplied id field, which changed on every regeneration and broke --task filters. Action names come from the discovered surface and are stable across runs when the surface hasn't changed. Duplicate action-name sets get a -1/-2 numeric suffix. - docs: add horizontal optimizer-loop SVG diagram to README top, showing the full init → baseline → iterate (analyze/mutate/ re-benchmark/accept-reject) → output flow at a glance. Closes #17 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs: add PNG version of optimizer loop diagram Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs: use SVG in README (PNG kept as companion file) SVG renders natively on GitHub and scales without pixelation. PNG is included alongside as a companion for external use cases (email, Office docs, tools that don't render SVG). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: address Copilot review on PR #24 - Delete SAFE_TASK_ID / isSafeTaskId from src/tasks/generate.ts — dead after the stable-ID refactor (still used in src/benchmark/config.ts for external task-file validation, so that copy stays). - Extend stable task IDs to the prompt surface: fall back to a SHA-1 hash of the prompt text when expected_actions is empty, so prompt-surface --task filters survive regeneration. - Rewrite four README links (optimizer-loop.svg, docs/reference/*.md, CONTRIBUTING.md) to absolute github.com URLs — docs/ and CONTRIBUTING.md are not in the npm tarball's files field, so relative paths 404 when users view the README on npmjs.com. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs: final 1.1.0 polish — CHANGELOG, error message, README consistency - CHANGELOG.md: fill in all additions and fixes that landed on development after the initial 1.1.0 bump (stable task IDs, Codex auth, SKILL folder, diagram, model-ID slug overhaul, error message). - src/errors.ts + docs/reference/errors.md: fix E_MODEL_ID_FORMAT — was "missing the openrouter/ prefix"; now lists all three valid provider prefixes (openrouter/, anthropic/, openai/). - README.md: use catalog-correct openrouter/google/gemini-2.5-flash (dot, not hyphen) in answers.json example; change "skill-optimizer benchmark" to "skill-optimizer run" for consistency with other examples. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(tasks): stable dedup suffix order + accurate CHANGELOG wording Sort validated tasks by (id, prompt) before the dedup counter loop so that numeric suffixes assigned to same-action-hash tasks are determined by content order, not LLM output order. Previously, if the model regenerated two create_wallet tasks in swapped order, the -1/-2 suffixes would swap between runs, making --task filters unstable for multi-variant cases. Also soften the CHANGELOG entry for "stable task IDs": clarifies that SDK/CLI/MCP IDs are stable across regenerations (action names come from discovered code), while prompt-surface IDs are only stable when the LLM produces identical wording. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: OpenClaw Agent (basd) <basd@openclaw.ai> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * Update README with Fast team and payment info (#27) * Update README with Fast team and payment info Added information about the Fast team and payment infrastructure for AI agents. Requested by Jessy. * Update README.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: dmn <damian.ovidiu27@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix: prompt-surface correctness (P1/P2/P3/B/C5) (#28) * fix: exempt openai/ model IDs from dot→hyphen rewrite OpenAI's direct-API model IDs use dots in version numbers (gpt-5.4, gpt-4.1). fix.ts already exempted openrouter/ but not openai/, so a manufactured model-id-bad-format issue would corrupt an openai/ ID. Defense-in-depth: validate.ts already skips emitting the issue for openai/, but fix.ts must independently respect the documented invariant in CLAUDE.md. Also wires smoke-model-ids.ts into npm test — it existed on disk but was not in the test script. * chore: remove unused src/discovery/prompt.ts and its tests The active prompt discovery lives in src/project/discover-prompt.ts, imported by snapshot.ts and benchmark/runner.ts. src/discovery/prompt.ts was a dead parallel implementation — only referenced by its own test file. Removing both the dead module and its test file to keep the codebase lean. The live discover-prompt.ts API (discoverPromptCapabilities) is covered by other smoke tests in this PR. * fix: prompt surface not blocked by coverage violation Prompt-surface tasks don't guarantee 1:1 capability coverage the way SDK/CLI/MCP tasks do, so coverageViolation=true was hard-FAILing every prompt benchmark regardless of actual scores. computeVerdict now only appends the coverage-violation reason when config.surface !== 'prompt'. Coverage is still computed and appears in the report. Regression guard in new smoke-verdict-prompt.ts locks in both halves of the behavior: prompt PASSes with coverageViolation=true + scores above floor; mcp still FAILs under identical conditions. * refactor: extract resolveCriteriaForTask from runner Pure refactor. Moves the caps→criteria lookup out of runner.ts into src/benchmark/prompt-criteria.ts so it can be unit-tested without running the full LLM pipeline. Behavior is unchanged in this commit — tasks missing capabilityId are logged as eval errors (FAIL with message) rather than silently vacuously passing; capabilityId tagging by the generator lands in a later commit. Adds optional capabilityId to GeneratedTask (SDK/CLI/MCP generators don't set it). Runtime enforcement (throw on missing/unknown) lives in resolveCriteriaForTask — no silent fallback, per the no-legacy-compat policy. New smoke-prompt-criteria.ts locks in: match, distinct-per-capability, throws-on-unknown, throws-on-missing, and noActiveCriteria flagging. * fix: evaluator flags noActiveCriteria instead of vacuous 1.0 pass Previously, when every criteria category was empty the evaluator returned score: 1.0 — any response (including an empty string) scored a perfect pass. Now the evaluator returns score: 0 and noActiveCriteria: true. The runner treats that flag as an evaluation error with an actionable message pointing at the SKILL.md section for the offending capability. Evaluator stays dumb (no pass/fail policy). Runner is the policy layer. * feat: per-capability prompt scoring via capabilityId The runner's caps[0] global was collapsing every prompt-surface task into evaluation against the first discovered capability regardless of what the task actually exercised. With this commit, each generated prompt-surface task is tagged at generation time with the action key of the capability it exercises, and the runner looks up criteria per-task via resolveCriteriaForTask (wired in previous commits). No legacy compat. Prompt-surface tasks lacking capabilityId fail to load; users regenerate with `skill-optimizer generate-tasks`. Regression guards: - smoke-generation.ts: valid tagging plus rejection of unknown ids - smoke-verdict-prompt.ts: three caps produce distinct criteria (caps[0]-collapse detector), P3 regression guard via evaluator, and mock-LLM verdict matrix (threshold + weight math) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs: v1.1.0 correctness fixes CHANGELOG gets a Fixed block covering P1/P2/P3/Bug B/C5 for v1.1.0. README prompt templates section reflects per-capability scoring. SKILL.md audit for any guidance contradicting the fixed behavior. * test: release-readiness coverage check smoke-changelog-coverage.ts parses the top block of CHANGELOG.md and asserts every item in Added/Fixed has at least one test file referencing relevant keywords. Guards against 'shipped feature, forgot the test' — the class that let P1/P2/P3 slip past v1.1.0 before this PR. smoke-release.ts also gains an assertion that the CHANGELOG contains a section header matching the current package.json version. * fix: plumb capabilityId through TaskDefinition and tighten changelog coverage check - Add capabilityId?: string to TaskDefinition and update normalizeTaskDefinition to read and pass it through — without this, resolveCriteriaForTask threw on every prompt-surface task because loadTasks silently dropped the field - smoke-changelog-coverage: require ≥2 tokens to co-occur in a single test file (whole-word match) instead of any one token anywhere in the corpus — prevents false-passes on generic words like "prompt" or "coverage" - generate.ts comment: clarify that parseGeneratedTasks attaches capabilityId; membership validation is in ground.ts, not here Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: output capability criteria and prompt coverage computation - discover-prompt.ts: _output capabilities now store section: section.body (full markdown with fences) instead of section: snippet (extracted content without fences). generateCriteriaFromCapability requires fences to extract format patterns, so passing bare snippet produced empty criteria and forced noActiveCriteria/FAIL for every output-format task. - coverage.ts: actionNamesOf falls back to capabilityId when expected_actions is empty — prompt tasks always have expected_actions:[] so coverage showed 0/N covered. capabilityId matches action.name for prompt capabilities (key===name in capabilityToAction), so this correctly attributes coverage. - Tests: regression guards added in smoke-prompt-criteria and smoke-coverage. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: OpenClaw Agent (basd) <basd@openclaw.ai> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: OpenClaw Agent (basd) <basd@openclaw.ai> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Xiaohong Chen <xiaohong.chen@pi2.network> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* chore: ignore .worktrees/ directory * feat: stable task IDs + optimizer loop diagram (#24) * chore: ignore .worktrees/ directory * feat: stable task IDs + optimizer loop diagram in README - fix(tasks): derive task IDs from sha1(action names) instead of LLM-supplied id field, which changed on every regeneration and broke --task filters. Action names come from the discovered surface and are stable across runs when the surface hasn't changed. Duplicate action-name sets get a -1/-2 numeric suffix. - docs: add horizontal optimizer-loop SVG diagram to README top, showing the full init → baseline → iterate (analyze/mutate/ re-benchmark/accept-reject) → output flow at a glance. Closes #17 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs: add PNG version of optimizer loop diagram Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs: use SVG in README (PNG kept as companion file) SVG renders natively on GitHub and scales without pixelation. PNG is included alongside as a companion for external use cases (email, Office docs, tools that don't render SVG). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: address Copilot review on PR #24 - Delete SAFE_TASK_ID / isSafeTaskId from src/tasks/generate.ts — dead after the stable-ID refactor (still used in src/benchmark/config.ts for external task-file validation, so that copy stays). - Extend stable task IDs to the prompt surface: fall back to a SHA-1 hash of the prompt text when expected_actions is empty, so prompt-surface --task filters survive regeneration. - Rewrite four README links (optimizer-loop.svg, docs/reference/*.md, CONTRIBUTING.md) to absolute github.com URLs — docs/ and CONTRIBUTING.md are not in the npm tarball's files field, so relative paths 404 when users view the README on npmjs.com. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs: final 1.1.0 polish — CHANGELOG, error message, README consistency - CHANGELOG.md: fill in all additions and fixes that landed on development after the initial 1.1.0 bump (stable task IDs, Codex auth, SKILL folder, diagram, model-ID slug overhaul, error message). - src/errors.ts + docs/reference/errors.md: fix E_MODEL_ID_FORMAT — was "missing the openrouter/ prefix"; now lists all three valid provider prefixes (openrouter/, anthropic/, openai/). - README.md: use catalog-correct openrouter/google/gemini-2.5-flash (dot, not hyphen) in answers.json example; change "skill-optimizer benchmark" to "skill-optimizer run" for consistency with other examples. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(tasks): stable dedup suffix order + accurate CHANGELOG wording Sort validated tasks by (id, prompt) before the dedup counter loop so that numeric suffixes assigned to same-action-hash tasks are determined by content order, not LLM output order. Previously, if the model regenerated two create_wallet tasks in swapped order, the -1/-2 suffixes would swap between runs, making --task filters unstable for multi-variant cases. Also soften the CHANGELOG entry for "stable task IDs": clarifies that SDK/CLI/MCP IDs are stable across regenerations (action names come from discovered code), while prompt-surface IDs are only stable when the LLM produces identical wording. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: OpenClaw Agent (basd) <basd@openclaw.ai> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * Update README with Fast team and payment info (#27) * Update README with Fast team and payment info Added information about the Fast team and payment infrastructure for AI agents. Requested by Jessy. * Update README.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: dmn <damian.ovidiu27@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix: prompt-surface correctness (P1/P2/P3/B/C5) (#28) * fix: exempt openai/ model IDs from dot→hyphen rewrite OpenAI's direct-API model IDs use dots in version numbers (gpt-5.4, gpt-4.1). fix.ts already exempted openrouter/ but not openai/, so a manufactured model-id-bad-format issue would corrupt an openai/ ID. Defense-in-depth: validate.ts already skips emitting the issue for openai/, but fix.ts must independently respect the documented invariant in CLAUDE.md. Also wires smoke-model-ids.ts into npm test — it existed on disk but was not in the test script. * chore: remove unused src/discovery/prompt.ts and its tests The active prompt discovery lives in src/project/discover-prompt.ts, imported by snapshot.ts and benchmark/runner.ts. src/discovery/prompt.ts was a dead parallel implementation — only referenced by its own test file. Removing both the dead module and its test file to keep the codebase lean. The live discover-prompt.ts API (discoverPromptCapabilities) is covered by other smoke tests in this PR. * fix: prompt surface not blocked by coverage violation Prompt-surface tasks don't guarantee 1:1 capability coverage the way SDK/CLI/MCP tasks do, so coverageViolation=true was hard-FAILing every prompt benchmark regardless of actual scores. computeVerdict now only appends the coverage-violation reason when config.surface !== 'prompt'. Coverage is still computed and appears in the report. Regression guard in new smoke-verdict-prompt.ts locks in both halves of the behavior: prompt PASSes with coverageViolation=true + scores above floor; mcp still FAILs under identical conditions. * refactor: extract resolveCriteriaForTask from runner Pure refactor. Moves the caps→criteria lookup out of runner.ts into src/benchmark/prompt-criteria.ts so it can be unit-tested without running the full LLM pipeline. Behavior is unchanged in this commit — tasks missing capabilityId are logged as eval errors (FAIL with message) rather than silently vacuously passing; capabilityId tagging by the generator lands in a later commit. Adds optional capabilityId to GeneratedTask (SDK/CLI/MCP generators don't set it). Runtime enforcement (throw on missing/unknown) lives in resolveCriteriaForTask — no silent fallback, per the no-legacy-compat policy. New smoke-prompt-criteria.ts locks in: match, distinct-per-capability, throws-on-unknown, throws-on-missing, and noActiveCriteria flagging. * fix: evaluator flags noActiveCriteria instead of vacuous 1.0 pass Previously, when every criteria category was empty the evaluator returned score: 1.0 — any response (including an empty string) scored a perfect pass. Now the evaluator returns score: 0 and noActiveCriteria: true. The runner treats that flag as an evaluation error with an actionable message pointing at the SKILL.md section for the offending capability. Evaluator stays dumb (no pass/fail policy). Runner is the policy layer. * feat: per-capability prompt scoring via capabilityId The runner's caps[0] global was collapsing every prompt-surface task into evaluation against the first discovered capability regardless of what the task actually exercised. With this commit, each generated prompt-surface task is tagged at generation time with the action key of the capability it exercises, and the runner looks up criteria per-task via resolveCriteriaForTask (wired in previous commits). No legacy compat. Prompt-surface tasks lacking capabilityId fail to load; users regenerate with `skill-optimizer generate-tasks`. Regression guards: - smoke-generation.ts: valid tagging plus rejection of unknown ids - smoke-verdict-prompt.ts: three caps produce distinct criteria (caps[0]-collapse detector), P3 regression guard via evaluator, and mock-LLM verdict matrix (threshold + weight math) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs: v1.1.0 correctness fixes CHANGELOG gets a Fixed block covering P1/P2/P3/Bug B/C5 for v1.1.0. README prompt templates section reflects per-capability scoring. SKILL.md audit for any guidance contradicting the fixed behavior. * test: release-readiness coverage check smoke-changelog-coverage.ts parses the top block of CHANGELOG.md and asserts every item in Added/Fixed has at least one test file referencing relevant keywords. Guards against 'shipped feature, forgot the test' — the class that let P1/P2/P3 slip past v1.1.0 before this PR. smoke-release.ts also gains an assertion that the CHANGELOG contains a section header matching the current package.json version. * fix: plumb capabilityId through TaskDefinition and tighten changelog coverage check - Add capabilityId?: string to TaskDefinition and update normalizeTaskDefinition to read and pass it through — without this, resolveCriteriaForTask threw on every prompt-surface task because loadTasks silently dropped the field - smoke-changelog-coverage: require ≥2 tokens to co-occur in a single test file (whole-word match) instead of any one token anywhere in the corpus — prevents false-passes on generic words like "prompt" or "coverage" - generate.ts comment: clarify that parseGeneratedTasks attaches capabilityId; membership validation is in ground.ts, not here Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: output capability criteria and prompt coverage computation - discover-prompt.ts: _output capabilities now store section: section.body (full markdown with fences) instead of section: snippet (extracted content without fences). generateCriteriaFromCapability requires fences to extract format patterns, so passing bare snippet produced empty criteria and forced noActiveCriteria/FAIL for every output-format task. - coverage.ts: actionNamesOf falls back to capabilityId when expected_actions is empty — prompt tasks always have expected_actions:[] so coverage showed 0/N covered. capabilityId matches action.name for prompt capabilities (key===name in capabilityToAction), so this correctly attributes coverage. - Tests: regression guards added in smoke-prompt-criteria and smoke-coverage. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: OpenClaw Agent (basd) <basd@openclaw.ai> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: address PR #26 review findings (prompt surface, docs, precision, error quality) (#30) * docs: add implementation plan for PR #26 review fixes (13 issues) * fix(preflight): exempt prompt surface from maxTasks check; surface-aware discovery hints Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(init): add prompt surface next-steps guidance * fix(wizard): accept anthropic/ and openai/ model IDs in custom model validator * fix(generate): allow missing expected_actions on prompt surface in validateTask When `knownCapabilityKeys` is defined (prompt surface), LLMs may omit `expected_actions` entirely even though the prompt requests an empty array. Fall back to `[]` instead of throwing, so task generation is not blocked. * fix(docs): correct config path to .skill-optimizer/ and update stale model ID Replace all occurrences of `skill-optimizer/skill-optimizer.json` (without dot) with `.skill-optimizer/skill-optimizer.json` (with dot) to match the actual path written by `src/init/scaffold.ts`. Also update stale `openrouter/openai/gpt-4o` model ID in `SKILL/references/setup.md` to `openrouter/openai/gpt-4o-mini`. * fix(snapshot): include snapshotPath in unsupported-format error message * fix(runner): set toolPrecision=1.0 for prompt surface tasks * fix(docs): correct apiKeyEnv description and loop.ts agent cwd comment Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(generate): guard generateCandidateTasksWithCoverage against prompt surface * fix(tasks): replace brittle string match with NoTextBlocksError class --------- Co-authored-by: OpenClaw Agent (basd) <basd@openclaw.ai> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: OpenClaw Agent (basd) <basd@openclaw.ai> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Xiaohong Chen <xiaohong.chen@pi2.network> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

OpenClaw Agent (basd) and others added 9 commits April 17, 2026 01:18

docs: v1.1.0 correctness fixes

179f97e

CHANGELOG gets a Fixed block covering P1/P2/P3/Bug B/C5 for v1.1.0. README prompt templates section reflects per-capability scoring. SKILL.md audit for any guidance contradicting the fixed behavior.

Merge branch 'development' into fix/prompt-surface-correctness

3a5ae71

damienen requested a review from Copilot April 17, 2026 04:22

Copilot started reviewing on behalf of damienen April 17, 2026 04:22 View session

Copilot AI reviewed Apr 17, 2026

View reviewed changes

OpenClaw Agent (basd) and others added 2 commits April 17, 2026 04:42

damienen merged commit 39869c4 into development Apr 17, 2026
3 checks passed

damienen deleted the fix/prompt-surface-correctness branch April 17, 2026 05:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prompt-surface correctness (P1/P2/P3/B/C5)#28

fix: prompt-surface correctness (P1/P2/P3/B/C5)#28
damienen merged 11 commits into
developmentfrom
fix/prompt-surface-correctness

damienen commented Apr 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

damienen commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	// can validate and attach capabilityId to each task.
	// can attach capabilityId metadata; membership validation is handled later.

Conversation

damienen commented Apr 17, 2026

Summary

Test Plan

New test files

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

damienen commented Apr 17, 2026

Code review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants