Skip to content

Add Docker skill eval workbench#40

Merged
bucurdavid merged 18 commits into
developmentfrom
refactor/evals
May 2, 2026
Merged

Add Docker skill eval workbench#40
bucurdavid merged 18 commits into
developmentfrom
refactor/evals

Conversation

@bucurdavid
Copy link
Copy Markdown
Collaborator

@bucurdavid bucurdavid commented Apr 29, 2026

Summary

  • Replace the old benchmark/optimizer stack with the Docker workbench CLI for cases, suites, graders, traces, trials, and reference-solution verification.
  • Add example workbench suites and package the canonical skill-optimizer skill.
  • Add plugin distribution for Claude Code, OpenCode, Codex, Cursor, and Gemini.

Test Plan

  • npm test
  • npm run typecheck
  • npm pack --dry-run --json package verifier

@bucurdavid bucurdavid marked this pull request as draft May 1, 2026 07:27
@bucurdavid bucurdavid marked this pull request as ready for review May 1, 2026 07:27
@bucurdavid bucurdavid requested a review from Copilot May 1, 2026 21:28
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

Clarify command-skill eval guidance and keep packaged examples from drifting from the supported workbench schema.
Validate standalone run-case model refs early and keep partially-started MCP service containers visible to cleanup.
@bucurdavid bucurdavid merged commit 5da0655 into development May 2, 2026
3 checks passed
@bucurdavid bucurdavid deleted the refactor/evals branch May 2, 2026 06:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants