Benchmark and self-optimize SDK, CLI, and MCP guidance so every agent model can use your tool reliably.
skill-optimizer runs your SDK / CLI / MCP docs against multiple LLMs, measures whether they call the right actions with the right arguments, and iteratively rewrites your SKILL.md / docs until a floor score is met across every model.
git clone https://github.com/bucurdavid/skill-optimizer
cd skill-optimizer
npm install
export OPENROUTER_API_KEY=sk-or-...
# Scaffold config for your surface type (sdk | cli | mcp)
npx tsx src/cli.ts init cliinit cli creates a skill-optimizer/ directory with:
skill-optimizer.json— the main config (task generation enabled by default)cli-commands.json— command manifest template (used as fallback if code-first discovery finds nothing)
init sdk creates skill-optimizer.json only. init mcp creates skill-optimizer.json + tools.json.
Open skill-optimizer/skill-optimizer.json and fill in these fields:
| Field | What it does | Set it to |
|---|---|---|
target.repoPath |
Root of the project being benchmarked | Absolute or relative path to your repo |
target.discovery.sources |
Source files to scan for callable methods/commands/tools | e.g. ["../src/index.ts"] or ["../src/server.ts"] |
target.skill |
Docs file the optimizer will edit | Path to your SKILL.md or equivalent guidance doc |
benchmark.models |
Models to benchmark | Valid OpenRouter model IDs |
For CLI and MCP surfaces: if code-first discovery yields nothing, edit the companion manifest (cli-commands.json or tools.json) with your real commands/tools — the config already points to it as a fallback.
Tasks are generated automatically from your discovered surface — you don't need to write them manually.
Then run the benchmark:
npx tsx src/cli.ts run --config ./skill-optimizer/skill-optimizer.json- Discover callable surface (SDK methods / CLI commands / MCP tools) via tree-sitter or a manifest.
- Scope the surface with
target.scope.include/target.scope.excludeglobs. - Generate tasks — one prompt per in-scope action, coverage-guaranteed.
- Benchmark — every configured model attempts every task; static evaluator checks action calls + args.
- Verdict — PASS/FAIL against two gates (per-model floor, weighted average).
- Optimize — mutate
SKILL.md/ docs insideallowedPaths, re-benchmark, accept only if both gates hold, rollback if not. - Recommendations — on FAIL, one critic call summarizes what to improve manually.
All configuration lives in a single skill-optimizer.json file.
| Field | Type | Default | Description |
|---|---|---|---|
surface |
"sdk" | "cli" | "mcp" |
required | Type of callable surface |
repoPath |
string |
. |
Path to the target repo |
skill |
string | { source: string; cache?: boolean } |
— | Path to SKILL.md |
discovery.mode |
"auto" | "manifest" |
"auto" |
How to discover actions |
discovery.sources |
string[] |
— | Source files for tree-sitter discovery |
discovery.language |
"typescript" | "python" | "rust" |
— | Language for code-first discovery |
discovery.fallbackManifest |
string |
— | Path to manifest JSON when code-first discovery is incomplete |
sdk.language |
"typescript" | "python" | "rust" |
— | SDK language |
sdk.entrypoints |
string[] |
— | SDK entry files |
cli.commands |
string |
— | Path to CLI commands manifest JSON |
mcp.tools |
string |
— | Path to MCP tools manifest JSON |
| Field | Type | Default | Description |
|---|---|---|---|
format |
"pi" |
"pi" |
LLM transport format |
apiKeyEnv |
string |
OPENROUTER_API_KEY |
Env var name for the API key |
timeout |
number |
240000 |
Ms per model call |
models |
Array<{ id: string; name: string; tier: "flagship"|"mid"|"low"; weight?: number }> |
required | Models to benchmark |
taskGeneration.enabled |
boolean |
false |
Whether to generate tasks automatically |
taskGeneration.maxTasks |
number |
10 |
Max tasks to generate (must be >= scope size) |
taskGeneration.seed |
number |
1 |
RNG seed for reproducible generation |
output.dir |
string |
benchmark-results/ |
Where reports are saved |
verdict.perModelFloor |
number |
0.6 |
Minimum per-model pass fraction for a PASS verdict |
verdict.targetWeightedAverage |
number |
0.7 |
Minimum weighted average across all models for a PASS verdict |
| Field | Type | Default | Description |
|---|---|---|---|
model |
string |
— | Model for mutation (e.g. openrouter/anthropic/claude-sonnet-4-6) |
apiKeyEnv |
string |
— | Env var for the optimizer's API key |
thinkingLevel |
"off"|"minimal"|"low"|"medium"|"high"|"xhigh" |
"medium" |
Reasoning depth for mutation calls |
allowedPaths |
string[] |
— | Paths the optimizer may edit (safety boundary) |
validation |
string[] |
— | Shell commands to run to validate each mutation |
requireCleanGit |
boolean |
true |
Require clean git state before starting |
maxIterations |
number |
5 |
Maximum optimization iterations |
minImprovement |
number |
0.02 |
Minimum weighted-average gain per accepted iteration |
reportContextMaxBytes |
number |
16000 |
Byte budget for mutation context |
{
"name": "my-mcp-project",
"target": {
"surface": "mcp",
"repoPath": ".",
"skill": "./SKILL.md",
"discovery": {
"mode": "auto",
"sources": ["./src/server.ts"]
}
},
"benchmark": {
"format": "pi",
"apiKeyEnv": "OPENROUTER_API_KEY",
"models": [
{ "id": "openrouter/anthropic/claude-sonnet-4-6", "name": "Claude Sonnet", "tier": "flagship", "weight": 2 },
{ "id": "openrouter/openai/gpt-4o-mini", "name": "GPT-4o mini", "tier": "mid", "weight": 1 }
],
"taskGeneration": {
"enabled": true,
"maxTasks": 20,
"seed": 1
},
"output": { "dir": "./benchmark-results" },
"verdict": {
"perModelFloor": 0.6,
"targetWeightedAverage": 0.7
}
},
"optimize": {
"model": "openrouter/anthropic/claude-sonnet-4-6",
"apiKeyEnv": "OPENROUTER_API_KEY",
"thinkingLevel": "medium",
"allowedPaths": ["SKILL.md"],
"requireCleanGit": true,
"maxIterations": 5,
"minImprovement": 0.02,
"reportContextMaxBytes": 16000
}
}Every benchmark run produces one of two verdicts: PASS or FAIL.
Two gates must both be satisfied for a PASS:
benchmark.verdict.perModelFloor(default0.6): every model must pass at least this fraction of tasks. A single model below the floor fails the run, regardless of the average.benchmark.verdict.targetWeightedAverage(default0.7): the weighted average score across all models must reach this threshold.
benchmark.models[].weight (default 1.0): heavier-weighted models count more toward the weighted average. Use higher weights for flagship models you care most about.
The optimizer only accepts a mutation when:
- the weighted average improves by at least
minImprovement, AND - no model that was above the floor drops below it.
Exit codes: 0 = PASS, 1 = FAIL — usable directly in CI pipelines.
Control which actions are benchmarked with target.scope:
target.scope.include(default["*"]): glob patterns for actions to include.target.scope.exclude(default[]): glob patterns for actions to exclude.
The * wildcard matches any sequence of characters including dots and slashes — it is not limited to a single path segment.
Examples:
"Wallet.*"— includes all Wallet methods"*.internal*"— excludes anything with "internal" anywhere in the name"get_*"— includes only getter actions
Task generation is coverage-guaranteed: every in-scope action gets at least one task. If the first generation pass misses any, a targeted retry runs (max 2 iterations). If coverage still fails, an error names the uncovered actions and suggests either fixing SKILL.md guidance or adding them to scope.exclude.
Rough LLM spend per run:
- Baseline benchmark: N models × M tasks LLM calls.
- Optimizer iteration: 1 mutation call + N models × M tasks re-benchmark per iteration.
- Recommendations: 1 critic call, only on FAIL verdict.
No per-failure LLM calls — feedback is deterministic (structured failure details + patterns + passing/failing diffs).
The optimizer's coding agent is powered by @mariozechner/pi-coding-agent — a small OSS wrapper around OpenRouter that handles agent sessions and tool loops. Models are accessed through OpenRouter — you need one API key for everything.
Missing OPENROUTER_API_KEY: Set it in your shell before running:
export OPENROUTER_API_KEY=sk-or-...Dirty git: The optimizer requires a clean git state in the target repo (requireCleanGit: true by default). Commit or stash uncommitted changes before running.
maxTasks < scope_size: benchmark.taskGeneration.maxTasks must be >= the number of in-scope actions. Run npx tsx src/cli.ts --dry-run --config ./skill-optimizer.json to see the count without making LLM calls.
Empty scope: target.scope.include matched nothing. Check your glob patterns — remember * matches everything including dots.
Legacy skill-benchmark.json: Rename it to skill-optimizer.json. The loader will tell you if it finds the old name.
See CONTRIBUTING.md.