feat(skill): add eval framework to measure SKILL.md effectiveness by BYK · Pull Request #602 · getsentry/cli

BYK · 2026-03-30T11:12:39Z

Summary

Adds an evaluation framework that measures how effectively SKILL.md guides an LLM agent to use the Sentry CLI efficiently. Inspired by the skill-creator plugin approach of prompt → plan → grade.

Two-phase eval: sends SKILL.md + user prompt to an LLM, then grades the planned commands with deterministic checks (string matching) and an LLM judge (coherence)
8 test cases covering the failure modes from Improve skill: Avoid auth on every request, and improve knowledge about auto project detection #598: no pre-auth, no org/project lookup, correct fields, minimal calls, trusts auto-detection
Anthropic API with claude-sonnet-4-6 + claude-opus-4-6 as agents, claude-haiku-4-5 as judge
CI job runs on skill-related file changes, protected by the skill-eval environment (requires reviewer approval to use the API key)
Blocking — added to CI Status, fails below 75% threshold
Baseline: 8/8 cases passed (100%) on both models

Running locally

With an Anthropic API key:

ANTHROPIC_API_KEY=sk-ant-... bun run eval:skill

Test a single model:

EVAL_AGENT_MODELS=claude-sonnet-4-6 ANTHROPIC_API_KEY=... bun run eval:skill

Ref #598

github-actions · 2026-03-30T11:12:54Z

Semver Impact of This PR

🟡 Minor (new features)

📋 Changelog Preview

This is how your changes will appear in the changelog.
Entries from this PR are highlighted with a left border (blockquote style).

New Features ✨

(skill) Add eval framework to measure SKILL.md effectiveness by BYK in #602

(telemetry) Add seer.outcome span tag for Seer command metrics by BYK in #609
(upgrade) Show changelog summary during CLI upgrade by BYK in #594

Bug Fixes 🐛

Upgrade

Prevent spinner freeze during delta patch application by BYK in #608
Indent changelog, add emoji to heading, hide empty sections by BYK in #604

Other

(dashboard) Reject MRI queries with actionable tracemetrics guidance by BYK in #601
(skill) Avoid unnecessary auth, reinforce auto-detection, fix field examples by BYK in #599
2 bug fixes — subcommand crash, negative span depth, pagination JSON parse by cursor in #607

Documentation 📚

(skill) Document dashboard widget constraints and deprecated datasets by BYK in #605
Fix documentation gaps and embed skill files at build time by cursor in #606

Internal Changes 🔧

Regenerate skill files and command docs by github-actions[bot] in 664362ca

_{🤖 This preview updates automatically when you update the PR.}

github-actions · 2026-03-30T11:14:29Z

Codecov Results 📊

✅ 129 passed | Total: 129 | Pass Rate: 100% | Execution Time: 0ms

📊 Comparison with Base Branch

Metric	Change
Total Tests	—
Passed Tests	—
Failed Tests	—
Skipped Tests	—

✨ No test changes detected

All tests are passing successfully.

❌ Patch coverage is 66.67%. Project has 1303 uncovered lines.
❌ Project coverage is 95.62%. Comparing base (base) to head (head).

Files with missing lines (1)

File	Patch %	Lines
src/lib/formatters/human.ts	6.25%	⚠️ 15 Missing

Coverage diff

@@            Coverage Diff             @@
##          main       #PR       +/-##
==========================================
- Coverage    95.73%    95.62%    -0.11%
==========================================
  Files          204       204         —
  Lines        29877     29739      -138
  Branches         0         0         —
==========================================
+ Hits         28601     28436      -165
- Misses        1276      1303       +27
- Partials         0         0         —

Generated by Codecov Action

test/skill-eval/helpers/judge.ts

BYK · 2026-03-30T11:19:35Z

Addressed Cursor Bugbot feedback: expected-patterns check in judge.ts now uses allCommands.some(cmd => cmd.includes(...)) per-command instead of matching against the concatenated joined string. This prevents false positives from patterns matching across command boundaries, consistent with how anti-patterns already works.

test/skill-eval/helpers/planner.ts

betegon

couple of comments and maybe we should consider using something like https://github.com/getsentry/vitest-evals (although we're un bun test)

test/skill-eval/cases.json

test/skill-eval/helpers/judge.ts

test/skill-eval/helpers/planner.ts

.github/workflows/ci.yml

test/skill-eval/helpers/planner.ts

.github/workflows/eval-skill-fork.yml

.github/workflows/ci.yml

Two-phase eval: sends test prompts to an LLM with SKILL.md as context, then grades the planned commands on efficiency criteria (no pre-auth, no org lookup, correct fields, minimal calls, trusts auto-detection). - 8 test cases covering the failure modes from issue #598 - Deterministic checks (string matching) + LLM judge (coherence) - Uses Anthropic API (claude-sonnet-4-6, claude-opus-4-6) via repo secret - CI job runs on skill-related file changes, fails below 75% threshold - Fork PRs: blocked until maintainer adds eval-skill label, eval runs via pull_request_target, results posted as commit status - Label removed on synchronize (new push forces re-review) - Uses SENTRY_RELEASE_BOT app token to re-trigger main CI after fork eval

src/lib/arg-parsing.ts

src/lib/formatters/human.ts

github-actions · 2026-03-31T10:52:33Z

PR Preview Action v1.8.1
🚀 View preview at https://cli.sentry.dev/pr-preview/pr-602/
Built to branch `gh-pages` at 2026-03-31 10:52 UTC. Preview will be ready when the GitHub Pages deployment is complete.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

src/lib/db/pagination.ts

src/commands/dashboard/resolve.ts

Bun's identifier minification creates name collisions in the `marked` library's token walker — `renderInline` (a function) gets the same minified name as an unrelated object, causing `auth status` and other markdown-rendering paths to crash with: TypeError: _4 is not a function. (In '_4(_4.tokens)', '_4' is an instance of Object) The collision was triggered by PR #602 removing ~380 lines of code, which shifted the minifier's naming sequence. Any future code change could re-trigger it since it depends on exact identifier ordering. Fix: use `minify: { whitespace: true, syntax: true, identifiers: false }` instead of `minify: true`. This keeps whitespace removal and syntax transforms (most of the size savings) while avoiding the fragile identifier renaming. Bundle grows from 2.87 MB to 3.64 MB raw, but gzip compression absorbs most of the difference. Made-with: Cursor

## Summary Fixes the compiled binary crash that affected all commands rendering markdown output (auth status, issue explain, etc.): ``` TypeError: _4 is not a function. (In '_4(_4.tokens)', '_4' is an instance of Object) ``` ## Root Cause Bun's identifier minification assigns short names (`_4`, `_5`, etc.) to all functions/variables. A name collision caused `renderInline` (a function in `markdown.ts`) to get the same minified name as an unrelated object. When `renderOneInline` calls `renderInline(token.tokens)`, the minified code calls `_4(_4.tokens)` — but `_4` is the object, not the function. Triggered by PR #602 (716e2ba) which removed ~380 lines of code, shifting the minifier's naming sequence. The bug is in Bun's bundler, not our source code — any future code change could re-trigger it. ## Fix Change `minify: true` to `minify: { whitespace: true, syntax: true, identifiers: false }` in `script/build.ts`. This keeps whitespace removal and syntax transforms while avoiding identifier renaming. **Size impact:** Bundle grows from 2.87 MB to 3.64 MB raw (~27%). Gzip compression absorbs most of the difference since original identifier names compress well. ## Bisect - `27a9f0f8` (PR #610) — works - `716e2bad` (PR #602) — crashes - Specifically: the change to `src/commands/issue/explain.ts` triggers the collision by shifting import ordering ## Test plan - `SENTRY_CLI_BINARY=./dist-bin/sentry-darwin-arm64 bun test --timeout 15000 test/e2e` — 122 pass, 0 fail - `SENTRY_AUTH_TOKEN=test ./dist-bin/sentry-darwin-arm64 auth status` — renders markdown without crash Made with [Cursor](https://cursor.com)

BYK marked this pull request as ready for review March 30, 2026 11:13

BYK requested review from MathurAditya724 and betegon March 30, 2026 11:14

cursor bot reviewed Mar 30, 2026

View reviewed changes

test/skill-eval/helpers/judge.ts Show resolved Hide resolved

sentry bot reviewed Mar 30, 2026

View reviewed changes

test/skill-eval/helpers/planner.ts Show resolved Hide resolved

betegon reviewed Mar 30, 2026

View reviewed changes

test/skill-eval/cases.json Show resolved Hide resolved

test/skill-eval/cases.json Show resolved Hide resolved

test/skill-eval/cases.json Show resolved Hide resolved

sentry bot reviewed Mar 30, 2026

View reviewed changes

test/skill-eval/helpers/judge.ts Show resolved Hide resolved

cursor bot reviewed Mar 30, 2026

View reviewed changes

test/skill-eval/helpers/planner.ts Outdated Show resolved Hide resolved

BYK temporarily deployed to skill-eval March 30, 2026 18:28 — with GitHub Actions Inactive

cursor bot reviewed Mar 30, 2026

View reviewed changes

.github/workflows/ci.yml Outdated Show resolved Hide resolved

test/skill-eval/helpers/planner.ts Show resolved Hide resolved

BYK temporarily deployed to skill-eval March 30, 2026 18:42 — with GitHub Actions Inactive

BYK temporarily deployed to skill-eval March 30, 2026 21:18 — with GitHub Actions Inactive

sentry bot reviewed Mar 31, 2026

View reviewed changes

.github/workflows/eval-skill-fork.yml Show resolved Hide resolved

cursor bot reviewed Mar 31, 2026

View reviewed changes

.github/workflows/ci.yml Outdated Show resolved Hide resolved

BYK closed this Mar 31, 2026

BYK reopened this Mar 31, 2026

BYK force-pushed the feat/skill-eval-framework branch from d409adf to 353155f Compare March 31, 2026 10:50

cursor bot reviewed Mar 31, 2026

View reviewed changes

src/lib/arg-parsing.ts Show resolved Hide resolved

src/lib/formatters/human.ts Show resolved Hide resolved

BYK force-pushed the feat/skill-eval-framework branch from 353155f to 6644186 Compare March 31, 2026 10:52

cursor bot reviewed Mar 31, 2026

View reviewed changes

src/lib/db/pagination.ts Show resolved Hide resolved

sentry bot reviewed Mar 31, 2026

View reviewed changes

src/commands/dashboard/resolve.ts Show resolved Hide resolved

BYK merged commit 716e2ba into main Mar 31, 2026
34 of 36 checks passed

BYK deleted the feat/skill-eval-framework branch March 31, 2026 13:23

sentry-release-bot bot mentioned this pull request Mar 31, 2026

publish: getsentry/cli@0.23.0 getsentry/publish#7654

Closed

5 tasks

betegon mentioned this pull request Mar 31, 2026

fix(build): disable identifier minification to fix marked crash #617

Merged

sentry-release-bot bot mentioned this pull request Mar 31, 2026

publish: getsentry/cli@0.23.0 getsentry/publish#7655

Closed

5 tasks

Uh oh!

Conversation

BYK commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Running locally

Uh oh!

github-actions bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Semver Impact of This PR

New Features ✨

Bug Fixes 🐛

Upgrade

Other

Documentation 📚

Internal Changes 🔧

Uh oh!

github-actions bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Results 📊

📊 Comparison with Base Branch

Uh oh!

Uh oh!

BYK commented Mar 30, 2026

Uh oh!

Uh oh!

betegon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 31, 2026

Built to branch gh-pages at 2026-03-31 10:52 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

BYK commented Mar 30, 2026 •

edited

Loading

github-actions bot commented Mar 30, 2026 •

edited

Loading

github-actions bot commented Mar 30, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-03-31 10:52 UTC.
Preview will be ready when the GitHub Pages deployment is complete.