feat(skill): add eval framework to measure SKILL.md effectiveness#602
feat(skill): add eval framework to measure SKILL.md effectiveness#602
Conversation
Semver Impact of This PR🟡 Minor (new features) 📋 Changelog PreviewThis is how your changes will appear in the changelog. New Features ✨
Bug Fixes 🐛Upgrade
Other
Documentation 📚
Internal Changes 🔧
🤖 This preview updates automatically when you update the PR. |
Codecov Results 📊✅ 129 passed | Total: 129 | Pass Rate: 100% | Execution Time: 0ms 📊 Comparison with Base Branch
✨ No test changes detected All tests are passing successfully. ❌ Patch coverage is 66.67%. Project has 1303 uncovered lines. Files with missing lines (1)
Coverage diff@@ Coverage Diff @@
## main #PR +/-##
==========================================
- Coverage 95.73% 95.62% -0.11%
==========================================
Files 204 204 —
Lines 29877 29739 -138
Branches 0 0 —
==========================================
+ Hits 28601 28436 -165
- Misses 1276 1303 +27
- Partials 0 0 —Generated by Codecov Action |
|
Addressed Cursor Bugbot feedback: |
betegon
left a comment
There was a problem hiding this comment.
couple of comments and maybe we should consider using something like https://github.com/getsentry/vitest-evals (although we're un bun test)
d409adf to
353155f
Compare
Two-phase eval: sends test prompts to an LLM with SKILL.md as context, then grades the planned commands on efficiency criteria (no pre-auth, no org lookup, correct fields, minimal calls, trusts auto-detection). - 8 test cases covering the failure modes from issue #598 - Deterministic checks (string matching) + LLM judge (coherence) - Uses Anthropic API (claude-sonnet-4-6, claude-opus-4-6) via repo secret - CI job runs on skill-related file changes, fails below 75% threshold - Fork PRs: blocked until maintainer adds eval-skill label, eval runs via pull_request_target, results posted as commit status - Label removed on synchronize (new push forces re-review) - Uses SENTRY_RELEASE_BOT app token to re-trigger main CI after fork eval
353155f to
6644186
Compare
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 3 total unresolved issues (including 2 from previous reviews).
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Bun's identifier minification creates name collisions in the `marked` library's token walker — `renderInline` (a function) gets the same minified name as an unrelated object, causing `auth status` and other markdown-rendering paths to crash with: TypeError: _4 is not a function. (In '_4(_4.tokens)', '_4' is an instance of Object) The collision was triggered by PR #602 removing ~380 lines of code, which shifted the minifier's naming sequence. Any future code change could re-trigger it since it depends on exact identifier ordering. Fix: use `minify: { whitespace: true, syntax: true, identifiers: false }` instead of `minify: true`. This keeps whitespace removal and syntax transforms (most of the size savings) while avoiding the fragile identifier renaming. Bundle grows from 2.87 MB to 3.64 MB raw, but gzip compression absorbs most of the difference. Made-with: Cursor
## Summary Fixes the compiled binary crash that affected all commands rendering markdown output (auth status, issue explain, etc.): ``` TypeError: _4 is not a function. (In '_4(_4.tokens)', '_4' is an instance of Object) ``` ## Root Cause Bun's identifier minification assigns short names (`_4`, `_5`, etc.) to all functions/variables. A name collision caused `renderInline` (a function in `markdown.ts`) to get the same minified name as an unrelated object. When `renderOneInline` calls `renderInline(token.tokens)`, the minified code calls `_4(_4.tokens)` — but `_4` is the object, not the function. Triggered by PR #602 (716e2ba) which removed ~380 lines of code, shifting the minifier's naming sequence. The bug is in Bun's bundler, not our source code — any future code change could re-trigger it. ## Fix Change `minify: true` to `minify: { whitespace: true, syntax: true, identifiers: false }` in `script/build.ts`. This keeps whitespace removal and syntax transforms while avoiding identifier renaming. **Size impact:** Bundle grows from 2.87 MB to 3.64 MB raw (~27%). Gzip compression absorbs most of the difference since original identifier names compress well. ## Bisect - `27a9f0f8` (PR #610) — works - `716e2bad` (PR #602) — crashes - Specifically: the change to `src/commands/issue/explain.ts` triggers the collision by shifting import ordering ## Test plan - `SENTRY_CLI_BINARY=./dist-bin/sentry-darwin-arm64 bun test --timeout 15000 test/e2e` — 122 pass, 0 fail - `SENTRY_AUTH_TOKEN=test ./dist-bin/sentry-darwin-arm64 auth status` — renders markdown without crash Made with [Cursor](https://cursor.com)

Summary
Adds an evaluation framework that measures how effectively SKILL.md guides an LLM agent to use the Sentry CLI efficiently. Inspired by the skill-creator plugin approach of prompt → plan → grade.
claude-sonnet-4-6+claude-opus-4-6as agents,claude-haiku-4-5as judgeskill-evalenvironment (requires reviewer approval to use the API key)Running locally
With an Anthropic API key:
Test a single model:
Ref #598