Skip to content

feat(skill): add eval framework to measure SKILL.md effectiveness#602

Merged
BYK merged 1 commit intomainfrom
feat/skill-eval-framework
Mar 31, 2026
Merged

feat(skill): add eval framework to measure SKILL.md effectiveness#602
BYK merged 1 commit intomainfrom
feat/skill-eval-framework

Conversation

@BYK
Copy link
Copy Markdown
Member

@BYK BYK commented Mar 30, 2026

Summary

Adds an evaluation framework that measures how effectively SKILL.md guides an LLM agent to use the Sentry CLI efficiently. Inspired by the skill-creator plugin approach of prompt → plan → grade.

  • Two-phase eval: sends SKILL.md + user prompt to an LLM, then grades the planned commands with deterministic checks (string matching) and an LLM judge (coherence)
  • 8 test cases covering the failure modes from Improve skill: Avoid auth on every request, and improve knowledge about auto project detection #598: no pre-auth, no org/project lookup, correct fields, minimal calls, trusts auto-detection
  • Anthropic API with claude-sonnet-4-6 + claude-opus-4-6 as agents, claude-haiku-4-5 as judge
  • CI job runs on skill-related file changes, protected by the skill-eval environment (requires reviewer approval to use the API key)
  • Blocking — added to CI Status, fails below 75% threshold
  • Baseline: 8/8 cases passed (100%) on both models

Running locally

With an Anthropic API key:

ANTHROPIC_API_KEY=sk-ant-... bun run eval:skill

Test a single model:

EVAL_AGENT_MODELS=claude-sonnet-4-6 ANTHROPIC_API_KEY=... bun run eval:skill

Ref #598

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 30, 2026

Semver Impact of This PR

🟡 Minor (new features)

📋 Changelog Preview

This is how your changes will appear in the changelog.
Entries from this PR are highlighted with a left border (blockquote style).


New Features ✨

  • (skill) Add eval framework to measure SKILL.md effectiveness by BYK in #602
  • (telemetry) Add seer.outcome span tag for Seer command metrics by BYK in #609
  • (upgrade) Show changelog summary during CLI upgrade by BYK in #594

Bug Fixes 🐛

Upgrade

  • Prevent spinner freeze during delta patch application by BYK in #608
  • Indent changelog, add emoji to heading, hide empty sections by BYK in #604

Other

  • (dashboard) Reject MRI queries with actionable tracemetrics guidance by BYK in #601
  • (skill) Avoid unnecessary auth, reinforce auto-detection, fix field examples by BYK in #599
  • 2 bug fixes — subcommand crash, negative span depth, pagination JSON parse by cursor in #607

Documentation 📚

  • (skill) Document dashboard widget constraints and deprecated datasets by BYK in #605
  • Fix documentation gaps and embed skill files at build time by cursor in #606

Internal Changes 🔧

  • Regenerate skill files and command docs by github-actions[bot] in 664362ca

🤖 This preview updates automatically when you update the PR.

@BYK BYK marked this pull request as ready for review March 30, 2026 11:13
@BYK BYK requested review from MathurAditya724 and betegon March 30, 2026 11:14
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 30, 2026

Codecov Results 📊

129 passed | Total: 129 | Pass Rate: 100% | Execution Time: 0ms

📊 Comparison with Base Branch

Metric Change
Total Tests
Passed Tests
Failed Tests
Skipped Tests

✨ No test changes detected

All tests are passing successfully.

❌ Patch coverage is 66.67%. Project has 1303 uncovered lines.
❌ Project coverage is 95.62%. Comparing base (base) to head (head).

Files with missing lines (1)
File Patch % Lines
src/lib/formatters/human.ts 6.25% ⚠️ 15 Missing
Coverage diff
@@            Coverage Diff             @@
##          main       #PR       +/-##
==========================================
- Coverage    95.73%    95.62%    -0.11%
==========================================
  Files          204       204         —
  Lines        29877     29739      -138
  Branches         0         0         —
==========================================
+ Hits         28601     28436      -165
- Misses        1276      1303       +27
- Partials         0         0         —

Generated by Codecov Action

@BYK
Copy link
Copy Markdown
Member Author

BYK commented Mar 30, 2026

Addressed Cursor Bugbot feedback: expected-patterns check in judge.ts now uses allCommands.some(cmd => cmd.includes(...)) per-command instead of matching against the concatenated joined string. This prevents false positives from patterns matching across command boundaries, consistent with how anti-patterns already works.

Copy link
Copy Markdown
Member

@betegon betegon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of comments and maybe we should consider using something like https://github.com/getsentry/vitest-evals (although we're un bun test)

@BYK BYK closed this Mar 31, 2026
@BYK BYK reopened this Mar 31, 2026
@BYK BYK force-pushed the feat/skill-eval-framework branch from d409adf to 353155f Compare March 31, 2026 10:50
Two-phase eval: sends test prompts to an LLM with SKILL.md as context,
then grades the planned commands on efficiency criteria (no pre-auth,
no org lookup, correct fields, minimal calls, trusts auto-detection).

- 8 test cases covering the failure modes from issue #598
- Deterministic checks (string matching) + LLM judge (coherence)
- Uses Anthropic API (claude-sonnet-4-6, claude-opus-4-6) via repo secret
- CI job runs on skill-related file changes, fails below 75% threshold
- Fork PRs: blocked until maintainer adds eval-skill label, eval runs
  via pull_request_target, results posted as commit status
- Label removed on synchronize (new push forces re-review)
- Uses SENTRY_RELEASE_BOT app token to re-trigger main CI after fork eval
@BYK BYK force-pushed the feat/skill-eval-framework branch from 353155f to 6644186 Compare March 31, 2026 10:52
@github-actions
Copy link
Copy Markdown
Contributor

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://cli.sentry.dev/pr-preview/pr-602/

Built to branch gh-pages at 2026-03-31 10:52 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Copy link
Copy Markdown
Contributor

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

@BYK BYK merged commit 716e2ba into main Mar 31, 2026
34 of 36 checks passed
@BYK BYK deleted the feat/skill-eval-framework branch March 31, 2026 13:23
betegon added a commit that referenced this pull request Mar 31, 2026
Bun's identifier minification creates name collisions in the `marked`
library's token walker — `renderInline` (a function) gets the same
minified name as an unrelated object, causing `auth status` and other
markdown-rendering paths to crash with:

  TypeError: _4 is not a function. (In '_4(_4.tokens)', '_4' is an instance of Object)

The collision was triggered by PR #602 removing ~380 lines of code,
which shifted the minifier's naming sequence. Any future code change
could re-trigger it since it depends on exact identifier ordering.

Fix: use `minify: { whitespace: true, syntax: true, identifiers: false }`
instead of `minify: true`. This keeps whitespace removal and syntax
transforms (most of the size savings) while avoiding the fragile
identifier renaming. Bundle grows from 2.87 MB to 3.64 MB raw, but
gzip compression absorbs most of the difference.

Made-with: Cursor
betegon added a commit that referenced this pull request Mar 31, 2026
## Summary

Fixes the compiled binary crash that affected all commands rendering
markdown output (auth status, issue explain, etc.):

```
TypeError: _4 is not a function. (In '_4(_4.tokens)', '_4' is an instance of Object)
```

## Root Cause

Bun's identifier minification assigns short names (`_4`, `_5`, etc.) to
all functions/variables. A name collision caused `renderInline` (a
function in `markdown.ts`) to get the same minified name as an unrelated
object. When `renderOneInline` calls `renderInline(token.tokens)`, the
minified code calls `_4(_4.tokens)` — but `_4` is the object, not the
function.

Triggered by PR #602 (716e2ba) which removed ~380 lines of code,
shifting the minifier's naming sequence. The bug is in Bun's bundler,
not our source code — any future code change could re-trigger it.

## Fix

Change `minify: true` to `minify: { whitespace: true, syntax: true,
identifiers: false }` in `script/build.ts`. This keeps whitespace
removal and syntax transforms while avoiding identifier renaming.

**Size impact:** Bundle grows from 2.87 MB to 3.64 MB raw (~27%). Gzip
compression absorbs most of the difference since original identifier
names compress well.

## Bisect

- `27a9f0f8` (PR #610) — works
- `716e2bad` (PR #602) — crashes
- Specifically: the change to `src/commands/issue/explain.ts` triggers
the collision by shifting import ordering

## Test plan

- `SENTRY_CLI_BINARY=./dist-bin/sentry-darwin-arm64 bun test --timeout
15000 test/e2e` — 122 pass, 0 fail
- `SENTRY_AUTH_TOKEN=test ./dist-bin/sentry-darwin-arm64 auth status` —
renders markdown without crash

Made with [Cursor](https://cursor.com)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants