Skip to content

fix(eval): ground LLM judge with command reference to prevent false negatives#712

Merged
BYK merged 6 commits intomainfrom
fix/eval-skill-judge-context
Apr 10, 2026
Merged

fix(eval): ground LLM judge with command reference to prevent false negatives#712
BYK merged 6 commits intomainfrom
fix/eval-skill-judge-context

Conversation

@BYK
Copy link
Copy Markdown
Member

@BYK BYK commented Apr 10, 2026

Summary

  • The skill eval LLM judge (Haiku 4.5) had zero context about the sentry CLI, causing it to hallucinate that valid commands don't exist (confusing it with the legacy sentry-cli)
  • This caused Opus 4.6 to fail 3/8 eval cases (62.5%, below the 75% threshold) — all on the overall-quality LLM judge criterion, not deterministic checks
  • Fix: extract the Command Reference section from SKILL.md and inject it into the judge prompt as grounding context

Failing CI run: https://github.com/getsentry/cli/actions/runs/24207303049/job/70666509005

…egatives

The skill eval judge (Haiku 4.5) had no context about the sentry CLI and
was hallucinating that valid commands don't exist, confusing it with the
legacy sentry-cli. This caused Opus 4.6 to fail 3/8 eval cases (62.5%,
below the 75% threshold) on the overall-quality criterion.

Extract the Command Reference section from SKILL.md and inject it into the
judge prompt so it can verify planned commands against actual CLI capabilities.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 10, 2026

Semver Impact of This PR

🟢 Patch (bug fixes)

📋 Changelog Preview

This is how your changes will appear in the changelog.
Entries from this PR are highlighted with a left border (blockquote style).


New Features ✨

Docs

  • Deploy main branch preview alongside PR previews by BYK in #707
  • Enable sourcemap upload, releases, and environment tracking by BYK in #705

Other

  • (cli) Hoist global flags from any argv position and add -v alias by BYK in #709
  • (commands) Add buildRouteMap wrapper with standard subcommand aliases by BYK in #690
  • (config) Support .sentryclirc config file for per-directory defaults by BYK in #693
  • (init) Add fuzzy edit replacers and edits-based apply-patchset by betegon in #698
  • (install) Add SENTRY_INIT env var to run wizard after install by betegon in #685
  • (release) Surface adoption and health metrics in list and view (Add release command group with adoption/health subcommand #463) by BYK in #680
  • (telemetry) Add agent detection tag for AI coding tools by betegon in #687

Bug Fixes 🐛

Dashboard

  • Add --layout flag to widget add for predictable placement by BYK in #700
  • Render tracemetrics widgets in dashboard view by BYK in #695

Other

  • (build) Enable sourcemap resolution for compiled binaries by BYK in #701
  • (cache) --fresh flag now updates cache with fresh response by BYK in #708
  • (eval) Ground LLM judge with command reference to prevent false negatives by BYK in #712
  • (init) Narrow command validation to actual shell injection vectors by betegon in #697
  • (init,feedback) Default to tracing only in feature select and attach user email to feedback by MathurAditya724 in #688
  • (setup) Handle read-only .claude directory in sandboxed environments by BYK in #702
  • Inject auth token into generated .env.sentry-build-plugin files by MathurAditya724 in #706

Internal Changes 🔧

  • (docs) Gitignore generated command docs, extract fragments by BYK in #696
  • (eval) Replace OpenAI with Anthropic SDK in init-eval judge by betegon in #683
  • (init) Use markdown pipeline for spinner messages by betegon in #686
  • Regenerate skill files and command docs by github-actions[bot] in 584ec0e0

🤖 This preview updates automatically when you update the PR.

@BYK BYK marked this pull request as ready for review April 10, 2026 10:23
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 10, 2026

Codecov Results 📊

134 passed | Total: 134 | Pass Rate: 100% | Execution Time: 0ms

📊 Comparison with Base Branch

Metric Change
Total Tests
Passed Tests
Failed Tests
Skipped Tests

✨ No test changes detected

All tests are passing successfully.

✅ Patch coverage is 100.00%. Project has 1573 uncovered lines.
✅ Project coverage is 95.32%. Comparing base (base) to head (head).

Coverage diff
@@            Coverage Diff             @@
##          main       #PR       +/-##
==========================================
+ Coverage    95.32%    95.32%        —%
==========================================
  Files          232       232         —
  Lines        33632     33632         —
  Branches         0         0         —
==========================================
+ Hits         32059     32059         —
- Misses        1573      1573         —
- Partials         0         0         —

Generated by Codecov Action

BYK added 4 commits April 10, 2026 10:27
The judge was too strict — it read the compact command reference literally
and rejected plans for omitting <org/project> args (which are optional via
auto-detection) and using standard flags like --json, --query, --limit that
aren't listed in the compact reference.

Also add a warning when the Command Reference section is missing from
SKILL.md, per Bugbot feedback.
…nce injection

Replace the command reference injection approach with empirical verification:
commands are now run with `-h` against the real CLI binary to check they
exist. The judge receives verification results instead of a pre-built
allowed-commands list, keeping it independent and honest.

Move the eval into the e2e test suite where the pre-built binary is
available, eliminating the standalone CI job. The e2e test auto-skips
when ANTHROPIC_API_KEY is absent (non-skill PRs, fork PRs). The fork
workflow continues using dev mode (bun run src/bin.ts) for verification.
The test preload mocks globalThis.fetch to block external network calls.
The skill eval test needs real fetch for Anthropic API, so restore
__originalFetch during the describe block.
Copy link
Copy Markdown
Contributor

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 5897259. Configure here.

@BYK BYK merged commit 430083d into main Apr 10, 2026
25 checks passed
@BYK BYK deleted the fix/eval-skill-judge-context branch April 10, 2026 11:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant