fix(eval): ground LLM judge with command reference to prevent false negatives#712
fix(eval): ground LLM judge with command reference to prevent false negatives#712
Conversation
…egatives The skill eval judge (Haiku 4.5) had no context about the sentry CLI and was hallucinating that valid commands don't exist, confusing it with the legacy sentry-cli. This caused Opus 4.6 to fail 3/8 eval cases (62.5%, below the 75% threshold) on the overall-quality criterion. Extract the Command Reference section from SKILL.md and inject it into the judge prompt so it can verify planned commands against actual CLI capabilities.
Semver Impact of This PR🟢 Patch (bug fixes) 📋 Changelog PreviewThis is how your changes will appear in the changelog. New Features ✨Docs
Other
Bug Fixes 🐛Dashboard
Other
Internal Changes 🔧
🤖 This preview updates automatically when you update the PR. |
Codecov Results 📊✅ 134 passed | Total: 134 | Pass Rate: 100% | Execution Time: 0ms 📊 Comparison with Base Branch
✨ No test changes detected All tests are passing successfully. ✅ Patch coverage is 100.00%. Project has 1573 uncovered lines. Coverage diff@@ Coverage Diff @@
## main #PR +/-##
==========================================
+ Coverage 95.32% 95.32% —%
==========================================
Files 232 232 —
Lines 33632 33632 —
Branches 0 0 —
==========================================
+ Hits 32059 32059 —
- Misses 1573 1573 —
- Partials 0 0 —Generated by Codecov Action |
The judge was too strict — it read the compact command reference literally and rejected plans for omitting <org/project> args (which are optional via auto-detection) and using standard flags like --json, --query, --limit that aren't listed in the compact reference. Also add a warning when the Command Reference section is missing from SKILL.md, per Bugbot feedback.
…nce injection Replace the command reference injection approach with empirical verification: commands are now run with `-h` against the real CLI binary to check they exist. The judge receives verification results instead of a pre-built allowed-commands list, keeping it independent and honest. Move the eval into the e2e test suite where the pre-built binary is available, eliminating the standalone CI job. The e2e test auto-skips when ANTHROPIC_API_KEY is absent (non-skill PRs, fork PRs). The fork workflow continues using dev mode (bun run src/bin.ts) for verification.
…context # Conflicts: # AGENTS.md
The test preload mocks globalThis.fetch to block external network calls. The skill eval test needs real fetch for Anthropic API, so restore __originalFetch during the describe block.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 5897259. Configure here.

Summary
sentryCLI, causing it to hallucinate that valid commands don't exist (confusing it with the legacysentry-cli)overall-qualityLLM judge criterion, not deterministic checksFailing CI run: https://github.com/getsentry/cli/actions/runs/24207303049/job/70666509005