fix(eval): ground LLM judge with command reference to prevent false negatives by BYK · Pull Request #712 · getsentry/cli

BYK · 2026-04-10T10:21:13Z

Summary

The skill eval LLM judge (Haiku 4.5) had zero context about the sentry CLI, causing it to hallucinate that valid commands don't exist (confusing it with the legacy sentry-cli)
This caused Opus 4.6 to fail 3/8 eval cases (62.5%, below the 75% threshold) — all on the overall-quality LLM judge criterion, not deterministic checks
Fix: extract the Command Reference section from SKILL.md and inject it into the judge prompt as grounding context

Failing CI run: https://github.com/getsentry/cli/actions/runs/24207303049/job/70666509005

…egatives The skill eval judge (Haiku 4.5) had no context about the sentry CLI and was hallucinating that valid commands don't exist, confusing it with the legacy sentry-cli. This caused Opus 4.6 to fail 3/8 eval cases (62.5%, below the 75% threshold) on the overall-quality criterion. Extract the Command Reference section from SKILL.md and inject it into the judge prompt so it can verify planned commands against actual CLI capabilities.

github-actions · 2026-04-10T10:21:29Z

Semver Impact of This PR

🟢 Patch (bug fixes)

📋 Changelog Preview

This is how your changes will appear in the changelog.
Entries from this PR are highlighted with a left border (blockquote style).

New Features ✨

Docs

Deploy main branch preview alongside PR previews by BYK in #707
Enable sourcemap upload, releases, and environment tracking by BYK in #705

Other

(cli) Hoist global flags from any argv position and add -v alias by BYK in #709
(commands) Add buildRouteMap wrapper with standard subcommand aliases by BYK in #690
(config) Support .sentryclirc config file for per-directory defaults by BYK in #693
(init) Add fuzzy edit replacers and edits-based apply-patchset by betegon in #698
(install) Add SENTRY_INIT env var to run wizard after install by betegon in #685
(release) Surface adoption and health metrics in list and view (Add release command group with adoption/health subcommand #463) by BYK in #680
(telemetry) Add agent detection tag for AI coding tools by betegon in #687

Bug Fixes 🐛

Dashboard

Add --layout flag to widget add for predictable placement by BYK in #700
Render tracemetrics widgets in dashboard view by BYK in #695

Other

(build) Enable sourcemap resolution for compiled binaries by BYK in #701
(cache) --fresh flag now updates cache with fresh response by BYK in #708

(eval) Ground LLM judge with command reference to prevent false negatives by BYK in #712

(init) Narrow command validation to actual shell injection vectors by betegon in #697
(init,feedback) Default to tracing only in feature select and attach user email to feedback by MathurAditya724 in #688
(setup) Handle read-only .claude directory in sandboxed environments by BYK in #702
Inject auth token into generated .env.sentry-build-plugin files by MathurAditya724 in #706

Internal Changes 🔧

(docs) Gitignore generated command docs, extract fragments by BYK in #696
(eval) Replace OpenAI with Anthropic SDK in init-eval judge by betegon in #683
(init) Use markdown pipeline for spinner messages by betegon in #686
Regenerate skill files and command docs by github-actions[bot] in 584ec0e0

_{🤖 This preview updates automatically when you update the PR.}

github-actions · 2026-04-10T10:24:37Z

Codecov Results 📊

✅ 134 passed | Total: 134 | Pass Rate: 100% | Execution Time: 0ms

📊 Comparison with Base Branch

Metric	Change
Total Tests	—
Passed Tests	—
Failed Tests	—
Skipped Tests	—

✨ No test changes detected

All tests are passing successfully.

✅ Patch coverage is 100.00%. Project has 1573 uncovered lines.
✅ Project coverage is 95.32%. Comparing base (base) to head (head).

Coverage diff

@@            Coverage Diff             @@
##          main       #PR       +/-##
==========================================
+ Coverage    95.32%    95.32%        —%
==========================================
  Files          232       232         —
  Lines        33632     33632         —
  Branches         0         0         —
==========================================
+ Hits         32059     32059         —
- Misses        1573      1573         —
- Partials         0         0         —

Generated by Codecov Action

script/eval-skill.ts

The judge was too strict — it read the compact command reference literally and rejected plans for omitting <org/project> args (which are optional via auto-detection) and using standard flags like --json, --query, --limit that aren't listed in the compact reference. Also add a warning when the Command Reference section is missing from SKILL.md, per Bugbot feedback.

…nce injection Replace the command reference injection approach with empirical verification: commands are now run with `-h` against the real CLI binary to check they exist. The judge receives verification results instead of a pre-built allowed-commands list, keeping it independent and honest. Move the eval into the e2e test suite where the pre-built binary is available, eliminating the standalone CI job. The e2e test auto-skips when ANTHROPIC_API_KEY is absent (non-skill PRs, fork PRs). The fork workflow continues using dev mode (bun run src/bin.ts) for verification.

…context # Conflicts: # AGENTS.md

test/skill-eval/helpers/verify.ts

.github/workflows/ci.yml

The test preload mocks globalThis.fetch to block external network calls. The skill eval test needs real fetch for Anthropic API, so restore __originalFetch during the describe block.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 5897259. Configure here.}

AGENTS.md

BYK marked this pull request as ready for review April 10, 2026 10:23

cursor bot reviewed Apr 10, 2026

View reviewed changes

script/eval-skill.ts Outdated Show resolved Hide resolved

BYK added 4 commits April 10, 2026 10:27

ci: re-trigger CI pipeline

fb4cfab

Merge remote-tracking branch 'origin/main' into fix/eval-skill-judge-…

5977323

…context # Conflicts: # AGENTS.md

sentry bot reviewed Apr 10, 2026

View reviewed changes

test/skill-eval/helpers/verify.ts Show resolved Hide resolved

.github/workflows/ci.yml Show resolved Hide resolved

fix(eval): restore real fetch for Anthropic API calls in e2e test

5897259

The test preload mocks globalThis.fetch to block external network calls. The skill eval test needs real fetch for Anthropic API, so restore __originalFetch during the describe block.

cursor bot reviewed Apr 10, 2026

View reviewed changes

AGENTS.md Show resolved Hide resolved

BYK merged commit 430083d into main Apr 10, 2026
25 checks passed

BYK deleted the fix/eval-skill-judge-context branch April 10, 2026 11:39

sentry-release-bot bot mentioned this pull request Apr 10, 2026

publish: getsentry/cli@0.26.0 getsentry/publish#7770

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(eval): ground LLM judge with command reference to prevent false negatives#712

fix(eval): ground LLM judge with command reference to prevent false negatives#712
BYK merged 6 commits intomainfrom
fix/eval-skill-judge-context

BYK commented Apr 10, 2026

Uh oh!

github-actions bot commented Apr 10, 2026 •

edited

Loading

New Features ✨

Docs

Other

Bug Fixes 🐛

Dashboard

Other

Internal Changes 🔧

Uh oh!

github-actions bot commented Apr 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

BYK commented Apr 10, 2026

Summary

Uh oh!

github-actions bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Semver Impact of This PR

New Features ✨

Docs

Other

Bug Fixes 🐛

Dashboard

Other

Internal Changes 🔧

Uh oh!

github-actions bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Results 📊

📊 Comparison with Base Branch

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions bot commented Apr 10, 2026 •

edited

Loading

github-actions bot commented Apr 10, 2026 •

edited

Loading