Skip to content

Investigation of inconsistent review results on docker/pinata PR #39588#62

Merged
derekmisler merged 1 commit intodocker:mainfrom
derekmisler:ensure-the-agent-has-the-diff-and-dont-try-to-re
Mar 3, 2026
Merged

Investigation of inconsistent review results on docker/pinata PR #39588#62
derekmisler merged 1 commit intodocker:mainfrom
derekmisler:ensure-the-agent-has-the-diff-and-dont-try-to-re

Conversation

@derekmisler
Copy link
Copy Markdown
Contributor

@derekmisler derekmisler commented Mar 2, 2026

Summary

  • Fix false approvals when verifier fails: ANTI-LOOP RULE now posts a COMMENT with unverified findings instead of silently approving
  • Add deterministic git diff merge-base fallback when gh pr diff fails, with stderr capture for diagnostics
  • Upgrade drafter model from Haiku to Sonnet for better bug detection recall (verifier stays on Haiku)
  • Add file-based chunking for large diffs (>1500 lines) — splits at diff --git boundaries into ~1000-line chunks
  • Log which diff method succeeded (pr.diff / gh_pr_diff / git_diff) for observability
  • Simplify metadata instructions — use prompt-provided context instead of reading pr_metadata.json from disk
  • Fix memory cache concurrency by adding github.job to cache keys

Context

Investigation of inconsistent review results on docker/pinata PR #39588 revealed:

  1. When gh pr diff failed (bare number in detached HEAD), the LLM improvised differently each run
  2. The ANTI-LOOP RULE auto-approved when the verifier returned garbage
  3. Haiku (drafter model) had inconsistent bug detection recall

These changes make the pipeline more deterministic — bash-level fallbacks handle diff failures before the LLM is involved, verifier failures surface findings instead of hiding them, and Sonnet improves draft quality at the cost-sensitive step.

Test plan

  • Local test: cagent run review-pr/agents/pr-review.yaml "Review my changes" from a repo with a local branch
  • CI test: bump a consuming repo to this version and trigger /review on a known-buggy PR
  • Diff pre-fetch: check workflow logs for stderr capture on gh pr diff failure
  • Chunking: test on a large PR (20+ files) and verify findings are merged correctly
  • Memory: trigger two concurrent reviews and verify no cache collisions

@derekmisler derekmisler force-pushed the ensure-the-agent-has-the-diff-and-dont-try-to-re branch 4 times, most recently from d35e233 to 75f86cf Compare March 2, 2026 16:46
@derekmisler derekmisler marked this pull request as ready for review March 2, 2026 16:52
@derekmisler derekmisler requested a review from a team as a code owner March 2, 2026 16:52
Copy link
Copy Markdown
Contributor

@docker-agent docker-agent bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

⚠️ REQUEST_CHANGES: Found 1 critical issue in the diff fallback logic that could lead to false approvals.

Critical Issue

The new git diff fallback at line 170 lacks error handling — if the command fails, pr.diff may be created but empty/corrupt, causing the agent to approve an empty diff (the exact false approval bug this PR aims to fix).

Recommendation

Add error checking after the git diff command:

if ! git diff "$MERGE_BASE"...HEAD > pr.diff; then
  echo "Error: git diff failed" >&2
  exit 1
fi

This ensures the pipeline fails fast with diagnostics instead of silently proceeding with a bad diff.


✅ The other changes (ANTI-LOOP RULE, model upgrade, chunking, metadata simplification, cache fix) look good and address the determinism issues described in the PR.

Signed-off-by: Derek Misler <derek.misler@docker.com>
@derekmisler derekmisler force-pushed the ensure-the-agent-has-the-diff-and-dont-try-to-re branch from 75f86cf to 3bca9c5 Compare March 2, 2026 16:56
@derekmisler derekmisler merged commit 90321c9 into docker:main Mar 3, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants