Skip to content

ai-code-review v2: subagent split, embedded Python CLI, deterministic post-steps#2

Merged
Andrey9kin merged 17 commits into
mainfrom
2026-07-03-ai-code-review-v2
Jul 3, 2026
Merged

ai-code-review v2: subagent split, embedded Python CLI, deterministic post-steps#2
Andrey9kin merged 17 commits into
mainfrom
2026-07-03-ai-code-review-v2

Conversation

@Andrey9kin

@Andrey9kin Andrey9kin commented Jul 3, 2026

Copy link
Copy Markdown
Member

Summary

Rewrite of the reusable ai-code-review workflow (OpenCode + Amazon Bedrock), plus hardening fixes from a follow-up code review of the rewrite.

Architecture (v2)

  • Subagent split. The orchestrator (main model) reads one pre-assembled context file, picks 2–3 review dimensions from the diff (canonical: business_logic, security, performance; repo config can force extras), and spawns review subagents in parallel. One mechanical subagent runs on a cheap model (subagent_model, default nvidia.nemotron-super-3-120b): comment-categorizer, which classifies existing bot comments.
  • Judgment up, execution down. Comment management is code, not an agent: the orchestrator writes one comment-plan.json (what to post/reply/resolve, honoring human "false positive"/"won't fix" feedback), and the CLI's apply-plan command executes it deterministically — validates every id against the categorizer output, posts with a reply → new-inline → summary-observations fallback so findings are never dropped, resolve-replies stale/fixed threads (skipping already-marked ones), deletes no_issues leftovers, and renders the summary comment from a fixed template (including a note pointing AI coding agents at the inline threads). This replaced a comment-lifecycle agent that was the slowest and least reliable phase.
  • Timing report. Every CLI invocation logs UTC time + duration to .ai-review-work/timing.log; a final workflow step prints it with work-file mtimes so per-phase time inside the OpenCode step is visible.
  • One embedded Python CLI (stdlib-only, shells out to gh/yq) replaces all bash pre-compute logic: config sanitization, per-file diff splitting, comment/thread pre-fetch, rules-file discovery (AGENTS.md/CLAUDE.md incl. per-dir), generated-only skip check, and the agent-facing comment operations.
  • Deterministic post-steps. Thread resolution and summary dedup are plain Python steps, not agent decisions. The agent only posts a resolve-reply; the CLI tags it with a hidden <!-- ai-review-resolve --> marker and the post-step resolves marker-tagged threads via GraphQL — skipping any thread where a human commented after the marker, so disputes are never buried.
  • Findings pipeline. Structured findings with dedup across dimensions, a strict severity filter, consolidation of repeated patterns, and a max_comments cap (default 10; overflow goes to the summary).
  • Optional BORIS infra context. Setting boris_mcp_url (org-controlled caller input, never repo config) installs the bmcp CLI from a pinned, checksum-verified release (v0.3.0, which adds OpenCode harness support); review subagents get read-only live AWS context for IaC changes. Auth reuses the existing Bedrock OIDC role. bmcp install opencode --scope project writes ./BORIS.md and a managed block in ./AGENTS.md into the runner's worktree, so the orchestrator and every subagent pick up the live tool catalog through OpenCode project rules instead of a hand-written prompt blurb. It runs after precompute, so the managed block never leaks into the assembled review rules.

Behavior changes

  • The auto-approve step ("approve PR when no unresolved bot threads") is removed — humans approve PRs.
  • Dropped vestigial inputs: review_plugin, plugin_marketplace_repo, max_turns (was dead plumbing).
  • Human feedback like "false positive" / "won't fix" on a thread is honored; the issue is not re-raised.
  • Only bot comments that are genuine "no issues"/LGTM-style leftovers are cleaned up; other automation on the same GitHub App identity is never touched.

Hardening (from review of the rewrite)

  • Diffs computed from the merge base to match pulls/files API patch semantics when the base branch has advanced.
  • Null comment user fields (deleted/ghost accounts) tolerated everywhere.
  • Config edge cases: scalar additional_dimensions, max_comments: 0, injection-shaped values.
  • Comment-mutation failures exit nonzero; apply-plan falls back (reply → new inline comment → summary observations) instead of silently dropping a finding.
  • review-context.md caps inlined diffs (20k/file, 250k total) with pointers to on-disk per-file diffs.
  • Pinned, SHA256-verified release installs for both OpenCode and bmcp.

Validation

  • Workflow YAML parses (reusable workflow, repo caller, adoption template).
  • Embedded Python extracted and byte-compiled; 29 functional checks pass against the extracted CLI with mocked gh/yq/git (config sanitization, resolve-thread marker/dispute/null-author matrix, null-user pre-compute, context truncation, exit codes).
  • apply-plan functionally tested against the extracted CLI with a mocked gh: id validation, fallback demotion, marker skipping, no_issues cleanup, template summary, exit codes.
  • Exercised on this PR's own review runs; live findings so far: GitHub requires the App to hold Contents R/W for resolveReviewThread (documented in README), and the old lifecycle agent posted a literal-\n summary once (now impossible — the summary is code-rendered).

🤖 Generated with Claude Code

Andrey9kin and others added 6 commits July 3, 2026 16:09
…istic post-steps

Adopt patterns from the hellohippo reference implementation while keeping
OpenCode + Bedrock:

- Move all deterministic logic (config sanitization, pre-compute, comment
  operations, post-steps) from bash into one embedded stdlib-only Python CLI
- Split the monolithic prompt into an orchestrator plus OpenCode subagents:
  comment-categorizer and comment-lifecycle run on a cheap configurable model
  (default nvidia.nemotron-super-3-120b); review dimensions are capped at 2-3
- Structure findings as {file, line, message, dimension, severity} with a
  severity filter, cross-file consolidation, and a max_comments cap (overflow
  goes to the summary comment)
- Track ALL thread replies (bot + human) so human "false positive" / "won't
  fix" feedback is honored instead of re-litigated
- Resolve review threads for real: deterministic post-step matches bot
  "Resolved ..." replies and resolves threads via GraphQL; a second post-step
  dedupes summary comments
- Split diffs per file and assemble one review-context.md so agents read a
  single pre-built file instead of hunting via API calls
- Optional BORIS infra context: boris_mcp_url input installs the bmcp CLI
  (pinned release + SHA256, SigV4 auth reuses the Bedrock OIDC role)
- Expand .github/ai-review.yml repo config: model, max_turns, max_comments,
  additional_dimensions (all sanitized against PR-branch injection)
- Drop the auto-approve step and the vestigial review_plugin inputs

Design and decision log in HANDOFF.md.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Ten ranked findings from a multi-agent review of the v2 rewrite (top five
marked ship-blockers: NO_ISSUES over-deletion, unanchored RESOLVED_RE,
human-unresolve fighting, null-user crash, scalar additional_dimensions),
plus verified lower-priority items and one refuted suggestion. Fixes not
yet applied.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- Narrow NO_ISSUES so unrelated bot comments are never deleted (finding 1)
- Replace phrase-matching with a hidden <!-- ai-review-resolve --> marker
  prepended by resolve-reply; resolve-threads matches only the marker and
  yields when a human commented after it (findings 2, 3)
- Tolerate null comment users from deleted/ghost accounts (finding 4)
- Accept scalar additional_dimensions instead of iterating chars (finding 5)
- Exit nonzero on mutation failure and restore the lifecycle fallback
  chain reply -> post-inline -> summary observations (finding 6)
- Fall back to literal text when an @-prefixed body is not a file (finding 7)
- Honor configured max_comments: 0 (finding 8)
- Diff from the merge base to match pulls/files API semantics (finding 9)
- Cap per-file (20k) and total (250k) diff bytes in review-context.md with
  pointers to the on-disk diffs (finding 10)
- Cleanups: drop dead max_turns plumbing, bash-heredoc note for the
  write-disabled lifecycle agent, 400-char reply truncation for dismissal
  detection, generated-only check moved before expensive precompute work

Validated: YAML parses, embedded Python byte-compiles, 29 functional
checks pass against the extracted CLI with mocked gh/yq/git.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Working notes for the v2 redesign; superseded by the final implementation.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Verified against the new release: run flags used here (--title, -m,
--print-logs, --log-level) unchanged; permission defaults still allow
bash in-worktree without --auto; agent definitions move to the now
documented .opencode/agents/ directory (the singular path still loads
but is legacy).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…llers

The org-wide secret holding the Bedrock OIDC role is named
DEVELOPMENT_ACCOUNT_ROLE_ARN; the reusable workflow keeps its
BEDROCK_ROLE_ARN secret interface and callers map the name.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Comment thread .github/workflows/ai-code-review.yml
Comment thread .github/workflows/ai-code-review.yml
Comment thread .github/workflows/ai-code-review-caller.yml
Comment thread .github/workflows/ai-code-review.yml
Comment thread workflow-templates/ai-review-caller.yml
Comment thread .github/workflows/ai-code-review.yml
Comment thread .github/workflows/ai-code-review.yml
@fivexl-reviewer

fivexl-reviewer Bot commented Jul 3, 2026

Copy link
Copy Markdown

AI Code Review Summary

  • Findings posted: 2 inline, 0 thread replies
  • By dimension: business_logic: 2
  • Threads marked resolved: 0

Additional observations

  • Additional observations (lower severity): (1) Line 548: malformed post items are silently skipped without incrementing 'failed' count, causing silent data loss if agent generates invalid plan. (2) Line 370: git merge-base failure exits without graceful error handling for edge cases like unrelated histories or shallow checkouts. (3) Security: path validation in read_body() and line number validation could be strengthened for defense-in-depth, though the author correctly noted the CLI is not a security boundary.

Note for AI coding agents: detailed findings are posted as inline review comments on the diff — read the unresolved review threads (e.g. gh api repos/<owner>/<repo>/pulls/<pr>/comments) before making changes; this summary is only an overview.

This summary was automatically generated by the AI code review workflow.

Andrey9kin and others added 2 commits July 3, 2026 20:03
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- Validate inputs.subagent_model against the model regex before it is
  substituted into agent frontmatter (defense in depth; callers are
  org-controlled)
- Substitute ${SUBAGENT_MODEL} with envsubst restricted to that one
  variable, at heredoc write time — replaces the perl -pi pass; the
  value is inserted as literal data either way
- read_body: an argument that is a bare path to an existing file under
  the work dir is read as a file with a warning — the first live run
  showed the lifecycle agent passing the summary file path without '@',
  which posted the literal path as the summary comment
- Lifecycle prompt: state explicitly that file contents require the '@'
  prefix

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Comment thread .github/workflows/ai-code-review.yml
Comment thread .github/workflows/ai-code-review.yml
Comment thread .github/workflows/ai-code-review.yml
Comment thread .github/workflows/ai-code-review.yml
Comment thread .github/workflows/ai-code-review.yml Outdated
Andrey9kin and others added 3 commits July 3, 2026 20:28
bmcp install all --scope project appends BORIS usage guidance (with the
synced tool list) to AGENTS.md and friends inside markers; OpenCode reads
AGENTS.md at session start, so agents discover bmcp without relying on
the orchestrator prompt alone. Runs after bmcp sync, soft-fails like the
other bmcp steps.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
bmcp v0.3.0 adds OpenCode harness support: `bmcp install opencode
--scope project` writes ./BORIS.md (usage + live tool catalog) and a
managed block in ./AGENTS.md, which OpenCode loads as project rules for
the orchestrator and every subagent. This replaces the hand-written
usage blurb in the orchestrator prompt with the real synced catalog.
Install runs after precompute, so the managed block never leaks into
the assembled review rules.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
`workflows` -> `.github/workflows` is a convenience symlink; AGENTS.md
documents it so agents don't mistake it for a duplicate directory.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Comment thread .github/workflows/ai-code-review.yml
Comment thread .github/workflows/ai-code-review.yml
Comment thread .github/workflows/ai-code-review.yml
Comment thread .github/workflows/ai-code-review.yml
Andrey9kin and others added 2 commits July 3, 2026 22:31
Thread resolution: all resolveReviewThread mutations failed because
GitHub requires the App to hold Contents: Read and write for that
mutation ("Resource not accessible by integration") — pull_requests
write is not enough. Document the required permission in the README
and include the GraphQL error in the resolve-threads warning so the
next failure is diagnosable from the run log.

Summary bodies: the lifecycle agent passed a whole multi-line summary
as one line of literal \n escapes, which bash does not expand, so the
comment rendered raw. read_body now repairs the unambiguous case
(3+ literal \n, zero real newlines) with a warning, and the agent
rules forbid literal \n escapes explicitly.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Gemma 4 is not available on Bedrock (checked via BORIS in us-east-1,
us-west-2, eu-west-1); google.gemma-3-27b-it is the largest Gemma.
The review summary now ends with a fixed note pointing AI coding
agents at the unresolved inline review threads, so agents acting on
the PR don't treat the summary as the full findings list.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Andrey9kin and others added 2 commits July 3, 2026 23:05
Gemma 4 (31B / 26B-A4B / E2B) launched on Bedrock in June 2026 but is
served only through the bedrock-mantle endpoint (openai/v1/responses
path), which the OpenCode amazon-bedrock provider does not reach; it
does not appear in the classic list_foundation_models catalog.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…mand

The lifecycle agent was the slowest phase (cheap model, one tool call
at a time) and the least reliable. Judgment moves up, execution moves
down: the orchestrator — which already has findings, existing comments,
and human replies in context — writes one comment-plan.json; the new
apply-plan command executes it in code: validates every id against the
categorizer output, posts/replies with the reply -> inline ->
observations fallback, resolve-replies stale + fixed threads (skipping
already-marked ones), deletes no_issues comments, and renders the
summary from a fixed template (agent note included, no more literal \n
risk). Partial failures are demoted to observations and don't fail the
command; only a failed summary upsert does.

Also adds timing instrumentation: every CLI invocation logs UTC time +
duration to stdout and .ai-review-work/timing.log, and a final "Report
timings" step prints the log plus work-file mtimes so the time spent in
each phase inside the OpenCode step is visible per run.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
body = SUMMARY_MARKER + "\n" + body
summary_file = WORK_DIR / "summary-comment-id.txt"
summary_id = summary_file.read_text().strip() if summary_file.is_file() else ""
if summary_id:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cmd_upsert_summary has a logic bug: if summary_id exists and the PATCH request fails, the code falls through to CREATE at line 479, creating a duplicate summary comment instead of returning False. The CREATE should only execute when summary_id was empty (first-time summary), not on PATCH failure. Recommend: move the CREATE block inside an 'else' branch after the 'if summary_id:' block, or add 'return False' after logging a PATCH failure that had a non-empty summary_id.

Addresses the reviewer finding on upsert-summary: a failed PATCH fell
through to creating a new summary comment, duplicating it on transient
errors (dedupe-summaries would mop up, but avoid the churn). Creation
now only happens when the PATCH failed with HTTP 404 — the existing
comment was deleted — otherwise the command fails cleanly.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
query($owner: String!, $name: String!, $pr: Int!) {
repository(owner: $owner, name: $name) {
pullRequest(number: $pr) {
reviewThreads(first: 100) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THREADS_QUERY fetches only first 100 threads and 50 comments per thread. PRs with >100 threads or threads with >50 comments will have unresolved threads silently left behind. The warning at line 664 only triggers when exactly 100 threads are returned, but pagination could return 100, 200, etc. Consider implementing proper pagination or documenting this limitation.

body = SUMMARY_MARKER + "\n" + body
summary_file = WORK_DIR / "summary-comment-id.txt"
summary_id = summary_file.read_text().strip() if summary_file.is_file() else ""
if summary_id:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cmd_upsert_summary has a logic bug: when summary_id exists and the PATCH request fails with a non-404 error (e.g., rate limit, network issue), the code falls through to line 488 creating a new comment, resulting in duplicate summaries. The 404 check at line 485 only prevents fallthrough for deleted comments. Should return False on any non-404 failure.

…eads

Observed live: the orchestrator's plan posted a new root comment at a
file+line that already had an active bot thread, opening a duplicate
thread. The same-line dedup is now a code guarantee, not a prompt rule.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@Andrey9kin Andrey9kin merged commit 57f0cdf into main Jul 3, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant