ai-code-review v2: subagent split, embedded Python CLI, deterministic post-steps by Andrey9kin · Pull Request #2 · fivexl/shared-workflows

Andrey9kin · 2026-07-03T15:57:01Z

Summary

Rewrite of the reusable ai-code-review workflow (OpenCode + Amazon Bedrock), plus hardening fixes from a follow-up code review of the rewrite.

Architecture (v2)

Subagent split. The orchestrator (main model) reads one pre-assembled context file, picks 2–3 review dimensions from the diff (canonical: business_logic, security, performance; repo config can force extras), and spawns review subagents in parallel. One mechanical subagent runs on a cheap model (subagent_model, default nvidia.nemotron-super-3-120b): comment-categorizer, which classifies existing bot comments.
Judgment up, execution down. Comment management is code, not an agent: the orchestrator writes one comment-plan.json (what to post/reply/resolve, honoring human "false positive"/"won't fix" feedback), and the CLI's apply-plan command executes it deterministically — validates every id against the categorizer output, posts with a reply → new-inline → summary-observations fallback so findings are never dropped, resolve-replies stale/fixed threads (skipping already-marked ones), deletes no_issues leftovers, and renders the summary comment from a fixed template (including a note pointing AI coding agents at the inline threads). This replaced a comment-lifecycle agent that was the slowest and least reliable phase.
Timing report. Every CLI invocation logs UTC time + duration to .ai-review-work/timing.log; a final workflow step prints it with work-file mtimes so per-phase time inside the OpenCode step is visible.
One embedded Python CLI (stdlib-only, shells out to gh/yq) replaces all bash pre-compute logic: config sanitization, per-file diff splitting, comment/thread pre-fetch, rules-file discovery (AGENTS.md/CLAUDE.md incl. per-dir), generated-only skip check, and the agent-facing comment operations.
Deterministic post-steps. Thread resolution and summary dedup are plain Python steps, not agent decisions. The agent only posts a resolve-reply; the CLI tags it with a hidden  marker and the post-step resolves marker-tagged threads via GraphQL — skipping any thread where a human commented after the marker, so disputes are never buried.
Findings pipeline. Structured findings with dedup across dimensions, a strict severity filter, consolidation of repeated patterns, and a max_comments cap (default 10; overflow goes to the summary).
Optional BORIS infra context. Setting boris_mcp_url (org-controlled caller input, never repo config) installs the bmcp CLI from a pinned, checksum-verified release (v0.3.0, which adds OpenCode harness support); review subagents get read-only live AWS context for IaC changes. Auth reuses the existing Bedrock OIDC role. bmcp install opencode --scope project writes ./BORIS.md and a managed block in ./AGENTS.md into the runner's worktree, so the orchestrator and every subagent pick up the live tool catalog through OpenCode project rules instead of a hand-written prompt blurb. It runs after precompute, so the managed block never leaks into the assembled review rules.

Behavior changes

The auto-approve step ("approve PR when no unresolved bot threads") is removed — humans approve PRs.
Dropped vestigial inputs: review_plugin, plugin_marketplace_repo, max_turns (was dead plumbing).
Human feedback like "false positive" / "won't fix" on a thread is honored; the issue is not re-raised.
Only bot comments that are genuine "no issues"/LGTM-style leftovers are cleaned up; other automation on the same GitHub App identity is never touched.

Hardening (from review of the rewrite)

Diffs computed from the merge base to match pulls/files API patch semantics when the base branch has advanced.
Null comment user fields (deleted/ghost accounts) tolerated everywhere.
Config edge cases: scalar additional_dimensions, max_comments: 0, injection-shaped values.
Comment-mutation failures exit nonzero; apply-plan falls back (reply → new inline comment → summary observations) instead of silently dropping a finding.
review-context.md caps inlined diffs (20k/file, 250k total) with pointers to on-disk per-file diffs.
Pinned, SHA256-verified release installs for both OpenCode and bmcp.

Validation

Workflow YAML parses (reusable workflow, repo caller, adoption template).
Embedded Python extracted and byte-compiled; 29 functional checks pass against the extracted CLI with mocked gh/yq/git (config sanitization, resolve-thread marker/dispute/null-author matrix, null-user pre-compute, context truncation, exit codes).
apply-plan functionally tested against the extracted CLI with a mocked gh: id validation, fallback demotion, marker skipping, no_issues cleanup, template summary, exit codes.
Exercised on this PR's own review runs; live findings so far: GitHub requires the App to hold Contents R/W for resolveReviewThread (documented in README), and the old lifecycle agent posted a literal-\n summary once (now impossible — the summary is code-rendered).

🤖 Generated with Claude Code

…istic post-steps Adopt patterns from the hellohippo reference implementation while keeping OpenCode + Bedrock: - Move all deterministic logic (config sanitization, pre-compute, comment operations, post-steps) from bash into one embedded stdlib-only Python CLI - Split the monolithic prompt into an orchestrator plus OpenCode subagents: comment-categorizer and comment-lifecycle run on a cheap configurable model (default nvidia.nemotron-super-3-120b); review dimensions are capped at 2-3 - Structure findings as {file, line, message, dimension, severity} with a severity filter, cross-file consolidation, and a max_comments cap (overflow goes to the summary comment) - Track ALL thread replies (bot + human) so human "false positive" / "won't fix" feedback is honored instead of re-litigated - Resolve review threads for real: deterministic post-step matches bot "Resolved ..." replies and resolves threads via GraphQL; a second post-step dedupes summary comments - Split diffs per file and assemble one review-context.md so agents read a single pre-built file instead of hunting via API calls - Optional BORIS infra context: boris_mcp_url input installs the bmcp CLI (pinned release + SHA256, SigV4 auth reuses the Bedrock OIDC role) - Expand .github/ai-review.yml repo config: model, max_turns, max_comments, additional_dimensions (all sanitized against PR-branch injection) - Drop the auto-approve step and the vestigial review_plugin inputs Design and decision log in HANDOFF.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Ten ranked findings from a multi-agent review of the v2 rewrite (top five marked ship-blockers: NO_ISSUES over-deletion, unanchored RESOLVED_RE, human-unresolve fighting, null-user crash, scalar additional_dimensions), plus verified lower-priority items and one refuted suggestion. Fixes not yet applied. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

- Narrow NO_ISSUES so unrelated bot comments are never deleted (finding 1) - Replace phrase-matching with a hidden  marker prepended by resolve-reply; resolve-threads matches only the marker and yields when a human commented after it (findings 2, 3) - Tolerate null comment users from deleted/ghost accounts (finding 4) - Accept scalar additional_dimensions instead of iterating chars (finding 5) - Exit nonzero on mutation failure and restore the lifecycle fallback chain reply -> post-inline -> summary observations (finding 6) - Fall back to literal text when an @-prefixed body is not a file (finding 7) - Honor configured max_comments: 0 (finding 8) - Diff from the merge base to match pulls/files API semantics (finding 9) - Cap per-file (20k) and total (250k) diff bytes in review-context.md with pointers to the on-disk diffs (finding 10) - Cleanups: drop dead max_turns plumbing, bash-heredoc note for the write-disabled lifecycle agent, 400-char reply truncation for dismissal detection, generated-only check moved before expensive precompute work Validated: YAML parses, embedded Python byte-compiles, 29 functional checks pass against the extracted CLI with mocked gh/yq/git. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Working notes for the v2 redesign; superseded by the final implementation. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Verified against the new release: run flags used here (--title, -m, --print-logs, --log-level) unchanged; permission defaults still allow bash in-worktree without --auto; agent definitions move to the now documented .opencode/agents/ directory (the singular path still loads but is legacy). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…llers The org-wide secret holding the Bedrock OIDC role is named DEVELOPMENT_ACCOUNT_ROLE_ARN; the reusable workflow keeps its BEDROCK_ROLE_ARN secret interface and callers map the name. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

fivexl-reviewer · 2026-07-03T18:41:23Z

AI Code Review Summary

Findings posted: 2 inline, 0 thread replies
By dimension: business_logic: 2
Threads marked resolved: 0

Additional observations

Additional observations (lower severity): (1) Line 548: malformed post items are silently skipped without incrementing 'failed' count, causing silent data loss if agent generates invalid plan. (2) Line 370: git merge-base failure exits without graceful error handling for edge cases like unrelated histories or shallow checkouts. (3) Security: path validation in read_body() and line number validation could be strengthened for defense-in-depth, though the author correctly noted the CLI is not a security boundary.

Note for AI coding agents: detailed findings are posted as inline review comments on the diff — read the unresolved review threads (e.g. gh api repos/<owner>/<repo>/pulls/<pr>/comments) before making changes; this summary is only an overview.

This summary was automatically generated by the AI code review workflow.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

- Validate inputs.subagent_model against the model regex before it is substituted into agent frontmatter (defense in depth; callers are org-controlled) - Substitute ${SUBAGENT_MODEL} with envsubst restricted to that one variable, at heredoc write time — replaces the perl -pi pass; the value is inserted as literal data either way - read_body: an argument that is a bare path to an existing file under the work dir is read as a file with a warning — the first live run showed the lifecycle agent passing the summary file path without '@', which posted the literal path as the summary comment - Lifecycle prompt: state explicitly that file contents require the '@' prefix Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

bmcp install all --scope project appends BORIS usage guidance (with the synced tool list) to AGENTS.md and friends inside markers; OpenCode reads AGENTS.md at session start, so agents discover bmcp without relying on the orchestrator prompt alone. Runs after bmcp sync, soft-fails like the other bmcp steps. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

bmcp v0.3.0 adds OpenCode harness support: `bmcp install opencode --scope project` writes ./BORIS.md (usage + live tool catalog) and a managed block in ./AGENTS.md, which OpenCode loads as project rules for the orchestrator and every subagent. This replaces the hand-written usage blurb in the orchestrator prompt with the real synced catalog. Install runs after precompute, so the managed block never leaks into the assembled review rules. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

`workflows` -> `.github/workflows` is a convenience symlink; AGENTS.md documents it so agents don't mistake it for a duplicate directory. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Thread resolution: all resolveReviewThread mutations failed because GitHub requires the App to hold Contents: Read and write for that mutation ("Resource not accessible by integration") — pull_requests write is not enough. Document the required permission in the README and include the GraphQL error in the resolve-threads warning so the next failure is diagnosable from the run log. Summary bodies: the lifecycle agent passed a whole multi-line summary as one line of literal \n escapes, which bash does not expand, so the comment rendered raw. read_body now repairs the unambiguous case (3+ literal \n, zero real newlines) with a warning, and the agent rules forbid literal \n escapes explicitly. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Gemma 4 is not available on Bedrock (checked via BORIS in us-east-1, us-west-2, eu-west-1); google.gemma-3-27b-it is the largest Gemma. The review summary now ends with a fixed note pointing AI coding agents at the unresolved inline review threads, so agents acting on the PR don't treat the summary as the full findings list. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Gemma 4 (31B / 26B-A4B / E2B) launched on Bedrock in June 2026 but is served only through the bedrock-mantle endpoint (openai/v1/responses path), which the OpenCode amazon-bedrock provider does not reach; it does not appear in the classic list_foundation_models catalog. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…mand The lifecycle agent was the slowest phase (cheap model, one tool call at a time) and the least reliable. Judgment moves up, execution moves down: the orchestrator — which already has findings, existing comments, and human replies in context — writes one comment-plan.json; the new apply-plan command executes it in code: validates every id against the categorizer output, posts/replies with the reply -> inline -> observations fallback, resolve-replies stale + fixed threads (skipping already-marked ones), deletes no_issues comments, and renders the summary from a fixed template (agent note included, no more literal \n risk). Partial failures are demoted to observations and don't fail the command; only a failed summary upsert does. Also adds timing instrumentation: every CLI invocation logs UTC time + duration to stdout and .ai-review-work/timing.log, and a final "Report timings" step prints the log plus work-file mtimes so the time spent in each phase inside the OpenCode step is visible per run. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

fivexl-reviewer · 2026-07-03T22:26:04Z

+                  body = SUMMARY_MARKER + "\n" + body
+              summary_file = WORK_DIR / "summary-comment-id.txt"
+              summary_id = summary_file.read_text().strip() if summary_file.is_file() else ""
+              if summary_id:


cmd_upsert_summary has a logic bug: if summary_id exists and the PATCH request fails, the code falls through to CREATE at line 479, creating a duplicate summary comment instead of returning False. The CREATE should only execute when summary_id was empty (first-time summary), not on PATCH failure. Recommend: move the CREATE block inside an 'else' branch after the 'if summary_id:' block, or add 'return False' after logging a PATCH failure that had a non-empty summary_id.

Addresses the reviewer finding on upsert-summary: a failed PATCH fell through to creating a new summary comment, duplicating it on transient errors (dedupe-summaries would mop up, but avoid the churn). Creation now only happens when the PATCH failed with HTTP 404 — the existing comment was deleted — otherwise the command fails cleanly. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

fivexl-reviewer · 2026-07-03T22:35:57Z

+          query($owner: String!, $name: String!, $pr: Int!) {
+            repository(owner: $owner, name: $name) {
+              pullRequest(number: $pr) {
+                reviewThreads(first: 100) {


THREADS_QUERY fetches only first 100 threads and 50 comments per thread. PRs with >100 threads or threads with >50 comments will have unresolved threads silently left behind. The warning at line 664 only triggers when exactly 100 threads are returned, but pagination could return 100, 200, etc. Consider implementing proper pagination or documenting this limitation.

fivexl-reviewer · 2026-07-03T22:35:57Z

+                  body = SUMMARY_MARKER + "\n" + body
+              summary_file = WORK_DIR / "summary-comment-id.txt"
+              summary_id = summary_file.read_text().strip() if summary_file.is_file() else ""
+              if summary_id:


cmd_upsert_summary has a logic bug: when summary_id exists and the PATCH request fails with a non-404 error (e.g., rate limit, network issue), the code falls through to line 488 creating a new comment, resulting in duplicate summaries. The 404 check at line 485 only prevents fallthrough for deleted comments. Should return False on any non-404 failure.

…eads Observed live: the orchestrator's plan posted a new root comment at a file+line that already had an active bot thread, opening a duplicate thread. The same-line dedup is now a code guarantee, not a prompt rule. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Andrey9kin and others added 6 commits July 3, 2026 16:09

Remove handoff notes

baf80a6

Working notes for the v2 redesign; superseded by the final implementation. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>