feat: PR babysit + verify/TDD loop — judge → fix → verify → push → quiet by crypticpy · Pull Request #6 · crypticpy/chorus

crypticpy · 2026-05-18T02:37:39Z

Summary

Rebased successor to closed PR #5 (whose base branch was deleted when PR #1 was squash-merged). This branch carries 17 feature commits on top of main after merging in the round-1/2 CodeRabbit/Sourcery/Greptile fixes that landed with PR #1.

Two big chunks of new feature work:

PR-babysit autonomous loop

End-to-end loop that watches a single PR through bot review, judges each new comment, applies a fix / posts a reply / escalates as appropriate, then auto-merges when CI is green and review has been quiet for two consecutive ticks.

New SQLite tables: babysit_jobs + babysit_decisions with idempotent migrations.
GH client with App-auth (RS256 JWT + installation token cache) and gh CLI fallback.
Per-PR worktree manager with branch-name validation + idempotent ensure/pull/remove.
Comment fetcher with sha256 dedup and author classification (CodeRabbit, Sourcery, Greptile, ChatGPT Codex).
Judge that classifies each comment into apply-trivial / apply-targeted / apply-architectural / reply-disagree / reply-ack / defer-to-human, with a pure action router.
Bounded-concurrency scheduler with per-job mutex.
Full state machine: idle → judging → fixing/replying → verifying → pushing → quiet_check → merged | escalated.
Webhook HMAC verify helper.
Daemon route POST /babysit/jobs (idempotent register) + pause/resume PATCH /babysit/jobs/:id.
MCP tool babysit_pr (11th tool) for orchestrators.
CLI: chorus babysit list/show/register/pause/resume.

Verify phase + TDD loop

New verify phase: no LLM doer — runs the project's package.json#chorus.verify command in repoPath, captures stdout/stderr/exit, fences it into a synthetic doer answer, and routes through the reviewer flow for a pass/fail verdict.
feedbackPhase knob: when a verify phase fails, re-prompts a named doer (typically implement) with the verify output as feedback, capped by maxIterations.
Dedicated VerifyPhaseSchema with commandTimeoutMs / feedbackPhase / maxIterations.

Supporting changes

Doer + reviewer prompts now include AGENTS.md + CLAUDE.md from repoPath as a fenced <project_guidelines> block (truncated, symlink-guarded).
pr-babysit audit preset.

Test plan

npx tsc --noEmit — clean
npx vitest run — 1085 / 1085 tests pass across 80 files (matches main's baseline)
Smoke: register a real PR via chorus babysit register <url> and watch it through one full judge → fix → verify → push tick
Smoke: run a template with a verify phase + feedbackPhase: implement and confirm the TDD re-prompt loop fires on a deliberately failing test

🤖 Generated with Claude Code

Summary by Sourcery

Introduce an autonomous PR babysitting loop with a scheduler-driven state machine, GitHub App/CLI integration, and persistent job/decision tracking, plus a new verify/TDD phase and project guideline injection into prompts.

New Features:

Add PR babysit daemon routes, scheduler, state machine, MCP tool, CLI commands, and per-PR worktree management for autonomous bot-comment handling.
Introduce a verify phase and TDD loop that runs project-defined verify commands, captures output, and feeds it into the reviewer and implement phases.
Include AGENTS.md/CLAUDE.md project guidelines into doer and reviewer prompts as structured context blocks.
Add a pr-babysit template and GitHub App secret support for babysit configuration.

Enhancements:

Extend DB schema and access layer with babysit jobs/decisions tables and indices for the babysit loop.
Add GitHub App-based client with JWT/installation token handling and a unified App/CLI GitHub request shim.
Enhance daemon startup to wire babysit routes and scheduler and update MCP server/tool counts.

Tests:

Add extensive babysit-related unit tests covering state machine, DB, routes, GitHub client/App auth, scheduler, comment fetcher, judge, worktree manager, git push, verifier, and prompts.
Extend template schema tests to cover the new verify phase and its constraints.

Summary by CodeRabbit

New Features
- Autonomous "PR babysit" workflow: judge bot comments, apply fixes, run verifies, push changes, escalate when needed.
- Daemon scheduler + CLI to register, list, pause, resume, and inspect babysit jobs.
- GitHub App support (auth + webhook HMAC verification), per-PR git worktrees, commit-and-push helper, verify-phase runner, and an MCP tool to register PRs.
- Prompt-builder now prepends per-repo project guidelines.
Documentation
- Added comprehensive PR babysit design document.
Chores
- DB schema and persistence added for babysit jobs and decision audit trail.
Tests
- Extensive test coverage for babysit components and helpers.

Three fixes from chorus-issues.md that prevent a freshly-installed chorus from finding the user's existing CLI credentials, so the daemon starts up cleanly on machines that already have Claude / Kimi / moonshot configured. #1: register Claude MCP at user scope. The chorus MCP entry now writes to the top-level `mcpServers` block in `~/.claude.json` (idempotent), and any stale chorus entry under the project-scoped `projects[homedir].mcpServers` is cleaned up. Previously the project-scoped registration was invisible to Claude Code launched outside that exact cwd. #2: cred-path fallbacks. When the anthropic file check misses (e.g. user authed via Claude Desktop, no `~/.claude/...` JSON), fall back to the macOS Keychain via `security find-generic-password -s "Claude Code-credentials"`. Added `~/.kimi/credentials/kimi-code.json` to the moonshot CRED_PATHS so users who authed through `kimi-code` aren't told to log in again. #3: kimi config-missing precheck. New layer-3 check parses `~/.kimi/config.toml` and surfaces a `config_missing` reason when there's no top-level `default_model` set — the CLI will silently pick whatever backend it likes, which is rarely what the user wants. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Seven fixes from chorus-issues.md covering the rest of the runner + MCP-surface issues found while reviewing PR chorus-codes#26 of foresight-app. #4: thread `repoPath` through reviewer subprocesses. `runReviewers` → `runReviewer` → `runReviewerHeadless` now accept the chat's repoPath and the reviewer's cwd switches to it when set, so `gh`, file reads, and sandboxed CLIs (Gemini) see the actual code instead of running in an empty per-reviewer scratch dir. #5: surface reviewer answer.md in MCP responses. New `readReviewerArtifacts` helper walks `~/.chorus/chats/<id>/round-N/reviewer-*/answer.md`, caps each at 16 KiB, sorts by (round desc, agent asc), and merges the result into `wait_for_chat` and `get_chat_status` payloads under `reviews`. Both the doer and reviewer `participant_done` events now carry `outputPath` so MCP clients can read the on-disk source of truth when they need more than the streamed tail. #6: bump phase_progress output tail from 500 B to 8 KiB. The 500-byte slice clipped reviewer summaries mid-word; full text remains on disk and is pointed to by `outputPath`. Affects both reviewer.ts and doer.ts. #7: tri-review verdict on `max_rounds_exhausted`. When the doer succeeded every round but reviewers kept saying request_changes through the round cap, chat_done now emits `status: completed, verdict: request_changes, reason: max_rounds_exhausted` with the last round's reviewer summary — previously misclassified as a generic doer failure. #8: refactor `CreateChatSchema` and `InvokePersonaSchema` to plain `z.object()` with per-field `.describe()`. The prior `.transform()` wrapped them in `ZodEffects` which strips the `properties` map from MCP introspection — clients saw an empty schema. Legacy `template` alias and the `code-review` default moved into a new `resolveTemplateId()` helper. chorus-codes#9: dedup `participant_done` at the multiplex layer. Same-slot fallbacks or parsers that emit `message_done` twice (the opencode parser historically does this) used to fan duplicate terminal events out to every subscriber; now keyed by `(phaseIdx, round, role, agent)` and later duplicates drop silently. chorus-codes#10: per-instance reviewer prompt isolation. Same-lineage instances (claude-code-2/4/5, etc.) share the chat dir tree at `~/.chorus/chats/<id>/round-N/reviewer-*/`; tool-using CLIs were wandering into a sibling's answer.md mid-flight and short-circuiting ("the review is complete" referring to a different agent's work). `buildReviewerAsk` now stamps an Independence directive when more than one reviewer slot exists, naming the slot tag and forbidding cross-slot reads. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The synthetic chat_done emitted when a terminal chat is re-attached derived `verdict` from `chat.status`, ignoring the `chat.verdict` column. Since the previous commit shipped the `max_rounds_exhausted` branch (chorus-issues.md #7), a chat can finish with `status='approved' verdict='request_changes'` — replay was clobbering that to `approved` on every page reload, hiding reviewer disagreement from the user. Use the persisted column when set; fall back to the old status-derived value only for pre-v0.8.27 rows where verdict is null. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three audit follow-ups on the daemon side, all surfacing previously silent failures. attached_files: parseAttachedFiles in runner-multiplex.ts used to swallow JSON parse errors and run the chat with no attachments. Refactor to a tagged result (`empty` / `ok` / `invalid`); on `invalid` the runner logs and emits a `cli_warning` SSE so the cockpit + MCP clients see which chat lost its file list. SSE backpressure: when a subscriber's queue exceeds the 1000-line cap the multiplex used to silently drop the connection. Now writes one `error` frame with code `sse_backpressure` before close, and logs the queue length to daemon.log so an operator tailing logs can see when clients fall behind. gh pr create URL validation: ship.ts captured stdout's last line as the PR URL with no shape check; an empty/malformed stdout produced `{ok: true, prUrl: ''}` and the chat row recorded "shipped" with an unclickable link. Now matches against `^https://github.com/<owner>/<repo>/pull/<n>` before declaring success. detectGitContext parallelization: the five spawnSync probes (is-repo, remote, gh --version, gh auth, HEAD) ran sequentially at 60s each — worst case 360s before runner saw a result. Converted to async with a new `runAsync` helper, batched via Promise.all with a 15s per-probe cap; detectDefaultBranch's symref + three branch-existence checks likewise parallelized. detectGitContext is now async; the lone caller in runner.ts awaits it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

participant-card.tsx: parseFailureSummary ran the multi-step regex chain over the full participant.answer string. Reviewer answers can be up to 256 KB; on every render that's a UI-thread block. Slice to the first 16 KiB before scanning — the failure-header block is always written at the top of answer.md by reviewer.ts/doer.ts, so the cap never loses signal. live-run-real/index.tsx: the SSE onmessage handler already had a try/catch around JSON.parse, but the catch was silent — a wire-format mismatch dropped events with no trace. Add a console.warn with a preview so devs notice schema drift in DevTools. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds src/daemon/github-pr.ts: parsePrUrl + fetchPrArtifact run gh pr view/diff plus review and issue comments in parallel, synthesize a Markdown artifact (description, comments capped at 50 newest each, diff capped at 200 KB UTF-8 safe), and classify gh failures into typed reasons. Exports runAsync from ship.ts so the new module can reuse the existing spawn+timeout helper instead of duplicating it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pulls the template lookup, artifact validation, chat row + opening phase event creation, and runner kickoff out of the POST /chats handler into a reusable helper. POST /chats now only handles its route-specific concerns (body shape, repoPath canonicalization, error response shaping). Sets up reuse from the upcoming POST /chats/from-pr endpoint without duplicating ~150 lines of validation logic. No behavior change — same template checks, same artifact rules, same kickoff path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Accepts { url, templateId, repoPath?, yolo? }, parses the PR URL, fetches PR meta + diff + existing comments via gh CLI, synthesizes a Markdown artifact, and creates the chat through the shared createChatFromValidatedInputs helper. gh failures map to typed reasons (invalid_url, gh_not_installed, gh_not_authed, pr_not_found, network_failure, unknown) so the cockpit can render actionable errors instead of generic 500s. Adds tests/github-pr.test.ts covering parsePrUrl edge cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a Free-form / GitHub PR mode toggle on the new-chat page. PR mode swaps the prompt textarea for a URL input and routes through the new POST /chats/from-pr endpoint. Validates client-side that the chosen template is review-only before letting the user submit. createChatFromPr API client surfaces the daemon's typed PR meta (owner/repo/number/title/branches) on the response so callers can display PR context after the chat is created. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Exposes POST /chats/from-pr through MCP. Orchestrators (Claude Code, Codex, Cursor) can now hand chorus a PR URL and get reviewers running against it without going through the cockpit. Defaults templateId to review-only so a caller can pass just a URL. ReviewPrSchema is a plain z.object (not ZodEffects) so MCP clients can introspect required fields — same hazard documented on CreateChatSchema and InvokePersonaSchema. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Idea note for running chorus against multiple paid accounts on the same CLI binary (work + personal Claude Code Max, etc.). Filed as follow-up after audit-presets + quota tiers ship — captures the env-override mechanism, proposed Identity primitive, and open questions on keychain CLIs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds the foundation for repo-pointed audit-and-orchestrate runs and the orchestrator's task↔voice tier matching. Template schema: - AuditPhase (kind: 'audit') — single reviewer voice + one of five preset lenses (de-slopify, monolith-breakdown, code-review, engineering-review, architecture-review). Output schema (AuditItemSchema, AuditOutputSchema) lives next to the phase shape so the structured-output adapter, scheduler, and cockpit checklist agree on the contract. - OrchestratePhase (kind: 'orchestrate') — array of worker voices, default branchPrefix `chorus/{chatId}/worker-{idx}` so each worker gets isolated git state. - templateRequiresRepo() helper for the cockpit's repo-picker gate. Voices: - Adds tier ('high' | 'medium' | 'low', default 'medium') and monthly_budget_usd (nullable) to the row schema, upsert input, and update input. Idempotent migrations on existing DBs. Chats: - bypass_quota INTEGER NOT NULL DEFAULT 0 — set on PR-review chats so the orchestrate scheduler runs every enabled voice at full capacity instead of tier-gating. Runner is stubbed for the new kinds: phase_done emit + continue, so templates that declare an audit/orchestrate phase before the runner logic lands don't crash. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Wraps an AgentShim's runHeadless with JSON-formatting prompt scaffold and a one-shot repair loop, returning typed data validated against a caller-supplied zod schema. Used by the upcoming audit phase (which needs typed AuditItem[] instead of free-form prose) and the orchestrate phase (worker results). Keeps each CLI lineage's existing headless transport — the adapter just owns the prompt-shape + parse-and-validate dance. Extraction strategy: prefer direct JSON.parse of finalText; fall back through fenced-block regex variants to a brace-to-brace slice. On parse or schema-violation, retry once with a repair prompt that quotes the validation error. Spawn errors short-circuit (the model never saw the prompt — repair would just retry the same failure). Tests cover happy path, fenced-block extraction, repair-loop success, repair-loop exhaustion, schema violation, and spawn error. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

/new gets a third tab beside Free-form and GitHub PR. In audit mode the user picks one of five preset lenses (de-slopify, monolith-breakdown, code-review, engineering-review, architecture-review) and supplies an absolute repo path. Submit fires createChat with templateId=`audit-<preset>` — those built-in templates land with the audit-phase implementation. RunChecklist component lives at src/components/run-checklist/. It takes the AuditItem[] surfaced by the audit phase's blocking event and renders one row per item with a checkbox, complexity badge, rationale, and file list. Default state has every item selected; the user trims, then submits via the parent's onSubmit which JSON-encodes the selected ids into the existing /chats/:id/resume `answer` field. Wiring into the live-run UI lands with the audit phase. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PR-review chats automatically set bypass_quota=true so the orchestrate scheduler ignores voice.tier and runs the full fleet at maximum capacity — reviews are short, parallel, and the user wants the strongest opinion possible regardless of model tier. PUT /voices/:id now accepts tier ('high' | 'medium' | 'low') and monthly_budget_usd (non-negative or null), so the cockpit fleet page can label voices by capability for the orchestrate scheduler to route work against. Tests cover both new fields plus a chat round-trip asserting bypass_quota defaults false and persists when set.

Wires the audit phase end-to-end: - src/daemon/phases/audit.ts runs the structured-output adapter against the chosen preset, persists the parsed AuditItem[] to <chatDir>/audit-output.json plus raw model output to round-1/audit/output.md, and emits phase_progress with the items. - src/daemon/runner.ts replaces the audit/orchestrate stub: audit invokes runAuditPhase, flips chat status to blocked so the cockpit renders the checklist UI, and exits cleanly. Orchestrate keeps the no-op stub until step 5 lands. - 5 preset prompts (de-slopify, monolith-breakdown, code-review, engineering-review, architecture-review) frame what each lens looks for. The structured-output adapter handles JSON formatting; presets describe the audit lens only. - 5 audit-* templates (one per preset), each a 2-phase audit -> orchestrate shape with three default workers. Auto-loaded by seedBuiltinTemplates. - tests/audit-phase.test.ts covers preset-file presence and the audit-* template parse + shape contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Wires the audit→orchestrate handoff: the cockpit POSTs the user's trimmed audit checklist to /chats/:id/resume, the resume handler cross-checks ids against audit-output.json, persists the selection, flips chat to drafting on the orchestrate phase, and re-fires the runner. The runner now starts at chat.current_phase_idx so a resumed chat lands directly on orchestrate. The new orchestrate phase walks the approved AuditItem[] sequentially (parallelism is an explicit non-goal for v1), picks a worker per item via the pure tier-aware scheduler, cuts a per-item branch, dispatches the worker via shim.runHeadless, captures git diff --stat, and persists orchestrate-manifest.json for the diff-apply UI to consume. The scheduler is a pure function with 9 unit tests covering tier matching, bypass override, disabled-voice skipping, empty pool, and unknown voice ids. Resume route has 10 tests exercising body validation, id cross-check, status gating, and the happy path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Run page reads audit-output.json + orchestrate-manifest.json on render - LiveRunReal renders RunChecklist while blocked w/ audit items, then swaps to OrchestrateManifest panel once orchestrate completes - New OrchestrateManifest component shows one row per worker w/ Checkout / Open PR buttons (per-row inline feedback, no global toast) - Daemon: GET /chats/:id/audit-items, GET /chats/:id/orchestrate-manifest, POST /chats/:id/workers/:idx/checkout (refuses on dirty tree), POST /chats/:id/workers/:idx/open-pr (gh pr create, bucketed failures) - OrchestrateManifestSchema added to template-schema.ts; route + UI parse via the same shape Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tJson Address /freview findings on the audit + orchestrate flow: - Resume race (BLOCKER): two concurrent POSTs to /chats/:id/resume could both pass the `status=='blocked'` check and double-fire the runner. Guard with `getActiveRun` (catches the audit-finishing window before `.finally` clears the registry) and replace the status flip with an atomic `tryResumeFromBlocked` CAS conditional on `WHERE status = 'blocked'`. - Branch-name argument injection (BLOCKER): tighten zod regexes on `OrchestratePhase.branchPrefix` and `OrchestrateManifestEntry.branch` so values starting with `-` (or containing shell metachars) cannot flow into `git checkout` / `gh pr create` as flags. - Symlink TOCTOU on checkout + open-pr (NON-BLOCKER): re-realpath `existing.repo_path` before passing to execFile cwd, mirroring the rerun-path pattern. Returns a structured validation error if the path no longer resolves. - extractJson Path 4 (NON-BLOCKER): try `{...}` and `[...]` slices independently and prefer the longer parse, so prose like "mentions [stuff] before {object}" extracts the object instead of the bracket. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two issues blocked `pnpm build:server`: - `audit.ts` used `import.meta.url` for module-relative path resolution, but the server tsconfig compiles to CJS where `import.meta` is a syntax error. Replaced with `__dirname`, which works in both the compiled dist (native CJS) and tsx-driven dev (tsx ≥4 shims it in ESM mode). - The `build:server` script copied `schema.sql` to dist/ but missed the preset markdown files in `src/daemon/presets/`. The audit phase's `loadPresetPrompt` resolves relative to `__dirname`, so a published install was hitting ENOENT on every audit run. Extended the copy step to mirror the preset directory. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(cli): add diagnose command + crash-hook Bundles two upstream changes that ship a self-service triage path for chorus users hitting opaque failures: - `chorus diagnose` walks the install, daemon, recent failed chats, voice health, and produces a sharable bug report. - Crash hook captures uncaught exceptions in the CLI and writes them to a crash log alongside instructions to attach during a bug report. Folded back from upstream chorus-codes/chorus: 7ea712b feat: chorus diagnose command + crash hook for bug reports (#1) 4a5ea20 fix(diagnose): realpath bin path + filter Next.js SSE noise (#4) Co-Authored-By: chorus-codes <info@chorus.codes> * feat(cli): add quickstart self-test command `chorus quickstart` runs a 30-second activation flow that verifies the daemon comes up, the SQLite DB initializes, and a minimal chat round-trips end-to-end. Aimed at first-run users who want to know "is this thing actually working" before authoring a template. Folded back from upstream chorus-codes/chorus: 56610cf feat(cli): chorus quickstart — 30-second activation self-test (chorus-codes#30) Co-Authored-By: chorus-codes <info@chorus.codes> * fix(cli): use dynamic import for open package (Node 22 ERR_REQUIRE_ESM) The `open` package and `chokidar` are both ESM-only as of recent versions. On Node 22 (the daily-driver target) static `require()` calls into them throw ERR_REQUIRE_ESM and crash the CLI at boot. Switch to dynamic import in: - src/cli/commands/start.ts (open browser after boot) - src/cli/open-browser.ts (new helper) - src/cli/index.ts (route open import) - src/daemon/output-watcher.ts (chokidar file watch) Includes upstream's post-merge hardening: the setTimeout that triggers the browser-open no longer wraps an async callback bare, so a missing default browser doesn't surface as an unhandled rejection. Folded back from upstream chorus-codes/chorus: e8ca2ee fix(cli): dynamic import for open package (chorus-codes#14) dcd1837 fix: post-merge hardening for chorus-codes#14 (start.ts portion only; cli-precheck.test.ts portion ships with the Keychain fix) Co-Authored-By: Julien Deudon <deudon.j@gmail.com> Co-Authored-By: chorus-codes <info@chorus.codes> * feat(cockpit): seed empty round-1 so QUEUED renders from t=0 Before: when a chat starts but no reviewer has produced an event yet, enrichRounds returned an empty rounds array and the live-run page showed nothing for several seconds — the user couldn't tell whether their chat had launched. After: seed a synthetic round-1 with QUEUED placeholders for every expected participant so the page renders the per-reviewer cards immediately. Real events overwrite placeholders as they arrive. Folded back from upstream chorus-codes/chorus: 53e8fb6 feat(cockpit): seed empty round-1 so QUEUED placeholders render from t=0 (#2) Co-Authored-By: chorus-codes <info@chorus.codes> * feat(daemon): runtime fallback-collision dedup across reviewer slots When two reviewer slots both fall through their per-slot chains to the same template-level fallback target (common case: every slot ends in anthropic/claude-sonnet-4-6), both used to dispatch the same (lineage, model) in parallel — wasted cost and the lineage diversity that's the point of multi-LLM peer review collapsed. Build-time dedup (template-fallback.ts) couldn't catch it because each slot only knows about other slots' PRIMARIES, not their fallback chains. Fix: new per-chat/per-round (lineage, model) registry. reviewer-driver tryClaim's before each chain attempt and releases in a finally. On collision, return null + emit cli_warning(reason='fallback_collision') so runWithChainFallback advances to the next entry and the cockpit can show why the slot skipped. Ported into fork's reviewer-driver.ts surgically so the verdict-isolation refactor (2a2cde2) and per-slot repoPath threading stay intact. Folded back from upstream chorus-codes/chorus: c4751fe feat(daemon): runtime fallback-collision dedup (#3) Co-Authored-By: chorus-codes <info@chorus.codes> * fix(daemon): write REVIEWER FAILED summary on pre-spawn failure Before: when a reviewer's precheck fails (e.g. underlying CLI not installed) or the chat is cancelled while the slot is queued for a CLI semaphore slot, runReviewer used to return null silently — leaving NO on-disk participant directory. The cockpit's enrich-rounds loop then couldn't reconcile the synthesised template slot against any real participant, so the card sat at "Queued — waiting for an open slot." forever and the actual error was invisible. Reproduction: install chorus on a host with only one CLI on PATH (e.g. just claude-code), open a template that includes lineages requiring codex/gemini/kimi, fire it. Every reviewer card stayed "Queued" — chat never visibly progressed even though it was already done failing. Fix: - Create the reviewer dir BEFORE the precheck runs. - Add a writePreSpawnFailure helper that writes a `## REVIEWER FAILED` summary in the canonical format (Kind / Lineage / Model / message) that the cockpit's `parseFailureSummary` already understands. - Wire it into the precheck-failed and cancelled-while-queued paths. Card now transitions out of pending and shows the actual error (cli_missing, cancelled, ...). Folded back from upstream chorus-codes/chorus: afc59cc fix(daemon): REVIEWER FAILED summary on pre-spawn failure (chorus-codes#26) Co-Authored-By: chorus-codes <info@chorus.codes> * feat(voices): auto-disable on persistent quota_exhausted + lsof timeout Real pain (upstream chorus-codes#11): a Pro Gemini model on a Flash-only account fails every chorus run with "exhausted your capacity on this model" — but Gemini doesn't return a resetAt because the model isn't going to become available for that account. Without auto-disable, the runner keeps picking the dead voice on every chat and the user keeps seeing the same opaque error. Voice auto-disable: - New src/lib/voice-failure-tracker.ts records per-voice consecutive quota_exhausted strikes in a settings counter. - Trigger: 2 consecutive strikes WITH no resetAt → set voices.enabled=false + disabled_reason='auto_quota'. - Counter resets on participant_done success; rate-limit strikes (hasResetAt=true) bypass the counter entirely so a transient 429 + a later permanent failure can't trip the threshold on the first permanent strike. - Wired into reviewer-driver alongside recordHealth; emits a cli_warning(reason='voice_auto_disabled') so the cockpit can show a one-line explanation. - VoiceDisabledReason union gains 'auto_quota' (schema column was already TEXT — no migration). Lsof timeout (upstream chorus-codes#12): - findPidsOnPort and findPidsOnPortWithSudo now bound execSync / execFileSync to 3s, so a slow-but-functional lsof on a loaded macOS box doesn't hang chorus boot. 3s leaves headroom while still bounding the hang case. Ported into fork's reviewer-driver.ts tmux pollHandle + success path. voices.ts disabled_reason union extended alongside fork's voice-tier column. Folded back from upstream chorus-codes/chorus: 4f6becc v0.8.30 — voice auto-disable (chorus-codes#11) + lsof timeout (chorus-codes#12) (chorus-codes#17) Co-Authored-By: chorus-codes <info@chorus.codes> Co-Authored-By: Lumina Mao <luminamao@mac.lan> * fix(daemon, schema): codex isolation + template-schema validation Two issues caused chats to fail opaquely at run-start: CODEX ISOLATION (chorus-codes#10, chorus-codes#16) The user's ~/.codex/config.toml may declare MCP servers, plugins, or notification hooks. In headless `codex exec` those integrations have caused codex to hang or cancel mid-call — two independent reproductions: codex as our reviewer (chorus-codes#10) and codex as MCP client of chorus (chorus-codes#16). Add --ignore-user-config to every headless codex argv. Extracted to a pure `buildHeadlessArgs(opts)` so the argv shape is unit-testable. TEMPLATE VALIDATION (chorus-codes#15) `reviewer.require > candidates.length` used to surface as "Job moves immediately to failure upon Start press" — the runner queued, failed to grant enough slots, and emitted an opaque chat-failure. Same for `require > distinct lineages` when crossLineage:true. Both now caught at TemplateSchema.parse() time with a clear error message the user can fix before the run starts. ReviewerSchema.superRefine() additions slot in cleanly alongside the fork's audit/orchestrate phase schema work — both are additive constraints on the same ReviewerSchema object. Folded back from upstream chorus-codes/chorus: 8ed970b fix(daemon, schema): codex isolation + template validation Co-Authored-By: chorus-codes <info@chorus.codes> * fix(runner): honour iterate.onDisagreement accept-doer/escalate The template schema, cockpit dialog, and SPEC-D-templates have always exposed three values for iterate.onDisagreement — 'continue', 'escalate', 'accept-doer' — but the runner only honoured 'continue'. Picking the other two from the cockpit form was a silent no-op: chats fell through to phase_failed with 'doer_failed_all_rounds' regardless. This wires both new branches into the round loop and the terminal chat_done emission: - 'accept-doer': after maxRounds without consensus, mark doerSucceeded and continue. The chat carries on (subsequent phases, ship, approval) as if reviewers had agreed on the doer's last answer. - 'escalate': halt with status='failed' but verdict='request_changes' and error='escalated_on_disagreement', so cockpits can render "reviewers disagreed, needs human" distinctly from "doer broke." Policy table extracted into a pure decidePhaseOutcome() helper so the 3 × 2 input matrix (policy × disagreement-in-last-round) is unit-tested without standing up the full runChat scaffold. Gated on disagreementInLastRound (reset at top of every round + on doer-crash path) so a partial / empty doer answer can never be silently "accept-doer"'d as final. Preserves the fork's existing standardPhaseRoundsExhausted #7 surfacing for the 'continue' path; the 'escalate' path takes precedence with its own distinct chat_done. Upstream PRs chorus-codes#49, chorus-codes#50 (commit 67572e9). Co-Authored-By: chorus-codes <280607145+chorus-codes@users.noreply.github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(cli-precheck): cover macOS Keychain fallback for Claude Code v2 The fork already implements the Keychain fallback in cli-precheck (hasDarwinKeychainEntry). This adds the missing test coverage: - passes when no cred file but keychain entry exists - blocks when no cred file and no keychain entry - skips keychain check when cred file exists (fast-path preserved) - does not consult keychain for non-anthropic lineages vi.mock('node:child_process') uses the importOriginal spread pattern so spawn / exec / etc. keep their real implementations — a bare module replacement would silently break any sibling test that imports from child_process. Upstream PRs #7, #8, plus the dcd1837 test-mock hardening. Co-Authored-By: Yura <yurahalych@gmail.com> Co-Authored-By: chorus-codes <280607145+chorus-codes@users.noreply.github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cockpit): derive candidatesWithModels from snapshot's candidates field Daemon-side TemplateSchema only carries `candidates` on each ReviewerRule. The cockpit Template type expects `candidatesWithModels` populated — enrich-rounds iterates that field to build slot→model mappings for run-page cards. When fromRow parsed template_snapshot and cast it to Template, the cast was a TypeScript lie: at runtime the parsed object lacked candidatesWithModels, enrichRounds iterated zero reviewer slots, and no model name reached the cards (badge appeared empty). Derive candidatesWithModels at the parse seam (chats.fromRow) so the cockpit's Template contract is honoured regardless of which path produced the data. Idempotent — if a future daemon ever serialises the field directly, that wins. Persona forwarded if present. Audit- phase single-voice reviewers (no candidates array) are skipped via a runtime narrow. Upstream PR #6 (chorus-codes/chorus@ac0c7fd). Co-Authored-By: chorus-codes <280607145+chorus-codes@users.noreply.github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(diagnose): capture failure context — CLI smoke, voice health, recent failed chats Extends `chorus diagnose` with three signals that triage the most common breakage modes: - **CLI smoke**: spawn `<bin> --version` per detected CLI with a hard 2s SIGKILL timeout (wrapper scripts may trap SIGTERM). Distinguishes `timedOut` from non-zero exit so the report can tell hangs apart from crashes. - **Voice health**: counts `enabled=0` voices grouped by `disabled_reason` ('user' vs 'auto_missing' vs 'quota_exhausted'). Added `idx_voices_enabled` so the `WHERE enabled = 0` scan stays cheap as the table grows. - **Recent failed chats**: last 5 chats with `status='blocked'` plus the errored participants pulled from `~/.chorus/chats/<id>/round-*/<part>/_attempts.jsonl`. Only `errorMessageBytes` is exposed — raw error text never leaves the user's machine. `$HOME` is redacted from any embedded path strings via `redactHomePaths`. Adapted from upstream chorus-codes#19 (0666dca). Preserves the fork's existing diagnose shape and adds tests for smokeOneCli / readLatestAttempt / formatReport rendering of the three new sections. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(diagnose): include no_review in recent failed chats query The recent-failed-chats section was meant to surface per-participant failure context from `_attempts.jsonl`, but the WHERE clause only covered 'failed', 'blocked', 'cancelled'. The most common failure shape — every reviewer down for missing CLI / auth / quota — ends the chat in 'no_review', which was being silently filtered out. So the exact case the section exists to diagnose returned an empty list, forcing users back into manual log collection. Adds 'no_review' to the IN-list and a regression test that asserts both the status and a quota_exhausted errorKind render in the report. Addresses chatgpt-codex review P2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: chorus-codes <info@chorus.codes> Co-authored-by: Julien Deudon <deudon.j@gmail.com> Co-authored-by: Lumina Mao <luminamao@mac.lan> Co-authored-by: chorus-codes <280607145+chorus-codes@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Yura <yurahalych@gmail.com>

#3) * feat(grok): detect Grok Build (xAI) + Level 1 orchestrator Adds Grok Build CLI to detection, onboarding picker, /connect card, diagnose smoke, init listing, and doctor labels. Grok auto-picks chorus MCP from ~/.claude.json (verified empirically via `grok inspect`) — no separate MCP wire needed. The grok orchestrator reports connected=true when both the binary is detected AND chorus is wired in ~/.claude.json (either top-level mcpServers or any project-scoped mcpServers entry). connect() is a no-op that points users at `chorus connect claude` if claude hasn't been wired yet. Quickstart filters CLIs to those with shims, so grok-cli being detected first no longer breaks the doer-pick flow. The cliToLineage map remains the source of truth for reviewer-capable CLIs. `docs/integrating-a-new-cli.md` captures the full Level 1/2/3 integration playbook for future CLIs — written while doing this so the steps are tested. Adapted from upstream chorus-codes#44 (6a00b00). No conflicts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(local): add Local LLM HTTP shim for OpenAI-compatible endpoints Adds a `local` lineage that dispatches chat completions to any OpenAI-compatible HTTP endpoint (Ollama, llama-swap, LM Studio, vLLM, or anything that speaks `/v1/chat/completions`). No external subscription or CLI binary required — only a running local inference server. Configuration: save a JSON secret under key `local` via Settings → Local LLM: {"base_url": "http://127.0.0.1:11434/v1", "api_key": ""} Model ids may use a `local:` prefix (e.g. `local:llama3`) which the shim strips before dispatch, or bare model names directly. When no secret is saved, falls back to Ollama's default port. Wiring sweep (extends every exhaustive enum / Record so templates can declare local voices without Zod errors): - src/daemon/agents/local.ts — new HTTP shim with JSON.parse guard on the secret (yields a typed `config_parse` error event for malformed secrets instead of throwing inside the generator) - src/daemon/agents/index.ts — register localShim, `local:` prefix routing in pickShimForVoice, add to isHttpDispatchedShim - src/daemon/agents/types.ts — `local` in Lineage - src/lib/template-schema.ts — `local` in both lineageEnum + reviewerLineageEnum - src/lib/cli-health.ts — `local` in CliLineage + ALL_LINEAGES - src/lib/cli-precheck.ts — empty CRED_PATHS, LOGIN_HINT, skip the file probe (same pattern as openrouter — auth lives in secrets table) - src/lib/cockpit-types.ts — `local` in ReviewerLineage - src/lib/lineage-maps.ts — `local` in DaemonLineage, UILineage, every label/dot/brand map; UI_LINEAGE_DEFAULT_MODEL[local] = "" (model IDs are endpoint-specific). Teal dot distinguishes local from openrouter's cyan - src/components/phase-editor/constants.ts — LINEAGES list, DAEMON_TO_COCKPIT_LINEAGE - src/components/template-dialog/constants.ts — COCKPIT_TO_DAEMON, DAEMON_TO_COCKPIT, DAEMON_DEFAULT_MODEL, FALLBACK_LINEAGES Adapted from upstream chorus-codes#41 (716fa3a). The bundled upstream commit also included Keychain dual-probe (chorus-codes#38) and fallback-registry hold-on-success (chorus-codes#42) — those land in follow-up commits in this PR so each concern is reviewable independently. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Greg <7xshadowx7@gmail.com> Co-authored-by: chorus-codes <280607145+chorus-codes@users.noreply.github.com> * feat(grok): Level 3 shim — full reviewer dispatch (happy-path unverified) Promotes Grok Build from Level 2 (consumer-only) to Level 3 (full reviewer shim). Chorus can now dispatch to grok-build as a doer or reviewer in any template. What's verified (empirically): - Detection, headless-mode invocation pattern (`grok -p ... --output-format streaming-json --yolo --max-turns 1`), error event shape, exit-code semantics - Failure path: free-tier auth produces clean quota_exhausted (SuperGrok Heavy subscription required) → voice auto-disables after N strikes - All UI surfaces (model boxes, template-editor lineage picker, run-page participant card, cli-status-panel, onboarding picker, connect orchestrator) What's specced but not run live (needs SuperGrok Heavy): - Happy-path streaming-json text/end event parsing (followed `~/.grok/docs/user-guide/13-headless-mode.md` spec) - Token/cost accounting — Grok doesn't surface usage in end event; estimateCostUsd returns 0 New files: - src/daemon/agents/grok.ts — shim with `--max-turns 1` headless args - src/daemon/agents/parsers/grok.ts — streaming-json + stderr parser - tests/grok-parser.test.ts — 18 cases covering happy / error / robustness Lineage sweep (xai daemon lineage was already a legacy alias to opencode — uses fresh `grok` daemon lineage to avoid colliding with that mapping; old YAML with `lineage:xai` still routes to opencode): - Lineage / CliLineage / ReviewerLineage / DaemonLineage / UILineage - LINEAGE_LABEL / LINEAGE_DOT / UI_LINEAGE_* / UI_LINEAGE_BRAND - UI_LINEAGE_AVAILABLE_MODELS.grok = ['grok-build'] - UI_LINEAGE_DEFAULT_MODEL.grok = 'grok-build' - template-schema lineageEnum + reviewerLineageEnum - DB voices row schema (additive — old rows still validate) - phase-editor LINEAGES + DAEMON_TO_COCKPIT_LINEAGE - template-dialog COCKPIT_TO_DAEMON + DAEMON_TO_COCKPIT + DAEMON_DEFAULT_MODEL + FALLBACK_LINEAGES - cli-status-panel + live-run-real helpers - error-detector auth-prompt regex (SuperGrok signature on its own branch ABOVE the generic auth regex — classifies to quota_exhausted, not auth_invalid) Voice seeding: grok-cli registered in SINGLE_MODEL_CLIS — auto- creates the grok-cli voice (id=grok-cli, lineage=grok, model_id=grok-build) on first daemon boot when the binary is detected. Auth flow: ~/.grok/auth.json file probe OR GROK_CODE_XAI_API_KEY env short-circuit. Both verified in tests/cli-precheck.test.ts. Daemon won't spawn grok without one or the other present — prevents the browser-OAuth flow from hanging headless dispatch. Total tests: 821 → 842 (+21). Adapted from upstream chorus-codes#46 (f9dfba5). Conflicts resolved by taking the union of fork's `local`-extended enums and upstream's `grok`-extended enums (every Record / z.enum had to be extended in both dimensions). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: chorus-codes <280607145+chorus-codes@users.noreply.github.com> * fix(cli-precheck): macOS Keychain dual-probe — also check "Claude Code" service Claude Code v2.x stores OAuth credentials under two service names depending on the auth flow: - `Claude Code-credentials` — Pro/Max OAuth via `claude login` - `Claude Code` (no suffix) — API-key auth + some Console-account flows The previous single-service probe regressed to auth_missing for API-key users on darwin. Refactor hasDarwinKeychainEntry to accept string | string[], iterate candidates, short-circuit on first match. Each probe stays bounded to 1.5s so a misconfigured keychain can't stall every spawn. Refs upstream issue chorus-codes#38 / commit 716fa3a. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: PR review — local in voices enum, AGENT_TO_LINEAGE for grok/local, separate cred-precheck vs semaphore bypass Addresses bot review on PR #3: - Sourcery P2 (src/lib/db/voices.ts): VoiceRowSchema and VoiceUpsertInput only allowed `grok` in the new-lineage slot; `local` voices upserted via the (future) Local LLM connect flow would have failed Zod validation at runtime. Add `local` to both the enum and the union. - Codex P2 (src/app/api/run-artifacts/[chatId]/route.ts + src/app/runs/[runId]/page.tsx): AGENT_TO_LINEAGE did not map `grok-cli` → `grok` nor `local` → `local`, so a real Grok or Local participant directory (`reviewer-grok-cli-N`, `reviewer-local-N`) resolved to a bogus lineage and rendered as an unbranded extra card while the placeholder slot stayed pending. - Codex P2 (src/daemon/agents/index.ts + src/daemon/runner/{doer,reviewer}-driver.ts + src/lib/settings/concurrency.ts): the daemon used a single predicate `isHttpDispatchedShim` for two unrelated decisions — bypassing the CLI-credential precheck AND bypassing the local-CLI semaphore. That was safe for OpenRouter (truly remote) but wrong for the Local LLM shim, whose default endpoint is Ollama on 127.0.0.1: N concurrent reviewers + a doer can thrash VRAM/RAM on consumer hardware. Split into `isHttpDispatchedShim` (kept for cred-precheck bypass) and `bypassesLocalCliSemaphore` (only openrouter). Add `grok-cli` and `local` to CLI_LINEAGES with conservative per-CLI defaults (grok-cli matches gemini at 2; local defaults to 1, bump in /settings if your endpoint multiplexes). Tests: 845 pass (unchanged), typecheck clean. * fix: PR review — CodeRabbit pass (docs/Grok level, init+quickstart+local edges, regex, tests) Addresses CodeRabbit's first batch of review comments on PR #3: - docs/integrating-a-new-cli.md: contradictory level for Grok — line 3 said "detection-only", line 15 said level 2, line 302 said level 3. Normalize to level-3 (the shim ships in this PR) and note that the level-2 orchestrator coexists for the consumer-side wiring. - src/cli/commands/init.ts: `--connect grok` was rejected because the local Name union, ALL_NAMES list, and the `--connect` option help text omitted 'grok' even though detection labels and OrchestratorName already accepted it. Adding 'grok' to all three. - src/cli/commands/quickstart.ts: the "install one of …" guidance printed when no CLIs are detected still listed only 5 — extend to Grok CLI to match the dispatchable set. - src/daemon/agents/local.ts: * Empty `base_url` (e.g. user saved settings with an empty box) was passed through `??` as the URL and surfaced as an opaque fetch error; treat empty / whitespace-only as unset and fall back to DEFAULT_BASE. Strip trailing slashes while at it. * Trailing SSE payload was dropped when the server closed without a final blank-line delimiter (older Ollama, some vLLM configs) — the last text_delta could silently disappear, truncating answers. Extract event-dispatch + payload-extract into local helpers and flush the residual buffer after the read loop exits. - src/lib/cli-detect.ts: grok regex documented "name OR bare-version" but only matched the name. Add the bare-version alternative; the basename guard already prevents cross-vendor matches. - tests/grok-parser.test.ts: 4 cases narrowed event[0] under `if (events[0].type === 'error')` without a prior `expect(...).toBe` on type — a non-error event silently skipped the inner assertions. Add explicit type expectations before the narrowing. Tests: 845 pass (unchanged), typecheck clean. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Greg <7xshadowx7@gmail.com> Co-authored-by: chorus-codes <280607145+chorus-codes@users.noreply.github.com>

…na parser (#4) Folds the cross-platform pieces of upstream commit 781bc42 ("Contributor stack: claude orchestrator + repoPath + Windows spawn (chorus-codes#39)") into the fork, intentionally omitting Windows-specific hunks. Included: - src/mcp/tools.ts: add safeCwd() helper + default `repoPath` on create_chat to safeCwd() when caller omits it. Previously the daemon fell back to its own cwd (packageRoot), which caused relative file paths in `files: [...]` to silently resolve to the chorus install dir and miss. MCP servers spawned by Claude Code / Codex / Gemini inherit the host's cwd (= the user's project), so safeCwd() lands at the right path automatically. safeCwd() also catches ENOENT from process.cwd() and falls back to homedir. - src/lib/personas.ts: normalize CRLF → LF in the frontmatter parser so persona .md files checked out with Windows line endings don't fail `missing YAML frontmatter`. Cross-platform safe. - src/daemon/orchestrators/index.ts: drop stale comment block about Claude having a project-config side-effect (the fork's orchestrator long since moved to user-scope). - tests/mcp-create-chat-repo-path.test.ts (+4 tests): cover explicit repoPath, cwd default, full-body forwarding, and ENOENT fallback to homedir. Omitted (Windows-only hunks): - src/cli/commands/update.ts (shell: win32 for npm self-update) - src/daemon/routes/system.ts (shell: win32 for opencode probe) - src/daemon/orchestrators/{codex,gemini,kimi}.ts (shell: win32 tweaks) - src/lib/cli-detect.ts (SAFE_WIN_PATH regex + buildVersionSpawn) - src/lib/voices.ts (discoverNpmPrefixes Windows shell) - tests/cli-detect.test.ts (Windows-specific cmd.exe escape tests) Also omitted: - src/daemon/orchestrators/claude.ts: upstream shells out to `claude mcp add --scope user`. Fork already implements user-scope registration via direct ~/.claude.json patch (more robust — no dependency on `claude` binary in PATH at registration time, plus sweeps stale project-scoped entries). Keeping fork's version. - tests/claude-orchestrator.test.ts: tests the upstream shell-out approach the fork doesn't use. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three-phase delivery plan for moving the PR babysitter loop out of Claude Code and into the chorus daemon. Covers GH App + webhook architecture, the judge phase (validity/category/confidence + shadow judge pattern), fix routing rules (trivial/targeted/architectural → Kimi/Sonnet/Opus), circuit breakers, merge gate, multi-PR coordination, and proposed DB schema. Design only — no code in this commit. Five open questions left for team decisions in §"Open questions for the team". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When a chat carries a repoPath, read AGENTS.md / CLAUDE.md from the repo and prepend them inside a <project_guidelines> fence (between the persona block and the phase header). Same TOCTOU + fence-breakout defences as the persona/attached-file readers: lstat-rejects symlinks, strips </project_guidelines> from contents, truncates each file at 16 KB with a visible marker. Lets users carry project conventions into every doer + reviewer turn by editing a file the rest of their AI tooling already reads, without adding a new chorus-specific storage layer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ewer Splits verify out of the StandardPhase shape into its own VerifyPhase (no doer, reviewer required). Reads `chorus.verify` from package.json, runs it via execFile in repoPath with a configurable command timeout (default 5 min, max 30 min), captures stdout/stderr/exit, and feeds the fenced artifact through the existing runReviewers flow. Env is scrubbed to PATH/HOME/LANG/LC_ALL/NODE_ENV so a `chorus.verify` script can't leak inherited credentials into the artifact. Output streams cap at 64 KB each with a visible truncation marker. Timeout detection matches both ETIMEDOUT and (killed && SIGTERM) shapes — node sometimes only sets the signal. The artifact lands at round-1/doer-verify-runner/answer.md so the cockpit renders it identically to a doer answer. A phase_progress event with kind="verify_command" surfaces the command-level outcome (exitCode, timedOut, duration) without needing a brand-new event type through the SSE multiplex. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Verify phase gains optional `feedbackPhase` + `maxIterations` (default 5, max 20). On verify failure, the runner re-fires the named phase's doer through `runDoer` with the verify output threaded in via `priorRoundFeedback` — same hook a normal disagree-iterate loop uses, so the doer sees the failure in the slot it already knows how to act on. Loops until verify passes or the cap is hit. Reviewers only run on the FINAL iteration (success or final failure); intermediate iterations skip the reviewer pass because exit code is the loop signal and asking the reviewer N times to judge the same class of failure would just burn tokens. Iterations write to round-1001, round-1002, … (TDD_ROUND_OFFSET=1000) so the synthetic TDD-loop round dirs can't collide with the original feedback phase's rounds in the same chat dir. Misconfigured templates (feedbackPhase points at a non-existent or non-standard phase) fail loudly at the top of the verify phase, before the first command run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Foundation for the PR-babysit autonomous review loop (Phase A of docs/pr-babysit-design.md). Two tables: - babysit_jobs: one row per (repo, pr_number) under review, state-machine tracked (idle → judging → fixing → verifying → pushing → quiet_check → merged | escalated). UNIQUE (repo, pr_number) prevents double-registration. ended_at auto-stamps on first terminal transition and is sticky. - babysit_decisions: append-only audit trail of every judge call. Two-stage insert — judge writes validity/category/confidence/outcome=NULL, the fix runner stamps outcome (+ commit) when it resolves. getAttemptCount drives the per-comment circuit breaker (same comment hash flagged N+ times → stop trying, escalate). Schema lives in schema.sql for fresh-DB init AND as idempotent CREATE TABLE IF NOT EXISTS in connection.ts so DBs that pre-date this version pick the tables up on next boot (matches the personas/voices migration pattern). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…6 hash Pulls PR review (line-anchored) + issue (conversation) comments via `gh api`, normalizes them into the shape the babysit judge consumes: - author classification: recognises CodeRabbit / Sourcery / Greptile / Codex by login regex; falls back to GitHub user.type=Bot + [bot] suffix for unknown bots. Humans always come through as isBot=false / bot=null. - sha256(body) keyed so the per-comment circuit breaker can recognise "the same bot re-flagged the same exact text" across polling ticks. - partial-data tolerance: if one of review/issue endpoints fails we still return what we got from the other (a 500 on one shouldn't blank the whole tick). Only when BOTH fail do we surface a typed reason. - `since=` parameter so the polling loop doesn't re-hash every comment on every tick. 16 tests covering author classify, sha256 stability, gh shellout via a fake `gh` on PATH, partial-failure, auth/404 classification, since arg. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reads one PR comment + diff context, asks the model to classify it as valid/invalid/partially_valid/unsure with a category from a fixed menu (apply-trivial | apply-targeted | apply-architectural | reply-disagree | reply-ack | defer-to-human) and a confidence score in [0,1]. Three pieces: - buildJudgePrompt(comment, ctx): pure prompt construction. Includes PR metadata, comment body, anchored code snippet, and (crucially) prior decisions on the same comment hash — so re-judgements after a failed fix tilt toward reply-disagree rather than re-trying the same fix. - judgeComment(opts): drives requestStructured against the JudgeOutputSchema, flags judgements below the 0.7 confidence threshold as belowThreshold. - decideAction(judgement, args): PURE routing function. Maps (judgement, attemptCount, belowThreshold) → fix/reply/escalate/skip. State machine in babysit/runner.ts (next session) stays a thin dispatcher. Routing rules in priority order: per-comment cap → confidence threshold → defer-to-human → reply-* → apply-* (with invalid/unsure self-correction to escalate, since acting on a comment we judged invalid is incoherent). 20 tests: prompt composition (bot vs human, snippet, prior decisions, multi-line bodies, threshold mention, full category menu) + routing table (every category × every priority rule). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase A MCP entry point for the PR-babysit loop. - `mcp__chorus__babysit_pr`: registers a PR for autonomous bot-comment judging. Idempotent — re-calling with the same URL returns the existing job without resetting state mid-flight. - Daemon routes: POST /babysit/jobs — upsert idle job GET /babysit/jobs — list (filters: ?active=true, ?state=…) GET /babysit/jobs/:id — single job + recent decisions - `templates/pr-babysit.yaml`: declares the judge roster (Haiku primary, Sonnet fallback). Validates against TemplateSchema as a `review_only` phase so seedBuiltinTemplates loads it cleanly; the babysit runner (next release) reads `phase.reviewer.candidates` for model selection but doesn't drive this phase through runner.ts. 13 route tests covering happy path, idempotent re-register, missing/ malformed URL, state filter, job-with-decisions detail view. MCP wrapper schema added to tools.ts. Note: src/daemon/index.ts diff is mostly Prettier rewriting single→double quotes after my import addition; the real semantic change is the two lines wiring registerBabysitRoutes into registerAll(). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

GitHub App auth bundle for the PR-babysit loop. Two-tier model: mint a 9-min RS256 JWT from the App private key (Node built-in crypto, no jsonwebtoken dep), then exchange it for a 1-hour installation token cached in-memory with a 5-min refresh buffer so we never present a token about to expire. Config persisted as a single global row in secrets (provider= github_app, kind=gh_app, value=JSON of appId/privateKey/ webhookSecret) — chorus is single-tenant, the App is owned by the daemon operator. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pure-crypto helper for verifying GitHub's X-Hub-Signature-256 against the raw request body. Constant-time comparison via crypto.timingSafeEqual + a typed discriminated-union failure mode (missing/malformed/mismatch/secret_not_configured) so a caller can log the precise reason without leaking it back to the sender. Not wired into a route this session — the daemon only polls — but the verifier ships with full coverage now since shipping the route later without it is a sharp footgun. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Unified GitHub-API surface for the babysit loop with two routes: - App auth when installationId is set AND App config persisted: mint/reuse a cached installation token, retry once on 401 (key rotation), retry once on 5xx with backoff. - gh CLI fallback otherwise. Inherits the developer's local gh auth. Bodies on this path return a typed error pointing the operator at the App-auth on-ramp — postponing the stdin plumbing until the runner actually needs to write through the CLI. Routing is transparent to the caller; they always get back a normalized {status, body|errorText, authMode} response. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Idempotent worktree lifecycle for the fix loop: - ensureWorktree() — create or reuse ~/.chorus/worktrees/ <owner>__<name>/pr-<n>/, fetching + checking out the PR head branch. Wipes a stale directory if one exists from a half- failed previous run. - pullLatest() — fetch + reset --hard origin/<branch>. Hard reset is safe only because the runner pushes every commit it makes; documented inline so it doesn't get cargo-culted. - removeWorktree() — git worktree remove --force + rm -rf as belt-and-suspenders for older git versions. Branch names from webhook payloads are validated against the same shell/path-traversal rules used elsewhere in the daemon before being passed to git. Tests use real git against a bare-remote fixture per case; mocking runAsync would leave 90% of the surface untested. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Tick driver for the babysit loop with three invariants the production daemon needs: 1. Per-job serialization. A Set keyed by job id, checked-and-set atomically inside dispatch(), prevents two ticks on the same PR from racing over the worktree, decisions table, or reply comment. 2. Bounded global concurrency. maxConcurrent (default 3) caps simultaneous jobs so judge-model quotas + gh-API pressure stay predictable as the backlog grows. 3. Clean drain on SIGTERM. stop() clears the interval AND awaits in-flight jobs so we never leave a worktree mid-commit. Errors thrown from runJob are caught + logged so a single broken PR can't poison the whole loop. The mutex is always released in finally so the next tick can re-dispatch. Not yet wired into daemon startup — the state-machine runner that becomes runJob ships in the next commits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

End-to-end driver tying the existing pieces together. One entry point (runJob) the scheduler calls per tick; per-state handlers dispatch the work and return a transition descriptor; the driver owns all babysit_jobs writes so handlers stay pure-ish. State transitions: idle -> judging (provisions worktree) judging -> fixing (any apply-* decision) judging -> quiet_check (replies only, or empty) judging -> escalated (defer-to-human, low-confidence, cap-hit, judge spawn/parse failure) fixing -> verifying (doer produced file edits) fixing -> escalated (doer failure; mark decision escalated) verifying -> pushing (verify passed) verifying -> escalated (verify failed; no auto-retry — the per-comment cap path catches genuine stuck) pushing -> quiet_check (pushed; record commit sha + fix_commits++) pushing -> escalated (git failure) quiet_check -> merged (PR merged on GitHub) quiet_check -> judging (new bot comments arrived) quiet_check -> quiet_check (no change) Supporting modules added in the same commit since they only exist to serve this state machine: - pr-metadata.ts: tiny shim over gh client for title/head/base/ default branch + PR state projection. Uses CLI fallback when no App config. - verifier.ts: resolves npm-test → npm-typecheck → tsc --noEmit from package.json/tsconfig; truncates output at 16 KiB for DB-safe escalation reasons. - fix-executor.ts: doer invocation via structured-output adapter returning {path, new_contents}[]. Full-file rewrites — LLMs are unreliable at diff coordinates and babysit fixes are small. Symlink-aware path safety refuses worktree escape. - git-push.ts: stage → diff-check → commit → push helper. No --force. Default chorus-babysit identity, overridable. Tests: 45 new tests across 5 files cover each handler's happy path + every failure-mode transition. State-machine tests use real DB + mocked external IO; helpers use real shellouts against fixture repos where the value is in the actual git/fs behaviour. Not yet wired: scheduler.start() at daemon boot — that's the next commit, separate from this so the integration is reviewable on its own. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Start a BabysitScheduler post-listen with the state-machine runner as its job handler. Tick interval defaults to 60s; sourceRepoPath defaults to the daemon's CWD (per-repo overrides will land when the registrar gains a sourceRepoPath field on the babysit job row). CHORUS_DISABLE_BABYSIT_SCHEDULER=1 skips the start for integration tests that drive ticks manually. SIGTERM / SIGINT trigger scheduler.stop(), which clears the interval AND awaits in-flight jobs so we never leave a worktree mid-commit on shutdown. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds operator-driven pause/resume so a registered PR can be taken off the scheduler's tick without losing its decision history. PATCH /babysit/jobs/:id { action: 'pause' | 'resume' } Pause refuses terminal states (merged, escalated) with 409 — there is nothing for the scheduler to skip once a job has ended. Resume refuses non-paused jobs with 409 to make the intent explicit; both verbs are idempotent within their valid state. Resume re-opens ended_at so the job reappears in listActive() / cockpit lists. The scheduler already treats 'paused' as non-dispatchable (NON_DISPATCHABLE includes paused alongside merged + escalated), so this commit is just the controller — no scheduler change needed. 8 new tests on top of the existing 13 cover: pause happy path, pause idempotency, resume happy path + ended_at clear, conflict on pause-merged + pause-escalated, conflict on resume-when-not-paused, validation on unknown action, 404 on missing job. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

User-facing subcommand group that fronts the existing daemon routes so operators can drive the babysit scheduler without hitting the API directly. chorus babysit register <pr-url> [--installation-id <n>] chorus babysit list [--active] [--state <s>] chorus babysit show <id> chorus babysit pause <id> chorus babysit resume <id> All commands talk to the local daemon over /api/v1; a connection-failed envelope surfaces the standard "start with \`chorus start\`" hint so the failure mode is consistent with the rest of the CLI. Job ids are "<owner>/<repo>#<number>" — show/ pause/resume URL-encode the segment so shells that treat # as a comment don't strip it. show prints the job header + decision log (comment id, author, validity, category, outcome) so 'why did this PR get escalated' is one command away. State labels are color-coded (terminal-red escalated, green merged, yellow paused). src/cli/index.ts also picks up unrelated single→double-quote normalization from the project prettier hook — the only logical change there is the new registerBabysitCommand wire-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… round-1/2 bot fixes PR #1 (`feat/pr-review-and-hardening` → main) was just squash-merged into origin/main as commit 187d389. As part of that PR's bot-review round, the PR-babysitter pushed two follow-up fix commits — 7ec8df3 (round-1 batch) and 9464740 (round-2 fixes) — addressing CodeRabbit / Sourcery / Greptile feedback. Those fixes landed in main but were never on feat/runner-execution-loop, which carries 17 babysit-feature commits on top of the pre-fix hardening state. This merge brings the round-1/2 fixes in and resolves overlaps with the babysit-feature additions. Most files take main's hardened version (sendError migration, abort+cancel wiring, CHECK constraints, Windows-path support, fence escape, URL validation in the manifest UI, audit/orchestrate failure paths, ENOENT detection in github-pr, quota-counter reset on transient streaks). Babysit-specific additions preserved: - src/mcp/tools.ts: BabysitPrSchema + BabysitJobRefSchema + babysitPr() layered on top of main's createChat dead-spread cleanup. - src/mcp/index.ts: 11th tool registration (babysit_pr). - src/cli/index.ts + src/cli/commands/babysit.ts: chorus babysit subcommand. - src/lib/db/connection.ts + schema.sql: babysit_jobs + babysit_decisions tables, indexes, and idempotent migration block layered on top of main's CHECK constraints. - src/lib/template-schema.ts: dedicated VerifyPhaseSchema (with commandTimeoutMs / feedbackPhase / maxIterations) kept over main's looser StandardPhase-kind addition — required by src/daemon/phases/verify.ts. - src/daemon/runner.ts: verify phase routing block added after orchestrate. - src/daemon/runner/prompt-builder.ts + doer-driver + reviewer-driver: AGENTS.md/CLAUDE.md project-guides plumbing kept (repoPath threaded through buildAsk + buildReviewerAsk). tsc --noEmit: clean. vitest run: 1085 tests, 80 files, all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

greptile-apps

crypticpy has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.

sourcery-ai · 2026-05-18T02:37:48Z

Reviewer's Guide

Adds an autonomous PR babysitting loop (daemon scheduler, state machine, DB tables, CLI/MCP tooling) and introduces a new verify phase with TDD feedback into the runChat pipeline, while wiring project guidelines into prompts and expanding GitHub/App auth support.

Sequence diagram for verify phase with TDD feedback loop

sequenceDiagram
  participant Runner as runChat
  participant VerifyPhase as runVerifyPhase
  participant Repo as runVerifyCommand
  participant ImplementDoer as runDoer
  participant Reviewer as runReviewers

  Runner->>VerifyPhase: runVerifyPhase(phase, template, repoPath,...)
  VerifyPhase->>VerifyPhase: readVerifyCommand(repoPath)
  alt no_chorus_verify_command
    VerifyPhase-->>Runner: {completed:false, reason:no_verify_command}
  else has_verify_command
    loop iteration <= maxIterations
      VerifyPhase->>Repo: runVerifyCommand(exec, args, cwd)
      Repo-->>VerifyPhase: VerifyCommandResult
      alt command_passed
        Note over VerifyPhase: break
      else command_failed_and_feedbackPhase_configured
        VerifyPhase->>ImplementDoer: runDoer(feedbackPhase, priorRoundFeedback)
        ImplementDoer-->>VerifyPhase: doer_round_done
      else command_failed_without_feedbackPhase
        Note over VerifyPhase: break
      end
    end
    VerifyPhase->>Reviewer: runReviewers(phase, round, verifyArtifact)
    Reviewer-->>VerifyPhase: {summary, agreed, allFailed}
    VerifyPhase-->>Runner: VerifyPhaseOutcome(completed, passed, iterations)
  end

ER diagram for babysit_jobs and babysit_decisions tables

erDiagram
  babysit_jobs ||--o{ babysit_decisions : has

  babysit_jobs {
    TEXT id PK
    TEXT repo
    INTEGER pr_number
    TEXT state
    INTEGER started_at
    INTEGER ended_at
  }

  babysit_decisions {
    INTEGER id PK
    TEXT job_id FK
    INTEGER comment_id
    TEXT comment_hash
    TEXT validity
    TEXT category
    TEXT outcome
  }

State diagram for PR babysit job lifecycle

stateDiagram-v2
  [*] --> idle

  idle --> judging: first_tick_after_registration
  idle --> escalated: invalid_repo_or_worktree_failure

  judging --> fixing: unjudged_comment_routes_to_fix
  judging --> quiet_check: all_comments_routed_to_reply_or_skip
  judging --> escalated: judge_failure_or_decision_escalated

  fixing --> verifying: fix_applied
  fixing --> escalated: doer_failure_or_comment_missing

  verifying --> pushing: verify_passed
  verifying --> escalated: verify_failed

  pushing --> quiet_check: pushed_or_no_changes
  pushing --> escalated: git_failure

  quiet_check --> judging: new_comments_arrived
  quiet_check --> merged: pr_merged_on_github
  quiet_check --> escalated: pr_closed_without_merge

  waiting --> judging: resume_wait

  merged --> [*]
  escalated --> [*]
  paused --> [*]

File-Level Changes

Change	Details	Files
Wire PR babysit scheduler and REST routes into the daemon, including per-job state machine execution.	Register new /babysit REST routes in the daemon HTTP API and initialize the BabysitScheduler at startup, guarded by an env flag. Implement a BabysitScheduler that periodically loads active babysit jobs, enforces per-job mutexes and a global concurrency cap, and dispatches each job into the state machine driver. Add shutdown hooks so the scheduler is stopped cleanly on SIGINT/SIGTERM along with telemetry and other background tasks.	`src/daemon/index.ts` `src/daemon/babysit/scheduler.ts`
Introduce PR babysit persistence and query layer for jobs and decisions with idempotent migrations.	Add babysit_jobs and babysit_decisions tables to the SQLite schema and ensure they are created idempotently on DB init, including supporting indexes. Implement typed data-access modules for babysitJobs and babysitDecisions (create, get, list, state transitions, counter increments, and per-comment attempt counting). Extend the DB index module to export babysitJobs and babysitDecisions so higher-level components can read/write babysit metadata.	`src/lib/db/connection.ts` `src/lib/db/schema.sql` `src/lib/db/babysit-jobs.ts` `src/lib/db/babysit-decisions.ts` `src/lib/db/index.ts` `tests/babysit-db.test.ts`
Implement the PR babysit state machine, comment pipeline, and associated helpers for GH, worktrees, judge, fix, verify, and push.	Add a state-machine driver that routes babysit jobs through idle, judging, fixing, verifying, pushing, quiet_check, merged, escalated, and paused states, including per-state handlers and transition logic. Implement helpers for comment fetching and classification, judge prompt and action routing, fix execution via a doer model, verify execution, git commit/push, PR metadata lookup, per-PR worktree management, GH App vs gh CLI client, and webhook HMAC verification. Add comprehensive tests covering the state machine transitions, GH client routing and retries, App auth and JWT minting, worktree lifecycle, judge routing logic, verifier behaviour, scheduler concurrency, and route responses.	`src/daemon/babysit/state-machine.ts` `src/daemon/babysit/comment-fetcher.ts` `src/daemon/babysit/judge.ts` `src/daemon/babysit/fix-executor.ts` `src/daemon/babysit/verifier.ts` `src/daemon/babysit/git-push.ts` `src/daemon/babysit/worktree-manager.ts` `src/daemon/babysit/gh-client.ts` `src/daemon/babysit/gh-app.ts` `src/daemon/babysit/pr-metadata.ts` `src/daemon/babysit/webhook-verify.ts` `tests/babysit-state-machine.test.ts` `tests/babysit-comment-fetcher.test.ts` `tests/babysit-judge.test.ts` `tests/babysit-worktree-manager.test.ts` `tests/babysit-gh-client.test.ts` `tests/babysit-gh-app.test.ts` `tests/babysit-scheduler.test.ts` `tests/babysit-git-push.test.ts` `tests/babysit-verifier.test.ts` `tests/babysit-webhook-verify.test.ts` `tests/babysit-pr-metadata.test.ts`
Expose babysit registration and control via daemon REST routes, CLI commands, and MCP tool.	Add Fastify routes to register babysit jobs from GitHub PR URLs, list and filter jobs, fetch job details with decisions, and pause/resume jobs with conflict handling. Introduce a chorus babysit CLI subcommand with register/list/show/pause/resume verbs that call the daemon API and render human-friendly summaries. Add an MCP babysit_pr tool with a zod schema, wired into the MCP server and backed by the daemon /babysit/jobs endpoint for idempotent registration.	`src/daemon/routes/babysit.ts` `src/cli/commands/babysit.ts` `src/cli/index.ts` `src/mcp/tools.ts` `src/mcp/index.ts` `tests/babysit-routes.test.ts`
Add a verify phase type to the template schema and integrate a verify/TDD loop into the runner pipeline.	Define a dedicated VerifyPhaseSchema with commandTimeoutMs, feedbackPhase, and maxIterations fields, distinguish verify from standard phases, and export a VerifyPhase type. Extend runChat to handle verify phases by running a verify-specific runner that shells out to a project-defined chorus.verify command, captures output, and routes it through reviewers, including TDD-style re-prompting of a feedback phase doer on failure. Add verify-phase unit tests to validate schema constraints, default values, TDD-loop caps, and verify runner behaviour for passing, failing, and timeout cases.	`src/lib/template-schema.ts` `src/daemon/runner.ts` `src/daemon/phases/verify.ts` `tests/template-schema.test.ts` `tests/verify-phase.test.ts`
Include project guidelines (AGENTS.md / CLAUDE.md) into doer and reviewer prompts and thread repoPath through drivers.	Add readProjectGuides helper that reads AGENTS.md and CLAUDE.md from the repo, enforces size caps and symlink guards, and wraps content in a <project_guidelines> fenced block. Update buildAsk and buildReviewerAsk to prepend project_guidelines blocks when repoPath is available, and thread repoPath through doer and reviewer drivers so prompts can locate the project root. Add tests to ensure project guideline files are loaded, truncated, sanitised, and injected at the correct position in prompts.	`src/daemon/runner/prompt-builder.ts` `src/daemon/runner/doer-driver.ts` `src/daemon/runner/reviewer-driver.ts` `tests/prompt-builder-project-guides.test.ts`
Extend secrets storage to support GitHub App configuration and small ancillary changes.	Add 'gh_app' as an allowed secret kind, document its semantics, and ensure secrets.set/get/list/delete continue to work via zod validation. Adjust various daemon/CLI imports and logs to standardise double-quote usage and minor formatting while wiring new components. Add a pr-babysit template preset file describing the judge-only babysit phase for use in templates and orchestrators.	`src/lib/db/secrets.ts` `src/daemon/index.ts` `src/cli/index.ts` `src/mcp/index.ts` `templates/pr-babysit.yaml`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

coderabbitai · 2026-05-18T02:37:51Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4ba02e31-2c59-464f-8666-d1a50c24b0b0

📥 Commits

Reviewing files that changed from the base of the PR and between c72bfa5 and 5b62a94.

📒 Files selected for processing (6)

src/daemon/babysit/comment-fetcher.ts
src/daemon/babysit/state-machine.ts
src/daemon/phases/verify.ts
src/daemon/runner.ts
src/daemon/runner/prompt-builder.ts
tests/babysit-comment-fetcher.test.ts

📝 Walkthrough

Walkthrough

Introduces a full PR babysitting subsystem: design doc, DB schema and accessors, GitHub App auth/client, comment fetcher and judge, fix executor and safety checks, per-PR worktree manager, git commit-and-push helper, verify step and verify-phase, state-machine and scheduler, Fastify routes, CLI/MCP tooling, runner/daemon wiring, template, and extensive tests.

Changes

PR Babysitter end-to-end feature

Layer / File(s)	Summary
Complete implementation, wiring, tooling, and tests `docs/pr-babysit-design.md`, `templates/pr-babysit.yaml`, `src/daemon/babysit/`, `src/lib/db/`, `src/daemon/phases/verify.ts`, `src/daemon/index.ts`, `src/cli/commands/babysit.ts`, `src/mcp/`, `src/daemon/runner/`, `src/daemon/ship.ts`, `src/daemon/runner/prompt-builder.ts`, `tests/**`	Adds the Phase A PR babysit feature: design and template, DB tables and DAOs (`babysit_jobs`, `babysit_decisions`), GitHub App auth and ghRequest client, comment fetcher and judge logic, fix executor and safety checks, per-PR worktree manager, git commit-and-push helper, verify step and verify-phase runner, state-machine driver and bounded scheduler, Fastify routes and CLI/MCP tools to register/list/pause/resume jobs, runner/daemon wiring including scheduler startup, prompt-builder project-guides support, runAsync stdin support, and comprehensive unit/integration tests covering the modules.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Poem

A rabbit with a checklist, ears all aglow,
Hops through PRs where the wild comments grow.
Judges with patience, patches with care,
Verifies, pushes—then naps in its chair.
If humans are needed, it thumps once: "I’ll share." 🐇✨

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

sourcery-ai

Hey - I've found 5 issues, and left some high level feedback:

The babysit scheduler and judge currently hard-code lineage/model (anthropic / claude-haiku-4-5) in multiple places; consider threading these from a central config (e.g. settings table or env) so operators can swap models or vendors without code changes.
The verify phase introduces a second verify runner under daemon/babysit/verifier.ts with its own truncation and mode logic; it may be worth consolidating this with the new phases/verify.ts helper to avoid diverging behaviour between the generic verify phase and the babysit-specific verify step.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The babysit scheduler and judge currently hard-code lineage/model (`anthropic` / `claude-haiku-4-5`) in multiple places; consider threading these from a central config (e.g. settings table or env) so operators can swap models or vendors without code changes.
- The verify phase introduces a second verify runner under `daemon/babysit/verifier.ts` with its own truncation and mode logic; it may be worth consolidating this with the new `phases/verify.ts` helper to avoid diverging behaviour between the generic verify phase and the babysit-specific verify step.

## Individual Comments

### Comment 1
<location path="src/daemon/babysit/gh-client.ts" line_range="177-186" />
<code_context>
+    "User-Agent": "chorus-babysit",
+  };
+  let init: { method: string; headers: Record<string, string>; body?: string };
+  if (args.body !== undefined) {
+    headers["Content-Type"] = "application/json";
+    init = {
+      method: args.method,
+      headers,
+      body: JSON.stringify(args.body),
+    };
+  } else {
+    init = { method: args.method, headers };
+  }
+  const res = await fetcher(url, init);
+  if (!res.ok) {
+    const text = await res.text().catch(() => "<unreadable>");
+    return { ok: false, authMode: "app", status: res.status, errorText: text };
+  }
+  // 204 No Content — return null body, ok: true.
+  if (res.status === 204) {
+    return { ok: true, authMode: "app", status: 204, body: null };
+  }
+  const text = await res.text();
+  return {
+    ok: true,
+    authMode: "app",
</code_context>
<issue_to_address>
**issue (bug_risk):** CLI fallback silently can't handle requests with bodies, which will break reply/comment POSTs when no GitHub App is configured.

Because `cliRequest` returns early when `args.body !== undefined`, any POST/PATCH/PUT via the `gh` path always fails with a synthetic `errorText`. Callers like `dispatchAction` interpret this as a soft failure (`"skip"`), so in `gh`-only environments reply actions never post but the loop keeps retrying until the per-comment cap. Please either support piping the body to `gh api --input -` (e.g., extend `runAsync` to accept stdin), or handle this specific failure upstream and escalate/log clearly so users know App auth is required for write operations.
</issue_to_address>

### Comment 2
<location path="src/daemon/babysit/comment-fetcher.ts" line_range="134-143" />
<code_context>
+  const reviewPath = `repos/${owner}/${repo}/pulls/${prNumber}/comments?per_page=100${sinceQuery}`;
+  const issuePath = `repos/${owner}/${repo}/issues/${prNumber}/comments?per_page=100${sinceQuery}`;
+
+  const [reviewRes, issueRes] = await Promise.all([
+    runAsync("gh", ["api", reviewPath], { cwd, timeoutMs: 20_000 }),
+    runAsync("gh", ["api", issuePath], { cwd, timeoutMs: 20_000 }),
+  ]);
+
+  // If both calls failed with the same reason, surface it. If one fails
+  // and the other succeeds, prefer the success — partial comment data
+  // is more useful than nothing.
+  if (!reviewRes.ok && !issueRes.ok) {
+    const reason =
+      classifyGhFailure(reviewRes.stderr) ?? classifyGhFailure(issueRes.stderr);
+    return {
+      ok: false,
+      reason: reason ?? "unknown",
+      detail: (reviewRes.stderr || issueRes.stderr || "").trim(),
+    };
+  }
+
+  const reviewComments = reviewRes.ok
+    ? safeParseArray<GhReviewCommentJson>(reviewRes.stdout)
+    : [];
</code_context>
<issue_to_address>
**issue (bug_risk):** Comment fetching always shells out to `gh` and ignores App auth, which will fail on headless/App-only deployments.

This code path still shells out to `gh` via `runAsync` instead of using the existing App-aware client (`ghRequest` / `GhClientDeps`) used in `pr-metadata.ts` and `gh-client.ts`. On hosts without the `gh` CLI but with a configured GitHub App/installation, these calls will always fail with `gh_not_installed`, causing comment fetches (and thus jobs) to fail. Please route these requests through the shared `ghRequest` helper, accepting an optional `installationId`, so App auth is used when available with CLI as a fallback.
</issue_to_address>

### Comment 3
<location path="src/daemon/index.ts" line_range="335" />
<code_context>
+  // Skipped under CHORUS_DISABLE_BABYSIT_SCHEDULER=1 (set during
+  // integration tests where the scheduler would race with explicit
+  // tickOnce() calls).
+  let babysitScheduler: BabysitScheduler | null = null;
+  if (process.env.CHORUS_DISABLE_BABYSIT_SCHEDULER !== "1") {
+    babysitScheduler = new BabysitScheduler({
</code_context>
<issue_to_address>
**issue (complexity):** Consider extracting the babysit scheduler setup into a helper and routing all SIGINT/SIGTERM cleanup through a single handler to keep `main()` focused on high-level sequencing.

You can keep all behavior but reduce `main()`’s surface area by extracting the babysit wiring and unifying signal handling.

### 1. Extract babysit scheduler bootstrap

Move the configuration-heavy `BabysitScheduler` construction out of `index.ts` into a small helper. That keeps `main()` focused on sequencing and makes the babysit subsystem a bounded module.

**babysit/bootstrap.ts**

```ts
// src/daemon/babysit/bootstrap.ts
import { BabysitScheduler } from "./scheduler.js";
import { runJob as runBabysitJob } from "./state-machine.js";
import { logger } from "../../lib/logger.js";

export function startBabysitScheduler(): BabysitScheduler | null {
  if (process.env.CHORUS_DISABLE_BABYSIT_SCHEDULER === "1") {
    return null;
  }

  const scheduler = new BabysitScheduler({
    runJob: (job) =>
      runBabysitJob(job, {
        sourceRepoPath: process.cwd(),
        doerLineage: "anthropic",
        doerModel: "claude-haiku-4-5",
        log: (line) => logger.debug({ scope: "babysit" }, line),
      }),
    logger: {
      tickStart: (info) =>
        logger.debug({ scope: "babysit-scheduler", ...info }, "tick"),
      jobStart: (id) =>
        logger.debug({ scope: "babysit-scheduler", id }, "job start"),
      jobEnd: (id, ms) =>
        logger.debug({ scope: "babysit-scheduler", id, ms }, "job end"),
      jobError: (id, err) =>
        logger.warn(
          { scope: "babysit-scheduler", id, err: String(err) },
          "job error",
        ),
    },
  });

  scheduler.start();
  return scheduler;
}
```

**daemon/index.ts (inside `main`)**

```ts
import { startBabysitScheduler } from "./babysit/bootstrap.js";

// ...

const telemetryHandle = startTelemetryHeartbeat({
  version: VERSION,
  daemonStartedAt: startTime,
});

const babysitScheduler = startBabysitScheduler();
```

Now `index.ts` just “turns on” the babysit subsystem instead of inlining its details.

### 2. Centralize signal handling

Right now `SIGTERM`/`SIGINT` handlers are scattered: shutdown, telemetry, and babysit each call `process.on(...)`. You can keep identical behavior but make it easier to scan by registering one handler per signal that fans out to cleanup callbacks.

**daemon/index.ts (inside `main`)**

```ts
const cleanupFns: Array<() => void | Promise<void>> = [];

const shutdown = async (signal: string) => {
  if (activeRunsCount() > 0) {
    const runs = activeRunsSnapshot();
    for (const entry of runs) entry.abortController.abort();
    try {
      await Promise.race([
        Promise.allSettled(runs.map((e) => e.promise)),
        new Promise((_, reject) =>
          setTimeout(
            () => reject(new Error("Timeout waiting for active runs")),
            10000,
          ),
        ),
      ]);
      console.log(`[chorus] aborted ${runs.length} active runs (${signal})`);
    } catch {
      console.warn(
        "[chorus] timeout or error waiting for active runs to abort",
      );
    }
  }
  if (stopReaper) stopReaper();
  await fastify.close();
  process.exit(0);
};

cleanupFns.push(() => telemetryHandle.stop());
if (babysitScheduler) {
  cleanupFns.push(() => babysitScheduler.stop());
}

const handleSignal = (signal: "SIGTERM" | "SIGINT") => {
  void (async () => {
    for (const fn of cleanupFns) {
      try {
        await fn();
      } catch {
        // best-effort cleanup; ignore
      }
    }
    await shutdown(signal);
  })();
};

process.on("SIGTERM", () => handleSignal("SIGTERM"));
process.on("SIGINT", () => handleSignal("SIGINT"));
```

Behavior is preserved (all same pieces still stop on SIGTERM/SIGINT), but:

- Babysit-related details are out of the entrypoint.
- Signal handling is a single, easily-auditable block instead of multiple `process.on` calls scattered among features.
</issue_to_address>

### Comment 4
<location path="src/daemon/phases/verify.ts" line_range="348" />
<code_context>
+ * is the loop signal, and asking the reviewer N times to judge the
+ * same kind of failure would just burn tokens for nothing.
+ */
+export async function runVerifyPhase(
+  args: RunVerifyPhaseArgs,
+): Promise<VerifyPhaseOutcome> {
</code_context>
<issue_to_address>
**issue (complexity):** Consider extracting the TDD iteration logic from `runVerifyPhase` into a dedicated helper that encapsulates loop state and mechanics.

You can make `runVerifyPhase` easier to follow by factoring the TDD loop into a small helper with an explicit loop-state struct. This keeps all behavior but pulls out the most complex branching.

For example, introduce a helper like:

```ts
interface VerifyLoopState {
  lastResult: VerifyCommandResult;
  lastArtifact: string;
  iterations: number;
}

async function runVerifyWithTddLoop(opts: {
  command: string;
  split: { exec: string; args: string[] };
  repoPath: string;
  maxIterations: number;
  commandTimeoutMs: number;
  chatDir: string;
  chatId: string;
  phase: VerifyPhase;
  phaseIdx: number;
  feedbackStdPhase: StandardPhase | null;
  work: string;
  filesBlock: string;
  tmuxMgr: TmuxManager;
  errorDetector: ErrorDetector;
  onEvent: (e: RunnerEvent) => void;
  abortSignal: AbortSignal;
  templateFallbackDoer?: RunVerifyPhaseArgs["templateFallbackDoer"];
}): Promise<VerifyLoopState | null> {
  const {
    command,
    split,
    repoPath,
    maxIterations,
    commandTimeoutMs,
    chatDir,
    chatId,
    phase,
    phaseIdx,
    feedbackStdPhase,
    work,
    filesBlock,
    tmuxMgr,
    errorDetector,
    onEvent,
    abortSignal,
    templateFallbackDoer,
  } = opts;

  let lastResult: VerifyCommandResult | null = null;
  let lastArtifact = "";
  let iter = 0;

  while (iter < maxIterations) {
    if (abortSignal.aborted) return null;
    iter++;
    const round = TDD_ROUND_OFFSET + iter;

    lastResult = await runVerifyCommand({
      exec: split.exec,
      args: split.args,
      cwd: repoPath,
      timeoutMs: commandTimeoutMs,
    });

    lastArtifact = formatVerifyArtifact(command, lastResult);

    const roundDir = path.join(chatDir, `round-${round}`);
    const verifyDir = path.join(roundDir, "doer-verify-runner");
    fs.mkdirSync(verifyDir, { recursive: true });
    fs.writeFileSync(
      path.join(verifyDir, "answer.md"),
      lastArtifact + "\n## DONE\n",
      "utf-8",
    );

    onEvent({
      chatId,
      type: "phase_progress",
      payload: {
        phaseId: phase.id,
        phaseIdx,
        kind: "verify_command",
        iteration: iter,
        round,
        command,
        exitCode: lastResult.exitCode,
        timedOut: lastResult.timedOut,
        durationMs: lastResult.durationMs,
        stdoutTruncated: lastResult.stdoutTruncated,
        stderrTruncated: lastResult.stderrTruncated,
      },
      ts: Date.now(),
    });

    const commandPassed = !lastResult.timedOut && lastResult.exitCode === 0;
    if (commandPassed) break;

    if (!feedbackStdPhase || iter >= maxIterations) break;

    const feedback = formatVerifyFailureFeedback(command, lastResult, iter);
    await runDoer(
      chatDir,
      chatId,
      feedbackStdPhase,
      phaseIdx,
      round,
      work,
      filesBlock,
      tmuxMgr,
      errorDetector,
      onEvent,
      abortSignal,
      repoPath,
      templateFallbackDoer,
      feedback,
    );
  }

  if (!lastResult) return null;

  return { lastResult, lastArtifact, iterations: iter };
}
```

Then `runVerifyPhase` becomes mostly wiring:

```ts
export async function runVerifyPhase(
  args: RunVerifyPhaseArgs,
): Promise<VerifyPhaseOutcome> {
  const { /* ...same destructuring... */ } = args;

  // early abort + phase_start + command resolution + feedbackStdPhase resolution stay here...

  const maxIterations = phase.maxIterations ?? 5;
  const commandTimeoutMs = phase.commandTimeoutMs ?? DEFAULT_COMMAND_TIMEOUT_MS;

  const loopState = await runVerifyWithTddLoop({
    command,
    split,
    repoPath,
    maxIterations,
    commandTimeoutMs,
    chatDir,
    chatId,
    phase,
    phaseIdx,
    feedbackStdPhase,
    work,
    filesBlock,
    tmuxMgr,
    errorDetector,
    onEvent,
    abortSignal,
    templateFallbackDoer,
  });

  if (!loopState) {
    return {
      completed: false,
      passed: false,
      allReviewersFailed: false,
      summary: "verify aborted before first run",
      iterations: 0,
    };
  }

  const { lastResult, lastArtifact, iterations } = loopState;
  const finalRound = TDD_ROUND_OFFSET + iterations;

  const reviewOutcome = await runReviewers(
    chatDir,
    chatId,
    phase as unknown as Parameters<typeof runReviewers>[2],
    phaseIdx,
    finalRound,
    lastArtifact,
    work,
    filesBlock,
    tmuxMgr,
    errorDetector,
    onEvent,
    abortSignal,
    templateFallbackReviewer,
    repoPath,
  );

  const commandPassed = !lastResult.timedOut && lastResult.exitCode === 0;
  const passed = commandPassed && reviewOutcome.agreed;

  return {
    completed: true,
    passed,
    allReviewersFailed: reviewOutcome.allFailed,
    summary: passed
      ? reviewOutcome.summary
      : `${reviewOutcome.summary} (verify failed after ${iterations} iteration${iterations === 1 ? "" : "s"})`,
    command: lastResult,
    iterations,
  };
}
```

This preserves all behavior (including artifacts, events, TDD feedback, and exit semantics) but splits:

- config/validation and reviewer orchestration in `runVerifyPhase`
- loop mechanics (iterations, rounds, early breaks, doer re-fire) in `runVerifyWithTddLoop`

It should address the “highly branched” concern while keeping the features you’ve added.
</issue_to_address>

### Comment 5
<location path="src/daemon/babysit/state-machine.ts" line_range="206" />
<code_context>
+  return { nextState: "judging", worktreePath: ensured.worktreePath };
+}
+
+async function handleJudging(
+  job: BabysitJob,
+  deps: StateMachineDeps,
</code_context>
<issue_to_address>
**issue (complexity):** Consider extracting shared PR context/metadata handling into a helper and moving judge/dispatch helpers to their own module to make the state handlers read as simpler, higher‑level transitions.

You can shave off a lot of the “plumbing noise” in the handlers by extracting the repeated GH/PR context and metadata logic into a small helper, and by moving the policy-ish helpers (`runJudgeForComment`, `dispatchAction`) out of this module.

Two concrete, low‑impact refactors that keep all behavior intact:

---

### 1. Factor out repeated `owner/repo` + metadata (+ merged/closed) logic

`handleJudging`, `handleFixing`, `handlePushing`, and `handleQuietCheck` all:

- split `job.repo`,
- call `fetchPrMetadata`,
- sometimes check for merged/closed,
- build nearly identical escalation messages.

You can centralize this into a small helper and make handlers read more like pure state transitions.

Example:

```ts
interface PrContext {
  owner: string;
  repo: string;
  meta: PrMetadata;
}

type PrStateCheck =
  | "none"                // just fetch metadata
  | "escalate-if-closed"  // escalate on closed, allow merged/other
  | "terminate-on-merged" // return merged, escalate on closed;

async function getPrContext(
  job: BabysitJob,
  deps: StateMachineDeps,
  opts: { cwd: string; check: PrStateCheck; phase: string },
): Promise<
  | { ok: true; ctx: PrContext; transition?: Transition }
  | { ok: false; transition: Transition }
> {
  const [owner, repo] = job.repo.split("/");
  if (!owner || !repo) {
    return {
      ok: false,
      transition: {
        nextState: "escalated",
        escalationReason: `repo id malformed: ${job.repo}`,
      },
    };
  }

  const metaRes = await fetchPrMetadata(
    {
      owner,
      repo,
      prNumber: job.pr_number,
      cwd: opts.cwd,
      installationId: job.installation_id,
    },
    deps.ghDeps,
  );
  if (!metaRes.ok) {
    return {
      ok: false,
      transition: {
        nextState: "escalated",
        escalationReason: `metadata fetch failed during ${opts.phase}: ${metaRes.detail}`,
      },
    };
  }

  if (opts.check === "terminate-on-merged") {
    if (metaRes.meta.state === "merged") {
      return { ok: true, ctx: { owner, repo, meta: metaRes.meta }, transition: { nextState: "merged" } };
    }
    if (metaRes.meta.state === "closed") {
      return {
        ok: false,
        transition: {
          nextState: "escalated",
          escalationReason: "PR closed without merge",
        },
      };
    }
  }

  if (opts.check === "escalate-if-closed" && metaRes.meta.state === "closed") {
    return {
      ok: false,
      transition: {
        nextState: "escalated",
        escalationReason: "PR closed without merge",
      },
    };
  }

  return { ok: true, ctx: { owner, repo, meta: metaRes.meta } };
}
```

Then `handleJudging` (as an example) becomes simpler and the state logic stands out:

```ts
async function handleJudging(
  job: BabysitJob,
  deps: StateMachineDeps,
): Promise<Transition> {
  if (!job.worktree_path) {
    return { nextState: "idle" };
  }

  const metaRes = await getPrContext(job, deps, {
    cwd: job.worktree_path,
    check: "terminate-on-merged",
    phase: "judging",
  });
  if (!metaRes.ok) return metaRes.transition;
  if (metaRes.transition) return metaRes.transition;
  const { owner, repo, meta } = metaRes.ctx;

  const fetched = await fetchPrComments({
    owner,
    repo,
    prNumber: job.pr_number,
    cwd: job.worktree_path,
  });
  // ... rest of handler unchanged ...
}
```

You can apply the same `getPrContext` helper in `handleFixing`, `handlePushing`, and `handleQuietCheck` (with `check: "none"` or `"escalate-if-closed"` as appropriate), which removes a lot of duplicated branching and error strings.

---

### 2. Move `runJudgeForComment` and `dispatchAction` into a dedicated module

Both `runJudgeForComment` and `dispatchAction` are “policy helpers” on top of `judgeComment` and `ghRequest`, and they don’t need access to the state-machine internals. Moving them into `judge.ts` (or a new `routing.ts`/`comment-routing.ts`) reduces the cognitive load in this file and makes the per‑state handlers call higher‑level primitives.

For example, extract to `judge-routing.ts`:

```ts
// judge-routing.ts
export async function runJudgeForComment(
  comment: RawPrComment,
  meta: PrMetadata,
  job: BabysitJob,
  deps: StateMachineDeps,
): Promise<JudgeCommentResult> {
  if (!job.worktree_path) {
    throw new Error("runJudgeForComment requires worktree_path");
  }
  const allDecisions = await babysitDecisions.listForJob(job.id);
  const priors = allDecisions
    .filter((d) => d.comment_hash === comment.bodyHash)
    .map((d) => ({
      decided_at: d.decided_at,
      validity: d.validity as Validity,
      category: d.category,
      outcome: d.outcome ?? null,
    }));
  return judgeComment({
    comment,
    ctx: {
      owner: meta.owner,
      repo: meta.repo,
      prNumber: meta.prNumber,
      title: meta.title,
      baseBranch: meta.baseBranch,
      priorDecisions: priors,
    },
    lineage: "anthropic",
    model: "claude-haiku-4-5",
    cwd: job.worktree_path,
    timeoutMs: deps.judgeTimeoutMs ?? DEFAULT_JUDGE_TIMEOUT_MS,
    abortSignal: new AbortController().signal,
  });
}

export async function dispatchAction(
  action: ActionDecision,
  comment: RawPrComment,
  job: BabysitJob,
  deps: StateMachineDeps,
): Promise<"fix" | "reply" | "escalate" | "skip"> {
  // ... existing logic, unchanged ...
}
```

Then in the state machine file you only import and use them:

```ts
import { runJudgeForComment, dispatchAction } from "./judge-routing.js";
```

`handleJudging` then reads as:

```ts
const judged = await runJudgeForComment(comment, meta, job, deps);
// ...
const followup = await dispatchAction(action, comment, job, deps);
```

This keeps `runJob` + handlers focused on state transitions and orchestration, while routing / policy concerns live in a dedicated module.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 99fa9c2f03

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-18T02:41:42Z

+        onEvent({
+          chatId,
+          type: "phase_done",


Stop the run when verification fails

When a verify phase exhausts its TDD retries and runVerifyPhase returns { completed: true, passed: false }, this block still emits phase_done and continues without setting any failure latch. The final chat_done logic only checks anyPhaseDoerFailed/anyPhaseAllReviewersFailed, so templates that add a verify phase can still finish as approved and even run the ship step after tests/typecheck failed. Please treat !verifyOutcome.passed as a terminal failure/request-changes condition before continuing.

Useful? React with 👍 / 👎.

Fixed in c72bfa5 — runner.ts now treats !verifyOutcome.passed as a terminal failure: sets anyPhaseDoerFailed = true, sets doerFailureReason = 'max_rounds_exhausted', emits a phase_failed event with role: 'verify' + reason: 'verify_not_passed', and breaks the phase loop so no subsequent ship phase runs and the chat resolves as failed rather than approved.

chatgpt-codex-connector · 2026-05-18T02:41:42Z

+    const fetch = await runAsync("git", ["fetch", "origin", args.branch], {
+      cwd: args.sourceRepoPath,
+      timeoutMs: 60_000,
+    });


Fetch PR heads from forks, not only origin

For PRs whose head branch lives in a fork, args.branch is only the branch name from the contributor's repo; origin is the base repository and usually does not contain that ref. In that common forked-PR case this fetch fails (or, worse, resolves a same-named branch in the base repo), so the babysit job escalates before judging/fixing. The metadata/worktree path needs to carry the head repo/clone URL or fetch the PR ref rather than assuming origin/<headBranch>.

Useful? React with 👍 / 👎.

Acknowledged but deferring for this PR. You're right that forked-PR heads aren't in origin — fetching/checkout will fail and the job escalates with git_failure. Fixing this properly requires plumbing the head repo's clone URL through pr-metadata + adding a per-PR remote (or fetching the refs/pull/N/head ref). That's a meaningful surface-area change (new remote management, fork-permission auth) and orthogonal to this PR's verify/judge/fix loop. Tracking as a follow-up; for now babysit explicitly only supports same-repo PRs and the escalation path surfaces the failure cleanly.

chatgpt-codex-connector · 2026-05-18T02:41:42Z

+  const reply = await ghRequest(
+    {
+      method: "POST",
+      path: `repos/${owner}/${repo}/issues/${job.pr_number}/comments`,
+      body: { body: action.text },


Do not silently drop replies in CLI-auth mode

When a job was registered without an installation id, this POST goes through ghRequest's CLI fallback, but that path explicitly returns an error for any request with a body because runAsync cannot provide stdin. The failure is then converted to "skip" after the decision row has already been inserted, so reply-ack/reply-disagree comments are never posted and are not retried. This needs either a real CLI POST implementation or an escalation/retry path when App auth is unavailable.

Useful? React with 👍 / 👎.

Fixed in c72bfa5 — gh-client.ts cliRequest now pipes JSON.stringify(args.body) to gh api --input - via the new input?: string option on runAsync (extended in tick 1, dd80178). runAsync opens stdin (stdio: ["pipe", ...]), writes the body, swallows EPIPE on close. Reply posts now succeed in CLI-auth mode and the decision row's outcome reflects the post result.

Address Sourcery bug_risk findings on PR #6: 1. `cliRequest` in gh-client.ts no longer rejects POST/PATCH/PUT with bodies. We now pipe the JSON body to `gh api --input -` via a new optional `input` parameter on `runAsync`. Without this, reply posts silently no-op in CLI-only deployments (no GitHub App) — the babysit loop counts the failure as a soft skip and burns retries until the per-comment cap fires. 2. `fetchPrComments` now routes through `ghRequest` so App auth is used when an installation id is available (matching `pr-metadata.ts`). On headless hosts with a configured App but no `gh` CLI, the old path failed every comment fetch with `gh_not_installed`. State machine threads `job.installation_id` + `deps.ghDeps` to all three call sites (judging, fixing re-fetch, quiet_check). `runAsync` gains an optional `input?: string`; when set, stdin is piped and closed before reading stdout/stderr. EPIPE on the stdin writer is swallowed so the close handler still sees the real exit code. All other callers keep the existing "ignore stdin" behaviour. Verification: - npx tsc --noEmit: clean - npx vitest run: 1085/1085 pass (matches baseline) - Updated 1 gh-client test that asserted the now-removed "CLI cannot send bodies" typed error → asserts a successful POST with the body piped via `--input -` Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

greptile-apps

crypticpy has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.

coderabbitai

Actionable comments posted: 25

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/daemon/index.ts (1)
280-306: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Await babysit scheduler shutdown inside shutdown().

Right now the scheduler gets separate signal handlers that call stop() fire-and-forget, while shutdown() continues to fastify.close() and process.exit(0). A SIGTERM can therefore terminate the process before in-flight babysit jobs drain.

Also applies to: 335-361
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/daemon/index.ts` around lines 280 - 306, The shutdown() handler currently
fires stop() on the babysit/scheduler via separate signal handlers but does not
await it, allowing process.exit(0) to kill in-flight babysit jobs; update
shutdown() to await the scheduler/babysit stop completion (e.g. await
babysit.stop() or await scheduler.stop()) before calling fastify.close() and
process.exit(0), and similarly await stopReaper if it returns a promise (replace
the fire-and-forget stopReaper() call with await stopReaper() when applicable)
so that babysit/scheduler and reaper drains complete before shutdown finalizes.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cli/commands/babysit.ts`:
- Around line 91-92: The callDaemon function currently casts arbitrary JSON to
ApiResult<T> and can cause crashes later; update callDaemon to validate the
parsed JSON envelope (ensure it's a non-null object and has an 'ok' boolean
property) before returning: if validation fails, return a safe ApiResult<T> with
ok: false and a descriptive error message (or throw a controlled CLI error) so
callers (e.g., the code that reads res.error around the previous usage of res at
Line 165) never dereference missing fields; adjust any direct uses of
callDaemon's return to assume the validated ApiResult<T> shape (or add an
explicit check for res.ok before accessing res.error).

In `@src/daemon/babysit/comment-fetcher.ts`:
- Around line 131-137: The code currently builds reviewPath and issuePath with
per_page=100 and calls runAsync("gh", ["api", ...]) once, which truncates
results; update the fetch to paginate so all comments are retrieved: either call
runAsync with the GH CLI --paginate flag (e.g. runAsync("gh", ["api",
"--paginate", reviewPath], ...) and same for issuePath) or implement a loop that
calls runAsync("gh", ["api", reviewPath, "--page", N, "--per_page", 100], ...)
until no more pages, then concatenate all pages into the final reviewRes and
issueRes arrays; keep existing timeout handling and ensure code that processes
reviewRes/issueRes still works with arrays of comments.

In `@src/daemon/babysit/gh-client.ts`:
- Around line 147-150: The current retry on any 5xx in the issueAppCall flow can
double-execute non-idempotent writes; change the retry logic to only retry for
idempotent HTTP methods. In the block handling res.status >= 500 && < 600, check
the request method from args (e.g., args.method or args.options.method) and only
perform the sleep + retry+res = await issueAppCall(...) when method is one of
idempotent methods (GET, HEAD, PUT, DELETE, possibly OPTIONS); do not retry for
POST or PATCH (non-idempotent) — instead surface or return the error. Keep
references to issueAppCall, args, config, installationId, and fetcher so the
updated conditional is local to that call site.
- Around line 214-227: The fallback branch in gh-client.ts currently rejects
requests with args.body (the block checking args.body and returning {ok:false,
authMode:"cli",...}), contradicting the "always valid" CLI fallback; update the
logic so the CLI path accepts bodies by wiring the request body into the
existing runAsync helper (or extend runAsync to accept an stdin payload) and
remove the hard-fail return; specifically, modify the code around the args.body
check to pass args.body to runAsync (or add a new parameter) so gh-client's CLI
fallback can send the body via stdin instead of returning an error, keeping the
authMode="cli" flow intact for functions that post comments or updates.

In `@src/daemon/babysit/git-push.ts`:
- Around line 121-126: The code currently treats a failed git rev-parse as a
successful push by allowing commitSha = "" and later returning outcome:
"pushed"; update the logic in git-push.ts (around the runAsync call and the
variable commitSha) to treat sha.ok === false as a push failure: detect when
runAsync("git", ["rev-parse", "HEAD"], ...) returns ok === false and
return/throw a failed outcome (e.g., outcome: "failed" with an error message)
instead of proceeding with an empty commitSha; ensure any downstream return that
sets outcome: "pushed" uses the validated commitSha and only runs when sha.ok is
true.

In `@src/daemon/babysit/scheduler.ts`:
- Around line 109-118: stop() can resolve while tickOnce() is mid-await and may
still dispatch new jobs; ensure stop waits for any in-progress tick to finish
and prevent tickOnce from dispatching after stopped. Modify tickOnce (and the
same logic around lines 130-148) to re-check this.stopped immediately after any
awaits (e.g., after awaiting babysitJobs.listActive()) and return early if
stopped, and have stop() await the currently running tick promise
(introduce/await a runningTick or currentTickPromise tracked by tickOnce) in
addition to awaiting inFlightPromises so no new jobs are dispatched after stop()
resolves.
- Around line 97-99: Replace the unhandled void call in the setInterval callback
so tickOnce rejections are caught: instead of "void this.tickOnce();" call
"this.tickOnce().catch(err => { /* log skipped tick here */ })". Use the
scheduler's existing logging/skip-tick helper (or this.logger/this.log) inside
the catch to record the transient failure; the change should be applied where
intervalHandle is assigned with setInterval(..., this.intervalMs) so any
rejection from tickOnce is handled and logged.

In `@src/daemon/babysit/state-machine.ts`:
- Around line 526-537: The no_changes branch records the decision but then
returns nextState "quiet_check", so the job is never reprocessed; update the
handler so that when pushed.outcome !== "pushed" you both mark the decision
escalated via babysitDecisions.setOutcome(target.id, "escalated", null) (already
present) and also surface the job for human review by either updating the job
record (e.g., babysitJobs.setStatus(job.id, "escalated") or
babysitJobs.markEscalated(job.id)) or by returning a non-quiet state (e.g.,
return { nextState: "escalate" }) instead of { nextState: "quiet_check" } so the
loop will revisit it; ensure you reference job.id and the pushed.outcome branch
when making this change.

In `@src/daemon/babysit/worktree-manager.ts`:
- Around line 62-70: The current worktree flow fails for fork-based PRs because
only a branch name is passed and git operations always target origin; update
fetchPrMetadata to capture the PR head repository/remote (head.repo or full repo
owner/name) and add a field (e.g. headRemote or headRepo) to the PrMetadata
type, then extend EnsureWorktreeArgs to include that head remote identifier, and
update ensureWorktree and pullLatest to use the provided head remote (instead of
hard-coding "origin") when fetching, resetting, and pulling refs (e.g. use
<headRemote>/<branch> or the remote name mapped from headRepo). Ensure all call
sites that construct EnsureWorktreeArgs or consume PrMetadata are updated to
pass the new head remote field.

In `@src/daemon/phases/verify.ts`:
- Around line 567-572: The call to runDoer is using the verify phase's phaseIdx
but is running feedbackStdPhase, which mislabels emitted events; update the
argument so runDoer(chatDir, chatId, feedbackStdPhase, phaseIdx, round, ...)
passes the correct index (use feedbackStdPhase.phaseIdx or the feedback-specific
phaseIdx variable) instead of the verify phase's phaseIdx so the feedback run is
emitted with the feedback phase index.
- Around line 629-631: The fallback message always says "verify failed..." even
when the verify command exitCode === 0 (only reviewer consensus failed); update
the ternary that builds the summary so it uses exitCode to choose the verb: use
reviewOutcome.summary if present, otherwise append either ` (verify passed after
${iter} iteration${iter === 1 ? "" : "s"})` when exitCode === 0 or ` (verify
failed after ${iter} iteration${iter === 1 ? "" : "s"})` when exitCode !== 0;
locate the string construction around summary/reviewOutcome.summary in verify.ts
and replace the current fallback with this conditional form so the message
accurately reflects pass vs fail.

In `@src/daemon/routes/babysit.ts`:
- Line 27: Remove the empty interface declaration "interface BabysitJobView
extends BabysitJob {}" to satisfy the linter; delete that line and update any
references to BabysitJobView to use BabysitJob directly (or, if a distinct alias
is truly required, replace it with a type alias like "type BabysitJobView =
BabysitJob"). Ensure all usages (e.g., in function signatures or return types)
reference the corrected type name.
- Around line 66-88: The upsert is racy: two requests can both miss
babysitJobs.getById then one babysitJobs.create throws a unique-constraint DB
error and returns db_error; instead catch unique-constraint failures around
babysitJobs.create (detect DB unique error code/marker) and in that catch
re-query babysitJobs.getById (or a lookup by repo/pr_number/installation_id) and
return successResponse({ job: existing, created: false }); only rethrow other
errors. Target the create call in babysit.ts and the babysitJobs.create/getById
helpers when implementing this fix.

In `@src/daemon/runner.ts`:
- Around line 427-443: The code currently treats a completed verify phase the
same regardless of verifyOutcome.passed; after calling runVerifyPhase (look for
runVerifyPhase and the verifyOutcome object in runner.ts) add a conditional so
that if verifyOutcome.completed is true and verifyOutcome.passed is false you
emit a failure event (use onEvent with type "phase_failed" or similar and
include phaseId/phaseIdx/kind/summary) and then stop further execution of the
pipeline (return the terminal result or break out to produce a rejected/failed
terminal state instead of continuing to subsequent phases), otherwise proceed
with the existing phase_done emission when passed is true.

In `@src/daemon/runner/prompt-builder.ts`:
- Around line 228-230: The truncation uses string length/code units but
PROJECT_GUIDE_MAX_BYTES is bytes; replace the `body.length` check and
`body.slice` logic with a UTF-8 byte-aware truncation: encode body with
TextEncoder to a Uint8Array, if its byteLength > PROJECT_GUIDE_MAX_BYTES take a
Uint8Array slice of PROJECT_GUIDE_MAX_BYTES and decode back to a string with
TextDecoder (or build the string incrementally encoding codepoints until bytes
<= PROJECT_GUIDE_MAX_BYTES) to avoid splitting multibyte characters; keep the
subsequent `sanitized = ...replace(/<\/project_guidelines>/gi, "")` step
unchanged. Ensure you update the variables `truncated`, `slice` (or rename
appropriately) and use PROJECT_GUIDE_MAX_BYTES in the byte-level logic.

In `@src/lib/template-schema.ts`:
- Around line 326-336: The feedbackPhase field currently uses
z.string().optional() and permits any string; change its validation to ensure
the value is one of the declared phase names at template-parse time by
validating against the template's phases declaration (e.g., the phases property
or phases keys) — either build an allowed set from the declared phases and
replace feedbackPhase's schema with a membership check
(z.string().optional().refine(...) or z.enum(...) generated from the phases) or
use a superRefine/preprocess step during parseTemplate/validation to error if
feedbackPhase is not present in the phases map; keep maxIterations as-is and
ensure the error message names feedbackPhase so typos are caught early.

In `@src/mcp/tools.ts`:
- Around line 314-317: The babysit_pr URL schema currently uses
z.string().min(1) which allows invalid strings; update the validator in
src/mcp/tools.ts (the z.string() chain shown) to enforce a real URL
format—replace .min(1, "url is required") with z.string().url({ message: "url is
required" }) or z.string().url({ message: "invalid url" }) so the babysit_pr URL
is validated as a proper URL.

In `@tests/babysit-comment-fetcher.test.ts`:
- Around line 267-283: The test currently only checks that the encoded since
string appears once in observedArgs; update it to assert it appears twice so
both gh API calls are verified. After calling fetchPrComments (the tested
function), count occurrences of the exact encoded token
"since=2026-05-10T00%3A00%3A00Z" in observedArgs (the file written by the fake
gh in tmpBin) and assert the count equals 2, keeping the rest of the setup
intact.

In `@tests/babysit-git-push.test.ts`:
- Around line 23-29: The helper function git currently shells out via
execSync(`git ${args}`) which is fragile and unsafe; change it to use
execFileSync from child_process and accept a string[] argv (e.g., git(args:
string[], cwd: string): string) and call execFileSync('git', args, { cwd,
encoding: 'utf-8', stdio: ['ignore','pipe','pipe'] }); then update all call
sites that currently pass interpolated strings (e.g., places building
paths/flags) to pass an array of arguments instead so quoting/escaping is
handled by the API rather than the shell.

In `@tests/babysit-verifier.test.ts`:
- Around line 135-143: The hasBinary helper currently uses dynamic
require("child_process") and shell interpolation causing module-coupling and
injection risk; change it to a static import (e.g., import { spawnSync } from
"child_process") and replace execSync(`which ${name}`) with spawnSync calls that
pass the binary name as an argument (no shell). In hasBinary, call spawnSync
with args ['which', name] on POSIX and ['where', name] on Windows (use
process.platform to choose), ensure shell is false and check the returned
status/code === 0 (or stdout length) to return true/false.

In `@tests/babysit-worktree-manager.test.ts`:
- Around line 32-38: The helper function git currently builds a shell command
via template interpolation and calls execSync, which is vulnerable to shell
injection; change the signature of git (and any callers at the other occurrence
around the previous diff) to accept args as a string[] and call
execFileSync("git", args, { cwd, encoding: "utf-8", stdio:
["ignore","pipe","pipe"] }) instead of execSync; update all call sites that pass
a single interpolated string (including the calls referenced near lines 239–241)
to pass the argv array (e.g., ["commit","-m","msg"] or ["status","--porcelain"])
so argument escaping is handled safely.

In `@tests/verify-phase.test.ts`:
- Around line 240-265: The test's final assertion is too broad — replace the
weak expect(out).toContain("do") with a precise invariant that guarantees the
feedback shape is correct; update the assertion in the spec using
formatVerifyFailureFeedback to check for the exact token the consumer expects
(e.g., expect(out).toContain("doer") or another exact marker your prior-round
injection relies on) so the test fails on regressions rather than incidental
text.
- Around line 81-101: The tests only cover whitespace splitting and miss quoted
arguments; add a test in tests/verify-phase.test.ts that asserts
splitCommand('-e "console.log(\'a b\')"') returns { exec: '-e'? No—use a
realistic command like 'node -e "console.log(\'a b\')"' and expect exec: 'node',
args: ['-e', "console.log('a b')"] (also add a case with single quotes), and
then update the splitCommand implementation to parse and collapse whitespace but
treat quoted substrings as single arguments (strip surrounding single or double
quotes, support escaped quotes inside) so quoted arguments containing spaces are
preserved; ensure splitCommand still returns null for empty input and handles
single-token commands as before.

---

Outside diff comments:
In `@src/daemon/index.ts`:
- Around line 280-306: The shutdown() handler currently fires stop() on the
babysit/scheduler via separate signal handlers but does not await it, allowing
process.exit(0) to kill in-flight babysit jobs; update shutdown() to await the
scheduler/babysit stop completion (e.g. await babysit.stop() or await
scheduler.stop()) before calling fastify.close() and process.exit(0), and
similarly await stopReaper if it returns a promise (replace the fire-and-forget
stopReaper() call with await stopReaper() when applicable) so that
babysit/scheduler and reaper drains complete before shutdown finalizes.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8848e778-1e52-47ca-9309-7c477a15fcc3

📥 Commits

Reviewing files that changed from the base of the PR and between 187d389 and 99fa9c2.

📒 Files selected for processing (49)

docs/pr-babysit-design.md
src/cli/commands/babysit.ts
src/cli/index.ts
src/daemon/babysit/comment-fetcher.ts
src/daemon/babysit/fix-executor.ts
src/daemon/babysit/gh-app.ts
src/daemon/babysit/gh-client.ts
src/daemon/babysit/git-push.ts
src/daemon/babysit/judge.ts
src/daemon/babysit/pr-metadata.ts
src/daemon/babysit/scheduler.ts
src/daemon/babysit/state-machine.ts
src/daemon/babysit/verifier.ts
src/daemon/babysit/webhook-verify.ts
src/daemon/babysit/worktree-manager.ts
src/daemon/index.ts
src/daemon/phases/verify.ts
src/daemon/routes/babysit.ts
src/daemon/runner.ts
src/daemon/runner/doer-driver.ts
src/daemon/runner/prompt-builder.ts
src/daemon/runner/reviewer-driver.ts
src/lib/db/babysit-decisions.ts
src/lib/db/babysit-jobs.ts
src/lib/db/connection.ts
src/lib/db/index.ts
src/lib/db/schema.sql
src/lib/db/secrets.ts
src/lib/template-schema.ts
src/mcp/index.ts
src/mcp/tools.ts
templates/pr-babysit.yaml
tests/babysit-comment-fetcher.test.ts
tests/babysit-db.test.ts
tests/babysit-fix-executor.test.ts
tests/babysit-gh-app.test.ts
tests/babysit-gh-client.test.ts
tests/babysit-git-push.test.ts
tests/babysit-judge.test.ts
tests/babysit-pr-metadata.test.ts
tests/babysit-routes.test.ts
tests/babysit-scheduler.test.ts
tests/babysit-state-machine.test.ts
tests/babysit-verifier.test.ts
tests/babysit-webhook-verify.test.ts
tests/babysit-worktree-manager.test.ts
tests/prompt-builder-project-guides.test.ts
tests/template-schema.test.ts
tests/verify-phase.test.ts

coderabbitai · 2026-05-18T02:49:17Z

+describe("splitCommand", () => {
+  it("splits a simple command on whitespace", () => {
+    expect(splitCommand("npm test")).toEqual({ exec: "npm", args: ["test"] });
+  });
+
+  it("collapses runs of whitespace", () => {
+    expect(splitCommand("  pnpm   run   verify  ")).toEqual({
+      exec: "pnpm",
+      args: ["run", "verify"],
+    });
+  });
+
+  it("returns null for an empty string", () => {
+    expect(splitCommand("")).toBeNull();
+    expect(splitCommand("   ")).toBeNull();
+  });
+
+  it("handles single-token commands", () => {
+    expect(splitCommand("vitest")).toEqual({ exec: "vitest", args: [] });
+  });
+});


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add quoted-argument coverage for splitCommand before locking in parser behavior.

These tests currently enforce whitespace tokenization only, which can silently break valid commands containing quoted args (e.g., -e "console.log('a b')"). This is a high-impact gap for chorus.verify command execution.

Proposed test addition

describe("splitCommand", () => { @@ it("handles single-token commands", () => { expect(splitCommand("vitest")).toEqual({ exec: "vitest", args: [] }); }); + + it("preserves quoted arguments containing spaces", () => { + expect(splitCommand(`node -e "console.log('a b')"`)).toEqual({ + exec: "node", + args: ["-e", "console.log('a b')"], + }); + }); });

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

describe("splitCommand", () => {

it("splits a simple command on whitespace", () => {

expect(splitCommand("npm test")).toEqual({ exec: "npm", args: ["test"] });

});

it("collapses runs of whitespace", () => {

expect(splitCommand(" pnpm run verify ")).toEqual({

exec: "pnpm",

args: ["run", "verify"],

});

});

it("returns null for an empty string", () => {

expect(splitCommand("")).toBeNull();

expect(splitCommand(" ")).toBeNull();

});

it("handles single-token commands", () => {

expect(splitCommand("vitest")).toEqual({ exec: "vitest", args: [] });

});

});

describe("splitCommand", () => {

it("splits a simple command on whitespace", () => {

expect(splitCommand("npm test")).toEqual({ exec: "npm", args: ["test"] });

});

it("collapses runs of whitespace", () => {

expect(splitCommand(" pnpm run verify ")).toEqual({

exec: "pnpm",

args: ["run", "verify"],

});

});

it("returns null for an empty string", () => {

expect(splitCommand("")).toBeNull();

expect(splitCommand(" ")).toBeNull();

});

it("handles single-token commands", () => {

expect(splitCommand("vitest")).toEqual({ exec: "vitest", args: [] });

});

it("preserves quoted arguments containing spaces", () => {

expect(splitCommand(`node -e "console.log('a b')"`)).toEqual({

exec: "node",

args: ["-e", "console.log('a b')"],

});

});

});

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/verify-phase.test.ts` around lines 81 - 101, The tests only cover whitespace splitting and miss quoted arguments; add a test in tests/verify-phase.test.ts that asserts splitCommand('-e "console.log(\'a b\')"') returns { exec: '-e'? No—use a realistic command like 'node -e "console.log(\'a b\')"' and expect exec: 'node', args: ['-e', "console.log('a b')"] (also add a case with single quotes), and then update the splitCommand implementation to parse and collapse whitespace but treat quoted substrings as single arguments (strip surrounding single or double quotes, support escaped quotes inside) so quoted arguments containing spaces are preserved; ensure splitCommand still returns null for empty input and handles single-token commands as before.

Acknowledged but deferring. splitCommand deliberately uses plain whitespace tokenization — chorus.verify is a single project-controlled string from package.json (not user-supplied at runtime), so the threat model is mis-typed configuration, not adversarial shell parsing. Supporting quoted args properly means either a full POSIX shell lexer (regex-impossible, see shellwords/shlex) or shelling out via sh -c (re-introduces injection risk). The current contract is documented: 'whitespace-split command, no quoting' — if a project needs complex args, they can use pnpm run verify and define the verify script in their package.json. Test could be added asserting the documented behavior (quoted tokens split on the space inside the quotes), but that locks in the limitation rather than fixing it; happy to add it if you'd prefer.

@crypticpy, that's a well-reasoned trade-off. The pnpm run verify escape hatch keeps the surface area minimal and avoids the shlex rabbit hole entirely — agreed, no changes needed here.

✏️ Learnings added

Learnt from: crypticpy Repo: crypticpy/chorus PR: 6 File: tests/verify-phase.test.ts:81-101 Timestamp: 2026-05-18T03:10:39.699Z Learning: In `src/daemon/phases/verify.ts` (TypeScript), `splitCommand` deliberately uses plain whitespace tokenization and does NOT support quoted arguments containing spaces. This is an intentional design decision: `chorus.verify` is a project-controlled string from `package.json` (not user-supplied at runtime), so the threat model is mis-typed config rather than shell injection. A proper POSIX shell lexer or `sh -c` was intentionally avoided to reduce injection risk. Projects needing complex args should define a named script in `package.json` (e.g., `pnpm run verify`) and set `chorus.verify` to that script invocation. The documented contract is: whitespace-split only, no quoting.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

…ries, scheduler race, schema cross-refs P1/Critical: - runner: gate phase progression on verify outcome (no terminal 'approved' when tests failed). - gh-client: only retry 5xx on idempotent methods (GET/DELETE/PUT); POST/PATCH no longer at risk of duplicated comments. - state-machine: judging loop now breaks after first fix so a single tick can't apply N stale decisions; no_changes now escalates instead of looping in quiet_check. Hardening: - scheduler: stop() awaits the in-flight tick before tearing down; setInterval errors no longer swallowed. - git-push: rev-parse HEAD failure surfaces as commit_failure rather than a blank commitSha. - comment-fetcher: paginate review + issue comments (cap at 10 pages/endpoint). - template-schema: superRefine that verify.feedbackPhase references a real standard phase (not review_only/audit/orchestrate/verify). - prompt-builder: byte-cap project guide via Buffer slice so multi-byte UTF-8 boundaries stay valid. - cli/babysit: validate daemon envelope shape; non-conforming JSON surfaces invalid_envelope instead of crashing on .error access. - mcp/tools: BabysitPrSchema.url is now a real URL check. - routes/babysit: tighten the typedef + handle upsert race on POST. Tests: - babysit-git-push + babysit-worktree-manager: shell-out conversion from execSync + template-literal interpolation to execFileSync + argv arrays (eliminates injection surface; paths with spaces). - babysit-verifier: hasBinary helper uses spawnSync instead of shelled `which` (portable + injection-free). - babysit-comment-fetcher: since-forwarding test asserts the query appears on BOTH endpoints (was single toContain — one-sided regression would have slipped through). - verify-phase: feedback assertion now matches the actual directive substring ('diagnose the failure'), not the previous brittle 'do'. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

greptile-apps

crypticpy has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/lib/template-schema.ts (1)

683-686: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Include verify in templateRequiresRepo to enforce create-time contract.

verify now hard-requires repoPath at runtime, but this helper still returns false for verify-only templates. That allows invalid chat creation paths and defers failure to execution time.

Suggested fix

 export function templateRequiresRepo(template: Template): boolean {
   return template.phases.some(
-    (p) => p.kind === "audit" || p.kind === "orchestrate",
+    (p) =>
+      p.kind === "audit" || p.kind === "orchestrate" || p.kind === "verify",
   );
 }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/template-schema.ts` around lines 683 - 686, templateRequiresRepo
currently only considers "audit" and "orchestrate" phases but misses "verify",
letting verify-only templates bypass create-time repo validation; update the
templateRequiresRepo(template: Template) implementation to also return true when
any phase has p.kind === "verify" so that templates that hard-require repoPath
(used by verify) are enforced at creation time (refer to function
templateRequiresRepo and the verify behavior that requires repoPath).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/daemon/babysit/comment-fetcher.ts`:
- Around line 175-193: The current logic only returns an overall failure when
both reviewPaged and issuePaged are bad; change it to fail if either call failed
by switching the conditional to check if (!reviewPaged.ok || !issuePaged.ok) and
construct the returned reason/detail from the failing call(s) (e.g., prefer
reviewPaged.errorText if reviewPaged.ok is false else use issuePaged.errorText),
using classifyGhFailureResult on the failing result(s) to populate reason and
detail so any single endpoint failure surfaces a fetch failure for retry.

In `@src/daemon/babysit/state-machine.ts`:
- Around line 430-434: The call to applyFixForComment is passing target.validity
(a classification like "valid"/"invalid") into the judgementRationale parameter,
which expects explanatory text; update the code so applyFixForComment receives
the actual judge rationale (e.g., use a stored rationale field from the
babysit_decisions record or from matched if available) or omit
judgementRationale entirely if not required. Locate the invocation of
applyFixForComment and replace the judgementRationale: target.validity argument
with the real rationale string (e.g., decision.rationale or matched.rationale)
or remove that parameter, and ensure the babysit_decisions persistence includes
a rationale field so the rationale can be retrieved and passed here when
creating fixed via applyFixForComment.
- Around line 677-681: The judge config is using hardcoded lineage/model and
creates an orphaned AbortController.signal; update the code that builds the
judge options to use configurable values from StateMachineDeps (reuse
deps.doerLineage/deps.doerModel or add new deps.judgeLineage/deps.judgeModel and
use them in place of the literal "anthropic"/"claude-haiku-4-5"), and replace
the new AbortController().signal with a caller-controlled signal (e.g., a signal
exposed on deps such as deps.shutdownSignal or a scheduler-provided signal) or
remove the signal entirely and rely on timeoutMs (deps.judgeTimeoutMs ??
DEFAULT_JUDGE_TIMEOUT_MS); ensure you patch the construction that references
job.worktree_path, timeoutMs, abortSignal so they use the chosen deps fields.

In `@src/daemon/phases/verify.ts`:
- Around line 512-513: The verify loop checks abortSignal.aborted and breaks,
but code after the loop still calls runReviewers (lines referencing runReviewers
and the verify function), so on cancellation the final reviewer run still
executes; modify verify so that when abortSignal.aborted is true you return
immediately (or throw/cancel) instead of breaking to the later code path—i.e.,
replace the break in the while (iter < maxIterations) { if (abortSignal.aborted)
... } with an early return (or guard after the loop that checks
abortSignal.aborted and returns) to ensure runReviewers is not invoked after
cancellation.

In `@src/daemon/runner.ts`:
- Around line 391-407: The repoPath check in runner.ts currently emits a
phase_failed event and breaks but doesn't mark the overall run as terminal;
update the branch that handles missing repoPath (the if (!repoPath) block that
calls onEvent with type "phase_failed") to also set the run's terminal failure
latch so the run cannot later resolve to chat_done approved/completed—either set
the existing run-level flag (e.g., runFailed or terminalFailure) or invoke the
run-termination helper (e.g., finishRunWithFailure/terminateRun) with the same
context (chatId, phase.id, phaseIdx, reason: "missing_repo_path") so the runner
treats this as a final failure. Ensure the added latch/termination uses the same
identifiers (chatId, phase.id, phaseIdx, phase.kind) so downstream logic
recognizes the terminal failure.

---

Outside diff comments:
In `@src/lib/template-schema.ts`:
- Around line 683-686: templateRequiresRepo currently only considers "audit" and
"orchestrate" phases but misses "verify", letting verify-only templates bypass
create-time repo validation; update the templateRequiresRepo(template: Template)
implementation to also return true when any phase has p.kind === "verify" so
that templates that hard-require repoPath (used by verify) are enforced at
creation time (refer to function templateRequiresRepo and the verify behavior
that requires repoPath).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 5fb3d210-04ec-4930-b2e0-2ee28e3821f4

📥 Commits

Reviewing files that changed from the base of the PR and between dd80178 and c72bfa5.

📒 Files selected for processing (17)

src/cli/commands/babysit.ts
src/daemon/babysit/comment-fetcher.ts
src/daemon/babysit/gh-client.ts
src/daemon/babysit/git-push.ts
src/daemon/babysit/scheduler.ts
src/daemon/babysit/state-machine.ts
src/daemon/phases/verify.ts
src/daemon/routes/babysit.ts
src/daemon/runner.ts
src/daemon/runner/prompt-builder.ts
src/lib/template-schema.ts
src/mcp/tools.ts
tests/babysit-comment-fetcher.test.ts
tests/babysit-git-push.test.ts
tests/babysit-verifier.test.ts
tests/babysit-worktree-manager.test.ts
tests/verify-phase.test.ts

…circuit, runner verify-gate parity, prompt-builder TOCTOU P0/Critical: - runner: missing-repoPath verify branch now sets anyPhaseDoerFailed + doerFailureReason BEFORE phase_failed/break, matching the post- runVerifyPhase failure gate. Was leaking to chat_done=approved. - comment-fetcher: fail closed when either review or issue endpoint fails. Autonomous judging on partial input could merge a PR with unaddressed line-level review feedback; state-machine retries failed fetches next tick, so transient failures recover safely. Hardening: - verify: when abortSignal trips mid-loop, return immediately instead of falling through to the final reviewer pass (each runReviewers call is 30+ seconds of LLM work). - state-machine: judge call now honors deps.judgeLineage / judgeModel / abortSignal (was hardcoded anthropic/haiku + orphan AbortController). Daemon-shutdown / job-pause can actually interrupt in-flight judging. - state-machine: judgementRationale string no longer mis-routes the one-word validity enum into the doer prompt's rationale slot — synthesizes a structured descriptor (validity / category / confidence) from what we DO persist. - prompt-builder readProjectGuides: TOCTOU hardening via openSync + O_NOFOLLOW + fstatSync on POSIX, lstat+read on Windows (no O_NOFOLLOW available there). Eliminates the lstat/read race where a swap could leak unintended file content into the project guide injection block. Tests: - comment-fetcher: partial-data test inverted to assert the new fail-closed contract (one-endpoint failure surfaces, doesn't pretend the other endpoint's data is the full picture). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

greptile-apps

crypticpy has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.

…babysit) + fork retarget (#7) * docs: README + ROADMAP for v0.8 shipped (audit, orchestrate, verify, babysit) + fork retarget Updates the docs to reflect what shipped in PRs #1 + #6 and decouples the URLs from the upstream chorus-codes/chorus slot now that this fork lives on its own at crypticpy/chorus. README: - Templates section: new rows for audit-* and pr-babysit + three explainer subsections (audit + orchestrate, verify + TDD loop, PR-babysit) - Commands section: chorus quickstart + chorus babysit subcommand group - Roadmap mini-summary: v0.8 marked done - Repo/issue/clone/discussion URLs: chorus-codes/chorus → crypticpy/chorus (badges + npm install left alone — those wait for the rebrand PR) ROADMAP: - Status tracker: 4 new shipped rows (audit phase, orchestrate phase, verify phase + TDD loop, PR-babysit) - chorus-codes/chorus migration items marked MOOT — fork is its own project, no contribution-back planned - Section 7 (migration) collapsed to a short MOOT note - "Where we are" footnote updated to 2026-05-17 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: clarify babysit daemon does not auto-merge The in-app `chorus babysit` daemon (src/daemon/babysit/state-machine.ts) sits in `quiet_check` after two quiet ticks and waits for an external merge — it does not call `gh pr merge` itself and does not gate on CI. The README and ROADMAP both overstated this, implying the daemon squash-merges on its own. Update both to point to the `/babysit-pr` Claude Code skill (this loop) as the surface that actually performs the merge. Caught by ChatGPT Codex review on PR #7. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: fix empty badge links + stale v0.7 status + ROADMAP SSE-decoupling contradiction - README: Status/Node badges had empty `()` link targets (MD042) and the Status badge still said v0.7. Point Status -> ROADMAP.md (now v0.8) and Node -> nodejs.org so both have real targets. - ROADMAP: "Runner decoupling from SSE" row was still marked TODO for v0.8 even though lines 48 and 267 record it as shipped in PR chorus-codes#21 (2026-05-02). Align row to the v0.7 DONE status used elsewhere. CI URL badge and `npm i -g chorus-codes` install line still intentionally untouched — those wait for the rebrand PR. Caught by CodeRabbit on PR #7. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

crypticpy and others added 30 commits May 7, 2026 20:59

crypticpy and others added 11 commits May 17, 2026 19:06

greptile-apps Bot reviewed May 18, 2026

View reviewed changes

sourcery-ai Bot reviewed May 18, 2026

View reviewed changes

Comment thread src/daemon/babysit/gh-client.ts

Comment thread src/daemon/babysit/comment-fetcher.ts Outdated

Comment thread src/daemon/index.ts

Comment thread src/daemon/phases/verify.ts

Comment thread src/daemon/babysit/state-machine.ts

chatgpt-codex-connector Bot reviewed May 18, 2026

View reviewed changes

greptile-apps Bot reviewed May 18, 2026

View reviewed changes

coderabbitai Bot reviewed May 18, 2026

View reviewed changes

greptile-apps Bot reviewed May 18, 2026

View reviewed changes

coderabbitai Bot reviewed May 18, 2026

View reviewed changes

greptile-apps Bot reviewed May 18, 2026

View reviewed changes

crypticpy merged commit 3f2f2a5 into main May 18, 2026
1 check passed

crypticpy deleted the feat/runner-execution-loop branch May 18, 2026 03:31

crypticpy mentioned this pull request May 18, 2026

docs: README + ROADMAP for v0.8 shipped (audit, orchestrate, verify, babysit) + fork retarget #7

Merged

3 tasks

Conversation

crypticpy commented May 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

PR-babysit autonomous loop

Verify phase + TDD loop

Supporting changes

Test plan

Summary by Sourcery

Summary by CodeRabbit

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for verify phase with TDD feedback loop

ER diagram for babysit_jobs and babysit_decisions tables

State diagram for PR babysit job lifecycle

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

coderabbitai Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

crypticpy May 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

crypticpy May 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

crypticpy May 18, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crypticpy commented May 18, 2026 •

edited by coderabbitai Bot

Loading

sourcery-ai Bot commented May 18, 2026 •

edited

Loading

coderabbitai Bot commented May 18, 2026 •

edited

Loading

coderabbitai Bot May 18, 2026 •

edited

Loading