Harden orchestration approval and manifest state#382
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
9e250d3 to
888598e
Compare
|
Capy auto-review is paused for this organization because the monthly auto-review limit has been reached. Increase the limit or turn it off in billing settings to resume automatic reviews. |
There was a problem hiding this comment.
PR Review
Scope: 11 file(s), +742 / −32
Verdict: Minor issues
This PR tightens orchestration correctness: explicit plan approval (no regex or planning-phase shortcuts), lead self-approval blocked in patch policy, manifest suspend/conflict handling on external bundle changes, bundle-path repointing when lane placement/worktree moves, and preload project-binding restore on cancelled openRepo. The changes are well-tested; one edge case in cold-session repointing is worth fixing.
🐛 Functionality
[Medium] Cold orchestration sessions can keep a stale bundle path after lane placement changes
File: apps/desktop/src/main/services/chat/agentChatService.ts:23807-23836
Issue: When repointing persisted (non-managed) orchestration sessions, the handler scans sessionService.list({ limit: 500 }) ordered by started_at desc. Any chat session with an orchestrationRunId that is not in the newest 500 rows is skipped, so its orchestrationBundlePath stays on the pre-move path after VM detach or worktree relocation.
Repro: Create 501+ chat sessions in a project; start an orchestration lead run, dispose the session (cold), change the lane worktree or detach from Mac VM, then reopen the cold session and invoke orchestration tools — metadata still points at the old bundle directory.
Fix: Query only sessions that carry orchestration metadata (e.g. filter persisted rows / metadata files with orchestrationRunId), or pass laneId into sessionService.list and drop the fixed 500 cap for this path.
Notes
- Good hardening overall:
requestPlanApprovalrequiringdecision === "accept" | "accept_for_session",isOrchestrationPlanApprovedno longer treating planning phasedoneas approval,persistManifestTOCTOU guard + suspend onrunIdmismatch, and watcher callbacks serialized under the run mutex. - Bundle relocation assumes orchestration files already exist at the destination worktree (mirror sync / manual copy); that matches VM mirror behavior (
.ade/orchestrationis not rsync-excluded). - VM detach still depends on mirror→lane flush before the share directory is removed;
stopMirrorSyncForLaneis optional on the Mac VM service — worth verifying separately if detach-related manifest loss is reported.
Sent by Cursor Automation: BUGBOT in Versic
888598e to
8d0384b
Compare
8d0384b to
a1763c5
Compare
9c69b32 to
ce54f9e
Compare
Fold the validated orchestration fixes into one lane: require explicit plan approval, block lead self-approval patches, suspend on foreign bundle swaps, detect on-disk manifest conflicts, relocate run bundles after lane placement changes, and restore preload project bindings after cancelled openRepo.
ce54f9e to
e919c82
Compare
| } catch (err) { | ||
| if (err instanceof OrchestrationRunSuspendedError) { | ||
| return { | ||
| ok: false, | ||
| error: "validation_failed", | ||
| message: RUN_SUSPENDED_MESSAGE, | ||
| }; | ||
| } | ||
| throw err; | ||
| } |
There was a problem hiding this comment.
Missing
OrchestrationPersistConflictError handler in recordValidation
directPatch → persistManifest now throws OrchestrationPersistConflictError (new in this PR) when the second rejectIfDiskAdvanced check inside atomicWrite.beforeCommit detects a concurrent write. The catch block here only handles OrchestrationRunSuspendedError and rethrows everything else. Any concurrent manifest mutation during a recordValidation call will therefore surface as an uncaught exception to the IPC caller instead of returning the structured { ok: false, error: "etag_conflict" } response that the function's return type implies. The sibling call-sites externalManifestPatch and agentHeartbeat both handle OrchestrationPersistConflictError explicitly — this one was missed.
Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/desktop/src/main/services/orchestration/orchestrationService.ts
Line: 1399-1408
Comment:
**Missing `OrchestrationPersistConflictError` handler in `recordValidation`**
`directPatch` → `persistManifest` now throws `OrchestrationPersistConflictError` (new in this PR) when the second `rejectIfDiskAdvanced` check inside `atomicWrite.beforeCommit` detects a concurrent write. The catch block here only handles `OrchestrationRunSuspendedError` and rethrows everything else. Any concurrent manifest mutation during a `recordValidation` call will therefore surface as an uncaught exception to the IPC caller instead of returning the structured `{ ok: false, error: "etag_conflict" }` response that the function's return type implies. The sibling call-sites `externalManifestPatch` and `agentHeartbeat` both handle `OrchestrationPersistConflictError` explicitly — this one was missed.
How can I resolve this? If you propose a fix, please make it concise.

Summary
Validation
Related
Greptile Summary
This PR hardens the orchestration service against self-approval exploits, concurrent disk conflicts, and stale bundle paths after lane relocation. It also restores the project binding on a cancelled
openRepo.persistManifestnow double-checks the on-disk manifest before and insideatomicWriteviarejectIfDiskAdvanced; foreign-runId or advanced-generation states suspend the run rather than overwriting, and the temp file is cleaned up on any pre-commit failure./leadState,/phases/{id:planning}/status,/phases/{id:planning}/completedAt, and/leadState/planApprovalSummaryare added toLEAD_DENY_PATTERNS; the regex-based free-text approval check is replaced with an explicitdecisionfield; andisOrchestrationPlanApprovedno longer accepts a planning-phase status shortcut.relocateRunBundleresets the runtime and restarts the watcher when a lane's worktree moves;handleLanePlacementChangedis made async and propagates relocations for both managed and cold (persisted) sessions with an unboundedlimit: nullquery instead of the previous 500-row cap.Confidence Score: 3/5
Safe to merge after the missing conflict-error handler in
recordValidationis addressed; the rest of the changes are well-structured.The new
OrchestrationPersistConflictErrortype introduced in this PR is handled inexternalManifestPatchandagentHeartbeat, but therecordValidationfunction's catch block only coversOrchestrationRunSuspendedErrorand rethrows everything else. A concurrent disk write during a validation record operation will propagate as an unhandled exception instead of the structured error response the function contract implies. Multiple agents recording validation results simultaneously — a normal multi-agent scenario — can trigger this path.apps/desktop/src/main/services/orchestration/orchestrationService.ts — specifically the
recordValidationtry-catch arounddirectPatch(lines ~1392–1408)Important Files Changed
rejectIfDiskAdvanced), suspension on foreign-runId,relocateRunBundle, andassertRunWritableguards. MissingOrchestrationPersistConflictErrorcatch inrecordValidationmeans concurrent writes surface as unhandled exceptions there./leadState,/leadState/planApprovalSummary,/phases/{id:planning}/status,/phases/{id:planning}/completedAttoLEAD_DENY_PATTERNSto close self-approval bypass. Pattern matching is exact (not prefix), so sub-path mutability is preserved correctly.isOrchestrationPlanApproved; approval is now exclusively gated onplanApprovedAtbeing set via the controlledapprovePlanpath.limit: null(no cap) and synchronous file I/O to rewrite the metadata JSON directly. Previously flaggedlimit: 500cap is resolved.limit: nullto remove the SQLite LIMIT clause; default (undefined) still falls through to 200. Logic is correct.decision: "accept"or"accept_for_session"field to prevent false-positives on negation phrases.openReporeturns falsy or throws, preventing a null project-binding state after a cancelled or failed open.onPlacementChangedcallback async-aware so that the relocation await chain completes beforeemitPlacementChangedreturns, avoiding fire-and-forget relocation races.handleLanePlacementChangedin theonPlacementChangedcallback to correctly propagate the now-async relocation chain.limitinListSessionsArgsfrom `numberSequence Diagram
sequenceDiagram participant Caller participant persistManifest participant atomicWrite participant disk as Disk (manifest.json) participant runtime as RunRuntime Caller->>persistManifest: write next manifest persistManifest->>disk: rejectIfDiskAdvanced() [pre-check] alt runId mismatch on disk disk-->>persistManifest: foreign runId persistManifest->>runtime: "suspended=true, manifest=null" persistManifest-->>Caller: throw OrchestrationRunSuspendedError else disk generation advanced disk-->>persistManifest: newer serverGeneration persistManifest->>runtime: "manifest = onDisk" persistManifest-->>Caller: throw OrchestrationPersistConflictError else no conflict disk-->>persistManifest: ok / ENOENT persistManifest->>runtime: markSelfWrite() persistManifest->>atomicWrite: "write tmp, beforeCommit=rejectIfDiskAdvanced" atomicWrite->>disk: rejectIfDiskAdvanced() [post-write check] alt disk advanced between pre-check and write disk-->>atomicWrite: conflict atomicWrite->>disk: unlink(tmp) atomicWrite-->>persistManifest: throw persistManifest->>runtime: "recentSelfWriteUntil=0" persistManifest-->>Caller: throw else still ok atomicWrite->>disk: rename(tmp to manifest.json) persistManifest->>disk: writeServerGeneration(.gen) persistManifest->>runtime: "manifest = next" persistManifest-->>Caller: ok end endComments Outside Diff (1)
apps/desktop/src/main/services/orchestration/orchestrationService.ts, line 486-522 (link)loadIntoRuntimedoes not clearsuspendedon successful reloadWhen a branch is restored after a foreign-runId swap,
loadIntoRuntimereads the correct manifest and setsruntime.manifest, but never resetsruntime.suspended = false. Every caller then checksif (runtime.suspended)or callsassertRunWritablebefore inspectingruntime.manifest, so all API operations (bundleRead,manifestPatch, etc.) return a false "suspended" error even though the correct bundle is already loaded.The watcher path (
handleExternalChange, lines 628–636) does reset the flag when it processes the same file-change event, but there is a race: if any API call enters the mutex before the debounced watcher task does, it will seesuspended = trueand return an error even though the correct manifest is sitting inruntime.manifest. Fix: addruntime.suspended = false;at the end of the successful-load path inloadIntoRuntime.Prompt To Fix With AI
Prompt To Fix All With AI
Reviews (5): Last reviewed commit: "Harden orchestration approval and manife..." | Re-trigger Greptile