Skip to content

Harden orchestration approval and manifest state#382

Merged
arul28 merged 1 commit into
mainfrom
cursor/critical-correctness-bugs-5028
May 29, 2026
Merged

Harden orchestration approval and manifest state#382
arul28 merged 1 commit into
mainfrom
cursor/critical-correctness-bugs-5028

Conversation

@cursor
Copy link
Copy Markdown
Contributor

@cursor cursor Bot commented May 28, 2026

Summary

  • require explicit plan approval decisions and block lead manifest patches that can self-approve the planning phase
  • suspend runs instead of writing through foreign/stale bundles, and reject on-disk manifest conflicts before atomic commit
  • relocate orchestration bundle paths for managed and persisted sessions after lane placement changes, including active watcher restart
  • restore the previous project binding when openRepo is cancelled or fails

Validation

  • npx vitest run src/main/services/orchestration/patchPolicy.test.ts
  • npx vitest run src/main/services/ai/tools/orchestrationTools.test.ts
  • npx vitest run src/main/services/orchestration/orchestrationService.test.ts
  • npx vitest run src/preload/preload.test.ts
  • npx vitest run src/main/services/chat/agentChatService.test.ts -t "orchestration|lane launch directives|openRepo"
  • npm --prefix apps/desktop run typecheck
  • npm --prefix apps/desktop run lint (warnings only, existing repo-wide warnings)
  • npm --prefix apps/desktop run build

Related

Greptile Summary

This PR hardens the orchestration service against self-approval exploits, concurrent disk conflicts, and stale bundle paths after lane relocation. It also restores the project binding on a cancelled openRepo.

  • Conflict detection: persistManifest now double-checks the on-disk manifest before and inside atomicWrite via rejectIfDiskAdvanced; foreign-runId or advanced-generation states suspend the run rather than overwriting, and the temp file is cleaned up on any pre-commit failure.
  • Self-approval hardening: /leadState, /phases/{id:planning}/status, /phases/{id:planning}/completedAt, and /leadState/planApprovalSummary are added to LEAD_DENY_PATTERNS; the regex-based free-text approval check is replaced with an explicit decision field; and isOrchestrationPlanApproved no longer accepts a planning-phase status shortcut.
  • Bundle relocation: relocateRunBundle resets the runtime and restarts the watcher when a lane's worktree moves; handleLanePlacementChanged is made async and propagates relocations for both managed and cold (persisted) sessions with an unbounded limit: null query instead of the previous 500-row cap.

Confidence Score: 3/5

Safe to merge after the missing conflict-error handler in recordValidation is addressed; the rest of the changes are well-structured.

The new OrchestrationPersistConflictError type introduced in this PR is handled in externalManifestPatch and agentHeartbeat, but the recordValidation function's catch block only covers OrchestrationRunSuspendedError and rethrows everything else. A concurrent disk write during a validation record operation will propagate as an unhandled exception instead of the structured error response the function contract implies. Multiple agents recording validation results simultaneously — a normal multi-agent scenario — can trigger this path.

apps/desktop/src/main/services/orchestration/orchestrationService.ts — specifically the recordValidation try-catch around directPatch (lines ~1392–1408)

Important Files Changed

Filename Overview
apps/desktop/src/main/services/orchestration/orchestrationService.ts Core change: adds disk-conflict detection (rejectIfDiskAdvanced), suspension on foreign-runId, relocateRunBundle, and assertRunWritable guards. Missing OrchestrationPersistConflictError catch in recordValidation means concurrent writes surface as unhandled exceptions there.
apps/desktop/src/main/services/orchestration/patchPolicy.ts Adds /leadState, /leadState/planApprovalSummary, /phases/{id:planning}/status, /phases/{id:planning}/completedAt to LEAD_DENY_PATTERNS to close self-approval bypass. Pattern matching is exact (not prefix), so sub-path mutability is preserved correctly.
apps/desktop/src/main/services/orchestration/runtimeProfile.ts Removes the planning-phase-status shortcut from isOrchestrationPlanApproved; approval is now exclusively gated on planApprovedAt being set via the controlled approvePlan path.
apps/desktop/src/main/services/chat/agentChatService.ts Adds bundle-path relocation for managed and cold sessions when a lane's worktree moves. Cold-session path now uses limit: null (no cap) and synchronous file I/O to rewrite the metadata JSON directly. Previously flagged limit: 500 cap is resolved.
apps/desktop/src/main/services/sessions/sessionService.ts Accepts limit: null to remove the SQLite LIMIT clause; default (undefined) still falls through to 200. Logic is correct.
apps/desktop/src/main/services/ai/tools/orchestrationTools.ts Removes free-text regex approval check; plan approval now requires an explicit decision: "accept" or "accept_for_session" field to prevent false-positives on negation phrases.
apps/desktop/src/preload/preload.ts Restores the previous project binding when openRepo returns falsy or throws, preventing a null project-binding state after a cancelled or failed open.
apps/desktop/src/main/services/lanes/laneService.ts Makes onPlacementChanged callback async-aware so that the relocation await chain completes before emitPlacementChanged returns, avoiding fire-and-forget relocation races.
apps/desktop/src/main/main.ts Awaits handleLanePlacementChanged in the onPlacementChanged callback to correctly propagate the now-async relocation chain.
apps/desktop/src/shared/types/sessions.ts Widens limit in ListSessionsArgs from `number

Sequence Diagram

sequenceDiagram
    participant Caller
    participant persistManifest
    participant atomicWrite
    participant disk as Disk (manifest.json)
    participant runtime as RunRuntime

    Caller->>persistManifest: write next manifest
    persistManifest->>disk: rejectIfDiskAdvanced() [pre-check]
    alt runId mismatch on disk
        disk-->>persistManifest: foreign runId
        persistManifest->>runtime: "suspended=true, manifest=null"
        persistManifest-->>Caller: throw OrchestrationRunSuspendedError
    else disk generation advanced
        disk-->>persistManifest: newer serverGeneration
        persistManifest->>runtime: "manifest = onDisk"
        persistManifest-->>Caller: throw OrchestrationPersistConflictError
    else no conflict
        disk-->>persistManifest: ok / ENOENT
        persistManifest->>runtime: markSelfWrite()
        persistManifest->>atomicWrite: "write tmp, beforeCommit=rejectIfDiskAdvanced"
        atomicWrite->>disk: rejectIfDiskAdvanced() [post-write check]
        alt disk advanced between pre-check and write
            disk-->>atomicWrite: conflict
            atomicWrite->>disk: unlink(tmp)
            atomicWrite-->>persistManifest: throw
            persistManifest->>runtime: "recentSelfWriteUntil=0"
            persistManifest-->>Caller: throw
        else still ok
            atomicWrite->>disk: rename(tmp to manifest.json)
            persistManifest->>disk: writeServerGeneration(.gen)
            persistManifest->>runtime: "manifest = next"
            persistManifest-->>Caller: ok
        end
    end
Loading

Comments Outside Diff (1)

  1. apps/desktop/src/main/services/orchestration/orchestrationService.ts, line 486-522 (link)

    P1 loadIntoRuntime does not clear suspended on successful reload

    When a branch is restored after a foreign-runId swap, loadIntoRuntime reads the correct manifest and sets runtime.manifest, but never resets runtime.suspended = false. Every caller then checks if (runtime.suspended) or calls assertRunWritable before inspecting runtime.manifest, so all API operations (bundleRead, manifestPatch, etc.) return a false "suspended" error even though the correct bundle is already loaded.

    The watcher path (handleExternalChange, lines 628–636) does reset the flag when it processes the same file-change event, but there is a race: if any API call enters the mutex before the debounced watcher task does, it will see suspended = true and return an error even though the correct manifest is sitting in runtime.manifest. Fix: add runtime.suspended = false; at the end of the successful-load path in loadIntoRuntime.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: apps/desktop/src/main/services/orchestration/orchestrationService.ts
    Line: 486-522
    
    Comment:
    **`loadIntoRuntime` does not clear `suspended` on successful reload**
    
    When a branch is restored after a foreign-runId swap, `loadIntoRuntime` reads the correct manifest and sets `runtime.manifest`, but never resets `runtime.suspended = false`. Every caller then checks `if (runtime.suspended)` or calls `assertRunWritable` before inspecting `runtime.manifest`, so all API operations (`bundleRead`, `manifestPatch`, etc.) return a false "suspended" error even though the correct bundle is already loaded.
    
    The watcher path (`handleExternalChange`, lines 628–636) does reset the flag when it processes the same file-change event, but there is a race: if any API call enters the mutex before the debounced watcher task does, it will see `suspended = true` and return an error even though the correct manifest is sitting in `runtime.manifest`. Fix: add `runtime.suspended = false;` at the end of the successful-load path in `loadIntoRuntime`.
    
    How can I resolve this? If you propose a fix, please make it concise.

    Fix in Cursor Fix in Codex Fix in Claude Code

Fix All in Cursor Fix All in Codex Fix All in Claude Code

Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
apps/desktop/src/main/services/orchestration/orchestrationService.ts:1399-1408
**Missing `OrchestrationPersistConflictError` handler in `recordValidation`**

`directPatch``persistManifest` now throws `OrchestrationPersistConflictError` (new in this PR) when the second `rejectIfDiskAdvanced` check inside `atomicWrite.beforeCommit` detects a concurrent write. The catch block here only handles `OrchestrationRunSuspendedError` and rethrows everything else. Any concurrent manifest mutation during a `recordValidation` call will therefore surface as an uncaught exception to the IPC caller instead of returning the structured `{ ok: false, error: "etag_conflict" }` response that the function's return type implies. The sibling call-sites `externalManifestPatch` and `agentHeartbeat` both handle `OrchestrationPersistConflictError` explicitly — this one was missed.

Reviews (5): Last reviewed commit: "Harden orchestration approval and manife..." | Re-trigger Greptile

Greptile also left 1 inline comment on this PR.

@vercel
Copy link
Copy Markdown

vercel Bot commented May 28, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
ade Ignored Ignored Preview May 29, 2026 12:30am

@arul28 arul28 changed the title Fix orchestration plan approval false positives from rejection text Harden orchestration approval and manifest state May 28, 2026
@arul28 arul28 force-pushed the cursor/critical-correctness-bugs-5028 branch from 9e250d3 to 888598e Compare May 28, 2026 23:00
@arul28 arul28 marked this pull request as ready for review May 28, 2026 23:00
@arul28 arul28 self-requested a review as a code owner May 28, 2026 23:00
@capy-ai
Copy link
Copy Markdown

capy-ai Bot commented May 28, 2026

Capy auto-review is paused for this organization because the monthly auto-review limit has been reached. Increase the limit or turn it off in billing settings to resume automatic reviews.

Copy link
Copy Markdown
Contributor Author

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review

Scope: 11 file(s), +742 / −32
Verdict: Minor issues

This PR tightens orchestration correctness: explicit plan approval (no regex or planning-phase shortcuts), lead self-approval blocked in patch policy, manifest suspend/conflict handling on external bundle changes, bundle-path repointing when lane placement/worktree moves, and preload project-binding restore on cancelled openRepo. The changes are well-tested; one edge case in cold-session repointing is worth fixing.


🐛 Functionality

[Medium] Cold orchestration sessions can keep a stale bundle path after lane placement changes

File: apps/desktop/src/main/services/chat/agentChatService.ts:23807-23836
Issue: When repointing persisted (non-managed) orchestration sessions, the handler scans sessionService.list({ limit: 500 }) ordered by started_at desc. Any chat session with an orchestrationRunId that is not in the newest 500 rows is skipped, so its orchestrationBundlePath stays on the pre-move path after VM detach or worktree relocation.
Repro: Create 501+ chat sessions in a project; start an orchestration lead run, dispose the session (cold), change the lane worktree or detach from Mac VM, then reopen the cold session and invoke orchestration tools — metadata still points at the old bundle directory.
Fix: Query only sessions that carry orchestration metadata (e.g. filter persisted rows / metadata files with orchestrationRunId), or pass laneId into sessionService.list and drop the fixed 500 cap for this path.


Notes

  • Good hardening overall: requestPlanApproval requiring decision === "accept" | "accept_for_session", isOrchestrationPlanApproved no longer treating planning phase done as approval, persistManifest TOCTOU guard + suspend on runId mismatch, and watcher callbacks serialized under the run mutex.
  • Bundle relocation assumes orchestration files already exist at the destination worktree (mirror sync / manual copy); that matches VM mirror behavior (.ade/orchestration is not rsync-excluded).
  • VM detach still depends on mirror→lane flush before the share directory is removed; stopMirrorSyncForLane is optional on the Mac VM service — worth verifying separately if detach-related manifest loss is reported.
Open in Web View Automation 

Sent by Cursor Automation: BUGBOT in Versic

Comment thread apps/desktop/src/main/services/orchestration/orchestrationService.ts Outdated
Comment thread apps/desktop/src/main/services/orchestration/orchestrationService.ts Outdated
Comment thread apps/desktop/src/main/services/chat/agentChatService.ts Outdated
@arul28 arul28 force-pushed the cursor/critical-correctness-bugs-5028 branch from 888598e to 8d0384b Compare May 28, 2026 23:28
Comment thread apps/desktop/src/main/services/orchestration/orchestrationService.ts Outdated
@arul28 arul28 force-pushed the cursor/critical-correctness-bugs-5028 branch from 8d0384b to a1763c5 Compare May 28, 2026 23:52
@arul28 arul28 force-pushed the cursor/critical-correctness-bugs-5028 branch 2 times, most recently from 9c69b32 to ce54f9e Compare May 29, 2026 00:13
Fold the validated orchestration fixes into one lane: require explicit plan approval, block lead self-approval patches, suspend on foreign bundle swaps, detect on-disk manifest conflicts, relocate run bundles after lane placement changes, and restore preload project bindings after cancelled openRepo.
@arul28 arul28 force-pushed the cursor/critical-correctness-bugs-5028 branch from ce54f9e to e919c82 Compare May 29, 2026 00:30
Comment on lines +1399 to +1408
} catch (err) {
if (err instanceof OrchestrationRunSuspendedError) {
return {
ok: false,
error: "validation_failed",
message: RUN_SUSPENDED_MESSAGE,
};
}
throw err;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Missing OrchestrationPersistConflictError handler in recordValidation

directPatchpersistManifest now throws OrchestrationPersistConflictError (new in this PR) when the second rejectIfDiskAdvanced check inside atomicWrite.beforeCommit detects a concurrent write. The catch block here only handles OrchestrationRunSuspendedError and rethrows everything else. Any concurrent manifest mutation during a recordValidation call will therefore surface as an uncaught exception to the IPC caller instead of returning the structured { ok: false, error: "etag_conflict" } response that the function's return type implies. The sibling call-sites externalManifestPatch and agentHeartbeat both handle OrchestrationPersistConflictError explicitly — this one was missed.

Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/desktop/src/main/services/orchestration/orchestrationService.ts
Line: 1399-1408

Comment:
**Missing `OrchestrationPersistConflictError` handler in `recordValidation`**

`directPatch``persistManifest` now throws `OrchestrationPersistConflictError` (new in this PR) when the second `rejectIfDiskAdvanced` check inside `atomicWrite.beforeCommit` detects a concurrent write. The catch block here only handles `OrchestrationRunSuspendedError` and rethrows everything else. Any concurrent manifest mutation during a `recordValidation` call will therefore surface as an uncaught exception to the IPC caller instead of returning the structured `{ ok: false, error: "etag_conflict" }` response that the function's return type implies. The sibling call-sites `externalManifestPatch` and `agentHeartbeat` both handle `OrchestrationPersistConflictError` explicitly — this one was missed.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Codex Fix in Claude Code

@arul28 arul28 merged commit 4a44e4a into main May 29, 2026
27 checks passed
@arul28 arul28 deleted the cursor/critical-correctness-bugs-5028 branch May 29, 2026 01:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants