Skip to content

feat(agent): apply --resume — resumable Claude SDK sessions across crashes#259

Merged
kelsonpw merged 2 commits intokelsonpw/wizard-mcp-plan-verify-listfrom
kelsonpw/agent-apply-resume
Apr 26, 2026
Merged

feat(agent): apply --resume — resumable Claude SDK sessions across crashes#259
kelsonpw merged 2 commits intokelsonpw/wizard-mcp-plan-verify-listfrom
kelsonpw/agent-apply-resume

Conversation

@kelsonpw
Copy link
Copy Markdown
Collaborator

@kelsonpw kelsonpw commented Apr 25, 2026

Summary

Closes the design-doc gap on mid-apply work loss. When the inner Claude SDK agent dies partway through (SIGINT, network drop, crashed terminal), today the user re-runs apply and the agent has to redo all the cold-start work — SDK install, package detection, file reads, framework analysis. This PR captures the SDK's session id and persists it on the plan, so apply --resume rehydrates the conversation instead of starting fresh.

Stacked on #258. Gap #1 from the work-loss analysis — the only remaining case (event-plan approval mid-agent) is a future plan/apply boundary refactor, not a session-resumption thing.

Surface

WizardPlan now optionally carries:
  agentSessionId         UUID of the prior SDK session
  agentSessionUpdatedAt  When it was captured (ISO-8601)

apply --resume           Pulls agentSessionId from the persisted plan and
                         forwards to the spawned child via
                         AMPLITUDE_WIZARD_RESUME_SESSION_ID. When the
                         plan has no captured session yet, logs a
                         structured warning and falls through to a fresh
                         run (right default for the very first apply).

How it works

1. wizard plan                 → plan.json { planId, framework, …, agentSessionId: undefined }
2. wizard apply --plan-id X --yes
     ↓ spawns: wizard --agent --yes --install-dir … (env: AMPLITUDE_WIZARD_PLAN_ID=X)
     ↓ agent-runner reads env, calls runAgent({ onSessionStart })
     ↓ SDK emits system/init { session_id: "sdk-sess-abc" }
     ↓ onSessionStart → applyPlanPatch(X, { agentSessionId: "sdk-sess-abc" })
3. (user kills the run with ⌃C)
4. wizard apply --plan-id X --yes --resume
     ↓ apply reads plan, finds agentSessionId, sets AMPLITUDE_WIZARD_RESUME_SESSION_ID
     ↓ agent-runner forwards as runAgent({ resumeSessionId: "sdk-sess-abc" })
     ↓ SDK rehydrates conversation, emits a new session id (fork)
     ↓ onSessionStart → applyPlanPatch updates agentSessionId to the new fork

The fork-and-update on resume means the chain works across multiple interruptions — each --resume picks up from the most recent attempt.

Smoke tests

\$ wizard plan --json
{ planId: "X", agentSessionId: undefined, … }

\$ wizard apply --plan-id X --resume --yes --json
{ type: "log", level: "warn",
  message: "--resume requested but plan X has no captured agent session yet. Running fresh.",
  data: { event: "resume_unavailable", planId: "X" } }
{ type: "lifecycle", message: "applying plan X", … }     # falls through cleanly

\$ wizard apply --plan-id Y --yes  # Y has agentSessionId from earlier run
\$ wizard apply --plan-id Y --resume --yes --json
{ type: "lifecycle",
  message: "applying plan Y (resuming session sdk-sess-abc)",
  data: { event: "apply_started", planId: "Y", resumeSessionId: "sdk-sess-abc", … } }

What changed

  • src/lib/agent-plans.ts — extends WizardPlanSchema with optional agentSessionId + agentSessionUpdatedAt. Adds applyPlanPatch(planId, partial) for atomic on-disk updates and getApplyContextFromEnv() for the env-var bridge.
  • src/lib/agent-interface.ts — surgical addition: 2 optional runAgent config fields (onSessionStart, resumeSessionId). 6 lines in the SDK message loop to capture session_id from system/init; 1 line in the query options for resume:.
  • src/lib/agent-runner.ts — reads env via getApplyContextFromEnv(), passes resumeSessionId + onSessionStart into runAgent. The handler patches the plan best-effort (failures logged, never break the run).
  • bin.ts--resume flag on the apply command. When set, looks up agentSessionId from the persisted plan and forwards via env. Surfaces a structured resume_unavailable warning when the plan has no captured session.

Test plan

  • pnpm test1319 passed, 17 skipped (8 new tests in agent-plans.test.ts)
  • pnpm tsc --noEmit clean
  • pnpm lint clean
  • Smoke: apply --resume without a captured session warns + runs fresh
  • Smoke: apply --resume payload includes resumeSessionId when plan has one
  • Manual: real apply against a small project, ⌃C halfway, --resume, watch SDK skip the cold-start

Out of scope

  • Detecting stale session ids server-side — the SDK either rehydrates or 404s. Today we assume rehydration succeeds; if it fails, the SDK falls back to fresh and our onSessionStart handler updates the plan with the new id. Handling explicit "session expired" surfacing is a future improvement.
  • Pruning agentSessionId on plan TTL expiry — the plan itself expires at 24h alongside the session ttl, so this is automatic.
  • Persisting agentSessionId outside applywizard --agent --yes (without a plan) doesn't persist a session. Adding that needs a separate "ad-hoc resume" path; not requested.

cc @amplitude/growth

🤖 Generated with Claude Code


Note

Medium Risk
Touches the apply execution path and Claude SDK invocation by persisting and replaying session IDs via env vars; failures should fall back to fresh runs but regressions could break non-interactive apply flows.

Overview
Adds an apply --resume flag that, when a plan has a captured Claude SDK agentSessionId, re-runs apply by resuming the prior SDK conversation to skip cold-start work; if no session is available it warns and runs fresh.

Persists Claude SDK session IDs onto WizardPlan (with agentSessionUpdatedAt) via a new best-effort applyPlanPatch, and bridges resume context across the apply spawn boundary using AMPLITUDE_WIZARD_PLAN_ID/AMPLITUDE_WIZARD_RESUME_SESSION_ID.

Extends runAgent to accept resumeSessionId and an onSessionStart hook, forwarding resume into the SDK query options and capturing session_id from the SDK system/init message; adds tests covering patching, env parsing, and schema back-compat.

Reviewed by Cursor Bugbot for commit 8c54f5e. Bugbot is set up for automated code reviews on this repo. Configure here.

…ashes

When `apply` runs the inner Claude SDK agent, capture the session id from
the first system/init message and persist it on the plan. A subsequent
`apply --plan-id <id> --yes --resume` passes that id to the SDK as
`resume:`, rehydrating the conversation instead of starting a fresh
agent that has to redo SDK install + package detection + file reads.

Surface area:

  WizardPlan now optionally carries:
    agentSessionId         — UUID of the prior SDK session
    agentSessionUpdatedAt  — when it was captured

  apply --resume           — opt-in flag; pulls agentSessionId from the
                             persisted plan and forwards to the spawned
                             child via AMPLITUDE_WIZARD_RESUME_SESSION_ID.
                             When the plan has no captured session yet,
                             logs a structured warning and falls through
                             to a fresh run (right default for the very
                             first apply against a plan).

  applyPlanPatch(planId, p) — partial-update helper for plans on disk;
                              best-effort, returns null on miss.

  getApplyContextFromEnv()  — agent-runner reads { planId, resumeSessionId }
                              from env vars set by `apply` so the spawn
                              boundary stays decoupled from the SDK call
                              site. Both vars optional — fresh runs work.

Wiring in agent-runner: pass `resumeSessionId` and `onSessionStart` into
`runAgent`. The latter fires once on system/init and patches the plan
with the SDK-assigned session id (handles both fresh runs AND forks from
a resumed session, which the SDK gives a new id).

Wiring in agent-interface: two new optional `runAgent` config fields
(`onSessionStart`, `resumeSessionId`). Surgical change — adds 6 lines to
the SDK message loop and 1 line to the query options.

Tests: +8 in agent-plans.test.ts (1319 total). Suite green.

Smoke tests:
  $ wizard plan --json                    → planId X (no agentSessionId)
  $ wizard apply --plan-id X --yes        → first run; captures session
  $ wizard apply --plan-id X --resume --yes → second run; resumes
  $ wizard apply --plan-id Y --resume --yes → warns "no captured session", runs fresh

Stacked on #258. Closes the design-doc gap on mid-`apply` work loss
(SIGINT, network drop, terminal crash).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kelsonpw kelsonpw requested a review from a team April 25, 2026 21:29
@github-actions
Copy link
Copy Markdown
Contributor

🧙 Wizard CI

Run the Wizard CI and test your changes against wizard-workbench example apps by replying with a GitHub comment using one of the following commands:

Test all apps:

  • /wizard-ci all

Test all apps in a directory:

  • /wizard-ci django
  • /wizard-ci fastapi
  • /wizard-ci flask
  • /wizard-ci javascript-node
  • /wizard-ci javascript-web
  • /wizard-ci next-js
  • /wizard-ci python
  • /wizard-ci react-router
  • /wizard-ci vue

Test an individual app:

  • /wizard-ci django/django3-saas
  • /wizard-ci fastapi/fastapi3-ai-saas
  • /wizard-ci flask/flask3-social-media
Show more apps
  • /wizard-ci javascript-node/express-todo
  • /wizard-ci javascript-node/fastify-blog
  • /wizard-ci javascript-node/hono-links
  • /wizard-ci javascript-node/koa-notes
  • /wizard-ci javascript-node/native-http-contacts
  • /wizard-ci javascript-web/saas-dashboard
  • /wizard-ci next-js/15-app-router-saas
  • /wizard-ci next-js/15-app-router-todo
  • /wizard-ci next-js/15-pages-router-saas
  • /wizard-ci next-js/15-pages-router-todo
  • /wizard-ci python/meeting-summarizer
  • /wizard-ci react-router/react-router-v7-project
  • /wizard-ci react-router/rrv7-starter
  • /wizard-ci react-router/saas-template
  • /wizard-ci react-router/shopper
  • /wizard-ci vue/movies

Results will be posted here when complete.

Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Vendor name "Claude SDK" in CLI help text
    • Replaced "Claude SDK session" with "agent session" in the user-facing --resume yargs describe string.

Create PR

Or push these changes by commenting:

@cursor push a5cfb236c3
Preview (a5cfb236c3)
diff --git a/bin.ts b/bin.ts
--- a/bin.ts
+++ b/bin.ts
@@ -2204,7 +2204,7 @@
         },
         resume: {
           describe:
-            'resume the previous Claude SDK session captured against this plan (skip cold-start work after a SIGINT or crash)',
+            'resume the previous agent session captured against this plan (skip cold-start work after a SIGINT or crash)',
           type: 'boolean',
           default: false,
         },

You can send follow-ups to the cloud agent here.

Comment thread bin.ts Outdated
@kelsonpw
Copy link
Copy Markdown
Collaborator Author

@cursor push a5cfb23

@kelsonpw kelsonpw merged commit a90692a into kelsonpw/wizard-mcp-plan-verify-list Apr 26, 2026
6 checks passed
@kelsonpw kelsonpw deleted the kelsonpw/agent-apply-resume branch April 26, 2026 01:02
Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is ON, but it could not run because the branch was deleted or merged before autofix could start.

Reviewed by Cursor Bugbot for commit 8c54f5e. Configure here.

// invoked against a plan with a captured agentSessionId. The SDK
// either rehydrates the conversation or, on a stale id, falls
// back to a fresh run — agent-runner clears the id in that case.
...(config?.resumeSessionId && { resume: config.resumeSessionId }),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resume option missing from SDKQueryOptions type

Medium Severity

The new resume property is spread into the SDK query() options via ...(config?.resumeSessionId && { resume: config.resumeSessionId }), but the local SDKQueryOptions type (the single source of truth for what gets passed to the SDK) doesn't declare a resume field. The spread operator bypasses TypeScript's excess-property checking, so the property reaches the SDK at runtime with zero compile-time validation — a misspelling (e.g. Resume) or wrong value type would be silently accepted. Every other SDK option has a declared field in SDKQueryOptions; resume?: string belongs there too.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 8c54f5e. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants