feat(agent): apply --resume — resumable Claude SDK sessions across crashes by kelsonpw · Pull Request #259 · amplitude/wizard

kelsonpw · 2026-04-25T21:29:42Z

Summary

Closes the design-doc gap on mid-apply work loss. When the inner Claude SDK agent dies partway through (SIGINT, network drop, crashed terminal), today the user re-runs apply and the agent has to redo all the cold-start work — SDK install, package detection, file reads, framework analysis. This PR captures the SDK's session id and persists it on the plan, so apply --resume rehydrates the conversation instead of starting fresh.

Stacked on #258. Gap #1 from the work-loss analysis — the only remaining case (event-plan approval mid-agent) is a future plan/apply boundary refactor, not a session-resumption thing.

Surface

WizardPlan now optionally carries:
  agentSessionId         UUID of the prior SDK session
  agentSessionUpdatedAt  When it was captured (ISO-8601)

apply --resume           Pulls agentSessionId from the persisted plan and
                         forwards to the spawned child via
                         AMPLITUDE_WIZARD_RESUME_SESSION_ID. When the
                         plan has no captured session yet, logs a
                         structured warning and falls through to a fresh
                         run (right default for the very first apply).

How it works

1. wizard plan                 → plan.json { planId, framework, …, agentSessionId: undefined }
2. wizard apply --plan-id X --yes
     ↓ spawns: wizard --agent --yes --install-dir … (env: AMPLITUDE_WIZARD_PLAN_ID=X)
     ↓ agent-runner reads env, calls runAgent({ onSessionStart })
     ↓ SDK emits system/init { session_id: "sdk-sess-abc" }
     ↓ onSessionStart → applyPlanPatch(X, { agentSessionId: "sdk-sess-abc" })
3. (user kills the run with ⌃C)
4. wizard apply --plan-id X --yes --resume
     ↓ apply reads plan, finds agentSessionId, sets AMPLITUDE_WIZARD_RESUME_SESSION_ID
     ↓ agent-runner forwards as runAgent({ resumeSessionId: "sdk-sess-abc" })
     ↓ SDK rehydrates conversation, emits a new session id (fork)
     ↓ onSessionStart → applyPlanPatch updates agentSessionId to the new fork

The fork-and-update on resume means the chain works across multiple interruptions — each --resume picks up from the most recent attempt.

Smoke tests

\$ wizard plan --json
{ planId: "X", agentSessionId: undefined, … }

\$ wizard apply --plan-id X --resume --yes --json
{ type: "log", level: "warn",
  message: "--resume requested but plan X has no captured agent session yet. Running fresh.",
  data: { event: "resume_unavailable", planId: "X" } }
{ type: "lifecycle", message: "applying plan X", … }     # falls through cleanly

\$ wizard apply --plan-id Y --yes  # Y has agentSessionId from earlier run
\$ wizard apply --plan-id Y --resume --yes --json
{ type: "lifecycle",
  message: "applying plan Y (resuming session sdk-sess-abc)",
  data: { event: "apply_started", planId: "Y", resumeSessionId: "sdk-sess-abc", … } }

What changed

src/lib/agent-plans.ts — extends WizardPlanSchema with optional agentSessionId + agentSessionUpdatedAt. Adds applyPlanPatch(planId, partial) for atomic on-disk updates and getApplyContextFromEnv() for the env-var bridge.
src/lib/agent-interface.ts — surgical addition: 2 optional runAgent config fields (onSessionStart, resumeSessionId). 6 lines in the SDK message loop to capture session_id from system/init; 1 line in the query options for resume:.
src/lib/agent-runner.ts — reads env via getApplyContextFromEnv(), passes resumeSessionId + onSessionStart into runAgent. The handler patches the plan best-effort (failures logged, never break the run).
bin.ts — --resume flag on the apply command. When set, looks up agentSessionId from the persisted plan and forwards via env. Surfaces a structured resume_unavailable warning when the plan has no captured session.

Test plan

pnpm test — 1319 passed, 17 skipped (8 new tests in agent-plans.test.ts)
pnpm tsc --noEmit clean
pnpm lint clean
Smoke: apply --resume without a captured session warns + runs fresh
Smoke: apply --resume payload includes resumeSessionId when plan has one
Manual: real apply against a small project, ⌃C halfway, --resume, watch SDK skip the cold-start

Out of scope

Detecting stale session ids server-side — the SDK either rehydrates or 404s. Today we assume rehydration succeeds; if it fails, the SDK falls back to fresh and our onSessionStart handler updates the plan with the new id. Handling explicit "session expired" surfacing is a future improvement.
Pruning agentSessionId on plan TTL expiry — the plan itself expires at 24h alongside the session ttl, so this is automatic.
Persisting agentSessionId outside apply — wizard --agent --yes (without a plan) doesn't persist a session. Adding that needs a separate "ad-hoc resume" path; not requested.

cc @amplitude/growth

🤖 Generated with Claude Code

Note

Medium Risk
Touches the apply execution path and Claude SDK invocation by persisting and replaying session IDs via env vars; failures should fall back to fresh runs but regressions could break non-interactive apply flows.

Overview
Adds an apply --resume flag that, when a plan has a captured Claude SDK agentSessionId, re-runs apply by resuming the prior SDK conversation to skip cold-start work; if no session is available it warns and runs fresh.

Persists Claude SDK session IDs onto WizardPlan (with agentSessionUpdatedAt) via a new best-effort applyPlanPatch, and bridges resume context across the apply spawn boundary using AMPLITUDE_WIZARD_PLAN_ID/AMPLITUDE_WIZARD_RESUME_SESSION_ID.

Extends runAgent to accept resumeSessionId and an onSessionStart hook, forwarding resume into the SDK query options and capturing session_id from the SDK system/init message; adds tests covering patching, env parsing, and schema back-compat.

^{Reviewed by Cursor Bugbot for commit 8c54f5e. Bugbot is set up for automated code reviews on this repo. Configure here.}

…ashes When `apply` runs the inner Claude SDK agent, capture the session id from the first system/init message and persist it on the plan. A subsequent `apply --plan-id <id> --yes --resume` passes that id to the SDK as `resume:`, rehydrating the conversation instead of starting a fresh agent that has to redo SDK install + package detection + file reads. Surface area: WizardPlan now optionally carries: agentSessionId — UUID of the prior SDK session agentSessionUpdatedAt — when it was captured apply --resume — opt-in flag; pulls agentSessionId from the persisted plan and forwards to the spawned child via AMPLITUDE_WIZARD_RESUME_SESSION_ID. When the plan has no captured session yet, logs a structured warning and falls through to a fresh run (right default for the very first apply against a plan). applyPlanPatch(planId, p) — partial-update helper for plans on disk; best-effort, returns null on miss. getApplyContextFromEnv() — agent-runner reads { planId, resumeSessionId } from env vars set by `apply` so the spawn boundary stays decoupled from the SDK call site. Both vars optional — fresh runs work. Wiring in agent-runner: pass `resumeSessionId` and `onSessionStart` into `runAgent`. The latter fires once on system/init and patches the plan with the SDK-assigned session id (handles both fresh runs AND forks from a resumed session, which the SDK gives a new id). Wiring in agent-interface: two new optional `runAgent` config fields (`onSessionStart`, `resumeSessionId`). Surgical change — adds 6 lines to the SDK message loop and 1 line to the query options. Tests: +8 in agent-plans.test.ts (1319 total). Suite green. Smoke tests: $ wizard plan --json → planId X (no agentSessionId) $ wizard apply --plan-id X --yes → first run; captures session $ wizard apply --plan-id X --resume --yes → second run; resumes $ wizard apply --plan-id Y --resume --yes → warns "no captured session", runs fresh Stacked on #258. Closes the design-doc gap on mid-`apply` work loss (SIGINT, network drop, terminal crash). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-25T21:29:53Z

🧙 Wizard CI

Run the Wizard CI and test your changes against wizard-workbench example apps by replying with a GitHub comment using one of the following commands:

Test all apps:

/wizard-ci all

Test all apps in a directory:

/wizard-ci django
/wizard-ci fastapi
/wizard-ci flask
/wizard-ci javascript-node
/wizard-ci javascript-web
/wizard-ci next-js
/wizard-ci python
/wizard-ci react-router
/wizard-ci vue

Test an individual app:

/wizard-ci django/django3-saas
/wizard-ci fastapi/fastapi3-ai-saas
/wizard-ci flask/flask3-social-media

Show more apps

/wizard-ci javascript-node/express-todo
/wizard-ci javascript-node/fastify-blog
/wizard-ci javascript-node/hono-links
/wizard-ci javascript-node/koa-notes
/wizard-ci javascript-node/native-http-contacts
/wizard-ci javascript-web/saas-dashboard
/wizard-ci next-js/15-app-router-saas
/wizard-ci next-js/15-app-router-todo
/wizard-ci next-js/15-pages-router-saas
/wizard-ci next-js/15-pages-router-todo
/wizard-ci python/meeting-summarizer
/wizard-ci react-router/react-router-v7-project
/wizard-ci react-router/rrv7-starter
/wizard-ci react-router/saas-template
/wizard-ci react-router/shopper
/wizard-ci vue/movies

Results will be posted here when complete.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: Vendor name "Claude SDK" in CLI help text
- Replaced "Claude SDK session" with "agent session" in the user-facing --resume yargs describe string.

Or push these changes by commenting:

@cursor push a5cfb236c3

Preview (a5cfb236c3)

diff --git a/bin.ts b/bin.ts
--- a/bin.ts
+++ b/bin.ts
@@ -2204,7 +2204,7 @@
         },
         resume: {
           describe:
-            'resume the previous Claude SDK session captured against this plan (skip cold-start work after a SIGINT or crash)',
+            'resume the previous agent session captured against this plan (skip cold-start work after a SIGINT or crash)',
           type: 'boolean',
           default: false,
         },

_{You can send follow-ups to the cloud agent here.}

kelsonpw · 2026-04-26T01:02:20Z

@cursor push a5cfb23

Applied via @cursor push command

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is ON, but it could not run because the branch was deleted or merged before autofix could start.}

^{Reviewed by Cursor Bugbot for commit 8c54f5e. Configure here.}

cursor · 2026-04-26T01:08:17Z

+            // invoked against a plan with a captured agentSessionId. The SDK
+            // either rehydrates the conversation or, on a stale id, falls
+            // back to a fresh run — agent-runner clears the id in that case.
+            ...(config?.resumeSessionId && { resume: config.resumeSessionId }),


resume option missing from SDKQueryOptions type

Medium Severity

The new resume property is spread into the SDK query() options via ...(config?.resumeSessionId && { resume: config.resumeSessionId }), but the local SDKQueryOptions type (the single source of truth for what gets passed to the SDK) doesn't declare a resume field. The spread operator bypasses TypeScript's excess-property checking, so the property reaches the SDK at runtime with zero compile-time validation — a misspelling (e.g. Resume) or wrong value type would be silently accepted. Every other SDK option has a declared field in SDKQueryOptions; resume?: string belongs there too.

Additional Locations (1)

src/lib/agent-interface.ts#L64-L82

^{Reviewed by Cursor Bugbot for commit 8c54f5e. Configure here.}

kelsonpw requested a review from a team April 25, 2026 21:29

cursor Bot reviewed Apr 25, 2026

View reviewed changes

Comment thread bin.ts Outdated

fix: replace vendor name 'Claude SDK' with 'agent' in CLI help text

8c54f5e

Applied via @cursor push command

kelsonpw merged commit a90692a into kelsonpw/wizard-mcp-plan-verify-list Apr 26, 2026
6 checks passed

kelsonpw deleted the kelsonpw/agent-apply-resume branch April 26, 2026 01:02

cursor Bot reviewed Apr 26, 2026

View reviewed changes

kelsonpw mentioned this pull request Apr 26, 2026

refactor(cli): split bin.ts into per-command CommandModule files #246

Closed

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(agent): apply --resume — resumable Claude SDK sessions across crashes#259

feat(agent): apply --resume — resumable Claude SDK sessions across crashes#259
kelsonpw merged 2 commits intokelsonpw/wizard-mcp-plan-verify-listfrom
kelsonpw/agent-apply-resume

kelsonpw commented Apr 25, 2026 •

edited by cursor Bot

Loading

Uh oh!

github-actions Bot commented Apr 25, 2026

Uh oh!

cursor Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

kelsonpw commented Apr 26, 2026

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kelsonpw commented Apr 25, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Surface

How it works

Smoke tests

What changed

Test plan

Out of scope

Uh oh!

github-actions Bot commented Apr 25, 2026

🧙 Wizard CI

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kelsonpw commented Apr 26, 2026

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Apr 26, 2026

Choose a reason for hiding this comment

resume option missing from SDKQueryOptions type

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kelsonpw commented Apr 25, 2026 •

edited by cursor Bot

Loading

cursor Bot left a comment •

edited

Loading

`resume` option missing from `SDKQueryOptions` type