Skip to content

feat(core): add workflow restart and crash recovery APIs#1098

Merged
omeraplak merged 3 commits intomainfrom
feat/workflow-restart-crash-recovery
Feb 21, 2026
Merged

feat(core): add workflow restart and crash recovery APIs#1098
omeraplak merged 3 commits intomainfrom
feat/workflow-restart-crash-recovery

Conversation

@omeraplak
Copy link
Copy Markdown
Member

@omeraplak omeraplak commented Feb 21, 2026

PR Checklist

Please check if your PR fulfills the following requirements:

Bugs / Features

What is the current behavior?

Workflows support suspend/resume, but there is no workflow-level restart API for interrupted running executions after crashes/restarts. There is also no bulk restart helper for active runs.

What is the new behavior?

Adds restart/crash-recovery APIs and checkpoint persistence:

New workflow APIs

  • workflow.restart(executionId, options?)
  • workflow.restartAllActive(options?)
  • workflowChain.restart(executionId, options?)
  • workflowChain.restartAllActive(options?)

New registry helpers

  • WorkflowRegistry.restartWorkflowExecution(workflowId, executionId, options?)
  • WorkflowRegistry.restartAllActiveWorkflowRuns(options?)

Runtime changes

  • Persist running checkpoints during execution (step progress, stepData snapshots, workflowState, context, usage, event sequence) in workflow metadata.
  • Restore checkpoint data during restart to continue deterministically.

Docs

  • website/docs/workflows/overview.md
  • website/docs/workflows/suspend-resume.md

Validation

  • pnpm --filter @voltagent/core test:single src/workflow/core.spec.ts src/workflow/chain.spec.ts src/workflow/suspend-resume.spec.ts
  • pnpm --filter @voltagent/core typecheck
  • Smoke test via examples/with-workflow (programmatic):
    • suspended execution
    • checkpoint present
    • workflow.restart(executionId) => completed

fixes (issue)

N/A

Notes for reviewers

  • This PR includes only plan 02 implementation scope.
  • docs/workflow-parity-plans/ is intentionally not included.

Summary by cubic

Adds workflow restart and crash-recovery with periodic checkpoints. Checkpoints capture step outputs/status, stepData, usage, workflow state, context, and event sequence for deterministic restarts and bulk recovery.

  • New Features

    • Restart one or all active executions via workflow.restart(...) and workflow.restartAllActive(...); also available on WorkflowChain.
    • Registry helpers for cross-workflow recovery: restartWorkflowExecution(...) and restartAllActiveWorkflowRuns({ workflowId? }); returns restarted and failed summaries.
    • Persist running checkpoints every N completed steps (checkpointInterval, default 1) or disable via disableCheckpointing; checkpoints include stepData snapshots, output/status, usage, and serialize/rehydrate step errors.
    • Exported types: WorkflowRestartAllResult and WorkflowRestartCheckpoint.
  • Migration

    • Restart is supported only for runs in "running" state.
    • Prefer idempotent steps; external side effects may have already occurred before a crash.
    • Persisted context is restored when tuple-serialized (Map-like entries); invalid shapes are ignored.

Written for commit 925c089. Summary will update on new commits.

Summary by CodeRabbit

  • New Features

    • Added workflow restart APIs to recover interrupted runs from persisted running checkpoints (single-execution restart, restart-all-active per workflow, and registry-driven cross-workflow restarts). Checkpoints persist step progress, shared state, context and usage for deterministic recovery. Added runtime options to control checkpointing.
  • Documentation

    • Added guides and examples for restarting interrupted runs and crash-recovery workflows.
  • Tests

    • New tests covering restart flows, checkpoint restoration, and registry-driven restarts.

@changeset-bot
Copy link
Copy Markdown

changeset-bot bot commented Feb 21, 2026

🦋 Changeset detected

Latest commit: 925c089

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
@voltagent/core Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@joggrbot

This comment has been minimized.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 21, 2026

📝 Walkthrough

Walkthrough

Adds deterministic restart and crash-recovery: new workflow.restart / restartAllActive APIs on Workflow, WorkflowChain, and WorkflowRegistry; persists running checkpoints (stepData, usage, step progress, event sequence, workflow state) and restores them to resume interrupted runs; exports new restart-related types and docs.

Changes

Cohort / File(s) Summary
Public Types & Exports
packages/core/src/index.ts, packages/core/src/workflow/index.ts, packages/core/src/workflow/types.ts
Adds WorkflowRestartAllResult, WorkflowRestartCheckpoint, WorkflowStepData, WorkflowSerializedStepError; exposes restart-related types and extends public Workflow interface with restart and restartAllActive.
Checkpoint Persistence Types
packages/core/src/memory/types.ts, packages/core/src/workflow/internal/state.ts
Augments persisted suspension/checkpoint with stepData and usage; ensures usage is preserved in MutableWorkflowState transformations.
Core Restart Logic
packages/core/src/workflow/core.ts
Implements checkpoint serialization/deserialization, persistRunningCheckpoint, resume restoration of usage/stepData/workflowState, and public restart / restartAllActive implementations; augments suspension/completion metadata and exports restart helpers.
Facade & Registry
packages/core/src/workflow/chain.ts, packages/core/src/workflow/registry.ts
Adds restart / restartAllActive on WorkflowChain; adds restartWorkflowExecution and restartAllActiveWorkflowRuns on WorkflowRegistry with per-workflow aggregation and error reporting.
Internal Types Narrowing
packages/core/src/workflow/steps/types.ts
InternalWorkflow now omits restart and restartAllActive from the overridable Workflow surface.
Tests
packages/core/src/workflow/core.spec.ts, packages/core/src/workflow/chain.spec.ts, packages/core/src/workflow/suspend-resume.spec.ts
Adds suites validating restart from checkpoint, state/context/usage preservation, step re-execution, error rehydration, non-running restart failure, registry bulk restart behavior, and chain-level restart APIs.
Docs & Changelog
website/docs/workflows/overview.md, website/docs/workflows/suspend-resume.md, .changeset/old-geese-smell.md
Documents restart/crash-recovery usage and examples; adds changelog entry and usage guidance.

Sequence Diagram

sequenceDiagram
    participant Client as Client
    participant REG as WorkflowRegistry
    participant WF as Workflow
    participant MEM as MemoryStore

    Client->>REG: restartAllActiveWorkflowRuns()
    REG->>MEM: list active executions / metadata
    MEM-->>REG: checkpoints + workflow states
    REG->>WF: restart(executionId, options)
    WF->>MEM: get persisted checkpoint & state
    MEM-->>WF: return checkpoint (stepData, usage, indices)
    WF->>WF: restore state, apply usage & stepData
    WF->>WF: re-execute remaining steps
    WF->>MEM: persistRunningCheckpoint() (progress/update)
    WF-->>Client: WorkflowExecutionResult
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested reviewers

  • lzj960515

Poem

🐇 I stored each hop in a silver thread,
When runs fell flat, I nudged them ahead,
Step by step, I stitch and mend,
Bounce back to finish, race to the end,
Hooray — checkpoints saved, and off we tread!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(core): add workflow restart and crash recovery APIs' is specific, concise, and directly reflects the main change—adding new workflow restart and crash-recovery APIs to the core package.
Description check ✅ Passed The PR description comprehensively covers all template sections: checklist completed, new behavior clearly documented with API examples, testing and documentation additions confirmed, and changesets added. The description is well-structured and informative.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/workflow-restart-crash-recovery

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 15 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/core/src/workflow/core.ts">

<violation number="1" location="packages/core/src/workflow/core.ts:1155">
P2: `Error` objects do not survive JSON serialization — `JSON.stringify(new Error('x'))` produces `'{}'`. The checkpoint's `error` field should be serialized to a plain object (e.g., `{ message, stack }`) or `null` so that error context is preserved across restarts and the type contract is maintained after deserialization.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread packages/core/src/workflow/core.ts Outdated
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages bot commented Feb 21, 2026

Deploying voltagent with  Cloudflare Pages  Cloudflare Pages

Latest commit: 26bda80
Status: ✅  Deploy successful!
Preview URL: https://9ccdbb05.voltagent.pages.dev
Branch Preview URL: https://feat-workflow-restart-crash.voltagent.pages.dev

View logs

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (10)
.changeset/old-geese-smell.md (1)

5-18: Consider surfacing newly exported public types in the changelog prose.

The AI summary notes that WorkflowRestartAllResult and WorkflowRestartCheckpoint are newly exported from the core package surface. Changelog consumers (library users) typically benefit from knowing about new types so they can adopt them in their own type annotations.

✏️ Suggested addition
 The workflow runtime now persists running checkpoints during execution, including step progress, shared workflow state, context, and usage snapshots, so interrupted runs in `running` state can be recovered deterministically.
+
+New public types exported: `WorkflowRestartAllResult`, `WorkflowRestartCheckpoint`.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.changeset/old-geese-smell.md around lines 5 - 18, The changelog currently
lists new restart/crash-recovery APIs but omits mentioning newly exported public
types; update the changelog prose to explicitly surface the new exported types
WorkflowRestartAllResult and WorkflowRestartCheckpoint (and any other new public
exports from the core package) so users can adopt them in their type
annotations—add a short sentence in the changelog paragraph that names these
types and notes they are exported from the core package and available for
consumer use.
website/docs/workflows/overview.md (1)

561-576: Consider documenting that restart() is also available directly on WorkflowChain.

The snippet shows workflow.toWorkflow().restart(...), but WorkflowChain also exposes .restart() directly (as exercised in chain.spec.ts line 112). Mentioning both avoids confusion for users who work with chains.

📝 Suggested documentation note
+> **Note:** `restart()` and `restartAllActive()` are also available directly on a `WorkflowChain` returned by `createWorkflowChain`, not only on the result of `.toWorkflow()`.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@website/docs/workflows/overview.md` around lines 561 - 576, Add a short note
to this section explaining that restart() can be called directly on a
WorkflowChain as well as on a Workflow instance (so users don't think they must
call workflow.toWorkflow().restart()); mention the equivalent methods
(WorkflowChain.restart and workflow.toWorkflow().restart) and optionally include
a one-line example or pointer to
WorkflowRegistry.getInstance().restartAllActiveWorkflowRuns() for cross-workflow
recovery to make the available options explicit.
packages/core/src/memory/types.ts (1)

151-159: error?: unknown | null is redundant — null is already assignable to unknown.

unknown | null is semantically identical to unknown. The explicit | null doesn't add a type constraint and may mislead readers into thinking null and undefined are treated differently here.

🛠️ Suggested fix
-      error?: unknown | null;
+      error?: unknown;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/core/src/memory/types.ts` around lines 151 - 159, The error field in
the stepData type is declared as "error?: unknown | null", which is redundant
because null is already assignable to unknown; change the declaration on the
stepData entry so the error property is typed simply as "error?: unknown" (or
remove the explicit "| null") in the Record value type to avoid misleading
readers; update the definition around the stepData property and any related
types that reference this exact shape to keep consistency.
packages/core/src/workflow/core.spec.ts (1)

514-536: Extract __voltagent_restart_checkpoint into a named constant.

The magic string "__voltagent_restart_checkpoint" is spread across multiple test fixtures (lines 515 and 677). If the key name ever changes in the implementation, these fixtures will silently diverge. Extract it to a shared constant so both the implementation and the tests stay in sync.

#!/bin/bash
# Check if this key is already exported as a constant anywhere in the codebase
rg -rn '__voltagent_restart_checkpoint' --type ts

Also applies to: 677-699

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/core/src/workflow/core.spec.ts` around lines 514 - 536, Replace the
hard-coded magic string "__voltagent_restart_checkpoint" in the test fixtures
with a single exported constant (e.g., VOLTAGENT_RESTART_CHECKPOINT_KEY) defined
in the implementation module that owns the checkpoint logic; export that
constant and import it into packages/core/src/workflow/core.spec.ts, then update
all occurrences in the test (including the metadata keys at the shown blocks) to
use the imported constant so tests and implementation remain in sync if the key
changes.
packages/core/src/workflow/chain.spec.ts (1)

78-116: Consider adding a chain.restartAllActive() test.

The new describe.sequential("workflow.restart") only tests chain.restart(executionId). Per the PR description, workflowChain.restartAllActive(options?) is a new public API, but its direct invocation on a WorkflowChain instance isn't covered here — core.spec.ts tests restartAllActive() on a Workflow from createWorkflow, not on a WorkflowChain.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/core/src/workflow/chain.spec.ts` around lines 78 - 116, Add a test
in this describe.sequential block that exercises the new WorkflowChain API by
calling restartAllActive on the chain instance (e.g.,
workflow.restartAllActive(options?)) instead of only testing
restart(executionId); set up the same Memory/InMemoryStorageAdapter, register
the chain via WorkflowRegistry.registerWorkflow(workflow.toWorkflow()), seed a
running execution state (workflow id "chain-restart" like the existing test),
call workflow.restartAllActive() and assert the returned execution(s) have
status "completed" and the expected result { value: 12 } to mirror the existing
restart test but via restartAllActive on the WorkflowChain.
packages/core/src/workflow/registry.ts (1)

261-265: Return type allows null but workflow.restart() never returns null.

restartWorkflowExecution is typed as Promise<WorkflowExecutionResult<any, any> | null>, but it delegates to workflow.restart(...) which returns Promise<WorkflowExecutionResult<...>> (non-nullable) or throws. The | null is dead code and may mislead callers into adding unnecessary null checks.

Proposed fix
   public async restartWorkflowExecution(
     workflowId: string,
     executionId: string,
     options?: WorkflowRunOptions,
-  ): Promise<WorkflowExecutionResult<any, any> | null> {
+  ): Promise<WorkflowExecutionResult<any, any>> {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/core/src/workflow/registry.ts` around lines 261 - 265, The return
type of restartWorkflowExecution incorrectly includes "| null" even though it
simply awaits and returns workflow.restart(...) which never returns null; update
the signature of restartWorkflowExecution to return
Promise<WorkflowExecutionResult<any, any>> (remove the nullable union) and
ensure any callers relying on a possible null are updated accordingly; locate
the method named restartWorkflowExecution and the delegation to
workflow.restart(...) to make this change.
packages/core/src/workflow/chain.ts (1)

974-1002: restart / restartAllActive recreate the workflow on each call — same caveat as run.

Both methods follow the existing pattern of calling createWorkflow() per invocation. This means each call gets a fresh in-memory store when no persistent memory is configured — so restart() won't find any prior state and will throw. This is the same limitation as run()/stream(), but it's worth noting that restart is inherently stateful and more sensitive to this.

Users who call chain.restart() instead of chain.toWorkflow().restart() with no persistent memory will always get "Workflow state not found". Consider adding a brief JSDoc note (like the toWorkflow() guidance for suspend/resume) warning that restart requires persistent memory or a registered workflow.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/core/src/workflow/chain.ts` around lines 974 - 1002, Add a JSDoc
note to the restart and restartAllActive methods on the Chain class (the async
restart(...) and async restartAllActive(...) functions) warning that these
methods recreate a workflow via createWorkflow(...) on each call and therefore
require a persistent memory or a registered workflow to find prior state
(otherwise they will get "Workflow state not found"); mirror the existing
guidance used for toWorkflow() (suggest calling chain.toWorkflow().restart(...)
when using ephemeral in-memory stores) so users know to configure persistent
memory or register the workflow before invoking restart/restartAllActive.
packages/core/src/workflow/core.ts (2)

2222-2294: restartExecution implementation is thorough and well-guarded.

Good validation chain: checks state existence, workflow ID ownership, and "running" status before proceeding. The input fallback to workflow-start event is a nice compatibility touch for adapters that don't store input directly.

One subtlety: the persistedState.context cast on line 2264 assumes it was stored as Array<[string | symbol, unknown]> (matching the serialization at line 989). This works for the current in-memory path, but external storage adapters that flatten or transform entries could break the new Map(...) reconstruction. Worth a defensive check.

Proposed defensive guard
     const persistedContext = persistedState.context
-      ? new Map(persistedState.context as Array<[string | symbol, unknown]>)
+      ? Array.isArray(persistedState.context)
+        ? new Map(persistedState.context as Array<[string | symbol, unknown]>)
+        : undefined
       : undefined;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/core/src/workflow/core.ts` around lines 2222 - 2294, The
persistedState.context reconstruction in restartExecution assumes
persistedState.context is an Array<[string|symbol, unknown]> and does new
Map(persistedState.context), which can throw or produce wrong maps for external
adapters that serialize/flatten context; guard this by validating
persistedState.context is an array of 2-length tuples (e.g., Array.isArray and
each item is an array of length 2 with a string/symbol key) before calling new
Map, otherwise set persistedContext to undefined (or reconstruct safely by
iterating and selectively adding valid entries); update the code around
persistedContext/new Map in restartExecution and ensure downstream use of
restartOptions.context tolerates undefined.

1160-1192: Checkpoint I/O on every step completion — consider the cost for long workflows.

persistRunningCheckpoint performs a read (mergeExecutionMetadatagetWorkflowState) plus a write (updateWorkflowState) after every step. For workflows with many small/fast steps or latency-sensitive storage backends, this doubles the memory-layer I/O per step.

This is a reasonable durability-vs-performance tradeoff for crash recovery, but you might want to offer a way to opt out or throttle (e.g., a checkpointInterval config on the workflow or per-execution options) for users who prefer speed over granular recovery.

Also applies to: 1875-1883

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/core/src/workflow/core.ts` around lines 1160 - 1192,
persistRunningCheckpoint currently does a metadata read via
mergeExecutionMetadata (which calls getWorkflowState) plus
executionMemory.updateWorkflowState on every step, causing double I/O per step;
add a configurable checkpointInterval or disableCheckpointing flag on the
workflow/execution options and use it in persistRunningCheckpoint (and the
similar checkpoint call elsewhere) to skip the merge+update on steps that are
not multiples of the interval (or when disabled), and when skipping avoid
calling mergeExecutionMetadata entirely; ensure you reference and read the new
option where persistRunningCheckpoint is invoked and before calling
mergeExecutionMetadata/updateWorkflowState so the behavior is gated by the new
config.
packages/core/src/workflow/types.ts (1)

668-683: restartAllActive workflowId option is redundant on a single Workflow instance.

The Workflow interface already carries its own id. Passing { workflowId?: string } to restartAllActive at this level is only needed for the registry aggregation path. On the Workflow object itself the parameter is misleading — callers might pass a different workflow's ID and get an empty result without any error. Consider narrowing the public signature here to (options?: Omit<…, 'workflowId'>) and only expose the workflowId filter on the registry API, or at least document the behaviour when a mismatched ID is supplied.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/core/src/workflow/types.ts` around lines 668 - 683, The
restartAllActive signature on the Workflow interface exposes a redundant
workflowId filter; change the method on Workflow to remove the workflowId option
so it becomes restartAllActive(options?: {}) =>
Promise<WorkflowRestartAllResult> (or simply restartAllActive() =>
Promise<WorkflowRestartAllResult>), update the Workflow interface declaration in
types.ts accordingly, update any implementations of Workflow.restartAllActive to
stop expecting options.workflowId, and keep the original { workflowId?: string }
filter only on the registry-level API; also update the JSDoc comment to document
that this per-Workflow call operates on that specific Workflow's executions and
that cross-workflow filtering lives on the registry API.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/core/src/workflow/core.ts`:
- Around line 1147-1158: serializeStepDataSnapshot currently copies
stepData.error (Error | null) directly, which will be lost when persisted via
JSON; change serializeStepDataSnapshot to convert stepData.error into a
serializable shape (e.g., null or { message: string; stack?: string; name?:
string }) before returning, update the WorkflowStepData error type (or introduce
a separate serialized checkpoint type) to accept this shape, and ensure you
read/rehydrate that shape on restore from executionContext.stepData so the
message/stack are preserved across JSON round-trips.

In `@packages/core/src/workflow/registry.ts`:
- Around line 298-309: The catch block in the loop over this.workflows
incorrectly populates failed entries with executionId: workflowId; update the
failure shape to distinguish workflow-level failures by adding an optional
workflowId (or a boolean flag like isWorkflowFailure) to
WorkflowRestartAllResult.failed and push { workflowId, error: ...,
isWorkflowFailure: true } (instead of using executionId), or alternatively push
a distinct structure for workflow-level errors; update usages that consume
aggregate.failed to handle the new optional field/flag; touch the loop around
registeredWorkflow.workflow.restartAllActive and the
WorkflowRestartAllResult.failed type to implement this change.

---

Nitpick comments:
In @.changeset/old-geese-smell.md:
- Around line 5-18: The changelog currently lists new restart/crash-recovery
APIs but omits mentioning newly exported public types; update the changelog
prose to explicitly surface the new exported types WorkflowRestartAllResult and
WorkflowRestartCheckpoint (and any other new public exports from the core
package) so users can adopt them in their type annotations—add a short sentence
in the changelog paragraph that names these types and notes they are exported
from the core package and available for consumer use.

In `@packages/core/src/memory/types.ts`:
- Around line 151-159: The error field in the stepData type is declared as
"error?: unknown | null", which is redundant because null is already assignable
to unknown; change the declaration on the stepData entry so the error property
is typed simply as "error?: unknown" (or remove the explicit "| null") in the
Record value type to avoid misleading readers; update the definition around the
stepData property and any related types that reference this exact shape to keep
consistency.

In `@packages/core/src/workflow/chain.spec.ts`:
- Around line 78-116: Add a test in this describe.sequential block that
exercises the new WorkflowChain API by calling restartAllActive on the chain
instance (e.g., workflow.restartAllActive(options?)) instead of only testing
restart(executionId); set up the same Memory/InMemoryStorageAdapter, register
the chain via WorkflowRegistry.registerWorkflow(workflow.toWorkflow()), seed a
running execution state (workflow id "chain-restart" like the existing test),
call workflow.restartAllActive() and assert the returned execution(s) have
status "completed" and the expected result { value: 12 } to mirror the existing
restart test but via restartAllActive on the WorkflowChain.

In `@packages/core/src/workflow/chain.ts`:
- Around line 974-1002: Add a JSDoc note to the restart and restartAllActive
methods on the Chain class (the async restart(...) and async
restartAllActive(...) functions) warning that these methods recreate a workflow
via createWorkflow(...) on each call and therefore require a persistent memory
or a registered workflow to find prior state (otherwise they will get "Workflow
state not found"); mirror the existing guidance used for toWorkflow() (suggest
calling chain.toWorkflow().restart(...) when using ephemeral in-memory stores)
so users know to configure persistent memory or register the workflow before
invoking restart/restartAllActive.

In `@packages/core/src/workflow/core.spec.ts`:
- Around line 514-536: Replace the hard-coded magic string
"__voltagent_restart_checkpoint" in the test fixtures with a single exported
constant (e.g., VOLTAGENT_RESTART_CHECKPOINT_KEY) defined in the implementation
module that owns the checkpoint logic; export that constant and import it into
packages/core/src/workflow/core.spec.ts, then update all occurrences in the test
(including the metadata keys at the shown blocks) to use the imported constant
so tests and implementation remain in sync if the key changes.

In `@packages/core/src/workflow/core.ts`:
- Around line 2222-2294: The persistedState.context reconstruction in
restartExecution assumes persistedState.context is an Array<[string|symbol,
unknown]> and does new Map(persistedState.context), which can throw or produce
wrong maps for external adapters that serialize/flatten context; guard this by
validating persistedState.context is an array of 2-length tuples (e.g.,
Array.isArray and each item is an array of length 2 with a string/symbol key)
before calling new Map, otherwise set persistedContext to undefined (or
reconstruct safely by iterating and selectively adding valid entries); update
the code around persistedContext/new Map in restartExecution and ensure
downstream use of restartOptions.context tolerates undefined.
- Around line 1160-1192: persistRunningCheckpoint currently does a metadata read
via mergeExecutionMetadata (which calls getWorkflowState) plus
executionMemory.updateWorkflowState on every step, causing double I/O per step;
add a configurable checkpointInterval or disableCheckpointing flag on the
workflow/execution options and use it in persistRunningCheckpoint (and the
similar checkpoint call elsewhere) to skip the merge+update on steps that are
not multiples of the interval (or when disabled), and when skipping avoid
calling mergeExecutionMetadata entirely; ensure you reference and read the new
option where persistRunningCheckpoint is invoked and before calling
mergeExecutionMetadata/updateWorkflowState so the behavior is gated by the new
config.

In `@packages/core/src/workflow/registry.ts`:
- Around line 261-265: The return type of restartWorkflowExecution incorrectly
includes "| null" even though it simply awaits and returns workflow.restart(...)
which never returns null; update the signature of restartWorkflowExecution to
return Promise<WorkflowExecutionResult<any, any>> (remove the nullable union)
and ensure any callers relying on a possible null are updated accordingly;
locate the method named restartWorkflowExecution and the delegation to
workflow.restart(...) to make this change.

In `@packages/core/src/workflow/types.ts`:
- Around line 668-683: The restartAllActive signature on the Workflow interface
exposes a redundant workflowId filter; change the method on Workflow to remove
the workflowId option so it becomes restartAllActive(options?: {}) =>
Promise<WorkflowRestartAllResult> (or simply restartAllActive() =>
Promise<WorkflowRestartAllResult>), update the Workflow interface declaration in
types.ts accordingly, update any implementations of Workflow.restartAllActive to
stop expecting options.workflowId, and keep the original { workflowId?: string }
filter only on the registry-level API; also update the JSDoc comment to document
that this per-Workflow call operates on that specific Workflow's executions and
that cross-workflow filtering lives on the registry API.

In `@website/docs/workflows/overview.md`:
- Around line 561-576: Add a short note to this section explaining that
restart() can be called directly on a WorkflowChain as well as on a Workflow
instance (so users don't think they must call workflow.toWorkflow().restart());
mention the equivalent methods (WorkflowChain.restart and
workflow.toWorkflow().restart) and optionally include a one-line example or
pointer to WorkflowRegistry.getInstance().restartAllActiveWorkflowRuns() for
cross-workflow recovery to make the available options explicit.

Comment thread packages/core/src/workflow/core.ts Outdated
Comment thread packages/core/src/workflow/registry.ts
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/core/src/workflow/core.spec.ts`:
- Around line 819-826: The test is passing a plain object for the
WorkflowStateEntry.context (currently `context: { role: "admin" } as any`) which
violates the expected type Array<[string | symbol, unknown]>; update the call to
memory.setWorkflowState in the spec so the context is provided as the correct
array-of-tuples form (e.g., [['role', 'admin']]) or use a small typed helper
that returns Array<[string | symbol, unknown]> and reference that helper in the
test; ensure the change is applied where WorkflowStateEntry is constructed so
types align without using `as any`.

In `@packages/core/src/workflow/core.ts`:
- Around line 2392-2426: The restartOptions currently sets userId/conversationId
from persistedState but drops persistedState.metadata when there is no
checkpoint; update restartOptions to include merged metadata by taking
persistedState.metadata and shallow-merging options?.metadata (so caller
overrides take precedence), e.g. compute mergedMetadata = {
...(persistedState.metadata ?? {}), ...(options?.metadata ?? {}) } and assign
metadata: Object.keys(mergedMetadata).length ? mergedMetadata : undefined on
restartOptions; before assigning, remove the restart/checkpoint marker key from
mergedMetadata (the metadata key used to indicate restart state) so we don't
carry restart-specific data forward; adjust code around persistedState,
metadataUserId, metadataConversationId, and restartOptions accordingly.

Comment on lines +819 to +826
await memory.setWorkflowState(executionId, {
id: executionId,
workflowId: workflow.id,
workflowName: workflow.name,
status: "running",
input: { value: 9 },
context: { role: "admin" } as any,
createdAt: now,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# First, let's look at the specific code in question
sed -n '815,830p' packages/core/src/workflow/core.spec.ts

Repository: VoltAgent/voltagent

Length of output: 655


🏁 Script executed:

#!/bin/bash
# Inspect persisted workflow context type
rg -n "context" packages/core/src/memory/types.ts -C3

Repository: VoltAgent/voltagent

Length of output: 2034


🏁 Script executed:

#!/bin/bash
# Let's also look at the WorkflowState type definition
rg -n "WorkflowState\|interface.*State\|type.*State" packages/core/src/memory/types.ts -A5

Repository: VoltAgent/voltagent

Length of output: 45


🏁 Script executed:

#!/bin/bash
# Get exact lines 819-826
sed -n '819,826p' packages/core/src/workflow/core.spec.ts

Repository: VoltAgent/voltagent

Length of output: 455


🏁 Script executed:

#!/bin/bash
# Search for WorkflowState type definition
rg -n "interface WorkflowState\|type WorkflowState" packages/core/src -A10

Repository: VoltAgent/voltagent

Length of output: 45


🏁 Script executed:

#!/bin/bash
# Look for test utilities or helpers for creating workflow state
rg -n "setWorkflowState|WorkflowState" packages/core/src/workflow/core.spec.ts -B2 -A2 | head -40

Repository: VoltAgent/voltagent

Length of output: 1531


🏁 Script executed:

#!/bin/bash
# Get exact lines around 819-826 using cat
cat -n packages/core/src/workflow/core.spec.ts | sed -n '815,830p'

Repository: VoltAgent/voltagent

Length of output: 767


🏁 Script executed:

#!/bin/bash
# Search for the complete WorkflowState definition
rg -n "WorkflowState" packages/core/src/memory/types.ts

Repository: VoltAgent/voltagent

Length of output: 584


🏁 Script executed:

#!/bin/bash
# Look at the full test function starting around line 810
cat -n packages/core/src/workflow/core.spec.ts | sed -n '808,835p'

Repository: VoltAgent/voltagent

Length of output: 1232


🏁 Script executed:

#!/bin/bash
# Search for the actual code with context and as any
rg -n "context.*as any\|role.*admin" packages/core/src/workflow/core.spec.ts -B5 -A5

Repository: VoltAgent/voltagent

Length of output: 45


🏁 Script executed:

#!/bin/bash
# Also search more broadly for the pattern
rg -n "as any" packages/core/src/workflow/core.spec.ts

Repository: VoltAgent/voltagent

Length of output: 310


🏁 Script executed:

#!/bin/bash
# Get context around the actual line 1026
cat -n packages/core/src/workflow/core.spec.ts | sed -n '1020,1035p'

Repository: VoltAgent/voltagent

Length of output: 644


🏁 Script executed:

#!/bin/bash
# Get the complete WorkflowStateEntry interface definition
rg -n "export interface WorkflowStateEntry" packages/core/src/memory/types.ts -A30

Repository: VoltAgent/voltagent

Length of output: 1212


Correct line number and use proper typing for context parameter. Line 1026 contains context: { role: "admin" } as any, but WorkflowStateEntry expects context to be Array<[string | symbol, unknown]>. The as any cast masks this type mismatch. Replace it with either the correct array-of-tuples format or a typed helper to maintain type safety per coding guidelines.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/core/src/workflow/core.spec.ts` around lines 819 - 826, The test is
passing a plain object for the WorkflowStateEntry.context (currently `context: {
role: "admin" } as any`) which violates the expected type Array<[string |
symbol, unknown]>; update the call to memory.setWorkflowState in the spec so the
context is provided as the correct array-of-tuples form (e.g., [['role',
'admin']]) or use a small typed helper that returns Array<[string | symbol,
unknown]> and reference that helper in the test; ensure the change is applied
where WorkflowStateEntry is constructed so types align without using `as any`.

Comment on lines +2392 to +2426
const metadataUserId =
typeof persistedState.metadata?.userId === "string"
? (persistedState.metadata.userId as string)
: undefined;
const metadataConversationId =
typeof persistedState.metadata?.conversationId === "string"
? (persistedState.metadata.conversationId as string)
: undefined;
const persistedContext = toValidContextMap(persistedState.context);
const effectiveWorkflowState =
options?.workflowState ?? checkpoint?.workflowState ?? persistedState.workflowState ?? {};

const restartOptions: WorkflowRunOptions = {
...options,
executionId,
userId: options?.userId ?? persistedState.userId ?? metadataUserId,
conversationId:
options?.conversationId ?? persistedState.conversationId ?? metadataConversationId,
context: options?.context ?? persistedContext,
workflowState: effectiveWorkflowState,
resumeFrom: checkpoint
? {
executionId,
resumeStepIndex: checkpoint.resumeStepIndex,
lastEventSequence: checkpoint.eventSequence,
checkpoint: {
stepExecutionState: checkpoint.stepExecutionState,
completedStepsData: checkpoint.completedStepsData,
workflowState: checkpoint.workflowState ?? effectiveWorkflowState,
stepData: checkpoint.stepData,
usage: checkpoint.usage,
},
}
: undefined,
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Preserve persisted metadata when restarting without a checkpoint.
Lines 2404-2412 build restart options without carrying stored metadata if no checkpoint exists, which can drop tenant/custom metadata for runs that crash before the first checkpoint (or when checkpointing is disabled). Consider carrying forward persisted metadata (minus the restart checkpoint key) unless the caller overrides it.

Suggested fix
+    const persistedMetadata = isObjectRecord(persistedState.metadata)
+      ? (persistedState.metadata as Record<string, unknown>)
+      : undefined;
+    const restartMetadata =
+      options?.metadata ??
+      (persistedMetadata
+        ? Object.fromEntries(
+            Object.entries(persistedMetadata).filter(
+              ([key]) => key !== VOLTAGENT_RESTART_CHECKPOINT_KEY,
+            ),
+          )
+        : undefined);
+
     const restartOptions: WorkflowRunOptions = {
       ...options,
       executionId,
       userId: options?.userId ?? persistedState.userId ?? metadataUserId,
       conversationId:
         options?.conversationId ?? persistedState.conversationId ?? metadataConversationId,
       context: options?.context ?? persistedContext,
       workflowState: effectiveWorkflowState,
+      metadata: restartMetadata,
       resumeFrom: checkpoint
         ? {
             executionId,
             resumeStepIndex: checkpoint.resumeStepIndex,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/core/src/workflow/core.ts` around lines 2392 - 2426, The
restartOptions currently sets userId/conversationId from persistedState but
drops persistedState.metadata when there is no checkpoint; update restartOptions
to include merged metadata by taking persistedState.metadata and shallow-merging
options?.metadata (so caller overrides take precedence), e.g. compute
mergedMetadata = { ...(persistedState.metadata ?? {}), ...(options?.metadata ??
{}) } and assign metadata: Object.keys(mergedMetadata).length ? mergedMetadata :
undefined on restartOptions; before assigning, remove the restart/checkpoint
marker key from mergedMetadata (the metadata key used to indicate restart state)
so we don't carry restart-specific data forward; adjust code around
persistedState, metadataUserId, metadataConversationId, and restartOptions
accordingly.

@omeraplak omeraplak merged commit b610ec6 into main Feb 21, 2026
20 of 22 checks passed
@omeraplak omeraplak deleted the feat/workflow-restart-crash-recovery branch February 21, 2026 20:22
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@packages/core/src/workflow/core.spec.ts`:
- Around line 1020-1028: The test currently uses a casted loose object for
persisted context when calling memory.setWorkflowState (executionId, {...,
context: { role: "admin" } as any,...}), which bypasses the expected tuple/array
format and weakens type safety; update the context value to match the project's
persisted-context type (the tuple/array shape the system expects) instead of
using "as any"—locate the set in memory.setWorkflowState and replace the cast
with a properly typed context value conforming to the persisted tuple-array
format so the test enforces real types for executionId/context handling.

In `@packages/core/src/workflow/core.ts`:
- Around line 2420-2464: The restart flow drops persisted metadata; update the
creation of restartOptions so it includes metadata merged from
persistedState.metadata (excluding any checkpoint key) unless options.metadata
is provided; specifically, when building the WorkflowRunOptions object
(restartOptions) merge options?.metadata ?? persistedState.metadata
(sanitizing/removing any checkpoint field) so tenant/custom metadata is
preserved across restarts while still allowing the caller to override metadata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant