Skip to content

fix(parallelism): salt sub-app IDs with sequence_id to prevent stale replay (#761)#778

Open
elijahbenizzy wants to merge 3 commits into
mainfrom
fix/761-mapstates-stale-replay-on-resume
Open

fix(parallelism): salt sub-app IDs with sequence_id to prevent stale replay (#761)#778
elijahbenizzy wants to merge 3 commits into
mainfrom
fix/761-mapstates-stale-replay-on-resume

Conversation

@elijahbenizzy
Copy link
Copy Markdown
Contributor

Fixes #761

Root cause

TaskBasedParallelAction (and subclasses MapStates, MapActions, MapActionsAndStates) generated sub-application IDs that were deterministic in (parent_app_id, i, j) only -- no per-invocation discriminator. When the parent application was built with initialize_from(...), the cascaded state initializer would, on the second invocation of the same parallel action, see a sub-app ID matching the first invocation's persisted state and silently hydrate it instead of running the action.

Result: every invocation after the first replayed the first invocation's outputs. Silent wrong results.

Fresh runs (no initialize_from) were unaffected because the cascaded initializer was None, so sub-apps never tried to load existing state even though their IDs still collided.

Fix

In TaskBasedParallelAction.run_and_update, salt each task's application_id with the parent context's sequence_id. The parent's sequence_id increments on every action step, so it uniquely identifies which invocation of the parent action we are in. Applied via a small _salt_task_app_id helper in both sync and async generator paths, so every subclass inherits the fix automatically without overriding tasks().

This matches the workaround pattern the issue reporter (hippalectryon-0) verified.

Regression test

tests/core/test_parallelism.py::test_map_states_reexecutes_on_repeated_invocations_with_initializer -- builds an app with initialize_from(...) and a MapStates action, invokes it three times in a row using a counter inside the sub-action, and asserts the outputs differ across invocations and the counter advanced as expected. Confirmed it fails on main and passes with the fix.

Test plan

  • New regression test passes with the fix
  • New regression test fails on main (without the fix), confirming it actually catches the bug
  • All 19 existing tests in tests/core/test_parallelism.py still pass

Backwards-compat note for release notes

This changes the sub-application IDs produced by every TaskBasedParallelAction subclass. Anyone who has persisted sub-app state under the old (deterministic-without-salt) IDs and depended on resuming from those exact IDs will see new IDs after upgrading. Worth flagging in the release notes for the next release.

…replay (#761)

Sub-application IDs in `TaskBasedParallelAction` (and its subclasses
`MapStates`, `MapActions`, `MapActionsAndStates`) were deterministic in
`(parent_app_id, i, j)` only, with no per-invocation discriminator. When
the parent application was built with `initialize_from(...)`, the
cascaded state initializer would, on the second invocation of the same
parallel action, see a sub-app ID matching the first invocation's
persisted state and silently hydrate it instead of re-running the
action. The result: every invocation after the first replayed the first
invocation's outputs.

Fresh runs (no `initialize_from`) were unaffected because the cascaded
initializer was None, so sub-apps never tried to load existing state
even though their IDs still collided.

Fix: in `TaskBasedParallelAction.run_and_update`, salt each task's
`application_id` with the parent context's `sequence_id` (which
increments per parent action step, so it uniquely identifies a given
invocation). Applied via a small `_salt_task_app_id` helper in both the
sync and async generator paths so all subclasses inherit the fix
without needing to override `tasks()`.

Also adds a regression test that exercises the full failure mode:
builds an app with `initialize_from(...)` and a `MapStates`, invokes it
three times, and asserts the per-invocation outputs differ.

Note: this changes sub-app IDs across versions for any
`TaskBasedParallelAction` subclass; persisted sub-app state from the
old scheme will not be matched by the new IDs.
@github-actions github-actions Bot added area/core Application, State, Graph, Actions area/streaming Streaming actions, parallel streams labels May 14, 2026
@elijahbenizzy
Copy link
Copy Markdown
Contributor Author

For 0.43.0 release notes — breaking change to surface:

The sub-application ID scheme for TaskBasedParallelAction (MapStates, MapActions, MapActionsAndStates) now incorporates the parent's sequence_id. Persisted sub-app state from earlier versions will be orphaned on resume — the sub-actions will re-execute fresh rather than load stale results. This is the fix for #761; the old scheme would have silently returned stale data instead.

Also documented inline in the _salt_task_app_id docstring (commit b234025) so anyone reading the code finds the rationale + version-impact note in one place.

Practical impact summary:

  • Fresh runs on the new version: no change in behavior
  • New runs on the new version that get persisted, then resumed: works normally with new IDs
  • Resumes of mid-flight pre-upgrade state: parallel sub-actions re-execute on first resume (one-time cost), then proceed normally
  • No data loss, no silent wrong answers (which is what the bug was)

Comment thread burr/core/parallelism.py
# Salt sub-app IDs with the parent sequence_id so repeated invocations
# don't collide and silently replay prior persisted state (#761).
task_generator = (
_salt_task_app_id(task, context.sequence_id) for task in task_generator
Copy link
Copy Markdown
Contributor

@skrawcz skrawcz May 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to enable configuring this? Is there a use case where you want the prior behavior?

@skrawcz
Copy link
Copy Markdown
Contributor

skrawcz commented May 15, 2026

Looks good. Why test failures though?

@elijahbenizzy
Copy link
Copy Markdown
Contributor Author

Closing — fix is wrong as designed

Investigating the failing `test_end_to_end_parallel_collatz_many_unreliable_tasks` revealed that this fix is fundamentally at odds with another required behavior.

The conflict

Use case What sub-app IDs need
#761 (two successful invocations of the parallel action within one parent app's lifetime) IDs must differ so each invocation runs fresh
`test_end_to_end_parallel_collatz_many_unreliable_tasks` (parent app fails mid-parallel-action, gets rebuilt + resumed via `initialize_from`, sub-tasks resume from their checkpoints) IDs must stay stable across parent rebuilds so the cascading initializer can find prior sub-app state

What I confirmed via diagnostics

  • Original (unsalted) sub-app IDs are stable across parent retries: e.g. `91e33a489b3c3b430d6d66c4...` is the same in iteration 1, 2, 3 of a parent rebuild loop.
  • `context.sequence_id` increments on every parent attempt, including resume rebuilds: 1 → 2 → 3 across retries.
  • `state.__SEQUENCE_ID` does the same: 1 → 2 → 3 across retries (and notably `__PRIOR_STEP` is dropped from state on the resumed iterations, suggesting `initialize_from` does some bookkeeping that bumps the counter).

So salting with sequence_id fixes #761 but breaks the retry-on-failure resume path: each rebuild gets fresh sub-app IDs, the cascading initializer can't find prior sub-app state, sub-tasks restart from scratch — and because new failures keep occurring, the test loops forever.

What would actually work

A salt that:

  1. Stays stable when the parallel action is being retried after a sub-action failure (so checkpoints work)
  2. Differs between distinct successful invocations of the same parallel action within one parent app's lifetime (so MapStates replays stale sub-state on repeated invocations when parent has a cascading state_initializer #761 is fixed)

No value in the current `run_and_update` context satisfies both. Candidates that would need new plumbing:

  • A counter of "successful prior completions of this parallel action" — would need to be persisted in parent state, incremented by the action's reduce step. Cleanest semantically; requires new field in state and modification to `TaskBasedParallelAction` execution flow.
  • User-provided per-invocation salt (opt-in via override): preserves default behavior, lets users hitting MapStates replays stale sub-state on repeated invocations when parent has a cascading state_initializer #761 fix their case explicitly. Punts the problem.
  • Default cascading initializer to `None` for sub-apps: large breaking change, would update `test_end_to_end_parallel_collatz_many_unreliable_tasks`. Possibly the "right" long-term answer but needs design discussion.

This needs a design call before any fix lands. Closing the PR. Issue #761 stays open with the diagnostic findings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/core Application, State, Graph, Actions area/streaming Streaming actions, parallel streams

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MapStates replays stale sub-state on repeated invocations when parent has a cascading state_initializer

2 participants