Skip to content

feat: RuntimeState event bus integration with checkpoint/resume#5241

Merged
greysonlalonde merged 74 commits intomainfrom
chore/runtime-state-event-bus
Apr 6, 2026
Merged

feat: RuntimeState event bus integration with checkpoint/resume#5241
greysonlalonde merged 74 commits intomainfrom
chore/runtime-state-event-bus

Conversation

@greysonlalonde
Copy link
Copy Markdown
Contributor

@greysonlalonde greysonlalonde commented Apr 2, 2026

Summary

  • Pass RuntimeState as optional third arg to event bus handlers
  • RuntimeState.checkpoint(dir) writes timestamped JSON snapshots
  • Crew.from_checkpoint(path) restores and resumes via kickoff()
  • _get_execution_start_index skips tasks with existing output
  • Convert CrewStructuredTool, StandardPromptResult, SystemPromptResult, TokenCalcHandler to BaseModel
  • CrewAgentExecutorMixin uses Field(exclude=True) for back-references

Test plan

  • Real LLM execution: checkpoint after task 1, restore, resume skips task 1 and runs task 2
  • 371 core tests pass
  • Backwards compatible: 2-arg event handlers still work

Note

High Risk
High risk because it changes core execution flow (agent executors, task skipping/resume) and event emission semantics by introducing shared RuntimeState recording and passing it into handlers.

Overview
Adds first-class checkpoint/resume by serializing a unified RuntimeState (entities + event record) to timestamped JSON and restoring Crew, Flow, and Agent via new from_checkpoint() APIs that rehydrate runtime links, rebuild event scope, and resume execution from the first incomplete task.

Integrates RuntimeState into the event system: the event bus now records emitted events, auto-registers emitting entities, and optionally passes the current runtime state as a third argument to sync/async handlers while remaining compatible with existing 2-arg handlers.

Refactors executor/state serialization to support checkpointing: introduces BaseAgentExecutor (Pydantic model) as a shared base for CrewAgentExecutor and AgentExecutor, adds resumable message handling, and updates LLM/executor fields to round-trip as structured dicts (with llm_type/executor_type discriminators) across agents/crews.

Reviewed by Cursor Bugbot for commit 3be42f9. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions github-actions bot added the size/L label Apr 2, 2026
@greysonlalonde greysonlalonde changed the title feat: runtime state event bus feat: RuntimeState event bus integration with checkpoint/resume Apr 3, 2026
…ider pattern

- Move runtime_state.py to state/runtime.py
- Add acheckpoint async method using aiofiles
- Introduce BaseProvider protocol and JsonProvider for pluggable storage
- Add aiofiles dependency to crewai package
- Use PrivateAttr for provider on RootModel
Set task_id and task_name in _set_task_fingerprint so events carry
task identity through serialization. Use task_id to find the correct
task_started event when restoring the scope stack on checkpoint resume.
- Add Literal type discriminators to all 119 event subclasses
- Add BeforeValidator + PlainSerializer on EventNode.event to
  deserialize events into the correct subclass using a type registry
- Falls back to BaseEvent for unrecognized or incomplete event dicts
Use BeforeValidator/PlainSerializer with create_model_from_schema to
serialize type[BaseModel] as its JSON schema dict and reconstruct a
dynamic model on deserialization.
# Conflicts:
#	lib/crewai/src/crewai/llms/providers/openai/completion.py
#	lib/devtools/pyproject.toml
#	uv.lock
BaseTool could not serialize to JSON because args_schema (a class
reference) and cache_function (a lambda) are not JSON-serializable.
This caused checkpointing to crash for any crew with tools.

- Add PlainSerializer to args_schema so it round-trips via JSON schema
- Replace default cache_function lambda with named _default_cache_function
  and type it as SerializableCallable so it serializes to a dotted path
- Add computed_field tool_type that stores the fully qualified class name
- Add restore_tool_from_dict to reconstruct the concrete subclass from
  checkpoint dicts, pre-resolving callback strings to callables
- Update BaseAgent.validate_tools and Task._restore_tools_from_checkpoint
  to handle dict inputs from checkpoint deserialization
greysonlalonde and others added 3 commits April 7, 2026 02:27
BaseTool could not serialize to JSON because args_schema (a class
reference) and cache_function (a lambda) are not JSON-serializable.
This caused checkpointing to crash for any crew with tools.

- Add PlainSerializer to args_schema so it round-trips via JSON schema
- Replace default cache_function lambda with named _default_cache_function
  and type it as SerializableCallable
- Add computed_field tool_type storing the fully qualified class name
- Add __init_subclass__ registry and __get_pydantic_core_schema__ on
  BaseTool so any list[BaseTool] field automatically dispatches to the
  concrete subclass during deserialization via tool_type lookup
- No changes needed to BaseAgent.validate_tools or Task — Pydantic
  handles it natively through the custom core schema
tool_type computed field is legitimately required in the schema.
Flow.from_checkpoint deserialized the checkpoint_* fields but never
copied them back into the private execution attrs (_completed_methods,
_method_outputs, _method_execution_counts, _state). Calling kickoff()
after from_checkpoint would restart from scratch.

Add _restore_from_checkpoint that copies the checkpoint fields into the
private attrs, using the existing _restore_state method for state
reconstruction.
Flow.from_checkpoint deserializes as base Flow (entity_type
discriminator), losing subclass methods and state type. When called on
a subclass like MyFlow.from_checkpoint(), create a cls instance and
transfer the checkpoint fields so @start methods, listeners, and
structured state are available.
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 3be42f9. Configure here.

Copy link
Copy Markdown
Contributor

@iris-clawd iris-clawd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. RuntimeState checkpoint/resume with event record, serializable executors/tools/LLMs, and Flow subclass restoration all look solid after multiple review rounds. Minor follow-ups (EventRecord memory growth for long runs, lazy event type map) can be addressed separately.

@greysonlalonde greysonlalonde merged commit 86ce54f into main Apr 6, 2026
49 checks passed
@greysonlalonde greysonlalonde deleted the chore/runtime-state-event-bus branch April 6, 2026 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants