Skip to content

fix(scheduler): restart workflow runners when an agent is restarted#1248

Merged
geoffjay merged 4 commits into
mainfrom
issue-1103
May 25, 2026
Merged

fix(scheduler): restart workflow runners when an agent is restarted#1248
geoffjay merged 4 commits into
mainfrom
issue-1103

Conversation

@geoffjay
Copy link
Copy Markdown
Owner

When an agent fails and its workflow runners time out waiting for it to reconnect (the 60-second startup window in resume_workflows), no runner is ever started for those workflows. If the agent is later restarted via the API, UI, or reconciliation, the workflows remain dead -- no polling, no dispatches.

Changes

  • scheduler/events.rs: Add AgentRestarted { agent_id } to SystemEvent. Published by restart_agent after the new process is launched and the DB record is updated.
  • manager.rs: Publish SystemEvent::AgentRestarted at the end of restart_agent (via self.registry.event_bus()). This covers all restart paths: API, reconcile, bootstrap, and clear_context.
  • scheduler/mod.rs: Add Scheduler::restart_workflows_for_agent which lists all enabled workflows for the agent, skips any with active runners (idempotent), and calls start_workflow for the dead ones.
  • main.rs: Subscribe to AgentRestarted events in a background task that calls restart_workflows_for_agent reactively.
  • config.rs: Fix two OrchestratorConfig test initializers that were missing the backend field, causing the test binary to fail to compile.

How it works

  1. Agent fails → 60s timeout → runner exits → runners map has no entry for the workflow
  2. Agent is restarted via POST /agents/{id}/restart
  3. restart_agent succeeds → publishes AgentRestarted { agent_id }
  4. Event subscriber calls restart_workflows_for_agent(agent_id)
  5. Method finds the enabled workflow with no active runner → calls start_workflow
  6. Runner starts and begins polling its task source; dispatches once the agent connects

Closes #1103

When an agent fails and its workflow runners time out waiting for it to
reconnect (60s startup window in resume_workflows), no runner is ever
started for those workflows. If the agent is later restarted via the API,
UI, or reconciliation, the workflows remain dead with no polling or
dispatches.

Fix by:
- Adding AgentRestarted to SystemEvent, published from restart_agent after
  the new process is launched and the DB record is updated.
- Adding Scheduler::restart_workflows_for_agent which lists enabled
  workflows for the agent, skips any with active runners, and starts new
  runners for the dead ones.
- Subscribing to AgentRestarted in main.rs to call
  restart_workflows_for_agent reactively.
- Fixing two OrchestratorConfig test initializers that were missing the
  backend field, causing test compilation to fail.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@geoffjay geoffjay added the review-agent Used to invoke a review by an agent tracking this label label May 25, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 25, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 63.77%. Comparing base (e6239dd) to head (757099a).

Additional details and impacted files
@@                     Coverage Diff                      @@
##           feature/autonomous-pipeline    #1248   +/-   ##
============================================================
  Coverage                        63.77%   63.77%           
============================================================
  Files                              173      173           
  Lines                             7733     7733           
  Branches                          2566     2566           
============================================================
  Hits                              4932     4932           
  Misses                            2780     2780           
  Partials                            21       21           
Flag Coverage Δ
frontend 63.77% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Owner Author

@geoffjay geoffjay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: fix(scheduler): restart workflow runners when an agent is restarted

The core approach is correct and well-designed. The event-bus pattern is consistent with the existing AgentDisconnected handler in main.rs, idempotency is properly handled, and the fix targets the root cause. Two blocking issues need to be resolved before merge.

Blocking

1. CI Format check is failing -- The Format CI job (cargo fmt --check) completed with FAILURE. Run cargo fmt, commit, and push before merge.

2. Wrong base branch -- This PR targets main, but CLAUDE.md requires all feature work to target feature/autonomous-pipeline. Please retarget: gh pr edit 1248 --repo geoffjay/agentd --base feature/autonomous-pipeline

Non-blocking suggestions

3. restart_workflows_for_agent receiver (scheduler/mod.rs) -- The method takes self: &Arc but never clones or stores the Arc. Unlike resume_workflows, no background spawns are done here, so &self is sufficient and more consistent with start_workflow, stop_workflow, and other methods in this impl block.

4. list_workflows(None) scans all workflows on every restart (scheduler/mod.rs line 417) -- Storage only supports filtering by project_id, not agent_id. Every agent restart loads all workflows and filters in Rust. Fine for current scale since restarts are rare, but a follow-up to add agent_id filtering to SchedulerStorage::list_workflows would prevent it becoming a hotspot.

5. Misleading error-level log for a benign race (scheduler/mod.rs ~line 449) -- A lingering resume_workflows background task (still inside its 60-second wait) could race with restart_workflows_for_agent and call start_workflow for the same workflow. The loser emits error!(...Failed to re-launch workflow runner...) when the root cause is just Workflow X is already running, which is benign. Consider warn! or checking the error variant to distinguish this case from genuine failures.

What is well done

  • Complete coverage: all restart paths flow through the private restart_agent method, so the single publish site in manager.rs covers API, reconcile, bootstrap, and clear_context.
  • Correct idempotency: the runners.contains_key check plus start_workflow own guard prevent double-starting runners.
  • Consistent pattern: the event subscriber in main.rs is structurally identical to the existing AgentDisconnected handler.
  • Test coverage: test_agent_restarted_event and the updated test_publish_all_event_variants cleanly cover the new variant; the bus capacity bump from 16 to 32 in the all-variants test is correct.
  • config.rs fix: adding the missing backend field to the two test initializers is a genuine compilation fix that belongs in this PR.

Please fix the format failure and retarget the base branch, then re-submit for review.

@geoffjay geoffjay added needs-rework PR has review feedback that must be addressed before merging and removed review-agent Used to invoke a review by an agent tracking this label labels May 25, 2026
- Change restart_workflows_for_agent receiver from self: &Arc<Self> to
  &self — no background spawns are done inside the method, so &self is
  sufficient and consistent with the other methods in this impl block.
- Downgrade the start_workflow error inside restart_workflows_for_agent
  from error! to warn! with an explanatory comment: the "Workflow X is
  already running" result is a benign TOCTOU race between the
  contains_key pre-check and start_workflow when a concurrent
  resume_workflows background waiter wins the same slot.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@geoffjay geoffjay changed the base branch from main to feature/autonomous-pipeline May 25, 2026 04:34
@geoffjay geoffjay added review-agent Used to invoke a review by an agent tracking this label and removed needs-rework PR has review feedback that must be addressed before merging labels May 25, 2026
@geoffjay
Copy link
Copy Markdown
Owner Author

Follow-up issue created for review suggestion 4 (agent_id filtering in list_workflows): #1249

geoffjay and others added 2 commits May 25, 2026 11:52
- cargo fmt --workspace: reformat crates/cli, crates/mcp, crates/tui,
  crates/xtask (pure whitespace/style changes, no semantic changes)
- crates/mcp/src/tools/orchestrator_debug.rs: replace sort_by closure
  with sort_by_key(|b| Reverse(b.1)) to satisfy clippy::unnecessary_sort_by

These failures existed on the base branch before this PR's changes and
were surfaced by CI when feature/autonomous-pipeline was created.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- tui/app.rs: use is_none_or instead of map_or(true, ...)
- tui/agent_detail.rs: remove unnecessary u16 cast
- tui/input.rs: use enumerate().take() instead of needless range loop
- xtask/platform/mod.rs: allow dead_code on port_env field

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@geoffjay geoffjay changed the base branch from feature/autonomous-pipeline to main May 25, 2026 20:11
@geoffjay geoffjay merged commit e4953ae into main May 25, 2026
@geoffjay geoffjay deleted the issue-1103 branch May 25, 2026 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review-agent Used to invoke a review by an agent tracking this label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Restarting a failed agent should re-launch its associated workflow runners

1 participant