fix(scheduler): restart workflow runners when an agent is restarted#1248
Conversation
When an agent fails and its workflow runners time out waiting for it to reconnect (60s startup window in resume_workflows), no runner is ever started for those workflows. If the agent is later restarted via the API, UI, or reconciliation, the workflows remain dead with no polling or dispatches. Fix by: - Adding AgentRestarted to SystemEvent, published from restart_agent after the new process is launched and the DB record is updated. - Adding Scheduler::restart_workflows_for_agent which lists enabled workflows for the agent, skips any with active runners, and starts new runners for the dead ones. - Subscribing to AgentRestarted in main.rs to call restart_workflows_for_agent reactively. - Fixing two OrchestratorConfig test initializers that were missing the backend field, causing test compilation to fail. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## feature/autonomous-pipeline #1248 +/- ##
============================================================
Coverage 63.77% 63.77%
============================================================
Files 173 173
Lines 7733 7733
Branches 2566 2566
============================================================
Hits 4932 4932
Misses 2780 2780
Partials 21 21
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
geoffjay
left a comment
There was a problem hiding this comment.
Review: fix(scheduler): restart workflow runners when an agent is restarted
The core approach is correct and well-designed. The event-bus pattern is consistent with the existing AgentDisconnected handler in main.rs, idempotency is properly handled, and the fix targets the root cause. Two blocking issues need to be resolved before merge.
Blocking
1. CI Format check is failing -- The Format CI job (cargo fmt --check) completed with FAILURE. Run cargo fmt, commit, and push before merge.
2. Wrong base branch -- This PR targets main, but CLAUDE.md requires all feature work to target feature/autonomous-pipeline. Please retarget: gh pr edit 1248 --repo geoffjay/agentd --base feature/autonomous-pipeline
Non-blocking suggestions
3. restart_workflows_for_agent receiver (scheduler/mod.rs) -- The method takes self: &Arc but never clones or stores the Arc. Unlike resume_workflows, no background spawns are done here, so &self is sufficient and more consistent with start_workflow, stop_workflow, and other methods in this impl block.
4. list_workflows(None) scans all workflows on every restart (scheduler/mod.rs line 417) -- Storage only supports filtering by project_id, not agent_id. Every agent restart loads all workflows and filters in Rust. Fine for current scale since restarts are rare, but a follow-up to add agent_id filtering to SchedulerStorage::list_workflows would prevent it becoming a hotspot.
5. Misleading error-level log for a benign race (scheduler/mod.rs ~line 449) -- A lingering resume_workflows background task (still inside its 60-second wait) could race with restart_workflows_for_agent and call start_workflow for the same workflow. The loser emits error!(...Failed to re-launch workflow runner...) when the root cause is just Workflow X is already running, which is benign. Consider warn! or checking the error variant to distinguish this case from genuine failures.
What is well done
- Complete coverage: all restart paths flow through the private restart_agent method, so the single publish site in manager.rs covers API, reconcile, bootstrap, and clear_context.
- Correct idempotency: the runners.contains_key check plus start_workflow own guard prevent double-starting runners.
- Consistent pattern: the event subscriber in main.rs is structurally identical to the existing AgentDisconnected handler.
- Test coverage: test_agent_restarted_event and the updated test_publish_all_event_variants cleanly cover the new variant; the bus capacity bump from 16 to 32 in the all-variants test is correct.
- config.rs fix: adding the missing backend field to the two test initializers is a genuine compilation fix that belongs in this PR.
Please fix the format failure and retarget the base branch, then re-submit for review.
- Change restart_workflows_for_agent receiver from self: &Arc<Self> to &self — no background spawns are done inside the method, so &self is sufficient and consistent with the other methods in this impl block. - Downgrade the start_workflow error inside restart_workflows_for_agent from error! to warn! with an explanatory comment: the "Workflow X is already running" result is a benign TOCTOU race between the contains_key pre-check and start_workflow when a concurrent resume_workflows background waiter wins the same slot. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Follow-up issue created for review suggestion 4 (agent_id filtering in |
- cargo fmt --workspace: reformat crates/cli, crates/mcp, crates/tui, crates/xtask (pure whitespace/style changes, no semantic changes) - crates/mcp/src/tools/orchestrator_debug.rs: replace sort_by closure with sort_by_key(|b| Reverse(b.1)) to satisfy clippy::unnecessary_sort_by These failures existed on the base branch before this PR's changes and were surfaced by CI when feature/autonomous-pipeline was created. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- tui/app.rs: use is_none_or instead of map_or(true, ...) - tui/agent_detail.rs: remove unnecessary u16 cast - tui/input.rs: use enumerate().take() instead of needless range loop - xtask/platform/mod.rs: allow dead_code on port_env field Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When an agent fails and its workflow runners time out waiting for it to reconnect (the 60-second startup window in
resume_workflows), no runner is ever started for those workflows. If the agent is later restarted via the API, UI, or reconciliation, the workflows remain dead -- no polling, no dispatches.Changes
scheduler/events.rs: AddAgentRestarted { agent_id }toSystemEvent. Published byrestart_agentafter the new process is launched and the DB record is updated.manager.rs: PublishSystemEvent::AgentRestartedat the end ofrestart_agent(viaself.registry.event_bus()). This covers all restart paths: API, reconcile, bootstrap, andclear_context.scheduler/mod.rs: AddScheduler::restart_workflows_for_agentwhich lists all enabled workflows for the agent, skips any with active runners (idempotent), and callsstart_workflowfor the dead ones.main.rs: Subscribe toAgentRestartedevents in a background task that callsrestart_workflows_for_agentreactively.config.rs: Fix twoOrchestratorConfigtest initializers that were missing thebackendfield, causing the test binary to fail to compile.How it works
runnersmap has no entry for the workflowPOST /agents/{id}/restartrestart_agentsucceeds → publishesAgentRestarted { agent_id }restart_workflows_for_agent(agent_id)start_workflowCloses #1103