fix(scheduler): restart workflow runners when an agent is restarted by geoffjay · Pull Request #1248 · geoffjay/agentd

geoffjay · 2026-05-25T04:21:59Z

When an agent fails and its workflow runners time out waiting for it to reconnect (the 60-second startup window in resume_workflows), no runner is ever started for those workflows. If the agent is later restarted via the API, UI, or reconciliation, the workflows remain dead -- no polling, no dispatches.

Changes

scheduler/events.rs: Add AgentRestarted { agent_id } to SystemEvent. Published by restart_agent after the new process is launched and the DB record is updated.
manager.rs: Publish SystemEvent::AgentRestarted at the end of restart_agent (via self.registry.event_bus()). This covers all restart paths: API, reconcile, bootstrap, and clear_context.
scheduler/mod.rs: Add Scheduler::restart_workflows_for_agent which lists all enabled workflows for the agent, skips any with active runners (idempotent), and calls start_workflow for the dead ones.
main.rs: Subscribe to AgentRestarted events in a background task that calls restart_workflows_for_agent reactively.
config.rs: Fix two OrchestratorConfig test initializers that were missing the backend field, causing the test binary to fail to compile.

How it works

Agent fails → 60s timeout → runner exits → runners map has no entry for the workflow
Agent is restarted via POST /agents/{id}/restart
restart_agent succeeds → publishes AgentRestarted { agent_id }
Event subscriber calls restart_workflows_for_agent(agent_id)
Method finds the enabled workflow with no active runner → calls start_workflow
Runner starts and begins polling its task source; dispatches once the agent connects

Closes #1103

When an agent fails and its workflow runners time out waiting for it to reconnect (60s startup window in resume_workflows), no runner is ever started for those workflows. If the agent is later restarted via the API, UI, or reconciliation, the workflows remain dead with no polling or dispatches. Fix by: - Adding AgentRestarted to SystemEvent, published from restart_agent after the new process is launched and the DB record is updated. - Adding Scheduler::restart_workflows_for_agent which lists enabled workflows for the agent, skips any with active runners, and starts new runners for the dead ones. - Subscribing to AgentRestarted in main.rs to call restart_workflows_for_agent reactively. - Fixing two OrchestratorConfig test initializers that were missing the backend field, causing test compilation to fail. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

codecov · 2026-05-25T04:24:02Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 63.77%. Comparing base (e6239dd) to head (757099a).

Additional details and impacted files

@@                     Coverage Diff                      @@
##           feature/autonomous-pipeline    #1248   +/-   ##
============================================================
  Coverage                        63.77%   63.77%           
============================================================
  Files                              173      173           
  Lines                             7733     7733           
  Branches                          2566     2566           
============================================================
  Hits                              4932     4932           
  Misses                            2780     2780           
  Partials                            21       21

Flag	Coverage Δ
frontend	`63.77% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

geoffjay

Review: fix(scheduler): restart workflow runners when an agent is restarted

The core approach is correct and well-designed. The event-bus pattern is consistent with the existing AgentDisconnected handler in main.rs, idempotency is properly handled, and the fix targets the root cause. Two blocking issues need to be resolved before merge.

Blocking

1. CI Format check is failing -- The Format CI job (cargo fmt --check) completed with FAILURE. Run cargo fmt, commit, and push before merge.

2. Wrong base branch -- This PR targets main, but CLAUDE.md requires all feature work to target feature/autonomous-pipeline. Please retarget: gh pr edit 1248 --repo geoffjay/agentd --base feature/autonomous-pipeline

Non-blocking suggestions

3. restart_workflows_for_agent receiver (scheduler/mod.rs) -- The method takes self: &Arc but never clones or stores the Arc. Unlike resume_workflows, no background spawns are done here, so &self is sufficient and more consistent with start_workflow, stop_workflow, and other methods in this impl block.

4. list_workflows(None) scans all workflows on every restart (scheduler/mod.rs line 417) -- Storage only supports filtering by project_id, not agent_id. Every agent restart loads all workflows and filters in Rust. Fine for current scale since restarts are rare, but a follow-up to add agent_id filtering to SchedulerStorage::list_workflows would prevent it becoming a hotspot.

5. Misleading error-level log for a benign race (scheduler/mod.rs ~line 449) -- A lingering resume_workflows background task (still inside its 60-second wait) could race with restart_workflows_for_agent and call start_workflow for the same workflow. The loser emits error!(...Failed to re-launch workflow runner...) when the root cause is just Workflow X is already running, which is benign. Consider warn! or checking the error variant to distinguish this case from genuine failures.

What is well done

Complete coverage: all restart paths flow through the private restart_agent method, so the single publish site in manager.rs covers API, reconcile, bootstrap, and clear_context.
Correct idempotency: the runners.contains_key check plus start_workflow own guard prevent double-starting runners.
Consistent pattern: the event subscriber in main.rs is structurally identical to the existing AgentDisconnected handler.
Test coverage: test_agent_restarted_event and the updated test_publish_all_event_variants cleanly cover the new variant; the bus capacity bump from 16 to 32 in the all-variants test is correct.
config.rs fix: adding the missing backend field to the two test initializers is a genuine compilation fix that belongs in this PR.

Please fix the format failure and retarget the base branch, then re-submit for review.

- Change restart_workflows_for_agent receiver from self: &Arc<Self> to &self — no background spawns are done inside the method, so &self is sufficient and consistent with the other methods in this impl block. - Downgrade the start_workflow error inside restart_workflows_for_agent from error! to warn! with an explanatory comment: the "Workflow X is already running" result is a benign TOCTOU race between the contains_key pre-check and start_workflow when a concurrent resume_workflows background waiter wins the same slot. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

geoffjay · 2026-05-25T18:47:58Z

Follow-up issue created for review suggestion 4 (agent_id filtering in list_workflows): #1249

- cargo fmt --workspace: reformat crates/cli, crates/mcp, crates/tui, crates/xtask (pure whitespace/style changes, no semantic changes) - crates/mcp/src/tools/orchestrator_debug.rs: replace sort_by closure with sort_by_key(|b| Reverse(b.1)) to satisfy clippy::unnecessary_sort_by These failures existed on the base branch before this PR's changes and were surfaced by CI when feature/autonomous-pipeline was created. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- tui/app.rs: use is_none_or instead of map_or(true, ...) - tui/agent_detail.rs: remove unnecessary u16 cast - tui/input.rs: use enumerate().take() instead of needless range loop - xtask/platform/mod.rs: allow dead_code on port_env field Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

geoffjay added the review-agent Used to invoke a review by an agent tracking this label label May 25, 2026

geoffjay mentioned this pull request May 25, 2026

Restarting a failed agent should re-launch its associated workflow runners #1103

Closed

geoffjay commented May 25, 2026

View reviewed changes

geoffjay added needs-rework PR has review feedback that must be addressed before merging and removed review-agent Used to invoke a review by an agent tracking this label labels May 25, 2026

geoffjay changed the base branch from main to feature/autonomous-pipeline May 25, 2026 04:34

geoffjay added review-agent Used to invoke a review by an agent tracking this label and removed needs-rework PR has review feedback that must be addressed before merging labels May 25, 2026

geoffjay mentioned this pull request May 25, 2026

perf(scheduler): add agent_id filter to SchedulerStorage::list_workflows #1249

Open

geoffjay and others added 2 commits May 25, 2026 11:52

geoffjay changed the base branch from feature/autonomous-pipeline to main May 25, 2026 20:11

geoffjay merged commit e4953ae into main May 25, 2026

geoffjay deleted the issue-1103 branch May 25, 2026 20:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(scheduler): restart workflow runners when an agent is restarted#1248

fix(scheduler): restart workflow runners when an agent is restarted#1248
geoffjay merged 4 commits into
mainfrom
issue-1103

geoffjay commented May 25, 2026

Uh oh!

codecov Bot commented May 25, 2026 •

edited

Loading

Uh oh!

geoffjay left a comment

Uh oh!

geoffjay commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

geoffjay commented May 25, 2026

Changes

How it works

Uh oh!

codecov Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

geoffjay left a comment

Choose a reason for hiding this comment

Review: fix(scheduler): restart workflow runners when an agent is restarted

Blocking

Non-blocking suggestions

What is well done

Uh oh!

geoffjay commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented May 25, 2026 •

edited

Loading