Summary
During a live multi-agent session (a Lead coordinating a Director + 8 parallel review workers + synthesis), we hit a series of observability, state-consistency, and process issues. The core orchestration flow (declare → plan-review → approve → spawn → notify) ultimately worked, but several silent-failure modes made agent liveness and message delivery effectively unobservable, and one tooling gap repeatedly blocked the Lead from acting directly.
All issues below are reported from the perspective of a Lead agent using the FlightDeck tools.
Issues
1. flightdeck_send reports success even for never-spawned / dead agents (silent dead-letter)
flightdeck_send returns {"status":"sent","messageId":"..."} regardless of whether the target is actually running. A Director assigned an unavailable runtime registered but never spawned; messages still returned sent but were never consumed. The sender cannot distinguish "delivered & will be processed" from "queued into a dead inbox."
Fix: Check target liveness at send time; return a warning or delivered:false (or a distinct status) for never-spawned/dead targets.
2. lastHeartbeat is always null — even for busy agents
flightdeck_agent_list returns lastHeartbeat: null for every agent, including ones in status: busy (observed for the Lead itself and a busy Director). If this is a liveness signal, it is non-functional; the only liveness proxy is polling whether tokensIn/Out increase, which is indirect.
Fix: Populate lastHeartbeat on each tick so liveness can be checked directly.
3. Spawn failures are not surfaced
When an agent is assigned a runtime unavailable on the host (e.g. codex not installed), it registers as idle but never starts, with no error bubbled to Lead/user.
Fix: Validate runtime at spawn time and fail loudly. Distinguish "idle (ran before, waiting)" from "never started."
4. Task dependsOn accepts dangling references → silent permanent stall
A synthesis task declared dependsOn: ["olive-review-arch", ...] (logical names) while the real task IDs were task-872342, etc. The dependency never resolved, so the task sat in pending forever. Because it was the only task with notifyLead:true, no one was ever notified — the whole chain silently stalled even though all 8 upstream tasks were done.
Fix: Validate dependsOn references at declare_tasks time; error/warn on dangling dependencies instead of allowing a permanent stall.
5. Task running but assigned worker idle with zero activity → state inconsistency
A synthesis task showed state: running assigned to a worker that showed status: idle, tokensIn/Out: 0, cost: 0. From outside it was indistinguishable from a stalled/dead worker — yet it had actually completed. There is no external way to distinguish a stalled worker from a finished one; both look like idle + 0 tokens + lastHeartbeat:null.
Fix: Reconcile task state with real worker activity; mark stalled when a worker shows no heartbeat/activity, and ensure completion transitions are observable.
6. copilot-sdk runtime does not report token/cost metrics
A worker on copilot-sdk + claude-opus-4.7 completed real work but reported tokensIn/Out: 0 and cost: 0, compounding issue #5 (the zeroed metrics made a finished worker look dead).
Fix: Ensure token/cost accounting is wired for the copilot-sdk runtime.
7. tasks_declared_notify reported the wrong task count
The Director declared 9 tasks (8 investigation + 1 synthesis), but the system notification said "declared 1 task(s)" and named only the synthesis task.
Fix: Report the accurate count and ideally list (or summarize) all declared tasks.
8. reviewer steer failed; reviewers had to be cleared to complete a task
On one task the plan event reported "reviewer steer failed"; the Director had to clear reviewers to complete it. Worth investigating the reviewer-steer path for robustness.
9. (Tooling) The Lead has no tool to execute shell/gh commands directly
The Lead's bash-related tools are limited to read_bash / stop_bash / list_bash (read/stop/list existing sessions) — there is no tool to start/write a new bash command. As a result the Lead cannot run gh itself and must delegate every shell action. Worse, generic task subagents repeatedly refused or got confused, claiming no bash access, which blocked GitHub-issue creation for several rounds until work was routed through Director → worker.
Suggested fixes:
- Clarify in the Lead's system prompt that it cannot execute shell commands directly (only read/stop/list bash sessions), so it routes shell work to workers immediately instead of stalling.
- Ensure
task subagents reliably know whether they can execute shell commands, and don't refuse spuriously.
- Consider a dedicated "report-bug-to-flightdeck" skill that standardizes "collect observations → format →
gh issue create on flightdeck-dev/flightdeck-2", so this path doesn't depend on ad-hoc prompts or on the Lead having shell access.
What worked well (context)
- Reliable message send/receive with returned message IDs.
- Clean declare-tasks → plan-review → approve → spawn flow.
- Async completion notifications (
notifyLead) enabling fire-and-forget delegation (once the dependency/notify wiring was correct, the completion notifications fired reliably).
- Parallel fan-out of 8 read-only workers + opus synthesis produced a high-quality consolidated report with zero repo modifications.
Priority
High: #1–#6 (liveness/delivery/state observability — these caused real blocked/stalled chains with no surfaced error). Medium: #9 tooling/UX (caused repeated stalls). Low: #7, #8 (notification accuracy, reviewer-steer robustness).
Summary
During a live multi-agent session (a Lead coordinating a Director + 8 parallel review workers + synthesis), we hit a series of observability, state-consistency, and process issues. The core orchestration flow (declare → plan-review → approve → spawn → notify) ultimately worked, but several silent-failure modes made agent liveness and message delivery effectively unobservable, and one tooling gap repeatedly blocked the Lead from acting directly.
All issues below are reported from the perspective of a Lead agent using the FlightDeck tools.
Issues
1.
flightdeck_sendreports success even for never-spawned / dead agents (silent dead-letter)flightdeck_sendreturns{"status":"sent","messageId":"..."}regardless of whether the target is actually running. A Director assigned an unavailable runtime registered but never spawned; messages still returnedsentbut were never consumed. The sender cannot distinguish "delivered & will be processed" from "queued into a dead inbox."Fix: Check target liveness at send time; return a warning or
delivered:false(or a distinct status) for never-spawned/dead targets.2.
lastHeartbeatis alwaysnull— even forbusyagentsflightdeck_agent_listreturnslastHeartbeat: nullfor every agent, including ones instatus: busy(observed for the Lead itself and a busy Director). If this is a liveness signal, it is non-functional; the only liveness proxy is polling whethertokensIn/Outincrease, which is indirect.Fix: Populate
lastHeartbeaton each tick so liveness can be checked directly.3. Spawn failures are not surfaced
When an agent is assigned a runtime unavailable on the host (e.g.
codexnot installed), it registers asidlebut never starts, with no error bubbled to Lead/user.Fix: Validate runtime at spawn time and fail loudly. Distinguish "idle (ran before, waiting)" from "never started."
4. Task
dependsOnaccepts dangling references → silent permanent stallA synthesis task declared
dependsOn: ["olive-review-arch", ...](logical names) while the real task IDs weretask-872342, etc. The dependency never resolved, so the task sat inpendingforever. Because it was the only task withnotifyLead:true, no one was ever notified — the whole chain silently stalled even though all 8 upstream tasks weredone.Fix: Validate
dependsOnreferences atdeclare_taskstime; error/warn on dangling dependencies instead of allowing a permanent stall.5. Task
runningbut assigned workeridlewith zero activity → state inconsistencyA synthesis task showed
state: runningassigned to a worker that showedstatus: idle,tokensIn/Out: 0,cost: 0. From outside it was indistinguishable from a stalled/dead worker — yet it had actually completed. There is no external way to distinguish a stalled worker from a finished one; both look likeidle + 0 tokens + lastHeartbeat:null.Fix: Reconcile task state with real worker activity; mark
stalledwhen a worker shows no heartbeat/activity, and ensure completion transitions are observable.6.
copilot-sdkruntime does not report token/cost metricsA worker on
copilot-sdk+claude-opus-4.7completed real work but reportedtokensIn/Out: 0andcost: 0, compounding issue #5 (the zeroed metrics made a finished worker look dead).Fix: Ensure token/cost accounting is wired for the
copilot-sdkruntime.7.
tasks_declared_notifyreported the wrong task countThe Director declared 9 tasks (8 investigation + 1 synthesis), but the system notification said "declared 1 task(s)" and named only the synthesis task.
Fix: Report the accurate count and ideally list (or summarize) all declared tasks.
8.
reviewer steerfailed; reviewers had to be cleared to complete a taskOn one task the plan event reported "reviewer steer failed"; the Director had to clear
reviewersto complete it. Worth investigating the reviewer-steer path for robustness.9. (Tooling) The Lead has no tool to execute shell/
ghcommands directlyThe Lead's bash-related tools are limited to
read_bash/stop_bash/list_bash(read/stop/list existing sessions) — there is no tool to start/write a new bash command. As a result the Lead cannot runghitself and must delegate every shell action. Worse, generictasksubagents repeatedly refused or got confused, claiming no bash access, which blocked GitHub-issue creation for several rounds until work was routed through Director → worker.Suggested fixes:
tasksubagents reliably know whether they can execute shell commands, and don't refuse spuriously.gh issue createon flightdeck-dev/flightdeck-2", so this path doesn't depend on ad-hoc prompts or on the Lead having shell access.What worked well (context)
notifyLead) enabling fire-and-forget delegation (once the dependency/notify wiring was correct, the completion notifications fired reliably).Priority
High: #1–#6 (liveness/delivery/state observability — these caused real blocked/stalled chains with no surfaced error). Medium: #9 tooling/UX (caused repeated stalls). Low: #7, #8 (notification accuracy, reviewer-steer robustness).