Observability & process gaps surfaced during a live multi-agent session (delivery, heartbeat, spawn, dependency validation, runtime metrics, notifications, tooling)

## Summary

During a live multi-agent session (a Lead coordinating a Director + 8 parallel review workers + synthesis), we hit a series of observability, state-consistency, and process issues. The core orchestration flow (declare → plan-review → approve → spawn → notify) ultimately worked, but several silent-failure modes made agent liveness and message delivery effectively unobservable, and one tooling gap repeatedly blocked the Lead from acting directly.

All issues below are reported from the perspective of a Lead agent using the FlightDeck tools.

## Issues

### 1. `flightdeck_send` reports success even for never-spawned / dead agents (silent dead-letter)
`flightdeck_send` returns `{"status":"sent","messageId":"..."}` regardless of whether the target is actually running. A Director assigned an unavailable runtime registered but never spawned; messages still returned `sent` but were never consumed. The sender cannot distinguish "delivered & will be processed" from "queued into a dead inbox."
**Fix:** Check target liveness at send time; return a warning or `delivered:false` (or a distinct status) for never-spawned/dead targets.

### 2. `lastHeartbeat` is always `null` — even for `busy` agents
`flightdeck_agent_list` returns `lastHeartbeat: null` for every agent, including ones in `status: busy` (observed for the Lead itself and a busy Director). If this is a liveness signal, it is non-functional; the only liveness proxy is polling whether `tokensIn/Out` increase, which is indirect.
**Fix:** Populate `lastHeartbeat` on each tick so liveness can be checked directly.

### 3. Spawn failures are not surfaced
When an agent is assigned a runtime unavailable on the host (e.g. `codex` not installed), it registers as `idle` but never starts, with no error bubbled to Lead/user.
**Fix:** Validate runtime at spawn time and fail loudly. Distinguish "idle (ran before, waiting)" from "never started."

### 4. Task `dependsOn` accepts dangling references → silent permanent stall
A synthesis task declared `dependsOn: ["olive-review-arch", ...]` (logical names) while the real task IDs were `task-872342`, etc. The dependency never resolved, so the task sat in `pending` forever. Because it was the only task with `notifyLead:true`, no one was ever notified — the whole chain silently stalled even though all 8 upstream tasks were `done`.
**Fix:** Validate `dependsOn` references at `declare_tasks` time; error/warn on dangling dependencies instead of allowing a permanent stall.

### 5. Task `running` but assigned worker `idle` with zero activity → state inconsistency
A synthesis task showed `state: running` assigned to a worker that showed `status: idle`, `tokensIn/Out: 0`, `cost: 0`. From outside it was indistinguishable from a stalled/dead worker — yet it had actually completed. There is **no external way to distinguish a stalled worker from a finished one**; both look like `idle + 0 tokens + lastHeartbeat:null`.
**Fix:** Reconcile task state with real worker activity; mark `stalled` when a worker shows no heartbeat/activity, and ensure completion transitions are observable.

### 6. `copilot-sdk` runtime does not report token/cost metrics
A worker on `copilot-sdk` + `claude-opus-4.7` completed real work but reported `tokensIn/Out: 0` and `cost: 0`, compounding issue #5 (the zeroed metrics made a finished worker look dead).
**Fix:** Ensure token/cost accounting is wired for the `copilot-sdk` runtime.

### 7. `tasks_declared_notify` reported the wrong task count
The Director declared 9 tasks (8 investigation + 1 synthesis), but the system notification said "declared **1** task(s)" and named only the synthesis task.
**Fix:** Report the accurate count and ideally list (or summarize) all declared tasks.

### 8. `reviewer steer` failed; reviewers had to be cleared to complete a task
On one task the plan event reported "reviewer steer failed"; the Director had to clear `reviewers` to complete it. Worth investigating the reviewer-steer path for robustness.

### 9. (Tooling) The Lead has no tool to execute shell/`gh` commands directly
The Lead's bash-related tools are limited to `read_bash` / `stop_bash` / `list_bash` (read/stop/list existing sessions) — there is **no tool to start/write a new bash command**. As a result the Lead cannot run `gh` itself and must delegate every shell action. Worse, generic `task` subagents repeatedly **refused or got confused**, claiming no bash access, which blocked GitHub-issue creation for several rounds until work was routed through Director → worker.
**Suggested fixes:**
- Clarify in the Lead's system prompt that it cannot execute shell commands directly (only read/stop/list bash sessions), so it routes shell work to workers immediately instead of stalling.
- Ensure `task` subagents reliably know whether they can execute shell commands, and don't refuse spuriously.
- Consider a dedicated **"report-bug-to-flightdeck" skill** that standardizes "collect observations → format → `gh issue create` on flightdeck-dev/flightdeck-2", so this path doesn't depend on ad-hoc prompts or on the Lead having shell access.

## What worked well (context)
- Reliable message send/receive with returned message IDs.
- Clean declare-tasks → plan-review → approve → spawn flow.
- Async completion notifications (`notifyLead`) enabling fire-and-forget delegation (once the dependency/notify wiring was correct, the completion notifications fired reliably).
- Parallel fan-out of 8 read-only workers + opus synthesis produced a high-quality consolidated report with zero repo modifications.

## Priority
High: #1–#6 (liveness/delivery/state observability — these caused real blocked/stalled chains with no surfaced error). Medium: #9 tooling/UX (caused repeated stalls). Low: #7, #8 (notification accuracy, reviewer-steer robustness).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Observability & process gaps surfaced during a live multi-agent session (delivery, heartbeat, spawn, dependency validation, runtime metrics, notifications, tooling) #9

Summary

Issues

1. `flightdeck_send` reports success even for never-spawned / dead agents (silent dead-letter)

2. `lastHeartbeat` is always `null` — even for `busy` agents

3. Spawn failures are not surfaced

4. Task `dependsOn` accepts dangling references → silent permanent stall

5. Task `running` but assigned worker `idle` with zero activity → state inconsistency

6. `copilot-sdk` runtime does not report token/cost metrics

7. `tasks_declared_notify` reported the wrong task count

8. `reviewer steer` failed; reviewers had to be cleared to complete a task

9. (Tooling) The Lead has no tool to execute shell/`gh` commands directly

What worked well (context)

Priority

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Observability & process gaps surfaced during a live multi-agent session (delivery, heartbeat, spawn, dependency validation, runtime metrics, notifications, tooling) #9

Description

Summary

Issues

1. flightdeck_send reports success even for never-spawned / dead agents (silent dead-letter)

2. lastHeartbeat is always null — even for busy agents

3. Spawn failures are not surfaced

4. Task dependsOn accepts dangling references → silent permanent stall

5. Task running but assigned worker idle with zero activity → state inconsistency

6. copilot-sdk runtime does not report token/cost metrics

7. tasks_declared_notify reported the wrong task count

8. reviewer steer failed; reviewers had to be cleared to complete a task

9. (Tooling) The Lead has no tool to execute shell/gh commands directly

What worked well (context)

Priority

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. `flightdeck_send` reports success even for never-spawned / dead agents (silent dead-letter)

2. `lastHeartbeat` is always `null` — even for `busy` agents

4. Task `dependsOn` accepts dangling references → silent permanent stall

5. Task `running` but assigned worker `idle` with zero activity → state inconsistency

6. `copilot-sdk` runtime does not report token/cost metrics

7. `tasks_declared_notify` reported the wrong task count

8. `reviewer steer` failed; reviewers had to be cleared to complete a task

9. (Tooling) The Lead has no tool to execute shell/`gh` commands directly