-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Source
MAST: Why Do Multi-Agent LLM Systems Fail? (arxiv 2503.13657, UC Berkeley March 2025)
https://arxiv.org/abs/2503.13657
Core Finding
Analysis of 1,642 execution traces across 7 multi-agent frameworks found that coordination breakdowns account for 36.9% of all failures -- the single largest failure category. The taxonomy identifies 14 failure modes across three clusters:
- System design failures -- agent boundaries too coarse, missing capability overlap
- Inter-agent misalignment -- ambiguous task handoff, missing context propagation, incompatible role assumptions
- Task verification failures -- no confirmation that subtask output meets specification before continuing
The inter-agent misalignment cluster (most relevant to Zeph) covers:
- Ambiguous handoff: the receiving agent doesn't have enough context to continue correctly
- Missing context: key facts known to the orchestrator are not passed to the sub-agent
- Role assumption mismatch: orchestrator assumes capabilities the sub-agent doesn't have
Current Zeph Gap
Zeph's sub-agent handoff uses YAML files with ad-hoc context fields (id, parent, agent, timestamp, status, context, output, next). The context and output fields are free-form, and there is no schema validation that required context is present before a handoff is issued.
From MAST's taxonomy, the most likely failure modes in Zeph are:
- Missing context: sub-agent receives task description but not the supporting facts (e.g., which crate, which function, what the constraint is)
- Ambiguous task: handoff says "fix the bug" without specifying the exact error or reproduction path
- No verification gate: the orchestrator proceeds to the next phase without confirming the sub-agent's output is valid
Implementation Sketch
- Define a typed HandoffContext struct with required fields per agent type (architect needs: spec files, constraints; developer needs: spec + architect output; tester needs: test plan + implementation)
- Add a validation step in SubAgentManager before dispatching: verify required context fields are present
- Add a verification step after sub-agent completion: check that output fields match expected structure before triggering the next phase
- Log handoff quality metrics: context completeness score, verification pass/fail rate
This does not require architectural changes to the sub-agent system -- it's a validation layer on top of the existing YAML handoff protocol.
Complexity
Medium. Requires: typed HandoffContext enum per agent role, validation logic in SubAgentManager, and either LLM-based or schema-based output verification. No changes to transport or agent execution.
Expected Benefit
- Reduces the 36.9% coordination failure rate (most impactful single change for multi-agent reliability)
- Makes handoff failures explicit and debuggable instead of silent
- Enables better error recovery: if context is missing, the orchestrator can request it before dispatching
See Also
- SubAgentManager: zeph-subagent crate
- Handoff protocol: .local/handoff/ + rust-agent-handoff skill
- Decentralized task allocation: https://www.nature.com/articles/s41598-025-21709-9
- Multi-agent coordination survey: https://arxiv.org/html/2501.06322v1