Skip to content

research(orchestration): MAST-informed handoff hardening to reduce multi-agent coordination failures #2023

@bug-ops

Description

@bug-ops

Source

MAST: Why Do Multi-Agent LLM Systems Fail? (arxiv 2503.13657, UC Berkeley March 2025)
https://arxiv.org/abs/2503.13657

Core Finding

Analysis of 1,642 execution traces across 7 multi-agent frameworks found that coordination breakdowns account for 36.9% of all failures -- the single largest failure category. The taxonomy identifies 14 failure modes across three clusters:

  1. System design failures -- agent boundaries too coarse, missing capability overlap
  2. Inter-agent misalignment -- ambiguous task handoff, missing context propagation, incompatible role assumptions
  3. Task verification failures -- no confirmation that subtask output meets specification before continuing

The inter-agent misalignment cluster (most relevant to Zeph) covers:

  • Ambiguous handoff: the receiving agent doesn't have enough context to continue correctly
  • Missing context: key facts known to the orchestrator are not passed to the sub-agent
  • Role assumption mismatch: orchestrator assumes capabilities the sub-agent doesn't have

Current Zeph Gap

Zeph's sub-agent handoff uses YAML files with ad-hoc context fields (id, parent, agent, timestamp, status, context, output, next). The context and output fields are free-form, and there is no schema validation that required context is present before a handoff is issued.

From MAST's taxonomy, the most likely failure modes in Zeph are:

  • Missing context: sub-agent receives task description but not the supporting facts (e.g., which crate, which function, what the constraint is)
  • Ambiguous task: handoff says "fix the bug" without specifying the exact error or reproduction path
  • No verification gate: the orchestrator proceeds to the next phase without confirming the sub-agent's output is valid

Implementation Sketch

  1. Define a typed HandoffContext struct with required fields per agent type (architect needs: spec files, constraints; developer needs: spec + architect output; tester needs: test plan + implementation)
  2. Add a validation step in SubAgentManager before dispatching: verify required context fields are present
  3. Add a verification step after sub-agent completion: check that output fields match expected structure before triggering the next phase
  4. Log handoff quality metrics: context completeness score, verification pass/fail rate

This does not require architectural changes to the sub-agent system -- it's a validation layer on top of the existing YAML handoff protocol.

Complexity

Medium. Requires: typed HandoffContext enum per agent role, validation logic in SubAgentManager, and either LLM-based or schema-based output verification. No changes to transport or agent execution.

Expected Benefit

  • Reduces the 36.9% coordination failure rate (most impactful single change for multi-agent reliability)
  • Makes handoff failures explicit and debuggable instead of silent
  • Enables better error recovery: if context is missing, the orchestrator can request it before dispatching

See Also

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Research — medium-high complexityorchestrationTask orchestration / DAG schedulingresearchResearch-driven improvement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions