research(orchestration): MAST-informed handoff hardening to reduce multi-agent coordination failures

## Source

MAST: Why Do Multi-Agent LLM Systems Fail? (arxiv 2503.13657, UC Berkeley March 2025)
https://arxiv.org/abs/2503.13657

## Core Finding

Analysis of 1,642 execution traces across 7 multi-agent frameworks found that **coordination breakdowns account for 36.9% of all failures** -- the single largest failure category. The taxonomy identifies 14 failure modes across three clusters:

1. **System design failures** -- agent boundaries too coarse, missing capability overlap
2. **Inter-agent misalignment** -- ambiguous task handoff, missing context propagation, incompatible role assumptions
3. **Task verification failures** -- no confirmation that subtask output meets specification before continuing

The inter-agent misalignment cluster (most relevant to Zeph) covers:
- Ambiguous handoff: the receiving agent doesn't have enough context to continue correctly
- Missing context: key facts known to the orchestrator are not passed to the sub-agent
- Role assumption mismatch: orchestrator assumes capabilities the sub-agent doesn't have

## Current Zeph Gap

Zeph's sub-agent handoff uses YAML files with ad-hoc context fields (id, parent, agent, timestamp, status, context, output, next). The `context` and `output` fields are free-form, and there is no schema validation that required context is present before a handoff is issued.

From MAST's taxonomy, the most likely failure modes in Zeph are:
- Missing context: sub-agent receives task description but not the supporting facts (e.g., which crate, which function, what the constraint is)
- Ambiguous task: handoff says "fix the bug" without specifying the exact error or reproduction path
- No verification gate: the orchestrator proceeds to the next phase without confirming the sub-agent's output is valid

## Implementation Sketch

1. Define a typed HandoffContext struct with required fields per agent type (architect needs: spec files, constraints; developer needs: spec + architect output; tester needs: test plan + implementation)
2. Add a validation step in SubAgentManager before dispatching: verify required context fields are present
3. Add a verification step after sub-agent completion: check that output fields match expected structure before triggering the next phase
4. Log handoff quality metrics: context completeness score, verification pass/fail rate

This does not require architectural changes to the sub-agent system -- it's a validation layer on top of the existing YAML handoff protocol.

## Complexity

Medium. Requires: typed HandoffContext enum per agent role, validation logic in SubAgentManager, and either LLM-based or schema-based output verification. No changes to transport or agent execution.

## Expected Benefit

- Reduces the 36.9% coordination failure rate (most impactful single change for multi-agent reliability)
- Makes handoff failures explicit and debuggable instead of silent
- Enables better error recovery: if context is missing, the orchestrator can request it before dispatching

## See Also

- SubAgentManager: zeph-subagent crate
- Handoff protocol: .local/handoff/ + rust-agent-handoff skill
- Decentralized task allocation: https://www.nature.com/articles/s41598-025-21709-9
- Multi-agent coordination survey: https://arxiv.org/html/2501.06322v1


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

research(orchestration): MAST-informed handoff hardening to reduce multi-agent coordination failures #2023

Source

Core Finding

Current Zeph Gap

Implementation Sketch

Complexity

Expected Benefit

See Also

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

research(orchestration): MAST-informed handoff hardening to reduce multi-agent coordination failures #2023

Description

Source

Core Finding

Current Zeph Gap

Implementation Sketch

Complexity

Expected Benefit

See Also

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions