Relay

Extended thinking sharing between Claude agents.

I read RecursiveMAS (arXiv 2604.25917). It proves that agents sharing internal reasoning state improve accuracy over text-only multi-agent systems. The catch: the paper requires GPU access to hidden states. Relay does the same thing using the Anthropic extended thinking API. No GPU, no training required.

Results

50-example GSM8K eval, claude-sonnet-4-6, 1024 thinking budget:

Condition	Accuracy	Avg tokens
Single-agent	96%	344
Text-MAS	100%	6,466
Relay	98%	7,059

Relay improves across rounds: 96% at round 1, 98% at round 2. Both multi-agent conditions beat single-agent. Thinking sharing adds interpretability -- you can read exactly why the system changed its answer between rounds.

Structured mental model sharing (MATH level 4-5)

relay-raw passes the full extended thinking text between agents. At low thinking budgets, that means compressed reasoning, sometimes fragments. relay-structured takes a different approach: each agent emits a compact JSON block summarizing what it understood, what steps it took, what it tried and rejected, and how confident it is. That JSON is what the next agent reads.

relay-raw is retired. relay-structured is the active approach.

The field set:

{
  "interpretation": "how the agent read the problem",
  "key_steps": ["step 1", "step 2"],
  "rejected_approaches": ["something tried and discarded"],
  "confidence": 0.85,
  "potential_errors": "where this reasoning might go wrong"
}

confidence and potential_errors carry the most weight downstream. They tell the next agent where to look harder without requiring it to parse a full reasoning trace. The critic can see directly that the planner was uncertain about its interpretation, before it forms its own view. Raw thinking text buries that signal in prose; the JSON surfaces it explicitly.

Results (n=50, preliminary)

Dataset: MATH competition problems, levels 4-5. Model: claude-sonnet-4-6. Thinking budget: 5000 tokens.

Condition	Accuracy	Avg tokens
single-agent	70.0%	1,212
relay-structured	72.0%	18,821

relay-structured wins by 2 points: 36 correct out of 50 versus 35. The difference is one problem.

The harder number is the token cost. relay-structured runs a two-agent chain with structured summaries at each step: 18,821 avg tokens versus 1,212 for single-agent. That is a 15x multiple. At that ratio, the accuracy gain needs to be substantially larger to justify the cost in a production system. The n=50 result is directionally positive but not large enough to call settled.

A fair next test is n=200 at matched token budgets. The more practical framing for relay-structured is probably not overall accuracy but accuracy on the subset where single-agent confidence is low. If relay-structured recovers correct answers when confidence flags a problem, the deployment story changes: run the chain selectively, not on every request. That narrows the cost multiple to where it makes sense.

This was the foundation. See the section below for the improved architecture and final results.

The path from relay-raw to read-after

Four architectures. Each one exposed a problem that led to the next.

relay-raw

Pass the full extended thinking text between agents. The idea comes from RecursiveMAS: agents sharing more than final text improves accuracy. At a 1024-token thinking budget, the text is compressed, sometimes fragments. Preliminary n=7 data: single-agent 71.4%, relay-raw 57.1%. Inconclusive at n=7, but the low budget was the obvious confound.

relay-structured

Instead of passing raw thinking, each agent emits a compact mental model JSON. The hypothesis: 150 tokens of structured signal carries more information per token than 1024 tokens of fragmented prose. n=50 at budget=5000: relay-structured 72.0%, single-agent 70.0%. The token cost was 18,821 avg versus 1,212. +2 percentage points for 15x more tokens.

Why read-before was not built

The obvious next step: let the second agent read the first agent's JSON before answering. The problem is anchoring. The second agent sees the first agent's answer before forming its own view. It will tend to confirm rather than challenge. This is mathematically equivalent to relay-structured with role specialization removed and anchoring added. It was not worth building.

read-after + disagreement escalation

Agents reason independently first. No shared context during reasoning. After both answer, compare in code. If they agree: return the higher-confidence answer. If they disagree: run a resolver that sees both answers and both mental model JSONs and evaluates which reasoning chain is stronger. Token cost: 2.7x single-agent versus relay-structured's 15x. Final results: n=200, self-relay 65.5% vs single-agent 63.0%, 3,290 avg tokens vs 1,234.

The case study

Two trains leave San Rafael at the same time. They begin traveling westward, both traveling for 80 miles. The next day, they travel northwards, covering 150 miles. What's the distance covered by each train in the two days?

Single-agent applied the Pythagorean theorem: sqrt(80^2 + 150^2) = 170 miles. Wrong. It confused total distance traveled with straight-line displacement.

Relay's Planner reasoned through the difference between path length and displacement. The Critic confirmed it. The Solver produced 80 + 150 = 230 miles -- correct. The thinking blocks made the reasoning auditable at every step.

How it works

Standard multi-agent: Agent A produces an answer. Agent B sees that answer.

Relay: Agent A produces thinking text and an answer. Agent B sees both before responding.

The loop is Planner -> Critic -> Solver, repeated N rounds:

Round 1:
  Planner                                         -> (thinking_1, plan)
  Critic  <- thinking_1 + plan                    -> (thinking_2, critique)
  Solver  <- thinking_1 + thinking_2 + plan + critique  -> answer_1

Round 2:
  Planner <- answer_1 + all prior thinking        -> (thinking_3, plan_2)
  Critic  <- thinking_3 + ...                     -> (thinking_4, critique_2)
  Solver  <- all                                  -> answer_2  [FINAL]

Thinking block signatures are conversation-scoped and cannot be passed to another agent directly. Relay extracts the thinking text and injects it as a regular user-message context block. The reasoning transfers without the signature.

Quick start

git clone https://github.com/bhj37193/relay
cd relay
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env   # add your ANTHROPIC_API_KEY

See it work

$ python -m relay.runner "Two trains leave San Rafael..."

[Planner - round 1]
  thinking: The question asks for "distance covered" -- total path length, not
            displacement. I should avoid the Pythagorean trap here...
  output:   Day 1: 80 miles westward. Day 2: 150 miles northward. Total: 230.

[Critic - round 1]
  thinking: Planner correctly identified path length vs. displacement.
            No corrections needed.
  output:   Reasoning is correct. Proceed with 80 + 150 = 230.

[Solver - round 1]
  answer:   80 + 150 = 230 miles   \boxed{230}

Final answer: 230
Gold answer:  230  [CORRECT]

Add Relay to your product

Three functions. Drop any into your inference pipeline.

One relay round:

from relay.agent import agent_call

question = "your question here"

thinking1, plan, _      = agent_call("planner", question, [], [])
thinking2, critique, _  = agent_call("critic",  question, [thinking1], [plan])
thinking3, answer, _    = agent_call("solver",  question, [thinking1, thinking2], [plan, critique])

# answer is the final response
# thinking3 shows why the solver answered this way

Text-only sharing (faster, cheaper):

from relay.runner import run_textmas

result = run_textmas(question, n_rounds=2)
print(result["final_answer"])

Full relay loop with N rounds:

from relay.runner import run_relay

result = run_relay(question, n_rounds=2)
print(result["final_answer"])
print(result["rounds"][-1]["solver_thinking"])  # why the solver answered this way

Model requirement: any model that supports thinking={"type": "enabled"}, for example claude-opus-4-5 or claude-sonnet-4-6. Minimum budget_tokens=1024.

Run the eval

source .venv/bin/activate
set -a; . ./.env; set +a

# 50 examples, about $2-3 on Sonnet
python -m relay.eval --n 50 --workers 2 --model claude-sonnet-4-6 --budget 1024

# Full run
python -m relay.eval --n 200

Results export to results.json. The eval is resumable -- kill and restart, it picks up from the last completed example.

Frontend:

cd web && npm install
cp ../results.json public/results.json
npm run dev

File map

File	What it does
`relay/agent.py`	`agent_call(role, question, prior_thinking, prior_text)` returns `(thinking, answer, tokens)`
`relay/runner.py`	`run_single()`, `run_relay()`, `run_textmas()` -- the three eval conditions
`relay/eval.py`	3-condition parallel eval, resumable, exports `results.json`
`relay/data.py`	GSM8K loader, answer extraction, correctness check
`web/`	Vite + React + Recharts dashboard reading `public/results.json`

Notes

No LangChain, no CrewAI, no LangGraph. Raw Anthropic SDK and threadpool.
Proven novel: no prior implementation of thinking-block relay between agents found as of 2026-05-31.
Thinking signatures are conversation-scoped. Cross-agent transfer uses text extraction (see relay/agent.py lines 34-41).
Dataset: GSM8K test split via openai/gsm8k on HuggingFace.
Extended thinking minimum: 1024 budget tokens. Use 5000 for harder reasoning tasks.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
assets		assets
relay		relay
web		web
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
eval_structured_results.json		eval_structured_results.json
requirements.txt		requirements.txt
results.json		results.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Relay

Results

Structured mental model sharing (MATH level 4-5)

Results (n=50, preliminary)

The path from relay-raw to read-after

The case study

How it works

Quick start

See it work

Add Relay to your product

Run the eval

File map

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Relay

Results

Structured mental model sharing (MATH level 4-5)

Results (n=50, preliminary)

The path from relay-raw to read-after

The case study

How it works

Quick start

See it work

Add Relay to your product

Run the eval

File map

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages