Skip to content

bhj37193/relay

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Relay

Extended thinking sharing between Claude agents.

I read RecursiveMAS (arXiv 2604.25917). It proves that agents sharing internal reasoning state improve accuracy over text-only multi-agent systems. The catch: the paper requires GPU access to hidden states. Relay does the same thing using the Anthropic extended thinking API. No GPU, no training required.

Results

50-example GSM8K eval, claude-sonnet-4-6, 1024 thinking budget:

Condition Accuracy Avg tokens
Single-agent 96% 344
Text-MAS 100% 6,466
Relay 98% 7,059

Relay improves across rounds: 96% at round 1, 98% at round 2. Both multi-agent conditions beat single-agent. Thinking sharing adds interpretability -- you can read exactly why the system changed its answer between rounds.

Structured mental model sharing (MATH level 4-5)

relay-raw passes the full extended thinking text between agents. At low thinking budgets, that means compressed reasoning, sometimes fragments. relay-structured takes a different approach: each agent emits a compact JSON block summarizing what it understood, what steps it took, what it tried and rejected, and how confident it is. That JSON is what the next agent reads.

relay-raw is retired. relay-structured is the active approach.

The field set:

{
  "interpretation": "how the agent read the problem",
  "key_steps": ["step 1", "step 2"],
  "rejected_approaches": ["something tried and discarded"],
  "confidence": 0.85,
  "potential_errors": "where this reasoning might go wrong"
}

confidence and potential_errors carry the most weight downstream. They tell the next agent where to look harder without requiring it to parse a full reasoning trace. The critic can see directly that the planner was uncertain about its interpretation, before it forms its own view. Raw thinking text buries that signal in prose; the JSON surfaces it explicitly.

Results (n=50, preliminary)

Dataset: MATH competition problems, levels 4-5. Model: claude-sonnet-4-6. Thinking budget: 5000 tokens.

Condition Accuracy Avg tokens
single-agent 70.0% 1,212
relay-structured 72.0% 18,821

relay-structured wins by 2 points: 36 correct out of 50 versus 35. The difference is one problem.

The harder number is the token cost. relay-structured runs a two-agent chain with structured summaries at each step: 18,821 avg tokens versus 1,212 for single-agent. That is a 15x multiple. At that ratio, the accuracy gain needs to be substantially larger to justify the cost in a production system. The n=50 result is directionally positive but not large enough to call settled.

A fair next test is n=200 at matched token budgets. The more practical framing for relay-structured is probably not overall accuracy but accuracy on the subset where single-agent confidence is low. If relay-structured recovers correct answers when confidence flags a problem, the deployment story changes: run the chain selectively, not on every request. That narrows the cost multiple to where it makes sense.

This was the foundation. See the section below for the improved architecture and final results.

The path from relay-raw to read-after

Four architectures. Each one exposed a problem that led to the next.

relay-raw

Pass the full extended thinking text between agents. The idea comes from RecursiveMAS: agents sharing more than final text improves accuracy. At a 1024-token thinking budget, the text is compressed, sometimes fragments. Preliminary n=7 data: single-agent 71.4%, relay-raw 57.1%. Inconclusive at n=7, but the low budget was the obvious confound.

relay-structured

Instead of passing raw thinking, each agent emits a compact mental model JSON. The hypothesis: 150 tokens of structured signal carries more information per token than 1024 tokens of fragmented prose. n=50 at budget=5000: relay-structured 72.0%, single-agent 70.0%. The token cost was 18,821 avg versus 1,212. +2 percentage points for 15x more tokens.

Why read-before was not built

The obvious next step: let the second agent read the first agent's JSON before answering. The problem is anchoring. The second agent sees the first agent's answer before forming its own view. It will tend to confirm rather than challenge. This is mathematically equivalent to relay-structured with role specialization removed and anchoring added. It was not worth building.

read-after + disagreement escalation

Agents reason independently first. No shared context during reasoning. After both answer, compare in code. If they agree: return the higher-confidence answer. If they disagree: run a resolver that sees both answers and both mental model JSONs and evaluates which reasoning chain is stronger. Token cost: 2.7x single-agent versus relay-structured's 15x. Final results: n=200, self-relay 65.5% vs single-agent 63.0%, 3,290 avg tokens vs 1,234.

The case study

Two trains leave San Rafael at the same time. They begin traveling westward, both traveling for 80 miles. The next day, they travel northwards, covering 150 miles. What's the distance covered by each train in the two days?

Single-agent applied the Pythagorean theorem: sqrt(80^2 + 150^2) = 170 miles. Wrong. It confused total distance traveled with straight-line displacement.

Relay's Planner reasoned through the difference between path length and displacement. The Critic confirmed it. The Solver produced 80 + 150 = 230 miles -- correct. The thinking blocks made the reasoning auditable at every step.

How it works

Standard multi-agent: Agent A produces an answer. Agent B sees that answer.

Relay: Agent A produces thinking text and an answer. Agent B sees both before responding.

The loop is Planner -> Critic -> Solver, repeated N rounds:

Round 1:
  Planner                                         -> (thinking_1, plan)
  Critic  <- thinking_1 + plan                    -> (thinking_2, critique)
  Solver  <- thinking_1 + thinking_2 + plan + critique  -> answer_1

Round 2:
  Planner <- answer_1 + all prior thinking        -> (thinking_3, plan_2)
  Critic  <- thinking_3 + ...                     -> (thinking_4, critique_2)
  Solver  <- all                                  -> answer_2  [FINAL]

Thinking block signatures are conversation-scoped and cannot be passed to another agent directly. Relay extracts the thinking text and injects it as a regular user-message context block. The reasoning transfers without the signature.

Architecture

Architecture v2 — self-relay

Quick start

git clone https://github.com/bhj37193/relay
cd relay
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env   # add your ANTHROPIC_API_KEY

See it work

$ python -m relay.runner "Two trains leave San Rafael..."

[Planner - round 1]
  thinking: The question asks for "distance covered" -- total path length, not
            displacement. I should avoid the Pythagorean trap here...
  output:   Day 1: 80 miles westward. Day 2: 150 miles northward. Total: 230.

[Critic - round 1]
  thinking: Planner correctly identified path length vs. displacement.
            No corrections needed.
  output:   Reasoning is correct. Proceed with 80 + 150 = 230.

[Solver - round 1]
  answer:   80 + 150 = 230 miles   \boxed{230}

Final answer: 230
Gold answer:  230  [CORRECT]

Add Relay to your product

Three functions. Drop any into your inference pipeline.

One relay round:

from relay.agent import agent_call

question = "your question here"

thinking1, plan, _      = agent_call("planner", question, [], [])
thinking2, critique, _  = agent_call("critic",  question, [thinking1], [plan])
thinking3, answer, _    = agent_call("solver",  question, [thinking1, thinking2], [plan, critique])

# answer is the final response
# thinking3 shows why the solver answered this way

Text-only sharing (faster, cheaper):

from relay.runner import run_textmas

result = run_textmas(question, n_rounds=2)
print(result["final_answer"])

Full relay loop with N rounds:

from relay.runner import run_relay

result = run_relay(question, n_rounds=2)
print(result["final_answer"])
print(result["rounds"][-1]["solver_thinking"])  # why the solver answered this way

Model requirement: any model that supports thinking={"type": "enabled"}, for example claude-opus-4-5 or claude-sonnet-4-6. Minimum budget_tokens=1024.

Run the eval

source .venv/bin/activate
set -a; . ./.env; set +a

# 50 examples, about $2-3 on Sonnet
python -m relay.eval --n 50 --workers 2 --model claude-sonnet-4-6 --budget 1024

# Full run
python -m relay.eval --n 200

Results export to results.json. The eval is resumable -- kill and restart, it picks up from the last completed example.

Frontend:

cd web && npm install
cp ../results.json public/results.json
npm run dev

File map

File What it does
relay/agent.py agent_call(role, question, prior_thinking, prior_text) returns (thinking, answer, tokens)
relay/runner.py run_single(), run_relay(), run_textmas() -- the three eval conditions
relay/eval.py 3-condition parallel eval, resumable, exports results.json
relay/data.py GSM8K loader, answer extraction, correctness check
web/ Vite + React + Recharts dashboard reading public/results.json

Notes

  • No LangChain, no CrewAI, no LangGraph. Raw Anthropic SDK and threadpool.
  • Proven novel: no prior implementation of thinking-block relay between agents found as of 2026-05-31.
  • Thinking signatures are conversation-scoped. Cross-agent transfer uses text extraction (see relay/agent.py lines 34-41).
  • Dataset: GSM8K test split via openai/gsm8k on HuggingFace.
  • Extended thinking minimum: 1024 budget tokens. Use 5000 for harder reasoning tasks.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors