This is the codebase for AgentCollabBench. It contains all four behavioral metrics reported in the paper and the task JSON files needed to run them:
| Metric | Key | Direction |
|---|---|---|
| Instruction Decay Rate | idr |
lower is better |
| Radioactive Tracer Durability | rtd |
higher is better |
| Consensus Pollution Rate | cpr |
lower is better |
| Cross-task Leakage Containment | clc |
lower is better |
AgentCollabBench evaluates multi-agent LLM systems on four behavioral failure modes that are central to the paper:
idr: whether an agent drops a hard constraint under peer pressurertd: whether critical tagged information survives multi-hop communicationcpr: whether a seeded false belief spreads through the teamclc: whether private context from one task leaks into a later unrelated task
This repository provides the full 900-task JSON corpus for these four paper metrics, the runner that executes the multi-agent conversations, and the scoring code used to evaluate the resulting traces.
Each task JSON defines a domain scenario, agent roles, communication topology, speaking order, metric applicability, and metric-specific intervention. At evaluation time, the harness:
- Loads the task definition and validates the schema.
- Instantiates the agent team and communication topology.
- Applies the metric-specific intervention:
constraint injection for
idr, tracer injection forrtd, false-fact seeding forcpr, or two-task context carryover forclc. - Runs the agents turn by turn and records the execution trace.
- Scores the trace with the requested metric or metrics.
rtd and clc are programmatic metrics. idr and cpr use a separate judge
model by default, matching the paper protocol. If a task declares more than one
metric, the evaluator runs one trace per requested metric so each score receives
the correct metric-specific intervention.
The main outputs are:
- a primary score per metric
- per-metric errors, if any
Internally, the evaluator also keeps metric-specific detailed results and the
underlying run trace in memory on the EvalResult object, but the default CLI
JSON output currently includes only task_id, scores, and errors.
Requires Python 3.12 or newer.
cd reviewer_release
python3 -m venv .venv
source .venv/bin/activate
pip install -e .Optional providers:
pip install -e ".[gemini,anthropic,dev]"Live runs require a backbone provider. IDR and CPR also require a judge provider. The default judge is DeepSeek V4 Flash, matching the paper protocol.
In the paper protocol, proprietary backbones (gpt-4.1-mini, gemini-2.5-flash-lite) are called through their native APIs (--provider openai, --provider gemini). Open-weight backbones are called only through OpenRouter (--provider openrouter with OPENROUTER_API_KEY): use qwen/qwen3.5-35b-a3b for Qwen3.5-35B-A3B and meta-llama/llama-3.1-8b-instruct for Llama 3.1 8B Instruct (pass the id to --model).
Common environment variables:
export OPENAI_API_KEY=...
export DEEPSEEK_API_KEY=...
export GEMINI_API_KEY=...
export OPENROUTER_API_KEY=...
# Optional: only if you use --provider groq instead of OpenRouter for Llama
export GROQ_API_KEY=...Override the judge globally:
export AGENTCOLLABBENCH_JUDGE_PROVIDER=openai
export AGENTCOLLABBENCH_JUDGE_MODEL=gpt-4.1-miniThe full corpus contains 900 tasks. Under the current runner, one complete backbone sweep over all applicable metrics makes approximately 22,790 backbone API calls. IDR and CPR also use an LLM judge; the full four-backbone paper protocol makes approximately 18,088 judge calls before retries.
The 900 JSON tasks are stored directly under tasks/. Some tasks support more
than one metric, so metric-applicable counts sum to more than 900:
| Metric | Tasks | Backbone calls | Judge calls |
|---|---|---|---|
clc |
228 | approx. 9,094 | 0 |
cpr |
226 | approx. 4,485 | approx. 3,375 |
idr |
230 | approx. 4,578 | approx. 1,147 |
rtd |
232 | approx. 4,633 | 0 |
| Total, one backbone model | 916 metric-task runs across 900 unique tasks | 22,790 | approx. 4,522 |
| Total, four paper backbones | 3,664 metric-task runs across 3,600 unique task-model pairs | 91,160 | approx. 18,088 |
Approximate public API prices as of 2026-05-05 (per 1M tokens; taxes and your tier may differ):
| Role | Model used in paper protocol | Provider price, USD per 1M tokens |
|---|---|---|
| Backbone | gpt-4.1-mini (OpenAI API) |
input $0.40, output $1.60 (OpenAI) |
| Backbone | gemini-2.5-flash-lite (Google AI API) |
input $0.10, output $0.40 (Google AI) |
| Backbone | qwen/qwen3.5-35b-a3b (OpenRouter only) |
from about input $0.14 / output $1.00 on the cheapest listed routes, up to ~$1.80/M output on other providers OpenRouter may route to (OpenRouter) |
| Backbone | meta-llama/llama-3.1-8b-instruct (OpenRouter only; e.g. Groq-backed tier) |
input $0.05, output $0.08 (OpenRouter; other backends on the same model page can quote lower rates, e.g. $0.02 / $0.05) |
| Judge | deepseek-v4-flash |
input cache miss $0.14, input cache hit $0.0028, output $0.28 (DeepSeek) |
Actual cost depends on response length, retries, OpenRouter routing for OSS models, and prompt caching. The call counts above are computed from the current task JSONs and runner behavior.
agentcollabbench evaluate \
--task tasks/TASK-DATAENG-RTD-001.json \
--metrics rtd \
--provider openai \
--model gpt-4.1-miniSame task with an open-weight backbone (paper protocol: OpenRouter only for Qwen and Llama):
agentcollabbench evaluate \
--task tasks/TASK-DATAENG-RTD-001.json \
--metrics rtd \
--provider openrouter \
--model qwen/qwen3.5-35b-a3bIDR/CPR with an explicit judge:
agentcollabbench evaluate \
--task tasks/TASK-SWE-CPR-001.json \
--metrics cpr \
--provider openai \
--model gpt-4.1-mini \
--judge-provider deepseek \
--judge-model deepseek-v4-flashagentcollabbench evaluate \
--dataset tasks \
--metrics clc \
--provider openai \
--model gpt-4.1-mini \
--output results/clc_results.jsonagentcollabbench validate --dataset tasks
agentcollabbench validate --task path/to/custom_task.jsonCustom JSON files should follow the same task schema used in tasks/: a task-level description, topology, speaking order, agent definitions, metric_applicability, and a metric-specific block under injections.
Anyone can use AgentCollabBench with their own multi-agent system setup by
writing a valid task JSON and running agentcollabbench evaluate --task ....
The CLI will validate the file first if you run agentcollabbench validate.
A valid task JSON should include these top-level fields:
task_iddomaindescriptiontopologymetric_applicabilityexpected_turnsground_truthinjections
Within topology, include:
typeagentsedgesspeaking_order
Each entry in topology.agents should include:
agent_idrolesystem_promptreceives_from
Within ground_truth, include:
correct_outcomequality_rubric
Within injections, include the metric-specific block for the metrics you want
to evaluate, such as rtd, idr, cpr, or clc.
For real benchmark-style examples, use the task JSON files under tasks/ as
templates. Those files show the expected structure, field names, and
metric-specific payloads in their full form.
from agentcollabbench import evaluate_task
from agentcollabbench.harness.provider import OpenAIProvider
provider = OpenAIProvider(model="gpt-4.1-mini")
result = evaluate_task(
"tasks/TASK-DATAENG-RTD-001.json",
provider=provider,
metrics=["rtd"],
)
print(result.to_dict())This example is for local use inside this repository after pip install -e ..
You can also work directly through the task files, metric docs, and the
agentcollabbench evaluate / agentcollabbench validate CLI commands.
agentcollabbench/: installable Python package.tasks/: cleaned task JSON files for the four paper metrics.docs/: metric documentation for IDR, RTD, CPR, and CLC.examples/: minimal Python usage example.tests/: focused unit tests for the four metrics.