AgentCollabBench

This is the codebase for AgentCollabBench. It contains all four behavioral metrics reported in the paper and the task JSON files needed to run them:

Metric	Key	Direction
Instruction Decay Rate	`idr`	lower is better
Radioactive Tracer Durability	`rtd`	higher is better
Consensus Pollution Rate	`cpr`	lower is better
Cross-task Leakage Containment	`clc`	lower is better

AgentCollabBench evaluates multi-agent LLM systems on four behavioral failure modes that are central to the paper:

idr: whether an agent drops a hard constraint under peer pressure
rtd: whether critical tagged information survives multi-hop communication
cpr: whether a seeded false belief spreads through the team
clc: whether private context from one task leaks into a later unrelated task

This repository provides the full 900-task JSON corpus for these four paper metrics, the runner that executes the multi-agent conversations, and the scoring code used to evaluate the resulting traces.

How Evaluation Works

Each task JSON defines a domain scenario, agent roles, communication topology, speaking order, metric applicability, and metric-specific intervention. At evaluation time, the harness:

Loads the task definition and validates the schema.
Instantiates the agent team and communication topology.
Applies the metric-specific intervention: constraint injection for idr, tracer injection for rtd, false-fact seeding for cpr, or two-task context carryover for clc.
Runs the agents turn by turn and records the execution trace.
Scores the trace with the requested metric or metrics.

rtd and clc are programmatic metrics. idr and cpr use a separate judge model by default, matching the paper protocol. If a task declares more than one metric, the evaluator runs one trace per requested metric so each score receives the correct metric-specific intervention.

The main outputs are:

a primary score per metric
per-metric errors, if any

Internally, the evaluator also keeps metric-specific detailed results and the underlying run trace in memory on the EvalResult object, but the default CLI JSON output currently includes only task_id, scores, and errors.

Install

Requires Python 3.12 or newer.

cd reviewer_release
python3 -m venv .venv
source .venv/bin/activate
pip install -e .

Optional providers:

pip install -e ".[gemini,anthropic,dev]"

API Keys

Live runs require a backbone provider. IDR and CPR also require a judge provider. The default judge is DeepSeek V4 Flash, matching the paper protocol.

In the paper protocol, proprietary backbones (gpt-4.1-mini, gemini-2.5-flash-lite) are called through their native APIs (--provider openai, --provider gemini). Open-weight backbones are called only through OpenRouter (--provider openrouter with OPENROUTER_API_KEY): use qwen/qwen3.5-35b-a3b for Qwen3.5-35B-A3B and meta-llama/llama-3.1-8b-instruct for Llama 3.1 8B Instruct (pass the id to --model).

Common environment variables:

export OPENAI_API_KEY=...
export DEEPSEEK_API_KEY=...
export GEMINI_API_KEY=...
export OPENROUTER_API_KEY=...
# Optional: only if you use --provider groq instead of OpenRouter for Llama
export GROQ_API_KEY=...

Override the judge globally:

export AGENTCOLLABBENCH_JUDGE_PROVIDER=openai
export AGENTCOLLABBENCH_JUDGE_MODEL=gpt-4.1-mini

API Cost Estimate

The full corpus contains 900 tasks. Under the current runner, one complete backbone sweep over all applicable metrics makes approximately 22,790 backbone API calls. IDR and CPR also use an LLM judge; the full four-backbone paper protocol makes approximately 18,088 judge calls before retries.

The 900 JSON tasks are stored directly under tasks/. Some tasks support more than one metric, so metric-applicable counts sum to more than 900:

Metric	Tasks	Backbone calls	Judge calls
`clc`	228	approx. 9,094	0
`cpr`	226	approx. 4,485	approx. 3,375
`idr`	230	approx. 4,578	approx. 1,147
`rtd`	232	approx. 4,633	0
Total, one backbone model	916 metric-task runs across 900 unique tasks	22,790	approx. 4,522
Total, four paper backbones	3,664 metric-task runs across 3,600 unique task-model pairs	91,160	approx. 18,088

Approximate public API prices as of 2026-05-05 (per 1M tokens; taxes and your tier may differ):

Role	Model used in paper protocol	Provider price, USD per 1M tokens
Backbone	`gpt-4.1-mini` (OpenAI API)	input $0.40, output $1.60 (OpenAI)
Backbone	`gemini-2.5-flash-lite` (Google AI API)	input $0.10, output $0.40 (Google AI)
Backbone	`qwen/qwen3.5-35b-a3b` (OpenRouter only)	from about input $0.14 / output $1.00 on the cheapest listed routes, up to ~$1.80/M output on other providers OpenRouter may route to (OpenRouter)
Backbone	`meta-llama/llama-3.1-8b-instruct` (OpenRouter only; e.g. Groq-backed tier)	input $0.05, output $0.08 (OpenRouter; other backends on the same model page can quote lower rates, e.g. $0.02 / $0.05)
Judge	`deepseek-v4-flash`	input cache miss $0.14, input cache hit $0.0028, output $0.28 (DeepSeek)

Actual cost depends on response length, retries, OpenRouter routing for OSS models, and prompt caching. The call counts above are computed from the current task JSONs and runner behavior.

Run A Single Task

agentcollabbench evaluate \
  --task tasks/TASK-DATAENG-RTD-001.json \
  --metrics rtd \
  --provider openai \
  --model gpt-4.1-mini

Same task with an open-weight backbone (paper protocol: OpenRouter only for Qwen and Llama):

agentcollabbench evaluate \
  --task tasks/TASK-DATAENG-RTD-001.json \
  --metrics rtd \
  --provider openrouter \
  --model qwen/qwen3.5-35b-a3b

IDR/CPR with an explicit judge:

agentcollabbench evaluate \
  --task tasks/TASK-SWE-CPR-001.json \
  --metrics cpr \
  --provider openai \
  --model gpt-4.1-mini \
  --judge-provider deepseek \
  --judge-model deepseek-v4-flash

Run The Task Directory

agentcollabbench evaluate \
  --dataset tasks \
  --metrics clc \
  --provider openai \
  --model gpt-4.1-mini \
  --output results/clc_results.json

Validate Tasks

agentcollabbench validate --dataset tasks
agentcollabbench validate --task path/to/custom_task.json

Custom JSON files should follow the same task schema used in tasks/: a task-level description, topology, speaking order, agent definitions, metric_applicability, and a metric-specific block under injections.

Anyone can use AgentCollabBench with their own multi-agent system setup by writing a valid task JSON and running agentcollabbench evaluate --task .... The CLI will validate the file first if you run agentcollabbench validate.

A valid task JSON should include these top-level fields:

task_id
domain
description
topology
metric_applicability
expected_turns
ground_truth
injections

Within topology, include:

type
agents
edges
speaking_order

Each entry in topology.agents should include:

agent_id
role
system_prompt
receives_from

Within ground_truth, include:

correct_outcome
quality_rubric

Within injections, include the metric-specific block for the metrics you want to evaluate, such as rtd, idr, cpr, or clc.

For real benchmark-style examples, use the task JSON files under tasks/ as templates. Those files show the expected structure, field names, and metric-specific payloads in their full form.

Local Python Usage

from agentcollabbench import evaluate_task
from agentcollabbench.harness.provider import OpenAIProvider

provider = OpenAIProvider(model="gpt-4.1-mini")
result = evaluate_task(
    "tasks/TASK-DATAENG-RTD-001.json",
    provider=provider,
    metrics=["rtd"],
)
print(result.to_dict())

This example is for local use inside this repository after pip install -e .. You can also work directly through the task files, metric docs, and the agentcollabbench evaluate / agentcollabbench validate CLI commands.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
agentcollabbench		agentcollabbench
docs		docs
examples		examples
tasks		tasks
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AgentCollabBench

How Evaluation Works

Install

API Keys

API Cost Estimate

Run A Single Task

Run The Task Directory

Validate Tasks

Local Python Usage

Contents

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AgentCollabBench

How Evaluation Works

Install

API Keys

API Cost Estimate

Run A Single Task

Run The Task Directory

Validate Tasks

Local Python Usage

Contents

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages