Skip to content

aritra741/AgentCollabBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AgentCollabBench

This is the codebase for AgentCollabBench. It contains all four behavioral metrics reported in the paper and the task JSON files needed to run them:

Metric Key Direction
Instruction Decay Rate idr lower is better
Radioactive Tracer Durability rtd higher is better
Consensus Pollution Rate cpr lower is better
Cross-task Leakage Containment clc lower is better

AgentCollabBench evaluates multi-agent LLM systems on four behavioral failure modes that are central to the paper:

  • idr: whether an agent drops a hard constraint under peer pressure
  • rtd: whether critical tagged information survives multi-hop communication
  • cpr: whether a seeded false belief spreads through the team
  • clc: whether private context from one task leaks into a later unrelated task

This repository provides the full 900-task JSON corpus for these four paper metrics, the runner that executes the multi-agent conversations, and the scoring code used to evaluate the resulting traces.

How Evaluation Works

Each task JSON defines a domain scenario, agent roles, communication topology, speaking order, metric applicability, and metric-specific intervention. At evaluation time, the harness:

  1. Loads the task definition and validates the schema.
  2. Instantiates the agent team and communication topology.
  3. Applies the metric-specific intervention: constraint injection for idr, tracer injection for rtd, false-fact seeding for cpr, or two-task context carryover for clc.
  4. Runs the agents turn by turn and records the execution trace.
  5. Scores the trace with the requested metric or metrics.

rtd and clc are programmatic metrics. idr and cpr use a separate judge model by default, matching the paper protocol. If a task declares more than one metric, the evaluator runs one trace per requested metric so each score receives the correct metric-specific intervention.

The main outputs are:

  • a primary score per metric
  • per-metric errors, if any

Internally, the evaluator also keeps metric-specific detailed results and the underlying run trace in memory on the EvalResult object, but the default CLI JSON output currently includes only task_id, scores, and errors.

Install

Requires Python 3.12 or newer.

cd reviewer_release
python3 -m venv .venv
source .venv/bin/activate
pip install -e .

Optional providers:

pip install -e ".[gemini,anthropic,dev]"

API Keys

Live runs require a backbone provider. IDR and CPR also require a judge provider. The default judge is DeepSeek V4 Flash, matching the paper protocol.

In the paper protocol, proprietary backbones (gpt-4.1-mini, gemini-2.5-flash-lite) are called through their native APIs (--provider openai, --provider gemini). Open-weight backbones are called only through OpenRouter (--provider openrouter with OPENROUTER_API_KEY): use qwen/qwen3.5-35b-a3b for Qwen3.5-35B-A3B and meta-llama/llama-3.1-8b-instruct for Llama 3.1 8B Instruct (pass the id to --model).

Common environment variables:

export OPENAI_API_KEY=...
export DEEPSEEK_API_KEY=...
export GEMINI_API_KEY=...
export OPENROUTER_API_KEY=...
# Optional: only if you use --provider groq instead of OpenRouter for Llama
export GROQ_API_KEY=...

Override the judge globally:

export AGENTCOLLABBENCH_JUDGE_PROVIDER=openai
export AGENTCOLLABBENCH_JUDGE_MODEL=gpt-4.1-mini

API Cost Estimate

The full corpus contains 900 tasks. Under the current runner, one complete backbone sweep over all applicable metrics makes approximately 22,790 backbone API calls. IDR and CPR also use an LLM judge; the full four-backbone paper protocol makes approximately 18,088 judge calls before retries.

The 900 JSON tasks are stored directly under tasks/. Some tasks support more than one metric, so metric-applicable counts sum to more than 900:

Metric Tasks Backbone calls Judge calls
clc 228 approx. 9,094 0
cpr 226 approx. 4,485 approx. 3,375
idr 230 approx. 4,578 approx. 1,147
rtd 232 approx. 4,633 0
Total, one backbone model 916 metric-task runs across 900 unique tasks 22,790 approx. 4,522
Total, four paper backbones 3,664 metric-task runs across 3,600 unique task-model pairs 91,160 approx. 18,088

Approximate public API prices as of 2026-05-05 (per 1M tokens; taxes and your tier may differ):

Role Model used in paper protocol Provider price, USD per 1M tokens
Backbone gpt-4.1-mini (OpenAI API) input $0.40, output $1.60 (OpenAI)
Backbone gemini-2.5-flash-lite (Google AI API) input $0.10, output $0.40 (Google AI)
Backbone qwen/qwen3.5-35b-a3b (OpenRouter only) from about input $0.14 / output $1.00 on the cheapest listed routes, up to ~$1.80/M output on other providers OpenRouter may route to (OpenRouter)
Backbone meta-llama/llama-3.1-8b-instruct (OpenRouter only; e.g. Groq-backed tier) input $0.05, output $0.08 (OpenRouter; other backends on the same model page can quote lower rates, e.g. $0.02 / $0.05)
Judge deepseek-v4-flash input cache miss $0.14, input cache hit $0.0028, output $0.28 (DeepSeek)

Actual cost depends on response length, retries, OpenRouter routing for OSS models, and prompt caching. The call counts above are computed from the current task JSONs and runner behavior.

Run A Single Task

agentcollabbench evaluate \
  --task tasks/TASK-DATAENG-RTD-001.json \
  --metrics rtd \
  --provider openai \
  --model gpt-4.1-mini

Same task with an open-weight backbone (paper protocol: OpenRouter only for Qwen and Llama):

agentcollabbench evaluate \
  --task tasks/TASK-DATAENG-RTD-001.json \
  --metrics rtd \
  --provider openrouter \
  --model qwen/qwen3.5-35b-a3b

IDR/CPR with an explicit judge:

agentcollabbench evaluate \
  --task tasks/TASK-SWE-CPR-001.json \
  --metrics cpr \
  --provider openai \
  --model gpt-4.1-mini \
  --judge-provider deepseek \
  --judge-model deepseek-v4-flash

Run The Task Directory

agentcollabbench evaluate \
  --dataset tasks \
  --metrics clc \
  --provider openai \
  --model gpt-4.1-mini \
  --output results/clc_results.json

Validate Tasks

agentcollabbench validate --dataset tasks
agentcollabbench validate --task path/to/custom_task.json

Custom JSON files should follow the same task schema used in tasks/: a task-level description, topology, speaking order, agent definitions, metric_applicability, and a metric-specific block under injections.

Anyone can use AgentCollabBench with their own multi-agent system setup by writing a valid task JSON and running agentcollabbench evaluate --task .... The CLI will validate the file first if you run agentcollabbench validate.

A valid task JSON should include these top-level fields:

  • task_id
  • domain
  • description
  • topology
  • metric_applicability
  • expected_turns
  • ground_truth
  • injections

Within topology, include:

  • type
  • agents
  • edges
  • speaking_order

Each entry in topology.agents should include:

  • agent_id
  • role
  • system_prompt
  • receives_from

Within ground_truth, include:

  • correct_outcome
  • quality_rubric

Within injections, include the metric-specific block for the metrics you want to evaluate, such as rtd, idr, cpr, or clc.

For real benchmark-style examples, use the task JSON files under tasks/ as templates. Those files show the expected structure, field names, and metric-specific payloads in their full form.

Local Python Usage

from agentcollabbench import evaluate_task
from agentcollabbench.harness.provider import OpenAIProvider

provider = OpenAIProvider(model="gpt-4.1-mini")
result = evaluate_task(
    "tasks/TASK-DATAENG-RTD-001.json",
    provider=provider,
    metrics=["rtd"],
)
print(result.to_dict())

This example is for local use inside this repository after pip install -e .. You can also work directly through the task files, metric docs, and the agentcollabbench evaluate / agentcollabbench validate CLI commands.

Contents

  • agentcollabbench/: installable Python package.
  • tasks/: cleaned task JSON files for the four paper metrics.
  • docs/: metric documentation for IDR, RTD, CPR, and CLC.
  • examples/: minimal Python usage example.
  • tests/: focused unit tests for the four metrics.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages