CANTANTE: Optimizing Agentic Systems via Contrastive Credit Attribution

CANTANTE converts system-level rewards into per-agent optimization signals via contrastive in-group attribution, enabling principled local optimization of multi-agent systems.

Setup

Prerequisites

Python 3.12
uv — curl -LsSf https://astral.sh/uv/install.sh | sh

Installation

git clone [REPO URL PLACEHOLDER]
cd cantante
uv sync

API key

All LLM calls use an OpenAI-compatible API. Place your key in token.txt. run_experiment.py reads it with open("token.txt").read().strip(). Update task_llm_kwargs.base_url and meta_llm_kwargs.base_url in your config to point to your endpoint.

Interactive inference notebook

For manual inspection and experimentation, notebooks/inference.ipynb provides an interactive way to load resulting configurations and run single-example inference.

Quick start — single experiment

1. Run the optimization

uv run scripts/run_experiment.py --config configs/ablations/group_size.yaml

2. Evaluate on the test set

uv run scripts/run_eval.py --run_dir results/ablations/<run_dir>

Results land in <run_dir>/eval/. Add --eval_all to score every step, or --step N for a specific step.

3. Evaluate the initial prompts as a baseline

uv run scripts/eval_initial_prompts.py --config configs/main_experiments/main_exp.yaml

Resuming an interrupted run

uv run scripts/restart_experiment.py --run_dir results/main_experiments/<run_dir>

Reproducing paper experiments

run_batch.py expands the grid: block of a config into individual runs, executes each one, and automatically triggers evaluation afterwards. Multiple instances can be launched simultaneously — each worker acquires a file-system lock before starting a job, so there is no double execution and you can parallelise across machines by pointing them at the same results/ directory (e.g. on a shared filesystem).

uv run scripts/run_batch.py --config configs/main_experiments/
uv run scripts/run_batch.py --config configs/ablations/

Flags

Flag	Effect
`--dry-run`	Print the job list without running anything
`--skip-completed`	Skip runs whose output directory already has `.finished` (default: on)
`--eval-only`	Skip optimization; only evaluate already-finished runs
`--ignore-locks`	Start even if a lock file exists

Generating paper artefacts

In order to reproduce the paper's figures and tables:

uv run scripts/create_plots.py
uv run scripts/create_tables.py

Both scripts read from results/main_experiments/ and results/ablations/.

Code execution (MBPP only)

The MBPP task executes LLM-generated code in a subprocess. Before running any MBPP experiment you must explicitly opt in:

export ALLOW_CODE_EXECUTION=1

We strongly recommend running inside a sandbox (Docker, gVisor, nsjail). See src/agent_tools/mbpp.py for details.

Config reference

Field	Description	Example
`experiment.dataset`	benchmark to run	`gsm8k`, `hotpotqa`, `mbpp`
`experiment.optimizer`	optimization algorithm	`cantante`, `mipro`, `gepa`
`experiment.seed`	global random seed	`42`
`experiment.max_token_budget`	total token cap	`10000000`
`experiment.output_dir`	where run artefacts are written	`./results/ablations/my_run`
`task_llm_kwargs.model`	model used by the MAS agents	any OpenAI-compatible model ID
`meta_llm_kwargs.model`	model used for attribution and optimization	any OpenAI-compatible model ID
`grid`	key–value lists expanded by `run_batch.py`	`experiment.seed: [7, 42, 47]`
`setup_dict_folder`	folder containing per-dataset setup YAMLs	`configs/setups`

Adding a custom task or graph

Dataset loader — add a case in src/experiment/load_datasets.py returning a BaseMASTask subclass (src/tasks/gsm8k.py is the simplest reference).

Agent setup YAML — create configs/setups/<your_task>.yaml with:

edges:
  - from: "__start__"
    to: "agent_a"
  - from: "agent_a"
    to: "__end__"

agents:
  - name: "agent_a"
    task_description: "..."
    input_vars: ["query"]
    output_vars: ["prediction"]
    tools: []
    max_tool_calls: 0

init_agent_prompt_pool:
  agent_a:
    - "Prompt variant 1 ..."
    - "Prompt variant 2 ..."
    # at least 6 variants recommended

Tools (optional) — add a BaseToolsAdapter subclass in src/agent_tools/ and register it in src/agent_tools/base.py::get_tools().
Experiment config — copy any existing config, set experiment.dataset to your task name, and point setup_dict_folder at configs/setups.
Run with run_experiment.py or run_batch.py as normal.

Repository structure

Cantante/
├── configs/
│   ├── main_experiments/          grid configs for main paper runs
│   ├── ablations/                 one config per ablation study
│   └── setups/                    per-dataset MAS graph + initial prompt pools
├── scripts/
│   ├── run_experiment.py          run a single experiment from a YAML config
│   ├── run_batch.py               expand a grid config and run all jobs (+ eval)
│   ├── run_eval.py                evaluate a finished run on the test split
│   ├── eval_initial_prompts.py    baseline: evaluate seed prompts only
│   ├── restart_experiment.py      resume an interrupted run from checkpoint
│   ├── create_tables.py           generate LaTeX tables  
│   └── create_plots.py            generate paper figures
├── src/
│   ├── mas.py                     MASPredictor — LangGraph-based MAS engine
│   ├── meta_prompts.py            system prompts for mutation & crossover & attribution
│   ├── prompt_structures.py       AgentPromptPool / Set / Batch data structures
│   ├── prompt_utils.py            Prompt Utilities
│   ├── callbacks.py               CheckpointCallback, OptimizationCallback
│   ├── candidate_selector.py      prompt candidate selection strategies
│   ├── agent_tools/
│   │   ├── base.py                tool registry (get_tools) and BaseToolsAdapter
│   │   ├── gsm8k.py               Tools for GSM8K (None)
│   │   ├── hotpotqa.py            QA retrieval tools
│   │   └── mbpp.py                sandboxed code execution (⚠ see Setup)
│   ├── tasks/
│   │   ├── base.py                BaseMASTask — evaluation loop and scoring
│   │   ├── gsm8k.py               GSM8K maths task
│   │   ├── hotpotqa.py            HotpotQA multi-hop QA task
│   │   └── mbpp.py                MBPP code generation task
│   ├── optimization/
│   │   ├── cantante.py            CANTANTE — attribution-guided genetic optimiser
│   │   ├── local_optimizer.py     CAPO and EvoPrompt node-level optimisers
│   │   ├── dspy_mas_optimizer.py  wrappers for GEPA and MIPROv2 (DSPy)
│   │   ├── dspy_mas_wrapper.py    DSPy integration adapters
│   │   ├── base_mas_optimizer.py  base optimizer interface
│   │   ├── broker_agent_optimizers.py  multi-agent optimization coordination
│   │   └── proxy_task.py          attribution proxy task wrapper
│   ├── attribution/
│   │   ├── base.py                BaseAttributer
│   │   ├── absolute.py            Absolute attributer (Cantante), as presented in the paper
│   │   └── naive.py               Identity attribution (ablation study)
│   ├── analysis/
│   │   ├── utils.py               DataFrame loading, load_main_results_df, compute_ranks
│   │   ├── tables.py              LatexTable, render_table, get_agg_table
│   │   └── style.py               matplotlib style configuration
│   └── experiment/
│       ├── load_datasets.py       get_tasks() — dataset loader factory
│       ├── configs.py             dataset configuration registry
│       └── utils.py               seed_everything, get_logger, inject_tool_descriptions
├── notebooks/
│   ├── generate_seed_prompts.ipynb   create initial prompt pools
│   └── inference.ipynb               single-example inference testing
├── results/
│   ├── main_experiments/          one sub-directory per completed run
│   │   └── <run_name>/
│   │       ├── prompts_per_step.parquet   full optimization trajectory
│   │       ├── .finished                  flag written on clean exit
│   │       ├── checkpoints/               per-step checkpoint JSONs
│   │       └── eval/
│   │           ├── scores_per_step.parquet   test-set scores per prompt set
│   │           └── token_usage.yaml
│   └── ablations/                 same structure, one dir per ablation run
├── figures/                       generated PDF figures
├── tables/                        generated LaTeX table files
├── token.txt                      API key — not committed, see Setup
└── pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CANTANTE: Optimizing Agentic Systems via Contrastive Credit Attribution

Setup

Prerequisites

Installation

API key

Interactive inference notebook

Quick start — single experiment

Reproducing paper experiments

Generating paper artefacts

Code execution (MBPP only)

Config reference

Adding a custom task or graph

Repository structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
configs		configs
figures		figures
notebooks		notebooks
results		results
scripts		scripts
src		src
tables		tables
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

CANTANTE: Optimizing Agentic Systems via Contrastive Credit Attribution

Setup

Prerequisites

Installation

API key

Interactive inference notebook

Quick start — single experiment

Reproducing paper experiments

Generating paper artefacts

Code execution (MBPP only)

Config reference

Adding a custom task or graph

Repository structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages