CANTANTE converts system-level rewards into per-agent optimization signals via contrastive in-group attribution, enabling principled local optimization of multi-agent systems.
- Python 3.12
- uv —
curl -LsSf https://astral.sh/uv/install.sh | sh
git clone [REPO URL PLACEHOLDER]
cd cantante
uv syncAll LLM calls use an OpenAI-compatible API. Place your key in token.txt. run_experiment.py reads it with open("token.txt").read().strip(). Update task_llm_kwargs.base_url and meta_llm_kwargs.base_url in your config to point to your endpoint.
For manual inspection and experimentation, notebooks/inference.ipynb provides an interactive way to load resulting configurations and run single-example inference.
1. Run the optimization
uv run scripts/run_experiment.py --config configs/ablations/group_size.yaml2. Evaluate on the test set
uv run scripts/run_eval.py --run_dir results/ablations/<run_dir>Results land in <run_dir>/eval/. Add --eval_all to score every step, or --step N for a specific step.
3. Evaluate the initial prompts as a baseline
uv run scripts/eval_initial_prompts.py --config configs/main_experiments/main_exp.yamlResuming an interrupted run
uv run scripts/restart_experiment.py --run_dir results/main_experiments/<run_dir>run_batch.py expands the grid: block of a config into individual runs, executes each one, and automatically triggers evaluation afterwards. Multiple instances can be launched simultaneously — each worker acquires a file-system lock before starting a job, so there is no double execution and you can parallelise across machines by pointing them at the same results/ directory (e.g. on a shared filesystem).
uv run scripts/run_batch.py --config configs/main_experiments/
uv run scripts/run_batch.py --config configs/ablations/Flags
| Flag | Effect |
|---|---|
--dry-run |
Print the job list without running anything |
--skip-completed |
Skip runs whose output directory already has .finished (default: on) |
--eval-only |
Skip optimization; only evaluate already-finished runs |
--ignore-locks |
Start even if a lock file exists |
In order to reproduce the paper's figures and tables:
uv run scripts/create_plots.py
uv run scripts/create_tables.pyBoth scripts read from results/main_experiments/ and results/ablations/.
The MBPP task executes LLM-generated code in a subprocess. Before running any MBPP experiment you must explicitly opt in:
export ALLOW_CODE_EXECUTION=1We strongly recommend running inside a sandbox (Docker, gVisor, nsjail). See src/agent_tools/mbpp.py for details.
| Field | Description | Example |
|---|---|---|
experiment.dataset |
benchmark to run | gsm8k, hotpotqa, mbpp |
experiment.optimizer |
optimization algorithm | cantante, mipro, gepa |
experiment.seed |
global random seed | 42 |
experiment.max_token_budget |
total token cap | 10000000 |
experiment.output_dir |
where run artefacts are written | ./results/ablations/my_run |
task_llm_kwargs.model |
model used by the MAS agents | any OpenAI-compatible model ID |
meta_llm_kwargs.model |
model used for attribution and optimization | any OpenAI-compatible model ID |
grid |
key–value lists expanded by run_batch.py |
experiment.seed: [7, 42, 47] |
setup_dict_folder |
folder containing per-dataset setup YAMLs | configs/setups |
-
Dataset loader — add a case in
src/experiment/load_datasets.pyreturning aBaseMASTasksubclass (src/tasks/gsm8k.pyis the simplest reference). -
Agent setup YAML — create
configs/setups/<your_task>.yamlwith:edges: - from: "__start__" to: "agent_a" - from: "agent_a" to: "__end__" agents: - name: "agent_a" task_description: "..." input_vars: ["query"] output_vars: ["prediction"] tools: [] max_tool_calls: 0 init_agent_prompt_pool: agent_a: - "Prompt variant 1 ..." - "Prompt variant 2 ..." # at least 6 variants recommended
-
Tools (optional) — add a
BaseToolsAdaptersubclass insrc/agent_tools/and register it insrc/agent_tools/base.py::get_tools(). -
Experiment config — copy any existing config, set
experiment.datasetto your task name, and pointsetup_dict_folderatconfigs/setups. -
Run with
run_experiment.pyorrun_batch.pyas normal.
Cantante/
├── configs/
│ ├── main_experiments/ grid configs for main paper runs
│ ├── ablations/ one config per ablation study
│ └── setups/ per-dataset MAS graph + initial prompt pools
├── scripts/
│ ├── run_experiment.py run a single experiment from a YAML config
│ ├── run_batch.py expand a grid config and run all jobs (+ eval)
│ ├── run_eval.py evaluate a finished run on the test split
│ ├── eval_initial_prompts.py baseline: evaluate seed prompts only
│ ├── restart_experiment.py resume an interrupted run from checkpoint
│ ├── create_tables.py generate LaTeX tables
│ └── create_plots.py generate paper figures
├── src/
│ ├── mas.py MASPredictor — LangGraph-based MAS engine
│ ├── meta_prompts.py system prompts for mutation & crossover & attribution
│ ├── prompt_structures.py AgentPromptPool / Set / Batch data structures
│ ├── prompt_utils.py Prompt Utilities
│ ├── callbacks.py CheckpointCallback, OptimizationCallback
│ ├── candidate_selector.py prompt candidate selection strategies
│ ├── agent_tools/
│ │ ├── base.py tool registry (get_tools) and BaseToolsAdapter
│ │ ├── gsm8k.py Tools for GSM8K (None)
│ │ ├── hotpotqa.py QA retrieval tools
│ │ └── mbpp.py sandboxed code execution (⚠ see Setup)
│ ├── tasks/
│ │ ├── base.py BaseMASTask — evaluation loop and scoring
│ │ ├── gsm8k.py GSM8K maths task
│ │ ├── hotpotqa.py HotpotQA multi-hop QA task
│ │ └── mbpp.py MBPP code generation task
│ ├── optimization/
│ │ ├── cantante.py CANTANTE — attribution-guided genetic optimiser
│ │ ├── local_optimizer.py CAPO and EvoPrompt node-level optimisers
│ │ ├── dspy_mas_optimizer.py wrappers for GEPA and MIPROv2 (DSPy)
│ │ ├── dspy_mas_wrapper.py DSPy integration adapters
│ │ ├── base_mas_optimizer.py base optimizer interface
│ │ ├── broker_agent_optimizers.py multi-agent optimization coordination
│ │ └── proxy_task.py attribution proxy task wrapper
│ ├── attribution/
│ │ ├── base.py BaseAttributer
│ │ ├── absolute.py Absolute attributer (Cantante), as presented in the paper
│ │ └── naive.py Identity attribution (ablation study)
│ ├── analysis/
│ │ ├── utils.py DataFrame loading, load_main_results_df, compute_ranks
│ │ ├── tables.py LatexTable, render_table, get_agg_table
│ │ └── style.py matplotlib style configuration
│ └── experiment/
│ ├── load_datasets.py get_tasks() — dataset loader factory
│ ├── configs.py dataset configuration registry
│ └── utils.py seed_everything, get_logger, inject_tool_descriptions
├── notebooks/
│ ├── generate_seed_prompts.ipynb create initial prompt pools
│ └── inference.ipynb single-example inference testing
├── results/
│ ├── main_experiments/ one sub-directory per completed run
│ │ └── <run_name>/
│ │ ├── prompts_per_step.parquet full optimization trajectory
│ │ ├── .finished flag written on clean exit
│ │ ├── checkpoints/ per-step checkpoint JSONs
│ │ └── eval/
│ │ ├── scores_per_step.parquet test-set scores per prompt set
│ │ └── token_usage.yaml
│ └── ablations/ same structure, one dir per ablation run
├── figures/ generated PDF figures
├── tables/ generated LaTeX table files
├── token.txt API key — not committed, see Setup
└── pyproject.toml