⚠️ Code, dataset and model checkpoints are currently under internal company review and will be released soon. 🙏
TMAS is a framework for scaling test-time compute via multi-agent synergy for mathematical reasoning. It organizes inference as a collaborative process among specialized agents, enabling structured information flow across agents, trajectories, and refinement iterations.
If you find our work useful, please give us a star ⭐ on GitHub for the latest updates.
- [2026.05.11] 🚀 We release TMAS, a scalable multi-agent framework for test-time compute scaling on mathematical reasoning tasks.
Existing structured test-time scaling methods either weakly coordinate parallel reasoning trajectories or rely on noisy historical information without explicitly deciding what should be retained and reused, limiting their ability to balance exploration and exploitation.
TMAS organizes inference as a collaborative process among five specialized agents, enabling structured information flow across agents, trajectories, and refinement iterations. To support effective cross-trajectory collaboration, TMAS introduces hierarchical me: the Experience Bank reuses low-level reliable intermediate conclusions and local feedback, while the Guideline Bank records previously explored high-level strategies to steer subsequent rollouts away from redundant reasoning patterns.
We further design a hybrid reward RL scheme tailored to TMAS, which jointly preserves basic reasoning capability, enhances experience utilization, and encourages exploration beyond previously attempted solution strategies.
TMAS decomposes inference-time reasoning into five specialized roles:
| Agent | Role |
|---|---|
| Solution Agent | Generates N candidate solutions per iteration; uses ε-greedy policy to mix exploitation and exploration |
| Verification Agent | Independently verifies each candidate M times to produce calibrated correctness scores |
| Summary Agent | Aggregates M verification results into a concise natural-language summary per candidate |
| Experience Agent | Extracts reusable failure analyses from current rollouts; evolves the Experience Bank |
| Guideline Agent | Derives high-level solving strategies from current rollouts; evolves the Guideline Bank |
All roles default to the same model endpoint and can be independently assigned to different models via separate API flags.
Two persistent stores carry information across iterations for each problem:
- Experience Bank (
E_t): structured records of past mistakes and correction insights, deduplicated by Jaccard similarity. - Guideline Bank (
G_t): high-level solving strategies that direct the Solution Agent's exploration policy.
Both banks are persisted to disk and can be resumed across interrupted runs.
The RL training data used in our experiments is released as RL-training-data.parquet (4,400 samples).
| Column | Description |
|---|---|
prompt |
Input prompt (system + user messages) |
data_source |
Problem source (math / math_grm) |
reward_model |
Ground truth and scoring style |
extra_info |
Experience Bank and Guideline Bank injected at training time |
_source |
Generation mode (normal / experience / guideline) |
The _source field reflects the ε-greedy sampling strategy used during trajectory collection: normal trajectories use no memory augmentation, while experience and guideline trajectories are conditioned on the respective memory banks.
We evaluate TMAS on two challenging mathematical reasoning benchmarks provided in ./problems/:
| Benchmark | # Problems |
|---|---|
| HLE-Math-100 | 100 |
| IMO-AnswerBench-50 | 50 |
Each file is in JSONL format with fields problem_id, question, and answer.
Performance comparison (Pass@1 %) on IMO-AnswerBench-50 and HLE-Math-100 across representative refinement iterations.
TMAS achieves stronger iterative scaling than existing TTS baselines, continuing to improve as iterations increase rather than plateauing. Hybrid reward RL further amplifies scaling ability and stability across refinement rounds — notably narrowing the gap between the 4B and 30B models by ~59% at iteration 19.
pip install httpx openai rich aiofiles filelock transformersThe system communicates with any OpenAI-compatible model API. Set the API key if required:
export API_KEY="your_api_key"Input files should be JSONL, one problem per line:
{"problem_id": "p1", "question": "Let ..."}Edit run.sh to set your model endpoint and paths, then:
bash run.shOr invoke directly:
python code/main.py \
--input_file ./problems/HLE_MATH_100.jsonl \
--tokenizer_path ./models/tokenizer \
--base_url http://YOUR_MODEL_ENDPOINT/v1 \
--model_name your-model-name \
--n_candidates 8 \
--n_verifications 8 \
--max_iterations 20 \
--max_tokens 130000 \
--concurrency 800 \
--epsilon 0.2 \
--no_proof \
--output_file ./results/eval_run1| Argument | Description |
|---|---|
--input_file |
Path to input JSONL |
--output_file |
Output directory (auto-created) |
--tokenizer_path |
Path to tokenizer for accurate token counting |
--base_url |
OpenAI-compatible API endpoint |
--model_name |
Model name served at the endpoint |
--n_candidates |
Solutions generated per iteration (N) |
--n_verifications |
Independent verification calls per candidate (M) |
--max_iterations |
Maximum search iterations |
--max_tokens |
Model context window size |
--epsilon |
Exploration probability for guideline-directed generation |
--no_proof |
Set for numerical-answer problems (outputs \boxed{}); omit for proof problems |
By default all agents share one endpoint. To assign different models per role:
--base_url / --model_name (Solution Agent, default for all roles)
--exp_base_url / --exp_model_name (Experience Agent)
--guide_base_url / --guide_model_name (Guideline Agent)
--summary_base_url / --summary_model_name (Summary Agent)
--verify_base_url / --verify_model_name (Verification Agent)
python code/main.py \
--resume_path ./results/eval_run1 \
--max_iterations 20 \
... (same other arguments)--max_iterations is interpreted as the total target iteration count when resuming.
@article{wu2026tmas,
title={TMAS: Scaling Test-Time Compute via Multi-Agent Synergy},
author={Wu, George and Jing, Nan and Yi, Qing and Hao, Chuan and Yang, Ming and Chang, Feng and Wei, Yuan and Yang, Jian and Tao, Ran and Dai, Bryan},
journal={https://arxiv.org/pdf/2605.10344},
year={2026}
}

