CRONOS is a benchmark designed to evaluate the reasoning capabilities of multimodal large language models (LLMs) in complex clinical decision-making scenarios. It focuses on two core challenges in oncology: multimodal integration (e.g., pathology, genomics, radiology) and longitudinal reasoning across patient timelines. The benchmark includes agentic tasks requiring interaction with external foundation model-based tools and datasets.
To install all required dependencies, simply run:
bash setup.shNote: If you want to evaluate the agent on IHC data, you will need to additionally clone and install TRIDENT from source.
Before running the benchmark, configure your paths and credentials in:
neurips25/configs/base.yaml
The config file specifies paths to datasets, tool credentials, and output directories. Below, we provide guidance for acquiring the necessary external datasets.
The HANCOCK dataset contains SVS-format tissue micro arrays (TMAs). To prepare it:
- Follow the original HANCOCK GitHub repository to extract tiles and compute cell densities using QuPath.
- Reproduce our ABMIL training by extracting tumor centers and cell density measurements for Blocks 1 and 2.
- Download the dataset files from the HANCOCK project page to replicate the question curation used in the benchmark.
The MSK-CHORD dataset is available on cBioPortal. To use it:
- Download the ZIP archive from the cBioPortal page.
- Extract it and update the dataset path in
base.yaml.
To enable the DrugBank tool for longitudinal drug lookups:
- Register for a DrugBank account.
- Apply for a license to access their API.
- Download and locally host the dataset following their documentation.
- Update your API path and credentials in the config file.
We provide full logs of agent interactions for all models evaluated in the paper:
agent_logs_hancock/: Multimodal evaluation logs (HANCOCK)agent_logs_msk/: Longitudinal evaluation logs (MSK-CHORD)
Each log includes all agent–LLM conversations, intermediate reasoning steps, and generated answers.
Make sure you have:
- Installed dependencies
- Configured Hugging Face access tokens (for model download)
- Set paths in
base.yaml
To run an evaluation with Qwen/Qwen2.5-VL-7B-Instruct on the HANCOCK dataset:
python -m neurips25.benchmarks.run_agent_benchmark \
--doctor_model "Qwen/Qwen2.5-VL-7B-Instruct" \
--output_dir "./agent_logs_hancock/" \
--dataset "hancock"