CRONOS: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology

CRONOS is a benchmark designed to evaluate the reasoning capabilities of multimodal large language models (LLMs) in complex clinical decision-making scenarios. It focuses on two core challenges in oncology: multimodal integration (e.g., pathology, genomics, radiology) and longitudinal reasoning across patient timelines. The benchmark includes agentic tasks requiring interaction with external foundation model-based tools and datasets.

🛠 Getting Started

To install all required dependencies, simply run:

bash setup.sh

Note: If you want to evaluate the agent on IHC data, you will need to additionally clone and install TRIDENT from source.

⚙️ Configuration

Before running the benchmark, configure your paths and credentials in:

neurips25/configs/base.yaml

The config file specifies paths to datasets, tool credentials, and output directories. Below, we provide guidance for acquiring the necessary external datasets.

📁 Datasets

HANCOCK (Multimodal Tissue Microarrays)

The HANCOCK dataset contains SVS-format tissue micro arrays (TMAs). To prepare it:

Follow the original HANCOCK GitHub repository to extract tiles and compute cell densities using QuPath.
Reproduce our ABMIL training by extracting tumor centers and cell density measurements for Blocks 1 and 2.
Download the dataset files from the HANCOCK project page to replicate the question curation used in the benchmark.

MSK-CHORD (Longitudinal Genomic Profiles)

The MSK-CHORD dataset is available on cBioPortal. To use it:

Download the ZIP archive from the cBioPortal page.
Extract it and update the dataset path in base.yaml.

DrugBank API

To enable the DrugBank tool for longitudinal drug lookups:

Register for a DrugBank account.
Apply for a license to access their API.
Download and locally host the dataset following their documentation.
Update your API path and credentials in the config file.

📑 Agent Logs

We provide full logs of agent interactions for all models evaluated in the paper:

agent_logs_hancock/: Multimodal evaluation logs (HANCOCK)
agent_logs_msk/: Longitudinal evaluation logs (MSK-CHORD)

Each log includes all agent–LLM conversations, intermediate reasoning steps, and generated answers.

▶️ Running the Benchmark

Make sure you have:

Installed dependencies
Configured Hugging Face access tokens (for model download)
Set paths in base.yaml

To run an evaluation with Qwen/Qwen2.5-VL-7B-Instruct on the HANCOCK dataset:

python -m neurips25.benchmarks.run_agent_benchmark \
  --doctor_model "Qwen/Qwen2.5-VL-7B-Instruct" \
  --output_dir "./agent_logs_hancock/" \
  --dataset "hancock"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CRONOS: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology

🛠 Getting Started

⚙️ Configuration

📁 Datasets

HANCOCK (Multimodal Tissue Microarrays)

MSK-CHORD (Longitudinal Genomic Profiles)

DrugBank API

📑 Agent Logs

▶️ Running the Benchmark

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
agent_logs_hancock		agent_logs_hancock
agent_logs_msk		agent_logs_msk
data		data
neurips25		neurips25
README.md		README.md
requirements.txt		requirements.txt
setup.sh		setup.sh

cronosbenchmark/cronos

Folders and files

Latest commit

History

Repository files navigation

CRONOS: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology

🛠 Getting Started

⚙️ Configuration

📁 Datasets

HANCOCK (Multimodal Tissue Microarrays)

MSK-CHORD (Longitudinal Genomic Profiles)

DrugBank API

📑 Agent Logs

▶️ Running the Benchmark

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages