DermAgent: A Self-Reflective Agentic System for Dermatological Image Analysis with Multi-Tool Reasoning and Traceable Decision-Making

A self-reflective agentic system for dermatological image analysis, built on LangChain/LangGraph.

DermAgent orchestrates seven specialist vision and language tools (PanDerm, MAKE, DermoGPT, Qwen3-VL, Case RAG, Guideline RAG, Ontology) within a Plan-Execute-Reflect framework, using GPT-4o as the reasoning backbone. A deterministic Critic module performs post-hoc auditing via confidence, coverage, and conflict gates to trigger targeted self-correction, delivering stepwise, traceable diagnostic reasoning.

Architecture

Project Structure

DermAgent/
├── skin_agent/                  # Core agent framework
│   ├── benchmark_agent.py       # Benchmark agent + Critic + AnswerParser
│   ├── configs.py               # Dataset task configurations
│   ├── tracing.py               # TraceLogger, TracingCallback
│   ├── profiler.py              # Performance profiling
│   ├── resume.py                # Checkpoint/resume for long runs
│   ├── prompts.md               # System prompts
│   ├── tools/
│   │   ├── base.py              # BaseSkinTool, input schemas
│   │   ├── skin_tools.py        # All 7 tool implementations
│   │   ├── executor.py          # Tool execution orchestration
│   │   └── derm_knowledge_tree/ # Disease ontology JSONs
│   └── utils/
│       ├── retry.py             # Rate-limit retry logic
│       └── image_utils.py       # Image path handling
├── benchmark/                   # Unified evaluation framework
│   ├── run.py                   # CLI runner for single-model baselines
│   ├── metrics.py               # Shared metrics (classification, multilabel, captioning, VQA)
│   ├── models/                  # Model wrappers (GPT-4o, LLaVA-Med, HuatuoGPT, etc.)
│   └── datasets/                # Dataset configs with prompts and class lists
├── scripts/                     # All runnable scripts
│   ├── build_qdrant_db.py       # Build image RAG vector database
│   ├── build_qdrant_rag.py      # Build text RAG (guidelines)
│   ├── run_task1_ham10000_500_agent_dermogpt_full_critic.sh
│   ├── run_task1_snu_500_critic.sh
│   ├── run_task2_task3_agent_critic.sh
│   ├── run_task3_loo_ablation.sh
│   └── *.py                     # Python runner scripts
├── baselines/                   # Agent-based baseline reproductions
│   ├── MDAgents/                # MDAgents agent baseline (NeurIPS 2024)
│   ├── MedAgent-Pro/            # MedAgent-Pro agent baseline
│   └── SkinVL/                  # SkinVL-PubMM baseline
├── data/                        # Benchmark CSV metadata
├── requirements.txt
└── .env.example

Setup

1. Environment

conda create -n dermagent python=3.10
conda activate dermagent

# Install PyTorch (match your CUDA version)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126

pip install -r requirements.txt

# Download NLTK data (needed for BLEU/ROUGE metrics)
python -c "import nltk; nltk.download('punkt'); nltk.download('punkt_tab')"

2. API Keys

cp .env.example .env
# Edit .env: set OPENAI_API_KEY (required) and other variables as needed

3. External Dependencies

The following external code/data directories are required but not included in this repository due to size or licensing. Place them at the project root:

Directory	Purpose	How to Obtain
`Derm1M/src/`	Custom OpenCLIP fork for PanDerm & RAG encoders	Clone from the Derm1M repository
`MAKE/src/`	Custom OpenCLIP fork for MAKE concept annotation	Clone from the MAKE repository
`MAKE/concept_annotation/term_lists/ConceptTerms.json`	Concept term definitions for MAKE	Included in the MAKE repository above
`model-weights/DermoGPT-RL`	DermoGPT-RL fine-tuned model weights	Download from the DermoGPT repository
`MM-Skin/`	LLaVA package used by the SkinVL-PubMM baseline	Clone from the MM-Skin repository
`model-weights/SkinVL-PubMM`	SkinVL-PubMM model weights for the baseline	Download from HuggingFace `zwq803/SkinVL-PubMM`
`RAG/dermnet_chunks_cleaned.json`	DermNet guideline chunks for Text RAG	See Text RAG build instructions below
`RAG/mayo_chunks_cleaned.json`	Mayo Clinic guideline chunks for Text RAG	See Text RAG build instructions below
`datasets/Derm1M/`	Derm1M dataset for building image RAG index	Download from Derm1M

For Text RAG models, pre-download the embedding and reranker models into model-weights/:

# Pre-download Qwen3 Embedding and Reranker for Text RAG
huggingface-cli download Qwen/Qwen3-Embedding-8B --local-dir model-weights/Qwen3-Embedding-8B
huggingface-cli download Qwen/Qwen3-Reranker-0.6B --local-dir model-weights/Qwen3-Reranker-0.6B

4. Datasets

Download the following datasets and place images in the expected directories:

Dataset	Task	Download	Image Directory
HAM10000	Diagnosis (7 classes, 642 imgs)	ISIC Archive	`datasets/ham10000/images/`
SNU	Diagnosis (134 classes, 500 imgs)	SNU Quiz	`datasets/SNU/images/`
Derm7pt	Concept Annotation (7 concepts)	Derm7pt	`datasets/derm7pt/final_images/`
SkinCon	Concept Annotation (32 concepts)	SkinCon	`datasets/skincon/final_images/`
SkinCAP	Captioning (100 imgs)	SkinCAP	`datasets/skin_cap/images/`

Benchmark CSV metadata (split definitions) are included in data/.

5. RAG Vector Database

Install and start Qdrant (vector database server):

# Option A: Docker (recommended)
docker run -d -p 6333:6333 -v $(pwd)/qdrant_storage:/qdrant/storage qdrant/qdrant

# Option B: Local binary — see https://qdrant.tech/documentation/guides/installation/

Build the Qdrant vector databases:

# Image-based case retrieval (requires Derm1M dataset + Derm1M/src/)
python scripts/build_qdrant_db.py

# Guideline-grounded text retrieval (requires RAG/ JSON files)
python scripts/build_qdrant_rag.py

6. Model Weights

Most tool models are auto-downloaded from HuggingFace on first use:

PanDerm: DermLIP_PanDerm-base-w-PubMed-256 (~2GB VRAM)
MAKE: MAKE (~2GB VRAM)
Qwen3-VL: Qwen3-VL-8B-Instruct (~16GB VRAM, bfloat16)

Models that must be manually placed in model-weights/:

DermoGPT-RL: See Step 3 above (~16GB VRAM, bfloat16)
Qwen3-Embedding-8B / Qwen3-Reranker-0.6B: See Step 3 above (Text RAG)

Total GPU requirement: ~20-22GB for all tools loaded simultaneously.

Table 1: Main Results

Model	Type	HAM10000 (Acc.)	SNU (Acc.)	Derm7pt (F1-Macro)	SkinCon (F1-Macro)	SkinCAP (ROUGE-L)
LLaVA-Med-v1.5	Medical MLLM	0.4424	0.0120	0.5170	0.1310	0.1532
HuatuoGPT	Medical MLLM	0.5140	0.0400	0.5343	0.0949	0.1432
DermoGPT-RL	Dermatology MLLM	0.5000	0.0920	0.5686	0.2072	0.1541
SkinVL-PubMM	Dermatology MLLM	0.4517	0.0340	0.5314	0.1320	0.1444
Qwen3-VL-8B	General MLLM	0.5109	0.0780	0.5370	0.2282	0.1247
GPT-4o	General MLLM	0.4891	0.1500	0.5414	0.2956	0.1633
GPT-5.2	General MLLM	0.3598	0.1480	0.5386	0.2662	0.1235
MDAgents	Medical Agent	0.1682	0.1140	0.3614	0.2393	0.1199
MedAgent-Pro	Medical Agent	0.5763	0.1160	0.6482	0.1834	0.1148
DermAgent (Ours)	Medical Agent	0.6183	0.3260	0.6506	0.3295	0.1948

Commands to reproduce DermAgent results:

# HAM10000 Diagnosis
bash scripts/run_task1_ham10000_500_agent_dermogpt_full_critic.sh

# SNU Diagnosis
bash scripts/run_task1_snu_500_critic.sh

# Derm7pt + SkinCon + SkinCAP
bash scripts/run_task2_task3_agent_critic.sh

Commands to reproduce baseline results (see baselines/README.md for agent baselines):

# Single-model MLLM baselines (e.g., GPT-4o on HAM10000)
cd benchmark && python run.py --model gpt4o --dataset HAM10000_500

# MDAgents agent baseline
cd baselines/MDAgents && python run_derm_benchmark.py --dataset HAM10000 --difficulty basic --model gpt-4o

# MedAgent-Pro agent baseline
cd baselines/MedAgent-Pro && python Derm_Case_level.py --task 1 \
    --csv-path ../../datasets/ham10000/HAM10000_benchmark_500.csv \
    --image-dir ../../datasets/ham10000 --max-samples 500

Table 2: Ablation Study (Leave-One-Out on SkinCAP)

Configuration	ROUGE-L	Delta (%)
Full Agent (w/ Critic)	0.1948	+12.8
Full Agent (w/o Critic)	0.1727	---
w/o Case RAG	0.1580	-8.5
w/o Guideline RAG	0.1628	-5.7
w/o DermoGPT	0.1672	-3.2
w/o PanDerm	0.1676	-3.0
w/o MAKE	0.1679	-2.8
w/o Ontology	0.1712	-0.9

Command to reproduce ablation results:

# Runs 6 leave-one-out experiments sequentially (removes one tool at a time)
bash scripts/run_task3_loo_ablation.sh

The full agent result (w/ Critic, ROUGE-L: 0.1948) is produced by the Task 3 portion of run_task2_task3_agent_critic.sh. The "w/o Critic" baseline (0.1727) is produced by the LOO script's full-tool run without the Critic module.

Tool Descriptions

Tool	Model	Purpose
PanDerm Classifier	DermLIP	Zero-shot disease classification via CLIP similarity
MAKE Annotator	MAKE (OpenCLIP)	Dermoscopic concept extraction
DermoGPT VQA	DermoGPT-RL	Dermatology-specialized visual QA
Qwen3-VL VQA	Qwen3-VL-8B	General visual question answering
Image RAG	DermLIP + Qdrant	Case retrieval from 413,210 diagnosed cases
Text RAG	Qwen3-Embedding + Qdrant	Guideline retrieval from 3,199 document chunks
Ontology	Knowledge Graph	Disease hierarchy and taxonomy queries

Citation

Paper: arXiv:2605.14403 (MICCAI 2026, early accept).

@article{liu2026dermagent,
  title={DermAgent: A Self-Reflective Agentic System for Dermatological Image
         Analysis with Multi-Tool Reasoning and Traceable Decision-Making},
  author={Liu, Yize and Yan, Siyuan and Hu, Ming and Ju, Lie and Li, Xieji and
          Tang, Feilong and Feng, Wei and Ge, Zongyuan},
  journal={arXiv preprint arXiv:2605.14403},
  year={2026}
}

License

This project is licensed under the Apache License 2.0 — see the LICENSE file for the full text.

DermAgent redistributes adapted code from third-party projects under the baselines/ directory. See NOTICE and the per-baseline ATTRIBUTION.md files for upstream sources, citations, and license status.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DermAgent: A Self-Reflective Agentic System for Dermatological Image Analysis with Multi-Tool Reasoning and Traceable Decision-Making

Architecture

Project Structure

Setup

1. Environment

2. API Keys

3. External Dependencies

4. Datasets

5. RAG Vector Database

6. Model Weights

Table 1: Main Results

Table 2: Ablation Study (Leave-One-Out on SkinCAP)

Tool Descriptions

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
baselines		baselines
benchmark		benchmark
data		data
scripts		scripts
skin_agent		skin_agent
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

DermAgent: A Self-Reflective Agentic System for Dermatological Image Analysis with Multi-Tool Reasoning and Traceable Decision-Making

Architecture

Project Structure

Setup

1. Environment

2. API Keys

3. External Dependencies

4. Datasets

5. RAG Vector Database

6. Model Weights

Table 1: Main Results

Table 2: Ablation Study (Leave-One-Out on SkinCAP)

Tool Descriptions

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages