Skip to content

YizeezLiu/DermAgent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DermAgent: A Self-Reflective Agentic System for Dermatological Image Analysis with Multi-Tool Reasoning and Traceable Decision-Making

arXiv License: Apache 2.0 Python 3.10+

A self-reflective agentic system for dermatological image analysis, built on LangChain/LangGraph.

DermAgent orchestrates seven specialist vision and language tools (PanDerm, MAKE, DermoGPT, Qwen3-VL, Case RAG, Guideline RAG, Ontology) within a Plan-Execute-Reflect framework, using GPT-4o as the reasoning backbone. A deterministic Critic module performs post-hoc auditing via confidence, coverage, and conflict gates to trigger targeted self-correction, delivering stepwise, traceable diagnostic reasoning.

Architecture

DermAgent Architecture

Project Structure

DermAgent/
├── skin_agent/                  # Core agent framework
│   ├── benchmark_agent.py       # Benchmark agent + Critic + AnswerParser
│   ├── configs.py               # Dataset task configurations
│   ├── tracing.py               # TraceLogger, TracingCallback
│   ├── profiler.py              # Performance profiling
│   ├── resume.py                # Checkpoint/resume for long runs
│   ├── prompts.md               # System prompts
│   ├── tools/
│   │   ├── base.py              # BaseSkinTool, input schemas
│   │   ├── skin_tools.py        # All 7 tool implementations
│   │   ├── executor.py          # Tool execution orchestration
│   │   └── derm_knowledge_tree/ # Disease ontology JSONs
│   └── utils/
│       ├── retry.py             # Rate-limit retry logic
│       └── image_utils.py       # Image path handling
├── benchmark/                   # Unified evaluation framework
│   ├── run.py                   # CLI runner for single-model baselines
│   ├── metrics.py               # Shared metrics (classification, multilabel, captioning, VQA)
│   ├── models/                  # Model wrappers (GPT-4o, LLaVA-Med, HuatuoGPT, etc.)
│   └── datasets/                # Dataset configs with prompts and class lists
├── scripts/                     # All runnable scripts
│   ├── build_qdrant_db.py       # Build image RAG vector database
│   ├── build_qdrant_rag.py      # Build text RAG (guidelines)
│   ├── run_task1_ham10000_500_agent_dermogpt_full_critic.sh
│   ├── run_task1_snu_500_critic.sh
│   ├── run_task2_task3_agent_critic.sh
│   ├── run_task3_loo_ablation.sh
│   └── *.py                     # Python runner scripts
├── baselines/                   # Agent-based baseline reproductions
│   ├── MDAgents/                # MDAgents agent baseline (NeurIPS 2024)
│   ├── MedAgent-Pro/            # MedAgent-Pro agent baseline
│   └── SkinVL/                  # SkinVL-PubMM baseline
├── data/                        # Benchmark CSV metadata
├── requirements.txt
└── .env.example

Setup

1. Environment

conda create -n dermagent python=3.10
conda activate dermagent

# Install PyTorch (match your CUDA version)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126

pip install -r requirements.txt

# Download NLTK data (needed for BLEU/ROUGE metrics)
python -c "import nltk; nltk.download('punkt'); nltk.download('punkt_tab')"

2. API Keys

cp .env.example .env
# Edit .env: set OPENAI_API_KEY (required) and other variables as needed

3. External Dependencies

The following external code/data directories are required but not included in this repository due to size or licensing. Place them at the project root:

Directory Purpose How to Obtain
Derm1M/src/ Custom OpenCLIP fork for PanDerm & RAG encoders Clone from the Derm1M repository
MAKE/src/ Custom OpenCLIP fork for MAKE concept annotation Clone from the MAKE repository
MAKE/concept_annotation/term_lists/ConceptTerms.json Concept term definitions for MAKE Included in the MAKE repository above
model-weights/DermoGPT-RL DermoGPT-RL fine-tuned model weights Download from the DermoGPT repository
MM-Skin/ LLaVA package used by the SkinVL-PubMM baseline Clone from the MM-Skin repository
model-weights/SkinVL-PubMM SkinVL-PubMM model weights for the baseline Download from HuggingFace zwq803/SkinVL-PubMM
RAG/dermnet_chunks_cleaned.json DermNet guideline chunks for Text RAG See Text RAG build instructions below
RAG/mayo_chunks_cleaned.json Mayo Clinic guideline chunks for Text RAG See Text RAG build instructions below
datasets/Derm1M/ Derm1M dataset for building image RAG index Download from Derm1M

For Text RAG models, pre-download the embedding and reranker models into model-weights/:

# Pre-download Qwen3 Embedding and Reranker for Text RAG
huggingface-cli download Qwen/Qwen3-Embedding-8B --local-dir model-weights/Qwen3-Embedding-8B
huggingface-cli download Qwen/Qwen3-Reranker-0.6B --local-dir model-weights/Qwen3-Reranker-0.6B

4. Datasets

Download the following datasets and place images in the expected directories:

Dataset Task Download Image Directory
HAM10000 Diagnosis (7 classes, 642 imgs) ISIC Archive datasets/ham10000/images/
SNU Diagnosis (134 classes, 500 imgs) SNU Quiz datasets/SNU/images/
Derm7pt Concept Annotation (7 concepts) Derm7pt datasets/derm7pt/final_images/
SkinCon Concept Annotation (32 concepts) SkinCon datasets/skincon/final_images/
SkinCAP Captioning (100 imgs) SkinCAP datasets/skin_cap/images/

Benchmark CSV metadata (split definitions) are included in data/.

5. RAG Vector Database

Install and start Qdrant (vector database server):

# Option A: Docker (recommended)
docker run -d -p 6333:6333 -v $(pwd)/qdrant_storage:/qdrant/storage qdrant/qdrant

# Option B: Local binary — see https://qdrant.tech/documentation/guides/installation/

Build the Qdrant vector databases:

# Image-based case retrieval (requires Derm1M dataset + Derm1M/src/)
python scripts/build_qdrant_db.py

# Guideline-grounded text retrieval (requires RAG/ JSON files)
python scripts/build_qdrant_rag.py

6. Model Weights

Most tool models are auto-downloaded from HuggingFace on first use:

Models that must be manually placed in model-weights/:

  • DermoGPT-RL: See Step 3 above (~16GB VRAM, bfloat16)
  • Qwen3-Embedding-8B / Qwen3-Reranker-0.6B: See Step 3 above (Text RAG)

Total GPU requirement: ~20-22GB for all tools loaded simultaneously.

Table 1: Main Results

Model Type HAM10000 (Acc.) SNU (Acc.) Derm7pt (F1-Macro) SkinCon (F1-Macro) SkinCAP (ROUGE-L)
LLaVA-Med-v1.5 Medical MLLM 0.4424 0.0120 0.5170 0.1310 0.1532
HuatuoGPT Medical MLLM 0.5140 0.0400 0.5343 0.0949 0.1432
DermoGPT-RL Dermatology MLLM 0.5000 0.0920 0.5686 0.2072 0.1541
SkinVL-PubMM Dermatology MLLM 0.4517 0.0340 0.5314 0.1320 0.1444
Qwen3-VL-8B General MLLM 0.5109 0.0780 0.5370 0.2282 0.1247
GPT-4o General MLLM 0.4891 0.1500 0.5414 0.2956 0.1633
GPT-5.2 General MLLM 0.3598 0.1480 0.5386 0.2662 0.1235
MDAgents Medical Agent 0.1682 0.1140 0.3614 0.2393 0.1199
MedAgent-Pro Medical Agent 0.5763 0.1160 0.6482 0.1834 0.1148
DermAgent (Ours) Medical Agent 0.6183 0.3260 0.6506 0.3295 0.1948

Commands to reproduce DermAgent results:

# HAM10000 Diagnosis
bash scripts/run_task1_ham10000_500_agent_dermogpt_full_critic.sh

# SNU Diagnosis
bash scripts/run_task1_snu_500_critic.sh

# Derm7pt + SkinCon + SkinCAP
bash scripts/run_task2_task3_agent_critic.sh

Commands to reproduce baseline results (see baselines/README.md for agent baselines):

# Single-model MLLM baselines (e.g., GPT-4o on HAM10000)
cd benchmark && python run.py --model gpt4o --dataset HAM10000_500

# MDAgents agent baseline
cd baselines/MDAgents && python run_derm_benchmark.py --dataset HAM10000 --difficulty basic --model gpt-4o

# MedAgent-Pro agent baseline
cd baselines/MedAgent-Pro && python Derm_Case_level.py --task 1 \
    --csv-path ../../datasets/ham10000/HAM10000_benchmark_500.csv \
    --image-dir ../../datasets/ham10000 --max-samples 500

Table 2: Ablation Study (Leave-One-Out on SkinCAP)

Configuration ROUGE-L Delta (%)
Full Agent (w/ Critic) 0.1948 +12.8
Full Agent (w/o Critic) 0.1727 ---
  w/o Case RAG 0.1580 -8.5
  w/o Guideline RAG 0.1628 -5.7
  w/o DermoGPT 0.1672 -3.2
  w/o PanDerm 0.1676 -3.0
  w/o MAKE 0.1679 -2.8
  w/o Ontology 0.1712 -0.9

Command to reproduce ablation results:

# Runs 6 leave-one-out experiments sequentially (removes one tool at a time)
bash scripts/run_task3_loo_ablation.sh

The full agent result (w/ Critic, ROUGE-L: 0.1948) is produced by the Task 3 portion of run_task2_task3_agent_critic.sh. The "w/o Critic" baseline (0.1727) is produced by the LOO script's full-tool run without the Critic module.

Tool Descriptions

Tool Model Purpose
PanDerm Classifier DermLIP Zero-shot disease classification via CLIP similarity
MAKE Annotator MAKE (OpenCLIP) Dermoscopic concept extraction
DermoGPT VQA DermoGPT-RL Dermatology-specialized visual QA
Qwen3-VL VQA Qwen3-VL-8B General visual question answering
Image RAG DermLIP + Qdrant Case retrieval from 413,210 diagnosed cases
Text RAG Qwen3-Embedding + Qdrant Guideline retrieval from 3,199 document chunks
Ontology Knowledge Graph Disease hierarchy and taxonomy queries

Citation

Paper: arXiv:2605.14403 (MICCAI 2026, early accept).

@article{liu2026dermagent,
  title={DermAgent: A Self-Reflective Agentic System for Dermatological Image
         Analysis with Multi-Tool Reasoning and Traceable Decision-Making},
  author={Liu, Yize and Yan, Siyuan and Hu, Ming and Ju, Lie and Li, Xieji and
          Tang, Feilong and Feng, Wei and Ge, Zongyuan},
  journal={arXiv preprint arXiv:2605.14403},
  year={2026}
}

License

This project is licensed under the Apache License 2.0 — see the LICENSE file for the full text.

DermAgent redistributes adapted code from third-party projects under the baselines/ directory. See NOTICE and the per-baseline ATTRIBUTION.md files for upstream sources, citations, and license status.

About

DermAgent: A Self-Reflective Agentic System for Dermatological Image Analysis with Multi-Tool Reasoning and Traceable Decision-Making

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors