Official implementation of "DecomposeRL: Traceable Claim Verification via RL-Trained Decomposition"
- Overview
- Installation
- Dataset
- Models
- Usage
- Reward Functions
- Project Structure
- Citation
- Acknowledgments
DecomposeRL trains language models to verify factual claims by decomposing them into sub-questions, answering each from evidence, and aggregating into a verdict. The model learns this decomposition strategy through GRPO (Group Relative Policy Optimization) with a multi-component reward system that evaluates question quality, answer correctness, and coverage.
Key contributions:
- Iterative decomposition: the model generates
<question>/<answer>pairs in a thinking loop, then outputs a<verification>verdict - Multi-reward GRPO training: 7 reward signals (format, verification, atomicity, diversity, answerability, correctness, coverage) guide the policy
- Tiny judges: distilled ModernBERT classifiers replace the LLM judge at training time for 100x faster reward computation
- 10 evaluation datasets: FEVER, HoVer, WiCE, ClaimDecomp, ExFEVER, PubHealthFact, FoolMeTwice, PubMedClaim, CoverBench, LLMAggreFactt
- Python 3.12
- CUDA 12.8+
- GPU with at least 24GB VRAM (for training)
git clone https://github.com/dipta007/DecomposeRL.git
cd DecomposeRL
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install dependencies
make setupThe dataset is hosted on HuggingFace:
| Name | Description |
|---|---|
| dipta007/DecomposeRL | Train + 11 test splits across 10 verification datasets |
from datasets import load_dataset
# Load training data
train = load_dataset("dipta007/DecomposeRL", "5000", split="train")
# Load a test split
test = load_dataset("dipta007/DecomposeRL", "5000", split="test_pubmedclaim")Each example contains:
claim: the factual claim to verifyevidence: the evidence documentlabel:SupportedorRefuteddecomposed_questions: ground-truth sub-questions (for training)
| Split | Dataset | Samples |
|---|---|---|
test_fever |
FEVER | - |
test_claimdecomp |
ClaimDecomp | - |
test_hover |
HoVer | - |
test_feverous |
FEVEROUS | - |
test_wice |
WiCE | - |
test_ex_fever |
ExFEVER | - |
test_pubhealthfact |
PubHealthFact | - |
test_fool_me_twice |
FoolMeTwice | - |
test_pubmedclaim |
PubMedClaim | - |
test_coverbench |
CoverBench | - |
test_llmaggrefact |
LLMAggreFactt | - |
| Model | Base | HuggingFace |
|---|---|---|
| DecomposeRL-7B | Qwen2.5-7B-Instruct | dipta007/decomposeRL-7b |
| Tiny Judge (8 classifiers) | ModernBERT-large | Collection |
All models and resources are collected in the HuggingFace Collection.
Run prompted baselines (6 methods: Self-Ask, Decomposed Prompting, HiSS, FOLK, ProgramFC, Chen-Complex):
# Single method + dataset
PYTHONPATH=. uv run python src/baselines/run.py \
--method self_ask --dataset pubmedclaim \
--model Qwen/Qwen2.5-7B-Instruct \
--output_dir outputs/baseline_7b_prompted
# Simple / Chain-of-Thought baselines
PYTHONPATH=. uv run python src/baselines/direct.py \
-d pubmedclaim --mode cot \
-o outputs/baseline_7b
# MiniCheck (NLI-based)
PYTHONPATH=. uv run python src/baselines/nli.py \
-d pubmedclaim -o outputs/baseline_minicheck
# Run all baselines across all datasets
bash src/baselines/baseline.shAPI-backed baselines (OpenAI / Anthropic):
OPENAI_API_KEY=... PYTHONPATH=. uv run python src/baselines/run.py \
--method hiss --backend api --provider openai \
--dataset pubmedclaim --output_dir outputs/baseline_apiuvx vllm serve Qwen/Qwen3-32B \
--port 8000 \
--tensor-parallel-size 4 \
--max-model-len 16384 \
--reasoning-parser qwen3 \
--dtype bfloat16 \
--gpu-memory-utilization 0.92
uvx vllm serve Qwen/Qwen3-Embedding-8B \
--port 8004 \
--max-model-len 8192 \
--trust-remote-codePYTHONPATH=. uv run python src/train/decomposerl/train.py \
--model_name unsloth/Qwen2.5-7B-Instruct \
--run_name my_runEnvironment variables for reward ablations:
| Variable | Default | Description |
|---|---|---|
REWARD_BACKEND |
llm-judge |
llm-judge or tiny-judge |
DIVERSITY_REWARD |
mmr |
mmr or vendi |
NECESSITY_AGGREGATION |
mean |
mean or min |
SUPERVISION_RATE |
1.0 |
fraction of claims using GT labels (0.0-1.0) |
COVERAGE_REWARD |
1 |
0 to disable |
GOOD_NUM_Q_REWARD |
1 |
0 to disable |
The LLM judge (Qwen3-32B) is accurate but slow. We distill it into small ModernBERT-large classifiers ("tiny judges") — one per reward criterion — that run locally on a single GPU and are ~100x faster.
Pre-trained tiny judges are available on HuggingFace:
| Task | Model |
|---|---|
| Atomicity (5 criteria) | dipta007/atomicity-*-judge-balanced |
| Question Answerable | dipta007/question-judge-balanced |
| Answer Correctness | dipta007/answer-judge-balanced |
| Coverage | dipta007/coverage-judge-balanced |
Use pre-trained tiny judges for GRPO training (no judge server needed):
REWARD_BACKEND=tiny-judge bash scripts/train_grpo.sh --run_name my_runTrain your own from LLM judge cache:
# Train all tasks
bash scripts/train_tiny_judge.sh --task all
# Train a single task and push to HF Hub
bash scripts/train_tiny_judge.sh --task coverage --pushEvaluate a trained checkpoint:
# Single dataset
PYTHONPATH=. uv run python src/test/test.py \
-c outputs/my_run/checkpoint-100 \
-d pubmedclaim
# All datasets
bash src/test/eval.sh outputs/my_run/checkpoint-100| Reward | Description |
|---|---|
| Format | Validates <think>, <question>, <answer>, <verification> tag structure |
| Verification | Binary match of predicted vs ground-truth label |
| Atomicity | 5-criterion checklist per question (is_question, single_focus, no_conjunctions, verifiable, grounded) |
| Diversity | MMR or Vendi Score over question embeddings |
| Answerability | Whether each question can be answered from the evidence |
| Answer Correctness | Whether each answer is factually correct given the evidence |
| Coverage | Whether the decomposed answers lead to the correct verdict |
| Necessity | Leave-one-out saliency: removing a question should change the verdict |
DecomposeRL/
├── src/
│ ├── baselines/ # 8 baseline methods (direct, CoT, MiniCheck, 6 prompted)
│ ├── train/
│ │ ├── decomposerl/ # GRPO training + reward functions
│ │ └── tiny_judge/ # Distilled judge classifier training
│ ├── test/ # Checkpoint evaluation
│ └── data_process/ # Dataset curation pipeline
├── scripts/ # Shell scripts (train, eval)
├── Makefile
└── pyproject.toml
If you find this work useful, please cite our paper:
@article{dipta2025decomposerl,
title={DecomposeRL: Traceable Claim Verification via RL-Trained Decomposition},
author={Shubhashis Roy Dipta and Ankur Padia and Francis Ferraro},
year={2025},
eprint={0000.00000},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/0000.00000},
}- Unsloth for efficient LoRA fine-tuning
- vLLM for fast inference
- TRL for GRPO training
- Qwen for base models
For questions or issues, please open a GitHub issue.