Conformal False-Discovery Control for Faithful Retrieval-Augmented Generation
Reference implementation accompanying the ICML 2026 paper. CiteGuard turns RAG citation verification into a multiple-testing problem: each claim gets a conformal p-value against an unsupported-claim null, and Benjamini-Hochberg (BH) or Benjamini-Yekutieli (BY) selection accepts only claims whose evidence support is statistically significant at a user-chosen FDR level.
- Finite-sample, distribution-free FDR control under exchangeable calibration.
- BH for PRDS regimes and BY as a worst-case safety net under arbitrary dependence.
- Adaptive variant that switches between BH and BY using lightweight diagnostics.
- Drop-in baselines (vanilla RAG, heuristic filter, calibrated threshold, selective prediction, RCPS-like) for iso-abstention comparisons.
- Synthetic Monte Carlo validation, alpha sweeps, and a per-seed FDR audit tool.
Requires Python 3.9 or newer.
git clone https://github.com/XiangyuJiang01/citeguard.git
cd citeguard
pip install -e .
# Optional dev extras:
pip install -e ".[dev]"The default install pulls torch, transformers, and datasets. The
synthetic and smoke pipelines only need numpy, pandas, and matplotlib,
so a minimal install (e.g. pip install numpy pandas matplotlib pyyaml tqdm scikit-learn) is sufficient for run_synthetic_validation.py and
run_citeguard.py --config configs/smoke.yaml.
python scripts/run_citeguard.py --config configs/smoke.yaml
python scripts/run_baselines.py --config configs/smoke.yaml
python scripts/run_alpha_sweep.py --config configs/alpha_sweep.yaml
pytest tests -qThe smoke configuration runs on the 12-claim toy set bundled in
data/smoke/claims.jsonl. It exercises the full conformal + BH/BY pipeline
end to end in well under one minute on a laptop CPU.
The full reproduction has three independent tracks. None of the released scripts ship with API credentials or pre-computed results; users prepare data and run scoring locally.
python scripts/prepare_fever.py --output data/processed/fever_claims.jsonl
python scripts/score_claims.py \
--input data/processed/fever_claims.jsonl \
--output data/processed/fever_scored.jsonl \
--model MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli
python scripts/run_fever_eval.py \
--scored data/processed/fever_scored.jsonl \
--out-dir results/fever \
--seeds 42 123 456 789 1024
python scripts/plot_fever_alpha_sweep.py \
--summary results/fever/metrics_summary.json \
--out results/fever/fever_alpha_sweep.pngprepare_fever.py defaults to the https://hf-mirror.com HuggingFace mirror.
Override with export HF_ENDPOINT=https://huggingface.co outside mainland China.
run_nq_eval.py operates on a per-question JSONL with the schema described
inside the script (one object per question, claims[] array with score,
supported, claim_index, claim). The released code does not include the
generation/extraction/verification/judge LLM pipeline or the private
nq_pipeline.jsonl used for the paper, because the available implementations
relied on private API credentials and human annotation files.
Two reproduction routes:
- From scratch. Implement the four-step pipeline (draft answer, claim
extraction, evidence verification, support judgement) on any LLM provider,
emit one record per question with the schema above, then run
scripts/run_nq_eval.py --pipeline your_pipeline.jsonl --out-dir results/nq. Section 4 of the paper documents the prompts and decision rules in full. - Using a published Self-RAG/CoVe pipeline. Adapt any open-source Self-RAG
or Chain-of-Verification implementation to produce the same JSONL schema
and reuse
scripts/eval_external_baselines.pyfor baseline scoring.
python scripts/prepare_nq.py --output data/processed/nq_questions.jsonl
# ... user-provided generation pipeline writes nq_pipeline.jsonl ...
python scripts/run_nq_eval.py \
--pipeline data/processed/nq_pipeline.jsonl \
--out-dir results/nq
python scripts/plot_nq_alpha_sweep.py \
--summary results/nq/metrics_summary.json \
--out results/nq/nq_alpha_sweep.pngHuman annotation data is not released; the annotator-agreement protocol is documented in Appendix D of the paper.
python scripts/run_synthetic_validation.py --out-dir results/synthetic --n-trials 2000
python scripts/run_avalanche_sweep.py --out-dir results/synthetic_sweep --n-trials 2000
python scripts/audit_fdr_violations.py results/synthetic/synthetic_validation.jsonThese runs use only NumPy and matplotlib and finish in a few minutes on a laptop. They reproduce Section 5 of the paper end to end.
citeguard/ Core library
conformal.py Conformal p-values for the unsupported-claim null
fdr.py BH / BY / adaptive selection
pipeline.py End-to-end CiteGuard selector
scoring.py Mock and NLI cross-encoder scorers
baselines.py Vanilla / heuristic / selective / threshold / RCPS-like
metrics.py FDR, coverage, EM@Acc, bootstrap CIs
coherence.py Post-hoc coherence flags
claims.py Deterministic claim extractor
schema.py Unified ClaimRecord dataclass
io.py JSON / JSONL / YAML config helpers
scripts/ Command-line entry points (21 scripts)
configs/ YAML/JSON configs for smoke, FEVER, NQ, and alpha sweeps
data/smoke/ 12-claim toy dataset
tests/ pytest suite (conformal, FDR, metrics, coherence, smoke)
pyproject.toml Package metadata
LICENSE MIT
CITATION.cff Machine-readable citation metadata
pytest tests -qThree modules cover the conformal core (test_conformal_fdr.py), claim-level
metrics and coherence flags (test_metrics_coherence.py), and an
end-to-end smoke run of the full pipeline (test_pipeline_smoke.py).
This release is intentionally scoped to the algorithmic core. It includes everything needed to reproduce the FEVER and synthetic tracks end to end and to plug CiteGuard into your own RAG pipeline. Exact NQ paper numbers require a pipeline JSONL matching the documented schema. The following items are intentionally not packaged:
- LLM-based generation/judgement scripts for the NQ pipeline. They were written against private third-party APIs and are kept locally only; the paper documents the prompts and decision rules so they can be reproduced on any LLM provider.
- Human annotation data. The protocol and inter-annotator agreement are reported in Appendix D of the paper; raw annotations are kept private.
- Pre-computed experimental results. FEVER and synthetic outputs are regenerated by running the released scripts; NQ outputs are regenerated from a user-provided pipeline JSONL matching the documented schema.
MIT. See LICENSE.
@inproceedings{citeguard2026,
title = {CiteGuard: Conformal False-Discovery Control for Faithful Retrieval-Augmented Generation},
author = {Jiang, Xiangyu},
booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
year = {2026},
publisher = {PMLR}
}Issues and pull requests are welcome on GitHub. Author email: xj70@sussex.ac.uk