Skip to content

XiangyuJiang01/citeguard

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CiteGuard

Conformal False-Discovery Control for Faithful Retrieval-Augmented Generation

Reference implementation accompanying the ICML 2026 paper. CiteGuard turns RAG citation verification into a multiple-testing problem: each claim gets a conformal p-value against an unsupported-claim null, and Benjamini-Hochberg (BH) or Benjamini-Yekutieli (BY) selection accepts only claims whose evidence support is statistically significant at a user-chosen FDR level.

Highlights

  • Finite-sample, distribution-free FDR control under exchangeable calibration.
  • BH for PRDS regimes and BY as a worst-case safety net under arbitrary dependence.
  • Adaptive variant that switches between BH and BY using lightweight diagnostics.
  • Drop-in baselines (vanilla RAG, heuristic filter, calibrated threshold, selective prediction, RCPS-like) for iso-abstention comparisons.
  • Synthetic Monte Carlo validation, alpha sweeps, and a per-seed FDR audit tool.

Installation

Requires Python 3.9 or newer.

git clone https://github.com/XiangyuJiang01/citeguard.git
cd citeguard
pip install -e .
# Optional dev extras:
pip install -e ".[dev]"

The default install pulls torch, transformers, and datasets. The synthetic and smoke pipelines only need numpy, pandas, and matplotlib, so a minimal install (e.g. pip install numpy pandas matplotlib pyyaml tqdm scikit-learn) is sufficient for run_synthetic_validation.py and run_citeguard.py --config configs/smoke.yaml.

Quick start: smoke test

python scripts/run_citeguard.py --config configs/smoke.yaml
python scripts/run_baselines.py --config configs/smoke.yaml
python scripts/run_alpha_sweep.py --config configs/alpha_sweep.yaml
pytest tests -q

The smoke configuration runs on the 12-claim toy set bundled in data/smoke/claims.jsonl. It exercises the full conformal + BH/BY pipeline end to end in well under one minute on a laptop CPU.

Reproducing the paper experiments

The full reproduction has three independent tracks. None of the released scripts ship with API credentials or pre-computed results; users prepare data and run scoring locally.

1. FEVER (algorithmic validation, no LLM API required)

python scripts/prepare_fever.py --output data/processed/fever_claims.jsonl
python scripts/score_claims.py \
    --input data/processed/fever_claims.jsonl \
    --output data/processed/fever_scored.jsonl \
    --model MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli
python scripts/run_fever_eval.py \
    --scored data/processed/fever_scored.jsonl \
    --out-dir results/fever \
    --seeds 42 123 456 789 1024
python scripts/plot_fever_alpha_sweep.py \
    --summary results/fever/metrics_summary.json \
    --out results/fever/fever_alpha_sweep.png

prepare_fever.py defaults to the https://hf-mirror.com HuggingFace mirror. Override with export HF_ENDPOINT=https://huggingface.co outside mainland China.

2. NQ (claim-level reproduction)

run_nq_eval.py operates on a per-question JSONL with the schema described inside the script (one object per question, claims[] array with score, supported, claim_index, claim). The released code does not include the generation/extraction/verification/judge LLM pipeline or the private nq_pipeline.jsonl used for the paper, because the available implementations relied on private API credentials and human annotation files.

Two reproduction routes:

  • From scratch. Implement the four-step pipeline (draft answer, claim extraction, evidence verification, support judgement) on any LLM provider, emit one record per question with the schema above, then run scripts/run_nq_eval.py --pipeline your_pipeline.jsonl --out-dir results/nq. Section 4 of the paper documents the prompts and decision rules in full.
  • Using a published Self-RAG/CoVe pipeline. Adapt any open-source Self-RAG or Chain-of-Verification implementation to produce the same JSONL schema and reuse scripts/eval_external_baselines.py for baseline scoring.
python scripts/prepare_nq.py --output data/processed/nq_questions.jsonl
# ... user-provided generation pipeline writes nq_pipeline.jsonl ...
python scripts/run_nq_eval.py \
    --pipeline data/processed/nq_pipeline.jsonl \
    --out-dir results/nq
python scripts/plot_nq_alpha_sweep.py \
    --summary results/nq/metrics_summary.json \
    --out results/nq/nq_alpha_sweep.png

Human annotation data is not released; the annotator-agreement protocol is documented in Appendix D of the paper.

3. Synthetic Monte Carlo validation

python scripts/run_synthetic_validation.py --out-dir results/synthetic --n-trials 2000
python scripts/run_avalanche_sweep.py --out-dir results/synthetic_sweep --n-trials 2000
python scripts/audit_fdr_violations.py results/synthetic/synthetic_validation.json

These runs use only NumPy and matplotlib and finish in a few minutes on a laptop. They reproduce Section 5 of the paper end to end.

Repository layout

citeguard/                 Core library
  conformal.py             Conformal p-values for the unsupported-claim null
  fdr.py                   BH / BY / adaptive selection
  pipeline.py              End-to-end CiteGuard selector
  scoring.py               Mock and NLI cross-encoder scorers
  baselines.py             Vanilla / heuristic / selective / threshold / RCPS-like
  metrics.py               FDR, coverage, EM@Acc, bootstrap CIs
  coherence.py             Post-hoc coherence flags
  claims.py                Deterministic claim extractor
  schema.py                Unified ClaimRecord dataclass
  io.py                    JSON / JSONL / YAML config helpers
scripts/                   Command-line entry points (21 scripts)
configs/                   YAML/JSON configs for smoke, FEVER, NQ, and alpha sweeps
data/smoke/                12-claim toy dataset
tests/                     pytest suite (conformal, FDR, metrics, coherence, smoke)
pyproject.toml             Package metadata
LICENSE                    MIT
CITATION.cff               Machine-readable citation metadata

Tests

pytest tests -q

Three modules cover the conformal core (test_conformal_fdr.py), claim-level metrics and coherence flags (test_metrics_coherence.py), and an end-to-end smoke run of the full pipeline (test_pipeline_smoke.py).

What is and is not in this release

This release is intentionally scoped to the algorithmic core. It includes everything needed to reproduce the FEVER and synthetic tracks end to end and to plug CiteGuard into your own RAG pipeline. Exact NQ paper numbers require a pipeline JSONL matching the documented schema. The following items are intentionally not packaged:

  • LLM-based generation/judgement scripts for the NQ pipeline. They were written against private third-party APIs and are kept locally only; the paper documents the prompts and decision rules so they can be reproduced on any LLM provider.
  • Human annotation data. The protocol and inter-annotator agreement are reported in Appendix D of the paper; raw annotations are kept private.
  • Pre-computed experimental results. FEVER and synthetic outputs are regenerated by running the released scripts; NQ outputs are regenerated from a user-provided pipeline JSONL matching the documented schema.

License

MIT. See LICENSE.

Citation

@inproceedings{citeguard2026,
  title     = {CiteGuard: Conformal False-Discovery Control for Faithful Retrieval-Augmented Generation},
  author    = {Jiang, Xiangyu},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  year      = {2026},
  publisher = {PMLR}
}

Contact

Issues and pull requests are welcome on GitHub. Author email: xj70@sussex.ac.uk

About

Conformal False-Discovery Control for Faithful Retrieval-Augmented Generation (ICML 2026)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages