LLM hallucination detection pipeline for verifying bibliographic references. Built with LangGraph and FastAPI.
- Python 3.11+
- uv — fast Python package manager
Citeguard uses Langfuse for tracing and observability of agent runs. You have two options:
Option A: Langfuse Cloud (recommended for quick setup)
- Create a free account at cloud.langfuse.com
- Go to Settings → API Keys and create a new key pair
- Copy your
Secret Key,Public Key, and note the base URL (https://cloud.langfuse.com)
Option B: Self-hosted Langfuse (via Docker)
If you prefer to run Langfuse locally:
# Clone and start Langfuse
git clone https://github.com/langfuse/langfuse.git
cd langfuse
docker compose up -dLangfuse will be available at http://localhost:3000. Create a project and generate API keys from the UI.
For more details, see the Langfuse self-hosting docs.
-
Clone and enter the repo directory:
git clone https://github.com/your-org/citeguard.git cd citeguard -
Create and activate a virtual environment:
uv sync source .venv/bin/activate -
Set up your environment variables:
cp .env.example .env
Then open
.envand fill in your keys.
Optional — DBLP local database:
For better CS conference paper coverage, build a local DBLP index (~4.6GB, runs once):
uv run python scripts/build_dblp_index.pyThis downloads the full DBLP dataset and indexes it locally. After building, set:
DBLP_DB_PATH=./data/dblp/dblp.db
Optional — Web search fallback:
For references that all academic databases fail to find, you can enable a last-resort web search stage. Set one of the following in your .env:
| Option | Setup | Cost | Notes |
|---|---|---|---|
| SearXNG | Docker (self-hosted) | Free | Best academic coverage — targets Google Scholar, Semantic Scholar, arXiv |
| Tavily | API key | Free tier (1k req/month) | No infrastructure needed — sign up at tavily.com |
If neither is set, the pipeline skips this stage silently — no breakage.
SearXNG setup:
See docker/searxng/README.md for full instructions.
Then add to .env:
SEARXNG_URL=http://localhost:8080
Tavily setup:
TAVILY_API_KEY=tvly-xxxxxxxxxx
Web search results are intentionally scored as
LIKELY_REALat best — neverVERIFIED— reflecting the weaker signal from a general web match compared to a structured academic database lookup.
# Development
uv run uvicorn app.main:app --reloadThe API will be available at http://localhost:8000. Visit /docs for the interactive Swagger UI.
docker build -t citeguard --platform linux/amd64 .docker run --env-file .env -p 8000:8000 citeguardCiteguard runs a multi-agent pipeline (via LangGraph) that:
---
config:
flowchart:
curve: linear
---
graph TD;
__start__([<p>__start__</p>]):::first
router_input_node(router_input_node)
gather_all_info_node(gather_all_info_node)
verify_dblp_node(verify_dblp_node<br/><small><font color="#ca5e9b">nodes/verification_nodes.py</font></small>)
verify_search_node(verify_search_node<br/><small><font color="#ca5e9b">nodes/verification_nodes.py</font></small>)
extract_references_node(extract_references_node<br/><small><font color="#ca5e9b">nodes/extraction_nodes.py</font></small>)
verify_doi_node(verify_doi_node<br/><small><font color="#ca5e9b">nodes/verification_nodes.py</font></small>)
verify_arxiv_node(verify_arxiv_node<br/><small><font color="#ca5e9b">nodes/verification_nodes.py</font></small>)
merge_results_node(merge_results_node<br/><small><font color="#ca5e9b">nodes/merge_and_score_nodes.py</font></small>)
verify_openlibrary_node(verify_openlibrary_node<br/><small><font color="#ca5e9b">nodes/verification_nodes.py</font></small>)
verify_web_search_node(verify_web_search_node<br/><small><font color="#ca5e9b">nodes/verification_nodes.py</font></small>)
score_node(score_node<br/><small><font color="#ca5e9b">nodes/merge_and_score_nodes.py</font></small>)
classify_references_node(classify_references_node<br/><small><font color="#ca5e9b">nodes/verification_nodes.py</font></small>)
needs_search_node(needs_search_node<br/><small><font color="#ca5e9b">nodes/verification_nodes.py</font></small>)
parse_content_from_file_node(parse_content_from_file_node<br/><small><font color="#ca5e9b">nodes/extraction_nodes.py</font></small>)
__end__([<p>__end__</p>]):::last
__start__ --> router_input_node;
classify_references_node --> verify_arxiv_node;
classify_references_node --> verify_doi_node;
extract_references_node --> classify_references_node;
gather_all_info_node --> extract_references_node;
merge_results_node --> score_node;
needs_search_node -. merge .-> merge_results_node;
needs_search_node -. search .-> verify_search_node;
parse_content_from_file_node --> gather_all_info_node;
router_input_node -. text .-> gather_all_info_node;
router_input_node -. file .-> parse_content_from_file_node;
verify_arxiv_node --> needs_search_node;
verify_dblp_node -. merge .-> merge_results_node;
verify_dblp_node -. openlibrary .-> verify_openlibrary_node;
verify_doi_node --> needs_search_node;
verify_openlibrary_node -. merge .-> merge_results_node;
verify_openlibrary_node -. web_search .-> verify_web_search_node;
verify_search_node -. merge .-> merge_results_node;
verify_search_node -. dblp .-> verify_dblp_node;
verify_web_search_node --> merge_results_node;
score_node --> __end__;
- Extracts structured references from LLM-generated text
- Verifies each reference against scholarly databases (Crossref, Semantic Scholar, OpenAlex)
- Cross-validates metadata fields (authors, year, journal, DOI)
- Scores each reference as verified, suspicious, or hallucinated
All runs are traced in Langfuse for full observability.
Submit text for verification:
curl -X POST http://localhost:8000/verify \
-H "Authorization: Bearer your-token" \
-H "Content-Type: application/json" \
-d '{"text": "your text with references here", "content_type": "text"}'Upload a file:
curl -X POST http://localhost:8000/verify \
-H "Authorization: Bearer your-token" \
-F "file=@paper.pdf"Example response:
{
"total": 16,
"verified": 10,
"needs_review": 1,
"likely_hallucinated": 2,
"unverifiable": 3,
"references": [
{
"title": "Attention is all you need",
"verdict": "VERIFIED",
"matched_url": "https://arxiv.org/abs/1706.03762",
"sources_checked": ["arxiv"]
}
]
}citeguard is licensed under the MIT License.