From 76ddf1a62b6f04deb0aa93bb7da35e96a65e9c67 Mon Sep 17 00:00:00 2001
From: Dragon-AI Agent <dragon-ai-agent[bot]@users.noreply.github.com>
Date: Mon, 17 Nov 2025 19:44:42 +0000
Subject: [PATCH] Add LinkML validator documentation for hallucination
 guardrails
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Extended the hallucination guardrails documentation to include:
- Two key validation strategies: ontology terms and text excerpts
- Conceptual explanation of dual validation (ID + label) approach
- Deterministic matching approach for text excerpt validation
- Links to linkml-term-validator and linkml-reference-validator implementations

This addresses issue #51 by documenting the concepts behind these
validation approaches while keeping implementation details in the
linked tools' documentation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
---
 .../make-ids-hallucination-resistant.md       | 61 +++++++++++++++++--
 1 file changed, 57 insertions(+), 4 deletions(-)

diff --git a/docs/how-tos/make-ids-hallucination-resistant.md b/docs/how-tos/make-ids-hallucination-resistant.md
index 630d2f9..a428f4e 100644
--- a/docs/how-tos/make-ids-hallucination-resistant.md
+++ b/docs/how-tos/make-ids-hallucination-resistant.md
@@ -89,19 +89,58 @@ This pattern only works if you have validation tools that can actually check the
 3. **Label matching**: Compare provided labels against canonical ones
 4. **Consistency checking**: Make sure everything matches up
 
+## Two Key Validation Strategies
+
+There are two complementary approaches to preventing hallucinations in AI-assisted curation:
+
+### Validating Ontology Terms
+
+When AI systems generate ontology terms, you need to verify that:
+
+- The term IDs actually exist in the ontology
+- The labels match the canonical labels in the source ontology
+- Both the ID and label are consistent with each other
+
+This **dual validation approach** (ID + label) makes it much harder for language models to fabricate plausible-looking but fake terms. The model would have to hallucinate both a valid ID and its correct label simultaneously, which is statistically unlikely.
+
+The validation process typically involves:
+
+1. **Schema validation**: Checking that data structures properly reference ontology terms
+2. **Dynamic lookup**: Querying ontology sources in real-time to verify terms exist
+3. **Multi-level caching**: Using in-memory and file-based caches to optimize performance
+4. **Binding validation**: Ensuring nested object fields maintain structural integrity
+
+### Validating Text Excerpts
+
+When AI extracts or generates text excerpts that are supposed to come from published sources, you need to verify that the text actually appears in the cited publication. This prevents:
+
+- Paraphrasing that changes the meaning
+- Fabricated quotes attributed to real papers
+- Misattributions where real text is assigned to wrong papers
+
+This **deterministic matching approach** emphasizes exact textual correspondence rather than fuzzy or AI-based approximation. The key principles:
+
+1. **Substring matching**: Look for exact matches between the excerpt and source material
+2. **Editorial conventions**: Handle legitimate variations like bracketed clarifications `[...]` and ellipsis `...`
+3. **Source fetching**: Retrieve publication content from authoritative APIs (PubMed/PMC)
+4. **Local caching**: Store retrieved publications to minimize repeated API requests
+
+This approach prioritizes accuracy and reproducibility over convenience, making it suitable for rigorous curation where precision matters.
+
 ### Useful APIs for Validation
 
 - **OLS (Ontology Lookup Service)**: EBI's comprehensive API for biomedical ontologies
 - **OAK (Ontology Access Kit)**: Python library that can work with multiple ontology sources
-- **PubMed APIs**: For validating PMIDs and retrieving titles
+- **PubMed/PMC APIs**: For fetching publication content and validating excerpts
 - **Individual ontology APIs**: Many ontologies have their own REST APIs
 
 ### Implementation Notes
 
-- **Cache responses** to avoid hitting APIs repeatedly for the same IDs
+- **Cache responses** to avoid hitting APIs repeatedly for the same IDs or publications
 - **Handle network failures** gracefully - you don't want validation failures to break your workflow
 - **Consider performance** - real-time validation can slow things down, so you might need to batch or background the checks
 - **Plan for errors** - decide how to handle cases where validation fails (reject, flag for review, etc.)
+- **Use deterministic methods** - prefer exact matching over probabilistic approaches for reproducibility
 
 ## Beyond Basic Ontologies
 
@@ -145,9 +184,23 @@ But for most scientific curation workflows involving ontologies, genes, and publ
 ## Getting Started
 
 1. **Pick one identifier type** that's important for your workflow
-2. **Find the authoritative API** for that type  
+2. **Find the authoritative API** for that type
 3. **Modify your prompts** to require both ID and label
 4. **Build simple validation** that checks both pieces
 5. **Expand gradually** to other identifier types
 
-The key is to start simple and build up. You don't need a comprehensive system from day one - even basic validation on your most critical identifier types can make a big difference.
\ No newline at end of file
+The key is to start simple and build up. You don't need a comprehensive system from day one - even basic validation on your most critical identifier types can make a big difference.
+
+## Implementation Tools
+
+If you're working with LinkML schemas and data, there are ready-to-use validation plugins that implement these concepts:
+
+### For Ontology Term Validation
+
+The [**linkml-term-validator**](https://linkml.io/linkml-term-validator/) plugin validates that LinkML schemas and datasets properly reference external ontology terms. It implements the dual validation (ID + label) approach described above and works with multiple ontology sources through the Ontology Access Kit (OAK). It's particularly useful for preventing AI-generated hallucinations in automated curation workflows.
+
+### For Text Excerpt Validation
+
+The [**linkml-reference-validator**](https://linkml.io/linkml-reference-validator/) plugin validates that text excerpts in datasets accurately match their cited source publications. It fetches references from PubMed/PMC and performs deterministic substring matching to verify quotes. The validator also includes [guidance for validating OBO format files](https://linkml.io/linkml-reference-validator/how-to/validate-obo-files/), making it useful for ontology curation workflows.
+
+Both tools integrate as plugins within LinkML's validation framework, support multi-level caching for performance, and can be used in CI/CD pipelines or programmatically in Python.
\ No newline at end of file