From 19041a61f4025205a211c0dcc73a2df545f5f30d Mon Sep 17 00:00:00 2001 From: Dragon-AI Agent Date: Mon, 17 Nov 2025 19:44:55 +0000 Subject: [PATCH] Add LinkML validator documentation for hallucination guardrails MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Extends existing hallucination prevention documentation to cover: - Distinction between term validation (ID + label) and reference validation (quote + citation) - Core concepts and principles for both validation approaches - When to use each type of validation - Practical examples of text excerpt validation - Implementation details for linkml-term-validator and linkml-reference-validator - Integration guidance for using both tools together Focuses on concepts with links to implementation-specific documentation per feedback from @cmungall in #41. Addresses #51 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude --- .../make-ids-hallucination-resistant.md | 149 +++++++++++++++++- 1 file changed, 145 insertions(+), 4 deletions(-) diff --git a/docs/how-tos/make-ids-hallucination-resistant.md b/docs/how-tos/make-ids-hallucination-resistant.md index 630d2f9..471d37d 100644 --- a/docs/how-tos/make-ids-hallucination-resistant.md +++ b/docs/how-tos/make-ids-hallucination-resistant.md @@ -76,10 +76,38 @@ publication: # Would catch mismatches like: publication: - pmid: 10802651 + pmid: 10802651 title: "Some other paper title" # Wrong title for this PMID ``` +### Text Excerpts from Publications + +When your curation includes quoted text or supporting evidence from papers, you can validate that the text actually appears in the cited source: + +```yaml +# This would pass validation +annotation: + term_id: GO:0005634 + supporting_text: "The protein localizes to the nucleus during cell division" + reference: PMID:12345678 + # Validation checks that this exact text appears in PMID:12345678 + +# This would fail - text doesn't appear in the paper +annotation: + term_id: GO:0005634 + supporting_text: "Made-up description that sounds plausible" + reference: PMID:12345678 +``` + +The validator supports standard editorial conventions: + +```yaml +# These are valid - bracketed clarifications and ellipses are allowed +annotation: + supporting_text: "The protein [localizes] to the nucleus...during cell division" + # Matches: "The protein to the nucleus early during cell division" +``` + ## You Need Tooling for This This pattern only works if you have validation tools that can actually check the identifiers against authoritative sources. You need: @@ -89,19 +117,66 @@ This pattern only works if you have validation tools that can actually check the 3. **Label matching**: Compare provided labels against canonical ones 4. **Consistency checking**: Make sure everything matches up +## Validation Concepts + +There are two complementary approaches to preventing hallucinations in AI-assisted curation: + +### 1. Term Validation (ID + Label Checking) + +This is the approach we've been discussing: validating that identifiers and their labels are consistent with authoritative ontology sources. The key concept is **dual verification** - requiring both the ID and its canonical label makes it exponentially harder for an AI to accidentally fabricate a valid combination. + +**Core principles:** +- Validate term IDs against ontology sources to ensure they exist +- Verify labels match the canonical labels from those sources +- Check consistency between related terms in your data +- Support dynamic enum validation for flexible controlled vocabularies + +**When to use this:** +- You're working with ontology terms (GO, HP, MONDO, etc.) +- You're handling gene identifiers, chemical compounds, or other standardized entities +- You need to validate that AI-generated annotations use real, correctly-labeled terms + +### 2. Reference Validation (Quote + Citation Checking) + +A complementary approach validates that text excerpts or quotes in your data actually appear in their cited sources. This prevents AI from inventing supporting text or misattributing quotes to publications. + +**Core principles:** +- Fetch the actual publication content from authoritative sources +- Perform deterministic substring matching (not fuzzy matching) +- Support legitimate editorial conventions (bracketed clarifications, ellipses) +- Reject any text that can't be verified in the source + +**When to use this:** +- Your curation workflow includes extracting text from publications +- You're building datasets with quoted material and citations +- AI systems are summarizing or extracting information from papers +- You need to verify that supporting text for annotations comes from real sources + +### Why Both Matter + +These validation approaches protect against different types of hallucinations: +- **Term validation** prevents fabricated identifiers and misapplied terms +- **Reference validation** prevents fabricated quotes and misattributed text + +For comprehensive AI guardrails, you often need both. For example, when curating gene-disease associations, you might validate: +1. That the gene IDs and disease term IDs are real and correctly labeled +2. That the supporting text cited from a paper actually appears in that paper + ### Useful APIs for Validation - **OLS (Ontology Lookup Service)**: EBI's comprehensive API for biomedical ontologies - **OAK (Ontology Access Kit)**: Python library that can work with multiple ontology sources - **PubMed APIs**: For validating PMIDs and retrieving titles +- **PMC (PubMed Central)**: For accessing full-text content to validate excerpts - **Individual ontology APIs**: Many ontologies have their own REST APIs ### Implementation Notes -- **Cache responses** to avoid hitting APIs repeatedly for the same IDs +- **Cache responses** to avoid hitting APIs repeatedly for the same IDs or references - **Handle network failures** gracefully - you don't want validation failures to break your workflow - **Consider performance** - real-time validation can slow things down, so you might need to batch or background the checks - **Plan for errors** - decide how to handle cases where validation fails (reject, flag for review, etc.) +- **Use deterministic validation** - avoid fuzzy matching that might accept AI-generated approximations ## Beyond Basic Ontologies @@ -145,9 +220,75 @@ But for most scientific curation workflows involving ontologies, genes, and publ ## Getting Started 1. **Pick one identifier type** that's important for your workflow -2. **Find the authoritative API** for that type +2. **Find the authoritative API** for that type 3. **Modify your prompts** to require both ID and label 4. **Build simple validation** that checks both pieces 5. **Expand gradually** to other identifier types -The key is to start simple and build up. You don't need a comprehensive system from day one - even basic validation on your most critical identifier types can make a big difference. \ No newline at end of file +The key is to start simple and build up. You don't need a comprehensive system from day one - even basic validation on your most critical identifier types can make a big difference. + +## Implementation Tools + +The concepts described in this guide are implemented in several practical tools: + +### LinkML Validator Plugins + +The [LinkML](https://linkml.io) ecosystem provides validator plugins specifically designed for these hallucination prevention patterns: + +#### linkml-term-validator + +Validates ontology terms in LinkML schemas and datasets using the dual verification approach (ID + label checking): + +- **Schema validation**: Verifies `meaning` fields in enum definitions reference real ontology terms +- **Dynamic enum validation**: Checks data against constraints like `reachable_from`, `matches`, and `concepts` +- **Binding validation**: Enforces constraints on nested object fields +- **Multi-level caching**: Speeds up validation with in-memory and file-based caching +- **Ontology Access Kit integration**: Works with multiple ontology sources through OAK adapters + +**Learn more:** [linkml-term-validator documentation](https://linkml.io/linkml-term-validator/) + +#### linkml-reference-validator + +Validates that text excerpts match their source publications using deterministic verification: + +- **Deterministic substring matching**: No fuzzy matching or AI approximations +- **Editorial convention support**: Handles bracketed clarifications and ellipses +- **PubMed/PMC integration**: Fetches actual publication content for verification +- **Smart caching**: Minimizes API requests with local caching +- **Multiple interfaces**: Command-line tool, Python API, and LinkML schema integration +- **OBO format support**: Can validate supporting text annotations in OBO ontology files + +**Learn more:** [linkml-reference-validator documentation](https://linkml.io/linkml-reference-validator/) + +### Using These Tools Together + +For comprehensive AI guardrails, you can combine both validators in your workflow: + +1. Use **linkml-term-validator** to ensure all ontology terms, gene IDs, and other identifiers are real and correctly labeled +2. Use **linkml-reference-validator** to verify that supporting text and quotes actually appear in their cited sources +3. Integrate both into your CI/CD pipeline to catch hallucinations before they enter your knowledge base + +### Example: Validating OBO Ontology Files + +If you're working with OBO format ontologies that include supporting text annotations, you can use regex-based validation: + +```bash +linkml-reference-validator validate text-file my-ontology.obo \ + --regex 'ex:supporting_text="([^"]*)\[(\S+:\S+)\]"' \ + --cache-dir ./cache +``` + +This validates that supporting text annotations actually appear in their referenced publications. + +**Learn more:** [Validating OBO files guide](https://linkml.io/linkml-reference-validator/how-to/validate-obo-files/) + +### Getting Started with LinkML Validators + +Both tools are available as Python packages and can be installed via pip: + +```bash +pip install linkml-term-validator +pip install linkml-reference-validator +``` + +They work as both command-line tools and Python libraries, so you can integrate them into your existing workflows however makes sense for your use case. \ No newline at end of file