A verification layer for scientific work. Paste any scientific text — or a whole PDF — and every PMID, DOI, arXiv ID and GitHub repo is checked against the real record: does it exist, does its metadata match, does the cited paper support the claim? Fabricated and mis-cited references are flagged. Deterministic, no API key.
▶ Live demo & interactive verifier — paste a paragraph, watch it catch a fabricated or mis-cited reference.
DeltaScience checks the citations in scientific writing against the real record — whether you drafted the text yourself or with an AI. Its core is a citation verifier: paste a paragraph, a hypothesis, or a whole paper, and every identifier is checked against PubMed / Crossref / OpenAlex / arXiv / DataCite / GitHub — does it exist, does its metadata match, and does the cited paper actually support the claim? It is deterministic (real lookups + string matching, no LLM in the trust path), so it runs with no API key.
Around that trust path it adds the discovery layer: scan (retrieve the closest real prior work) and gap (a coverage-honest read of how crowded an area is) — plus a workflow layer that runs the right ones for your goal (grant, paper, review, ideate). The principle is no LLM in the trust path, not "no LLM anywhere": verify must be deterministic, but discovery is better with intelligence. So scan/gap run keyless out of the box (deterministic term-overlap retrieval), and get sharper when an agent drives them — pass your own queries with scan --query "…", or let the deltasci-ground skill write the queries and reason over the results. A weak discovery pass can only make you miss prior art; it can never corrupt a verdict.
It also includes the two-perspective co-reasoning mode it grew out of — a domain scientist and an ML engineer talk an idea through in structured rounds (deltasci run, needs an LLM), documented below. The verifier is the part that stands on its own.
It ships in two form factors:
pip install deltasci— a Python CLI + library.- Claude Code skills — install
skill/for the two-perspective hypothesis mode (deltasci), and/orskill-ground/for the grounding layer (deltasci-ground), where the agent drives scan → gap and calls the deterministic engine to verify.
Both share the same domain packs (biomed, materials, climate, plus your own).
Free-form chatbot brainstorming gives plausible-sounding research ideas with no grounding. Single-prompt "be a scientist" approaches collapse two distinct expertises (domain mechanism, ML feasibility) into one voice and produce mush. DeltaScience keeps them separate:
Round 1 Domain Scientist → mechanism, unmet need, prior evidence, constraints
Round 2 ML Engineer → data representation, method, precedents, risks
Round 3 Domain Scientist → refinement, evaluation realism, falsifiable prediction
Round 4 ML Engineer → revised plan, formulas, implementation, expected outcomes
↓
Synthesis: a grounded, falsifiable hypothesis with citation trail
A complete deltasci run produces six artifacts: hypothesis, experiment protocol, risk register, adversarial challenger findings, citation audit, and a transcript of the dialogue that produced them — all auditable, all in one navigable run directory.
Six things make it different from "just prompting an LLM":
| DeltaScience | Free-form LLM | |
|---|---|---|
| Grounding | Every claim tagged with type, source, AND AI's training-coverage self-assessment (well-covered / sparse) |
Untagged; mixes facts and vibes |
| Honest about AI's edges | Material outside training distribution is emitted as [KNOWLEDGE_GAP] for the researcher, not fabricated |
Confabulates citations to fill gaps |
| Falsifiability gate | Synthesis refuses to emit a hypothesis without a measurable threshold | "It might work!" |
| Epistemic humility gate | Synthesis refuses if zero KNOWLEDGE_GAP and zero NOVEL_SYNTHESIS across the dialogue (a complete-certainty transcript is itself a hallucination signal) |
No such check |
| Citation audit pillar | Every PMID/DOI/arXiv ID/GitHub repo/GEO accession in a CLAIM is verified against the real PubMed/Crossref/OpenAlex/GitHub record; metadata mismatches surface as a prominent "FAILED AUDIT" section (this catches the BioIntel-style failure where a fabricated PMID got a green checkmark) | "I checked it" with no actual fetch |
| Adversarial challenger | A second-opinion model (optionally a different provider via --challenger-llm) tries to break the hypothesis; its findings are first-class output and its own citations get audited too |
None |
| Structured protocol + risks | Hypothesis ships with a concrete 6-step experiment plan and a ranked risk register; both flow through the same audit pillar | Hypothesis text only |
| Domain awareness | Pluggable domain packs parameterize the expert lens | Generic "be a scientist" |
[CLAIM type=<TYPE> coverage=<COVERAGE> source="<CITATION>"]<text>[/CLAIM]
[KNOWLEDGE_GAP category=<CATEGORY>]<question for the researcher>[/KNOWLEDGE_GAP]
[NOVEL_SYNTHESIS rationale="<one-line>"]<the connection you're proposing>[/NOVEL_SYNTHESIS]
coverage ∈ {well-covered, sparse}. Material the AI thinks is uncovered (lab-tribal, paywalled, niche, non-English, the researcher's pilot data, etc.) is never allowed as a CLAIM — it must be a KNOWLEDGE_GAP. Connections the AI is making (not citing) are NOVEL_SYNTHESIS. This is exactly the line between "AI knows what's well-discussed online" and "the researcher knows what's at the frontier of their field" — DeltaScience surfaces it instead of papering over it.
pip install deltasci # core (CLI + library)
pip install "deltasci[anthropic]" # + Anthropic adapter
pip install "deltasci[openai]" # + OpenAI adapter
pip install "deltasci[all]" # bothProvider keys are read from the environment (ANTHROPIC_API_KEY or OPENAI_API_KEY).
export ANTHROPIC_API_KEY=...
deltasci run \
--pack materials \
--idea "Train a graph neural network on the Materials Project to identify Li-ion cathode candidates in the spinel family with predicted voltage > 4.3V."Outputs go to ./deltasci-output/<timestamp>_<slug>/ in a numbered staged layout:
00_idea.md
01_framing/ 02_engineering/ 03_refinement/ 04_plan/ (per-round transcripts)
05_synthesis/ hypothesis.md + summary.json (three-section evidence trail + falsifiability)
06_protocol/ protocol.md + experiment_plan.json (concrete, execution-ready plan)
07_risks/ risks.md + risk_register.json (5–10 ranked failure modes + mitigations)
08_audits/ citations.json + codex.md (citation audit + adversarial challenger)
manifest.json (run-level metadata)
The web UI at deltasci view <run-dir> (auto-launched at the end of deltasci run unless you pass --no-view) renders all of this with the audit results, including any FAILED AUDIT section showing both what the AI claimed and what was actually at the cited identifier.
deltasci demo --pack biomed --llm mockThis runs a deterministic mock LLM end-to-end so you can see the output shape.
git clone https://github.com/boheling/deltasci
cd deltasci
bash skill/install.sh # deltasci — two-perspective hypothesis mode
bash skill-ground/install.sh # deltasci-ground — the scan → gap → verify grounding layerThen in Claude Code, for the grounding layer (the agent writes the queries and reasons; verify stays deterministic — no key):
"Ground this idea: an experience-learning framework with RL for LLM-agent skill evolution." "Verify the citations in paper.pdf."
…or for the hypothesis mode:
"Use deltasci with the climate pack to generate a hypothesis for: train a neural emulator on ERA5 to downscale Sahel precipitation."
DeltaScience's citation-audit pillar also ships as a standalone verifier you can point at any LLM-generated scientific text — a pasted related-work section, a JSON list of claims, or a .bib file. It checks that each cited PMID / DOI / arXiv / GitHub identifier exists, that its metadata matches, and (by default) that the cited paper actually supports the claim — catching the "real paper, wrong citation" failure that plagues autonomous AI-scientist pipelines. No provider API key required.
deltasci verify --file related_work.md # untagged prose
echo "X drives Y (PMID 35562209)." | deltasci verify --file -
deltasci verify --text '…' --json # machine output; exit code 2 on any failed auditEach claim gets a verdict: PASS / FABRICATED / METADATA-MISMATCH / UNSUPPORTED / UNVERIFIABLE / SKIPPED.
Verify generated citations from inside any MCP client (Claude Code/Desktop, Cursor) or AI-scientist pipeline — without forking anything:
pip install "deltasci[mcp]"
claude mcp add deltasci-verify -- deltasci-mcpIt exposes one tool, verify_scientific_claims(text, format, check_support), returning the same per-claim verdicts.
Real papers cite by number, with the references in a bibliography at the bottom — so a pasted paragraph only has [12], nothing to resolve. Paper mode ingests the whole document: it parses the bibliography, resolves every reference to a real record (embedded DOI/PMID/arXiv, or a Crossref title lookup), links each in-text marker to its reference, and checks each citation in the context of the sentence that cites it.
pip install "deltasci[pdf]"
deltasci verify --pdf paper.pdf # verify every numbered citation in context
deltasci verify --pdf paper.pdf --max-references 30 # fast first pass on a big bibliography
deltasci verify --paper --file paper.txt # pasted full text (body + references)
deltasci verify --pdf paper.pdf --llm anthropic # LLM fallback for author-year / messy bibliographiesThe web UI (/verify) also accepts a PDF upload and shows one card per citation — its verdict, the in-text sentence it was cited in, and a link to the real record. Deterministic by default (no API key); the --llm fallback only structures messy bibliographies — every citation is still verified against the real record deterministically.
Note: arXiv references are verified via their DataCite DOI (reliable, no rate-limit issues), so arXiv-heavy CS papers work too. The claim-to-abstract support check is PubMed-only — non-PubMed references get existence + metadata verification rather than claim-context. Author-year bibliographies (no
[n]numbers) are handled by extracting and verifying every cited identifier; use--llmfor full per-claim context on those.
| Pack | Display name | What it lenses |
|---|---|---|
biomed |
Biomedical Sciences | mechanism, patient framing, evidence base, IRB / regulatory pathway, translational realism |
materials |
Materials Science | first principles, composition/structure space, DFT bias, synthesizability, validation pathway |
climate |
Climate & Earth Sciences | physical conservation, observational data ecosystem, statistical regime, decision relevance |
List them: deltasci list-packs. Inspect one: deltasci show-pack biomed.
A domain pack is two files in a directory:
my_pack/
├── pack.toml # metadata + evidence rules + scoring rubric
└── lens.md # the domain expert's reasoning lens (markdown)
Scaffold one:
deltasci init-pack neuroscience
# ... edit pack.toml and lens.md ...
deltasci validate-pack ./packs/neuroscience
deltasci run --pack ./packs/neuroscience --idea "..."See docs/AUTHORING_DOMAIN_PACKS.md for the full guide.
| Tool | Scope | Open source | License | Domain-pluggable | Falsifiability gate |
|---|---|---|---|---|---|
| DeltaScience | Hypothesis ideation | ✅ | MIT | ✅ packs | ✅ hard requirement |
| ChatGPT / Claude direct | Generic chat | n/a | n/a | manual prompts | ❌ |
| AI Scientist (Sakana) | Full paper generation | ✅ | Apache 2.0 | partial | ❌ |
| Coscientist (CMU) | Chemistry experiment design | partial | research | locked | partial |
| Galactica / scite / Elicit | Literature retrieval | ✅/❌ | mixed | n/a | n/a |
DeltaScience deliberately occupies a small niche: get to a defensible hypothesis. It hands off to your favourite paper-writing or experiment-design tool downstream.
from deltasci import CoReasoner, Config, load_pack
from deltasci.llm import get_adapter
pack = load_pack("biomed")
llm = get_adapter("anthropic")
reasoner = CoReasoner(pack=pack, llm=llm, config=Config(num_rounds=4))
result = reasoner.run(idea="Predict checkpoint-immunotherapy non-response in TFE3-fusion osteosarcoma from spatial transcriptomics.")
print(result.hypothesis.title)
print(result.hypothesis.falsifiability.threshold)
print(result.hypothesis.feasibility_scores.overall)The full hypothesis schema is documented in docs/ARCHITECTURE.md.
Environment variables:
| Variable | Default | Notes |
|---|---|---|
DELTASCI_LLM_PROVIDER |
auto | anthropic, openai, mock, or auto |
DELTASCI_MODEL |
provider default | model id override |
DELTASCI_OUTPUT_DIR |
./deltasci-output |
where outputs are written |
ANTHROPIC_API_KEY |
— | required for --llm anthropic |
OPENAI_API_KEY |
— | required for --llm openai |
CLI flags override env vars (--llm, --model, --out).
DeltaScience runs locally and does not phone home. The only outbound traffic is to whatever LLM provider you choose. No telemetry, no user accounts, no server.
If you use it for clinical, regulatory, or high-stakes research work, the falsifiability gate is not a substitute for IRB review, regulatory pathway analysis, or clinical validation. The tool is an ideation aid, not a decision-maker.
If DeltaScience helps your research, citing it as:
@software{deltascience2026,
title = {DeltaScience: Two-Perspective Co-Reasoning for AI4Science Hypothesis Generation},
author = {{DeltaScience contributors}},
year = {2026},
url = {https://github.com/boheling/deltasci},
note = {Version 0.1.0}
}Contributions welcome — especially new domain packs. See CONTRIBUTING.md. The fastest path:
- Open a domain pack proposal issue.
- Author the pack (≈50 LOC of TOML + 1 markdown file).
- Open a PR with
validate-packoutput and one example transcript.
MIT — see LICENSE.