Skip to content

gagle/ncbijs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

130 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ncbijs

TypeScript clients for NCBI APIs — PubMed, PMC, BLAST, SNP, ClinVar, PubChem, Datasets, and more.

license CI RAG Ready MCP Server LLM Tools


Disclaimer: This is an unofficial, community-maintained SDK. It is not affiliated with, endorsed by, or related to the National Center for Biotechnology Information (NCBI) or the NCBI GitHub organization. For official NCBI tools and resources, visit ncbi.nlm.nih.gov/home/develop.

What is NCBI?

The National Center for Biotechnology Information (NCBI), part of the U.S. National Library of Medicine (NLM), maintains the world's largest collection of biomedical databases. These include PubMed (37M+ article citations), PubMed Central (PMC, 9M+ full-text articles), MeSH (controlled medical vocabulary), BLAST (sequence alignment), dbSNP (genetic variation), ClinVar (clinical variants), PubChem (chemical compounds), and many more. Researchers, clinicians, and developers rely on NCBI's public APIs to search, retrieve, and analyze biomedical data programmatically.

ncbijs provides typed, zero-dependency TypeScript clients for these APIs. This entire project is built and maintained by AI using Claude Code — no human-written code is accepted. See CONTRIBUTING.md for details.

It is designed for two audiences:

  • Developers and researchers building biomedical applications, literature review tools, or clinical decision support systems.
  • LLM and AI agents that need structured, programmatic access to biomedical literature for retrieval-augmented generation (RAG), entity extraction, and citation management.

Built for LLM consumption. Every package follows consistent naming, consistent interfaces, and has a self-documenting API with full JSDoc. The MCP server exposes 27 tools that any LLM agent can call directly. The workflow table below and the "Which package do I need?" decision tree make it easy for agents to discover the right package without reading source code. 40 of 43 packages run in the browser — ideal for agentic web apps that query NCBI without a backend.

What can you do with ncbijs?

Workflow Packages
Search PubMed and retrieve article metadata @ncbijs/pubmed + @ncbijs/eutils
Fetch full-text articles from PMC @ncbijs/pmc + @ncbijs/jats
Extract genes, diseases, chemicals from articles @ncbijs/pubtator
Generate formatted citations (RIS, MEDLINE, CSL-JSON) @ncbijs/cite
Convert between PMID, PMCID, and DOI @ncbijs/id-converter
Expand MeSH terms for comprehensive searches @ncbijs/mesh
Chunk full-text articles for RAG pipelines @ncbijs/jats (toChunks)
Look up genes, genomes, and taxonomy @ncbijs/datasets
Parse FASTA nucleotide/protein sequences @ncbijs/fasta
Run BLAST sequence alignments @ncbijs/blast
Look up SNP/variant data from dbSNP @ncbijs/snp
Query clinical variant significance from ClinVar @ncbijs/clinvar
Retrieve compound, substance, and assay data @ncbijs/pubchem
Fetch protein sequences in FASTA or GenBank format @ncbijs/protein
Fetch nucleotide sequences in FASTA or GenBank format @ncbijs/nucleotide
Parse GenBank flat file records locally @ncbijs/genbank
Look up genetic disorders from OMIM @ncbijs/omim
Query medical genetics concepts from MedGen @ncbijs/medgen
Search genetic tests from GTR @ncbijs/gtr
Search gene expression datasets from GEO @ncbijs/geo
Query structural variants from dbVar @ncbijs/dbvar
Search sequencing experiment metadata from SRA @ncbijs/sra
Look up 3D molecular structures from MMDB/PDB @ncbijs/structure
Search conserved protein domains from CDD @ncbijs/cdd
Search NCBI Bookshelf entries @ncbijs/books
Look up journal/serial records from NLM Catalog @ncbijs/nlm-catalog
Convert variant notations (HGVS, SPDI, VCF) @ncbijs/snp
Get full compound annotations (GHS, patents) @ncbijs/pubchem
Chain search-fetch pipelines via History Server @ncbijs/eutils
Search clinical trials by condition/intervention @ncbijs/clinical-trials
Get citation metrics and impact scores @ncbijs/icite
Normalize drug names and find drug classes @ncbijs/rxnorm
Look up drug labels, SPLs, and NDC packaging @ncbijs/dailymed
Find literature linked to genetic variants @ncbijs/litvar
Get annotated text with entity recognition @ncbijs/bioc
Autocomplete ICD-10, LOINC, SNOMED codes @ncbijs/clinical-tables
Store NCBI data locally in DuckDB @ncbijs/store
Build data pipelines (Source → Parse → Sink) @ncbijs/pipeline
Load any NCBI dataset with one function call @ncbijs/etl
Watch NCBI sources for updates and re-sync @ncbijs/sync
Expose all tools to LLM agents via MCP @ncbijs/http-mcp
Query local NCBI data via MCP @ncbijs/store-mcp

Packages

Package Description Version
@ncbijs/pubmed High-level PubMed search and retrieval with fluent query builder npm
@ncbijs/pmc PMC full-text retrieval via E-utilities, OA Service, and OAI-PMH npm
@ncbijs/eutils Spec-compliant client for all 9 NCBI E-utilities npm
@ncbijs/cite Citation formatting in 4 styles (RIS, MEDLINE, CSL-JSON, Citation) npm
@ncbijs/id-converter Batch conversion between PMID, PMCID, DOI, and Manuscript ID npm
@ncbijs/mesh MeSH vocabulary tree traversal and query expansion npm
@ncbijs/pubtator PubTator3 text mining — entity search and BioC annotation export npm
@ncbijs/pubmed-xml PubMed/MEDLINE XML and plain-text parser npm
@ncbijs/jats JATS XML parser with markdown, plain-text, and RAG chunking npm
@ncbijs/blast BLAST sequence alignment with async submit/poll/retrieve workflow npm
@ncbijs/snp dbSNP variation data — placements, allele annotations, frequencies npm
@ncbijs/clinvar ClinVar clinical variant significance, genes, traits, locations npm
@ncbijs/pubchem PubChem compound data — properties, synonyms, descriptions npm
@ncbijs/datasets NCBI Datasets API v2 client for genes, genomes, and taxonomy npm
@ncbijs/protein Protein sequence retrieval in FASTA and GenBank formats npm
@ncbijs/nucleotide Nucleotide sequence retrieval in FASTA and GenBank formats npm
@ncbijs/genbank Zero-dependency GenBank flat file format parser npm
@ncbijs/omim OMIM genetic disorders — Mendelian inheritance catalog npm
@ncbijs/medgen MedGen medical genetics concepts and disease-gene links npm
@ncbijs/gtr Genetic Testing Registry — test catalog and clinical validity npm
@ncbijs/geo GEO gene expression datasets — microarray and RNA-seq metadata npm
@ncbijs/dbvar dbVar structural variants — copy number, inversions, translocations npm
@ncbijs/sra SRA sequencing experiment metadata with embedded XML parsing npm
@ncbijs/structure 3D molecular structure records from MMDB/PDB npm
@ncbijs/cdd Conserved Domain Database — protein domain annotations npm
@ncbijs/books NCBI Bookshelf entries — textbooks, reports, chapters npm
@ncbijs/nlm-catalog NLM Catalog journal and serial records with ISSN data npm
@ncbijs/clinical-trials ClinicalTrials.gov v2 — study search, stats, and field values npm
@ncbijs/icite NIH iCite citation metrics — RCR, percentiles, clinical citations npm
@ncbijs/rxnorm RxNorm drug normalization — concepts, classes, NDC codes npm
@ncbijs/dailymed DailyMed drug labels — SPLs, NDC packaging, drug classes npm
@ncbijs/litvar LitVar2 variant-literature linking — publications by rsID npm
@ncbijs/bioc BioC annotated text — PubMed/PMC articles with named entities npm
@ncbijs/clinical-tables Clinical Table Search — ICD-10, LOINC, SNOMED autocomplete npm
@ncbijs/fasta Zero-dependency FASTA format parser for sequences npm
@ncbijs/xml Zero-dependency regex-based XML reader for NCBI formats npm
@ncbijs/store Storage interfaces and DuckDB implementation for local NCBI data npm
@ncbijs/pipeline Composable data pipelines: Source → Parse → Sink npm
@ncbijs/etl Pre-wired NCBI data loaders: load('mesh', mySink) npm
@ncbijs/sync NCBI update detection and scheduled re-sync npm
@ncbijs/http-mcp MCP server exposing all ncbijs tools for LLM agents npm
@ncbijs/store-mcp MCP server for querying locally stored NCBI data via DuckDB npm
@ncbijs/rate-limiter Token bucket rate limiter for browser and Node.js npm

RAG integration

ncbijs is built to power biomedical RAG (Retrieval-Augmented Generation) pipelines. Use it to enrich document chunks with named entities, normalize terminology via MeSH, validate claims against PubMed, and inject formatted citations into generated answers. The MCP server (@ncbijs/http-mcp) lets LLM agents call any ncbijs tool directly during generation with zero glue code.

See RAG Integration Guide for a full architecture walkthrough covering ingestion enrichment, query-time augmentation, generation-time citation, and priority assessment.

Data pipelines

ncbijs includes a composable pipeline system for processing bulk NCBI data. Wire any source, parser, and sink together with a single pipeline() call. The pipeline package is 100% browser-compatible — every export uses standard Web APIs (fetch, DecompressionStream).

import { pipeline, createHttpSource, createSink } from '@ncbijs/pipeline';
import { parseMeshDescriptorXml } from '@ncbijs/mesh';

// Download from NCBI HTTP → parse → write to any destination
await pipeline(
  createHttpSource('https://nlmpubs.nlm.nih.gov/projects/mesh/MESH_FILES/xmlmesh/desc2026.xml'),
  (xml) => parseMeshDescriptorXml(xml).descriptors,
  createSink(async (records) => {
    console.log(`Received ${records.length} MeSH descriptors`);
  }),
);

Or skip the wiring entirely with @ncbijs/etl — one function call to download, parse, and sink any dataset:

import { load, loadAll } from '@ncbijs/etl';
import { createSink } from '@ncbijs/pipeline';

// Load a single dataset
await load(
  'mesh',
  createSink(async (records) => {
    console.log(`${records.length} MeSH descriptors`);
  }),
);

// Load all 6 datasets into any sink
await loadAll((dataset) =>
  createSink(async (records) => {
    console.log(`${dataset}: ${records.length} records`);
  }),
);

The pipeline has three phases: load, sync, and query:

Phase 1: Initial Load        Phase 2: Watch & Sync       Phase 3: Query via MCP
  NCBI FTP ──→ DuckDB          Poll NCBI → re-load         store-mcp ──→ Claude
  (one-time bulk download)      (long-running process)      (zero rate limits)

Phase 1: Load data with @ncbijs/etl

import { load, loadAll } from '@ncbijs/etl';
import { DuckDbFileStorage } from '@ncbijs/store';

const storage = await DuckDbFileStorage.open('ncbi.duckdb');

// Load a single dataset
await load('clinvar', storage.createSink('clinvar'));

// Or load all 6 datasets at once
await loadAll((dataset) => storage.createSink(dataset));

Phase 2: Keep data fresh with @ncbijs/sync

Once loaded, start a watcher to poll for upstream changes and re-load only what changed. createCheckers() picks the best detection strategy per dataset: MD5 checksums for ClinVar, Taxonomy, and PubChem; HTTP Last-Modified for all others.

import { createCheckers, load } from '@ncbijs/etl';
import { SyncScheduler, InMemorySyncState } from '@ncbijs/sync';

const scheduler = new SyncScheduler(new InMemorySyncState(), createCheckers(), {
  checkIntervalMs: 3600_000,
  datasets: ['clinvar', 'genes'],
  onUpdate: async (dataset) => {
    await load(dataset, storage.createSink(dataset));
  },
});

await scheduler.start(); // checks immediately, then every hour

Phase 3: Query via MCP

Once data is loaded, expose it to Claude (or any MCP-compatible agent) with @ncbijs/store-mcp:

{
  "mcpServers": {
    "ncbijs-store": {
      "command": "npx",
      "args": ["-y", "@ncbijs/store-mcp"],
      "env": {
        "NCBIJS_DB_PATH": "/absolute/path/to/ncbi.duckdb"
      }
    }
  }
}

Now your agent can query the local data directly:

  • "Search for pathogenic BRCA1 variants in ClinVar"
  • "Look up the MeSH descriptor for Alzheimer's disease"
  • "What genes are on chromosome 17 in the local store?"
  • "Convert PMID 33024307 to a DOI"

No network, no rate limits, no API keys. See @ncbijs/store-mcp for the full list of 13 query tools.

See examples/data-pipeline/ for complete scripts covering all three phases.

Packages

  • @ncbijs/pipeline — Composable Source/Sink primitives built on AsyncIterable. HTTP and composite sources, streaming, backpressure, abort signals. Browser + Node.js.
  • @ncbijs/etl — Pre-wired loaders for 6 NCBI bulk datasets. load('mesh', mySink) is all you need. Also exports createCheckers() for sync.
  • @ncbijs/store — Storage interfaces with a DuckDB reference implementation. Node.js only.
  • @ncbijs/sync — Watches NCBI FTP for updates via MD5 checksums or HTTP Last-Modified. Pluggable checkers, configurable interval, abort signal.

See Data Pipeline Guide for the full API walkthrough, streaming parsers, error handling, and sync scheduling.

MCP servers

ncbijs ships two MCP servers that give AI agents direct access to NCBI data. Pick the one that fits your use case — or use both:

Live API (http-mcp) Local data (store-mcp)
Setup Zero — just add the config Load data first (Phases 1-2)
Network Required (queries NCBI APIs in real time) Offline after initial load
Rate limits NCBI limits apply (3-10 req/s) None
Data freshness Always current As fresh as last sync
Tools 27 13

Live API access (@ncbijs/http-mcp)

Query NCBI APIs in real time — PubMed, PMC full text, BLAST, ClinVar, PubChem, MeSH, and more. No data loading required.

{
  "mcpServers": {
    "ncbijs": {
      "command": "npx",
      "args": ["-y", "@ncbijs/http-mcp"],
      "env": {
        "NCBI_API_KEY": ""
      }
    }
  }
}

27 tools covering: PubMed search, PMC full text, PubTator entity recognition, gene/genome/taxonomy lookup, BLAST alignment, SNP/ClinVar variant queries, PubChem compounds, citation formatting, ID conversion, MeSH vocabulary, iCite metrics, RxNorm drug data, and LitVar variant-literature linking.

Example prompts:

  • "Search PubMed for recent CRISPR gene therapy reviews"
  • "Get the full text of PMC7886120 and summarize the methods"
  • "What genes and diseases are mentioned in PMID 33024307?"
  • "Run a BLAST search for the sequence ATCGATCGATCG"

See @ncbijs/http-mcp for details. Get a free API key at ncbi.nlm.nih.gov/account/settings.

Local data queries (@ncbijs/store-mcp)

Query your local DuckDB database — MeSH, ClinVar, genes, taxonomy, PubChem, and ID mappings. No network needed after loading.

Phase 1: load data ──→ Phase 2: sync ──→ Phase 3: query via store-mcp
(see Data pipelines)    (optional)        (this section)
{
  "mcpServers": {
    "ncbijs-store": {
      "command": "npx",
      "args": ["-y", "@ncbijs/store-mcp"],
      "env": {
        "NCBIJS_DB_PATH": "/absolute/path/to/ncbi.duckdb"
      }
    }
  }
}

13 tools available: store-lookup-mesh, store-search-mesh, store-lookup-variant, store-search-variants, store-lookup-gene, store-search-genes, store-lookup-taxonomy, store-search-taxonomy, store-lookup-compound, store-search-compounds, store-convert-ids, store-search-ids, store-stats.

Example prompts:

  • "Search for pathogenic BRCA1 variants in the local ClinVar data"
  • "What compounds have an InChI key starting with BSYNRYMUT?"
  • "How many records are loaded in each dataset?"

See @ncbijs/store-mcp for details. See Data pipelines above to load the data.

Browser compatibility

40 of 43 packages work in both browsers and Node.js. Only 3 infrastructure packages require Node.js:

Runtime Packages Why
Browser + Node.js All HTTP clients, parsers, rate-limiter, xml, fasta, genbank, pipeline, etl, sync (40 packages) Only uses fetch, DecompressionStream, and pure computation
Node.js only @ncbijs/store Requires @duckdb/node-api (native binding)
Node.js only @ncbijs/store-mcp, @ncbijs/http-mcp MCP server CLIs (stdio transport)

Use ncbijs directly in frontend apps — search PubMed, look up genes, query MeSH, and more with zero server-side code:

import { PubMed } from '@ncbijs/pubmed';
import { Datasets } from '@ncbijs/datasets';

const pubmed = new PubMed();
const articles = await pubmed.search({ term: 'CRISPR therapy', retmax: 10 });

const datasets = new Datasets();
const gene = await datasets.geneBySymbol('BRCA1');

Quick start

npm install @ncbijs/pubmed
import { PubMed } from '@ncbijs/pubmed';

const pubmed = new PubMed({
  tool: 'my-research-app',
  email: 'you@university.edu',
});

const articles = await pubmed
  .search('CRISPR gene therapy')
  .dateRange('2023/01/01', '2024/12/31')
  .freeFullText()
  .limit(10)
  .fetchAll();

for (const article of articles) {
  console.log(`${article.pmid}: ${article.title}`);
}

Which package do I need?

I want to...
│
├── Search biomedical literature
│   ├── High-level PubMed search ──────────→ @ncbijs/pubmed
│   ├── Low-level Entrez queries ──────────→ @ncbijs/eutils
│   └── Find literature by genetic variant ─→ @ncbijs/litvar
│
├── Retrieve full-text articles
│   ├── PMC open-access articles ──────────→ @ncbijs/pmc
│   └── Annotated text with NER ───────────→ @ncbijs/bioc
│
├── Extract entities from text
│   ├── Genes, diseases, chemicals ────────→ @ncbijs/pubtator
│   └── Annotated passages (BioC format) ──→ @ncbijs/bioc
│
├── Work with citations
│   ├── Format citations (RIS, CSL, etc.) ─→ @ncbijs/cite
│   ├── Convert PMID/PMCID/DOI ────────────→ @ncbijs/id-converter
│   └── Citation impact metrics (RCR) ─────→ @ncbijs/icite
│
├── Work with genes and sequences
│   ├── Gene/genome metadata ──────────────→ @ncbijs/datasets
│   ├── Protein sequences ─────────────────→ @ncbijs/protein
│   ├── Nucleotide sequences ──────────────→ @ncbijs/nucleotide
│   ├── Sequence alignment (BLAST) ────────→ @ncbijs/blast
│   ├── Parse FASTA format ────────────────→ @ncbijs/fasta
│   └── Parse GenBank format ──────────────→ @ncbijs/genbank
│
├── Work with variants and clinical data
│   ├── SNP/variant lookup (dbSNP) ────────→ @ncbijs/snp
│   ├── HGVS/SPDI/VCF conversion ─────────→ @ncbijs/snp
│   ├── Clinical significance (ClinVar) ───→ @ncbijs/clinvar
│   ├── Genetic disorders (OMIM) ──────────→ @ncbijs/omim
│   └── Medical genetics (MedGen) ─────────→ @ncbijs/medgen
│
├── Work with drugs and chemicals
│   ├── Compound properties ───────────────→ @ncbijs/pubchem
│   ├── Compound annotations (GHS, etc.) ──→ @ncbijs/pubchem
│   ├── Drug normalization (RxCUI) ────────→ @ncbijs/rxnorm
│   ├── Drug classes (ATC, VA, MEDRT) ─────→ @ncbijs/rxnorm
│   ├── NDC code lookup ───────────────────→ @ncbijs/rxnorm
│   └── Drug labels and SPLs ─────────────→ @ncbijs/dailymed
│
├── Autocomplete medical codes
│   ├── ICD-10, LOINC, SNOMED ─────────────→ @ncbijs/clinical-tables
│   └── RxTerms drug names ────────────────→ @ncbijs/clinical-tables
│
├── Search clinical trials ────────────────→ @ncbijs/clinical-trials
│
├── Work with vocabularies
│   └── MeSH term expansion ───────────────→ @ncbijs/mesh
│
├── Search other NCBI databases
│   ├── Gene expression (GEO) ─────────────→ @ncbijs/geo
│   ├── Structural variants (dbVar) ───────→ @ncbijs/dbvar
│   ├── Sequencing data (SRA) ─────────────→ @ncbijs/sra
│   ├── 3D structures (MMDB/PDB) ──────────→ @ncbijs/structure
│   ├── Protein domains (CDD) ─────────────→ @ncbijs/cdd
│   ├── Genetic tests (GTR) ───────────────→ @ncbijs/gtr
│   ├── Books/textbooks ───────────────────→ @ncbijs/books
│   └── Journal records (NLM Catalog) ─────→ @ncbijs/nlm-catalog
│
├── Store NCBI data locally ───────────────→ @ncbijs/store
├── Data pipeline (Source → Parse → Sink) ─→ @ncbijs/pipeline
├── Load any NCBI dataset in one call ─────→ @ncbijs/etl
├── Watch NCBI sources for updates ────────→ @ncbijs/sync
├── Expose tools to LLM agents (live API) ─→ @ncbijs/http-mcp
└── Query local data via MCP ─────────────→ @ncbijs/store-mcp

Package capabilities

Capability Packages
Supports API key eutils, pubmed, pmc, clinvar, snp, datasets, omim, medgen, gtr, geo, dbvar, sra, structure, cdd, books, nlm-catalog, protein, nucleotide (optional, for higher rate limits)
No API key needed All others (non-NCBI APIs)
Rate-limited eutils, datasets, blast, snp, clinvar, pubchem, clinical-trials, icite, rxnorm, dailymed, + all that depend on rate-limiter
Zero dependencies pipeline, sync, cite, id-converter, mesh, fasta, genbank, litvar, bioc, clinical-tables
Async iterators eutils (efetchBatches, searchAndFetch, searchAndSummarize), pubmed (batch), clinical-trials (searchStudies), cite (citeMany), pipeline (Source, streamParser)
XML parsing eutils, pubmed-xml, jats, pubtator, xml
Bulk parsers mesh, cite, id-converter, clinvar, datasets, pubchem, snp, icite, clinical-trials, litvar, medgen, cdd, pmc
Data pipelines pipeline (Source → Parse → Sink), store (DuckDbSink), sync (update detection)

Development

pnpm install
pnpm build        # Build all packages
pnpm test         # Run all tests
pnpm lint         # Lint all packages
pnpm typecheck    # Type-check all packages

Single package

pnpm nx run @ncbijs/pubmed:build
pnpm nx run @ncbijs/pubmed:test

E2E tests

E2E tests hit real NCBI APIs and require an API key:

cp .env.example .env
# Add your NCBI API key to .env
pnpm nx run ncbijs-e2e:e2e

Get an API key at ncbi.nlm.nih.gov/account/settings.

About

TypeScript clients for NCBI APIs — PubMed, PMC, BLAST, SNP, ClinVar, PubChem, Datasets, and more. Zero-dependency. MCP server included.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages