Skip to content

dmoose/doctrove

doctrove

CI Release Go Reference Go Report Card License: MIT

A local documentation store for AI coding agents.

Mirrors LLM-targeted documentation (llms.txt and companion files) from websites to a local store with full-text search, git change tracking, and an MCP interface for agent access.

Install

make install                   # builds and installs to $GOBIN
make init-workspace            # creates ~/.config/doctrove with default config
doctrove mcp-config            # shows config to add to your agent

Workspace defaults to ~/.config/doctrove. Override with --dir or DOCTROVE_DIR.

Quick Start

# Discover what a site has
doctrove discover https://stripe.com

# Grab it (init + sync in one step)
doctrove grab https://supabase.com

# Search across all mirrored content
doctrove search "authentication"

# Search only API docs
doctrove search --category api-reference "webhooks"

# Refresh to pick up changes (uses ETag caching)
doctrove refresh supabase.com

# See what you have
doctrove catalog
doctrove stats

Commands

Command Description
discover <url> Probe a URL for LLM content without tracking
grab <url> Discover, track, and sync in one step
init <url> Add a site to track
sync [site|--all] Download/update content
refresh [site|--all] Re-sync tracked sites, skipping unchanged files via ETag caching
search <query> Full-text search with --site, --type, --category, --full
tag <site> <path> <cat> Override the category for a mirrored file
catalog Show site summaries with topics (from llms.txt structure)
stats Disk usage, file counts, sync freshness per site
stale Show sites not synced within --threshold (default 7d)
list List all tracked sites
status [site] Show sync status and file counts
check <site> Dry-run: show available content without downloading
history [site] Git-based change history with --since
diff [from] [to] Show content changes between syncs
remove <site> Stop tracking (with --keep-files option)
mcp Start MCP server (stdio transport)

All commands support --json for machine-readable output.

MCP Server

Generate your config snippet:

doctrove mcp-config

Add the mcpServers entry to the appropriate config file:

Agent Config File
Claude Code (user scope) ~/.claude.json
Claude Code (project scope) .mcp.json (project root)
Cursor .cursor/mcp.json (project root)

Example config:

{
  "mcpServers": {
    "doctrove": {
      "command": "/usr/local/bin/doctrove",
      "args": ["mcp", "--dir", "/Users/you/.config/doctrove"]
    }
  }
}

Tools (20)

Tool Description
trove_discover Probe a URL for LLM content
trove_scan Add and sync a site (content_types param to filter; persisted for refresh; re-scannable)
trove_refresh Re-sync a tracked site, using ETag caching (honours content_types filter)
trove_check Dry-run: show available content with sizes and content types
trove_search Full-text search with category, path filters; path-boosted ranking; summaries included
trove_search_full Search and return full content of best match (large; prefer outline+section read)
trove_outline Get heading structure with max_depth (default 3) and max_sections (default 100) caps
trove_read Read a file or specific section by heading match (section param)
trove_summarize Store an agent-written summary for a file (visible in search results and outlines)
trove_tag Override category for a file (validated, persists across re-syncs)
trove_list List tracked sites
trove_list_files Enumerate files with path, size, content type, and category (paginated, category filter)
trove_catalog Site summaries with topics
trove_stats Workspace statistics
trove_status Sync status, category breakdown, and staleness for a site
trove_history Git change history
trove_diff Content changes between refs (stat mode for compact summary)
trove_stale List sites not synced within a threshold (default 7d)
trove_find Find files by path pattern (faster than search for path lookups)
trove_remove Stop tracking a site

Context-Efficient Workflow

The tools are designed for hierarchical drill-down to minimize context usage:

trove_catalog          → which site has docs on my topic?
trove_search           → which files are relevant? (check summaries first)
trove_outline          → what sections does this file have? (+ summary if cached)
trove_read section=X   → read just the section I need
trove_summarize        → cache a summary so the next agent doesn't re-read

trove_tag and trove_summarize persist across re-syncs. If you read a large file, summarize it. If a category is wrong, fix it.

Content Discovery

doctrove probes multiple sources for LLM-targeted content:

  • Well-known paths: /llms.txt, /llms-full.txt, /llms-ctx.txt, /llms-ctx-full.txt, /ai.txt
  • Companion files: URLs referenced in llms.txt (markdown links followed permissively)
  • Sitemap: Checks sitemap.xml for paths containing /llms/ or ending in .md/.txt
  • .well-known: tdmrep.json, agent.json, agents.json
  • Context7: Bare library names (e.g. react, stripe-node) resolved via Context7 API when context7_api_key is configured
  • HTML conversion: Sites serving HTML at content URLs (Next.js, SPAs) are converted to markdown
  • MDX cleanup: Framework artifacts (JSX components, export statements, boilerplate banners) are stripped from mirrored content

Page Categories

Every indexed file is assigned a semantic category for task-appropriate filtering:

Category Examples
api-reference /api/, /reference/, code-heavy pages
tutorial /tutorials/, /getting-started/, /quickstart
guide /guides/, /learn/, /how-to/
spec /specification/, /schema, /seps/
changelog /changelog, /release-notes
marketing /pricing, /use-cases/, /customers, link-heavy pages
legal /privacy, /legal/, /terms
community /community/, /contributing
context7 Content fetched via Context7 API
index llms.txt, llms-full.txt, ai.txt (site index files)
other Unclassified companions, well-known metadata

Assigned by path patterns with body analysis as fallback. Override with trove_tag / doctrove tag.

# Search only API docs
doctrove search --category api-reference "hooks"

# Fix a misclassified page
doctrove tag stripe.com /payments marketing

Context7 Integration

With a Context7 API key, you can resolve bare library names (e.g. react, stripe-node) to documentation maintained by the Context7 community, in addition to site-sourced llms.txt content.

settings:
  context7_api_key: ctx7sk-...   # get a key at https://context7.com
# Discover and sync Context7 docs for a library
doctrove scan react
doctrove scan stripe-node

Content fetched via Context7 is categorized as context7 and stored under synthetic domains (e.g. context7.com~facebook_react), keeping it separate from site-sourced content. Context7 content is subject to Upstash Terms of Service.

ETag Caching

Re-syncs use HTTP conditional requests (If-None-Match, If-Modified-Since) to skip unchanged files. Cache headers are stored per-file in the index. Use refresh to take advantage of this:

doctrove refresh modelcontextprotocol.io   # only downloads changed files

Configuration

doctrove.yaml in the workspace root:

settings:
  rate_limit: 2            # req/sec per host
  rate_burst: 5            # burst capacity
  timeout: 30s             # HTTP timeout
  max_probes: 100          # companion probes per llms.txt
  user_agent: "doctrove/1.0"
  events_url: http://localhost:6060/events    # optional eventrelay integration
  context7_api_key: ctx7sk-...                # optional Context7 API key

sites:
  stripe.com:
    url: https://stripe.com
    include:
      - "/llms*"
      - "/docs/**/*.md"
    exclude:
      - "/internal/**"

Global Flags

--dir string         workspace directory (default ~/.config/doctrove)
--json               output as JSON
--respect-robots     respect robots.txt AI crawler directives (off by default)

Storage

Content is stored as plain files under sites/<domain>/, tracked by git for change history, with a SQLite FTS5 index for search. The workspace is self-contained; share it by cloning.

When a URL path conflicts with a child path (e.g. /deploy exists as a file but /deploy/getting_started needs to be stored), the parent file is promoted to a directory with its content at _index. ReadContent handles this automatically.

Event Relay Integration

When events_url is configured, doctrove emits structured events to an eventrelay server for real-time observability. Events follow the full eventrelay schema:

{
  "source": "doctrove",
  "channel": "mcp",
  "action": "trove_search",
  "level": "info",
  "agent_id": "myproject:00a3f1",
  "duration_ms": 42,
  "data": {"query": "authentication", "site": "stripe.com"},
  "ts": "2026-03-18T12:00:00Z"
}
Field Description
source Always doctrove
channel mcp for MCP tool calls, sync for engine operations (init, sync, discover, remove)
action Tool or operation name (e.g. trove_search, sync, init)
level info normally, error on failure, warn on partial errors
agent_id Auto-derived from working directory + PID (e.g. myproject:00a3f1)
duration_ms Operation wall time (top-level, displayed inline in the dashboard)
data Tool arguments (MCP) or operation details (engine)

About

A library card for your AI agent.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors