doctrove

A local documentation store for AI coding agents.

Mirrors LLM-targeted documentation (llms.txt and companion files) from websites to a local store with full-text search, git change tracking, and an MCP interface for agent access.

Install

make install                   # builds and installs to $GOBIN
make init-workspace            # creates ~/.config/doctrove with default config
doctrove mcp-config            # shows config to add to your agent

Workspace defaults to ~/.config/doctrove. Override with --dir or DOCTROVE_DIR.

Quick Start

# Discover what a site has
doctrove discover https://stripe.com

# Grab it (init + sync in one step)
doctrove grab https://supabase.com

# Search across all mirrored content
doctrove search "authentication"

# Search only API docs
doctrove search --category api-reference "webhooks"

# Refresh to pick up changes (uses ETag caching)
doctrove refresh supabase.com

# See what you have
doctrove catalog
doctrove stats

Commands

Command	Description
`discover <url>`	Probe a URL for LLM content without tracking
`grab <url>`	Discover, track, and sync in one step
`init <url>`	Add a site to track
`sync [site\|--all]`	Download/update content
`refresh [site\|--all]`	Re-sync tracked sites, skipping unchanged files via ETag caching
`search <query>`	Full-text search with `--site`, `--type`, `--category`, `--full`
`tag <site> <path> <cat>`	Override the category for a mirrored file
`catalog`	Show site summaries with topics (from llms.txt structure)
`stats`	Disk usage, file counts, sync freshness per site
`stale`	Show sites not synced within `--threshold` (default 7d)
`list`	List all tracked sites
`status [site]`	Show sync status and file counts
`check <site>`	Dry-run: show available content without downloading
`history [site]`	Git-based change history with `--since`
`diff [from] [to]`	Show content changes between syncs
`remove <site>`	Stop tracking (with `--keep-files` option)
`mcp`	Start MCP server (stdio transport)

All commands support --json for machine-readable output.

MCP Server

Generate your config snippet:

doctrove mcp-config

Add the mcpServers entry to the appropriate config file:

Agent	Config File
Claude Code (user scope)	`~/.claude.json`
Claude Code (project scope)	`.mcp.json` (project root)
Cursor	`.cursor/mcp.json` (project root)

Example config:

{
  "mcpServers": {
    "doctrove": {
      "command": "/usr/local/bin/doctrove",
      "args": ["mcp", "--dir", "/Users/you/.config/doctrove"]
    }
  }
}

Tools (20)

Tool	Description
`trove_discover`	Probe a URL for LLM content
`trove_scan`	Add and sync a site (`content_types` param to filter; persisted for refresh; re-scannable)
`trove_refresh`	Re-sync a tracked site, using ETag caching (honours content_types filter)
`trove_check`	Dry-run: show available content with sizes and content types
`trove_search`	Full-text search with `category`, `path` filters; path-boosted ranking; summaries included
`trove_search_full`	Search and return full content of best match (large; prefer outline+section read)
`trove_outline`	Get heading structure with `max_depth` (default 3) and `max_sections` (default 100) caps
`trove_read`	Read a file or specific section by heading match (`section` param)
`trove_summarize`	Store an agent-written summary for a file (visible in search results and outlines)
`trove_tag`	Override category for a file (validated, persists across re-syncs)
`trove_list`	List tracked sites
`trove_list_files`	Enumerate files with path, size, content type, and category (paginated, `category` filter)
`trove_catalog`	Site summaries with topics
`trove_stats`	Workspace statistics
`trove_status`	Sync status, category breakdown, and staleness for a site
`trove_history`	Git change history
`trove_diff`	Content changes between refs (`stat` mode for compact summary)
`trove_stale`	List sites not synced within a threshold (default 7d)
`trove_find`	Find files by path pattern (faster than search for path lookups)
`trove_remove`	Stop tracking a site

Context-Efficient Workflow

The tools are designed for hierarchical drill-down to minimize context usage:

trove_catalog          → which site has docs on my topic?
trove_search           → which files are relevant? (check summaries first)
trove_outline          → what sections does this file have? (+ summary if cached)
trove_read section=X   → read just the section I need
trove_summarize        → cache a summary so the next agent doesn't re-read

trove_tag and trove_summarize persist across re-syncs. If you read a large file, summarize it. If a category is wrong, fix it.

Content Discovery

doctrove probes multiple sources for LLM-targeted content:

Well-known paths: /llms.txt, /llms-full.txt, /llms-ctx.txt, /llms-ctx-full.txt, /ai.txt
Companion files: URLs referenced in llms.txt (markdown links followed permissively)
Sitemap: Checks sitemap.xml for paths containing /llms/ or ending in .md/.txt
.well-known: tdmrep.json, agent.json, agents.json
Context7: Bare library names (e.g. react, stripe-node) resolved via Context7 API when context7_api_key is configured
HTML conversion: Sites serving HTML at content URLs (Next.js, SPAs) are converted to markdown
MDX cleanup: Framework artifacts (JSX components, export statements, boilerplate banners) are stripped from mirrored content

Page Categories

Every indexed file is assigned a semantic category for task-appropriate filtering:

Category	Examples
`api-reference`	`/api/`, `/reference/`, code-heavy pages
`tutorial`	`/tutorials/`, `/getting-started/`, `/quickstart`
`guide`	`/guides/`, `/learn/`, `/how-to/`
`spec`	`/specification/`, `/schema`, `/seps/`
`changelog`	`/changelog`, `/release-notes`
`marketing`	`/pricing`, `/use-cases/`, `/customers`, link-heavy pages
`legal`	`/privacy`, `/legal/`, `/terms`
`community`	`/community/`, `/contributing`
`context7`	Content fetched via Context7 API
`index`	llms.txt, llms-full.txt, ai.txt (site index files)
`other`	Unclassified companions, well-known metadata

Assigned by path patterns with body analysis as fallback. Override with trove_tag / doctrove tag.

# Search only API docs
doctrove search --category api-reference "hooks"

# Fix a misclassified page
doctrove tag stripe.com /payments marketing

Context7 Integration

With a Context7 API key, you can resolve bare library names (e.g. react, stripe-node) to documentation maintained by the Context7 community, in addition to site-sourced llms.txt content.

settings:
  context7_api_key: ctx7sk-...   # get a key at https://context7.com

# Discover and sync Context7 docs for a library
doctrove scan react
doctrove scan stripe-node

Content fetched via Context7 is categorized as context7 and stored under synthetic domains (e.g. context7.com~facebook_react), keeping it separate from site-sourced content. Context7 content is subject to Upstash Terms of Service.

ETag Caching

Re-syncs use HTTP conditional requests (If-None-Match, If-Modified-Since) to skip unchanged files. Cache headers are stored per-file in the index. Use refresh to take advantage of this:

doctrove refresh modelcontextprotocol.io   # only downloads changed files

Configuration

doctrove.yaml in the workspace root:

settings:
  rate_limit: 2            # req/sec per host
  rate_burst: 5            # burst capacity
  timeout: 30s             # HTTP timeout
  max_probes: 100          # companion probes per llms.txt
  user_agent: "doctrove/1.0"
  events_url: http://localhost:6060/events    # optional eventrelay integration
  context7_api_key: ctx7sk-...                # optional Context7 API key

sites:
  stripe.com:
    url: https://stripe.com
    include:
      - "/llms*"
      - "/docs/**/*.md"
    exclude:
      - "/internal/**"

Global Flags

--dir string         workspace directory (default ~/.config/doctrove)
--json               output as JSON
--respect-robots     respect robots.txt AI crawler directives (off by default)

Storage

Content is stored as plain files under sites/<domain>/, tracked by git for change history, with a SQLite FTS5 index for search. The workspace is self-contained; share it by cloning.

When a URL path conflicts with a child path (e.g. /deploy exists as a file but /deploy/getting_started needs to be stored), the parent file is promoted to a directory with its content at _index. ReadContent handles this automatically.

Event Relay Integration

When events_url is configured, doctrove emits structured events to an eventrelay server for real-time observability. Events follow the full eventrelay schema:

{
  "source": "doctrove",
  "channel": "mcp",
  "action": "trove_search",
  "level": "info",
  "agent_id": "myproject:00a3f1",
  "duration_ms": 42,
  "data": {"query": "authentication", "site": "stripe.com"},
  "ts": "2026-03-18T12:00:00Z"
}

Field	Description
`source`	Always `doctrove`
`channel`	`mcp` for MCP tool calls, `sync` for engine operations (init, sync, discover, remove)
`action`	Tool or operation name (e.g. `trove_search`, `sync`, `init`)
`level`	`info` normally, `error` on failure, `warn` on partial errors
`agent_id`	Auto-derived from working directory + PID (e.g. `myproject:00a3f1`)
`duration_ms`	Operation wall time (top-level, displayed inline in the dashboard)
`data`	Tool arguments (MCP) or operation details (engine)

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.github/workflows		.github/workflows
cli		cli
cmd/doctrove		cmd/doctrove
config		config
content		content
discovery		discovery
engine		engine
events		events
fetcher		fetcher
internal		internal
mcp		mcp
mirror		mirror
skills		skills
store		store
.gitignore		.gitignore
.golangci.yml		.golangci.yml
AGENT.md		AGENT.md
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DESIGN.md		DESIGN.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

doctrove

Install

Quick Start

Commands

MCP Server

Tools (20)

Context-Efficient Workflow

Content Discovery

Page Categories

Context7 Integration

ETag Caching

Configuration

Global Flags

Storage

Event Relay Integration

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

doctrove

Install

Quick Start

Commands

MCP Server

Tools (20)

Context-Efficient Workflow

Content Discovery

Page Categories

Context7 Integration

ETag Caching

Configuration

Global Flags

Storage

Event Relay Integration

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages