Your private document brain.
PDFs in, RAG out. Self-hosted. Plug everywhere.
Quickstart · What you get · Plug everywhere · Docs · OpenAPI · Changelog
v0.5.4: any S3-compatible storage (AWS, R2, Backblaze, MinIO, Garage, Ceph, SeaweedFS, ...) with hardened SSRF policy and rate limits, Typesense vector store, history GFM rendering with stopped-job state, and a curl|sh / iwr|iex one-liner installer pinned to the release tag. See the changelog.
Most document-to-AI tools are SaaS. They cost per page, they see your documents, and they lock you into one provider. Extracto is the opposite: one Docker container, your machine, any vision model (local or hosted), output goes wherever you want it. Browser, code, agent, vector store. You pick.
A complete pipeline from raw document to retrievable knowledge, in one container:
- Ingest any PDF, image, or watched folder.
- Extract with the vision model of your choice (Ollama, Mistral OCR, OpenRouter, any OpenAI-compatible endpoint).
- Post-process with a second LLM pass (clean to markdown or strict JSON, with your own instruction).
- Chunk + embed + store into Chroma, Qdrant, or Weaviate. Five chunking strategies including semantic (sentence-embed + topic-shift split) and hierarchical (preserves heading breadcrumbs).
- Retrieve through a stable v1 REST API, an OpenAI-Chat-Completions adapter, an MCP server (Claude/Cursor/Codex/OpenClaw/Hermes), a typed CLI, or the browser UI.
Other things you don't need to bolt on:
- Per-user accounts, scoped API keys with rate limits, signed webhooks.
- Resumable jobs, page-by-page progress, searchable history.
- Optional S3/MinIO blob offload, Prometheus metrics, healthcheck.
- Five UI languages (English, Italian, French, Spanish, German).
- 1200+ tests, MIT-licensed, semver on
/api/v1.
You need Docker. That's it.
curl --proto '=https' --tlsv1.2 -fsSL https://raw.githubusercontent.com/codelined-ag/Extracto/v0.5.4/scripts/install.sh | bashiwr -UseBasicParsing https://raw.githubusercontent.com/codelined-ag/Extracto/v0.5.4/scripts/install.ps1 | iexThe installer clones the repo at the pinned tag to ~/.local/share/extracto (or %LOCALAPPDATA%\Extracto), sets up Docker + Ollama if missing, drops an extracto launcher on PATH, and starts the stack. Open http://localhost:3000 and sign up.
The installer wraps
install-extracto.sh, which on a fresh machine runs vendor scripts fromhttps://get.docker.comandhttps://ollama.com/install.shas root to provision Docker + Ollama. Skip those steps withEXTRACTO_INSTALL_DOCKER=0 EXTRACTO_INSTALL_OLLAMA=0if Docker is already installed and you don't want Ollama. SetEXTRACTO_REPO_REF=mainto track the bleeding edge instead of the pinned tag.
Settings → S3 takes any S3-compatible endpoint: AWS S3, Cloudflare R2, Backblaze B2, DigitalOcean Spaces, Wasabi, Linode Object Storage, GCS, MinIO, Garage, Ceph RGW, SeaweedFS, on-prem appliances, etc. The endpoint URL is validated for SSRF (cloud-metadata IPs and link-local always blocked); private/loopback hosts (RFC1918, 127.0.0.1, host.docker.internal) require either S3_ALLOW_LOOPBACK=1 for global opt-in or S3_ALLOWED_HOSTS=minio.internal.corp,*.objects.internal for granular access.
The launcher wraps the full API: extracto ocr ./invoice.pdf, extracto jobs list, extracto kb export, extracto api-key create .... Full reference at extracto.help/cli/overview.
Manual paths
From source:
git clone https://github.com/codelined-ag/Extracto.git
cd Extracto
./install-extracto.sh # Linux/macOS
# or: .\scripts\extracto.ps1 install # Windows (Docker Desktop + WSL2)
extracto onSingle docker run (no launcher):
docker run -d --name extracto -p 3000:3000 -v extracto-data:/app/data -e AUTH_SECRET="$(openssl rand -hex 32)" ghcr.io/codelined-ag/extracto:latestMulti-arch (linux/amd64 + linux/arm64); pin a release with :v0.5.4 instead of :latest.
Same backend, four surfaces. Pick what fits.
| Surface | Use it when | Read |
|---|---|---|
| Browser UI | You're a human with a stack of PDFs | How it works |
REST API (/api/v1/*) |
You're building a document-intake pipeline | API reference |
| MCP server | Your agent speaks MCP (Claude Desktop, Cursor, Codex, OpenClaw, Hermes) | Agents |
CLI + SKILL.md |
Your agent only has a shell tool (Claude Code, shell-based runners) | Skill file |
| OpenAI-Chat adapter | You already have OpenAI-SDK code; just point it at Extracto | OpenAI compat |
Agents get two first-class paths. The MCP server exposes seven tools (ocr_submit, ocr_get, jobs_list, job_stop, kb_search, kb_export, presets_list). The SKILL.md + typed CLI path is for agents that don't speak MCP: drop the skill file into the agent's context and it knows when to call extracto ocr, extracto kb search, extracto jobs ... from a shell.
Everything beyond a five-minute install lives at extracto.help in five languages:
- Configuration reference (every env var)
- Full v1 API guide (auth, OCR, jobs, presets, webhooks, KB export, search, metrics)
- CLI reference
- MCP setup for every supported client
- Knowledge-base export (chunking strategies, embedding providers, vector stores)
- Production checklist (auth secret, HTTPS, signup gate, rate limits, allowlists)
- Troubleshooting + ops (logs, metrics, retention, S3 offload, watched folders)
- Architecture tour
OpenAPI 3.1 spec at openapi.yaml. Import into Bruno, Postman, Insomnia, or any client generator.
MIT © codelined
