Skip to content

codelined-ag/Extracto

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

280 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Extracto

Your private document brain.
PDFs in, RAG out. Self-hosted. Plug everywhere.

Quickstart · What you get · Plug everywhere · Docs · OpenAPI · Changelog

CI License GHCR Stars

Extracto workspace

v0.5.4: any S3-compatible storage (AWS, R2, Backblaze, MinIO, Garage, Ceph, SeaweedFS, ...) with hardened SSRF policy and rate limits, Typesense vector store, history GFM rendering with stopped-job state, and a curl|sh / iwr|iex one-liner installer pinned to the release tag. See the changelog.


Why

Most document-to-AI tools are SaaS. They cost per page, they see your documents, and they lock you into one provider. Extracto is the opposite: one Docker container, your machine, any vision model (local or hosted), output goes wherever you want it. Browser, code, agent, vector store. You pick.


What you get

A complete pipeline from raw document to retrievable knowledge, in one container:

  1. Ingest any PDF, image, or watched folder.
  2. Extract with the vision model of your choice (Ollama, Mistral OCR, OpenRouter, any OpenAI-compatible endpoint).
  3. Post-process with a second LLM pass (clean to markdown or strict JSON, with your own instruction).
  4. Chunk + embed + store into Chroma, Qdrant, or Weaviate. Five chunking strategies including semantic (sentence-embed + topic-shift split) and hierarchical (preserves heading breadcrumbs).
  5. Retrieve through a stable v1 REST API, an OpenAI-Chat-Completions adapter, an MCP server (Claude/Cursor/Codex/OpenClaw/Hermes), a typed CLI, or the browser UI.

Other things you don't need to bolt on:

  • Per-user accounts, scoped API keys with rate limits, signed webhooks.
  • Resumable jobs, page-by-page progress, searchable history.
  • Optional S3/MinIO blob offload, Prometheus metrics, healthcheck.
  • Five UI languages (English, Italian, French, Spanish, German).
  • 1200+ tests, MIT-licensed, semver on /api/v1.

Quickstart

You need Docker. That's it.

One-liner (Linux / macOS)

curl --proto '=https' --tlsv1.2 -fsSL https://raw.githubusercontent.com/codelined-ag/Extracto/v0.5.4/scripts/install.sh | bash

One-liner (Windows)

iwr -UseBasicParsing https://raw.githubusercontent.com/codelined-ag/Extracto/v0.5.4/scripts/install.ps1 | iex

The installer clones the repo at the pinned tag to ~/.local/share/extracto (or %LOCALAPPDATA%\Extracto), sets up Docker + Ollama if missing, drops an extracto launcher on PATH, and starts the stack. Open http://localhost:3000 and sign up.

The installer wraps install-extracto.sh, which on a fresh machine runs vendor scripts from https://get.docker.com and https://ollama.com/install.sh as root to provision Docker + Ollama. Skip those steps with EXTRACTO_INSTALL_DOCKER=0 EXTRACTO_INSTALL_OLLAMA=0 if Docker is already installed and you don't want Ollama. Set EXTRACTO_REPO_REF=main to track the bleeding edge instead of the pinned tag.

S3-compatible storage (any provider)

Settings → S3 takes any S3-compatible endpoint: AWS S3, Cloudflare R2, Backblaze B2, DigitalOcean Spaces, Wasabi, Linode Object Storage, GCS, MinIO, Garage, Ceph RGW, SeaweedFS, on-prem appliances, etc. The endpoint URL is validated for SSRF (cloud-metadata IPs and link-local always blocked); private/loopback hosts (RFC1918, 127.0.0.1, host.docker.internal) require either S3_ALLOW_LOOPBACK=1 for global opt-in or S3_ALLOWED_HOSTS=minio.internal.corp,*.objects.internal for granular access.

The launcher wraps the full API: extracto ocr ./invoice.pdf, extracto jobs list, extracto kb export, extracto api-key create .... Full reference at extracto.help/cli/overview.

Manual paths

From source:

git clone https://github.com/codelined-ag/Extracto.git
cd Extracto
./install-extracto.sh   # Linux/macOS
# or: .\scripts\extracto.ps1 install   # Windows (Docker Desktop + WSL2)
extracto on

Single docker run (no launcher):

docker run -d --name extracto -p 3000:3000 -v extracto-data:/app/data -e AUTH_SECRET="$(openssl rand -hex 32)" ghcr.io/codelined-ag/extracto:latest

Multi-arch (linux/amd64 + linux/arm64); pin a release with :v0.5.4 instead of :latest.


Plug everywhere

Same backend, four surfaces. Pick what fits.

Surface Use it when Read
Browser UI You're a human with a stack of PDFs How it works
REST API (/api/v1/*) You're building a document-intake pipeline API reference
MCP server Your agent speaks MCP (Claude Desktop, Cursor, Codex, OpenClaw, Hermes) Agents
CLI + SKILL.md Your agent only has a shell tool (Claude Code, shell-based runners) Skill file
OpenAI-Chat adapter You already have OpenAI-SDK code; just point it at Extracto OpenAI compat

Agents get two first-class paths. The MCP server exposes seven tools (ocr_submit, ocr_get, jobs_list, job_stop, kb_search, kb_export, presets_list). The SKILL.md + typed CLI path is for agents that don't speak MCP: drop the skill file into the agent's context and it knows when to call extracto ocr, extracto kb search, extracto jobs ... from a shell.


Documentation

Everything beyond a five-minute install lives at extracto.help in five languages:

  • Configuration reference (every env var)
  • Full v1 API guide (auth, OCR, jobs, presets, webhooks, KB export, search, metrics)
  • CLI reference
  • MCP setup for every supported client
  • Knowledge-base export (chunking strategies, embedding providers, vector stores)
  • Production checklist (auth secret, HTTPS, signup gate, rate limits, allowlists)
  • Troubleshooting + ops (logs, metrics, retention, S3 offload, watched folders)
  • Architecture tour

OpenAPI 3.1 spec at openapi.yaml. Import into Bruno, Postman, Insomnia, or any client generator.


Star History

Star History

License

MIT © codelined