GitHub - codelined-ag/Extracto: Your private document brain. PDFs in, RAG out. Self-hosted. Plug everywhere.

Your private document brain.
PDFs in, RAG out. Self-hosted. Plug everywhere.

Quickstart · What you get · Plug everywhere · Docs · OpenAPI · Changelog

Extracto workspace

v0.5.4: any S3-compatible storage (AWS, R2, Backblaze, MinIO, Garage, Ceph, SeaweedFS, ...) with hardened SSRF policy and rate limits, Typesense vector store, history GFM rendering with stopped-job state, and a curl|sh / iwr|iex one-liner installer pinned to the release tag. See the changelog.

Why

Most document-to-AI tools are SaaS. They cost per page, they see your documents, and they lock you into one provider. Extracto is the opposite: one Docker container, your machine, any vision model (local or hosted), output goes wherever you want it. Browser, code, agent, vector store. You pick.

What you get

A complete pipeline from raw document to retrievable knowledge, in one container:

Ingest any PDF, image, or watched folder.
Extract with the vision model of your choice (Ollama, Mistral OCR, OpenRouter, any OpenAI-compatible endpoint).
Post-process with a second LLM pass (clean to markdown or strict JSON, with your own instruction).
Chunk + embed + store into Chroma, Qdrant, or Weaviate. Five chunking strategies including semantic (sentence-embed + topic-shift split) and hierarchical (preserves heading breadcrumbs).
Retrieve through a stable v1 REST API, an OpenAI-Chat-Completions adapter, an MCP server (Claude/Cursor/Codex/OpenClaw/Hermes), a typed CLI, or the browser UI.

Other things you don't need to bolt on:

Per-user accounts, scoped API keys with rate limits, signed webhooks.
Resumable jobs, page-by-page progress, searchable history.
Optional S3/MinIO blob offload, Prometheus metrics, healthcheck.
Five UI languages (English, Italian, French, Spanish, German).
1200+ tests, MIT-licensed, semver on /api/v1.

Quickstart

You need Docker. That's it.

One-liner (Linux / macOS)

curl --proto '=https' --tlsv1.2 -fsSL https://raw.githubusercontent.com/codelined-ag/Extracto/v0.5.4/scripts/install.sh | bash

One-liner (Windows)

iwr -UseBasicParsing https://raw.githubusercontent.com/codelined-ag/Extracto/v0.5.4/scripts/install.ps1 | iex

The installer clones the repo at the pinned tag to ~/.local/share/extracto (or %LOCALAPPDATA%\Extracto), sets up Docker + Ollama if missing, drops an extracto launcher on PATH, and starts the stack. Open http://localhost:3000 and sign up.

The installer wraps install-extracto.sh, which on a fresh machine runs vendor scripts from https://get.docker.com and https://ollama.com/install.sh as root to provision Docker + Ollama. Skip those steps with EXTRACTO_INSTALL_DOCKER=0 EXTRACTO_INSTALL_OLLAMA=0 if Docker is already installed and you don't want Ollama. Set EXTRACTO_REPO_REF=main to track the bleeding edge instead of the pinned tag.

S3-compatible storage (any provider)

Settings → S3 takes any S3-compatible endpoint: AWS S3, Cloudflare R2, Backblaze B2, DigitalOcean Spaces, Wasabi, Linode Object Storage, GCS, MinIO, Garage, Ceph RGW, SeaweedFS, on-prem appliances, etc. The endpoint URL is validated for SSRF (cloud-metadata IPs and link-local always blocked); private/loopback hosts (RFC1918, 127.0.0.1, host.docker.internal) require either S3_ALLOW_LOOPBACK=1 for global opt-in or S3_ALLOWED_HOSTS=minio.internal.corp,*.objects.internal for granular access.

The launcher wraps the full API: extracto ocr ./invoice.pdf, extracto jobs list, extracto kb export, extracto api-key create .... Full reference at extracto.help/cli/overview.

Manual paths

From source:

git clone https://github.com/codelined-ag/Extracto.git
cd Extracto
./install-extracto.sh   # Linux/macOS
# or: .\scripts\extracto.ps1 install   # Windows (Docker Desktop + WSL2)
extracto on

Single docker run (no launcher):

docker run -d --name extracto -p 3000:3000 -v extracto-data:/app/data -e AUTH_SECRET="$(openssl rand -hex 32)" ghcr.io/codelined-ag/extracto:latest

Multi-arch (linux/amd64 + linux/arm64); pin a release with :v0.5.4 instead of :latest.

Plug everywhere

Same backend, four surfaces. Pick what fits.

Surface	Use it when	Read
Browser UI	You're a human with a stack of PDFs	How it works
REST API (`/api/v1/*`)	You're building a document-intake pipeline	API reference
MCP server	Your agent speaks MCP (Claude Desktop, Cursor, Codex, OpenClaw, Hermes)	Agents
CLI + `SKILL.md`	Your agent only has a shell tool (Claude Code, shell-based runners)	Skill file
OpenAI-Chat adapter	You already have OpenAI-SDK code; just point it at Extracto	OpenAI compat

Agents get two first-class paths. The MCP server exposes seven tools (ocr_submit, ocr_get, jobs_list, job_stop, kb_search, kb_export, presets_list). The SKILL.md + typed CLI path is for agents that don't speak MCP: drop the skill file into the agent's context and it knows when to call extracto ocr, extracto kb search, extracto jobs ... from a shell.

Documentation

Everything beyond a five-minute install lives at extracto.help in five languages:

Configuration reference (every env var)
Full v1 API guide (auth, OCR, jobs, presets, webhooks, KB export, search, metrics)
CLI reference
MCP setup for every supported client
Knowledge-base export (chunking strategies, embedding providers, vector stores)
Production checklist (auth secret, HTTPS, signup gate, rate limits, allowlists)
Troubleshooting + ops (logs, metrics, retention, S3 offload, watched folders)
Architecture tour

OpenAPI 3.1 spec at openapi.yaml. Import into Bruno, Postman, Insomnia, or any client generator.

Name		Name	Last commit message	Last commit date
Latest commit History 280 Commits
.github/workflows		.github/workflows
docs/screenshots		docs/screenshots
download		download
examples		examples
prisma		prisma
public		public
scripts		scripts
src		src
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Caddyfile		Caddyfile
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
SKILL.md		SKILL.md
bun.lock		bun.lock
components.json		components.json
docker-compose.yml		docker-compose.yml
docker-entrypoint.sh		docker-entrypoint.sh
docker.env		docker.env
eslint.config.mjs		eslint.config.mjs
extracto-banner.png		extracto-banner.png
install-extracto.sh		install-extracto.sh
middleware.ts		middleware.ts
next.config.ts		next.config.ts
openapi.yaml		openapi.yaml
package.json		package.json
postcss.config.mjs		postcss.config.mjs
renovate.json		renovate.json
scorecard.png		scorecard.png
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Why

What you get

Quickstart

One-liner (Linux / macOS)

One-liner (Windows)

S3-compatible storage (any provider)

Plug everywhere

Documentation

Star History

License

About

Uh oh!

Releases 10

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Why

What you get

Quickstart

One-liner (Linux / macOS)

One-liner (Windows)

S3-compatible storage (any provider)

Plug everywhere

Documentation

Star History

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 10

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages