DocRunr gives you two ways to run document processing: a CLI for local and batch work, and a Docker container with a UI for your RAG stack development and production deployments.
- Binary file detection.
- Clean Markdown and stable chunk JSON output.
- Automatic parser fallback when extraction quality is weak.
- Worker setup with queue processing, uploads, health, stats, and artifact inspection.
- UI for uploads, jobs, and output review.
DocRunr does one job: it turns messy documents into clean Markdown and structured chunks. PDFs, Office files, email, HTML, images with text.
DocRunr is built for general document handling, not for every possible document edge case. The goal is to make the common 80% of real world documents usable with a predictable pipeline, not to promise perfect conversion for every domain specific layout, template, or special use case. There will always be documents and use cases that need custom handling outside DocRunr.
Chunks are simple by design. We lean on the structure already in the document and use one chunking approach only: recursive, structure based splitting with no overlap. No strategy matrix, no tuning exercise, no guessing which splitter to use. The behavior is stable, documented in SPEC.md, and easy to rely on.
DocRunr fits into one small part of your stack. Locally, you can run the CLI on files directly. In Docker or production, you push jobs to RabbitMQ and let the DocRunr worker do the extraction and chunking.
flowchart LR
A[Documents] --> B[RabbitMQ]
B --> C[DocRunr]
C --> B
C --> D["📝 Clean Markdown (.md)"]
C --> E["🧩 Structured chunks (.json)"]
The bundled UI sits on top of that same flow. It gives you an easy way to upload documents, inspect jobs, and review artifacts without building your own operator tooling first.
The default Docker stack is RabbitMQ, the TXT worker, the LLM worker (LiteLLM + in-Docker Ollama), and local storage under ./.data:
docker compose up -d --build- Open http://localhost:8080 for the text extraction (TXT) dashboard.
- Open http://localhost:8081 for the LLM dashboard.
Object storage: Use the SeaweedFS overlay so both workers use S3-compatible storage (list it last so it overrides STORAGE_TYPE):
docker compose -f docker-compose.base.yml -f docker-compose.llm.yml -f docker-compose.ollama.yml -f docker-compose.seaweedfs.yml up -d --buildLLM embeddings: Pass llm_profile on extraction jobs to trigger a follow-up embedding step. See SPEC.md (section 20) for the full protocol.
Queue payloads
Extraction — job and result fields, priority queues, and llm_profile: SPEC.md, section 19.
{
"job_id": "…",
"filename": "report.pdf",
"source_path": "input/…/….pdf",
"options": {},
"priority": 0,
"llm_profile": "nomic-embed-text-137m"
}{
"job_id": "…",
"status": "ok",
"filename": "report.pdf",
"source_path": "input/…/….pdf",
"markdown_path": "output/…/….md",
"chunks_path": "output/…/….json",
"total_tokens": 0,
"total_chars": 0,
"chunk_count": 0,
"duration_seconds": 0,
"error": null,
"priority": 0,
"llm_profile": "nomic-embed-text-137m"
}LLM (optional worker-llm) — queues, job and result fields, and retries: SPEC.md, section 20.
{
"job_id": "new-uuid",
"extract_job_id": "original-extraction-uuid",
"filename": "report.pdf",
"source_path": "input/2026/04/15/00/original-uuid.pdf",
"chunks_path": "output/2026/04/15/00/original-uuid.json",
"llm_profile": "nomic-embed-text-137m",
"priority": 0,
"metadata": {}
}LLM result (docrunr.llm.results): status ok or error; on success, artifact_path points at the embeddings JSON; provider, chunk_count, vector_count, and duration_seconds describe the run.
{
"job_id": "new-uuid",
"extract_job_id": "original-extraction-uuid",
"status": "ok",
"filename": "report.pdf",
"source_path": "input/…/….pdf",
"chunks_path": "output/…/….json",
"llm_profile": "nomic-embed-text-137m",
"provider": "ollama",
"chunk_count": 12,
"vector_count": 12,
"duration_seconds": 3.41,
"artifact_path": "output/…/….embeddings.json",
"error": null
}Environment variables: Text extraction and LLM workers are configured only via env vars; tables and defaults are in SPEC.md (section 22, Configuration, and section 20 for the LLM worker).
- Core runtime: Python
- Queue: RabbitMQ
- UI: React, Vite, Mantine
- Storage: local disk or S3-compatible object storage
- Packaging: Docker
To work on DocRunr locally, you need Python 3.11+, uv, Node.js 20+ with corepack for pnpm, and Docker for the local stack and integration tests.
git clone https://github.com/docrunr/docrunr.git
cd docrunr
cp .env.example .env
uv sync
pnpm -C ui installWorkspace layout
docrunr/
├── core/ # docrunr on PyPI (CLI + library)
├── worker/ # docrunr-worker (RabbitMQ, HTTP, bundled UI assets)
├── worker-llm/ # docrunr-worker-llm (optional LLM post-processing)
├── ui/ # React + Mantine; Vite in dev, static bundle in the image
├── tests/ # core, worker, worker_llm, integration, samples
└── scripts/ # release and dev helpers
After the clone and .env copy above, the commands below install dependencies and run DocRunr Worker in dev mode. For Docker, tests, lint, release, and other workflows, use the tasks in .vscode/tasks.json.
| Command | Description |
|---|---|
uv sync |
Install the Python workspace and dev dependencies. |
pnpm -C ui install |
Install UI dependencies. |
node ./scripts/dev.mjs |
Start dev |
node ./scripts/dev.mjs --llm |
Start dev with LLM worker + LiteLLM |
The docrunr command processes a single file or walks a directory of supported documents and writes cleaned Markdown (.md) and chunk metadata (.json) next to each input unless you set --out. It uses the same pipeline as convert() in Python—no config files, same predictable output for the same input. The options table covers output location, verbose extraction logs, batch summary JSON, parallel workers, and filename filters. Full behavior, exit codes, and JSON shapes are documented in SPEC.md.
Install from PyPI:
uv pip install docrunrdocrunr document.pdf
docrunr ./documents/ --out ./output -v -rfrom docrunr import convert
result = convert("report.pdf")
result.markdown
result.chunks| Option | Short | Description |
|---|---|---|
--out |
-o |
Output directory (default: beside input) |
--verbose |
-v |
Extraction details and timing |
--report |
-r |
Batch report JSON |
--workers |
-w |
Parallel workers for batch (0 = auto) |
--include |
-i |
Filter by name, extension, or glob |
DocRunr picks a parser from the file’s detected MIME type (binary based, via Magika), not from the filename alone. Today the built-in registry handles the MIME types listed below, which correspond to the extensions in the table.
| Category | Formats |
|---|---|
| Documents | PDF, DOCX, DOC, ODT |
| Spreadsheets | XLSX, XLS, ODS, CSV |
| Presentations | PPTX, PPT, ODP |
| EML, MSG | |
| Web & markup | HTML, HTM, XML, MD, JSON, TXT |
| Images | JPG, JPEG, PNG, TIFF, BMP |
DocRunr is licensed under the Apache License 2.0.
