Skip to content
View docrunr's full-sized avatar
  • Netherlands

Block or report docrunr

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
docrunr/README.md

DocRunr DocRunr

Document to clean Markdown and chunks. That's it.

License: Apache-2.0 Contributions welcome Python 3.11+ RabbitMQ

DocRunr dashboard: metrics, activity heatmap, and charts

DocRunr gives you two ways to run document processing: a CLI for local and batch work, and a Docker container with a UI for your RAG stack development and production deployments.

Highlights

  • Binary file detection.
  • Clean Markdown and stable chunk JSON output.
  • Automatic parser fallback when extraction quality is weak.
  • Worker setup with queue processing, uploads, health, stats, and artifact inspection.
  • UI for uploads, jobs, and output review.

🎯 Simple by design

DocRunr does one job: it turns messy documents into clean Markdown and structured chunks. PDFs, Office files, email, HTML, images with text.

DocRunr is built for general document handling, not for every possible document edge case. The goal is to make the common 80% of real world documents usable with a predictable pipeline, not to promise perfect conversion for every domain specific layout, template, or special use case. There will always be documents and use cases that need custom handling outside DocRunr.

Chunks are simple by design. We lean on the structure already in the document and use one chunking approach only: recursive, structure based splitting with no overlap. No strategy matrix, no tuning exercise, no guessing which splitter to use. The behavior is stable, documented in SPEC.md, and easy to rely on.

🔄 How it works

DocRunr fits into one small part of your stack. Locally, you can run the CLI on files directly. In Docker or production, you push jobs to RabbitMQ and let the DocRunr worker do the extraction and chunking.

flowchart LR
    A[Documents] --> B[RabbitMQ]
    B --> C[DocRunr]
    C --> B
    C --> D["📝 Clean Markdown (.md)"]
    C --> E["🧩 Structured chunks (.json)"]
Loading

The bundled UI sits on top of that same flow. It gives you an easy way to upload documents, inspect jobs, and review artifacts without building your own operator tooling first.

🐳 Docker

The default Docker stack is RabbitMQ, the TXT worker, the LLM worker (LiteLLM + in-Docker Ollama), and local storage under ./.data:

docker compose up -d --build

Object storage: Use the SeaweedFS overlay so both workers use S3-compatible storage (list it last so it overrides STORAGE_TYPE):

docker compose -f docker-compose.base.yml -f docker-compose.llm.yml -f docker-compose.ollama.yml -f docker-compose.seaweedfs.yml up -d --build

LLM embeddings: Pass llm_profile on extraction jobs to trigger a follow-up embedding step. See SPEC.md (section 20) for the full protocol.

Queue payloads

Extraction — job and result fields, priority queues, and llm_profile: SPEC.md, section 19.

{
  "job_id": "",
  "filename": "report.pdf",
  "source_path": "input/…/….pdf",
  "options": {},
  "priority": 0,
  "llm_profile": "nomic-embed-text-137m"
}
{
  "job_id": "",
  "status": "ok",
  "filename": "report.pdf",
  "source_path": "input/…/….pdf",
  "markdown_path": "output/…/….md",
  "chunks_path": "output/…/….json",
  "total_tokens": 0,
  "total_chars": 0,
  "chunk_count": 0,
  "duration_seconds": 0,
  "error": null,
  "priority": 0,
  "llm_profile": "nomic-embed-text-137m"
}

LLM (optional worker-llm) — queues, job and result fields, and retries: SPEC.md, section 20.

{
  "job_id": "new-uuid",
  "extract_job_id": "original-extraction-uuid",
  "filename": "report.pdf",
  "source_path": "input/2026/04/15/00/original-uuid.pdf",
  "chunks_path": "output/2026/04/15/00/original-uuid.json",
  "llm_profile": "nomic-embed-text-137m",
  "priority": 0,
  "metadata": {}
}

LLM result (docrunr.llm.results): status ok or error; on success, artifact_path points at the embeddings JSON; provider, chunk_count, vector_count, and duration_seconds describe the run.

{
  "job_id": "new-uuid",
  "extract_job_id": "original-extraction-uuid",
  "status": "ok",
  "filename": "report.pdf",
  "source_path": "input/…/….pdf",
  "chunks_path": "output/…/….json",
  "llm_profile": "nomic-embed-text-137m",
  "provider": "ollama",
  "chunk_count": 12,
  "vector_count": 12,
  "duration_seconds": 3.41,
  "artifact_path": "output/…/….embeddings.json",
  "error": null
}

Environment variables: Text extraction and LLM workers are configured only via env vars; tables and defaults are in SPEC.md (section 22, Configuration, and section 20 for the LLM worker).

🛠 Tech stack

  • Core runtime: Python
  • Queue: RabbitMQ
  • UI: React, Vite, Mantine
  • Storage: local disk or S3-compatible object storage
  • Packaging: Docker

💻 Development

To work on DocRunr locally, you need Python 3.11+, uv, Node.js 20+ with corepack for pnpm, and Docker for the local stack and integration tests.

git clone https://github.com/docrunr/docrunr.git
cd docrunr
cp .env.example .env
uv sync
pnpm -C ui install

Workspace layout

docrunr/
├── core/           # docrunr on PyPI (CLI + library)
├── worker/         # docrunr-worker (RabbitMQ, HTTP, bundled UI assets)
├── worker-llm/     # docrunr-worker-llm (optional LLM post-processing)
├── ui/             # React + Mantine; Vite in dev, static bundle in the image
├── tests/          # core, worker, worker_llm, integration, samples
└── scripts/        # release and dev helpers

Commands

After the clone and .env copy above, the commands below install dependencies and run DocRunr Worker in dev mode. For Docker, tests, lint, release, and other workflows, use the tasks in .vscode/tasks.json.

Command Description
uv sync Install the Python workspace and dev dependencies.
pnpm -C ui install Install UI dependencies.
node ./scripts/dev.mjs Start dev
node ./scripts/dev.mjs --llm Start dev with LLM worker + LiteLLM

⌨️ CLI

The docrunr command processes a single file or walks a directory of supported documents and writes cleaned Markdown (.md) and chunk metadata (.json) next to each input unless you set --out. It uses the same pipeline as convert() in Python—no config files, same predictable output for the same input. The options table covers output location, verbose extraction logs, batch summary JSON, parallel workers, and filename filters. Full behavior, exit codes, and JSON shapes are documented in SPEC.md.

Install from PyPI:

uv pip install docrunr
docrunr document.pdf
docrunr ./documents/ --out ./output -v -r
from docrunr import convert

result = convert("report.pdf")
result.markdown
result.chunks
Option Short Description
--out -o Output directory (default: beside input)
--verbose -v Extraction details and timing
--report -r Batch report JSON
--workers -w Parallel workers for batch (0 = auto)
--include -i Filter by name, extension, or glob

📋 Supported formats

DocRunr picks a parser from the file’s detected MIME type (binary based, via Magika), not from the filename alone. Today the built-in registry handles the MIME types listed below, which correspond to the extensions in the table.

Category Formats
Documents PDF, DOCX, DOC, ODT
Spreadsheets XLSX, XLS, ODS, CSV
Presentations PPTX, PPT, ODP
Email EML, MSG
Web & markup HTML, HTM, XML, MD, JSON, TXT
Images JPG, JPEG, PNG, TIFF, BMP

📄 License

DocRunr is licensed under the Apache License 2.0.

Popular repositories Loading

  1. docrunr docrunr Public

    You give it a document. It gives you clean Markdown and chunks. That's it. Use it for your RAG stack development.

    Python 1