Multi-engine OCR pipeline with LLM reconciliation, packaged as a Docker Compose stack with a single REST API.
polycr fans out every image to multiple OCR engines in parallel, then uses a vision-capable LLM to reconcile the outputs into one high-confidence transcription and extract structured fields (dates, names, prices, contact info, etc.).
ocr-mcp — MCP server that wraps this stack and exposes OCR tools to OpenClaw. Calls the /ocr/raw endpoint and handles engine selection, fallback, and image preprocessing.
# 1. Clone and configure
git clone https://github.com/davidgut1982/polycr.git && cd polycr
cp .env.example .env
# Edit .env — set LLM_API_KEY at minimum
# 2. Start the default stack (router + tesseract + easyocr + doctr)
make up
# 3. Wait for services to be healthy (~60 s on first run, models download)
docker compose logs -f
# 4. Send an image
curl -X POST http://localhost:8000/process \
-F "file=@/path/to/document.jpg" | jq .curl -X POST http://localhost:8000/process \
-F "file=@invoice.jpg" | jq '{text, structured, engines_used}'curl -X POST http://localhost:8000/ocr/raw \
-F "file=@invoice.jpg" | jq '.results[] | {engine, confidence, text}'curl http://localhost:8000/health
# {"status":"ok","engines":["tesseract","easyocr","doctr"]}| Engine | Technology | Notes |
|---|---|---|
tesseract |
Tesseract 4 (LSTM) | Fastest; best for clean printed text |
easyocr |
CRNN deep learning | Good on varied fonts and orientations |
doctr |
Transformer (docTR) | Strong on document layouts |
paddleocr |
PaddlePaddle CRNN | High accuracy; large download |
surya |
Transformer (Surya) | Multilingual; best layout understanding |
Default stack (ENABLED_ENGINES=tesseract,easyocr,doctr) balances speed and accuracy without requiring the full profile.
Enable all engines:
docker compose --profile full up -d| Provider | LLM_PROVIDER |
Example model |
|---|---|---|
| Anthropic | anthropic |
claude-haiku-4-5-20251001 |
| OpenAI | openai |
gpt-4o |
| OpenRouter | openrouter |
qwen/qwen2.5-vl-72b-instruct |
| Groq | groq |
llama-3.2-11b-vision-preview |
If LLM_API_KEY is not set, the /process endpoint falls back to the highest-confidence engine result with no structured extraction.
| Mode | Command | Notes |
|---|---|---|
| CPU (default) | make up |
Works everywhere; slower inference |
| GPU (NVIDIA) | make up-gpu |
Requires nvidia-container-toolkit |
| Minimal (tesseract only) | make up-minimal |
Fastest start; no model downloads |
polycr exposes two complementary services:
| Service | Port | Purpose |
|---|---|---|
router (polycr) |
8000 | Multi-engine text extraction — fans out to OCR engines, reconciles with an LLM, returns structured text. Used for document classification and filename generation. |
ocrmypdf |
8001 | Archival PDF generation — wraps ocrmypdf to produce a searchable PDF with an embedded text layer. Used to create the final stored document. |
These are intentionally separate: the router is optimised for high-confidence text extraction (queried first), and ocrmypdf is optimised for producing a compact, searchable PDF suitable for long-term storage. A typical scan pipeline calls both in sequence:
scan_document → JPEG temp file
↓
:8000/ocr/raw — multi-engine text extraction (classification + filename)
↓
:8001/pdf — searchable PDF with embedded text layer (archival copy)
↓
Upload to Nextcloud
Full pipeline: preprocess → OCR fan-out → LLM reconciliation.
Request: multipart/form-data with file field (JPEG/PNG/TIFF/BMP).
Response:
{
"text": "reconciled full text",
"structured": {
"date": "2026-04-10",
"total": "$142.50"
},
"ocr_raw": [
{"engine": "tesseract", "text": "...", "confidence": 87.3, "error": ""}
],
"engines_used": ["tesseract", "easyocr", "doctr"],
"engines_failed": []
}OCR fan-out only — no LLM call.
Response:
{
"results": [
{"engine": "tesseract", "text": "...", "confidence": 87.3, "error": ""}
]
}Readiness probe.
Response: {"status": "ok", "engines": ["tesseract", "easyocr", "doctr"]}
Run ocrmypdf on an uploaded image or PDF and return a searchable PDF.
Request: multipart/form-data with file field (JPEG/PNG/PDF/TIFF).
Query params:
| Param | Default | Description |
|---|---|---|
deskew |
true |
Deskew the input image before OCR |
optimize |
1 |
PDF optimization level (0 = none, 3 = maximum) |
Response (success): PDF binary with Content-Type: application/pdf.
Response (error): {"error": "...", "detail": "..."} with a 5xx status code.
Readiness probe for the ocrmypdf service.
Response: {"status": "ok", "service": "ocrmypdf"}
| Variable | Default | Description |
|---|---|---|
LLM_PROVIDER |
anthropic |
LLM provider: anthropic, openai, openrouter, groq |
LLM_API_KEY |
(none) | API key for the chosen provider |
LLM_MODEL |
claude-haiku-4-5-20251001 |
Model identifier |
ENABLED_ENGINES |
tesseract,easyocr,doctr |
Comma-separated engine list |
PORT |
8000 |
Host port for the router |
- Fork the repository and create a feature branch.
- Run
make up-minimalto start a dev stack. - Run
make testto validate your changes. - Open a pull request — the CI workflow will run integration tests automatically.
Engine additions follow the same contract: POST /ocr + GET /health, multipart file upload, JSON response {engine, text, confidence}.