High-performance OCR pipeline for Turkish, archival, and complex-layout documents — turning PDFs into HuggingFace-ready training datasets.
OpenCR is an end-to-end open-source pipeline that converts PDFs (especially Turkish text, archival material, and pages with complex layout) into clean Parquet datasets ready for LLM training and retrieval.
For Turkish documents, see: README.tr.md
- Turkish-first accuracy. Built around DeepSeek-OCR, it handles Turkish characters and difficult page layouts better than off-the-shelf OCR.
- Dataset factory. Outputs are packaged directly as
pages.parquet+documents.parquetwith deterministic train/validation/test splits and a HuggingFace dataset card. - Operator console. A single-page web UI to monitor runs, page-by-page validate quality, retry, and publish to HuggingFace.
- Pluggable backends. Production-grade NVIDIA + vLLM by default; runs in-process on Apple Silicon / CPU for development; or talk to any OpenAI-compatible model server.
Requires Docker, an NVIDIA GPU, and the NVIDIA Container Toolkit.
docker compose up -dOpen http://localhost:39672. Drop PDFs in ./input/, hit Start OCR run.
For local development, demos, and small jobs on a Mac or Linux box with no GPU.
git clone https://github.com/cdliai/opencr.git
cd opencr
python3 -m venv .venv && source .venv/bin/activate
pip install -r ocr_pipeline/requirements.txt -r requirements-local.txt
MODEL_BACKEND=local ./scripts/start.shOpen http://localhost:39672. The DeepSeek-OCR model (~6 GB) downloads
on first request and runs in-process via transformers on MPS (Apple Silicon)
or CPU. Expect 5–30 seconds per page on M-series, much slower on CPU —
fine for development, not for production batch jobs.
If you already run vLLM somewhere, or use OpenRouter, or another endpoint serving DeepSeek-OCR:
pip install -r ocr_pipeline/requirements.txt
MODEL_BACKEND=remote MODEL_SERVER_URL=https://your-endpoint MODEL_API_KEY=sk-... ./scripts/start.shConfigurable via environment variables (or a .env file):
| Variable | Default | Description |
|---|---|---|
MODEL_BACKEND |
vllm |
vllm (NVIDIA, OpenAI-compatible server), local (in-process transformers), remote (alias). |
MODEL_SERVER_URL |
http://ocr-model:39671 |
Base URL for vllm / remote backends. |
MODEL_NAME |
deepseek-ai/DeepSeek-OCR |
Model identifier. |
MODEL_API_KEY |
EMPTY |
API key for remote endpoints. |
LOCAL_DEVICE |
auto | auto, mps, cuda, or cpu for the local backend. |
INPUT_DIR |
./input (or /data/input) |
Where to read PDFs from. |
OUTPUT_DIR |
./output (or /data/output) |
Where artifacts and the SQLite DB land. |
HOST / PORT |
0.0.0.0 / 39672 |
Where the web console serves. |
HF_OAUTH_CLIENT_ID |
unset | Enables "Sign in with HuggingFace" for the publish flow. See HF OAuth setup. |
APP_SESSION_SECRET |
random per process | Cookie-signing secret. Set to a stable value in production. |
Completed runs can be pushed to a HuggingFace dataset repo. Two modes:
- Paste-token (default). In the operator console, click Publish to HuggingFace and paste a HF write token. Or set
HF_TOKENin the server's environment to skip pasting. - Sign in with HuggingFace (recommended for shared deployments). Configure OAuth (below) and users sign in with their HF account. The publish flow then uses their personal token automatically. This is also how the operator console gets gated — without a session, the publish action is hidden.
- Create an OAuth app at https://huggingface.co/settings/connected-applications/new with redirect URI
https://your-host/api/auth/callbackand scopesopenid profile write-repos. - Set on the server:
export HF_OAUTH_CLIENT_ID=...
export HF_OAUTH_CLIENT_SECRET=...
export HF_OAUTH_REDIRECT_URI=https://your-host/api/auth/callback
export APP_SESSION_SECRET=$(python3 -c 'import secrets; print(secrets.token_hex(32))')- Restart. The console gains a Sign in with HuggingFace button in the topbar.
Published datasets are tagged opencr so they're discoverable via HuggingFace's tag search.
┌───────────────────────────────┐
│ OCR pipeline (FastAPI) │
PDFs ─────►. │ ingest → render → OCR → │ ──► pages.parquet
│ clean → validate → export │ documents.parquet
│ + operator console (Alpine) │ manifest.json
└──────────────┬────────────────┘
│ OpenAI-compatible
▼
┌───────────────────────────────┐
│ Model backend │
│ ┌─────────────────────────┐ │
│ │ vllm (NVIDIA, prod) │ │
│ │ local (MPS/CPU, dev) │ │
│ │ remote (any OpenAI URL) │ │
│ └─────────────────────────┘ │
└───────────────────────────────┘
State lives in SQLite + the filesystem. No external queue/broker is required for single-node operation. See docs/architectural-overhaul-v2.md for the long-form design.
make install # create venv, install deps
make run # start dev server with sensible defaults
make test # run pytest suite
make lint # ruff checkTests live under tests/. UI is plain HTML + Alpine.js — no build step.
Contributions are welcome — bug reports, Turkish-language test fixtures, benchmarks against other OCR engines, model-backend ports (MLX, llama.cpp), and documentation translations are especially useful.
See CONTRIBUTING.md.
Apache 2.0 — see LICENSE.
OpenCR is built and maintained by cdli.ai to support Turkish-language LLM research and dataset curation.