Skip to content

doublewordai/dwocr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dwocr

dwocr batches PDF OCR requests against an OpenAI-compatible API, with defaults tuned for Doubleword-hosted Qwen OCR models. It renders each PDF page locally, submits page-level requests through autobatcher, and writes one markdown file per source PDF.

By default it uses:

  • base URL: https://api.doubleword.ai/v1
  • model: Qwen/Qwen3.5-397B-A17B-FP8
  • prompt: the built-in Qwen 397 OCR benchmark prompt in src/dwocr/prompts.py

Install

pip install -e .

This exposes two commands:

  • dwocr
  • dwocr-web

CLI Usage

Basic usage:

dwocr INPUT_PATH

INPUT_PATH can be either:

  • a single PDF file
  • a directory, in which case dwocr recursively processes *.pdf files below it

The CLI looks for an API key in this order:

  1. --api-key
  2. DOUBLEWORD_API_KEY
  3. OPENAI_API_KEY

If you do not pass --output-dir, output is written to a sibling dwocr_output/ directory next to the input root. Relative paths are preserved, so nested PDFs produce nested markdown files.

Example: Doubleword + Qwen 3.5 397B

export DOUBLEWORD_API_KEY=...

dwocr ./papers \
  --base-url https://api.doubleword.ai/v1 \
  --model Qwen/Qwen3.5-397B-A17B-FP8 \
  --output-dir ./ocr_output \
  --render-images \
  --batch-size 512 \
  --batch-window-seconds 5 \
  --poll-interval-seconds 5 \
  --completion-window 24h \
  --target-longest-image-dim 1024 \
  --render-concurrency 8 \
  --overwrite

Important Options

dwocr INPUT_PATH [options]

--api-key TEXT
--base-url TEXT                    OpenAI-compatible API base URL
--model TEXT                       OCR model name
--output-dir TEXT                  Output directory for markdown files
--prompt-file TEXT                 Replace the built-in OCR prompt
--temperature FLOAT                Default: 0.0
--max-tokens INT                   Default: 4096
--batch-size INT                   Default: 512
--batch-window-seconds FLOAT       Default: 5.0
--poll-interval-seconds FLOAT      Default: 5.0
--completion-window {1h,24h}       Default: 24h
--target-longest-image-dim INT     Default: 1024
--render-concurrency INT           Default: min(16, max(4, cpu_count))
--render-images                    Save cropped image regions and rewrite markdown image tags
--overwrite                        Allow writing into a non-empty output directory

Output

Each source PDF becomes one markdown file containing page outputs in order:

<!-- source: some/file.pdf -->
<!-- model: Qwen/Qwen3.5-397B-A17B-FP8 -->
<!-- generated_at: 2026-03-19T12:34:56+00:00 -->

<!-- page 1 -->
...page markdown...

<!-- page 2 -->
...page markdown...

When --render-images is enabled and the model emits markers such as:

image[[120,300,520,700]] Figure caption

dwocr will:

  • crop that region from the rendered source page
  • write the crop into a sibling asset directory such as document_images/
  • rewrite the OCR output to a normal markdown image link

If any pages fail, dwocr still finishes the remaining work, exits non-zero, and prints the failed page list to stderr.

Web UI

Run:

dwocr-web

Then open http://127.0.0.1:8123.

The web UI lets you:

  • submit OCR jobs for a PDF or directory
  • set model, base URL, batching, and rendering options
  • monitor active jobs and logs
  • inspect recent job details and the exact generated CLI command

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors