Rust implementation of the DeepSeek-OCR inference stack with a fast CLI and an OpenAI-compatible HTTP server. The workspace packages the vision-language model, prompt tooling, and serving layer so you can build document understanding pipelines that run locally on CPU, Apple Metal, or NVIDIA CUDA GPUs.
中文文档请看 README_CN.md。
Want ready-made binaries? Latest macOS (Metal-enabled) and Windows bundles live in the build-binaries workflow artifacts. Grab them from the newest green run.
- Vision preprocessing –
prepare_vision_input_from_imagebuilds a square global canvas with letterboxing (build_global_view) and, when crop mode is enabled, appliesdynamic_preprocesstiling to produce high-resolution local crops plus optional thumbnails. - SAM + CLIP fusion – each view is normalised via
image_to_tensor, pushed through the Candle ports of SAM (SamBackbone) and CLIP-L (ClipVisionModel), then flattened withbuild_clip_sam_tokensso the features stay spatially aligned. - Projector & layout tokens – the custom
ImageProjectorlinearly maps concatenated SAM/CLIP channels into the language hidden size while injecting learnedimage_newline/view_separatortokens to preserve grid structure, yielding the multimodal embeddings used during decoding. - Tokenizer alignment –
build_prompt_tokenssynthesises<image>spans whose length exactly matches the projected token count (global + local grids), ensuring OpenAI-style prompts remain consistent even after chat history pruning. - Decoder & caching – the text stack is a Candle reimplementation of DeepSeek-V2 (
DeepseekLanguageModel) with optional FlashAttention, rotary position embeddings, andDynamicCacheguards so both the CLI and server can stream tokens efficiently. - Observability & parity – debug builds expose CLIP/SAM traces (
VisionDebugFeatures) so we can diff intermediate tensors against the PyTorch reference; most stages are already numerically aligned, and the few remaining deltas (mainly projector normalisation + vision tiling) are tracked on the roadmap for upcoming releases.
The original DeepSeek-OCR ships as a Python + Transformers stack—powerful, but hefty to deploy and awkward to embed. Rewriting the pipeline in Rust gives us:
- Smaller deployable artifacts with zero Python runtime or conda baggage.
- Memory-safe, thread-friendly infrastructure that blends into native Rust backends.
- Unified tooling (CLI + server) running on Candle + Rocket without the Python GIL overhead.
- Drop-in compatibility with OpenAI-style clients while tuned for single-turn OCR prompts.
- Candle for tensor compute, with Metal and CUDA backends and FlashAttention support.
- Rocket + async streaming for OpenAI-compatible
/v1/responsesand/v1/chat/completions. - tokenizers (Hugging Face) wrapped by
crates/assetsfor deterministic caching. - Pure Rust vision/prompt pipeline shared by CLI and server to avoid duplicated logic.
- Faster cold-start on Apple Silicon, lower RSS, and native binary distribution.
- Deterministic Hugging Face asset download + verification built into the workspace.
- Automatic single-turn chat compaction so OCR outputs stay stable even when clients send history.
- Ready-to-use OpenAI compatibility for tools like Open WebUI without adapters.
- One repo, two entrypoints – a batteries-included CLI for batch jobs and a Rocket-based server that speaks
/v1/responsesand/v1/chat/completions. - Works out of the box – pulls model weights, configs, and tokenizer from Hugging Face on first run.
- Optimised for Apple Silicon – optional Metal backend with FP16 execution for real-time OCR on laptops.
- CUDA-ready – build with
--features cudaand run with--device cuda --dtype f16to leverage NVIDIA GPUs on Linux/Windows. - OpenAI client compatibility – drop-in replacement for popular SDKs; the server automatically collapses chat history to the latest user turn for OCR-friendly prompts.
- Rust 1.78+ (edition 2024 support)
- Git
- Optional: Apple Silicon running macOS 13+ for Metal acceleration
- Optional: CUDA 12.2+ toolkit + driver for NVIDIA GPU acceleration on Linux/Windows
- (Recommended) Hugging Face account with
HF_TOKENwhen pulling from thedeepseek-ai/DeepSeek-OCRrepo
git clone https://github.com/TimmyOVO/deepseek-ocr.rs.git
cd deepseek-ocr.rs
cargo fetchThe first invocation of the CLI or server downloads the config, tokenizer, and model-00001-of-000001.safetensors (~6.3GB) into DeepSeek-OCR/. To prefetch manually:
cargo run -p deepseek-ocr-cli -- --help # triggers asset downloadSet HF_HOME or HF_TOKEN if you store Hugging Face caches elsewhere. The full model package is ~6.3GB on disk and typically requires ~13GB of RAM headroom during inference (model + activations).
Build and run directly from the workspace:
cargo run -p deepseek-ocr-cli -- \
--prompt "<image>\n<|grounding|>Convert this receipt to markdown." \
--image baselines/sample/images/test.png \
--device cpu --max-new-tokens 512macOS tip: append
--features metalto thecargo run/cargo buildcommands to compile with Accelerate + Metal backends.CUDA tip (Linux/Windows): append
--features cudaand run with--device cuda --dtype f16to target NVIDIA GPUs.
Install the CLI as a binary:
cargo install --path crates/cli
deepseek-ocr-cli --helpKey flags:
--prompt/--prompt-file: text with<image>slots--image: path(s) matching<image>placeholders--deviceand--dtype: choosemetal+f16on Apple Silicon orcuda+f16on NVIDIA GPUs--max-new-tokens: decoding budget
Launch an OpenAI-compatible endpoint:
cargo run -p deepseek-ocr-server -- \
--host 0.0.0.0 --port 8000 \
--device cpu --max-new-tokens 512macOS tip: add
--features metalto thecargo run -p deepseek-ocr-servercommand when you want the server binary to link against Accelerate + Metal (and pair it with--device metalat runtime).CUDA tip: add
--features cudaand start the server with--device cuda --dtype f16to offload inference to NVIDIA GPUs.
Notes:
- Use
data:URLs or remotehttp(s)links; local paths are rejected. - The server collapses multi-turn chat inputs to the latest user message to keep prompts OCR-friendly.
- Works out of the box with tools such as Open WebUI or any OpenAI-compatible client—just point the base URL to your server (
http://localhost:8000/v1) and select thedeepseek-ocrmodel. - Adjust the request body limit with Rocket config if you routinely send large images.
- Metal (macOS 13+ Apple Silicon) – pass
--device metal --dtype f16and build binaries with--features metalso Candle links against Accelerate + Metal. - CUDA (Linux/Windows, NVIDIA GPUs) – install CUDA 12.2+ toolkits, build with
--features cuda, and launch the CLI/server with--device cuda --dtype f16. - For either backend, prefer release builds (e.g.
cargo build --release -p deepseek-ocr-cli --features metal|cuda) to maximise throughput. - Combine GPU runs with
--max-new-tokensand crop tuning flags to balance latency vs. quality.
crates/core– shared inference pipeline, model loaders, conversation templates.crates/cli– command-line frontend (deepseek-ocr-cli).crates/server– Rocket server exposing OpenAI-compatible endpoints.crates/assets– asset management (configuration, tokenizer, Hugging Face download helpers).baselines/– reference inputs and outputs for regression testing.
- Weights download fails – export
HF_TOKEN=<your-token>and retry. Assets land in~/.cache/huggingfaceby default. - Slow first response – model load and GPU warm-up (Metal/CUDA) happen on the initial request; later runs are faster.
- Large image rejection – increase Rocket JSON limits in
crates/server/src/main.rsor downscale the input.
- ✅ Apple Metal backend with FP16 support and CLI/server parity on macOS.
- ✅ NVIDIA CUDA backend (build with
--features cuda, run with--device cuda --dtype f16) for GPU acceleration on Linux/Windows. - 🔄 Parity polish – finish projector normalisation + crop tiling alignment; extend intermediate-tensor diff suite beyond the current sample baseline.
- 🔄 Grounding & streaming – port the Python post-processing helpers (box extraction, markdown polish) and refine SSE streaming ergonomics.
- 🔄 Cross-platform acceleration – continue tuning CUDA kernels, add automatic device detection across CPU/Metal/CUDA, and publish opt-in GPU benchmarks.
- 🔄 Packaging & Ops – ship binary releases with deterministic asset checksums, richer logging/metrics, and Helm/docker references for server deploys.
- 🔜 Structured outputs – optional JSON schema tools for downstream automation once parity gaps close.
This repository inherits the licenses of its dependencies and the upstream DeepSeek-OCR model. Refer to DeepSeek-OCR/LICENSE for model terms and apply the same restrictions to downstream use.
