Skip to content

ghuntley/dpdf

 
 

Repository files navigation

Note: this repository does not build standalone yet. The following forward dependencies are not yet published:


claims:

  • type: entrypoint name: dpdf-cli code_path: crates/dpdf-client/src/main.rs
  • type: package name: dpdf-core code_path: crates/dpdf-core/
  • type: package name: dpdf-pipeline code_path: crates/dpdf-pipeline/
  • type: package name: dpdf-server code_path: crates/dpdf-server/
  • type: package name: dpdf-wasm code_path: crates/dpdf-wasm/
  • type: directory name: docs-dir code_path: docs/

dpdf

Pure Rust PDF extraction and document understanding for local, server, and browser workflows.

Author: Andrew Yates andrewyates.name@gmail.com Version: 0.1.0 License: Apache 2.0 Copyright: 2026 Andrew Yates

What Is dpdf?

dpdf extracts structured text, tables, figures, thumbnails, and rendering output from PDF files. The workspace includes:

  • dpdf CLI for inspection, extraction, rendering, thumbnails, benchmarking, and pipeline debugging.
  • dpdf-server for HTTP extraction services.
  • dpdf-wasm for Tier 0 browser extraction with no server round-trip.
  • Tiered ML document-understanding pipelines built around local mly-* crates.

Dependency Model

The native extraction workspace (dpdf-core, dpdf-types, dpdf-pipeline, dpdf-client, and dpdf-server) is implemented in Rust and avoids crates.io dependencies. ML tiers depend on local path crates such as mly-core and mly-metal, and the optional browser wrapper crates/dpdf-wasm uses wasm-bindgen.

Installation

dpdf is currently built from source:

cargo build --workspace
cargo test --workspace
cargo build -p dpdf-client --release

The CLI binary is written to target/release/dpdf. If you build with CARGO_TARGET_DIR=target/user, the binary lands at target/user/release/dpdf.

Usage

# Inspect metadata
dpdf info paper.pdf

# Extract page 1 as Markdown
dpdf extract paper.pdf --pages 1

# Render page 1 to PNG
dpdf render paper.pdf --pages 1 -o ./output/

# Generate a first-page thumbnail
dpdf thumbnail paper.pdf

# Inspect pipeline routing and intermediate stages
dpdf pipeline route paper.pdf

For ML-backed extraction, point the CLI at a local model directory:

export DPDF_MODEL_DIR=~/dpdf-models
dpdf extract paper.pdf --tier 1

Reference Docs

Workspace Layout

crates/
├── dpdf-core/       PDF parsing, text extraction, rendering, and image codecs
├── dpdf-types/      Shared document data model and serializers
├── dpdf-pipeline/   Tier routing, ML orchestration, and evaluation harnesses
├── dpdf-client/     `dpdf` CLI
├── dpdf-server/     HTTP server surface
└── dpdf-wasm/       Browser-facing Tier 0 wrapper

Additional top-level directories:

  • docs/ for user-facing documentation that ships with the public repo.
  • tests/ for CLI and integration coverage.

License

Apache License 2.0. See LICENSE.

About

PDF parser and text extractor in Rust

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Rust 99.7%
  • Other 0.3%