High-performance document extraction library for Rust.
Extracts text and renders pages from PDF documents using the Pdfium engine — the same C++ library used by Google Chrome.
- Native text extraction — extracts text directly from PDF text layers, no OCR required for digitally generated documents
- Page rendering — renders PDF pages to images at configurable DPI with optional annotation rendering and dimension clamping
- Automatic page classification — detects whether a page has a native text layer or is scanned
- PDF validation — magic byte checking, size guards, existence checks
- Document metadata — title, author, page count, file size, PDF version
- Zero intermediate files — everything in memory, no temp file I/O
- Pdfium singleton — initialised once, borrowed across your application
[dependencies]
docuparse = "0.0.1"Docuparse links against Pdfium at runtime. You must provide the Pdfium shared library alongside your binary. Pre-built binaries are available from the pdfium-render releases.
use docuparse::{init_pdfium, PdfDocument};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let pdfium = init_pdfium()?;
let doc = PdfDocument::open(pdfium, "document.pdf")?;
println!("pages: {}", doc.page_count());
println!("file: {}", doc.metadata.file_size_display());
for result in doc.extract_all_text_layers() {
match result {
Ok((page, Some(text))) => println!("page {}: {} chars", page + 1, text.len()),
Ok((page, None)) => println!("page {}: scanned — no text layer", page + 1),
Err(e) => eprintln!("page: error — {e}"),
}
}
Ok(())
}use docuparse::{init_pdfium, PdfDocument, RenderConfig};
let pdfium = init_pdfium()?;
let doc = PdfDocument::open(pdfium, "document.pdf")?;
let config = RenderConfig::builder()
.dpi(150)
.with_annotations(true)
.build();
let image = doc.render_page(0, &config)?;
image.save("page_1.png")?;let images = doc.render_all_pages(&config)?;
for (i, image) in images.iter().enumerate() {
image.save(format!("page_{:03}.png", i + 1))?;
}Benchmarked on a representative mixed PDF document (native text layers + scanned pages) on Apple Silicon (M4):
| Metric | Value |
|---|---|
| Text extraction (12 pages) | ~8 ms |
| Per page | ~680 µs |
| Memory RSS | ~25 MB |
| Child processes spawned | 0 |
| Temp file I/O | none |
Performance is dominated by the Pdfium C++ engine. The Rust wrapper contributes negligible overhead — dev and release builds measure identically, confirming zero Rust-level bottleneck.
All public functions return Result<T, PdfError>. Errors are fully typed
via thiserror and compose cleanly
with anyhow or eyre in application code:
use docuparse::PdfError;
match PdfDocument::open(pdfium, "file.pdf") {
Err(PdfError::Validation(_)) => {
eprintln!("not a valid PDF");
}
Err(e) => eprintln!("error: {e}"),
Ok(doc) => { /* ... */ }
}- OCR support via ONNX Runtime (feature flag)
- Image input (JPEG, PNG) alongside PDF
- Async API
MIT