Extract text from PDFs with position data.
// Simple extraction
let output = pdf_strings::from_path("file.pdf")?;
println!("{}", output); // Plain text
// With password
let output = pdf_strings::PdfExtractor::builder()
.password("secret")
.build()
.from_path("encrypted.pdf")?;
// Preserve spatial layout
println!("{}", output.to_string_pretty());
// Access structured data with bounding boxes
for line in output.lines() {
for span in line {
println!("{} at {:?}", span.text, span.bbox);
}
}- Plain text extraction
- Spatial layout preservation
- Bounding box coordinates for every text span
- Font encoding resolution (ToUnicode, Type1, TrueType, CID, Type3)
- Password-protected PDF support
- Handles complex fonts, rotated text, and multi-column layouts
Three output formats:
to_string()- Plain textto_string_pretty()- Character grid rendering that preserves spatial layoutlines()- Structured data withTextSpanobjects containing text, bounding boxes, and font sizes
This is a fork of pdf-extract. Thanks for laying the groundwork, PDFs are ... something else.