Skip to content

bbqsrc/pdf-strings

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pdf-extract

crates.io Documentation

Extract text from PDFs with position data.

Usage

// Simple extraction
let output = pdf_strings::from_path("file.pdf")?;
println!("{}", output);  // Plain text

// With password
let output = pdf_strings::PdfExtractor::builder()
    .password("secret")
    .build()
    .from_path("encrypted.pdf")?;

// Preserve spatial layout
println!("{}", output.to_string_pretty());

// Access structured data with bounding boxes
for line in output.lines() {
    for span in line {
        println!("{} at {:?}", span.text, span.bbox);
    }
}

Features

  • Plain text extraction
  • Spatial layout preservation
  • Bounding box coordinates for every text span
  • Font encoding resolution (ToUnicode, Type1, TrueType, CID, Type3)
  • Password-protected PDF support
  • Handles complex fonts, rotated text, and multi-column layouts

API

Three output formats:

  • to_string() - Plain text
  • to_string_pretty() - Character grid rendering that preserves spatial layout
  • lines() - Structured data with TextSpan objects containing text, bounding boxes, and font sizes

Acknowledgements

This is a fork of pdf-extract. Thanks for laying the groundwork, PDFs are ... something else.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Sponsor this project

  •  

Packages

No packages published

Contributors 17