pdf2alto

pdf2alto is a tool for extracting word-level bounding boxes from PDFs and presenting them in ALTO.

The ALTO is a little crazy, as it does not provide bounding boxes for the Page, PrintSpace, TextBlock, or TextLine, and in fact only provides only one of each for each page, no matter how the individual Strings on the Page are arranged. For my use case of search hit highlighting of individual words or groups of words, this is Good Enough.

The word segmentation code is geared toward separating ordinary English words, not recognizing email addresses, domain names, or telephone numbers. For example, it will split a domain name with internal periods into a sequence of words. Words broken across lines will yield two bounding boxes, one for each half of the word, but both Strings will have the full word as their CONTENT.

pdf2alto assumes that the PDF measures distances in points. It produces an ALTO file with measurements in 1200ths of an inch.

This package provides the class PrintWordLocations, which is a minor modification of Ben Litchfield's example class PrintTextLocations, included in the source distribution of Apache PDFBox. You will need PDFBox to compile and use this class. A sample Bash script to drive the class is also provided.

-- Michael Slone

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
bin		bin
src/main/java/org/apache/pdfbox/examples/util		src/main/java/org/apache/pdfbox/examples/util
LICENSE.txt		LICENSE.txt
NOTICE.txt		NOTICE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pdf2alto

About

Releases

Packages

Contributors 2

Languages

License

cokernel/pdf2alto

Folders and files

Latest commit

History

Repository files navigation

pdf2alto

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages