Skip to content

scripts to extract text layer from PDFs and rebuild lighter PDFs

License

Notifications You must be signed in to change notification settings

flppgg/pdf2text2pdf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 

Repository files navigation

pdf2text2pdf

Scripts to extract text layer from PDFs and rebuild lighter PDFs

 

These scripts were developed to:

  1. extract Unicode text from a PDF file (or DjVu file), with the position and size of every word in every page, and
  2. re-build a lighter version of the PDF file that ONLY includes the text layer  

 

This allows you to substantially reduce the size of the PDF file, and potentially to implement full text search functionalities.

 

The first folder, pdf2text, includes a number of Perl and shell scripts to extract the text layer from a PDF or DjVu file, and return a text file with the position and size of every word in the PDF.

The second folder, text2pdf, contains a Python script to build a PDF file from such text file.  

 

The Python script is still a beta version that needs testing fixing. In theory it should support every language included in UTF-8, although we are still far from that.

If you would like to contribute please send your comments, Thanks!

About

scripts to extract text layer from PDFs and rebuild lighter PDFs

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published