Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
pdftotext: option to preserve layout #93
The default pdf to text parser (pdftotext) don't preserver the layout. I added an extra argument (-layout) to do so. How to use:
from CLI utility:
Note: the actual value of variable layout do not matter. It can be anything.
Thanks for the contribution, @ankushshah89! This looks great. I added some tests and documentation around the new option and this will be in the next release.
For anyone else that is wondering, I looked into pdfminer and it doesn't appear to have a layout preservation option when the output text-based. I've used pdfminer to extract html to figure out the text placement of different text chunks, but from my previous work it isn't trivial to have pdfminer correctly order the text blocks from the html output (it places the