New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pdftotext: option to preserve layout #93

merged 1 commit into from Oct 9, 2015


None yet
2 participants

ankushshah89 commented Oct 8, 2015

The default pdf to text parser (pdftotext) don't preserver the layout. I added an extra argument (-layout) to do so. How to use:

from python:
import textract
textract.process("file.pdf") #without layout
textract.process("file.pdf", layout=True) #with layout

from CLI utility:
textract "file.pdf" #without layout
textract -O layout=True "file.pdf" #with layout

Note: the actual value of variable layout do not matter. It can be anything.

@deanmalmgren deanmalmgren merged commit 95cdd5e into deanmalmgren:master Oct 9, 2015

1 check failed

continuous-integration/travis-ci/pr The Travis CI build failed

This comment has been minimized.


deanmalmgren commented Oct 9, 2015

Thanks for the contribution, @ankushshah89! This looks great. I added some tests and documentation around the new option and this will be in the next release.

For anyone else that is wondering, I looked into pdfminer and it doesn't appear to have a layout preservation option when the output text-based. I've used pdfminer to extract html to figure out the text placement of different text chunks, but from my previous work it isn't trivial to have pdfminer correctly order the text blocks from the html output (it places the <p> tags using absolute positioning and it is difficult to correctly infer the order of the text, which is a really important priority for textract).

@ankushshah89 ankushshah89 deleted the ankushshah89:pdftotext-layout branch Oct 9, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment