-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add support for two columns #75
Comments
This module uses pdf-text-extract which then uses the extractor. If you use pdf-text-extract directly does it work? If so, toss me an example file and I'll see what I can do. |
I have installed 'pdf-text-extract' with the same result. I have attached the file I'm trying. Thanks for the efforts and the quickly reply :) |
Verified, that is very odd, will need to dig into pdf-text-extract. Not sure why it would break something like this. |
The problem is this, in the options for pdf-text-extract:
|
See https://www.npmjs.com/package/pdftotextjs Another alternative, http://git.macropus.org/2011/11/pdftotext/example/ . It perfectly generates the text, including the number on references on the last page, but I'm not sure how it works. Need more analysis. I'll be on top of your investigations |
Yeah.. I saw the layout option. Not good for my target |
For documentation purposes (going to point to this ticket in the README and in the code)... If you run
I'm going to override the pdf-text-extract default, which is This can be overridden to use |
I have noticed that when I'm parsing a 2-column pdf file, the result is an unsorted text. The text result following the visual line, as if it was 1-column document.
textract@1.2.1 is based on ''pdftotext". When I directly type the command
pdftotext filename.pdf
, the txt file generated has the correct sort line. Why not the same result with this library?Tnx
The text was updated successfully, but these errors were encountered: