Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for two columns #75

Closed
raulromanp opened this issue Feb 25, 2016 · 8 comments
Closed

add support for two columns #75

raulromanp opened this issue Feb 25, 2016 · 8 comments
Milestone

Comments

@raulromanp
Copy link

I have noticed that when I'm parsing a 2-column pdf file, the result is an unsorted text. The text result following the visual line, as if it was 1-column document.

textract@1.2.1 is based on ''pdftotext". When I directly type the command pdftotext filename.pdf, the txt file generated has the correct sort line. Why not the same result with this library?

Tnx

@raulromanp raulromanp changed the title add support to two columns add support for two columns Feb 25, 2016
@dbashford
Copy link
Owner

This module uses pdf-text-extract which then uses the extractor. If you use pdf-text-extract directly does it work? If so, toss me an example file and I'll see what I can do.

@raulromanp
Copy link
Author

I have installed 'pdf-text-extract' with the same result. I have attached the file I'm trying.

Thanks for the efforts and the quickly reply :)

file.pdf

@dbashford
Copy link
Owner

Verified, that is very odd, will need to dig into pdf-text-extract. Not sure why it would break something like this.

@dbashford
Copy link
Owner

The problem is this, in the options for pdf-text-extract:

layout: Should be either layout, raw or htmlmeta. Default: layout

@dbashford
Copy link
Owner

So layout preserves the layout, so what comes back looks like this:

image

That may be great for readability, but its not great for programmatic extraction of the text.

@raulromanp
Copy link
Author

See https://www.npmjs.com/package/pdftotextjs
Using that option on the pdffile, if resolve a good ordered text. It uses internally pdftotext.

Another alternative, http://git.macropus.org/2011/11/pdftotext/example/ . It perfectly generates the text, including the number on references on the last page, but I'm not sure how it works. Need more analysis.

I'll be on top of your investigations

@raulromanp
Copy link
Author

Yeah.. I saw the layout option. Not good for my target

@dbashford
Copy link
Owner

For documentation purposes (going to point to this ticket in the README and in the code)...

If you run pdftotext nameOfFile at the command line, it doesn't use a layout. For some reason pdf-text-extract provides a default layout of layout which preserves columns. That is not in the best interest of a text extractor. This primary concern of this library (textract) is to provide the ability to programmatically process the text it is extracting, which sometimes takes precedent over maintaining readability. Keeping, for instance, two columns in the output does not lend to programmatically processing the text.

I'm going to override the pdf-text-extract default, which is layout so that no layout is used at all, which sets the textract defaults to be in line with the pdftotext defaults.

I'm going to override the pdf-text-extract default, which is layout, so that raw is used. No layout does not preserve line breaks. raw, despite being discouraged in the pdftotext docs, preserves line breaks while also unravelling columns.

This can be overridden to use layout, which preserves columns. It can also be overridden to use something that pdf-text-extract does not recognize, like no_thanks, which will use the default layout native to pdftotext.

@dbashford dbashford modified the milestone: 1.3.0 Feb 25, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants