add support for two columns #75

raulromanp · 2016-02-25T11:32:16Z

I have noticed that when I'm parsing a 2-column pdf file, the result is an unsorted text. The text result following the visual line, as if it was 1-column document.

textract@1.2.1 is based on ''pdftotext". When I directly type the command pdftotext filename.pdf, the txt file generated has the correct sort line. Why not the same result with this library?

Tnx

The text was updated successfully, but these errors were encountered:

dbashford · 2016-02-25T12:05:25Z

This module uses pdf-text-extract which then uses the extractor. If you use pdf-text-extract directly does it work? If so, toss me an example file and I'll see what I can do.

raulromanp · 2016-02-25T12:16:36Z

I have installed 'pdf-text-extract' with the same result. I have attached the file I'm trying.

Thanks for the efforts and the quickly reply :)

file.pdf

dbashford · 2016-02-25T13:48:52Z

Verified, that is very odd, will need to dig into pdf-text-extract. Not sure why it would break something like this.

dbashford · 2016-02-25T13:56:59Z

The problem is this, in the options for pdf-text-extract:

layout: Should be either layout, raw or htmlmeta. Default: layout

dbashford · 2016-02-25T13:57:50Z

So layout preserves the layout, so what comes back looks like this:

That may be great for readability, but its not great for programmatic extraction of the text.

raulromanp · 2016-02-25T13:58:19Z

See https://www.npmjs.com/package/pdftotextjs
Using that option on the pdffile, if resolve a good ordered text. It uses internally pdftotext.

Another alternative, http://git.macropus.org/2011/11/pdftotext/example/ . It perfectly generates the text, including the number on references on the last page, but I'm not sure how it works. Need more analysis.

I'll be on top of your investigations

raulromanp · 2016-02-25T13:59:45Z

Yeah.. I saw the layout option. Not good for my target

dbashford · 2016-02-25T14:20:32Z

For documentation purposes (going to point to this ticket in the README and in the code)...

If you run pdftotext nameOfFile at the command line, it doesn't use a layout. For some reason pdf-text-extract provides a default layout of layout which preserves columns. That is not in the best interest of a text extractor. This primary concern of this library (textract) is to provide the ability to programmatically process the text it is extracting, which sometimes takes precedent over maintaining readability. Keeping, for instance, two columns in the output does not lend to programmatically processing the text.

~~I'm going to override the pdf-text-extract default, which is layout so that no layout is used at all, which sets the textract defaults to be in line with the pdftotext defaults.~~

I'm going to override the pdf-text-extract default, which is layout, so that raw is used. No layout does not preserve line breaks. raw, despite being discouraged in the pdftotext docs, preserves line breaks while also unravelling columns.

This can be overridden to use layout, which preserves columns. It can also be overridden to use something that pdf-text-extract does not recognize, like no_thanks, which will use the default layout native to pdftotext.

raulromanp changed the title ~~add support to two columns~~ add support for two columns Feb 25, 2016

dbashford modified the milestone: 1.3.0 Feb 25, 2016

dbashford closed this as completed in 8d21240 Feb 26, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add support for two columns #75

add support for two columns #75

raulromanp commented Feb 25, 2016

dbashford commented Feb 25, 2016

raulromanp commented Feb 25, 2016

dbashford commented Feb 25, 2016

dbashford commented Feb 25, 2016

dbashford commented Feb 25, 2016

raulromanp commented Feb 25, 2016

raulromanp commented Feb 25, 2016

dbashford commented Feb 25, 2016

add support for two columns #75

add support for two columns #75

Comments

raulromanp commented Feb 25, 2016

dbashford commented Feb 25, 2016

raulromanp commented Feb 25, 2016

dbashford commented Feb 25, 2016

dbashford commented Feb 25, 2016

dbashford commented Feb 25, 2016

raulromanp commented Feb 25, 2016

raulromanp commented Feb 25, 2016

dbashford commented Feb 25, 2016