Duplicated Text #115

Closed
zagraves opened this Issue Apr 6, 2013 · 8 comments

2 participants

@zagraves

I've attached a link to a sample PDF and output html where I am seeing text that has been duplicated and usually overlapping.

Note the words, "needle", "floating", "so", "e will", etc. I also notice when using the browser zoom that these duplicate text elements shift around, especially the word "floating".

From what I can tell in Acrobat, there isn't any hidden text in the PDF that would appear in the HTML like this.

HTML: http://zachgraves.com/output.html
PDF: http://db.tt/JjQhFQXR

@coolwanglu
Owner

Yes confirmed. Thanks for reporting.

In PDF there are indeed separated "dle", "so" etc. But they are supposed to overlap previous letters perfectly.
The variance seems to be too large to me, I'll try to figure it out.

@coolwanglu
Owner

I'm using Evince (a poppler-based PDF viewer), if I copy & paste the two lines of text which start with 'IMPORTANT", I actually got

IMPORTANT: Always rub the needle
dle
in the same direction.

which proves that there are the extra text in the PDF.

Can you please verify this in Acrobat? If you don't see the extra 'dle', it should be Acrobat who detected and merged them.

Unfortunately, so far the only I could suggest is to remove the extra div elements manually.
Or you can use the --fallback 1 parameter.

This is actually an example that the HTML has to be pixel-wise accurate, there are lots of PDF where overlapped text are used to simulate bold fonts.
Please allow a few days before I can get back to this issue.

Thanks for reporting!

@zagraves

Interesting. I just checked pdftotext and see similar results.

STEP 2: Repeat step 1 about
forty times! Rub, rub, rub!
IMPORTANT: Always rub the needle
dle
in the same direction.

STEP 3: Place your leaf (or
or floating
thing) on top of the water so
o
that it floats in the middle.

STEP 4: Carefully put your needle
on top of the leaf. The needle
e will
slowly turn and point NORTH.
NOW YOU’VE MADE
A COMPASS!

HOW IT WORKS: Earth
is like a giant magnet. The
needle of your compass is
attracted to Earth’s
NORTH POLE.

74

Looking in Acrobat, the PDF is bloody complicated, very hard to tell what's going on, but I can't see any obvious duplicated text.

Regarding fallback, is there any merit in having a per-page fallback option? Otherwise we can remove problem divs manually... but an automated option is ideal even if it's simply a flat image for problem pages.

@coolwanglu
Owner

So far there is no such flexible option available.

@coolwanglu
Owner

@zachgraves I don't have an Acrobat at hand. Can you try to remove the text "Always rub the needle" and see if there are still some letters left?

I just found that using a larger value of "--font-size-multiplier" helps, which means that it's the browser who are rounding the font size. Might worth a try.

@zagraves

Sorry I haven't had a chance to test in Acrobat, yet. I will look at it tonight.

Regarding --font-size-multiplier, is there any expectation that a value of 1 can result in the extracted text containing unexpected spaces?

I have a case where a value of 1 may result in L orem ip sum dol or sit amet... and the default 4 results in the expected Lorem ipsum dolor sit amet... in the HTML fragment that is output by pdf2htmlEX. In my mind the multiplier would only affect the font-size and CSS transform, but I admit I'm not well versed on the internals of the PDF format.

Can provide a sample (or open another issue.)

@coolwanglu
@coolwanglu
Owner

Please reopen with more info

@coolwanglu coolwanglu closed this Jul 2, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment