Duplicated Text #115

zagraves opened this Issue Apr 6, 2013 · 8 comments

2 participants


I've attached a link to a sample PDF and output html where I am seeing text that has been duplicated and usually overlapping.

Note the words, "needle", "floating", "so", "e will", etc. I also notice when using the browser zoom that these duplicate text elements shift around, especially the word "floating".

From what I can tell in Acrobat, there isn't any hidden text in the PDF that would appear in the HTML like this.

HTML: http://zachgraves.com/output.html
PDF: http://db.tt/JjQhFQXR


Yes confirmed. Thanks for reporting.

In PDF there are indeed separated "dle", "so" etc. But they are supposed to overlap previous letters perfectly.
The variance seems to be too large to me, I'll try to figure it out.


I'm using Evince (a poppler-based PDF viewer), if I copy & paste the two lines of text which start with 'IMPORTANT", I actually got

IMPORTANT: Always rub the needle
in the same direction.

which proves that there are the extra text in the PDF.

Can you please verify this in Acrobat? If you don't see the extra 'dle', it should be Acrobat who detected and merged them.

Unfortunately, so far the only I could suggest is to remove the extra div elements manually.
Or you can use the --fallback 1 parameter.

This is actually an example that the HTML has to be pixel-wise accurate, there are lots of PDF where overlapped text are used to simulate bold fonts.
Please allow a few days before I can get back to this issue.

Thanks for reporting!


Interesting. I just checked pdftotext and see similar results.

STEP 2: Repeat step 1 about
forty times! Rub, rub, rub!
IMPORTANT: Always rub the needle
in the same direction.

STEP 3: Place your leaf (or
or floating
thing) on top of the water so
that it floats in the middle.

STEP 4: Carefully put your needle
on top of the leaf. The needle
e will
slowly turn and point NORTH.

is like a giant magnet. The
needle of your compass is
attracted to Earth’s


Looking in Acrobat, the PDF is bloody complicated, very hard to tell what's going on, but I can't see any obvious duplicated text.

Regarding fallback, is there any merit in having a per-page fallback option? Otherwise we can remove problem divs manually... but an automated option is ideal even if it's simply a flat image for problem pages.


So far there is no such flexible option available.


@zachgraves I don't have an Acrobat at hand. Can you try to remove the text "Always rub the needle" and see if there are still some letters left?

I just found that using a larger value of "--font-size-multiplier" helps, which means that it's the browser who are rounding the font size. Might worth a try.


Sorry I haven't had a chance to test in Acrobat, yet. I will look at it tonight.

Regarding --font-size-multiplier, is there any expectation that a value of 1 can result in the extracted text containing unexpected spaces?

I have a case where a value of 1 may result in L orem ip sum dol or sit amet... and the default 4 results in the expected Lorem ipsum dolor sit amet... in the HTML fragment that is output by pdf2htmlEX. In my mind the multiplier would only affect the font-size and CSS transform, but I admit I'm not well versed on the internals of the PDF format.

Can provide a sample (or open another issue.)


Please reopen with more info

@coolwanglu coolwanglu closed this Jul 2, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment