Skip to content


Subversion checkout URL

You can clone with
Download ZIP


Visibility test for text #64

razamobin opened this Issue · 28 comments

7 participants


Text that are covered by following images or other objects, should not be visible in the final HTML.
For each piece of text, we should test the visibility and display only visible or partial visible parts.

Relevant stuffs:

  • Arrangement clipping (set operation between polygons, bezier curves etc)
  • Transparent bits (text covered by transparent pixels are still visible)
  • #39

[Update 2013.10.05]
Possible solution:

  • To trace all the images, estimate the area of the image with its bounding box, and find all text behind that box. Such text are mark as hidden and would be rendered as image -- in case that are only partially covered.
    • Currently text and images are processed separately, need to merge them.
    • If most parts in the bbox is transparent (e.g. a long diagonal line), lots of text that are actually completely visible will be rendered as image
    • Need to store the locations and sizes of the text, and make it efficient to query and mark (as hidden)
  • To employ SVG background image.
    • For some browsers, text in SVG are not selectable, (e.g. firefox). Fallback mode may be used
    • Currently SVG background is still very large, need further optimization.

// Original report
I have a sample PDF which appears to be scanned pages. The html produced has both images and text for each page - it should just be one of text xor images, not both.


html output:

I checked the FAQ and looked through the command line options but didn't discover anything. I'm not sure if there's something I missed. Thanks for reading.



Confirmed and working on it.


This pdf first displays the text, then the scanned image on top of the text. Such that the real text are covered, invisible, but still selectable.

pdf2htmlEX currently cannot detects this, it always tries to grab all text and put them on the top text layer.

I'll try to find a workaround for this.


Is it true (or very common) that, for scanned pdf files, all text are hidden or covered by the scanned image.
If so I may add an option like 'hide-text' for a workaround.

EDIT: I mean you can actually add more text there, above the images, with any PDF manipulation tool. So --hide-text may break the PDF again. But it's OK if this is not common.


I'm not sure how common it is. I believe this kind of PDF is created when you OCR a scanned document, so that when using a PDF viewer, you can search on text and it will highlight as expected because the text is almost exactly behind the scanned version of the same text.


It's relatively common, but was a poor decision on the part of the OCR program - it should have used hidden text rather than just placing an image over the text. Obviously it's too late to do anything about that.

Poppler's pdftohtml has some code to handle this specific problem at line 522 of which is probably a good starting point.

522   //----- discard duplicated text (fake boldface, drop shadows)
523   if( !complexMode )
524   { /* if not in complex mode get rid of duplicate strings */
525     HtmlString *str3;
526     GBool found;
527     while (str1)
528     {
529         double size = str1->yMax - str1->yMin;
530         double xLimit = str1->xMin + size * 0.2;
531         found = gFalse;
532         for (str2 = str1, str3 = str1->yxNext;
533             str3 && str3->xMin < xLimit;
534             str2 = str3, str3 = str2->yxNext)
535         {
536             if (str3->len == str1->len &&
537                 !memcmp(str3->text, str1->text, str1->len * sizeof(Unicode)) &&
538                 fabs(str3->yMin - str1->yMin) < size * 0.2 &&
539                 fabs(str3->yMax - str1->yMax) < size * 0.2 &&
540                 fabs(str3->xMax - str1->xMax) < size * 0.2)
541             {
542                 found = gTrue;
543                 //printf("found duplicate!\n");
544                 break;
545             }
546         }
547         if (found)
548         {
549             str2->xyNext = str3->xyNext;
550             str2->yxNext = str3->yxNext;
551             delete str3;
552         }
553         else
554         {
555             str1 = str1->yxNext;
556         }
557     }       
558   } /*- !complexMode */

Oh I see... that's annoying. I thought the image was a Type 3 font, but it's not - it really is an image.


A PDF file can have multiple layers, and layers containing images and text can be intermixed. I've attached a screenshot of a PDF and the output pdf2htmlEX currently generates that shows a more general example of the problem (look at the stack of receipts).


Other than that, it did a remarkably good job of replicating that page.


@jmbowman, Thanks for the info. Yes, the current design of pdf2htmlEX maybe too naive, maybe these will fix it somewhat

  • detect and hide all text completely covered by images
  • detect and hide all text partially covered, and add a hidden text layer

but it might be slow and ugly..

btw, can I have that PDF for debugging?

@coolwanglu coolwanglu closed this
@coolwanglu coolwanglu reopened this

Here's that one page of the PDF for testing:

I think that collapsing all of the image layers into a single image is usually a good optimization, except when it breaks like this. I guess one solution would be to have options for always collapsing (smallest), always preserving layers (most correct), or only preserving layers for specific pages which you know you'll need it for (best results with extra effort). Automatically figuring out which pages those are would be nice, but could be a separate improvement.


@razamobin A new option --fallback is now available, which make PDF files rendered as image plus hidden text. Usually this would increase the output size, but not for scanned PDFs. So please give it a try.


I have implemented "covered text handling" initially.
Characters covered by images are detected and (1) are made transparent in text layer, (2) are drawn in background layer.

There are still things to do:

  • Fix SplashBackgroundRenderer. The covered text handling requires that text layer is drawn before background, but SplashBackgroundRenderer currently doesn't support this mode, so only svg bg works for now.
  • Detect characters covered by paths.
  • Handling clip area.
  • Speed. The hitting test algorithm is naive, may be slow for complicated pages.
  • Merge neighboring transparent chars. Now each char has a <span>.

@duanyao Cool!
I'm a little bit worried about the performance, you might want to take a look at rtree in boost. Or you can leave the interface flexible such that I can fix it later.
And It might not be a good idea to create a separate char_covere array, would make it more difficult to optimized in the future. Currently it's similar as in PDF, where we record text and state changes. But it's not elegant either. Probably I need to make it an array and storing the state for each character.

  • The only public interface is std::vector<bool> & HTMLRenderer::get_chars_covered(). Optimization could be done in HTMLRenderer::add_non_char_bbox(double * bbox, int index) who makes hitting test.

  • We have to do char-drawing at least two passes, one for hitting test, one for selective drawing on background. The problem is, how to hint the second pass which chars should be drawn? I don't know a better way than a chars_covered array. I'm afraid the data structures used internally in the first pass is irreverent here.

  • The Big-O complexity of current algorithm is estimated as O(m*n), m for char count, n for non-char graphics count, if very few chars are covered. Assume each hitting test use 100 CPU cycles, and for a page where m=n=2000, the total time is 0.4G cycles, or 0.2s on a 2GHz core. This is not quite fast, but satisfying me for now, as most of my PDF pages are not that complex.

Thanks for recommending rtree, I'll take a look at it later.


I think you can keep chars_covered for now, as probably I don't have time to rework the data structure. I wonder if std::vector<char> would be better, because I remember that std::vector<bool> will use bitset for saving memory, but rather slow.

Can you create a separate class for the hittest? I don't want everything inside HTMLRenderer. Besides, it will be easier to adapt to other data structures.

Rtree should give an average time of O(n logm), and O(n sqrt(m)) at least. 0.2s is not fast, as I see sites using pdf2htmlEX to convert thousands of PDF, or a single file containing thousands of pages. But other the other hand, font & image processing may be even slower, so this may not be the bottleneck.

Can you create an option for this? This is experimental right now. Somebody might prefer performance.


@duanyao Probably you could create a PR when you think it's ready, and it'll be a better place to discuss. Thanks!


Sure. Firstly I want to fix the broken SplashBackgroundRenderer, and this may introduce conflicts with pending PR 360, so I want to do it after #360 it merged.


I don't know if this is a related problem; text which is completely transparent in the pdf (but selectable); appears in the html result (most telling on page 15):


I get it better now, in fact pdf.js works like fallback mode by rendenring the page has an image and having all the text be transparent,

Is the non rendered text hidden behind images in the actual pdf?


I tested the covered text, and it didn't work


@zogwarg, how did you test the "covered text"? Did you build covered_text_handling branch with cmake -DENABLE_SVG=ON, and pass --correct-text-visibility 1 at runtime? It is off by default.

I tested your PDF's p15, --correct-text-visibility 1 worked as expected.


I didn't have -DENABLE_SVG=ON, thanks i'll test it again


@duanyao Did you try converting this document
All the text become invisible when used with --correct-text-visibility 1 and --bg-format svg

Output html can be downloaded from here


I'm closing this issue as there's already an implementation for this.
Please create new issue with sample files if it's not working well.

@coolwanglu coolwanglu closed this

@bilalmughal I can reproduce your issue. However it is not related to --correct-text-visibility, but seems a problem of poppler(or cario)'s SVG renderer. Using poppler's pdftocario -svg command to convert your file, the output SVG file also looks blank in chrome, firefox, and inkscape, though shows some texts in gnome image viewer. I suggest you to report the pdftocario -svg bug to poppler ( if you can.


@duanyao Thanks for looking into it, i have reported it to poppler.


@bilalmughal could you post the link to the poppler's bug so that we can track the progress?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.