Visibility test for text #64

Closed
razamobin opened this Issue Jan 15, 2013 · 28 comments

Comments

Projects
None yet
7 participants

Text that are covered by following images or other objects, should not be visible in the final HTML.
For each piece of text, we should test the visibility and display only visible or partial visible parts.

Relevant stuffs:

  • Arrangement clipping (set operation between polygons, bezier curves etc)
  • Transparent bits (text covered by transparent pixels are still visible)
  • #39

[Update 2013.10.05]
Possible solution:

  • To trace all the images, estimate the area of the image with its bounding box, and find all text behind that box. Such text are mark as hidden and would be rendered as image -- in case that are only partially covered.
    • Currently text and images are processed separately, need to merge them.
    • If most parts in the bbox is transparent (e.g. a long diagonal line), lots of text that are actually completely visible will be rendered as image
    • Need to store the locations and sizes of the text, and make it efficient to query and mark (as hidden)
  • To employ SVG background image.
    • For some browsers, text in SVG are not selectable, (e.g. firefox). Fallback mode may be used
    • Currently SVG background is still very large, need further optimization.

///////////////////////////////
// Original report
I have a sample PDF which appears to be scanned pages. The html produced has both images and text for each page - it should just be one of text xor images, not both.

pdf:
https://dl.dropbox.com/u/31309918/dd/F3Znx0Qodh.pdf

html output:
https://dl.dropbox.com/u/31309918/dd/F3Znx0Qodh.html

I checked the FAQ and looked through the command line options but didn't discover anything. I'm not sure if there's something I missed. Thanks for reading.

-Raza

Owner

coolwanglu commented Jan 19, 2013

Confirmed and working on it.

Owner

coolwanglu commented Jan 19, 2013

This pdf first displays the text, then the scanned image on top of the text. Such that the real text are covered, invisible, but still selectable.

pdf2htmlEX currently cannot detects this, it always tries to grab all text and put them on the top text layer.

I'll try to find a workaround for this.

Owner

coolwanglu commented Jan 31, 2013

Is it true (or very common) that, for scanned pdf files, all text are hidden or covered by the scanned image.
If so I may add an option like 'hide-text' for a workaround.

EDIT: I mean you can actually add more text there, above the images, with any PDF manipulation tool. So --hide-text may break the PDF again. But it's OK if this is not common.

I'm not sure how common it is. I believe this kind of PDF is created when you OCR a scanned document, so that when using a PDF viewer, you can search on text and it will highlight as expected because the text is almost exactly behind the scanned version of the same text.

Contributor

jahewson commented Feb 2, 2013

It's relatively common, but was a poor decision on the part of the OCR program - it should have used hidden text rather than just placing an image over the text. Obviously it's too late to do anything about that.

Poppler's pdftohtml has some code to handle this specific problem at line 522 of HtmlOutputDev.cc which is probably a good starting point.

522   //----- discard duplicated text (fake boldface, drop shadows)
523   if( !complexMode )
524   { /* if not in complex mode get rid of duplicate strings */
525     HtmlString *str3;
526     GBool found;
527     while (str1)
528     {
529         double size = str1->yMax - str1->yMin;
530         double xLimit = str1->xMin + size * 0.2;
531         found = gFalse;
532         for (str2 = str1, str3 = str1->yxNext;
533             str3 && str3->xMin < xLimit;
534             str2 = str3, str3 = str2->yxNext)
535         {
536             if (str3->len == str1->len &&
537                 !memcmp(str3->text, str1->text, str1->len * sizeof(Unicode)) &&
538                 fabs(str3->yMin - str1->yMin) < size * 0.2 &&
539                 fabs(str3->yMax - str1->yMax) < size * 0.2 &&
540                 fabs(str3->xMax - str1->xMax) < size * 0.2)
541             {
542                 found = gTrue;
543                 //printf("found duplicate!\n");
544                 break;
545             }
546         }
547         if (found)
548         {
549             str2->xyNext = str3->xyNext;
550             str2->yxNext = str3->yxNext;
551             delete str3;
552         }
553         else
554         {
555             str1 = str1->yxNext;
556         }
557     }       
558   } /*- !complexMode */
Owner

coolwanglu commented Feb 2, 2013

No, in our case, one is text, the other is on image, so they are not
duplicate

On Sat, Feb 2, 2013 at 11:23 PM, John Hewson notifications@github.comwrote:

It's relatively common, but was a poor decision on the part of the OCR
program - it should have used hidden text rather than just placing an image
over the text. Obviously it's too late to do anything about that.

Poppler's pdftohtml has some code to handle this specific problem at line
522 of HtmlOutputDev.cchttp://fossies.org/dox/poppler-0.22.0/HtmlOutputDev_8cc_source.html#l00522which is probably a good starting point.

522 //----- discard duplicated text (fake boldface, drop shadows)523 if( !complexMode )524 { /* if not in complex mode get rid of duplicate strings /525 HtmlString *str3;526 GBool found;527 while (str1)528 {529 double size = str1->yMax - str1->yMin;530 double xLimit = str1->xMin + size * 0.2;531 found = gFalse;532 for (str2 = str1, str3 = str1->yxNext;533 str3 && str3->xMin < xLimit;534 str2 = str3, str3 = str2->yxNext)535 {536 if (str3->len == str1->len &&537 !memcmp(str3->text, str1->text, str1->len * sizeof(Unicode)) &&538 fabs(str3->yMin - str1->yMin) < size * 0.2 &&539 fabs(str3->yMax - str1->yMax) < size * 0.2 &&540 fabs(str3->xMax - str1->xMax) < size * 0.2)541 {542 found = gTrue;543 //printf("found duplicate!\n");544 break;545 }546 }547 if (found)548 {549 str2->xyNext = str3->xyNext;550 str2->yxNext = str3->yxNext;551 delete str3;552 }553 else554 {555 str1 = str1->yxNext;556 }557 } 558 } /- !complexMode */


Reply to this email directly or view it on GitHubhttps://github.com/coolwanglu/pdf2htmlEX/issues/64#issuecomment-13032004.

Contributor

jahewson commented Feb 2, 2013

Oh I see... that's annoying. I thought the image was a Type 3 font, but it's not - it really is an image.

jmbowman commented Feb 4, 2013

A PDF file can have multiple layers, and layers containing images and text can be intermixed. I've attached a screenshot of a PDF and the output pdf2htmlEX currently generates that shows a more general example of the problem (look at the stack of receipts).

PDF: https://dl.dropbox.com/u/4804331/Layers_PDF.png
HTML: https://dl.dropbox.com/u/4804331/Layers_HTML.png

Other than that, it did a remarkably good job of replicating that page.

Owner

coolwanglu commented Feb 5, 2013

@jmbowman, Thanks for the info. Yes, the current design of pdf2htmlEX maybe too naive, maybe these will fix it somewhat

  • detect and hide all text completely covered by images
  • detect and hide all text partially covered, and add a hidden text layer

but it might be slow and ugly..

btw, can I have that PDF for debugging?

coolwanglu closed this Feb 5, 2013

coolwanglu reopened this Feb 5, 2013

jmbowman commented Feb 5, 2013

Here's that one page of the PDF for testing: https://dl.dropbox.com/u/4804331/layers_bug.pdf

I think that collapsing all of the image layers into a single image is usually a good optimization, except when it breaks like this. I guess one solution would be to have options for always collapsing (smallest), always preserving layers (most correct), or only preserving layers for specific pages which you know you'll need it for (best results with extra effort). Automatically figuring out which pages those are would be nice, but could be a separate improvement.

Owner

coolwanglu commented Feb 5, 2013

Actually pdf2htmlEX is pdf-to-image with text extracted.

Bsides layers, clipping path is the biggest problem, an image in rectangle
may be displayed as a circle due to the clipping path, which cannot be done
easily in HTML.

Still looking for a solution.

On Wed, Feb 6, 2013 at 1:39 AM, jmbowman notifications@github.com wrote:

Here's that one page of the PDF for testing:
https://dl.dropbox.com/u/4804331/layers_bug.pdf

I think that collapsing all of the image layers into a single image is
usually a good optimization, except when it breaks like this. I guess one
solution would be to have options for always collapsing (smallest), always
preserving layers (most correct), or only preserving layers for specific
pages which you know you'll need it for (best results with extra effort).
Automatically figuring out which pages those are would be nice, but could
be a separate improvement.


Reply to this email directly or view it on GitHubhttps://github.com/coolwanglu/pdf2htmlEX/issues/64#issuecomment-13141583.

Owner

coolwanglu commented Mar 9, 2013

@razamobin A new option --fallback is now available, which make PDF files rendered as image plus hidden text. Usually this would increase the output size, but not for scanned PDFs. So please give it a try.

Collaborator

duanyao commented Jun 13, 2014

@coolwanglu
I have implemented "covered text handling" initially.
Characters covered by images are detected and (1) are made transparent in text layer, (2) are drawn in background layer.

There are still things to do:

  • Fix SplashBackgroundRenderer. The covered text handling requires that text layer is drawn before background, but SplashBackgroundRenderer currently doesn't support this mode, so only svg bg works for now.
  • Detect characters covered by paths.
  • Handling clip area.
  • Speed. The hitting test algorithm is naive, may be slow for complicated pages.
  • Merge neighboring transparent chars. Now each char has a <span>.
Owner

coolwanglu commented Jun 13, 2014

@duanyao Cool!
I'm a little bit worried about the performance, you might want to take a look at rtree in boost. Or you can leave the interface flexible such that I can fix it later.
And It might not be a good idea to create a separate char_covere array, would make it more difficult to optimized in the future. Currently it's similar as in PDF, where we record text and state changes. But it's not elegant either. Probably I need to make it an array and storing the state for each character.

Collaborator

duanyao commented Jun 14, 2014

  • The only public interface is std::vector<bool> & HTMLRenderer::get_chars_covered(). Optimization could be done in HTMLRenderer::add_non_char_bbox(double * bbox, int index) who makes hitting test.
  • We have to do char-drawing at least two passes, one for hitting test, one for selective drawing on background. The problem is, how to hint the second pass which chars should be drawn? I don't know a better way than a chars_covered array. I'm afraid the data structures used internally in the first pass is irreverent here.
  • The Big-O complexity of current algorithm is estimated as O(m*n), m for char count, n for non-char graphics count, if very few chars are covered. Assume each hitting test use 100 CPU cycles, and for a page where m=n=2000, the total time is 0.4G cycles, or 0.2s on a 2GHz core. This is not quite fast, but satisfying me for now, as most of my PDF pages are not that complex.

Thanks for recommending rtree, I'll take a look at it later.

Owner

coolwanglu commented Jun 14, 2014

I think you can keep chars_covered for now, as probably I don't have time to rework the data structure. I wonder if std::vector<char> would be better, because I remember that std::vector<bool> will use bitset for saving memory, but rather slow.

Can you create a separate class for the hittest? I don't want everything inside HTMLRenderer. Besides, it will be easier to adapt to other data structures.

Rtree should give an average time of O(n logm), and O(n sqrt(m)) at least. 0.2s is not fast, as I see sites using pdf2htmlEX to convert thousands of PDF, or a single file containing thousands of pages. But other the other hand, font & image processing may be even slower, so this may not be the bottleneck.

Can you create an option for this? This is experimental right now. Somebody might prefer performance.

Owner

coolwanglu commented Jun 14, 2014

@duanyao Probably you could create a PR when you think it's ready, and it'll be a better place to discuss. Thanks!

Collaborator

duanyao commented Jun 14, 2014

Sure. Firstly I want to fix the broken SplashBackgroundRenderer, and this may introduce conflicts with pending PR 360, so I want to do it after #360 it merged.

zogwarg commented Jul 9, 2014

I don't know if this is a related problem; text which is completely transparent in the pdf (but selectable); appears in the html result (most telling on page 15):
http://zogwarg.free.fr/pdftohtml/1_NOR.pdf
http://zogwarg.free.fr/pdftohtml/1_NOR.html

EDIT:

I get it better now, in fact pdf.js works like fallback mode by rendenring the page has an image and having all the text be transparent,

Is the non rendered text hidden behind images in the actual pdf?

zogwarg commented Jul 9, 2014

I tested the covered text, and it didn't work

Collaborator

duanyao commented Jul 10, 2014

@zogwarg, how did you test the "covered text"? Did you build covered_text_handling branch with cmake -DENABLE_SVG=ON, and pass --correct-text-visibility 1 at runtime? It is off by default.

I tested your PDF's p15, --correct-text-visibility 1 worked as expected.

zogwarg commented Jul 10, 2014

I didn't have -DENABLE_SVG=ON, thanks i'll test it again

@duanyao Did you try converting this document
All the text become invisible when used with --correct-text-visibility 1 and --bg-format svg

Output html can be downloaded from here

Owner

coolwanglu commented Oct 31, 2014

I'm closing this issue as there's already an implementation for this.
Please create new issue with sample files if it's not working well.

coolwanglu closed this Oct 31, 2014

Collaborator

duanyao commented Nov 9, 2014

@bilalmughal I can reproduce your issue. However it is not related to --correct-text-visibility, but seems a problem of poppler(or cario)'s SVG renderer. Using poppler's pdftocario -svg command to convert your file, the output SVG file also looks blank in chrome, firefox, and inkscape, though shows some texts in gnome image viewer. I suggest you to report the pdftocario -svg bug to poppler (https://bugs.freedesktop.org/buglist.cgi?quicksearch=poppler&list_id=457322) if you can.

@duanyao Thanks for looking into it, i have reported it to poppler.

Collaborator

duanyao commented Nov 10, 2014

@bilalmughal could you post the link to the poppler's bug so that we can track the progress?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment