Google drive pdf viewer #248

Toneti777 · 2013-11-25T11:11:44Z

These recent days I have opened some pdf files attached in a e-mail on my gmail account. The viewer online surprise me..
They have a very good html viewer!!

I have inspeted some files and I advise some improvents against this library.
They use

for one or more text line without aditional elements resulting in a more eficient and lighter.

I don't kwok how it works and if can you study their proccess...or catch their goals.

coolwanglu · 2013-11-25T12:13:37Z

Actually text are rendered into images, while hidden text layer is provided for selection.

In this way the hidden text layer may be not so accurate, and that's why lots of styles may be removed.

Toneti777 · 2013-11-25T12:24:59Z

Ok, I've seen now...

It isn't so good than first appearance...

Thanks.

This library is better...I only miss lighter html code and solve duplicated fonts problem..;-)
Very good job..

Toneti777 · 2013-11-27T18:27:03Z

High optimization without lose accurate.

I've proved to reduce the amount of html elements of the html result...

I've found a patter that could be useful...I can't improve like this changing library parameters.

1 - Found neighbors divs (corresponding to lines) that have the same "m x h fs fc sc ls ws" classes...
2 - Remove the span elements and join all text of this divs in only one div.
3 - Remove the divs excess.
4 - Put a width size to div. Sum the all divs height and update the div height. Deduct the dib removed height to "bottom" atribute. Change "white-spaces" to "inherit". Add "line-heiht" and letter-spacing.

Now text flows through the div space and you can obtain the similar result...I'll prove it on a Crhome debugger...

Maybe is dificult to obtain the line-height or letter-spacing...but I think the improvement will be tremendous...

What do you think??

coolwanglu · 2013-11-29T08:21:40Z

(1) and part of (2) should be done with --optimize-text, althought it's still faulty for some tricky PDF>

I'd actually been planning to do 3 and 4, at the cost of some inaccuracy. Not finished yet.

coolwanglu · 2013-12-15T15:15:42Z

@Toneti777 There are other concerns of item 2, <span> elements are usually added to adjust inner-word space, or due to the change of font or other styles. If you remove the span elements and merge the text, accuracy cannot be preserved.

coolwanglu · 2013-12-15T15:16:59Z

@Toneti777 But if you managed to optimize the output a lot with item 1 and 2, it sounds like a bug of pdf2htmlEX — producing unnecessary span elements. In that case, can you please file a new bug with sample files?

Toneti777 · 2013-12-16T08:01:00Z

In 1) I have one problem about how calculate the width of line when merge two or more lines. The library build one div by line and I think it might be inprove. I manually try it and it's perfect but I'm lost to find width for a automatic process, maybe inside your library is easier.

In 2) I try to delete the span elements with very low value in margin-left attribute. For my pdfs on each line, library puts a lot of span elements between letters of each word. Most of them have a low value and it might be fixed with letter-spacing or word-spacing on div element. I obtain a much less complex html output.

coolwanglu · 2013-12-16T08:10:02Z

Maybe you can try a larger value for --heps and --optimize-text 1

coolwanglu · 2013-12-20T16:42:02Z

I think this is a duplicate of #56 so I close this one.

coolwanglu closed this as completed Dec 20, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Google drive pdf viewer #248

Google drive pdf viewer #248

Toneti777 commented Nov 25, 2013

coolwanglu commented Nov 25, 2013

Toneti777 commented Nov 25, 2013

Toneti777 commented Nov 27, 2013

coolwanglu commented Nov 29, 2013

coolwanglu commented Dec 15, 2013

coolwanglu commented Dec 15, 2013

Toneti777 commented Dec 16, 2013

coolwanglu commented Dec 16, 2013

coolwanglu commented Dec 20, 2013

Google drive pdf viewer #248

Google drive pdf viewer #248

Comments

Toneti777 commented Nov 25, 2013

coolwanglu commented Nov 25, 2013

Toneti777 commented Nov 25, 2013

Toneti777 commented Nov 27, 2013

coolwanglu commented Nov 29, 2013

coolwanglu commented Dec 15, 2013

coolwanglu commented Dec 15, 2013

Toneti777 commented Dec 16, 2013

coolwanglu commented Dec 16, 2013

coolwanglu commented Dec 20, 2013