Skip to content
This repository has been archived by the owner on Dec 9, 2018. It is now read-only.

Google drive pdf viewer #248

Closed
Toneti777 opened this issue Nov 25, 2013 · 9 comments
Closed

Google drive pdf viewer #248

Toneti777 opened this issue Nov 25, 2013 · 9 comments

Comments

@Toneti777
Copy link

These recent days I have opened some pdf files attached in a e-mail on my gmail account. The viewer online surprise me..
They have a very good html viewer!!

I have inspeted some files and I advise some improvents against this library.
They use

for one or more text line without aditional elements resulting in a more eficient and lighter.

I don't kwok how it works and if can you study their proccess...or catch their goals.

@coolwanglu
Copy link
Owner

Actually text are rendered into images, while hidden text layer is provided for selection.

In this way the hidden text layer may be not so accurate, and that's why lots of styles may be removed.

@Toneti777
Copy link
Author

Ok, I've seen now...

It isn't so good than first appearance...

Thanks.

This library is better...I only miss lighter html code and solve duplicated fonts problem..;-)
Very good job..

@Toneti777
Copy link
Author

High optimization without lose accurate.

I've proved to reduce the amount of html elements of the html result...

I've found a patter that could be useful...I can't improve like this changing library parameters.

1 - Found neighbors divs (corresponding to lines) that have the same "m x h fs fc sc ls ws" classes...
2 - Remove the span elements and join all text of this divs in only one div.
3 - Remove the divs excess.
4 - Put a width size to div. Sum the all divs height and update the div height. Deduct the dib removed height to "bottom" atribute. Change "white-spaces" to "inherit". Add "line-heiht" and letter-spacing.

Now text flows through the div space and you can obtain the similar result...I'll prove it on a Crhome debugger...

Maybe is dificult to obtain the line-height or letter-spacing...but I think the improvement will be tremendous...

What do you think??

@coolwanglu
Copy link
Owner

(1) and part of (2) should be done with --optimize-text, althought it's still faulty for some tricky PDF>

I'd actually been planning to do 3 and 4, at the cost of some inaccuracy. Not finished yet.

@coolwanglu
Copy link
Owner

@Toneti777 There are other concerns of item 2, <span> elements are usually added to adjust inner-word space, or due to the change of font or other styles. If you remove the span elements and merge the text, accuracy cannot be preserved.

@coolwanglu
Copy link
Owner

@Toneti777 But if you managed to optimize the output a lot with item 1 and 2, it sounds like a bug of pdf2htmlEX — producing unnecessary span elements. In that case, can you please file a new bug with sample files?

@Toneti777
Copy link
Author

In 1) I have one problem about how calculate the width of line when merge two or more lines. The library build one div by line and I think it might be inprove. I manually try it and it's perfect but I'm lost to find width for a automatic process, maybe inside your library is easier.

In 2) I try to delete the span elements with very low value in margin-left attribute. For my pdfs on each line, library puts a lot of span elements between letters of each word. Most of them have a low value and it might be fixed with letter-spacing or word-spacing on div element. I obtain a much less complex html output.

@coolwanglu
Copy link
Owner

Maybe you can try a larger value for --heps and --optimize-text 1

@coolwanglu
Copy link
Owner

I think this is a duplicate of #56 so I close this one.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants