Navigation Menu

Skip to content
This repository has been archived by the owner on Dec 9, 2018. It is now read-only.

HTML optimization #104

Closed
2 of 3 tasks
coolwanglu opened this issue Mar 12, 2013 · 31 comments
Closed
2 of 3 tasks

HTML optimization #104

coolwanglu opened this issue Mar 12, 2013 · 31 comments

Comments

@coolwanglu
Copy link
Owner

Crocdoc is (once again) a good one to learn from

  • Group a few text lines, and place them with display:block and proper margin-top values
    • Groups are still absolute positioned
    • Check if it is faster than all-absolute-positioning, even not, there will be much less margin-top classes than y axis
  • For sub/superscripts, use top and relative positioning
    • vertical-align seems to be better
    • This can also prevent overlapping when the font size is not correct (rounded by browsers)
  • Use average letter/word-space when possible
@jahewson
Copy link
Contributor

+1, reducing the number of <div>s and <span>s will be a huge boost to performance.

@jahewson
Copy link
Contributor

Could sub/superscripts use CSS vertical-align with a length ?

@coolwanglu
Copy link
Owner Author

Oh I didn't know it can take a length as the value. I've just checked the CSS standard, seems to be better than relative positioning.
Issue updated. Thanks!

@Hengjie
Copy link

Hengjie commented Mar 28, 2013

Yes I agree, reducing the amount of divs is going to mean reflowing the browser will be faster.

@iclems
Copy link

iclems commented Mar 29, 2013

I just discovered the project and would love to get involved as I worked on similar stuff a year ago. Just a quick hint (maybe you already know this): to fix the issue of WebKit and decimals not being taken into account for letter-spacing for instance => you can multiply all your values by X then use a CSS transform to scale down by a factor of X and then the decimals do work

@coolwanglu
Copy link
Owner Author

@iclems Thanks for the message.

Actually the scaling trick has always been in there since an very earlier version.

There are still some issues marked as 'need solution', to which I have not been able to figure out solutions. Maybe you may share some of your thoughts?

@iclems
Copy link

iclems commented Mar 29, 2013

Thanks ! I've been having a look at the project today and I'm now getting familiar with the way things are done. Meeting again my old friend Poppler... I remember having thought about how to properly optimize the background image, try to have a fast enough conversion, etc... Good example of a small PDF very slow to convert and very big once converted (and just 1.7Mo though in PDF) : http://clement.wehrung.free.fr/scaling.pdf

I'll probably be able to start focusing on some specific issues next week by the way, do you have any priority list ?

@coolwanglu
Copy link
Owner Author

OK, Thanks for the PDF. I'll take a look tomorrow.

I'm now trying to reduce the number of <div> for positional shifts, by filling into space characters or adjusting word-space.

I think you may just pick up any one you found interesting. And I'd like to recommend #39, which is serious and doable for now. I'm not sure if you are familiar with dealing with clipping paths, I've no experience at all.

I'd like to explain the codebase and discuss about possible solutions with you. Thanks!

@jahewson
Copy link
Contributor

jahewson commented Apr 2, 2013

Hi @iclems, I have a similar background - familiarity with Poppler, and now starting to make some small contributions to pdf2htmlEX. I'm actually working on #39 at the moment, rather slowly.

This issue - reducing the number of divs - is in my opinion one of the most important because of the impact on performance. I'd recommend trying out some of your typical PDFs and seeing if any features you care about are missing - that's how I ended up adding stroked text.

@coolwanglu
Copy link
Owner Author

I've finished the optimization of word-space, and letter-space will follow up soon.
As I tested in Chrome, this optimization would bring about 10% performance gain.

@jahewson
Copy link
Contributor

jahewson commented Apr 2, 2013

10% performance gain

Would that be DOM memory, HTML file size, or frame rate?

@coolwanglu
Copy link
Owner Author

Oh, it was the time for parsing and rendering the entire document (with
lazy rendering disabled)

On Tue, Apr 2, 2013 at 6:23 PM, John Hewson notifications@github.comwrote:

10% performance gain

Would that be DOM memory, HTML file size, or frame rate?


Reply to this email directly or view it on GitHubhttps://github.com//issues/104#issuecomment-15767484
.

@jahewson
Copy link
Contributor

jahewson commented Apr 2, 2013

Ok. Btw - I think you should keep the un-optimized text generation mode, and have a flag --optimize-text which is 1 by default, for debugging.

@jahewson
Copy link
Contributor

jahewson commented Apr 2, 2013

Have you tried looking at the DOM memory in the Chrome's Task Manager?

@coolwanglu
Copy link
Owner Author

Right, I'll add it.

On Tue, Apr 2, 2013 at 6:30 PM, John Hewson notifications@github.comwrote:

Ok. Btw - I think you should keep the un-optimized text generation mode,
and have a flag --optimize-text which is 1 by default, for debugging.


Reply to this email directly or view it on GitHubhttps://github.com//issues/104#issuecomment-15767763
.

@coolwanglu
Copy link
Owner Author

No, let me do a comparison of the optimized and not-optimized versions

On Tue, Apr 2, 2013 at 6:31 PM, John Hewson notifications@github.comwrote:

Have you tried looking at the DOM memory in the Chrome's Task Manager?


Reply to this email directly or view it on GitHubhttps://github.com//issues/104#issuecomment-15767808
.

@iclems
Copy link

iclems commented Apr 2, 2013

Hi @jahewson

I have a few concerns for now, and will try to start thinking on how I could contribute today :

  • Poppler speed : converting a small PDF as the one I pointed above can be very slow, even for small PDFs
  • "z-index" issue : the eternal issue with the approach consisting in placing everything that is not text in the background takes place one you have "elements" hiding text (a lot of designers do it in InDesign => they just hide some text elements with a white square, or put an image on top of it and never remove the text behind) = it's quite complicated to find an issue to this issue as it would involve for each object to check its "visibility"
  • reducing the background-image issue : part of the poppler speed issue is due IMO to the rasterizing of the big background-image. In some cases, I have noticed that a non transparent background color can lead to generating one big image for each page. Do you know anything about this ? Do you consider it an issue ? I had as well been working a year ago on a custom approach trying to cut the background image in non-empty smaller images which would just be positioned absolutely. What do you think ?

@coolwanglu
Copy link
Owner Author

Comparison with demo.pdf. It is a scientific paper, which should be able to enjoy the optimization most.

_yes is with optimization and _no is not

Selection_003

loading time:
about 2s for _yes and about 2.7s for _no

@coolwanglu
Copy link
Owner Author

@jahewson what does proportional memory (the last column) mean?

@coolwanglu
Copy link
Owner Author

@iclems

Indeed pdf2htmlEX is very slow converting your sample PDF. There are too many pages for it. I've just checked pdftohtml from poppler, which is able to process the same file very fast. I'll try to find the cause.

One possible solution is to use multiple threads, since rendering background image of each page is independent to each other. And fortunately, poppler has just become thread-safe since a recent version.

Visibility test, indeed, even harder than #39 where we may simply estimate the clipping path as a rectangle. I've been thinking about this, but no good idea so far. Maybe we may estimate each object by its bounding box, and test the visibility in the preprocessor.

About cutting the background image. That should be intuitive and useful, how did you do that?

Actually I've tried to dump every image object in PDF and put them directly into HTML. But it did not work due to clipping paths, also there may be other drawing objects. I also tried to at least detect "if there is anything on the background", (there is a bg_integrate branch, which has not been maintained for a while), which did not work well either, since a simple header/footer will make the background nonempty.

In the bg_integrated path, I also attempted to employ SVG for the visibility issue, but it turned out to be too complicated to me. Crocdoc seems to support render in SVG now, I never succeeded in viewing them though, they always froze my browsers.

@iclems
Copy link

iclems commented Apr 2, 2013

@coolwanglu

Thanks for the long reply :) Could I have your mail to send you a link to some source ?

I think visibility test is not the #1 priority. Most probably :

  1. fixing the background issue which both increases the generation time and makes the page weight much bigger than required (best would be to be able to put the background color in CSS and have a "per image" absolute positioning / otherwise, a quick compare would help to reduce the file size as most probably a lot of background images will just be the same if it's only about the background color...)
  2. improving generation speed, (may be a lot improved by Preserve font colors #1)
  3. testing fonts (at that time, I had a lot of pain with specific font issues),

@jahewson
Copy link
Contributor

jahewson commented Apr 2, 2013

@jahewson what does proportional memory (the last column) mean?

@coolwanglu the columns should be:

  • Resident: Amount of memory that is present in physical RAM.
  • Shared: Amount of memory that is present in physical RAM and can be shared with another process.
  • Private: Amount of memory that is present in physical RAM and can not be shared with another process.
  • Virtual: Amount of address space allocated in virtual memory.

The most important value is Resident, which is the first column. So you're seing a 23% reduction in RAM with your optimizations - great! (93MB -> 72MB)

@coolwanglu
Copy link
Owner Author

@iclems My email is available in README

@jahewson
Copy link
Contributor

jahewson commented Apr 2, 2013

@iclems, yep these are tricky issues:

  • "z-index" issue : the eternal issue with the approach consisting in placing everything that is not text in the background takes place one you have "elements" hiding text (a lot of designers do it in InDesign => they just hide some text elements with a white square, or put an image on top of it and never remove the text behind) = it's quite complicated to find an issue to this issue as it would involve for each object to check its "visibility"

It could be done by sending all the drawing commands to a polygon clipper, and pruning any text which gets drawn over (where the text rectangle intersects the drawing polygon). It's a very big job.

Alternatively, if each drawing command was rendered to a separate transparent PNG image, then the problem goes away, as does the problem below.

  • reducing the background-image issue : part of the poppler speed issue is due IMO to the rasterizing of the big background-image. In some cases, I have noticed that a non transparent background color can lead to generating one big image for each page. Do you know anything about this ? Do you consider it an issue ? I had as well been working a year ago on a custom approach trying to cut the background image in non-empty smaller images which would just be positioned absolutely. What do you think ?

[...] best would be to be able to put the background color in CSS and have a "per image" absolute positioning

"per image" absolute positioning, for image objects that's fine, but what about paths? These would need to be rendered into separate images, it could be done.

The simplest approach might be to keep track of the min/max x and y values used for drawing, and crop the background to that size.

@coolwanglu
Copy link
Owner Author

@jahewson I wonder if per-path images would introduce too many overhead. For example, why people use CSS sprites? I think maybe we need some clustering algorithms.

About polygon clipper, do you know any light-weight geometry libraries, for example CGAL?

About image objects can also be clipped, and thus cannot be directly dumped and inserted to HTML.

@jahewson
Copy link
Contributor

jahewson commented Apr 2, 2013

I wonder if per-path images would introduce too many overhead.

There's only one way to find out...

For example, why people use CSS sprites?

Because they look good on retina displays, and scale well with zoom. I don't think that size or overhead are the reasons people choose CSS sprites.

@jahewson
Copy link
Contributor

jahewson commented Apr 2, 2013

About polygon clipper, do you know any light-weight geometry libraries, for example CGAL?

http://www.angusj.com/delphi/clipper.php

@coolwanglu
Copy link
Owner Author

Looks great, but seems that bezier curves are not supported.

Bezier curves might be used in cilpping paths, drawing objects.

hmm..

@coolwanglu
Copy link
Owner Author

@iclems
Futher tests suggest that pdftohtml was not so fast.

Previously I was not using the -c parameter, such that images are not processed carefully
With the -c paramter, the speed of pdftohtml is similar as pdf2thmlEX (with the same scaling)

I guess this is the best poppler can do (with current parameters)

@coolwanglu
Copy link
Owner Author

Just realized that #64 is about visibility test

@coolwanglu
Copy link
Owner Author

The first item seems not to be able to bring performance improvements. Probably the only good thing about it is that it would possibly prevent vertical overlapping caused by rounded font sizes by the browsers, which never happened to me.

I've created HTMLTextPage which allows future optimizations, but the rest part seems to be dull to me.

The last 2 items have been implemented and indeed improve the performance.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants