Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LaTeXML speedups and profiling #480

Closed
dginev opened this issue Apr 26, 2014 · 13 comments
Assignees
Milestone

Comments

@dginev
Copy link
Collaborator

@dginev dginev commented Apr 26, 2014

There is a new TeX to XML formula conversion paper that makes it a primary argument that the math conversion via LaTeXML is decidedly too slow, as compared to their MathJaX-based approach.

While there is a good reason why LaTeXML shouldn't be as fast (it does more!), I think we have not hit the best of our game in terms of speed (indeed our optimization efforts have been hitting only the obvious nails so far), so I am making a meta ticket to try and plan ahead some strategies for making things faster, while still remaining within the bounds of reason.

Here is a list that I wrote up while profiling a daemonized math conversion job today:

  • Think on moving core components into native C code, via XS wrappers, for a real performance impact. The philosophy that is starting to form in my mind is:

    high performance core with highly expressive periphery

    In other words, think of porting the most important core classes to C, but keep the less frequently invoked pieces in Perl (such as bindings, KeyVals, and other specialized code).

  • Perl's LibXML wrapper had some inefficiencies:

    • We should use _getAttribute() with an underscore when requesting attributes, as over 50% of the runtime of the user-facing getAttribute() is spent on figuring out if we are asking for an actual attribute or a namespace (vlaues that start with xmlns:). That overhead piles up since that method is called extremely often.
    • It is somewhat heart-breaking to see that the DESTROY method for XML::LibXML::Element takes a factor of ten longer to delete the Perl tiecache used for the wrapper, as it takes to free the C datastructure. No suggestion how to avoid that, but it's worth remarking (it's in the top subroutines reported by NYTProf).
      • In truth, it makes me wonder if it isn't sensible to rewrite LaTeXML::Core::Document and possibly LaTeXML::Common::XML in C and avoid using XML::LibXML altogether.
    • This is a good place to start getting scared of my suggestions :-)
  • The math profile can be slimmed down, e.g. by turning off the scan and crossref post-processors (they're currently enabled by default).

  • The vast majority of time for processing a single formula is naturally spent in the grammar, so we can consider switching to a XS-wrapped interface to a Marpa grammar of the same coverage. That grammar needs to be properly evaluated too, of course, so that we know a switch is sensible.

So, in the particular focus of making single formula conversion fast, we need to polish the performance of the main actors involved. Currently, profiling shows two of them are the main bottlenecks - the grammar and the document construction and manipulation. I will look into a Tikz conversion next, to contrast what would be vital there. Will also possibly profile an arXiv document. When I approach individual optimization tasks, I intend to open individual tickets (but not sooner).

And as always - discussion is more than welcome!

@dginev dginev added this to the LaTeXML-0.8.1 milestone Apr 26, 2014
@dginev dginev self-assigned this Apr 26, 2014
@dginev

This comment has been minimized.

Copy link
Collaborator Author

@dginev dginev commented Apr 27, 2014

The promised Tikz profiling (with a single sample graphic from TeXamples). Note that nothing is preloaded in this run, which converts a single file once. In the above math profiling, I ran 20 formula conversions with the math profile preloaded. Here are the details:

  • As expected, 0 overlap in terms of bottlenecks with the math conversion use case.
  • The summary is that "native interpretation of raw TeX code is slow". Details:
    • 5 million calls to LaTeXML::Core::State::lookupDefinitionin 29.7s exclusive time.
    • 1.3 millions calls to LaTeXML::Core::Gullet::readXToken in 19.4s exclusive time. Similarly, 5 million calls to readToken took 10s.
    • 75 thousand calls to LaTeXML::Core::Definition::Conditional::skipConditionalBody took 13.9s.
    • The remaining top subs in exclusive time are less than 10 seconds and visibly all relate to raw TeX munging (Expandable::doInvocation, Object::Equals, Parameter::read, Gullet::readMatch, Gullet::readBalanced, Token::getExecutableName).
  • Also interestingly, there were 72 calls to kpsewhich (my guess is to load Tikz libraries) which took 6 seconds. I am hoping preloading Tikz and all its libraries may resolve that issue (and some of the others?)

Given that we have looked into and talked about these exact routines in State and Gullet, the scary conclusion may be that the only real way forward in terms of optimization is to rewrite State and Gullet in an efficient C equivalent.

An alternative high level approach would be to ask ourselves is there a way to circumvent making 5 million calls to readToken and lookupMeaning, what does that tell us about the flow of the conversion in a Tikz example and also why would the classic tex interpreter be able to deal with that quickly. A simple answer is that a subroutine call in C is extremely cheap, while a subroutine call in Perl is quite expensive, and in the scale of millions of calls the difference becomes quite noticeable. But I have a very partial understanding still.

@dginev

This comment has been minimized.

Copy link
Collaborator Author

@dginev dginev commented Apr 27, 2014

Oh for the love of runtime! I was left gawking at the sheer performance hit that one suffers from doing anything 5 billion times in Perl. And that's a possible amount of subroutine calls we'll get for say a TeX book of 100 pages.

Here is an example:

$ time perl -e 'sub f {}; for (1..5_000_000_000) {f();}'

real    9m49.537s
user    9m49.594s
sys 0m0.396s

To dispel any doubts, here is the empty loop of 5 billion iterations:

$ time perl -e 'sub f {}; for (1..5_000_000_000) {}'

real    2m11.297s
user    2m11.220s
sys 0m0.135s

So, 5 billion subroutine calls in this case took 7+ minutes?? Ouch.

Compare that to the following C program:

#include<stdio.h>
void f() {
  return; }

int main() {
  unsigned long i;
  unsigned long j=0;
  for (i=0;i<5000000000;i++) {
    j+=2;
    f();
  }
  printf ("%d\n",(int)j);
  return 0; 
}

compiled and executed as:

$ gcc -Wall -O3 csub.c -o csub.o
$ time ./csub.o 
1410065408
real    0m0.002s
user    0m0.001s
sys 0m0.001s

I added that j increment there just so that I convince myself it was doing something in that instant of time it took to finish!!

I mean almost 10 minutes compared to 0.001 seconds !

I ended up making a mini benchmark with some other programming languages:
(ordered best to worst)

  • C with -O3 or -O2 - 0.002s

Edit: @cprodescu pointed out that the true runtime with -O3 can be seen when using the empty function call from a separate file, so that the compiler doesn't optimize it out. He reported 8.4s runtime in that case. That may imply that the Java and NodeJS figures need to be adjusted accordingly as well.

  • C with -O1 - 1.6s
  • Java (long) - 1.7s, with almost no extra overhead for the subroutine calls
  • NodeJS - 8.5s, with almost no extra overhead for the subroutine calls
  • C with -O0 - 12.8s
  • Java (Long) - 16s, due to casts between the Long class and the native long.
  • Ruby - 4m and 48s, 3m of which for the subroutine calls
  • PHP - 6m and 16s, 4m and 17s of which for the subroutine calls
  • Python - 7m and 30s, 5m of which for the subroutine calls
  • Perl - almost 10m, as explained above
  • Perl 6 - didn't terminate in 16 min, sigh
@dginev

This comment has been minimized.

Copy link
Collaborator Author

@dginev dginev commented Apr 27, 2014

My main point by now is that simply rewriting verbatim (with the needed adaptations) the State and Gullet modules into C, should provide a massive speedup when compiled with -O3 (even though not as massive as the mock benchmark, where clearly the entire loop was inlined by the compiler, hence the instant execution). I will try to play around with the idea at some point soonish.

@brucemiller

This comment has been minimized.

Copy link
Owner

@brucemiller brucemiller commented Apr 27, 2014

Could be though I worry about call through costs.I know that the architecture sorta pushes me to lookup definitions more than I'd like. Potential optimization saving definitions for later use? Or faster is defined queries? Or all of the above?

Hard to type but. Take a look at NTS New typesetting system. Attempt to rewrite TeX as structured modular extensible java. By the time it passed stress test it wasn't structured anymore! Sound familiar?

@dginev

This comment has been minimized.

Copy link
Collaborator Author

@dginev dginev commented May 2, 2014

I see your point about NTS. That's why I would like to keep any rewrites sensibly limited. I think LaTeXML is somewhat close to actually winning the fight for a structured TeX engine, since your design makes the processing compartmentalized and sensible. I am in no way even thinking of changing the model, just turn the appropriate pieces into faster gears.

And looking up things seems like an excellent candidate for trying this out. The issue there is that Perl's dispatch is slow, because subroutines bring along a big overhead. Native C dispatch on the other hand is quite efficient, especially if we inline the routines.

The challenge then is to first evaluate what is the overhead of calling XS methods (I am trying to quantify that now). If it turns out it is too noticeable, we would have to "submerge" a larger chunk of the processing in pure C land, to avoid the frequent alternation of levels. I don't want to commit to particulars on that front just yet. (And of course, the alternative is to give up on C-based optimization)

I am hoping the XS invocation overhead isn't that significant, since it is seemingly not visible from the profiling -- all top reported routines are busy in pure-Perl land. And there were already millions of calls to the C libxml2 in the tests I made.

I'll come back with some XS figures, once I properly get the hang of this low-level dance.

@dginev

This comment has been minimized.

Copy link
Collaborator Author

@dginev dginev commented May 5, 2014

As promised, here is a mini benchmark comparing pure-Perl with Perl+XS calls.

I am rerunning on just my battery at the airport, so I had to regenerate all Perl-specific tests, in order to get reliable ratios for pure vs XS. Time is in seconds.

(For clarity: the code blocks are executed from a Perl test file, and there is a real XSBench.pm module that points to XSBench.xs.)

Description Time Perl Code
Empty iterations 130s for (1..5_000_000_000) {}
Pure Perl, function call 530s sub my_f {} for (1..5_000_000_000) { my_f(); }
Perl+XS, function call 474s for (1..5_000_000_000) { XSBench::f(); }
XS, loop + calls 13s XSBench::loop_f();

This goes to disprove that switching between XS and Perl is costly, to the contrary it is still cheaper than invoking a Perl function. Here we save close to 1 minute for 5 billion iterations. That is probably not in itself compelling motivation to do an XS rewrite, but that was not the point of the comparison. Indeed, what I think I have established, is that trivial XS function calls are slightly cheaper than native Perl calls. And here I measure the time for the full invocation chain, through the XSBench.pm module, to XSBench.xs and then to the call of function f in bench.c.

Of course this is true for transitions that do not have packaging overhead. Switching between Perl and C datastructures brings its own cost, especially when it is non-trivial. But I suspect we can dance around it nicely, as long as we keep passing pointers around.

For full reference, here is my XS file:

#include "EXTERN.h"
#include "perl.h"
#include "XSUB.h"
#include "ppport.h"

#include "bench.h"

MODULE = XSBench               PACKAGE = XSBench              

PROTOTYPES: ENABLE

int
f()
  CODE:
    f();
    RETVAL = 1;
  OUTPUT:
    RETVAL

int
loop_f()
  CODE: 
    unsigned long i;
    unsigned long j=0;
    for (i=0;i<5000000000;i++) {
      j+=2;
      f();
    }
    RETVAL = j;
  OUTPUT:
    RETVAL

Note the calls to f(); which is defined in a trivial bench.h header file and is the trivial empty function in C: void f() {}. I have used -O3 optimizations in this benchmark, but the loop has not been optimized away and was fully executed (that's the main purpose of having a separate bench.h C header file, it prevents the compiler from rewriting the function call, as it is not available before linking).

Two conclusions:

  • [Obvious] The more logic we can submerge into C, the faster we'll get. (contrast the 13 seconds needed above, when the full loop+code logic is in C)
  • XS function calls are (a little) cheaper than pure-Perl function calls, when we are not doing advanced repackaging of datastructures. This gives hopes partial XS rewrites of LaTeXML's logic may result in profitable speedups.
@dginev

This comment has been minimized.

Copy link
Collaborator Author

@dginev dginev commented Jul 11, 2014

In case anyone is following the issue, I am getting somewhere (but not there yet) with a C and XS rewrite of State.pm. You can find the current state at:

https://github.com/dginev/LaTeXML/tree/XSive_optimization

@dginev dginev changed the title LaTeXML speedups and profiling, 0.8.1 edition LaTeXML speedups and profiling Aug 13, 2014
@dginev dginev modified the milestones: LaTeXML-0.8.1, Future Aug 13, 2014
@dginev

This comment has been minimized.

Copy link
Collaborator Author

@dginev dginev commented Aug 13, 2014

I am renaming the issue and changing the milestone to Future, since my XS rewrite branch ended up jammed, and I ran out of time to devote to it. I will be happy to return to investigating here, but most likely not before the next release.

@dginev dginev modified the milestones: LaTeXML-0.8.2, Future Nov 14, 2014
@dginev

This comment has been minimized.

Copy link
Collaborator Author

@dginev dginev commented Nov 14, 2014

I am dropping this to 0.8.2. Since I started working at Authorea, I have been feeling so motivated to pitch LaTeXML as our default TeX to HTML converter, but I already know there is a big dead end down that street at the moment - our team considers LaTeXML too slow for large-scale use, as compared to Pandoc (the powers of ghc compiling!).

So, if I had one Christmas wish, it would be to make a big dent on the speedups front. Coincidentally, I will likely have time exactly around Christmas to look at this. Hence the 0.8.2 milestone to keep me motivated.

@dginev

This comment has been minimized.

Copy link
Collaborator Author

@dginev dginev commented Jun 14, 2015

I'll bump this back to 0.8.3, can wait to the next release.

@dginev dginev modified the milestones: LaTeXML-0.8.3, LaTeXML-0.8.2 Jun 14, 2015
@dginev

This comment has been minimized.

Copy link
Collaborator Author

@dginev dginev commented Jul 6, 2015

Since we got to discuss this a little bit in a recent email thread, one idea to keep in mind is trying out a different submersion path, where instead of XS+C we use FFI+Rust, as described in this nice blog post:
http://paul.woolcock.us/posts/rust-perl-julia-ffi.html

I am expecting that this path may be a lot cleaner, but I am unsure whether FFI isn't even more expensive to call through. There are a lot of advantages to using Rust over C, and I would like to investigate that route. The trickiest bit with either approach is how do we handle complex datatypes, as they are one of the big sources of overhead between the layers.

But the interplay of TeX keeps bouncing back and forth between Mouth, Gullet and Stomach, which guarantees callthrough costs even if we entirely submerge one of the three into Rust. That's another difficulty in designing the optimizations.

@dginev

This comment has been minimized.

Copy link
Collaborator Author

@dginev dginev commented Jul 6, 2015

Speaking of FFI vs XS, I found an interesting benchmark, followed by an interesting comment thread, pointing out the benefits of the cutely-named FFI::Platypus:
https://gist.github.com/calid/17df5bcfb81c83786d6f

@dginev

This comment has been minimized.

Copy link
Collaborator Author

@dginev dginev commented Sep 5, 2017

I think this issue is too long-lived at this point and has fulfilled its initial purpose. There are some underwater efforts related to it, but it is safe to say they may be larger in scope. Best to close here rather than keep a long-term "be faster" reminder, not terribly constructive.

@dginev dginev closed this Sep 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.