Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How comes it so much faster than LoDash? #137

Closed
rpominov opened this issue May 13, 2015 · 25 comments
Closed

How comes it so much faster than LoDash? #137

rpominov opened this issue May 13, 2015 · 25 comments

Comments

@rpominov
Copy link
Contributor

Hi!

I really impressed with the performance. Test results are just incredible! And I can fully accept than it much faster that other RP libs. Being author of one, I know how much stuff happen on each event dispatch, and theoretically it can be boiled down to a single function call.

Still I don't understand how it can be faster, and so much faster, than LoDash, which doesn't care about asynchrony, multiple subscribers etc. Does Most do same job as LoDash, i.e. can we use Most instead of LoDash?

Thanks for any insights!

@briancavalier
Copy link
Member

Hey @pozadi! Great questions :)

Test results are just incredible!

Thanks. They are micro-benchmarks, so the intent is only to provide a feel for how much overhead each lib is likely to introduce. The difference in real applications will vary. The tests try to give each lib the best chance to run as fast as possible. (if there's a better way to setup the tests for kefir, please let me know!)

Does Most do same job as LoDash, i.e. can we use Most instead of LoDash?

I definitely don't recommend most as a replacement for lodash. Lodash is the way to go for synchronous collection processing. Most is intended for event-based reactive programming.

Thanks for any insights!

Most is designed to be extremely optimization and inlining friendly. That's probably the biggest factor, since optimized/inlined code can be easily 100x faster than non-optimized. So, another way to think about this question is: What are other libs doing that is preventing them from being as fast as they could be?

For example, we've tried really hard to ensure that the internal combinator objects (e.g. Map, Filter, etc.) have properties that can always be accessed via fast offsets, and never get turned into hashtables, and never change shape after construction to avoid polymorphism penalties. We've kept methods and functions tiny for inlining, and tried to ensure they won't be deoptimized for type check failures, etc, etc.

If you run code like the following in iojs 2.0 and turn on --trace_opt, --trace_inlining, and/or --trace_deopt:

var n = 1000000;
var a = new Array(n);
for(var i = 0; i< a.length; ++i) {
    a[i] = i;
}

function log(x) {
    console.log(x);
}

most.from(a).filter(even).map(add1).reduce(sum, 0)
    .then(log);

you'll see that nearly everything gets optimized and/or inlined, and then never gets deoptimized (there's probably an initial deopt, as v8 figures out some stuff). For example, if you just turn on --trace_inlining, at the end of the output, you'll see something like:

Inlined sum called from AccumulateSink.event.
Inlined noop called from Observer.event.
Inlined Observer.event called from AccumulateSink.event.
Inlined even called from FilterMapSink.event.
Inlined add1 called from FilterMapSink.event.
Inlined sum called from AccumulateSink.event.
Inlined noop called from Observer.event.
Inlined Observer.event called from AccumulateSink.event.
Inlined AccumulateSink.event called from FilterMapSink.event.
Inlined even called from FilterMapSink.event.
Inlined add1 called from FilterMapSink.event.
Inlined sum called from AccumulateSink.event.
Inlined noop called from Observer.event.
Inlined Observer.event called from AccumulateSink.event.
Inlined AccumulateSink.event called from FilterMapSink.event.
Inlined FilterMapSink.event called from produce.
250000000000

Nearly everything got inlined and never had to be deoptimized before the final result was printed. If you try something similar with other libs, including lodash, you'll see that the VM either isn't able to inline as much, or ends up having to deopt a lot.

There are other factors as well. There is almost no scope chain capturing, which can be surprisingly expensive. Try/catch has been isolated to very tiny helper functions (many (all?) current generation VMs can't optimize try/catch, but future VMs will be able to, e.g. v8 Turbofan). The map and filter fusion helps somewhat (interestingly, it can also hurt performance in certain cases, but seems to be a win on average). We also optimize for the single observer case, since that's by far the most common.

Some of this stuff has proven to be completely unintuitive, and we've spent a good amount time profiling code to find and fix performance issues, and to compare various approaches.

@rpominov
Copy link
Contributor Author

Thanks, this is extremely helpful! I've never tried to use tools to detect deopt/inlining, thought they are too complex for me (e.g. require knowledge of assembler or something), but calling iojs with args looks like something I can handle :) Going to try it very soon. Is there any other profiling tools you've used and can recommend?

if there's a better way to setup the tests for kefir, please let me know!

Doesn't look like so :)

I definitely don't recommend most as a replacement for lodash.

Yes, didn't mean to actually use most in place of lodash. I wanted to ask if the tests show that there should be opportunities for optimization in lodash? Or the results are expected because, for example, lodash builds an array as the final result, while most just call subscriber with each value, or something like that.

The map and filter fusion helps somewhat (interestingly, it can also hurt performance in certain cases, but seems to be a win on average).

I think I've tried the fusion at some point, but it didn't gave much, so I decided not to add complication to the code, if I remember right.

We also optimize for the single observer case, since that's by far the most common.

That interesting. Could you tell more about that, or give a link to the code piece? I've tried something like:

if (subscribers.length === 1) {
  subscribers[0]();
} else {
  // the loop...
}

But turned out that a loop with a single element array works just as fast (which pretty intuitive, anyway).

@jdalton
Copy link

jdalton commented May 13, 2015

Hiya!

The --trace_inlining of Most is most impressive and I agree with what @briancavalier wrote.

For a bit more on the inlining topic from the lodash side see this thread on jsblocks. The TL;DR is that the lodash chaining implementation of shortcut fusion relies on a loop which calls a list of callbacks. This prevents inlining unless the engine supports polymorphic function inlining. Chakra, the JS engine in Microsoft Edge, supports this kind of inlining up to something like 4 functions. The win here though isn't really the function inlining, as that gets busted pretty easily when you go beyond trivial add & sum examples, it's the use of shortcut fusion to reduce overall iteration:

_(large100kArray).map(add).filter(even).take(50).value();
// => 100 iterations performed instead of 200,000 iterations

The non-chained form of lodash will benefit from inlining but won't get the bigger win of shortcut fusion.

lodash's chaining syntax has some baggage because it tracks normal chaining and lazy chaining to be able to bailout to normal chaining for unsupported value types /callback signatures or mixed method (lazy + regular) sequences. Because lodash chained execution is deferred until an implicit or explicit value() is requested you can store a reference to the sequence to call later.

var seq = _([]).map(add).filter(even).take(50);
// then later, maybe in another function
return seq.plant(newArray).value();

This is the gist behind lodash's composed performance being a bit better.

For a different take there's jsblocks which uses method compilation to transform sequences in a way to benefit from inlining as well.

A side note, engines are starting to optimize the try-block of try-catches. I know FF and Chakra (in MS Edge) have optimizations to avoid de-opts until an error is thrown.

@unscriptable
Copy link
Member

Hey @jdalton, very happy to see the King of Optimization is on the case! :)

@briancavalier
Copy link
Member

Wow, thanks for all the info, @jdalton. That's the first time I've seen jsblocks' compilation. Very interesting stuff--I need to take a closer look.

Yeah, I agree that the inlining won't always kick in to the degree it is in some of the perf tests. When it does, though, the benefit seems to be pretty good. When it doesn't, most is still getting highly optimized and almost never deopt.

it's the use of shortcut fusion to reduce overall iteration:

For sure. Pipelining provides a huge win over non-pipelined (e.g. Array). I think all of the libs are pipelining the in all of the perf tests. The reactive libs basically have to work that way since they deal with infinite asynchronous event sequences and can't wait for the last event before moving on to the next step, and they don't need to build large intermediate data structures (I don't know much about the internals of highland, but I'm guessing it pipelines, too). All the libs should be able to take advantage of, for example, the filter in the filter-map-reduce test which cuts the number of iterations in half.

Since they're all pipelining, it's likely a confluence of other things that are contributing to most's perf, and optimization is probably a big one of those.

A side note, engines are starting to optimize the try-block of try-catches. I know FF and Chakra (in MS Edge) have optimizations to avoid de-opts until an error is thrown.

Oh cool, I didn't know about FF and Chakra Edge having those optimizations already. TIL :)

@briancavalier
Copy link
Member

@pozadi

lodash builds an array as the final result, while most just call subscriber with each value

I haven't looked closely at lodash's pipelining, and specifically at how reduce fits into in. My guess is that lodash never has to materialize any intermediate arrays in order to compute the final reduced value. Maybe @jdalton can shed more light.

I think I've tried the fusion at some point, but it didn't gave much, so I decided not to add complication to the code

I'm also not completely convinced it's a worthwhile tradeoff, but the code isn't too bad, so I'm willing to leave it in for while and see how it performs over time.

That interesting. Could you tell more about that, or give a link to the code piece?

Basically, by default, most doesn't keep subscription lists at each node. Multicast subscriptions are opt-in (some stream types enable shared subscriptions by default, like most.create) and implemented via composition. I'm honestly not sure yet if that will be the way we go in the long term.

@jdalton
Copy link

jdalton commented May 15, 2015

I haven't looked closely at lodash's pipelining, and specifically at how reduce fits into in

lodash doesn't currently support pipelining for reduce or reduceRight. A list of the methods which support lazy evaluation optimizations can be found here.

I'm also not completely convinced it's a worthwhile tradeoff,

Same here. It's the shortcut bit, that it enables, that's the big win.

@rpominov
Copy link
Contributor Author

Thanks @briancavalier and @jdalton for all the interesting stuff you are posting, this will certainly help me and others to improve performance in whatever we are working on :)

To me it looks like the major reason of great Most's performance is developing with performance as a target from the beginning. I was trying to do similar with Kefir, but on lower level — only by looking on the benchmarks, and applying my intuition on how v8 should work. But in Most you looked on how it actually works internally, which enabled more optimization possibilities.

@petkaantonov
Copy link

most.from(a).filter(even).map(add1).reduce(sum, 0)
    .then(log);

This is misleading and just another reason microbenchmarks don't tell the whole story. In reality you would not have a case where those functions always get passed the same function, or even 4 same functions. What gets inlined would be very different if you had fed realistic type feedback before running the benchmark (say 5 different functions to each of these methods although just 1 additional in v8 is enough to ruin it but in case you run chakra). Same mistake was done here where optional properties put on prototype look good on microbenchmark because the type feedback is unrealistic.

@briancavalier
Copy link
Member

microbenchmarks don't tell the whole story

Yep, I agree. That's why I said exactly that at the very beginning of the thread, and both @jdalton and I have pointed out up-thread that the extreme inlining won't always happen.

A while back, I ran most through Bluebird's doxbee test, since that test now includes other reactive libs. It does fairly well. It's faster than many promise implementations, and considerably faster and less memory intensive than the other reactive libs. The test isn't really a good use case for event streams, so I feel it's quite an encouraging result. I figure we can probably do better, though.

@petkaantonov
Copy link

Yep, I agree. That's why I said exactly that at the very beginning of the thread, and both @jdalton and I have pointed out up-thread that the extreme inlining won't always happen.

But my point is that it never happens, even with polymorphic inlining support.

@jdalton
Copy link

jdalton commented May 15, 2015

In reality you would not have a case where those functions always get passed the same function, or even 4 same functions.

Sure you could. Lodash at least punts to composition of its lower level methods to create others.
So I could easily see something like

function thing(a) {
  return _(a).map(iteratee).filter(predicate).foo(yada).bar(zada);
}

being used.

@petkaantonov
Copy link

So across an entire application you are always using the exact same predicate function and exact same mapper function? Seems absurdly unlikely.

@jdalton
Copy link

jdalton commented May 15, 2015

Seems absurdly unlikely.

That's method composition for ya and it happens a lot in JS land.

@petkaantonov
Copy link

Ok composition happens a lot but what does that have to do with anything?

Let's say I have app that never uses filtering for anything else other than getting the even numbers out of a stream or something (this alone is already a ridiculous proposition):

function evens(stream) {
    return stream.filter(function(value) {
        return value % 2 === 0;
    });
}

So because this always passes a different function to .filter, filter cannot inline the callback after evens has been called more than once. Otoh it will never get optimized if I just call it once either (which is pre-requisite for inlining).

The only case where the callback is getting inlined is if it's hoisted out and that no other code anywhere else in the application ever uses .filter I.E. never.

@jdalton
Copy link

jdalton commented May 15, 2015

Ok composition happens a lot but what does that have to do with anything?

It provides an example of when such an optimization would kick in.
It's a common pattern for Lodash at least.

@petkaantonov
Copy link

It only kicks in if thing is the only place in the application where you use .map and .filter and the functions iteratee and predicate are hoisted out. And my point is the if is just never happening.

@jdalton
Copy link

jdalton commented May 15, 2015

function evens(stream) {
  return stream.filter(function(value) {
      return value % 2 === 0;
  });
}

I can't speak to v8 off the top of my head, though judging by their scores on certain benchmarks I think they tackle it in some way, but Chakra should inline the iteratee of filter just fine (confirmed with the JIT devs across my desk).

Speaking of hoisting you may dig https://github.com/codemix/babel-plugin-closure-elimination.

@petkaantonov
Copy link

Thanks, didn't know about that plugin, I dig :)

V8 only considers function identities so different identities of same syntactic function entity are considered entirely different in cases like this. But it isn't really the point, applications are going to use far more different kinds of predicates than just a predicate that determines if a number is even or not.

@jdalton
Copy link

jdalton commented May 15, 2015

But it isn't really the point, applications are going to use far more different kinds of predicates than just a predicate that determines if a number is even or not.

Yap, @briancavalier and I have already covered that.

@petkaantonov
Copy link

That's just confusing then, you just argued otherwise for 2 hours :P

@jdalton
Copy link

jdalton commented May 15, 2015

That's just confusing then, you just argued otherwise for 2 hours :P

Naw. You were trying to take the wind out of the sails of the inlining efforts of Most and I pointed to popular scenarios and engines where it does apply.

@petkaantonov
Copy link

Inlining of the user callbacks does not apply in any engine because applications need more than 1 (or 4) different predicate functions. And it also has nothing to do with inlining effort since it's the user who provides the callback to be inlined to begin with. So there is still a big misunderstanding :)

@jdalton
Copy link

jdalton commented May 15, 2015

Inlining of the user callbacks does not apply in any engine because applications need more than 1 (or 4) different predicate function

I've already given examples to the contrary.

So there is still a big misunderstanding :)

I get where you're coming from, I've just tried to expand your view a bit.

@briancavalier
Copy link
Member

@jdalton and @petkaantonov Thank you for the lively discussion :) There's a ton of great information in this thread. I learned a lot and I think anyone who reads it will, too.

It seems we've covered the aspects of why most.js does well in the microbenchmarks, as well as why that may or may not translate directly into application performance on various engines.

@pozadi Closing this, but if you feel there's a need to reopen, please do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants