Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use RRB-Tree in Vector #38

Open
nathansobo opened this issue Aug 5, 2014 · 42 comments
Open

Use RRB-Tree in Vector #38

nathansobo opened this issue Aug 5, 2014 · 42 comments

Comments

@nathansobo
Copy link

I'm attempting to use vectors to represent the contents of 5-line tiles in Atom, and I'm keeping the groups of lines up to date with repeated applications of splice. After not very long, however, performance of iteration seems to degrade. Is this a known issue?

screenshot 2014-08-05 12 39 59

@leebyron
Copy link
Collaborator

leebyron commented Aug 5, 2014

Not yet known, you're definitely giving it a good stress test.

Is there a more contrived version of this that you can share so I can figure out what's up?

@leebyron
Copy link
Collaborator

leebyron commented Aug 5, 2014

Oh I just re-read this and realizing you said splice() not slice(). Yes, splice is quite inefficient.

@leebyron
Copy link
Collaborator

leebyron commented Aug 5, 2014

If you're calling it repeatedly, you might try replacing .splice(...) with .splice(...).toVector()

@nathansobo
Copy link
Author

For now I'm just building a new vector as they're quite small.

Is this inefficiency inherent in the nature of the trie data structure you're using or something implementation-specific? Do the ClojureScript data structures exhibit this issue?

@leebyron
Copy link
Collaborator

leebyron commented Aug 5, 2014

splice() is definitely an exensive operation regardless as everything after the splice point needs to be shifted (moved in memory) for regular arrays. The same is true for a HAMT.

I was thinking about this a bit and wonder if you might benefit from a different kind of data structure that's designed around making splice() fast. Linked lists are really fast at this if you're willing to give up O(1) access and immutability. Perhaps some kind of b-tree would be well adapted to immutable(persistent) form and deal well with splicing.

What's the specific use case here?

Sent from Mailbox

On Tue, Aug 5, 2014 at 1:43 PM, Nathan Sobo notifications@github.com
wrote:

For now I'm just building a new vector as they're quite small.

Is this inefficiency inherent in the nature of the trie data structure you're using or something implementation-specific? Do the ClojureScript data structures exhibit this issue?

Reply to this email directly or view it on GitHub:
#38 (comment)

@nathansobo
Copy link
Author

Yeah, great point. I'll just replace the entire Vector for now because it has so few elements. Thanks for your thoughts. Feel free to close this unless you'd like to look into improving splice so far as it's possible.

@leebyron
Copy link
Collaborator

leebyron commented Aug 6, 2014

I'll definitely keep it in mind, so I'll keep this open. At the very least I think a real TODO is clarifying splice()'s use with withMutable and even finding incremental perf improvements. —
Sent from Mailbox

On Tue, Aug 5, 2014 at 3:04 PM, Nathan Sobo notifications@github.com
wrote:

Yeah, great point. I'll just replace the entire Vector for now because it has so few elements. Thanks for your thoughts. Feel free to close this unless you'd like to look into improving splice so far as it's possible.

Reply to this email directly or view it on GitHub:
#38 (comment)

@swannodette
Copy link

FWIW efficient splice (and other ops) is supported if you implement RRB-Trees - a variant of bitmapped vector tries (i.e. Vectors) http://infoscience.epfl.ch/record/169879/files/RMTrees.pdf

@Tvaroh
Copy link

Tvaroh commented Aug 6, 2014

BTW, that official paper on RRB trees is lacking many details. I would recommend to look at this thesis.

@leebyron
Copy link
Collaborator

leebyron commented Aug 6, 2014

Interesting. I'll read up! Thanks guys. —
Sent from Mailbox

On Wed, Aug 6, 2014 at 6:23 AM, Alexander Semenov
notifications@github.com wrote:

BTW, that official paper on RRB trees is lacking many details. I would recommend to look at this thesis.

Reply to this email directly or view it on GitHub:
#38 (comment)

@graue
Copy link

graue commented Nov 1, 2014

@hypirion (author of the thesis above): didn't you say you found RRB-trees slower in practice (like, by a constant factor) than the Clojure-style vector for some operations?

I bring this up because if so, that may be an argument for presenting RRB-trees as an alternative implementation rather than replacing the current one.

@leebyron
Copy link
Collaborator

leebyron commented Nov 1, 2014

My understanding was the constant perf hit only came if the RRB nodes are in the path, which means the result of slice/concat. 

I'm interested to hear more though. 


Sent from Mailbox

On Fri, Oct 31, 2014 at 10:51 PM, Scott Feeney notifications@github.com
wrote:

@hypirion (author of the thesis above): didn't you say you found RRB-trees slower in practice (like, by a constant factor) than the Clojure-style vector for some operations?

I bring this up because if so, that may be an argument for presenting RRB-trees as an alternative implementation rather than replacing the current one.

Reply to this email directly or view it on GitHub:
#38 (comment)

@hypirion
Copy link

hypirion commented Nov 1, 2014

There is a lot of additional code required for the RRB-tree cases, but it's designed to switch over to the persistent vector code whenever possible: RRB-tree code can generally detect when the substructure of a node is persistent vector-ish, and will jump to pvec performance at that point in the tree. I did some benchmarks on that, and found no notable performance difference if no slices/concatenations were done on the RRB-tree.

That being said, those were implemented with a lot of twiddling in C with Boehm-GC, and I'd hardly call that equivalent with a javascript engine. I'm not JS expert, but I'd guess the initial implementation of the RRB-tree will be slower than the immutable list for non-slice and concat use, until it's tweaked to the JS engines. I suspect it would be harder to inline and optimise for the optimising compiler. Making it optimisable may take some time (at least based on my own experience), and you probably want to not degrade immutable list performance while you do it.

So If I were you, I would probably start out having the RRB-tree be a different class, and let slice() and concat() convert the immutable lists to RRB-trees, and delegate the work to the RRB-tree implentation. The only additional change in the immutable list implementation would be that VNodes should have an additional field for RRB-tree size tables, which will by default be null.

Again, I don't have much knowledge and experience on how the JS engines optimises and how one would tweak JS code. Perhaps the points I bring up is of no concern at all, and for some reason the performance is equivalent from the get-go.

@leebyron
Copy link
Collaborator

leebyron commented Nov 2, 2014

Thanks @hypirion this is really helpful.

@dahjelle
Copy link

@leebyron You hinted about something using splice and withMutations together—is there some speedup to be gained in general, or is that only with repeated calls to splice? While I realize you won't get better than O(n), it'd be nice to get a bit closer to a native array's performance.

@dahjelle
Copy link

Nevermind. #37 explains. Thanks!

@Pauan
Copy link

Pauan commented Jan 14, 2015

Another alternative implementation is to use AVL trees (any binary tree will do, really, but AVL trees are fast and easy to implement) where each AVL node contains a JavaScript array. Once the JavaScript array reaches a certain size, it splits it into two AVL nodes. I found through benchmarking that 125 is a good limit for the JavaScript arrays.

I wrote a library that implements immutable lists in the above manner, you can find the benchmarks here:

https://github.com/Pauan/Immutable/blob/javascript/benchmarks/List/2015-01-13

As you can see, they are quite a bit faster than Immutable-js, for most operations. I'm willing to go into more detail on their implementation, if you want.

@swannodette
Copy link

@Pauan in the case of Mori you're not measuring what you think you're measuring, there's a pending update to Mori which will permit performance comparisons of the actual data structures from plain JavaScript. Also your AVL tree implementation seems to come in consistently behind for random access/update in these benchmarks which is one of the value propositions for Bitmapped Array Tries. Also you're not benchmarking against transients in the Immutable.js case nor the Mori case.

@Pauan
Copy link

Pauan commented Jan 14, 2015

@swannodette Those are all excellent points. Once the new version of Mori comes out, I would love to re-run the benchmarks. I was actually quite disappointed when I first benchmarked Mori: I was expecting it to be faster, so an improvement there would make me happy.

Do you have any recommendations for how I can run the benchmarks better? I think my usage is pretty straight-forward: https://github.com/Pauan/Immutable/blob/javascript/src/Benchmark/List.js

I simply load up Mori as a Node.js module and use it in the typical way that I would expect a developer to do. I did the same with the other libraries.

The random access issue is a bit complicated. You are right that it is consistently slower than Immutable-js, but not by that much. In exchange for that, it has vastly better performance for lots of other things, especially random inserts (which is a big part of the reason I wrote the library in the first place, I needed decent-speed random inserts). This could change if Immutable-js gets the aforementioned RRB-tree optimizations.

It's also important to note that the performance of "100 values" is misleading: as an optimization, I use cons cells for O(1) push at the end, but that means that lookup degrades to O(n) for lists with less than 125 values in them. Once it hits the 125 limit, performance jumps up significantly. You can see that with "1000 values".

In the past I have benchmarked transients (in both Immutable-js and Mori), and I found that in some cases they were slower, for bulk inserts they were generally faster, but in general they don't make a huge difference. It wasn't worth the extra time it took to benchmark them, so I took them out. But I can add them back in if you would like to see the results.

P.S. O(1) concat and slice in Mori is insane. I'm curious how that's implemented, and whether it involves some other tradeoffs or not.

@graue
Copy link

graue commented Jan 15, 2015

@Pauan: Slice in Mori is O(1) time because it creates a view over the original vector. In essence, it's like

function slice(vec, start, length) {
  if (isAlreadyASlice(vec)) {
    start += vec.offset;
    vec = vec.originalVector;
  }
  return {
    originalVector: vec,
    offset: start,
    length: length
  };
}

Totally made-up pseudocode, but conceptually, that's what is going on. So, as long as the slice lives, nothing in the original vector can be garbage collected.

I copied this behavior in my immutable-vector implementation, but warned about it in the docs, since it's a pretty big caveat, and can lead to memory leaks if not understood.

@graue
Copy link

graue commented Jan 15, 2015

@swannodette: Correct me if I'm remembering wrong or the above behavior has since changed.

As for concat, I don't remember concat in Mori being O(1), and I'm confused by the notion that it is.

@swannodette
Copy link

@graue your assessment of subvec aka slice in Clojure is correct. concat is not O(1) in Mori.

@Pauan
Copy link

Pauan commented Jan 16, 2015

@graue I see, I was afraid it might be something like that. Thanks for the explanation.

@graue @swannodette I mentioned Mori's concat being O(1) because in the benchmarks I showed, you can clearly see that it has O(1) behavior. I have checked, and my benchmark code appears to be correct.

Since Mori vectors don't seem to have a specialized concat function, I'm using mori.concat, which converts its arguments to sequences. So, when you say that Mori doesn't have O(1) concat, do you mean that the concat itself is O(1), but it costs time if you want to convert back to a vector? Or do you mean that the concat itself is O(log32(n))?

@swannodette
Copy link

@Pauan concat in Mori just produces a lazy sequence so the cost is amortized.

@Pauan
Copy link

Pauan commented Jan 16, 2015

@swannodette Okay, cool, so it's like a concat over ES6 Iterables. Thanks for answering my many questions.

@Pauan
Copy link

Pauan commented Jan 16, 2015

With @swannodette's help, I updated the benchmarks to use the latest versions of Immutable-js and Mori. You can find it here:

https://github.com/Pauan/Immutable/blob/javascript/benchmarks/List/2015-01-16

Mori 0.3.2 is significantly faster than 0.2.9. Most of the speed gain comes from using fixed-arity functions in the benchmark code.

My library (without transients) holds up well compared to Mori and Immutable-js (with transients).

My library has extremely fast concat and slice. It does not "cheat": concat and slice are not lazy, they return new Lists, and the old Lists can be garbage collected.

The price for this is a very minor loss of lookup speed: for a list with 1,000,000 elements, my library takes 13 operations to lookup a value. Mori and Immutable-js takes 4 operations.

So, if you only need to insert / remove at the end of a list, then either Mori or Immutable-js are excellent choices.

But if you want good concat / slice performance, or you need to insert at arbitrary indexes, my library offers a much faster alternative. I think the performance of AVL trees + arrays is a lot better than what most people think.

I'm curious to see if RRB-trees improve the performance of splice / concat / slice.

@leebyron
Copy link
Collaborator

Nice work @Pauan!

I haven't spent much time benchmarking or optimizing Immutable.js yet, so there are likely some wins to be had. A huge thanks to you and @swannodette for illustrating some baselines to strive for. It's always awesome and inspiring to see the headroom to strive to close :). However one thing I want to be cautious of, and part of why I've under-prioritized benchmarking, is comparing benchmarks to every day use. I don't think benchmarks are enough to prove or disprove an idea unless they closely mirror common use scenarios. AVL trees are really interesting and I'm looking forward to seeing how they may play a role in this lib, or at least in the new landscape of these structures as a whole. May I also suggest a composite test where reads and writes are interleaved at different frequencies? Most common use tends to read a lot more than write which is part of why Clojure/Scala/immutable chose to optimize for reads instead of writes

@Pauan
Copy link

Pauan commented Jan 16, 2015

@leebyron Indeed, real-world examples always trump synthetic benchmarks. But it's also much harder to find and test real-world examples.

Adding in more benchmarks for different scenarios (like interleaving read/write) is a good idea, and I might add them in at some point.

I agree, reads tend to be more frequent than writes, so optimizing for read is a good idea. But when you're trading minor read performance for multiple orders of magnitude of write performance, I think that's a good tradeoff. It will of course depend upon the particular programmer's use-cases, though.

My library also supports Tuples, for the situations where read performance is needed much more so than write performance. I think this is a good idea: a general-purpose List that's good at everything (reading, inserting, concatenating, etc.), and a more specialized data structure that can be faster for a particular operation (like reading).

@leebyron
Copy link
Collaborator

Totally. The beauty of data structures is that there are so many variants and each one has its ideal use.

@swannodette
Copy link

@Pauan the speedups do not come primarily from fixed arity invokes. Two things changed in ClojureScript, we no longer leak arguments and the standard library is compiled itself with static invokes enabled thus removing any internal overheads. Even without JS static invokes you will see Mori win out in many cases.

@leebyron while I appreciate the conservatism with wrt. micro benchmarking, let's not be disingenuous, Clojure and thus ClojureScript's many optimizations are the direct result of 7 years of observing common usage patterns in production applications.

@Pauan
Copy link

Pauan commented Jan 16, 2015

@swannodette I'm sure there were plenty of other optimizations, but I ran the Mori 0.3.2 benchmarks both with the fN functions and without. There was a speed boost even without the fN functions, but it was a lot less than the speed boost from using the fN functions.

The results I posted were with the fN functions.

@leebyron
Copy link
Collaborator

@swannodette Exactly my point. Clojure & ClojureScript's performance is the result of multiple smart people working diligently for quite a long time with production performance in mind. And it's seriously impressive and inspiring (like, literally inspiring, like literally inspired this project). My hesitancy towards benchmarks is simply to ensure they don't become the driving force for optimization work which in my experience on other projects has often led to worse performance for production apps, or distraction from more important work. Always focus on production performance first. Immutable.js has a long road to follow, and many great ideas to borrow to get there (including this task). Putting together a benchmarking rig is on my todo list, with benchmarks designed to mimic common patterns, and I'll want to do that before going deep on performance tweaking, but production use and perf is still going to be my end-goal :). Not there yet!

@swannodette
Copy link

@Pauan btw all your Mori slice benches are wrong. subvec is the same as Immutable.js slice. You should correct this.

Amortization via laziness is a well know strategy it's why Clojure(Script) users have got along fine for so long without needing AVL or RRB trees. Still both are useful and thus have been available to Clojure(Script) as separate contrib libs.

@Pauan
Copy link

Pauan commented Jan 16, 2015

@swannodette That is intentional: my benchmark code is correct. Mori's subvec is lazy, while Immutable-js's slice is not lazy: it actually constructs a new List. Comparing a lazy O(1) slice with a non-lazy slice is disingenuous and useless.

I have no problem with laziness, and lazy views are a great idea. Perhaps Immutable-js and Immutable could benefit from a lazy slice. I would probably call it something like view rather than slice.

But, currently, JavaScript's slice is not lazy, Immutable-js's slice is not lazy, and Immutable's slice is not lazy. So to make the comparison fair, I had to make Mori's subvec non-lazy. The same goes for concat.

There's also the issue that Mori's subvec leaks memory if you throw away the old vector. To prevent that, you must force the slice into an actual concrete collection, just like Immutable-js and Immutable. So it's useful to know how Mori compares to the other libraries, in that situation.

You might ask: why not have a second benchmark, for lazy slice? Because if all the libraries implemented lazy slice, they would all be O(1), and there is no point in benchmarking if all the libraries are O(1).

You might ask: why not put in the benchmark anyways, even though not all the libraries have lazy slice? Because that's not fair. Not only is it not fair, it's heavily misleading to anybody who looks at the benchmarks. They would think that Mori's subvec is amazingly fast, but in an actual program where they force the slice, it won't be nearly as fast as on the benchmark.

Lazy slice and non-lazy slice do different things, have different behavior, and have different caveats and tradeoffs. They should be compared separately.

@swannodette
Copy link

@PauAB subvec is not lazy. Immutable.js uses our same windowing approach. Not sure where you're getting your information from. Your benchmark is wrong.

@swannodette
Copy link

Ah apologies it looks like the slice implementation has changed since I last looked. But I still think this is the typical useless benchmark that doesn't matter in practice.

@Pauan
Copy link

Pauan commented Jan 16, 2015

Perhaps benchmarking non-lazy slice is useless, perhaps not.

But benchmarking lazy slice is definitely useless, because it's always O(1).

If you prefer, I can simply remove the subvec benchmark for Mori.

@leebyron
Copy link
Collaborator

I had to make slice non-lazy to fix some behavioral issues, my intent is to make it lazy again in the future (and by lazy, I mean just a view on a backing List, same as subvec).

@swannodette
Copy link

@Pauan my point is that your benchmarking approach doesn't make the least bit of sense. You won't allow lazy views based on principles and then turn around and benchmark a non-equivalent O(n) operation. This is kind of thing that @leebyron was suggesting against :)

@leebyron
Copy link
Collaborator

Yeah, guys I'm gunna have to call this convo over (at least in this context) because we've gotten pretty off-topic from RRB trees :). You've both inspired me to a) investigate some alternative options for speeding things up and b) add a benchmarking harness of some kind to start tracking this stuff better.

When I get a benchmark harness up, I'll be psyched to get your feedback.

@swannodette
Copy link

Sounds good

@ghost
Copy link

ghost commented Aug 4, 2015

Thank you for reporting this issue and appreciate your patience. We've notified the core team for an update on this issue. We're looking for a response within the next 30 days or the issue may be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants