Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Use B+-trees instead of AVL trees for Immutable Collections #14477

Open
sharwell opened this issue Apr 24, 2015 · 47 comments
Open

[Proposal] Use B+-trees instead of AVL trees for Immutable Collections #14477

sharwell opened this issue Apr 24, 2015 · 47 comments
Labels
area-System.Collections enhancement Product code improvement that does NOT require public API changes/additions help wanted [up-for-grabs] Good issue for external contributors tenet-performance Performance related issue
Milestone

Comments

@sharwell
Copy link
Member

Currently the Node class used for ImmutableList<T> includes 40 bytes of overhead per list element on X64, and 24 bytes of overhead on X86. This means the memory cost of using this data structure for an ImmutableList<int> is 1000% on X64, and 600% on X86 (these figures represent the amount of memory used for storing non-data elements associated with the structure).

I propose migrating to a B+-tree instead, using a fixed value for b and declaring data values in the index and data pages directly (to avoid unnecessary memory overhead for arrays). The choice of b primarily affects the following things:

  1. The height of the tree. Shorter tree heights improve performance for random access lookups by improving memory locality for a majority of traversal operations.
  2. The size of the pages. Mutations of immutable data structures require reconstructing pages leading back to the root. If the pages are large, mutations could become expensive.

In addition to the above, B+-trees provide a substantial memory savings by removing the child pointers (the Left and Right pointers for AVL nodes) from the leaves.

My initial calculations (which could be completely wrong) indicate that the average memory allocation required for a single mutation operation on a list of 100000 elements using the AVL structure is 775 bytes for X64 and 493 bytes for X86. The optimal value of b for X64 in terms of mutation allocation overhead is b=8, which costs 476 bytes per operation on a list of the same size (39% reduction). For X86 this same value of b costs 388 bytes per mutation (21% reduction, and b=6 would be optimal with 382 bytes). The overall storage overhead for b=8 drops a staggering amount down to 169% for X64 and 119% for X86 (counting the average page at 75% capacity, based on a merge at 50% and split at 100%).

@stephentoub
Copy link
Member

cc: @AArnott, @pharring

@AArnott
Copy link
Contributor

AArnott commented Apr 24, 2015

I'm not very familiar with B+-trees, but my first concern is whether they make it necessary to discard more of the tree than AVL trees do on mutations. If I'm reading it correctly, @sharwell addresses that concern and suggests that gets even better (due to their being shallower trees overall?). It sounds better for memory, and we know that n-ary trees in general have better performance than AVL trees for lookups. So it sounds like this is a promising idea.

@sharwell
Copy link
Member Author

@AArnott You are reading that correctly. For "standard" collections, the value for b is generally chosen for best interaction with the garbage collector and processor caches (e.g. relatively large pages for great locality, but intentionally small enough to keep them out of the large object heap for most expected types). I had excellent results with b set to 128 or larger for this type of data structure.

In the case of Immutable Collections, I approached the analysis from the other side of minimizing the cost of mutations.

@AArnott
Copy link
Contributor

AArnott commented Apr 24, 2015

Great. Another concern is construction performance. When we had a 4-ary tree, construction was much slower.

@pharring
Copy link
Contributor

Seems like a nice improvement to me. We've avoided ImmutableList in Roslyn because of the measured allocation overhead. (But we lean heavily ImmutableArray to keep us happy.)

@ghost
Copy link

ghost commented May 14, 2015

+1 👍

@AArnott
Copy link
Contributor

AArnott commented May 15, 2015

One other thing to keep tabs on is what it means for Builders.

@ghost
Copy link

ghost commented May 17, 2015

Rust language uses B-tree for maps and sets :D

@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@maryamariyan maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 23, 2020
@eiriktsarpalis eiriktsarpalis removed the untriaged New issue has not been triaged by the area owner label Mar 25, 2020
@adamsitnik adamsitnik added the tenet-performance Performance related issue label Nov 2, 2020
@evilguest
Copy link

This is likely to make the mutators execution times worse (Add, Insert, etc), because of the increased footprint of the nodes to clone/alter. Would it bother you less than the overall memory consumption?

@sharwell
Copy link
Member Author

sharwell commented Nov 27, 2020

... because of the increased footprint of the nodes to clone/alter ...

@evilguest The proposal is explicitly based on the idea that this is not a problem. While each index node in the B+-tree is larger than a single node in the AVL tree, many fewer nodes need to be modified during a mutation operation. For the sample collection size of 100,000 elements, the single-mutation memory allocation reduction is 39% on X64 and 21% on X86. There may be a few collection sizes that do not benefit from this, but I'm not sure what those are right now.

@evilguest
Copy link

Thanks. I am experimenting the same direction as https://github.com/tunnelvisionlabs/dotnet-trees and the factory implementation of the System.Collections.Immutable.ImmutableList is surprisingly hard to beat from the performance standpoint. E.g. ImmutableList.this[i] in tunnelvision's version is 2-10 times worse than the in MS one (depending on the list size).
I am still benchmarking the modifications, but expect similar issues (as almost 50% of the work is finding the leaf to clone).

@arkalyanms
Copy link
Member

I am still benchmarking the modifications, but expect similar issues (as almost 50% of the work is finding the leaf to clone).

@evilguest Did you get to benchmarking your modifications?

@sharwell
Copy link
Member Author

sharwell commented Feb 8, 2021

@arkalyanms the biggest help would be a bunch of benchmark scenarios to establish baselines. I can give you an overview of the in-flight work tomorrow. 😃

@evilguest
Copy link

My benchmarking suite can be found at https://github.com/evilguest/atropos/tree/main/Atropos.Benchmarks

@evilguest
Copy link

I am not yet happy with the List implementation because of both memory overhead and underperformance for the modification operations.

@madelson
Copy link
Contributor

I'm potentially interested in working on this.

@danmoseley
Copy link
Member

@madelson probably the best first step is to ensure that the scenarios we care about are fully covered in dotnet/performance, before making any changes here.

Certainly for elapsed time - @kunalspathak does the lab detect allocation regressions as well?

@madelson
Copy link
Contributor

@madelson probably the best first step is to ensure that the scenarios we care about are fully covered in dotnet/performance, before making any changes here.

@danmoseley are there particular benchmarks to be added? Looking at this folder in the benchmarks repo it seems like some of the benchmarks incorporate the immutable collections but many do not. Is the thought to flesh that out or to create specific targeted benchmarks for the immutable collections?

@danmoseley
Copy link
Member

@madelson right, there is not much coverage of the concurrent collections there today. We have not changed the implementations much. If we're contemplating more radical changes, we first need to ensure that any interesting performance regression (one that would impact Roslyn would certainly qualify) would be detected by our set of benchmarks. So I'm suggesting perhaps some investment into those.

@jaredpar
Copy link
Member

We maybe should amp those up there. @jaredpar do you have benchmarks that would focus on the immutable collections?

No we do not have those that we could migrate. But @sharwell generally handles performance for Roslyn overall so expect he'd be testing our key scenarios as a part of evaluating this change

@sharwell
Copy link
Member Author

sharwell commented Feb 15, 2022

Note that Roslyn is becoming a bit desensitized to this issue. Since many of our performance-sensitive areas construct a collection which is subsequently read but not mutated, we've been migrating to segmented collections (essentially a 2-level B+ tree with variable-sized index level and constant-count leaves). CPS and MSBuild are two other candidates for this performance testing though.

@AlgorithmsAreCool
Copy link
Contributor

I'm going to leave this comment here to remind myself of a future API proposal if this issue makes it in.

Unsolicited Suggestion 1

Th current proposal is for a fixed b. But there is value in a configurable b. Immutable Collections are used in a broad range of applications and allowing the user to choose the arity, gives developers more flexibility.

Need faster read performance? Great, increase the arity to make scans and seeks faster.
Need to update the collection alot? Lower the arity to reduce copying and make updates more efficient.

Unsolicited Suggestion 2

As a side note, I would also envy a scan API that would take advantage of the internal structure and allow block based iteration. This shape might need some work to allow the action to stop the scan early.

public static class CollectionsMarshal 
{
    public static void Scan<TArg, T>(ImmutableList<T> list, TArg arg, ReadOnlySpanAction<TArg, T> action);
    public static void ScanRange<TArg, T>(ImmutableList<T> list, TArg arg, Range range, ReadOnlySpanAction<TArg, T> action);
}

@madelson
Copy link
Contributor

madelson commented Mar 13, 2022

To try it out, I made a prototype of a B+ tree ImmutableList<T> with a subset of operators and benchmarked it against the .NET 6 ImmutableList<T>. There's still a lot to be improved, but the results seem promising (for the ratios, lower is better and the current ImmutableList<T> is 1.0):

Benchmark Time Ratio Bytes Allocated Ratio
Indexer for loop 5 ints 0.56 n/a
Indexer for loop 512 ints 0.77 n/a
Indexer for loop 10K ints 0.59 n/a
Indexer for loop 5 strings 0.85 n/a
Indexer for loop 512 strings 0.95 n/a
Indexer for loop 10K strings 0.70 n/a
Add 5 ints 0.35 0.47
Add 512 ints 0.28 0.62
Add 10K ints 0.22 0.65
Add 5 strings 0.58 0.54
Add 512 strings 0.41 0.60
Add 10K strings 0.31 0.61
Insert 5 ints in random locations 0.49 0.54
Insert 512 ints in random locations 0.40 0.76
Insert 10K ints in random locations 0.33 0.91
Insert 5 strings in random locations 0.84 0.62
Insert 512 strings in random locations 0.55 0.80
Insert 10K strings in random locations 0.55 0.86
CreateRange 5 ints 0.99 1.28
CreateRange 512 ints 0.49 0.14
CreateRange 10K ints 0.39 0.11
CreateRange 5 strings 0.91 1.00
CreateRange 512 strings 0.59 0.25
CreateRange 10K strings 0.57 0.23
AddRange 50 ints to 50 ints 0.66 0.33
AddRange 10K ints to 10K ints 0.42 0.11
AddRange 50 strings to 50 strings 0.62 0.33
AddRange 10K strings to 10K strings 0.59 0.23

Benchmark info:

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1526 (21H1/May2021Update)
Intel Core i7-3632QM CPU 2.20GHz (Ivy Bridge), 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.200
  [Host]     : .NET 6.0.2 (6.0.222.6406), X64 RyuJIT
  DefaultJob : .NET 6.0.2 (6.0.222.6406), X64 RyuJIT

@AArnott
Copy link
Contributor

AArnott commented Mar 17, 2022

Those numbers look great. Indexing and adding looks good. I'm interested in moving forward with this.

@madelson
Copy link
Contributor

@AArnott if this is something I can contribute, how would you recommend going about the development?

For example, would it make sense for me to do a writeup of the proposed design to get feedback/alignment on that as a first step?

I'm also curious what the benchmarking/validation process should be. I did add some additional benchmarks to dotnet/performance (one set got merged, the other is still in PR), but I could imagine that for a change like this it could be desirable to have additional more detailed benchmarks for a side-by-side comparison (as I've posted above). There was also discussion of trying out the new approach in MSBuild. In addition, some way to run the benchmark suite on different operating systems seems important.

@AArnott
Copy link
Contributor

AArnott commented Mar 17, 2022

Good questions, @madelson. While I wrote the current implementation of the immutable collections, I don't typically work in this repo so I think we should wait for some other folks who accept PRs to this repo to provide the guidance you're looking for.

I don't see the Builder pattern included in your prototype. I'll be interested in seeing how you can incorporate that. The memory/perf characteristics of our current builder implementation is that creating a builder based on an immutable collection is O(1) in time and memory. Switching back from builder to immutable collection is O(n) in time and O(1) in memory, where n is the number of nodes actually changed while in Builder mode. Can we achieve that with your B+ tree?

@madelson
Copy link
Contributor

I think we should wait for some other folks who accept PRs to this repo to provide the guidance you're looking for.

Makes sense.

I don't see the Builder pattern included in your prototype. I'll be interested in seeing how you can incorporate that.

I've thought about this a bit but haven't built anything. Obviously the fact that I'm using arrays as nodes prevents us from following the current pattern of having a "frozen" bool on each node.

The first step is representing mutability. For internal nodes, I believe we can use the highest-order bit of the CumulativeChildCount field (currently unused because all counts are non-negative) to indicate that that child is mutable. The builder object itself would additionally store a flag to indicate whether or not the root node is mutable. Therefore, as we traverse down the tree we always know whether or not the current node we're looking at is mutable based on context passed from the parent caller.

Using this marking system, we can create a new builder in O(1) time/memory and we can convert back to an immutable structure in-place by traversing the tree and freezing all mutable nodes we encounter (short-circuiting on already-frozen nodes) just like we do today.

One design question with this approach is how much we should try to unify the builder and immutable methods. For example, the builder methods would always need to be masking out the first bit when reading CumulativeChildCount. If we share all of the code, we'd be taking on that extra overhead in the immutable case as well.

Another design question is whether the builder should attempt to use partially-full arrays to eliminate allocations. For example, in immutable world if we have a leaf of size 4 we represent it with just a 4-element array. Same thing goes for internal nodes. In builder world I see 2 options:
(1) We could continue this practice of using arrays that are just the right size, which means arrays are replaced whenever they "grow"
(2) We could instead adopt an approach where we double the size of the array any time it runs out of space (ala List<T>), thus leaving room to "grow" the array in-place.

With approach (1), we end up allocating a new leaf node for every builder add/insert, and much less frequently we'll need to allocate new internal nodes as well. In contrast, approach (2) can sometimes add/insert without creating any new nodes.

On the other hand, when using approach (1) converting back to the immutable structure is trivial whereas for approach (2) we either have to "compact" many nodes or we have to make the immutable structure able to handle partially-full nodes and accept the loss of space efficiency. Approach (1) also leads to simpler code because we can use the array's length as the node count whereas with approach (2) we have to flow the true count down from the parent node.

All things considered, I would probably lean towards approach (1) because it seems simpler and because the existing algorithm's builder has to create a new leaf node for every add/insert anyway so we're not losing a ton there, but I'm not sure. I'm kind of assuming that builders are a niche use-case since you're not using ImmutableList<T> unless you want the persistant/functional datastructure aspects.

@sharwell
Copy link
Member Author

@madelson I'd be interested to see those benchmarks run against the implementation in tunnelvisionlabs/dotnet-trees. I have a builder implementation there which adheres to the complexity bounds mentioned by @AArnott.

@madelson
Copy link
Contributor

@sharwell here's what I got when I added TunnelVision's ImmutableTreeList to the benchmarks (using the NuGet package):

Benchmark Time Ratio (Prototype) Bytes Allocated Ratio (Prototype) Time Ratio (TunnelVision) Bytes Allocated Ratio (TunnelVision)
Index<Int32> (5) 0.56 NaN 1.79 NaN
Index<Int32> (512) 0.80 NaN 6.84 NaN
Index<Int32> (10K) 0.59 NaN 5.57
Index<String> (5) 0.99 NaN 1.14 NaN
Index<String> (512) 0.83 NaN 2.34 NaN
Index<String> (10K) 0.81 1.00 2.89 2.00
Add<Int32> (5) 0.34 0.47 0.92 0.54
Add<Int32> (512) 0.28 0.62 2.41 0.70
Add<Int32> (10K) 0.21 0.65 1.76 0.80
Add<String> (5) 0.56 0.54 1.00 0.75
Add<String> (512) 0.43 0.60 2.52 0.77
Add<String> (10K) 0.34 0.61 1.77 0.85
Insert<Int32> (5) 0.49 0.54 1.07 0.62
Insert<Int32> (512) 0.38 0.75 3.12 0.88
Insert<Int32> (10K) 0.34 0.91 2.55 0.89
Insert<String> (5) 0.79 0.62 1.28 0.86
Insert<String> (512) 0.60 0.80 3.42 0.97
Insert<String> (10K) 0.46 0.85 2.17 0.94
CreateRange<Int32> (5) 0.90 1.28 1.31 0.69
CreateRange<Int32> (512) 0.48 0.14 2.35 0.23
CreateRange<Int32> (10K) 0.40 0.11 3.30 0.24
CreateRange<String> (5) 1.04 1.33 1.23 0.81
CreateRange<String> (512) 0.61 0.25 2.06 0.32
CreateRange<String> (10K) 0.52 0.23 2.49 0.32
AddRange<Int32> (50) 0.58 0.53 1.64 0.35
AddRange<Int32> (10K) 0.43 0.11 3.30 0.24
AddRange<String> (50) 0.63 0.33 1.43 0.33
AddRange<String> (10K) 0.58 0.23 2.77 0.32

@madelson
Copy link
Contributor

The memory/perf characteristics of our current builder implementation is that creating a builder based on an immutable collection is O(1) in time and memory. Switching back from builder to immutable collection is O(n) in time and O(1) in memory, where n is the number of nodes actually changed while in Builder mode. Can we achieve that with your B+ tree?

I added a basic builder implementation to the prototype. For now the builder just supports indexer get/set and ToBuilder/ToImmutable. However, this should illustrate how the builder can function in general and it meets the performance criteria above.

Benchmarks:

Benchmark Time Ratio (Prototype) Bytes Allocated Ratio (Prototype) Time Ratio (TunnelVision) Bytes Allocated Ratio (TunnelVision)
BuilderSetItem<Int32> (5) 0.38 0.56 0.88 0.56
BuilderSetItem<Int32> (512) 0.23 0.16 1.89 0.21
BuilderSetItem<Int32> (10K) 0.27 0.12 1.73 0.20
BuilderSetItem<String> (5) 0.61 0.63 0.99 0.70
BuilderSetItem<String> (512) 0.36 0.25 1.97 0.29
BuilderSetItem<String> (10K) 0.35 0.23 1.79 0.28

The benchmark I'm using does the following:

  • Given an immutable list with N items
  • Call ToBuilder()
  • Set the first 25% of the builder's indices to different values using the indexer
  • Call ToImmutable()

@madelson
Copy link
Contributor

A few more benchmarks to share:

Benchmark Time Ratio (Prototype) Bytes Allocated Ratio (Prototype) Time Ratio (TunnelVision) Bytes Allocated Ratio (TunnelVision)
RemoveAt<Int32> (5) 0.56 0.71 1.96 0.90
RemoveAt<Int32> (512) 0.78 1.01 4.80 0.96
RemoveAt<Int32> (10K) 0.75 1.15 3.25 1.06
RemoveAt<String> (5) 0.93 0.79 1.91 1.23
RemoveAt<String> (512) 1.03 1.01 4.67 1.05
RemoveAt<String> (10K) 0.89 1.07 3.53 1.12
RemoveRange<Int32> (5) 0.27 0.43 1.71 0.48
RemoveRange<Int32> (512) 0.01 0.02 2.40 0.20
RemoveRange<Int32> (10K) 0.00 0.00 2.25 0.20
RemoveRange<String> (5) 0.57 0.48 1.87 0.67
RemoveRange<String> (512) 0.01 0.03 2.62 0.28
RemoveRange<String> (10K) 0.00 0.00 2.63 0.28
IndexOf<Int32> (5) 0.10 NaN 0.20 NaN
IndexOf<Int32> (512) 0.17 NaN 0.87 NaN
IndexOf<Int32> (10K) 0.17 NaN 1.01 NaN
IndexOf<String> (5) 0.08 NaN 0.19 NaN
IndexOf<String> (512) 0.11 NaN 0.45 NaN
IndexOf<String> (10K) 0.11 0.00 0.48 0.00
CopyTo<Int32> (5) 0.07 NaN 0.34 NaN
CopyTo<Int32> (512) 0.04 NaN 8.58 NaN
CopyTo<Int32> (10K) 0.03 NaN 13.42
CopyTo<String> (5) 0.10 NaN 0.23 NaN
CopyTo<String> (512) 0.05 NaN 2.19 NaN
CopyTo<String> (10K) 0.05 NaN 4.41

Some notes:

  • The RemoveAt benchmark starts with a list of size and removes indices at random until the list is empty
  • The RemoveRange benchmark starts with a list of size and removes a single range comprising of the middle third of the list elements
  • The IndexOf benchmark searches for an element that isn't in the list
  • RemoveRange(index, count) can be implemented very efficiently for a B+ tree. Our RemoveAt(index) just calls RemoveRange(index, 1).
  • IndexOf and CopyTo leverage a Span-based scanning helper like what @AlgorithmsAreCool suggested.

@madelson
Copy link
Contributor

Added a stack-based enumerator like what ImmutableList<T> has. Enumeration is also significantly faster, likely since most MoveNext() calls are just advancing through a leaf node.

Benchmark Time Ratio (Prototype) Bytes Allocated Ratio (Prototype) Time Ratio (TunnelVision) Bytes Allocated Ratio (TunnelVision)
Enumeration<Int32> (5) 0.24 NaN 0.51 NaN
Enumeration<Int32> (512) 0.39 NaN 1.87 NaN
Enumeration<Int32> (10K) 0.35 NaN 2.60 NaN
Enumeration<String> (5) 0.24 NaN 0.37 NaN
Enumeration<String> (512) 0.34 NaN 0.76 NaN
Enumeration<String> (10K) 0.31 0.00 0.95 1.00

@AArnott
Copy link
Contributor

AArnott commented May 6, 2022

I haven't studied the perf numbers, but wanted to respond to one of the questions about memory allocations: people use the builders to reduce allocations and improve perf of creating a structure that will be frozen later. Allocating and copying an array on every Add would defeat that. For that reason I support having 'slack' in the array at least during the builder's use.
At that point then, we need to decide whether it's better to have very fast and alloc-free ToImmutable or to remove the slack from the arrays. I'm not sure on this. I'm leaning slightly towards cutting the slack.

@madelson
Copy link
Contributor

madelson commented May 6, 2022

For that reason I support having 'slack' in the array at least during the builder's use.
At that point then, we need to decide whether it's better to have very fast and alloc-free ToImmutable or to remove the slack from the arrays. I'm not sure on this. I'm leaning slightly towards cutting the slack.

@AArnott can you clarify what you mean by "cutting the slack" here? Are you coming down in favor of fast, alloc-free ToImmutable (no slack in the arrays) or in favor of more expensive ToImmutable() in exchange for fewer allocations on the individual builder operations?

The topic of how the builder should work is one I've given a bit of thought to and I think there are some interesting considerations / options / tradeoffs worth mentioning.

First off, consider the approach where there is no slack (arrays are always full). While this might seem like it defeats the point of having a builder, it still offers real benefits:

  • SetItem() calls will not allocate once the nodes along the path have been "mutablized" since no expansion/contraction is required
  • Add/Remove/Insert will allocate every time at the leaf level, but don't necessarily allocate at the internal level (internal node allocations only happen when enough changes have been made to trigger a merge/split). For the AVL tree, Add/Insert operations always allocate at the leaf level anyway even in the builder.

Next, consider the approach where we do have slack to avoid recreating leaf arrays on every change:

  • This is a win when the user is inserting/adding/removing multiple elements in the safe leaf node. If the user is touching random elements throughout the list it is less impactful and may even be harmful since it will force us to do more reallocations when ToImmutable() is called. However, if the user is touching a bunch of elements that are close together, they could simply use the batch operations (e. g. RemoveRange, AddRange) which are very efficient for the B+ tree. In general, I'm curious what the target use-case is for the builders.
  • While I don't think it would be unbearable, this approach would increase the complexity of the builder code and may make it more difficult to share code between the builder and the immutable list. Not sure how much of a factor this should be.

Finally, I'll note a couple of "in-between" options that might be worth thinking about:

  • We could have "slack" only at the leaf level on the premise that it is more impactful there.
  • We could have "slack" only at the leading edge of the tree to optimize for Add() specifically (I already have an optimization like this for the immutable version which is why Add() is faster and incurs fewer allocations than Insert()).

Sorry for the big wall of text, but curious to hear your thoughts, especially what specific factors / benchmarks should influence the decision.

@AArnott
Copy link
Contributor

AArnott commented May 7, 2022

can you clarify what you mean by "cutting the slack" here? Are you coming down in favor of fast, alloc-free ToImmutable (no slack in the arrays) or in favor of more expensive ToImmutable() in exchange for fewer allocations on the individual builder operations?

Fewer allocations in the Builder demand slack. ToImmutable's lean result demands no slack. So the direction I'm leaning is to create arrays with slack in the Builder, and then trim the slack away during ToImmutable. It wins here and loses there, but as Builders are used to initialize large collections, I think these are the right trade-offs.

@AArnott
Copy link
Contributor

AArnott commented May 7, 2022

I just finished reading your last comment. Great points.
I wonder how hard it would be to code to allow for either possibility (slack or no during mutation). Maybe the policy could allow for a few mutations with no slack. Within one Builder's lifetime if mutations keep coming, we change the optimizing assumption and start leaving slack in arrays... just an idea. It's not worth it if it adds so much complexity that bugs or inadequate testing result.

I guess considering arrays at each node wouldn't exceed some (small) length (right?), allocating another array and copying elements over might not be too bad. But it would be interesting to measure filling 15000 items in a collection both ways to see how perf is impacted (and how much GC pressure each produces). Maybe you've already collected that data above... I haven't studied the tables.

this approach would increase the complexity of the builder code and may make it more difficult to share code between the builder and the immutable list. Not sure how much of a factor this should be.

If there is a significant perf win, and we have adequate tests for all the code, if the code isn't all shared, that feel ok to me.

We could have "slack" only at the leading edge of the tree to optimize for Add() specifically (I already have an optimization like this for the immutable version which is why Add() is faster and incurs fewer allocations than Insert()).

Such optimizations sound great. And then there's AddRange(IEnumerable) which presumably should allow you to perform optimally (particularly if a Count property is available via conditional casting).

@sharwell
Copy link
Member Author

sharwell commented May 9, 2022

Note that the implementation I worked on has different slack characteristics. While the implementation uses fixed-size inline arrays, it separately guarantees minimum slack exists in cases where a list is constructed from beginning to end (e.g. starting from empty and calling Add/AddRange).

@madelson
Copy link
Contributor

madelson commented May 9, 2022

separately guarantees minimum slack exists in cases where a list is constructed from beginning to end

By this do you mean that the resulting B+ tree is "compact" in the sense that a naive implementation of Add would result in lots of half-full nodes but instead you are careful to construct it such that as many nodes as possible are full?

If so, that's also what I'm doing in the array-based implementation although I haven't built this out for the builder yet.

It would be interesting to measure filling 15000 items in a collection both ways to see how perf is impacted (and how much GC pressure each produces)

@AArnott are you imagining randomized insertion or Add? As mentioned above Add will be able to take advantage of some other optimizations and shouldn't be as impacted by how we handle slack.

@madelson
Copy link
Contributor

@AArnott ok I had some time to code up Builder.Add() with allowance for slack in the leaves to reduce allocations. Here are the numbers:

Benchmark Time Ratio (Prototype) Bytes Allocated Ratio (Prototype) Time Ratio (TunnelVision) Bytes Allocated Ratio (TunnelVision)
BuilderAdd<Int32> (15K) 0.15 0.21 1.45 0.19
BuilderAdd<String> (15K) 0.25 0.34 1.60 0.28

Interestingly, without the slack optimization the timing numbers are comparable, but the allocation numbers are worse (ratio ~2.0 instead of ~0.25 like we see here). Therefore, I think having slack in the leaves is a good build for the builder. There's a good amount of room to tune as well. Right now we always expand the node to maximum size but we could instead do something more conservative like going to max / 2 on the first expansion which might be better for insertions. We could also switch from a no-slack mode to a slack mode based on a number of expansions performed like you mentioned. I'm not sure either of those is worth exploring now but it could make sense in the future.

@AArnott
Copy link
Contributor

AArnott commented Jun 8, 2022

Those numbers look good. I prefer the "prototype" columns to the "TunnelVision" column as a few % points of more memory seems totally worth being several times faster.

@sharwell
Copy link
Member Author

sharwell commented Jun 9, 2022

@AArnott note that the time ratio for my branch is primarily caused by the fact that the inline 8 element array search has not been vectorized and uses a coding pattern in the indexer which the JIT doesn't know how to optimize. It is designed to support branchless lookups on machines with 256-bit registers at each level of index once that work is complete.

@madelson
Copy link
Contributor

@sharwell are you referring to something that would optimize these switch statements?

If so, could Unsafe be used to avoid branching in the meantime? Something like this:

static ref T Item<T>(ref FixedArray8<T> array, int index)
{
	if ((uint)index >= 8) { throw new IndexOutOfRangeException(); }
	
	return ref Unsafe.Add(ref Unsafe.As<FixedArray8<T>, T>(ref array), index);
}

Not sure if this is making some invalid assumptions about layout.

@sharwell
Copy link
Member Author

sharwell commented Jun 10, 2022

@sharwell are you referring to something that would optimize these switch statements?

That's part of the issue, but it can go much further. For example, an 8-element binary search could be implemented in a completely vectored form based on code similar to https://fioraaeterna.tumblr.com/post/26644202378/fast-binary-searching.

If so, could Unsafe be used to avoid branching in the meantime? Something like this:

static ref T Item<T>(ref FixedArray8<T> array, int index)
{
	if ((uint)index >= 8) { throw new IndexOutOfRangeException(); }
	
	return ref Unsafe.Add(ref Unsafe.As<FixedArray8<T>, T>(ref array), index);
}

Not sure if this is making some invalid assumptions about layout.

This was implemented in tunnelvisionlabs/dotnet-trees#102, but there are concerns that this may violate layout assumptions made by the JIT. Switching to Vector256<int> would resolve this in a different way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Collections enhancement Product code improvement that does NOT require public API changes/additions help wanted [up-for-grabs] Good issue for external contributors tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests