[Proposal] Use B+-trees instead of AVL trees for Immutable Collections #14477

sharwell · 2015-04-24T07:12:28Z

Currently the Node class used for ImmutableList<T> includes 40 bytes of overhead per list element on X64, and 24 bytes of overhead on X86. This means the memory cost of using this data structure for an ImmutableList<int> is 1000% on X64, and 600% on X86 (these figures represent the amount of memory used for storing non-data elements associated with the structure).

I propose migrating to a B+-tree instead, using a fixed value for b and declaring data values in the index and data pages directly (to avoid unnecessary memory overhead for arrays). The choice of b primarily affects the following things:

The height of the tree. Shorter tree heights improve performance for random access lookups by improving memory locality for a majority of traversal operations.
The size of the pages. Mutations of immutable data structures require reconstructing pages leading back to the root. If the pages are large, mutations could become expensive.

In addition to the above, B+-trees provide a substantial memory savings by removing the child pointers (the Left and Right pointers for AVL nodes) from the leaves.

My initial calculations (which could be completely wrong) indicate that the average memory allocation required for a single mutation operation on a list of 100000 elements using the AVL structure is 775 bytes for X64 and 493 bytes for X86. The optimal value of b for X64 in terms of mutation allocation overhead is b=8, which costs 476 bytes per operation on a list of the same size (39% reduction). For X86 this same value of b costs 388 bytes per mutation (21% reduction, and b=6 would be optimal with 382 bytes). The overall storage overhead for b=8 drops a staggering amount down to 169% for X64 and 119% for X86 (counting the average page at 75% capacity, based on a merge at 50% and split at 100%).

The text was updated successfully, but these errors were encountered:

stephentoub · 2015-04-24T11:58:33Z

cc: @AArnott, @pharring

AArnott · 2015-04-24T13:57:52Z

I'm not very familiar with B+-trees, but my first concern is whether they make it necessary to discard more of the tree than AVL trees do on mutations. If I'm reading it correctly, @sharwell addresses that concern and suggests that gets even better (due to their being shallower trees overall?). It sounds better for memory, and we know that n-ary trees in general have better performance than AVL trees for lookups. So it sounds like this is a promising idea.

sharwell · 2015-04-24T14:13:04Z

@AArnott You are reading that correctly. For "standard" collections, the value for b is generally chosen for best interaction with the garbage collector and processor caches (e.g. relatively large pages for great locality, but intentionally small enough to keep them out of the large object heap for most expected types). I had excellent results with b set to 128 or larger for this type of data structure.

In the case of Immutable Collections, I approached the analysis from the other side of minimizing the cost of mutations.

AArnott · 2015-04-24T16:00:26Z

Great. Another concern is construction performance. When we had a 4-ary tree, construction was much slower.

pharring · 2015-04-27T04:48:39Z

Seems like a nice improvement to me. We've avoided ImmutableList in Roslyn because of the measured allocation overhead. (But we lean heavily ImmutableArray to keep us happy.)

ghost · 2015-05-14T05:51:18Z

+1 👍

AArnott · 2015-05-15T04:13:51Z

One other thing to keep tabs on is what it means for Builders.

ghost · 2015-05-17T21:34:46Z

Rust language uses B-tree for maps and sets :D

evilguest · 2020-11-22T10:55:47Z

This is likely to make the mutators execution times worse (Add, Insert, etc), because of the increased footprint of the nodes to clone/alter. Would it bother you less than the overall memory consumption?

sharwell · 2020-11-27T11:03:05Z

... because of the increased footprint of the nodes to clone/alter ...

@evilguest The proposal is explicitly based on the idea that this is not a problem. While each index node in the B+-tree is larger than a single node in the AVL tree, many fewer nodes need to be modified during a mutation operation. For the sample collection size of 100,000 elements, the single-mutation memory allocation reduction is 39% on X64 and 21% on X86. There may be a few collection sizes that do not benefit from this, but I'm not sure what those are right now.

evilguest · 2020-11-27T11:39:44Z

Thanks. I am experimenting the same direction as https://github.com/tunnelvisionlabs/dotnet-trees and the factory implementation of the System.Collections.Immutable.ImmutableList is surprisingly hard to beat from the performance standpoint. E.g. ImmutableList.this[i] in tunnelvision's version is 2-10 times worse than the in MS one (depending on the list size).
I am still benchmarking the modifications, but expect similar issues (as almost 50% of the work is finding the leaf to clone).

arkalyanms · 2021-02-05T17:49:30Z

I am still benchmarking the modifications, but expect similar issues (as almost 50% of the work is finding the leaf to clone).

@evilguest Did you get to benchmarking your modifications?

sharwell · 2021-02-08T03:38:06Z

@arkalyanms the biggest help would be a bunch of benchmark scenarios to establish baselines. I can give you an overview of the in-flight work tomorrow. 😃

evilguest · 2021-02-08T08:25:10Z

My benchmarking suite can be found at https://github.com/evilguest/atropos/tree/main/Atropos.Benchmarks

evilguest · 2021-02-08T08:26:08Z

I am not yet happy with the List implementation because of both memory overhead and underperformance for the modification operations.

madelson · 2022-02-12T02:16:02Z

I'm potentially interested in working on this.

danmoseley · 2022-02-12T04:10:29Z

@madelson probably the best first step is to ensure that the scenarios we care about are fully covered in dotnet/performance, before making any changes here.

Certainly for elapsed time - @kunalspathak does the lab detect allocation regressions as well?

madelson · 2022-02-13T01:06:56Z

@madelson probably the best first step is to ensure that the scenarios we care about are fully covered in dotnet/performance, before making any changes here.

@danmoseley are there particular benchmarks to be added? Looking at this folder in the benchmarks repo it seems like some of the benchmarks incorporate the immutable collections but many do not. Is the thought to flesh that out or to create specific targeted benchmarks for the immutable collections?

danmoseley · 2022-02-13T02:41:11Z

@madelson right, there is not much coverage of the concurrent collections there today. We have not changed the implementations much. If we're contemplating more radical changes, we first need to ensure that any interesting performance regression (one that would impact Roslyn would certainly qualify) would be detected by our set of benchmarks. So I'm suggesting perhaps some investment into those.

jaredpar · 2022-02-14T15:55:39Z

We maybe should amp those up there. @jaredpar do you have benchmarks that would focus on the immutable collections?

No we do not have those that we could migrate. But @sharwell generally handles performance for Roslyn overall so expect he'd be testing our key scenarios as a part of evaluating this change

sharwell · 2022-02-15T16:53:36Z

Note that Roslyn is becoming a bit desensitized to this issue. Since many of our performance-sensitive areas construct a collection which is subsequently read but not mutated, we've been migrating to segmented collections (essentially a 2-level B+ tree with variable-sized index level and constant-count leaves). CPS and MSBuild are two other candidates for this performance testing though.

See dotnet/runtime#14477 (comment) for context.

AlgorithmsAreCool · 2022-03-01T14:43:31Z

I'm going to leave this comment here to remind myself of a future API proposal if this issue makes it in.

Unsolicited Suggestion 1

Th current proposal is for a fixed b. But there is value in a configurable b. Immutable Collections are used in a broad range of applications and allowing the user to choose the arity, gives developers more flexibility.

Need faster read performance? Great, increase the arity to make scans and seeks faster.
Need to update the collection alot? Lower the arity to reduce copying and make updates more efficient.

Unsolicited Suggestion 2

As a side note, I would also envy a scan API that would take advantage of the internal structure and allow block based iteration. This shape might need some work to allow the action to stop the scan early.

public static class CollectionsMarshal 
{
    public static void Scan<TArg, T>(ImmutableList<T> list, TArg arg, ReadOnlySpanAction<TArg, T> action);
    public static void ScanRange<TArg, T>(ImmutableList<T> list, TArg arg, Range range, ReadOnlySpanAction<TArg, T> action);
}

madelson · 2022-03-13T20:37:53Z

To try it out, I made a prototype of a B+ tree ImmutableList<T> with a subset of operators and benchmarked it against the .NET 6 ImmutableList<T>. There's still a lot to be improved, but the results seem promising (for the ratios, lower is better and the current ImmutableList<T> is 1.0):

Benchmark	Time Ratio	Bytes Allocated Ratio
Indexer for loop 5 ints	0.56	n/a
Indexer for loop 512 ints	0.77	n/a
Indexer for loop 10K ints	0.59	n/a
Indexer for loop 5 strings	0.85	n/a
Indexer for loop 512 strings	0.95	n/a
Indexer for loop 10K strings	0.70	n/a
Add 5 ints	0.35	0.47
Add 512 ints	0.28	0.62
Add 10K ints	0.22	0.65
Add 5 strings	0.58	0.54
Add 512 strings	0.41	0.60
Add 10K strings	0.31	0.61
Insert 5 ints in random locations	0.49	0.54
Insert 512 ints in random locations	0.40	0.76
Insert 10K ints in random locations	0.33	0.91
Insert 5 strings in random locations	0.84	0.62
Insert 512 strings in random locations	0.55	0.80
Insert 10K strings in random locations	0.55	0.86
CreateRange 5 ints	0.99	1.28
CreateRange 512 ints	0.49	0.14
CreateRange 10K ints	0.39	0.11
CreateRange 5 strings	0.91	1.00
CreateRange 512 strings	0.59	0.25
CreateRange 10K strings	0.57	0.23
AddRange 50 ints to 50 ints	0.66	0.33
AddRange 10K ints to 10K ints	0.42	0.11
AddRange 50 strings to 50 strings	0.62	0.33
AddRange 10K strings to 10K strings	0.59	0.23

Benchmark info:

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1526 (21H1/May2021Update)
Intel Core i7-3632QM CPU 2.20GHz (Ivy Bridge), 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.200
  [Host]     : .NET 6.0.2 (6.0.222.6406), X64 RyuJIT
  DefaultJob : .NET 6.0.2 (6.0.222.6406), X64 RyuJIT

AArnott · 2022-03-17T01:10:02Z

Those numbers look great. Indexing and adding looks good. I'm interested in moving forward with this.

madelson · 2022-03-17T13:23:30Z

@AArnott if this is something I can contribute, how would you recommend going about the development?

For example, would it make sense for me to do a writeup of the proposed design to get feedback/alignment on that as a first step?

I'm also curious what the benchmarking/validation process should be. I did add some additional benchmarks to dotnet/performance (one set got merged, the other is still in PR), but I could imagine that for a change like this it could be desirable to have additional more detailed benchmarks for a side-by-side comparison (as I've posted above). There was also discussion of trying out the new approach in MSBuild. In addition, some way to run the benchmark suite on different operating systems seems important.

AArnott · 2022-03-17T14:48:39Z

Good questions, @madelson. While I wrote the current implementation of the immutable collections, I don't typically work in this repo so I think we should wait for some other folks who accept PRs to this repo to provide the guidance you're looking for.

I don't see the Builder pattern included in your prototype. I'll be interested in seeing how you can incorporate that. The memory/perf characteristics of our current builder implementation is that creating a builder based on an immutable collection is O(1) in time and memory. Switching back from builder to immutable collection is O(n) in time and O(1) in memory, where n is the number of nodes actually changed while in Builder mode. Can we achieve that with your B+ tree?

madelson · 2022-03-17T16:28:01Z

I think we should wait for some other folks who accept PRs to this repo to provide the guidance you're looking for.

Makes sense.

I don't see the Builder pattern included in your prototype. I'll be interested in seeing how you can incorporate that.

I've thought about this a bit but haven't built anything. Obviously the fact that I'm using arrays as nodes prevents us from following the current pattern of having a "frozen" bool on each node.

The first step is representing mutability. For internal nodes, I believe we can use the highest-order bit of the CumulativeChildCount field (currently unused because all counts are non-negative) to indicate that that child is mutable. The builder object itself would additionally store a flag to indicate whether or not the root node is mutable. Therefore, as we traverse down the tree we always know whether or not the current node we're looking at is mutable based on context passed from the parent caller.

Using this marking system, we can create a new builder in O(1) time/memory and we can convert back to an immutable structure in-place by traversing the tree and freezing all mutable nodes we encounter (short-circuiting on already-frozen nodes) just like we do today.

One design question with this approach is how much we should try to unify the builder and immutable methods. For example, the builder methods would always need to be masking out the first bit when reading CumulativeChildCount. If we share all of the code, we'd be taking on that extra overhead in the immutable case as well.

Another design question is whether the builder should attempt to use partially-full arrays to eliminate allocations. For example, in immutable world if we have a leaf of size 4 we represent it with just a 4-element array. Same thing goes for internal nodes. In builder world I see 2 options:
(1) We could continue this practice of using arrays that are just the right size, which means arrays are replaced whenever they "grow"
(2) We could instead adopt an approach where we double the size of the array any time it runs out of space (ala List<T>), thus leaving room to "grow" the array in-place.

With approach (1), we end up allocating a new leaf node for every builder add/insert, and much less frequently we'll need to allocate new internal nodes as well. In contrast, approach (2) can sometimes add/insert without creating any new nodes.

On the other hand, when using approach (1) converting back to the immutable structure is trivial whereas for approach (2) we either have to "compact" many nodes or we have to make the immutable structure able to handle partially-full nodes and accept the loss of space efficiency. Approach (1) also leads to simpler code because we can use the array's length as the node count whereas with approach (2) we have to flow the true count down from the parent node.

All things considered, I would probably lean towards approach (1) because it seems simpler and because the existing algorithm's builder has to create a new leaf node for every add/insert anyway so we're not losing a ton there, but I'm not sure. I'm kind of assuming that builders are a niche use-case since you're not using ImmutableList<T> unless you want the persistant/functional datastructure aspects.

sharwell · 2022-03-17T17:26:37Z

@madelson I'd be interested to see those benchmarks run against the implementation in tunnelvisionlabs/dotnet-trees. I have a builder implementation there which adheres to the complexity bounds mentioned by @AArnott.

madelson · 2022-03-17T23:35:02Z

@sharwell here's what I got when I added TunnelVision's ImmutableTreeList to the benchmarks (using the NuGet package):

Benchmark	Time Ratio (Prototype)	Bytes Allocated Ratio (Prototype)	Time Ratio (TunnelVision)	Bytes Allocated Ratio (TunnelVision)
Index<Int32> (5)	0.56	NaN	1.79	NaN
Index<Int32> (512)	0.80	NaN	6.84	NaN
Index<Int32> (10K)	0.59	NaN	5.57	∞
Index<String> (5)	0.99	NaN	1.14	NaN
Index<String> (512)	0.83	NaN	2.34	NaN
Index<String> (10K)	0.81	1.00	2.89	2.00
Add<Int32> (5)	0.34	0.47	0.92	0.54
Add<Int32> (512)	0.28	0.62	2.41	0.70
Add<Int32> (10K)	0.21	0.65	1.76	0.80
Add<String> (5)	0.56	0.54	1.00	0.75
Add<String> (512)	0.43	0.60	2.52	0.77
Add<String> (10K)	0.34	0.61	1.77	0.85
Insert<Int32> (5)	0.49	0.54	1.07	0.62
Insert<Int32> (512)	0.38	0.75	3.12	0.88
Insert<Int32> (10K)	0.34	0.91	2.55	0.89
Insert<String> (5)	0.79	0.62	1.28	0.86
Insert<String> (512)	0.60	0.80	3.42	0.97
Insert<String> (10K)	0.46	0.85	2.17	0.94
CreateRange<Int32> (5)	0.90	1.28	1.31	0.69
CreateRange<Int32> (512)	0.48	0.14	2.35	0.23
CreateRange<Int32> (10K)	0.40	0.11	3.30	0.24
CreateRange<String> (5)	1.04	1.33	1.23	0.81
CreateRange<String> (512)	0.61	0.25	2.06	0.32
CreateRange<String> (10K)	0.52	0.23	2.49	0.32
AddRange<Int32> (50)	0.58	0.53	1.64	0.35
AddRange<Int32> (10K)	0.43	0.11	3.30	0.24
AddRange<String> (50)	0.63	0.33	1.43	0.33
AddRange<String> (10K)	0.58	0.23	2.77	0.32

madelson · 2022-03-19T18:03:10Z

The memory/perf characteristics of our current builder implementation is that creating a builder based on an immutable collection is O(1) in time and memory. Switching back from builder to immutable collection is O(n) in time and O(1) in memory, where n is the number of nodes actually changed while in Builder mode. Can we achieve that with your B+ tree?

I added a basic builder implementation to the prototype. For now the builder just supports indexer get/set and ToBuilder/ToImmutable. However, this should illustrate how the builder can function in general and it meets the performance criteria above.

Benchmarks:

Benchmark	Time Ratio (Prototype)	Bytes Allocated Ratio (Prototype)	Time Ratio (TunnelVision)	Bytes Allocated Ratio (TunnelVision)
BuilderSetItem<Int32> (5)	0.38	0.56	0.88	0.56
BuilderSetItem<Int32> (512)	0.23	0.16	1.89	0.21
BuilderSetItem<Int32> (10K)	0.27	0.12	1.73	0.20
BuilderSetItem<String> (5)	0.61	0.63	0.99	0.70
BuilderSetItem<String> (512)	0.36	0.25	1.97	0.29
BuilderSetItem<String> (10K)	0.35	0.23	1.79	0.28

The benchmark I'm using does the following:

Given an immutable list with N items
Call ToBuilder()
Set the first 25% of the builder's indices to different values using the indexer
Call ToImmutable()

madelson · 2022-04-13T11:33:02Z

A few more benchmarks to share:

Benchmark	Time Ratio (Prototype)	Bytes Allocated Ratio (Prototype)	Time Ratio (TunnelVision)	Bytes Allocated Ratio (TunnelVision)
RemoveAt<Int32> (5)	0.56	0.71	1.96	0.90
RemoveAt<Int32> (512)	0.78	1.01	4.80	0.96
RemoveAt<Int32> (10K)	0.75	1.15	3.25	1.06
RemoveAt<String> (5)	0.93	0.79	1.91	1.23
RemoveAt<String> (512)	1.03	1.01	4.67	1.05
RemoveAt<String> (10K)	0.89	1.07	3.53	1.12
RemoveRange<Int32> (5)	0.27	0.43	1.71	0.48
RemoveRange<Int32> (512)	0.01	0.02	2.40	0.20
RemoveRange<Int32> (10K)	0.00	0.00	2.25	0.20
RemoveRange<String> (5)	0.57	0.48	1.87	0.67
RemoveRange<String> (512)	0.01	0.03	2.62	0.28
RemoveRange<String> (10K)	0.00	0.00	2.63	0.28
IndexOf<Int32> (5)	0.10	NaN	0.20	NaN
IndexOf<Int32> (512)	0.17	NaN	0.87	NaN
IndexOf<Int32> (10K)	0.17	NaN	1.01	NaN
IndexOf<String> (5)	0.08	NaN	0.19	NaN
IndexOf<String> (512)	0.11	NaN	0.45	NaN
IndexOf<String> (10K)	0.11	0.00	0.48	0.00
CopyTo<Int32> (5)	0.07	NaN	0.34	NaN
CopyTo<Int32> (512)	0.04	NaN	8.58	NaN
CopyTo<Int32> (10K)	0.03	NaN	13.42	∞
CopyTo<String> (5)	0.10	NaN	0.23	NaN
CopyTo<String> (512)	0.05	NaN	2.19	NaN
CopyTo<String> (10K)	0.05	NaN	4.41	∞

Some notes:

The RemoveAt benchmark starts with a list of size and removes indices at random until the list is empty
The RemoveRange benchmark starts with a list of size and removes a single range comprising of the middle third of the list elements
The IndexOf benchmark searches for an element that isn't in the list
RemoveRange(index, count) can be implemented very efficiently for a B+ tree. Our RemoveAt(index) just calls RemoveRange(index, 1).
IndexOf and CopyTo leverage a Span-based scanning helper like what @AlgorithmsAreCool suggested.

madelson · 2022-04-16T18:27:51Z

Added a stack-based enumerator like what ImmutableList<T> has. Enumeration is also significantly faster, likely since most MoveNext() calls are just advancing through a leaf node.

Benchmark	Time Ratio (Prototype)	Bytes Allocated Ratio (Prototype)	Time Ratio (TunnelVision)	Bytes Allocated Ratio (TunnelVision)
Enumeration<Int32> (5)	0.24	NaN	0.51	NaN
Enumeration<Int32> (512)	0.39	NaN	1.87	NaN
Enumeration<Int32> (10K)	0.35	NaN	2.60	NaN
Enumeration<String> (5)	0.24	NaN	0.37	NaN
Enumeration<String> (512)	0.34	NaN	0.76	NaN
Enumeration<String> (10K)	0.31	0.00	0.95	1.00

AArnott · 2022-05-06T14:17:15Z

I haven't studied the perf numbers, but wanted to respond to one of the questions about memory allocations: people use the builders to reduce allocations and improve perf of creating a structure that will be frozen later. Allocating and copying an array on every Add would defeat that. For that reason I support having 'slack' in the array at least during the builder's use.
At that point then, we need to decide whether it's better to have very fast and alloc-free ToImmutable or to remove the slack from the arrays. I'm not sure on this. I'm leaning slightly towards cutting the slack.

madelson · 2022-05-06T22:01:25Z

For that reason I support having 'slack' in the array at least during the builder's use.
At that point then, we need to decide whether it's better to have very fast and alloc-free ToImmutable or to remove the slack from the arrays. I'm not sure on this. I'm leaning slightly towards cutting the slack.

@AArnott can you clarify what you mean by "cutting the slack" here? Are you coming down in favor of fast, alloc-free ToImmutable (no slack in the arrays) or in favor of more expensive ToImmutable() in exchange for fewer allocations on the individual builder operations?

The topic of how the builder should work is one I've given a bit of thought to and I think there are some interesting considerations / options / tradeoffs worth mentioning.

First off, consider the approach where there is no slack (arrays are always full). While this might seem like it defeats the point of having a builder, it still offers real benefits:

SetItem() calls will not allocate once the nodes along the path have been "mutablized" since no expansion/contraction is required
Add/Remove/Insert will allocate every time at the leaf level, but don't necessarily allocate at the internal level (internal node allocations only happen when enough changes have been made to trigger a merge/split). For the AVL tree, Add/Insert operations always allocate at the leaf level anyway even in the builder.

Next, consider the approach where we do have slack to avoid recreating leaf arrays on every change:

This is a win when the user is inserting/adding/removing multiple elements in the safe leaf node. If the user is touching random elements throughout the list it is less impactful and may even be harmful since it will force us to do more reallocations when ToImmutable() is called. However, if the user is touching a bunch of elements that are close together, they could simply use the batch operations (e. g. RemoveRange, AddRange) which are very efficient for the B+ tree. In general, I'm curious what the target use-case is for the builders.
While I don't think it would be unbearable, this approach would increase the complexity of the builder code and may make it more difficult to share code between the builder and the immutable list. Not sure how much of a factor this should be.

Finally, I'll note a couple of "in-between" options that might be worth thinking about:

We could have "slack" only at the leaf level on the premise that it is more impactful there.
We could have "slack" only at the leading edge of the tree to optimize for Add() specifically (I already have an optimization like this for the immutable version which is why Add() is faster and incurs fewer allocations than Insert()).

Sorry for the big wall of text, but curious to hear your thoughts, especially what specific factors / benchmarks should influence the decision.

AArnott · 2022-05-07T02:30:03Z

can you clarify what you mean by "cutting the slack" here? Are you coming down in favor of fast, alloc-free ToImmutable (no slack in the arrays) or in favor of more expensive ToImmutable() in exchange for fewer allocations on the individual builder operations?

Fewer allocations in the Builder demand slack. ToImmutable's lean result demands no slack. So the direction I'm leaning is to create arrays with slack in the Builder, and then trim the slack away during ToImmutable. It wins here and loses there, but as Builders are used to initialize large collections, I think these are the right trade-offs.

AArnott · 2022-05-07T02:37:23Z

I just finished reading your last comment. Great points.
I wonder how hard it would be to code to allow for either possibility (slack or no during mutation). Maybe the policy could allow for a few mutations with no slack. Within one Builder's lifetime if mutations keep coming, we change the optimizing assumption and start leaving slack in arrays... just an idea. It's not worth it if it adds so much complexity that bugs or inadequate testing result.

I guess considering arrays at each node wouldn't exceed some (small) length (right?), allocating another array and copying elements over might not be too bad. But it would be interesting to measure filling 15000 items in a collection both ways to see how perf is impacted (and how much GC pressure each produces). Maybe you've already collected that data above... I haven't studied the tables.

this approach would increase the complexity of the builder code and may make it more difficult to share code between the builder and the immutable list. Not sure how much of a factor this should be.

If there is a significant perf win, and we have adequate tests for all the code, if the code isn't all shared, that feel ok to me.

We could have "slack" only at the leading edge of the tree to optimize for Add() specifically (I already have an optimization like this for the immutable version which is why Add() is faster and incurs fewer allocations than Insert()).

Such optimizations sound great. And then there's AddRange(IEnumerable) which presumably should allow you to perform optimally (particularly if a Count property is available via conditional casting).

sharwell · 2022-05-09T19:59:40Z

Note that the implementation I worked on has different slack characteristics. While the implementation uses fixed-size inline arrays, it separately guarantees minimum slack exists in cases where a list is constructed from beginning to end (e.g. starting from empty and calling Add/AddRange).

madelson · 2022-05-09T20:27:40Z

separately guarantees minimum slack exists in cases where a list is constructed from beginning to end

By this do you mean that the resulting B+ tree is "compact" in the sense that a naive implementation of Add would result in lots of half-full nodes but instead you are careful to construct it such that as many nodes as possible are full?

If so, that's also what I'm doing in the array-based implementation although I haven't built this out for the builder yet.

It would be interesting to measure filling 15000 items in a collection both ways to see how perf is impacted (and how much GC pressure each produces)

@AArnott are you imagining randomized insertion or Add? As mentioned above Add will be able to take advantage of some other optimizations and shouldn't be as impacted by how we handle slack.

madelson · 2022-05-12T11:38:39Z

@AArnott ok I had some time to code up Builder.Add() with allowance for slack in the leaves to reduce allocations. Here are the numbers:

Benchmark	Time Ratio (Prototype)	Bytes Allocated Ratio (Prototype)	Time Ratio (TunnelVision)	Bytes Allocated Ratio (TunnelVision)
BuilderAdd<Int32> (15K)	0.15	0.21	1.45	0.19
BuilderAdd<String> (15K)	0.25	0.34	1.60	0.28

Interestingly, without the slack optimization the timing numbers are comparable, but the allocation numbers are worse (ratio ~2.0 instead of ~0.25 like we see here). Therefore, I think having slack in the leaves is a good build for the builder. There's a good amount of room to tune as well. Right now we always expand the node to maximum size but we could instead do something more conservative like going to max / 2 on the first expansion which might be better for insertions. We could also switch from a no-slack mode to a slack mode based on a number of expansions performed like you mentioned. I'm not sure either of those is worth exploring now but it could make sense in the future.

AArnott · 2022-06-08T15:02:49Z

Those numbers look good. I prefer the "prototype" columns to the "TunnelVision" column as a few % points of more memory seems totally worth being several times faster.

sharwell · 2022-06-09T16:12:10Z

@AArnott note that the time ratio for my branch is primarily caused by the fact that the inline 8 element array search has not been vectorized and uses a coding pattern in the indexer which the JIT doesn't know how to optimize. It is designed to support branchless lookups on machines with 256-bit registers at each level of index once that work is complete.

madelson · 2022-06-10T00:56:47Z

@sharwell are you referring to something that would optimize these switch statements?

If so, could Unsafe be used to avoid branching in the meantime? Something like this:

static ref T Item<T>(ref FixedArray8<T> array, int index)
{
	if ((uint)index >= 8) { throw new IndexOutOfRangeException(); }
	
	return ref Unsafe.Add(ref Unsafe.As<FixedArray8<T>, T>(ref array), index);
}

Not sure if this is making some invalid assumptions about layout.

sharwell · 2022-06-10T18:35:01Z

@sharwell are you referring to something that would optimize these switch statements?

That's part of the issue, but it can go much further. For example, an 8-element binary search could be implemented in a completely vectored form based on code similar to https://fioraaeterna.tumblr.com/post/26644202378/fast-binary-searching.

If so, could Unsafe be used to avoid branching in the meantime? Something like this:
static ref T Item<T>(ref FixedArray8<T> array, int index)
{
	if ((uint)index >= 8) { throw new IndexOutOfRangeException(); }
	
	return ref Unsafe.Add(ref Unsafe.As<FixedArray8<T>, T>(ref array), index);
}
Not sure if this is making some invalid assumptions about layout.

This was implemented in tunnelvisionlabs/dotnet-trees#102, but there are concerns that this may violate layout assumptions made by the JIT. Switching to Vector256<int> would resolve this in a different way.

joshfree assigned AArnott Oct 13, 2015

Priya91 unassigned AArnott Sep 28, 2016

msftgits transferred this issue from dotnet/corefx Jan 31, 2020

msftgits added this to the Future milestone Jan 31, 2020

maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 23, 2020

eiriktsarpalis removed the untriaged New issue has not been triaged by the area owner label Mar 25, 2020

sharwell mentioned this issue Oct 30, 2020

ImmutableHashSet.Contains is few times slower that HashSet.Contains #29085

Closed

adamsitnik added the tenet-performance Performance related issue label Nov 2, 2020

eiriktsarpalis mentioned this issue Dec 16, 2020

Deduplicate lots of AVL tree implementations in System.Collections.Immutable #19996

Closed

eiriktsarpalis mentioned this issue Feb 5, 2021

ImmutableDictionary has slowly been regressing and in version 5.0 is about 10X slower than Dictionary #47812

Closed

This was referenced Oct 29, 2021

Creating ImmutableHashSet from an existing collection is very slow #29090

Closed

ImmutableDictionary.ContainsKey is few times slower than Dictionary.ContainsKey or even ConcurrentDictionary.ContainsKey #29088

Closed

eiriktsarpalis mentioned this issue Nov 11, 2021

Creating ImmutableDictionary from an existing dictionary is very slow #29089

Closed

madelson added a commit to madelson/performance that referenced this issue Feb 18, 2022

Add some additional micro benchmark coverage for immutable collections.

f5ab039

See dotnet/runtime#14477 (comment) for context.

madelson mentioned this issue Feb 18, 2022

Add some additional micro benchmark coverage for immutable collections. dotnet/performance#2268

Merged

madelson mentioned this issue Mar 11, 2022

Runtime issue 14477 more immutable benchmarks dotnet/performance#2306

Open

zvrba mentioned this issue Jan 8, 2023

[API Proposal]: Mergeable dictionary (mutable and persistent) #80339

Open

[Proposal] Use B+-trees instead of AVL trees for Immutable Collections #14477

[Proposal] Use B+-trees instead of AVL trees for Immutable Collections #14477

Comments

sharwell commented Apr 24, 2015

stephentoub commented Apr 24, 2015

AArnott commented Apr 24, 2015

sharwell commented Apr 24, 2015

AArnott commented Apr 24, 2015

pharring commented Apr 27, 2015

ghost commented May 14, 2015

AArnott commented May 15, 2015

ghost commented May 17, 2015

evilguest commented Nov 22, 2020

sharwell commented Nov 27, 2020 • edited

evilguest commented Nov 27, 2020

arkalyanms commented Feb 5, 2021

sharwell commented Feb 8, 2021

evilguest commented Feb 8, 2021

evilguest commented Feb 8, 2021

madelson commented Feb 12, 2022

danmoseley commented Feb 12, 2022

madelson commented Feb 13, 2022

danmoseley commented Feb 13, 2022

jaredpar commented Feb 14, 2022

sharwell commented Feb 15, 2022 • edited

AlgorithmsAreCool commented Mar 1, 2022

Unsolicited Suggestion 1

Unsolicited Suggestion 2

madelson commented Mar 13, 2022 • edited

AArnott commented Mar 17, 2022

madelson commented Mar 17, 2022

AArnott commented Mar 17, 2022

madelson commented Mar 17, 2022

sharwell commented Mar 17, 2022

madelson commented Mar 17, 2022

madelson commented Mar 19, 2022

madelson commented Apr 13, 2022

madelson commented Apr 16, 2022

AArnott commented May 6, 2022

madelson commented May 6, 2022

AArnott commented May 7, 2022

AArnott commented May 7, 2022

sharwell commented May 9, 2022

madelson commented May 9, 2022

madelson commented May 12, 2022

AArnott commented Jun 8, 2022

sharwell commented Jun 9, 2022 • edited

madelson commented Jun 10, 2022

sharwell commented Jun 10, 2022 • edited

sharwell commented Nov 27, 2020 •

edited

sharwell commented Feb 15, 2022 •

edited

madelson commented Mar 13, 2022 •

edited

sharwell commented Jun 9, 2022 •

edited

sharwell commented Jun 10, 2022 •

edited