-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proposal] Use hash array mapped trie for ImmutableDictionary #26001
Comments
Do you have information about the retained memory for this type? I'd like to compare it to the numbers in #14477. Retained memory (storage density) is one of the primary concerns we've encountered when using the immutable collections. |
I am not familiar with the term retained memory. Can you clarify by any chance? As far as the storage density, it really depends on the hash of the objects. You can think of each non-leave node contains an array of size 32. However, it normally uses the exact size array for the node until we have an array of size 16, then it will resize to use all 32 until the non-null entries drops to 8. As a result, the collection wastes very little on empty buckets, and most get operations should be within 6 node hops away at worst case. If you are talking the overhead on the node objects, I would also like to think it's minimal. However, for the exact number of each type of nodes. I will need to spend some time to calculate them. |
I did calculate memory usage for a few scenarios as well:
10,000 count and snapshots every 1,000 adds
10,000 count and snapshots every add
I was a bit surprised to see ImmutableDictionary using more here, since HAMT can copy several arrays that can be up to length 32 for each modification... but I guess the much smaller depth of the tree makes up for that. Keeping a reference to every single modified collection ends up with about the same memory usage overall. In my opinion, a lot more testing with various types for key/value would be needed to be sure that replacing the current ImmutableDictionary implementation would be a good idea - I only benchmarked <int, int> as I wanted to measure as much as possible the overhead of the structure only. One of the projects I am working on does use immutable structures extensively, so I may try dropping a replacement in and see what performance/memory impact it has. |
@dbolin If you want to compare allocation performance with B+-trees, you can try using https://github.com/tunnelvisionlabs/dotnet-trees. The use of wrappers was a means to an end so the API could be functional given limited time available to work on the library. I'd like to focus on getting a correct and high-performing implementation of |
@dbolin Your benchmark might not be comprehensive enough (I am not say mine is). The problem I see is that you used Int32 as key, so your hash code will only be ranging from 0 to the benching param. This might not be representative of real word uses when people use object as key. GetHashCode of Int32: |
Updated benchmarks to include ImmutableTreeDictionary from TunnelVisionLabs I also changed the integer keys to be randomized, and as expected this didn't make much difference, since hash code distributions should be uniform anyway. Also added some benchmarks using String keys. Memory usage of ImmutableTreeDictionary was 310,760 bytes with 10,000 integer keys, 1,040,128 bytes with snapshots at each 1,000 keys, and 8,103,744 bytes with snapshots at every key added. I didn't try anything with ImmutableTreeList yet since I assume the api will be a bit different from Dictionaries. |
It's pretty cool to see others to have interests in HMAT, and a benchmarked other implementations. I think corefx should really look into HMAT to improve the performance of the their ImmutableDictionary, or at least some guidelines/requirements for a replacement implementation. |
@dbolin Your implementation would not work on non x86 platform because of the infused use of System.Runtime.Intrinsics.X86, but that should be an easy fix. I am more curious to know the improvement it made. Btw, I didn't know they are adding Intrinsics in .net. That's pretty cool :) |
@stevenguh Yeah, it just needs a fallback for when the intrinsic is not supported. Without the intrinsic, lookups (which are the most affected) are between 10-25% slower depending on the collection size. |
This sounds good. I haven't studied your implementation but at a high level I wonder why you would ever allocate an array that is larger than necessary. Having slack in your array doesn't permit you to add elements later, does it? Considering the collection is immutable, adding an element to an existing array would potentially violate immutability. Another concern I have is whether your implementation suffers from retaining strong references to its elements. If you are sharing arrays across many versions of the collection, how do you know when to scrub a reference from an array slot? Suppose I remove or replace an element from the collection and get a new collection instance, and I release my reference to the old collection. Does the new collection reference an array that still references the element I removed (even if it's inaccessible outside)? If so, that would be a problem as it would be a memory leak. I'd also like to see benchmarks for the Builder case. IIRC the current corefx implementation is completely garbage free when mutating builders (a bunch of adds or a bunch of removes). Is the same true with this proposed implementation? It may not need to be exactly true, but I'd like to know how much garbage is produced so we can weigh its significance against the perf wins here. |
This is the main highlight of the collections I was working on. I'm not especially surprised to see it running slower or allocating more total data during mutations; much of that is a side effect of using a wrapper implementation strategy to prototype the collection and show that total retained memory is lower than alternatives.
On 64-bit, the wrappers would be directly responsible for 480,000 bytes of this, so once the wrappers are removed it would come down to 7,623,744 bytes. |
This seems to be true for adding but not removing - adding 10k <int,int> pairs allocates 560,080 bytes, and the resulting object is 560,200 bytes, whereas removing those allocates 386,048 bytes and the result is 152 bytes. And here are benchmarks for the various builder implementations adding/removing respectively:
For the case of adding, the total allocated memory is about double corefx, and 75% of that is garbage. |
@dbolin To clarify, we're interested in the following sequence Scenario 1: Create a builder, and add a bunch of items to it Scenario 2:
Expected outcome:
|
@sharwell Assuming you mean a builder starting from empty, I'm very confident that it works like you expect. With a fixed branching factor there's never a need to reallocate mutable nodes. With HAMTs, the branching factor is anywhere from 2 to 32, so the array of branches must be reallocated whenever that factor changes. I'm sure the builder can be optimized to do better than mine, but I'd expect it to always produce a significant amount of garbage relative to the final size of the set. |
There are a few reasons, depending what is begin optimized. If we have a full size node (HashArrayMapNode) with a full size array (length of 32 -- branching factor of 32) in each node, then we don't have to use PopCount to find the index of an item. However, for each mutation, we will need to create another full array. On the other hand, if all the nodes are the ‘compressed’ node (BitmapIndexedNode) with the variable length array, then we will need to use PopCount to located the bucket with get/set/remove. To answer you question, yes, having slack in the array would not allow you to add element later, except in the builder case. Therefore, it wouldn't violate immutability. For an item to be updated, all the nodes in the path to the leaf (value node) need to be copied. In the builder case, we will ensure the node is "owned" by the builder, so I know that node has been copied and can mutate safely. (Continue below)
The implementation does not suffer from retaining strong references. Whenever you update the collection, all the nodes traversed from the root node of the trie to the value node are required to be copied, and the rest will not need to be changed. Suppose you remove the reference of your element and the old collection, then a part of the trie will retain because it is used by the new collection; however, the part that got copied because of mutation will release and be garbage collected. Let say in the follow diagram.
It wouldn't be garbage free as you can see in the benchmark. To understand how to builder works, I will use the example below. First, we create a trie and it contains a key value pair of (k, v1). At step (1), we created At step(2), we update the value of the key At step(3), we created another builder from
Therefore, mutation will require to copy all the nodes along the path if the nodes are not 'owned' by the builder. Addition is a bit more complicated. As I mentioned before, there are roughly two types of node -- 'compressed' (BitmapIndexedNode) and 'full' (HashArrayMapNode). Deletion of a 'compressed' or 'full' node that 'owned' by the builder will not create a new array. However, when the number of items in a 'full' node reduces to 8, then we will copy the array to the 'compressed' node. In general, we try to balance the number of the empty buckets in a 'full' node and balance the speed of locating with the use of pop count in compressed node. Also there are some additional facts about the implementation of the builder which I think is consistent with the current API, and I think going forward these are also the requirements for the replacement implementation?
In closing, I think we shouldn't benchmark with the use of intrinsics because .net core is designed to be cross platform. The baseline should measure without any intrinsic, because if the user happens to be on the platform with the right instruction. Great, the user gets extra boost. But we shouldn't use that as our benchmark since intrinsics aren't available on all platforms. While I was looking into ways to optimize, I found out a great paper optimizing HAMT for JVM. I think we should evaluate that as well. I will try to implement it and see if there are significant improvement on memory usage and iteration speed. I haven't read the whole paper yet, but it seems very promising. Paper: Optimizing Hash-Array Mapped Tries for Fast and Lean Immutable JVM Collections: The author's Phd thesis on efficient immutable collection: Implementation from the author in Java: Kotlin uses CHAMP in their implementation of immutable data structure: |
Due to lack of recent activity, this issue has been marked as a candidate for backlog cleanup. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will undo this process. This process is part of the experimental issue cleanup initiative we are currently trialing in a limited number of areas. Please share any feedback you might have in the linked issue. |
This issue will now be closed since it had been marked |
I've implemented a variant of hash array mapped trie and found out that it performs at around 2x faster than the current ImmutableDictionary with comparable memory usage. I am wondering if we can use that to replace the current implementation.
Here are some of the benchmarks I ran on my computer:
DictioanryAdd
DictionaryRemove
DictionarySetExisting
The text was updated successfully, but these errors were encountered: