Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Add a more scalable ETS ordered_set implementation #1952
The current ETS ordered_set implementation can quickly become a
The new implementation is expected to scale better than the default
delete/2, delete_object/2, first/1, insert/2 (single object),
Currently, the new implementation does not have scalable support for the
Description of the New Implementation
The new implementation is based on a data structure which is called the
A Contention Adapting Approach to Concurrent Ordered Sets
A discussion of how the CA tree can be used as an ETS back-end can be
The ETS tests in
Sep 5, 2018
@vans163 Thanks for your encouraging comment and your questions.
I guess you are wondering why the new ordered_set implementation scales so badly in the scenarios with select operations. The short answer is that the new implementation has not been optimized for these scenarios yet. The long answer follows below.
The new implementation is based on the contention adapting search tree (CA tree). A CA tree automatically changes its synchronization granularity based on how much contention that has been detected. One such change either do a split of a subtree that increases the lock count by one (to make the synchronization more fine-grained) or do a join of two subtrees that decreases the lock count by one (to make the synchronization more coarse-grained) (see the animation in this presentation or Figure 2 in the JPDC paper referred to in the pull request for an illustration). Currently, the only operations that can operate on the table when it is in a state with fine-grained synchronization (i.e., more than one lock exist in the data structure) are delete/2, delete_object/2, first/1, insert/2 (single object), insert_new/2 (single object), lookup/2, lookup_element/2, member/2, next/2, take/2 and update_element/3 (single object). The other operations will merge all elements into a single sequential AVL tree which gets protected by a single lock. This way, the old code for doing, for example, the ets:select operation can be reused without modification. The CA tree algorithm for doing the select operations in a more scalable way is already implemented and is used for the ets:next operation. However, making use of this algorithm for the select operations would require quite a lot of changes in the current code for doing select operations. Therefore, I suggest that this pull request is accepted first and that someone can add code to support more operations in a scalable way later (I don't have much time to work on this right now because I have a day job which is not related to Erlang).
@kjellwinblad Very very interesting. The throughput increase is crazy. I hope this gets merged. ordered_sets do better when there are deletes in the mix at 64 processes, which is a little strange to me. Why is there such a drastic decline for the 50% insert / 50% delete after 16 processes.
I would expect the curve to just level out, could it be NUMA memory / context switch / processor contention issues? The system only has 64 logical cores with 4 physical processors, its a scheduling nightmare. Also its Sandy Bridge system which is rock solid bang for buck but a bit behind.
I would be interested how the benchmarks would look like on say a 64 physical core (128 logic, single numa node) single processor EPYC if someone would fund it.
First of all, the benchmark was started with the Erlang option "+sbt nnts" which means that scheduling threads are pinned to logical cores and up to 16 processes run on only one NUMA node. More than one NUMA nodes are used when there are more than 16 processes. It is expected that the scalability becomes worse when more than one NUMA nodes are used as it is much cheaper to transfer data within a node than between nodes. Secondly, the table has counter variables to keep track of how much memory is allocated for the table and for the number of items in the table. These counters are changed in the insert and delete operations with atomic instructions. These changes cause a lot of expensive traffic between the NUMA nodes when more than one NUMA nodes are used. The scalability of the ETS tables could probably be substantially improved with a more scalable implementation of these counters.
I would also be interested in seeing such an experiment...
I don't understand what you mean with the above.