New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a more scalable ETS ordered_set implementation #1952

Merged
merged 3 commits into from Sep 21, 2018

Conversation

Projects
None yet
4 participants
@kjellwinblad
Copy link
Contributor

kjellwinblad commented Sep 5, 2018

The current ETS ordered_set implementation can quickly become a
scalability bottleneck on multicore machines when an application updates
an ordered_set table from concurrent processes [1][2]. The current
implementation is based on an AVL tree protected from concurrent writes
by a single readers-writer lock. Furthermore, the current implementation
has an optimization, called the stack optimization [3], that can improve
the performance when only a single process accesses a table but can
cause bad scalability even in read-only scenarios. It is possible to
pass the option {write_concurrency, true} to ets:new/2 when creating an
ETS table of type ordered_set but this option has no effect for tables
of type ordered_set without this commit. The new ETS ordered_set
implementation, added by this commit, is only activated when one passes
the options ordered_set and {write_concurrency, true} to the ets:new/2
function. Thus, the previous ordered_set implementation (from here on
called the default implementation) can still be used in applications
that do not benefit from the new implementation. The benchmark results
on the following web page show that the new implementation is many times
faster than the old implementation in some scenarios and that the old
implementation is still better than the new implementation in some
scenarios.

http://winsh.me/ets_catree_benchmark/ets_ca_tree_benchmark_results.html

The new implementation is expected to scale better than the default
implementation when concurrent processes use the following ETS
operations to operate on a table:

delete/2, delete_object/2, first/1, insert/2 (single object),
insert_new/2 (single object), lookup/2, lookup_element/2, member/2,
next/2, take/2 and update_element/3 (single object).

Currently, the new implementation does not have scalable support for the
other operations (e.g., select/2). However, when these operations are
used infrequently, the new implantation may still scale better than the
default implementation as the benchmark results at the URL above shows.

Description of the New Implementation

The new implementation is based on a data structure which is called the
contention adapting search tree (CA tree for short). The following
publication contains a detailed description of the CA tree:

A Contention Adapting Approach to Concurrent Ordered Sets
Journal of Parallel and Distributed Computing, 2018
Kjell Winblad and Konstantinos Sagonas
https://doi.org/10.1016/j.jpdc.2017.11.007
http://www.it.uu.se/research/group/languages/software/ca_tree/catree_proofs.pdf

A discussion of how the CA tree can be used as an ETS back-end can be
found in another publication [1]. The CA tree is a data structure that
dynamically changes its synchronization granularity based on detected
contention. Internally, the CA tree uses instances of a sequential data
structure to store items. The CA tree implementation contained in this
commit uses the same AVL tree implementation as is used for the default
ordered set implementation. This AVL tree implementation is reused so
that much of the existing code to implement the ETS operations can be
reused.

Tests

The ETS tests in lib/stdlib/test/ets_SUITE.erl have been extended to
also test the new ordered_set implementation. The function
ets_SUITE:throughput_benchmark/0 has also been added to this file. This
function can be used to measure and compare the performance of the
different ETS table types and options. This function writes benchmark
data to standard output that can be visualized by the HTML page
lib/stdlib/test/ets_SUITE_data/visualize_throughput.html.

[1]
More Scalable Ordered Set for ETS Using Adaptation.
In Thirteenth ACM SIGPLAN workshop on Erlang (2014).
Kjell Winblad and Konstantinos Sagonas.
https://doi.org/10.1145/2633448.2633455
http://www.it.uu.se/research/group/languages/software/ca_tree/erlang_paper.pdf

[2]
On the Scalability of the Erlang Term Storage
In Twelfth ACM SIGPLAN workshop on Erlang (2013)
Kjell Winblad, David Klaftenegger and Konstantinos Sagonas
https://doi.org/10.1145/2505305.2505308
http://winsh.me/papers/erlang_workshop_2013.pdf

[3]
The stack optimization works by keeping one preallocated stack instance
in every ordered_set table. This stack is updated so that it contains
the search path in some read operations (e.g., ets:next/2). This makes
it possible for a subsequent ets:next/2 to avoid traversing some nodes
in some cases. Unfortunately, the preallocated stack needs to be flagged
so that it is not updated concurrently by several threads which cause
bad scalability.

@CLAassistant

This comment has been minimized.

Copy link

CLAassistant commented Sep 5, 2018

CLA assistant check
All committers have signed the CLA.

kjellwinblad and others added some commits Sep 5, 2018

Add a more scalable ETS ordered_set implementation
The current ETS ordered_set implementation can quickly become a
scalability bottleneck on multicore machines when an application updates
an ordered_set table from concurrent processes [1][2]. The current
implementation is based on an AVL tree protected from concurrent writes
by a single readers-writer lock. Furthermore, the current implementation
has an optimization, called the stack optimization [3], that can improve
the performance when only a single process accesses a table but can
cause bad scalability even in read-only scenarios. It is possible to
pass the option {write_concurrency, true} to ets:new/2 when creating an
ETS table of type ordered_set but this option has no effect for tables
of type ordered_set without this commit. The new ETS ordered_set
implementation, added by this commit, is only activated when one passes
the options ordered_set and {write_concurrency, true} to the ets:new/2
function. Thus, the previous ordered_set implementation (from here on
called the default implementation) can still be used in applications
that do not benefit from the new implementation. The benchmark results
on the following web page show that the new implementation is many times
faster than the old implementation in some scenarios and that the old
implementation is still better than the new implementation in some
scenarios.

http://winsh.me/ets_catree_benchmark/ets_ca_tree_benchmark_results.html

The new implementation is expected to scale better than the default
implementation when concurrent processes use the following ETS
operations to operate on a table:

delete/2, delete_object/2, first/1, insert/2 (single object),
insert_new/2 (single object), lookup/2, lookup_element/2, member/2,
next/2, take/2 and update_element/3 (single object).

Currently, the new implementation does not have scalable support for the
other operations (e.g., select/2). However, when these operations are
used infrequently, the new implantation may still scale better than the
default implementation as the benchmark results at the URL above shows.

Description of the New Implementation
----------------------------------

The new implementation is based on a data structure which is called the
contention adapting search tree (CA tree for short). The following
publication contains a detailed description of the CA tree:

A Contention Adapting Approach to Concurrent Ordered Sets
Journal of Parallel and Distributed Computing, 2018
Kjell Winblad and Konstantinos Sagonas
https://doi.org/10.1016/j.jpdc.2017.11.007
http://www.it.uu.se/research/group/languages/software/ca_tree/catree_proofs.pdf

A discussion of how the CA tree can be used as an ETS back-end can be
found in another publication [1]. The CA tree is a data structure that
dynamically changes its synchronization granularity based on detected
contention. Internally, the CA tree uses instances of a sequential data
structure to store items. The CA tree implementation contained in this
commit uses the same AVL tree implementation as is used for the default
ordered set implementation. This AVL tree implementation is reused so
that much of the existing code to implement the ETS operations can be
reused.

Tests
-----

The ETS tests in `lib/stdlib/test/ets_SUITE.erl` have been extended to
also test the new ordered_set implementation. The function
ets_SUITE:throughput_benchmark/0 has also been added to this file. This
function can be used to measure and compare the performance of the
different ETS table types and options. This function writes benchmark
data to standard output that can be visualized by the HTML page
`lib/stdlib/test/ets_SUITE_data/visualize_throughput.html`.

[1]
More Scalable Ordered Set for ETS Using Adaptation.
In Thirteenth ACM SIGPLAN workshop on Erlang (2014).
Kjell Winblad and Konstantinos Sagonas.
https://doi.org/10.1145/2633448.2633455
http://www.it.uu.se/research/group/languages/software/ca_tree/erlang_paper.pdf

[2]
On the Scalability of the Erlang Term Storage
In Twelfth ACM SIGPLAN workshop on Erlang (2013)
Kjell Winblad, David Klaftenegger and Konstantinos Sagonas
https://doi.org/10.1145/2505305.2505308
http://winsh.me/papers/erlang_workshop_2013.pdf

[3]
The stack optimization works by keeping one preallocated stack instance
in every ordered_set table. This stack is updated so that it contains
the search path in some read operations (e.g., ets:next/2). This makes
it possible for a subsequent ets:next/2 to avoid traversing some nodes
in some cases. Unfortunately, the preallocated stack needs to be flagged
so that it is not updated concurrently by several threads which cause
bad scalability.
stdlib: Suppress test log spam in ets_SUITE
of repeated table opts
and waiting for workers

@kjellwinblad kjellwinblad force-pushed the kjellwinblad:ca_tree_pull_request branch from 510c42f to fa28bfb Sep 5, 2018

@sverker sverker self-assigned this Sep 6, 2018

@vans163

This comment has been minimized.

Copy link
Contributor

vans163 commented Sep 7, 2018

Very nice but why is select and selectAll so slow? is selectAll == ets:tab2list/1 or ets:match_object/1 with '_'? or does it mean running a ets:select operation?

@kjellwinblad

This comment has been minimized.

Copy link
Contributor

kjellwinblad commented Sep 8, 2018

Very nice but why is select and selectAll so slow? is selectAll == ets:tab2list/1 or ets:match_object/1 with '_'? or does it mean running a ets:select operation?

@vans163 Thanks for your encouraging comment and your questions.

  • selectAll = an ets:select_count call that counts all the items in the table
  • partial_selectX = an ets:select_count call that counts all the items within a random range of size X

You can find the above descriptions here (under Benchmark Description). The actual code for the benchmark is here.

I guess you are wondering why the new ordered_set implementation scales so badly in the scenarios with select operations. The short answer is that the new implementation has not been optimized for these scenarios yet. The long answer follows below.

The new implementation is based on the contention adapting search tree (CA tree). A CA tree automatically changes its synchronization granularity based on how much contention that has been detected. One such change either do a split of a subtree that increases the lock count by one (to make the synchronization more fine-grained) or do a join of two subtrees that decreases the lock count by one (to make the synchronization more coarse-grained) (see the animation in this presentation or Figure 2 in the JPDC paper referred to in the pull request for an illustration). Currently, the only operations that can operate on the table when it is in a state with fine-grained synchronization (i.e., more than one lock exist in the data structure) are delete/2, delete_object/2, first/1, insert/2 (single object), insert_new/2 (single object), lookup/2, lookup_element/2, member/2, next/2, take/2 and update_element/3 (single object). The other operations will merge all elements into a single sequential AVL tree which gets protected by a single lock. This way, the old code for doing, for example, the ets:select operation can be reused without modification. The CA tree algorithm for doing the select operations in a more scalable way is already implemented and is used for the ets:next operation. However, making use of this algorithm for the select operations would require quite a lot of changes in the current code for doing select operations. Therefore, I suggest that this pull request is accepted first and that someone can add code to support more operations in a scalable way later (I don't have much time to work on this right now because I have a day job which is not related to Erlang).

@vans163

This comment has been minimized.

Copy link
Contributor

vans163 commented Sep 8, 2018

@kjellwinblad Very very interesting. The throughput increase is crazy. I hope this gets merged. ordered_sets do better when there are deletes in the mix at 64 processes, which is a little strange to me. Why is there such a drastic decline for the 50% insert / 50% delete after 16 processes.

I would expect the curve to just level out, could it be NUMA memory / context switch / processor contention issues? The system only has 64 logical cores with 4 physical processors, its a scheduling nightmare. Also its Sandy Bridge system which is rock solid bang for buck but a bit behind.

I would be interested how the benchmarks would look like on say a 64 physical core (128 logic, single numa node) single processor EPYC if someone would fund it.

@kjellwinblad

This comment has been minimized.

Copy link
Contributor

kjellwinblad commented Sep 8, 2018

Why is there such a drastic decline for the 50% insert / 50% delete after 16 processes.

First of all, the benchmark was started with the Erlang option "+sbt nnts" which means that scheduling threads are pinned to logical cores and up to 16 processes run on only one NUMA node. More than one NUMA nodes are used when there are more than 16 processes. It is expected that the scalability becomes worse when more than one NUMA nodes are used as it is much cheaper to transfer data within a node than between nodes. Secondly, the table has counter variables to keep track of how much memory is allocated for the table and for the number of items in the table. These counters are changed in the insert and delete operations with atomic instructions. These changes cause a lot of expensive traffic between the NUMA nodes when more than one NUMA nodes are used. The scalability of the ETS tables could probably be substantially improved with a more scalable implementation of these counters.

I would be interested how the benchmarks would look like on say a 64 physical core (128 logic, single numa node) single processor EPYC if someone would fund it.

I would also be interested in seeing such an experiment...

ordered_sets do better when there are deletes in the mix at 64 processes, which is a little strange to me.

I don't understand what you mean with the above.

@sverker

This comment has been minimized.

Copy link
Contributor

sverker commented Sep 13, 2018

I added a commit with a documentation update for write_concurrency and ordered_set.

@sverker sverker merged commit f26d11a into erlang:master Sep 21, 2018

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
license/cla Contributor License Agreement is signed.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment