Optimize to better handle large rings/nodes #319

merged 17 commits into from May 30, 2013


None yet
3 participants

jtuple commented May 15, 2013

This pull-request consists of numerous changes that are all part of an effort to improve riak_core with respect to large rings and large clusters.

As a data point, consider this mailing list post:

That email shows that adding a node to a Riak cluster with a 2048-size ring takes about an hour. I have seen similar results in my own testing. With the changes in this pull-request, adding a node to a cluster with a 16384 size ring takes about 5 minutes.

In general, this was a continual round of whack-a-mole. Profile to find bottleneck A, fix it; profile to find bottleneck B. Rinse/repeat. I've tried to organize the pull-request into logical, self-contained commits. For code review, the best approach is to go through commit-by-commit, being sure to read the detailed commit messages provided.

This pull-request requires basho/riak_kv#554

As a whole, the primary changes are as follows.

First, it's been long known that using mochiglobal to read/write the ring has been a blessing and a curse. mochiglobal reads are very, very fast. They take advantage of the Erlang constant pool and allow multiple processes to read the same data without locks and without copying data into individual process heaps. Unfortunately, mochiglobal writes are very, very slow. And they get slower the larger the ring is. The majority of this time is not spent in the code loader but inside code that converts the bare term into the abstract syntax tree that is passed to the Erlang compiler.

These slow writes hurt Riak during periods of high gossip traffic (such as cluster transitions), where hundreds of ring events may be triggered per minute. The riak_core_ring_manager stalls, message queue backs up, etc. Bad news. Riak is now CPU bound, spinning it's wheels generating code at runtime.

Solution? A hybrid approach. When a ring is first written, it's written to a shared ETS table, and the atom ets is written to mochiglobal. If the ring doesn't change for 90 seconds, it is then promoted to mochiglobal. Thus, we get fast writes during high ring churn, and fast reads once things stabilize.

Of course, reading from ETS is slower than reading from mochiglobal. This is predominately due to the need to copy the data into the process heap, which for large rings is MBs of data.

To combat this, so that critical paths of Riak are not heavily affected during cluster transitions, the second change is to extract certain elements out of the ring and provide an alternate means to query them. Bucket properties are stored in the ETS table, and a binary version of the chash structure called chashbin is generated. Since the chashbin uses a binary for the bulk of it's data, the structure is quick to copy between processes due to off-heap binary sharing. The chashbin structure also provides nearly constant time operations for a variety of operations that the traditional chash structure did in O(N) time. So, it's a win/win.

Various other commit modify different aspects of Riak to use the chashbin structure: the preflist calculation in riak_core_apl, the coverage calculations, various parts of the vnode manager, etc. See the commits for more details.

The next set of changes optimizes various other things. This consist mostly of changing various algorithms that are O(N^3), O(N^2), or O(N) with large constant factors where N is either the ring size or number of nodes in the cluster to using different algorithms and/or data structures with reduced complexity. Again, see commits for details.

There is also a change to how ring events are triggered in the a few manager processes. The traditional ring event system triggers an event for every ring received. Most things in Riak only care if the ring has actually changed. Thus, a new approach that reduces the number of events generated / colacesing multiple events together was build. This change, along with some of the algorithmic changes, makes the riak_core_ring_manager, riak_core_vnode_manager, and riak_core_capability_manager much more lightweight, whereas before they were the three dominate processes with number reductions between them.

Finally, there are some other minor changes.

There's a lot of benchmarking / profiling I've done during this work, but most of it become outdated as I made further changes. Or I didn't keep good notes. Will add more details in the coming days, but wanted to get the code out for review.

One set of information I do have around is from a micro-benchmark that compares the cost of computing preference lists in Riak between master, this branch using non-chashbin approach, and this branch using chashbin approach. The results are as follows (time in microseconds):

                 master         chash       chashbin
Ring  Nodes    Min    Max    Min    Max    Min    Max
   64     1      2      2      1      2      1      1
   64    10      6      7      1      2      1      1
   64   100      6      8      1      2      1      1
   64  1000      6      8      1      2      1      1
  128     1      2      4      1      3      1      1
  128    10     11     13      1      3      1      1
  128   100     14     15      1      3      1      1
  128  1000     12     13      1      4      1      1
  256     1      4      7      1      6      1      1
  256    10     20     23      1      6      1      1
  256   100     23     26      1      6      1      1
  256  1000     25     26      1      7      1      2
  512     1      6     12      3     11      1      1
  512    10     39     45      3     11      1      1
  512   100     48     50      3     10      1      1
  512  1000     44     57      3     14      1      5
 1024     1     15     27      8     19      1      1
 1024    10     81     95      6     19      1      1
 1024   100     87    103      6     20      1      1
 1024  1000    116    121      6     18      1      1
 2048     1     32     53      9     39      1      1
 2048    10    153    181     10     39      1      1
 2048   100    174    203      8     36      1      1
 2048  1000    243    292      8     35      1      1
 4096     1     57     96     20     68      1      1
 4096    10    322    363     16     72      1      1
 4096   100    360    419     17     75      1      2
 4096  1000    477    548     18     71      1      1
 8192     1     98    208     35    162      1      1
 8192    10    596    701     45    133      1      1
 8192   100    711    803     41    138      1      1
 8192  1000    783    818     34    151      1      2
16384     1    196    380     81    295      1      1
16384    10   1242   1365     79    296      1      1
16384   100   1415   1512     97    296      1      1
16384  1000   1576   2003     83    276      1      2

jtuple added some commits May 10, 2013

@jtuple jtuple Change riak_core_ring_manager to use gen_server 2ce32c3
@jtuple jtuple Hybrid ring/mochiglobal approach
Numerous processes concurrently read and access the ring in a
variety of time sensitive code paths. To make this efficient,
riak_core uses mochiglobal which exploits the Erlang constant
pool to provide constant-time access to the ring without needing
to copy data into individual process heaps.

However, updating a mochiglobal value is very slow, and becomes slower
the larger the item being stored. In Riak clusters with large rings and
many nodes, storing a ring in mochiglobal can take seconds. This delay
is too long during periods of high gossip, where hundreds of ring events
are being triggered a second. The riak_core_ring_manager will stall and
start to exhibit message queue growth.

This commit uses a hybrid approach to solve this problem. When a ring is
first written, it is written to a shared ETS table. If no ring events
have occurred for 90 seconds, the ring is then promoted to mochiglobal.
This provides fast updates during periods of ring churn, while eventually
providing very fast reads after the ring stabilizes. The downside is that
reading from the ETS table before promotion is slower than mochiglobal,
and requires copying the ring into individual process heaps.
@jtuple jtuple Add chashbin and bucket metadata
The hybrid ETS/mochiglobal approach solves the fast update problem,
but hurts ring read performance during ring transitions. This issue
can be avoided by exploiting the fact that the majority of Riak only
reads the ring in order to determine index owners, preference lists,
and bucket properties. This commit provides an alternate approach to
getting access to these values without reading the entire ring.

First, this commit creates a binary version of the chash structure
which is stored in the ring manager's ETS table. This chashbin
structure is fast to copy between processes due to off-heap binary
sharing. Furthermore, this structure provides a secondary benefit
of being much faster than the traditional chash structure for normal

Second, this commit extracts bucket metadata from the ring, storing
the values in the ring manager's ETS table. This allows processes
to directly query a given bucket's metadata without reading the ring.
@jtuple jtuple Optimize riak_core_apl using chashbin
This commit changes riak_core_apl to use the new chashbin structure,
avoiding the need to read the ring in order to compute preflists. This
change also makes preflist calculation significantly faster given the
better performance of the chashbin structure over the traditional chash
structure. Specifically, traditional preflist calculation time scales
with the ring size whereas the new approach is essentially constant
regardless of ring size.
@jtuple jtuple Optimize coverage plan using chashbin b39bd4c
@jtuple jtuple Update vnode manager to use chashbin 3ed0a56
@jtuple jtuple Change vnode + capability managers to use intelligent ring polling
The standard ring event handler sends an event for every ring received.
The vnode manager and capability manager do not care about every ring
received, they only care if the local ring actually changes. In other
words, if a ring is received that provides no new information, and
therefore the local ring does not change, there is no reason to trigger
an event.

This commit changes these managers to no longer listen to ring events,
but instead poll the ring manager at a fixed frequency. This not only
solves the above problem, but also allows the managers to ignore high
frequency ring events.

Given that the common case is for the ring to remain stable, this change
also introduces the notion of the "ring id" which is a monotonic sequence
number for the ring maintained by the ring manager. The periodic polling
then simplifies to checking if the ring id has changed, and only
performing the more expensive state update if so.
@jtuple jtuple Optimize the "up nodes" logic of riak_core_apl
Change riak_core_apl to use an unordered list and lists:member rather
than ordsets and ordsets:is_element for determining up nodes. This
approach is much faster in practice. Both ordsets:is_element and
lists:member are O(N); however, lists:member is a BIF and has a
significantly lower constant factor. Likewise, this approach avoids
the lists:usort performed by ordsets:from_list.
@jtuple jtuple Enable warning_as_errors build option and verbose eunit
Several minor changes made to fix pre-existing warnings that would now
prevent the build given the warning_as_errors option.
@jtuple jtuple Make vnode manager responsible for ensuring vnodes started fa932e6
@jtuple jtuple Change handoff manager to use sets instead of ordsets 2bf01a8
@jtuple jtuple Reduce file server bottleneck when writing ring to disk 29500c1
@jtuple jtuple Optimize vnode manager
Change certain uses of orddict to dict where the size of the structure
depends on either the ring size or number of nodes in a cluster.

Change update_never_started to only query ring indices when there
are actually unknown modules.
@jtuple jtuple Optimize riak_core_vnode_manager:compute_forwarding
Change from O(n*m) to O(m) + O(n*log32(m))
@jtuple jtuple Change how vnode manager uses ETS to avoid table scans 8320cfd
@jtuple jtuple Randomize vnode inactivity timeout
The goal is to prevent multiple vnodes timing out simultaneously
and overloading the vnode manager with tons of messages.

jtuple referenced this pull request in basho/riak_kv May 16, 2013


Optimize to better handle large rings/nodes #554


jrwest commented May 16, 2013

Discussed some surface level changes (e.g. comments/test updates) w/ @jtuple offline. Unable to line comment on this PR currently due to GH problem, however. Besides making those small changes, +1.

This PR clearly improves performance of large clusters. Ran a simple test: try to make a 2-node cluster of ring size 2048 before and after this change (aae off, forced_ownership_handoff raised, process limit raised). Before, I cancelled the test after ~10 min. During this time the vnode manager was causing all sorts of system monitors to fire. After the change, the test completed in ~30 seconds and riak operated as normal.

EDIT: also ran a few riak tests including verify_reset_bucket_properties against these changes

jrwest referenced this pull request May 23, 2013


Ring Resizing #301

@jtuple jtuple Address code review comments
-- Add module edoc to riak_core_ring_manager discussing hybrid approach.
-- Comment the approach used to update the bucket props in ETS.
-- Ensure we do not silently ignore any errors in ensure_vnodes_started.
-- Change chashbin:index_owner/2 error condition.
-- Fix variable naming for chashbin:iterator/2.

@jtuple jtuple added a commit that referenced this pull request May 28, 2013

@jtuple jtuple Merge #319 from jdb-large-master into staging-1.4
Also, fix several build warnings since #319 includes a change to treat
warnings as errors.


Will it crash or simply loop around the ring forever? I don't think that iterator loop would make the stack grow. Anyway...


engelsanchez replied May 29, 2013

Nevermind, saw how the iterator remembers the initial position and stops there. In any case, why am I even commenting and distracting you? :) Just looking around the new changes.

@jrwest jrwest merged commit 7a3b9f4 into master May 30, 2013

1 check failed

default The Travis CI build failed

seancribbs deleted the jdb-large-master branch Apr 1, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment