AZ447: Improve partition claim algorithm to reduce unneeded partition transfers #91

jtuple · 2011-09-20T21:32:48Z

This pull-request adds a new partition claim algorithm in riak_core_new_claim.erl that aims to reduce the number of unneeded partition transfers. However, the previous algorithm (default_claim) remains the default for 1.0.

The default algorithm falls back to re-diagonalization far too often, making it so that adding a new node to the more often than not requires more than half of the partitions to be handed off, rather than the expected 1/N (where N is the number of nodes, including the new node).

The new algorithm tries more closely hit the 1/N target while still maintaining the target_n spread constraint.

In addition to the new algorithm, this pull-request also merges in a claim simulation harness originally contributed by Greg Nelson at Dropcam (https://gist.github.com/992317). This can be used to compare the efficiency of the new and old algorithms.

Add riak_core_new_claim:new_wants_claim/2 and new_claim/2. Merge in claim simulation code provided by Greg Nelson (grourk@dropcam.com). Add pretty_print function to riak_core_ring. The new claim function is designed to reduce the number of partition transfers that occur when rebalancing the ring, aiming as close to possible for minimal consistent hashing.

kellymclaughlin · 2011-09-20T21:53:04Z

test/claim_simulation.erl

+            {node_count, 8},
+            {node_capacity, 24}
+           ],
+    run(Opts).


How about adding two calls to run, one with the default claim and one with the new?

jtuple · 2011-09-20T22:00:45Z

The claim simulation code is setup as an eunit test, and can be invoked through rebar:
./rebar app=riak_core eunit suite=claim_simulation

The code writes out a file in /tmp of the form /tmp/rings_RINGSIZE_NODES.txt. For example, the default in the test is to test a 32 element ring with 8 nodes, so you'll see /tmp/rings_32_8.txt.

It should be easy to modify that file test/claim_simulation.erl in order to change ring, node, and claim algorithm parameters. Current, the new algorithm is the default in the code, comment/uncomment the obvious lines will enable the default algorithm.

To test a real cluster, be sure to set lager debugging to the debug level. Then, on the claimant (typically dev1) you'll see messages of the form Pending ownership transfers: X printed the the console/console.log. It makes it easy to add/remove nodes and see the outcome of the rebalance. Setting the app.config as appropriate will enable you to test out a old/new claim cluster and compare the pending transfers number:

{riak_core, [
%%  {choose_claim_fun, {riak_core_claim, default_choose_claim}},
%%  {wants_claim_fun, {riak_core_claim, default_wants_claim}},
    {choose_claim_fun, {riak_core_new_claim, new_claim}},
    {wants_claim_fun, {riak_core_new_claim, new_wants_claim}},

There's also new pretty-printer for riak_core_ring. To demonstrate:

%% Single-line pretty-print of ring, grouped into target_n segments
> riak_core_ring:pretty_print(Ring, []).            
abcc|abcc|abcc|abcc|abcc|abcc|abcc|abcc|abcc|abcc|abca|abba|abba|abba|abba|abba|

%% Include legend in pretty-print
> riak_core_ring:pretty_print(Ring, [legend]). 
==================================== Nodes ====================================
Node a:  34.4% dev1@127.0.0.1
Node b:  32.8% dev2@127.0.0.1
Node c:  32.8% dev3@127.0.0.1
==================================== Ring =====================================
abcc|abcc|abcc|abcc|abcc|abcc|abcc|abcc|abcc|abcc|abca|abba|abba|abba|abba|abba|

%% Print using numeric ids instead of letters (useful for cluster > 26 nodes)
riak_core_ring:pretty_print(Ring, [numeric]).
["1","2","3","3","1","2","3","3","1","2","3","3","1","2","3","3","1","2","3",
 "3","1","2","3","3","1","2","3","3","1","2","3","3","1","2","3","3","1","2",
 "3","3","1","2","3","1","1","2","2","1","1","2","2","1","1","2","2","1","1",
 "2","2","1","1","2","2","1"]

jtuple · 2011-09-20T22:18:57Z

Another interesting thing to point out is that the default target_n_val is 4 while the default N is 3. This leads to what may appear as strange behavior with the new claim algorithm, versus the old.

For example, the old algorithm's ring progression going from 1 to 4 nodes clearly shows that with 3 nodes most N=3 preflists are on distinct nodes:

1: aaaa|aaaa|aaaa|aaaa|aaaa|aaaa|aaaa|aaaa|
2: abab|abab|abab|abab|abab|abab|abab|abab|
3: abca|bcab|cabc|abca|bcab|cabc|abca|bcab|
4: abcd|abcd|abcd|abcd|abcd|abcd|abcd|abcd|

For the new algorithm, the default settings result in the 3 node ring having repeated nodes in most N=3 preflists, because the new algorithm doesn't maximize spread until after the number of nodes is greater than target_n_val. In other words, both algorithms are safe with N=3 when there are 4 nodes, but the old is better when there are only 3 nodes:

1: aaaa|aaaa|aaaa|aaaa|aaaa|aaaa|aaaa|aaaa|
2: abba|abba|abba|abba|abba|abba|abba|abba|
3: abcc|abcc|abcc|abcc|abca|abca|abba|abba|
4: abcd|abcd|abcd|abcd|abcd|abcd|abcd|abcd|

If, however, target_n_val was 3 to begin with, then the new algorithm also has 3 distinct nodes per N=3 preflist with only 3 nodes (except the same wrap around preflist that also plagues the old algorithm due to 3 not evenly diving the ring size):

aaa|aaa|aaa|aaa|aaa|aaa|aaa|aaa|aaa|aaa|aa
abb|abb|abb|abb|abb|aba|aba|aba|aba|aba|ab
abc|abc|abc|abc|abc|abc|abc|abc|abc|abc|ab
adc|adc|adc|abd|abd|abc|dbc|dbc|abc|abc|db

kellymclaughlin · 2011-09-20T23:03:55Z

Code review, test runs, and local cluster testing look good. +1 to merge.

Change claim_simulation.erl eunit test to run a simulation with both the new and old claim algorithm as suggested. Rename riak_core_new_claim:new_claim/2 to new_choose_claim/2 to match default_choose_claim/2. Fix two bugs in riak_core_new_claim.erl that are on code paths that cannot occur in 1.0 due to existing invariants, but should be fixed nevertheless: - Match error in prefilter_violations: change CNth to {CNth, _}. - Handle case where new_choose_claim fails to claim partitions by falling back to claim_rebalance_n.

bz://1067: remove unused try_cast and fallback functions from riak_kv_ut

…epl still needs it. This reverts commit 4ae468b, reversing changes made to a17ca57.

kellymclaughlin reviewed Sep 20, 2011
View reviewed changes

jtuple added a commit that referenced this pull request Sep 21, 2011

Merge pull request #91 from basho/az447-choose-claim-1.0

827a5a5

jtuple merged commit 827a5a5 into 1.0 Sep 21, 2011

argv0 added a commit that referenced this pull request Apr 21, 2013

Merge pull request #91 from basho/bz1067-remove-try-cast

4ae468b

bz://1067: remove unused try_cast and fallback functions from riak_kv_ut

jonmeredith pushed a commit that referenced this pull request Apr 21, 2013

Revert "Merge pull request #91 from basho/bz1067-remove-try-cast" - r…

b9e7606

…epl still needs it. This reverts commit 4ae468b, reversing changes made to a17ca57.

seancribbs deleted the az447-choose-claim-1.0 branch April 1, 2015 22:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AZ447: Improve partition claim algorithm to reduce unneeded partition transfers #91

AZ447: Improve partition claim algorithm to reduce unneeded partition transfers #91

jtuple commented Sep 20, 2011

kellymclaughlin Sep 20, 2011

jtuple commented Sep 20, 2011

jtuple commented Sep 20, 2011

kellymclaughlin commented Sep 20, 2011

AZ447: Improve partition claim algorithm to reduce unneeded partition transfers #91

AZ447: Improve partition claim algorithm to reduce unneeded partition transfers #91

Conversation

jtuple commented Sep 20, 2011

kellymclaughlin Sep 20, 2011

Choose a reason for hiding this comment

jtuple commented Sep 20, 2011

jtuple commented Sep 20, 2011

kellymclaughlin commented Sep 20, 2011