Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AZ447: Improve partition claim algorithm to reduce unneeded partition transfers #91

Merged
merged 2 commits into from Sep 21, 2011

Conversation

jtuple
Copy link
Contributor

@jtuple jtuple commented Sep 20, 2011

This pull-request adds a new partition claim algorithm in riak_core_new_claim.erl that aims to reduce the number of unneeded partition transfers. However, the previous algorithm (default_claim) remains the default for 1.0.

The default algorithm falls back to re-diagonalization far too often, making it so that adding a new node to the more often than not requires more than half of the partitions to be handed off, rather than the expected 1/N (where N is the number of nodes, including the new node).

The new algorithm tries more closely hit the 1/N target while still maintaining the target_n spread constraint.

In addition to the new algorithm, this pull-request also merges in a claim simulation harness originally contributed by Greg Nelson at Dropcam (https://gist.github.com/992317). This can be used to compare the efficiency of the new and old algorithms.

Add riak_core_new_claim:new_wants_claim/2 and new_claim/2.
Merge in claim simulation code provided by Greg Nelson (grourk@dropcam.com).
Add pretty_print function to riak_core_ring.

The new claim function is designed to reduce the number of partition transfers
that occur when rebalancing the ring, aiming as close to possible for minimal
consistent hashing.
{node_count, 8},
{node_capacity, 24}
],
run(Opts).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding two calls to run, one with the default claim and one with the new?

@jtuple
Copy link
Contributor Author

jtuple commented Sep 20, 2011

The claim simulation code is setup as an eunit test, and can be invoked through rebar:
./rebar app=riak_core eunit suite=claim_simulation

The code writes out a file in /tmp of the form /tmp/rings_RINGSIZE_NODES.txt. For example, the default in the test is to test a 32 element ring with 8 nodes, so you'll see /tmp/rings_32_8.txt.

It should be easy to modify that file test/claim_simulation.erl in order to change ring, node, and claim algorithm parameters. Current, the new algorithm is the default in the code, comment/uncomment the obvious lines will enable the default algorithm.

To test a real cluster, be sure to set lager debugging to the debug level. Then, on the claimant (typically dev1) you'll see messages of the form Pending ownership transfers: X printed the the console/console.log. It makes it easy to add/remove nodes and see the outcome of the rebalance. Setting the app.config as appropriate will enable you to test out a old/new claim cluster and compare the pending transfers number:

{riak_core, [
%%  {choose_claim_fun, {riak_core_claim, default_choose_claim}},
%%  {wants_claim_fun, {riak_core_claim, default_wants_claim}},
    {choose_claim_fun, {riak_core_new_claim, new_claim}},
    {wants_claim_fun, {riak_core_new_claim, new_wants_claim}},

There's also new pretty-printer for riak_core_ring. To demonstrate:

%% Single-line pretty-print of ring, grouped into target_n segments
> riak_core_ring:pretty_print(Ring, []).            
abcc|abcc|abcc|abcc|abcc|abcc|abcc|abcc|abcc|abcc|abca|abba|abba|abba|abba|abba|

%% Include legend in pretty-print
> riak_core_ring:pretty_print(Ring, [legend]). 
==================================== Nodes ====================================
Node a:  34.4% dev1@127.0.0.1
Node b:  32.8% dev2@127.0.0.1
Node c:  32.8% dev3@127.0.0.1
==================================== Ring =====================================
abcc|abcc|abcc|abcc|abcc|abcc|abcc|abcc|abcc|abcc|abca|abba|abba|abba|abba|abba|

%% Print using numeric ids instead of letters (useful for cluster > 26 nodes)
riak_core_ring:pretty_print(Ring, [numeric]).
["1","2","3","3","1","2","3","3","1","2","3","3","1","2","3","3","1","2","3",
 "3","1","2","3","3","1","2","3","3","1","2","3","3","1","2","3","3","1","2",
 "3","3","1","2","3","1","1","2","2","1","1","2","2","1","1","2","2","1","1",
 "2","2","1","1","2","2","1"]

@jtuple
Copy link
Contributor Author

jtuple commented Sep 20, 2011

Another interesting thing to point out is that the default target_n_val is 4 while the default N is 3. This leads to what may appear as strange behavior with the new claim algorithm, versus the old.

For example, the old algorithm's ring progression going from 1 to 4 nodes clearly shows that with 3 nodes most N=3 preflists are on distinct nodes:

1: aaaa|aaaa|aaaa|aaaa|aaaa|aaaa|aaaa|aaaa|
2: abab|abab|abab|abab|abab|abab|abab|abab|
3: abca|bcab|cabc|abca|bcab|cabc|abca|bcab|
4: abcd|abcd|abcd|abcd|abcd|abcd|abcd|abcd|

For the new algorithm, the default settings result in the 3 node ring having repeated nodes in most N=3 preflists, because the new algorithm doesn't maximize spread until after the number of nodes is greater than target_n_val. In other words, both algorithms are safe with N=3 when there are 4 nodes, but the old is better when there are only 3 nodes:

1: aaaa|aaaa|aaaa|aaaa|aaaa|aaaa|aaaa|aaaa|
2: abba|abba|abba|abba|abba|abba|abba|abba|
3: abcc|abcc|abcc|abcc|abca|abca|abba|abba|
4: abcd|abcd|abcd|abcd|abcd|abcd|abcd|abcd|

If, however, target_n_val was 3 to begin with, then the new algorithm also has 3 distinct nodes per N=3 preflist with only 3 nodes (except the same wrap around preflist that also plagues the old algorithm due to 3 not evenly diving the ring size):

aaa|aaa|aaa|aaa|aaa|aaa|aaa|aaa|aaa|aaa|aa
abb|abb|abb|abb|abb|aba|aba|aba|aba|aba|ab
abc|abc|abc|abc|abc|abc|abc|abc|abc|abc|ab
adc|adc|adc|abd|abd|abc|dbc|dbc|abc|abc|db

@kellymclaughlin
Copy link
Contributor

Code review, test runs, and local cluster testing look good. +1 to merge.

Change claim_simulation.erl eunit test to run a simulation with both the
new and old claim algorithm as suggested.

Rename riak_core_new_claim:new_claim/2 to new_choose_claim/2 to match
default_choose_claim/2.

Fix two bugs in riak_core_new_claim.erl that are on code paths that cannot
occur in 1.0 due to existing invariants, but should be fixed nevertheless:
- Match error in prefilter_violations: change CNth to {CNth, _}.
- Handle case where new_choose_claim fails to claim partitions by falling
  back to claim_rebalance_n.
@jtuple jtuple merged commit 827a5a5 into 1.0 Sep 21, 2011
argv0 added a commit that referenced this pull request Apr 21, 2013
bz://1067:  remove unused try_cast and fallback functions from riak_kv_ut
jonmeredith pushed a commit that referenced this pull request Apr 21, 2013
…epl still needs it.

This reverts commit 4ae468b, reversing
changes made to a17ca57.
@seancribbs seancribbs deleted the az447-choose-claim-1.0 branch April 1, 2015 22:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants