Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leave and attempt_simple_transfer #970

Closed
martinsumner opened this issue Sep 27, 2021 · 14 comments
Closed

Leave and attempt_simple_transfer #970

martinsumner opened this issue Sep 27, 2021 · 14 comments

Comments

@martinsumner
Copy link
Contributor

martinsumner commented Sep 27, 2021

In recent refactoring of cluster changes - #913 and #967 - the focus has been on behaviour on join operations, and not leave.

Recently a customer achieved a balanced ring through a join (as expected with the new cluster claim algorithm), but then leave plans kept creating unbalanced rings i.e. rings whereby partitions were unevenly distributed leading to the potential for "one slow node" problems.

...

[EDIT] There were various incorrect statements made initially here, about how leave works. See later comments for a more accurate representation of the problem
...

There are perhaps some simple things that can be done:

  1. Perhaps a configuration option to force rebalance on leave. This can then be enabled for users of location. This perhaps could also be enabled if repeated re-planning is not leading to balanced outcomes.
  2. The attempt_simple_transfer/4 could check how many partitions are already owned by. a candidate node, and prefer candidate nodes with lower levels of existing ownership. This should be more likely to give balanced results (although it will do nothing in terms of location awareness.

There be demons here. I don't think there's a lot of existing test coverage of leave scenarios (riak_test/src/rt.erl has a staged_join/1 function but no staged_leave/1). There could be the potential for confusing situations when the configuration setting (1) changes between nodes and between staging and committing changes.

@martinsumner
Copy link
Contributor Author

@systream

@martinsumner
Copy link
Contributor Author

The above isn't based on a proper understanding of how leave works.

Take an 8-node cluster, and attempt a leave:

=============================== Staged Changes ================================
Action         Details(s)
-------------------------------------------------------------------------------
leave          'dev7@127.0.0.1'
-------------------------------------------------------------------------------


NOTE: Applying these changes will result in 2 cluster transitions

###############################################################################
                         After cluster transition 1/2
###############################################################################

================================= Membership ==================================
Status     Ring    Pending    Node
-------------------------------------------------------------------------------
leaving    12.5%      0.0%    dev7@127.0.0.1
valid      12.5%     12.5%    dev1@127.0.0.1
valid      12.5%     12.5%    dev2@127.0.0.1
valid      12.5%     25.0%    dev3@127.0.0.1
valid      12.5%     12.5%    dev4@127.0.0.1
valid      12.5%     12.5%    dev5@127.0.0.1
valid      12.5%     12.5%    dev6@127.0.0.1
valid      12.5%     12.5%    dev8@127.0.0.1
-------------------------------------------------------------------------------
Valid:7 / Leaving:1 / Exiting:0 / Joining:0 / Down:0

Transfers resulting from cluster changes: 8
  8 transfers from 'dev7@127.0.0.1' to 'dev3@127.0.0.1'

###############################################################################
                         After cluster transition 2/2
###############################################################################

================================= Membership ==================================
Status     Ring    Pending    Node
-------------------------------------------------------------------------------
valid      12.5%     15.6%    dev1@127.0.0.1
valid      12.5%     14.1%    dev2@127.0.0.1
valid      25.0%     14.1%    dev3@127.0.0.1
valid      12.5%     14.1%    dev4@127.0.0.1
valid      12.5%     14.1%    dev5@127.0.0.1
valid      12.5%     14.1%    dev6@127.0.0.1
valid      12.5%     14.1%    dev8@127.0.0.1
-------------------------------------------------------------------------------
Valid:7 / Leaving:0 / Exiting:0 / Joining:0 / Down:0

Transfers resulting from cluster changes: 53
  1 transfers from 'dev6@127.0.0.1' to 'dev5@127.0.0.1'
  1 transfers from 'dev5@127.0.0.1' to 'dev3@127.0.0.1'
  1 transfers from 'dev4@127.0.0.1' to 'dev2@127.0.0.1'
  1 transfers from 'dev2@127.0.0.1' to 'dev8@127.0.0.1'
  1 transfers from 'dev1@127.0.0.1' to 'dev6@127.0.0.1'
  1 transfers from 'dev8@127.0.0.1' to 'dev5@127.0.0.1'
  1 transfers from 'dev6@127.0.0.1' to 'dev3@127.0.0.1'
  1 transfers from 'dev5@127.0.0.1' to 'dev2@127.0.0.1'
  1 transfers from 'dev4@127.0.0.1' to 'dev1@127.0.0.1'
  1 transfers from 'dev2@127.0.0.1' to 'dev6@127.0.0.1'
  1 transfers from 'dev1@127.0.0.1' to 'dev5@127.0.0.1'
  2 transfers from 'dev8@127.0.0.1' to 'dev4@127.0.0.1'
  2 transfers from 'dev6@127.0.0.1' to 'dev2@127.0.0.1'
  2 transfers from 'dev5@127.0.0.1' to 'dev1@127.0.0.1'
  2 transfers from 'dev4@127.0.0.1' to 'dev8@127.0.0.1'
  3 transfers from 'dev3@127.0.0.1' to 'dev6@127.0.0.1'
  2 transfers from 'dev2@127.0.0.1' to 'dev5@127.0.0.1'
  2 transfers from 'dev1@127.0.0.1' to 'dev4@127.0.0.1'
  2 transfers from 'dev8@127.0.0.1' to 'dev3@127.0.0.1'
  1 transfers from 'dev3@127.0.0.1' to 'dev2@127.0.0.1'
  1 transfers from 'dev6@127.0.0.1' to 'dev1@127.0.0.1'
  1 transfers from 'dev5@127.0.0.1' to 'dev8@127.0.0.1'
  1 transfers from 'dev4@127.0.0.1' to 'dev6@127.0.0.1'
  1 transfers from 'dev3@127.0.0.1' to 'dev5@127.0.0.1'
  1 transfers from 'dev2@127.0.0.1' to 'dev4@127.0.0.1'
  1 transfers from 'dev1@127.0.0.1' to 'dev3@127.0.0.1'
  1 transfers from 'dev8@127.0.0.1' to 'dev2@127.0.0.1'
  3 transfers from 'dev3@127.0.0.1' to 'dev1@127.0.0.1'
  2 transfers from 'dev6@127.0.0.1' to 'dev8@127.0.0.1'
  2 transfers from 'dev5@127.0.0.1' to 'dev6@127.0.0.1'
  2 transfers from 'dev4@127.0.0.1' to 'dev5@127.0.0.1'
  3 transfers from 'dev3@127.0.0.1' to 'dev4@127.0.0.1'
  1 transfers from 'dev2@127.0.0.1' to 'dev3@127.0.0.1'
  1 transfers from 'dev1@127.0.0.1' to 'dev2@127.0.0.1'
  1 transfers from 'dev8@127.0.0.1' to 'dev1@127.0.0.1'
  2 transfers from 'dev3@127.0.0.1' to 'dev8@127.0.0.1'

Two things occur here. Firstly it looks like we do a simple transfer, and only node 3 is valid for taking vnodes from node 7 (with respect to target_n_val).

This is really bad, as this requires double the capacity on node 3 to complete the operation.

However, clearly a second phase to rebalance does occur.

If we put in code to prevent simple transfer from being attempted we get:

=============================== Staged Changes ================================
Action         Details(s)
-------------------------------------------------------------------------------
leave          'dev7@127.0.0.1'
-------------------------------------------------------------------------------


NOTE: Applying these changes will result in 2 cluster transitions

###############################################################################
                         After cluster transition 1/2
###############################################################################

================================= Membership ==================================
Status     Ring    Pending    Node
-------------------------------------------------------------------------------
leaving    12.5%      0.0%    dev7@127.0.0.1
valid      12.5%     15.6%    dev1@127.0.0.1
valid      12.5%     14.1%    dev2@127.0.0.1
valid      12.5%     14.1%    dev3@127.0.0.1
valid      12.5%     14.1%    dev4@127.0.0.1
valid      12.5%     14.1%    dev5@127.0.0.1
valid      12.5%     14.1%    dev6@127.0.0.1
valid      12.5%     14.1%    dev8@127.0.0.1
-------------------------------------------------------------------------------
Valid:7 / Leaving:1 / Exiting:0 / Joining:0 / Down:0

WARNING: Not all replicas will be on distinct nodes

Transfers resulting from cluster changes: 51
  1 transfers from 'dev7@127.0.0.1' to 'dev6@127.0.0.1'
  1 transfers from 'dev6@127.0.0.1' to 'dev5@127.0.0.1'
  1 transfers from 'dev5@127.0.0.1' to 'dev4@127.0.0.1'
  1 transfers from 'dev4@127.0.0.1' to 'dev3@127.0.0.1'
  1 transfers from 'dev3@127.0.0.1' to 'dev2@127.0.0.1'
  1 transfers from 'dev2@127.0.0.1' to 'dev1@127.0.0.1'
  1 transfers from 'dev1@127.0.0.1' to 'dev8@127.0.0.1'
  1 transfers from 'dev8@127.0.0.1' to 'dev6@127.0.0.1'
  1 transfers from 'dev7@127.0.0.1' to 'dev5@127.0.0.1'
  1 transfers from 'dev6@127.0.0.1' to 'dev4@127.0.0.1'
  1 transfers from 'dev5@127.0.0.1' to 'dev3@127.0.0.1'
  1 transfers from 'dev4@127.0.0.1' to 'dev2@127.0.0.1'
  1 transfers from 'dev3@127.0.0.1' to 'dev1@127.0.0.1'
  1 transfers from 'dev2@127.0.0.1' to 'dev8@127.0.0.1'
  1 transfers from 'dev1@127.0.0.1' to 'dev6@127.0.0.1'
  1 transfers from 'dev8@127.0.0.1' to 'dev5@127.0.0.1'
  1 transfers from 'dev7@127.0.0.1' to 'dev4@127.0.0.1'
  1 transfers from 'dev6@127.0.0.1' to 'dev3@127.0.0.1'
  1 transfers from 'dev5@127.0.0.1' to 'dev2@127.0.0.1'
  1 transfers from 'dev4@127.0.0.1' to 'dev1@127.0.0.1'
  1 transfers from 'dev3@127.0.0.1' to 'dev8@127.0.0.1'
  1 transfers from 'dev2@127.0.0.1' to 'dev6@127.0.0.1'
  1 transfers from 'dev1@127.0.0.1' to 'dev5@127.0.0.1'
  1 transfers from 'dev8@127.0.0.1' to 'dev4@127.0.0.1'
  1 transfers from 'dev7@127.0.0.1' to 'dev3@127.0.0.1'
  1 transfers from 'dev6@127.0.0.1' to 'dev2@127.0.0.1'
  1 transfers from 'dev5@127.0.0.1' to 'dev1@127.0.0.1'
  1 transfers from 'dev4@127.0.0.1' to 'dev8@127.0.0.1'
  1 transfers from 'dev3@127.0.0.1' to 'dev6@127.0.0.1'
  1 transfers from 'dev2@127.0.0.1' to 'dev5@127.0.0.1'
  1 transfers from 'dev1@127.0.0.1' to 'dev4@127.0.0.1'
  1 transfers from 'dev8@127.0.0.1' to 'dev3@127.0.0.1'
  1 transfers from 'dev7@127.0.0.1' to 'dev2@127.0.0.1'
  1 transfers from 'dev6@127.0.0.1' to 'dev1@127.0.0.1'
  1 transfers from 'dev5@127.0.0.1' to 'dev8@127.0.0.1'
  1 transfers from 'dev4@127.0.0.1' to 'dev6@127.0.0.1'
  1 transfers from 'dev3@127.0.0.1' to 'dev5@127.0.0.1'
  1 transfers from 'dev2@127.0.0.1' to 'dev4@127.0.0.1'
  1 transfers from 'dev1@127.0.0.1' to 'dev3@127.0.0.1'
  1 transfers from 'dev8@127.0.0.1' to 'dev2@127.0.0.1'
  1 transfers from 'dev7@127.0.0.1' to 'dev1@127.0.0.1'
  1 transfers from 'dev6@127.0.0.1' to 'dev8@127.0.0.1'
  1 transfers from 'dev5@127.0.0.1' to 'dev6@127.0.0.1'
  1 transfers from 'dev4@127.0.0.1' to 'dev5@127.0.0.1'
  1 transfers from 'dev3@127.0.0.1' to 'dev4@127.0.0.1'
  1 transfers from 'dev2@127.0.0.1' to 'dev3@127.0.0.1'
  1 transfers from 'dev1@127.0.0.1' to 'dev2@127.0.0.1'
  2 transfers from 'dev8@127.0.0.1' to 'dev1@127.0.0.1'
  2 transfers from 'dev7@127.0.0.1' to 'dev8@127.0.0.1'

###############################################################################
                         After cluster transition 2/2
###############################################################################

================================= Membership ==================================
Status     Ring    Pending    Node
-------------------------------------------------------------------------------
valid      15.6%      --      dev1@127.0.0.1
valid      14.1%      --      dev2@127.0.0.1
valid      14.1%      --      dev3@127.0.0.1
valid      14.1%      --      dev4@127.0.0.1
valid      14.1%      --      dev5@127.0.0.1
valid      14.1%      --      dev6@127.0.0.1
valid      14.1%      --      dev8@127.0.0.1
-------------------------------------------------------------------------------
Valid:7 / Leaving:0 / Exiting:0 / Joining:0 / Down:0

WARNING: Not all replicas will be on distinct nodes

However - note the warning. The outcome is balanced, but the target_n_val is now not met.

@martinsumner
Copy link
Contributor Author

In the case above - this seems to be a failure for a preflist at n_val 3 as well:

 riak_core_ring_util:check_ring(FR2).    
[[{1415829711164312202009819681693899175291684651008,
   'dev8@127.0.0.1'},
  {1438665674247607560106752257205091097473808596992,
   'dev2@127.0.0.1'},
  {0,'dev2@127.0.0.1'}],
 [{1438665674247607560106752257205091097473808596992, 
   'dev2@127.0.0.1'},
  {0,'dev2@127.0.0.1'},
  {22835963083295358096932575511191922182123945984,
   'dev3@127.0.0.1'}]]

@systream
Copy link

Not dug myself into the code yet, but if i understand correctly it would be nice to claim a new ring when a node leaves (at least when location is set), and maybe the claiming algorithm should be optimised for transfers.

@martinsumner
Copy link
Contributor Author

The initial change I made db9c3f0 isn't effective as the fallback position on leave is to use riak_core_claim:claim_rebalance_n/2 which is the old diagonalise algorithm which doesn't resolve tail violations.

Experimenting with calling riak_core_claim:sequential_claim/2 instead.

Also, after the call to riak_core_gossip:attempt_simple_transfer, there is still a call to riak_core_claim:claim/1 in https://github.com/basho/riak_core/blob/develop-3.0/src/riak_core_claimant.erl#L1454-L1468. This is what is currently triggering the second phase of the plan. However, it does not appear to rebalance as expected.

The location concerns may be false because of this.

@martinsumner
Copy link
Contributor Author

After this commit:

=============================== Staged Changes ================================
Action         Details(s)
-------------------------------------------------------------------------------
leave          'dev7@127.0.0.1'
-------------------------------------------------------------------------------


NOTE: Applying these changes will result in 2 cluster transitions

###############################################################################
                         After cluster transition 1/2
###############################################################################

================================= Membership ==================================
Status     Ring    Pending    Node
-------------------------------------------------------------------------------
leaving    12.5%      0.0%    dev7@127.0.0.1
valid      12.5%     15.6%    dev1@127.0.0.1
valid      12.5%     14.1%    dev2@127.0.0.1
valid      12.5%     14.1%    dev3@127.0.0.1
valid      12.5%     14.1%    dev4@127.0.0.1
valid      12.5%     14.1%    dev5@127.0.0.1
valid      12.5%     14.1%    dev6@127.0.0.1
valid      12.5%     14.1%    dev8@127.0.0.1
-------------------------------------------------------------------------------
Valid:7 / Leaving:1 / Exiting:0 / Joining:0 / Down:0

Transfers resulting from cluster changes: 55
  1 transfers from 'dev7@127.0.0.1' to 'dev6@127.0.0.1'
  1 transfers from 'dev6@127.0.0.1' to 'dev5@127.0.0.1'
  1 transfers from 'dev5@127.0.0.1' to 'dev3@127.0.0.1'
  1 transfers from 'dev4@127.0.0.1' to 'dev2@127.0.0.1'
  1 transfers from 'dev3@127.0.0.1' to 'dev1@127.0.0.1'
  1 transfers from 'dev2@127.0.0.1' to 'dev8@127.0.0.1'
  1 transfers from 'dev1@127.0.0.1' to 'dev6@127.0.0.1'
  1 transfers from 'dev8@127.0.0.1' to 'dev5@127.0.0.1'
  1 transfers from 'dev7@127.0.0.1' to 'dev4@127.0.0.1'
  1 transfers from 'dev6@127.0.0.1' to 'dev3@127.0.0.1'
  1 transfers from 'dev5@127.0.0.1' to 'dev2@127.0.0.1'
  1 transfers from 'dev4@127.0.0.1' to 'dev1@127.0.0.1'
  1 transfers from 'dev3@127.0.0.1' to 'dev8@127.0.0.1'
  1 transfers from 'dev2@127.0.0.1' to 'dev6@127.0.0.1'
  1 transfers from 'dev1@127.0.0.1' to 'dev5@127.0.0.1'
  2 transfers from 'dev8@127.0.0.1' to 'dev4@127.0.0.1'
  2 transfers from 'dev7@127.0.0.1' to 'dev3@127.0.0.1'
  2 transfers from 'dev6@127.0.0.1' to 'dev2@127.0.0.1'
  2 transfers from 'dev5@127.0.0.1' to 'dev1@127.0.0.1'
  2 transfers from 'dev4@127.0.0.1' to 'dev8@127.0.0.1'
  2 transfers from 'dev3@127.0.0.1' to 'dev6@127.0.0.1'
  2 transfers from 'dev2@127.0.0.1' to 'dev5@127.0.0.1'
  2 transfers from 'dev1@127.0.0.1' to 'dev4@127.0.0.1'
  2 transfers from 'dev8@127.0.0.1' to 'dev3@127.0.0.1'
  1 transfers from 'dev7@127.0.0.1' to 'dev2@127.0.0.1'
  1 transfers from 'dev6@127.0.0.1' to 'dev1@127.0.0.1'
  1 transfers from 'dev5@127.0.0.1' to 'dev8@127.0.0.1'
  1 transfers from 'dev4@127.0.0.1' to 'dev6@127.0.0.1'
  1 transfers from 'dev3@127.0.0.1' to 'dev5@127.0.0.1'
  1 transfers from 'dev2@127.0.0.1' to 'dev4@127.0.0.1'
  1 transfers from 'dev1@127.0.0.1' to 'dev3@127.0.0.1'
  1 transfers from 'dev8@127.0.0.1' to 'dev2@127.0.0.1'
  2 transfers from 'dev7@127.0.0.1' to 'dev1@127.0.0.1'
  2 transfers from 'dev6@127.0.0.1' to 'dev8@127.0.0.1'
  2 transfers from 'dev5@127.0.0.1' to 'dev6@127.0.0.1'
  2 transfers from 'dev4@127.0.0.1' to 'dev5@127.0.0.1'
  2 transfers from 'dev3@127.0.0.1' to 'dev4@127.0.0.1'
  1 transfers from 'dev2@127.0.0.1' to 'dev3@127.0.0.1'
  1 transfers from 'dev1@127.0.0.1' to 'dev2@127.0.0.1'
  1 transfers from 'dev8@127.0.0.1' to 'dev1@127.0.0.1'
  1 transfers from 'dev7@127.0.0.1' to 'dev8@127.0.0.1'

###############################################################################
                         After cluster transition 2/2
###############################################################################

================================= Membership ==================================
Status     Ring    Pending    Node
-------------------------------------------------------------------------------
valid      15.6%      --      dev1@127.0.0.1
valid      14.1%      --      dev2@127.0.0.1
valid      14.1%      --      dev3@127.0.0.1
valid      14.1%      --      dev4@127.0.0.1
valid      14.1%      --      dev5@127.0.0.1
valid      14.1%      --      dev6@127.0.0.1
valid      14.1%      --      dev8@127.0.0.1
-------------------------------------------------------------------------------
Valid:7 / Leaving:0 / Exiting:0 / Joining:0 / Down:0

This is when setting always_rebalance_onleave to on. This is an improvement in that if simple_transfer is going to lead to a significant temporary unbalance - this can be avoided by directly rebalancing from scratch, using the new algorithm which will handle tail violations.

@martinsumner
Copy link
Contributor Author

martinsumner commented Sep 28, 2021

I think I now have a handle on the original problem that prompted this. The problem was a cluster plan like this:

================================= Membership ==================================
Status     Ring    Pending    Node
-------------------------------------------------------------------------------
leaving     4.7%      0.0%    'riak@<x>.<y>.<z>.30'
valid       4.7%      4.7%    'riak@<x>.<y>.<z>.11'
valid       5.1%      5.1%    'riak@<x>.<y>.<z>.12'
valid       4.7%      4.7%    'riak@<x>.<y>.<z>.129'
valid       4.7%      4.7%    'riak@<x>.<y>.<z>.13'
valid       4.7%      4.7%    'riak@<x>.<y>.<z>.14'
valid       4.7%      5.5%    'riak@<x>.<y>.<z>.15'
valid       4.7%      4.7%    'riak@<x>.<y>.<z>.16'
valid       4.7%      4.7%    'riak@<x>.<y>.<z>.17'
valid       4.7%      5.1%    'riak@<x>.<y>.<z>.18'
valid       4.7%      5.5%    'riak@<x>.<y>.<z>.19'
valid       4.7%      4.7%    'riak@<x>.<y>.<z>.20'
valid       5.1%      5.1%    'riak@<x>.<y>.<z>.21'
valid       5.1%      5.1%    'riak@<x>.<y>.<z>.22'
valid       5.1%      6.3%    'riak@<x>.<y>.<z>.23'
valid       4.7%      5.5%    'riak@<x>.<y>.<z>.24'
valid       4.7%      5.1%    'riak@<x>.<y>.<z>.25'
valid       4.7%      5.1%    'riak@<x>.<y>.<z>.26'
valid       4.7%      4.7%    'riak@<x>.<y>.<z>.27'
valid       4.7%      4.7%    'riak@<x>.<y>.<z>.28'
valid       4.7%      4.7%    'riak@<x>.<y>.<z>.29'
-------------------------------------------------------------------------------
Valid:20 / Leaving:1 / Exiting:0 / Joining:0 / Down:0

The issue with this plan is that one node takes two extra vnodes valid 5.1% 6.3% 'riak@<x>.<y>.<z>.23'. this is after a simple_transfer - so why did the call to riak_core_claim:claim/1 do nothing, and fail to rebalance this.

The problem is that claim_until_balanced looks for nodes with Wants:

https://github.com/basho/riak_core/blob/riak_kv-3.0.5/src/riak_core_claim.erl#L101-L114

The issue here, is that after the simple_transfer, every node already has at least RingSize div NodeCount vnodes. A node has an excess of Claims, but no node has a deficit of Wants - so the ring is considered balanced prematurely.

This isn't going to be triggered on small clusters. It will only certainly be a problem when:

RingSize div NodeCount == RingSize div (NodeCount - LeaveCount)

e.g. when leaving just one node from a large cluster (in this case going from 21 nodes to 20 with a ring size of 256). However, there could be other cases when rebalancing will commence, but end prematurely (whilst nodes still have an excess of claims).

@martinsumner
Copy link
Contributor Author

@systream - apologies, false alarm, I don't think there is an issue with location awareness here. This is just an issue with balancing the cluster correctly.

@martinsumner
Copy link
Contributor Author

Clarification as to the issue:

On leave, the following process is used:

1a. Attempt a simple_transfer, try and transfer the vnodes from the nodes to safe places - and if there are multiple safe places choose at random
1b. If the simple_transfer is not possible i.e. a state is reached where a vnode has no safe place (compatible with target_n_val) - then a full re-diagonalisation is called (and entirely fresh plan is created with a set of transfers needed to get there)
2a. After step 1 claim is re-run so that claim_until_balanced will be run to rebalance the ring (if in the case of 1a the simple_transfer has created an unbalanced ring)
2b. The re-run of claim may if also hit an unsafe outcome, and then trigger a full re-diagonalisation (as in 1b).

The issues we have are:

A: In Step 1b the deprecated re-diagonalisation claim_rebalance_n is used, not sequential_claim. The deprecated rebalance function does not avoid tail violations - and so may unnecessarily return an unsafe cluster plan (with the warning that "Not all replicas will be on distinct nodes").

B: In Step 1a an extremely unbalanced cluster may be created (i.e. one node may take all the transferred vnodes from the leaving vnode). This may be unsupportable from a capacity perspective. Commonly in this case, Step 2b will be invoked, and so re-diagonalisation will occur anyway. In this case it would would be preferable to Skip 1a, and force the use of 1b.

C: Sometimes Step 2b will not rebalance correctly, especially when leaving small numbers of nodes from large clusters - where RS div NodeCount == RS div (NodeCount - LeaveCount). Fixing this would be a significant change to the claim function. It could be mitigated by a simpler change to the simple_transfer function to bias the original distribution to make unbalances less likely in Step 1a.

The branch https://github.com/basho/riak_core/tree/mas-i970-simpleleave currently provides:

Fix to (A).
Workaround to (B) - force Step 1b via configuration.

@systream
Copy link

Are there any use cases where a customer prefers a probably faster leave over potentially imbalanced/not proper ring?
Maybe ring rebalance should be always forced on node leave.

@martinsumner
Copy link
Contributor Author

The PR #971 provides an improvement to B. This also partially mitigates issues with C, as the input to C is less likely to be imbalanced

@martinsumner
Copy link
Contributor Author

@systream with very big databases a full re-diagonalisation to rebalance can take a huge amount of time (many days potentially) - so I think there will always be cases where a simple_transfer follow by rebalance by claim is the preferred way.

I think the fixes we now have to this issue provide sufficient mitigation and improvements now. Just need to try and figure out the best way to test this.

@martinsumner
Copy link
Contributor Author

There appears to be an additional bug as the attempt_simple_transfer does not check for tail violation - that is to say when it reaches the tail of the ring, it does not check forward back through to the front of the ring (i.e. with ring-size of 64 - the Idx 63 should not overlap with the Idx 0, 1, 2).

@martinsumner
Copy link
Contributor Author

martinsumner commented Sep 30, 2021

The attempt_simple_transfer logic is now refactored in #971. This now uses the same method for looking forward and backwards in the ring - with that method supporting of checking for tail violations, as it loops around the ring when required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants