-
Notifications
You must be signed in to change notification settings - Fork 391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Leave and attempt_simple_transfer #970
Comments
The above isn't based on a proper understanding of how leave works. Take an 8-node cluster, and attempt a leave:
Two things occur here. Firstly it looks like we do a simple transfer, and only node 3 is valid for taking vnodes from node 7 (with respect to target_n_val). This is really bad, as this requires double the capacity on node 3 to complete the operation. However, clearly a second phase to rebalance does occur. If we put in code to prevent simple transfer from being attempted we get:
However - note the warning. The outcome is balanced, but the target_n_val is now not met. |
In the case above - this seems to be a failure for a preflist at n_val 3 as well:
|
Not dug myself into the code yet, but if i understand correctly it would be nice to claim a new ring when a node leaves (at least when location is set), and maybe the claiming algorithm should be optimised for transfers. |
The initial change I made db9c3f0 isn't effective as the fallback position on leave is to use Experimenting with calling Also, after the call to The location concerns may be false because of this. |
After this commit:
This is when setting |
I think I now have a handle on the original problem that prompted this. The problem was a cluster plan like this:
The issue with this plan is that one node takes two extra vnodes The problem is that https://github.com/basho/riak_core/blob/riak_kv-3.0.5/src/riak_core_claim.erl#L101-L114 The issue here, is that after the simple_transfer, every node already has at least RingSize div NodeCount vnodes. A node has an excess of Claims, but no node has a deficit of Wants - so the ring is considered balanced prematurely. This isn't going to be triggered on small clusters. It will only certainly be a problem when: RingSize div NodeCount == RingSize div (NodeCount - LeaveCount) e.g. when leaving just one node from a large cluster (in this case going from 21 nodes to 20 with a ring size of 256). However, there could be other cases when rebalancing will commence, but end prematurely (whilst nodes still have an excess of claims). |
@systream - apologies, false alarm, I don't think there is an issue with location awareness here. This is just an issue with balancing the cluster correctly. |
Clarification as to the issue: On leave, the following process is used: 1a. Attempt a simple_transfer, try and transfer the vnodes from the nodes to safe places - and if there are multiple safe places choose at random The issues we have are: A: In Step 1b the deprecated re-diagonalisation claim_rebalance_n is used, not sequential_claim. The deprecated rebalance function does not avoid tail violations - and so may unnecessarily return an unsafe cluster plan (with the warning that "Not all replicas will be on distinct nodes"). B: In Step 1a an extremely unbalanced cluster may be created (i.e. one node may take all the transferred vnodes from the leaving vnode). This may be unsupportable from a capacity perspective. Commonly in this case, Step 2b will be invoked, and so re-diagonalisation will occur anyway. In this case it would would be preferable to Skip 1a, and force the use of 1b. C: Sometimes Step 2b will not rebalance correctly, especially when leaving small numbers of nodes from large clusters - where The branch https://github.com/basho/riak_core/tree/mas-i970-simpleleave currently provides: Fix to (A). |
Are there any use cases where a customer prefers a probably faster leave over potentially imbalanced/not proper ring? |
The PR #971 provides an improvement to B. This also partially mitigates issues with C, as the input to C is less likely to be imbalanced |
@systream with very big databases a full re-diagonalisation to rebalance can take a huge amount of time (many days potentially) - so I think there will always be cases where a simple_transfer follow by rebalance by claim is the preferred way. I think the fixes we now have to this issue provide sufficient mitigation and improvements now. Just need to try and figure out the best way to test this. |
There appears to be an additional bug as the attempt_simple_transfer does not check for tail violation - that is to say when it reaches the tail of the ring, it does not check forward back through to the front of the ring (i.e. with ring-size of 64 - the Idx 63 should not overlap with the Idx 0, 1, 2). |
The |
In recent refactoring of cluster changes - #913 and #967 - the focus has been on behaviour on
join
operations, and notleave
.Recently a customer achieved a balanced ring through a join (as expected with the new cluster claim algorithm), but then leave plans kept creating unbalanced rings i.e. rings whereby partitions were unevenly distributed leading to the potential for "one slow node" problems.
...
[EDIT] There were various incorrect statements made initially here, about how leave works. See later comments for a more accurate representation of the problem
...
There are perhaps some simple things that can be done:
attempt_simple_transfer/4
could check how many partitions are already owned by. a candidate node, and prefer candidate nodes with lower levels of existing ownership. This should be more likely to give balanced results (although it will do nothing in terms of location awareness.There be demons here. I don't think there's a lot of existing test coverage of leave scenarios (riak_test/src/rt.erl has a
staged_join/1
function but nostaged_leave/1
). There could be the potential for confusing situations when the configuration setting (1) changes between nodes and between staging and committing changes.The text was updated successfully, but these errors were encountered: