New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability: Rebalances must be atomic #12768

Open
bdarnell opened this Issue Jan 8, 2017 · 7 comments

Comments

@bdarnell
Copy link
Member

bdarnell commented Jan 8, 2017

Consider a deployment with three datacenters (and a replication configuration that places one replica of each range in each DC). During a rebalance, a range that normally requires a quorum of 2/3 will briefly change to a configuration with a 3/4 quorum and two replicas in one DC. If that DC (or the network) fails before the fourth replica is removed, the range becomes unavailable (If rebalancing downreplicated before upreplicating, then the intermediate state would have a quorum of 2/2 and a failure of either datacenter would lead to unavailability and much harder recovery). While the window is fairly small, there is a risk that an ill-timed DC failure could make one or more ranges unavailable for the duration.

The most general solution to this problem is to implement "joint consensus", the original raft membership change proposal (it is described in the original raft paper and in section 4.3 of Diego's dissertation). Etcd implements a much simpler membership change protocol which has the limitation that only one change can be made at a time (leading to the brief exposure of the intermediate 3/4 state). Joint consensus would let us make these rebalancing changes atomically.

We might also be able to mitigate the problem by using ideas from Flexible Paxos. Flexible Paxos shows that committing entries can be done with a bare majority of nodes, so as long as the leader is not in the failed datacenter the removal of the fourth node can be completed while the DC is down and the range is restored to its 2/3 state. However, if the leader were in the failed DC then the two surviving ones would be unable to make progress since they would have to assume that the former leader is still making progress on the other side of its partition. I'm not sure if there's a full solution here or not.

Running with a higher replication factor (5 replicas in 3 DCs) could also mitigate the problem if the rebalancer were aware of it (so when the range goes from 3/5 to 4/6, those six replicas are guaranteed to be two per DC). This might be a quick fix to prevent this problem from striking a critical system range. Running with more DCs than replicas (3 replicas in 4 DCs, so you never need to place two replicas in the same DC) also avoids this problem without increasing storage costs, but it has the significant downside that all rebalancing must happen over the more expensive inter-DC links.

@rjnn

This comment has been minimized.

Copy link
Collaborator

rjnn commented Jan 9, 2017

If we're using flexible paxos, we can make it work as long as we first ensure that the leader is not on the datacenter where the rebalance is taking place. So if we denote datacenters by numbers and replicas by letters, if we are going from {1A, 2B, 3C} to {1A, 2B, 3D}, we just need to ensure that A or B is the leader before upreplicating to {1A, 2B, 3C, 3D}. Now if we have a datacenter failure at 1 or 2, we have enough replicas (three) to elect a leader. If we have a datacenter failure at 3, we still have our leader so we can upreplicate again to a 5-replica range.

@bdarnell

This comment has been minimized.

Copy link
Member Author

bdarnell commented Jan 9, 2017

Yeah, we'd need to ensure both that A or B is the leader and that C cannot become leader while the upreplication is in flight.

There are also performance considerations here (which we do not currently take into account): the new replica gets its snapshot from the leader, so for performance we'd want to have 3C be the leader when 3D is added. (or we introduce some way to send preemptive snapshots from a follower)

@petermattis petermattis added this to the Later milestone Feb 23, 2017

@bdarnell bdarnell modified the milestones: Later, 2.1 Apr 26, 2018

@bdarnell bdarnell self-assigned this Apr 26, 2018

@tbg tbg added this to Backlog in Core Stability May 14, 2018

@tbg tbg modified the milestones: 2.1, 2.2 Jul 19, 2018

@bdarnell bdarnell changed the title stability: Datacenter failures during rebalances stability: Rebalances must be atomic Jul 23, 2018

@bdarnell

This comment has been minimized.

Copy link
Member Author

bdarnell commented Jul 23, 2018

We have a real-world instance of this: a node crashed (in a persistent way so it crashes again each time it's restarted) while the range was only configured for two replicas. This required extreme efforts to revive the downed node instead of simply letting the cluster recover normally.

@a-robinson

This comment has been minimized.

Copy link
Collaborator

a-robinson commented Jul 23, 2018

While it was only configured for two replicas? So this was when up-replicating from 1 to 3? I'm drawing a blank on when else a range would be configured to have two replicas during rebalancing (other than when removing a replica from a dead store and adding a new one to replace it, which this doesn't sound like).

@bdarnell

This comment has been minimized.

Copy link
Member Author

bdarnell commented Jul 23, 2018

I think this was while removing a replica from a dead store. So technically this was a two-simultaneous-failure event, but the first failure was transient while the second was persistent. If the replica change had been atomic, we'd have been able to recover when the first transient failure resolved.

@a-robinson

This comment has been minimized.

Copy link
Collaborator

a-robinson commented Jul 23, 2018

Yeah, we need to fix that, although the fix for that will be different from / simpler than the fix for what this issue was originally describing. We really just need to up-replicate before removing the dead replica, not to make anything atomic.

For the record, we hit this issue or removing a dead replica before adding its replacement a couple months ago as well - #25392.

@tbg

This comment has been minimized.

Copy link
Member

tbg commented Oct 11, 2018

I folded #2067 (can't change the membership with three replicas in three nodes cluster) into this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment