release-2.1: storage: fix various problems with dynamic replication factor #34199

petermattis · 2019-01-23T19:45:39Z

Backport 6/6 commits from #34126.

@bdarnell TestRemoveDeadReplicas is failing with this change. It is stuck waiting for up-replication to occur after removing the dead replicas. I see that you're fixing some other problem with TestRemoveDeadReplicas on master. Is there something that needs to be backported there?

/cc @cockroachdb/release

roachtest: add replicate/wide roachtest
storage: reimplement StorePool.AvailableNodeCount
storage: fix lastReplicaAdded computation
storage: fix filterUnremovableReplicas badness

Fixes #34122

Release note (bug fix): Avoid down-replicating widely replicated ranges
when nodes in the cluster are temporarily down.

`filterUnremovableReplicas` was allowing replicas that were a necessary part of quorum to be removed. This occurred due to `filterBehindReplicas` taking a "brand new replica" that was being considered up-to-date even when we didn't have evidence of it being up-to-date. `filterBehindReplicas` needs to return an accurate picture of the up-to-date replicas. Rather than push this work into `filterBehindReplicas`, `filterUnremovableReplicas` has been changed to perform the filtering of the "brand new replica" from the removable candidates. See cockroachdb#34122 Release note: None

Previously, `Replica.mu.lastReplicaAdded` was being set to the maximum replica ID in the descriptor whenever the descriptor changed. This was invalid when a replica was removed from the range. For example, consider a range with 9 replicas, IDs 1 through 9. If replica ID 5 is removed from the range, `lastReplicaAdded` was being set to 9. Coupled with the bug in the previous commit, this was causing replica ID 9 to appear to be up-to-date when it wasn't. The fix here isn't strictly necessary, but is done to bring sanity: `lastReplicaAdded` should accurately reflect the last replica which was added, not the maximum replica ID in the range. See cockroachdb#34122 Release note: None

cockroach-teamcity · 2019-01-23T19:45:52Z

This change is

Reimplement `StorePool.AvailableNodeCount` in terms of `NodeLiveness.GetNodeCount`. The latter returns a count of the user's intended number of nodes in the cluster. This is added nodes minus decommissioning (or decommissioned) nodes. This fixes misbehavior of the dynamic replication factor heuristic. The previous `StorePool.AvailableNodeCount` implementation would fluctuate depending on the number of node descriptors that had been received from gossip and the number of dead nodes. So if you had a 5-node cluster and 2 nodes died for more than 5 min (and thus marked as dead), the cluster would suddenly start down-replicating ranges. Similarly, if you had a 5-node cluster and you took down all 5-nodes and only restarted 3, the cluster would start down-replicating ranges. The new behavior is to consider a node part of the cluster until it is decommissioned. This better matches user expectations. Fixes cockroachdb#34122 Release note (bug fix): Avoid down-replicating widely replicated ranges when nodes in the cluster are temporarily down.

Release note: None

Add the `replicate/wide` roachtest which starts up a 9-node cluster, sets the replication factor for all zones to 9, waits for full replication, and then restarts the cluster, bringing up only nodes 1-6. Previously, this would cause down-replication and that down-replication could cause unavailable ranges. Further, test decommissioning one of the nodes and verify that the replication of the ranges falls to 7. Lastly, decrease the replication factor to 5 and verify the replicas per range again falls. See cockroachdb#34122 Release note: None

Change `filterUnremovableCandidates` to return zero candidates if the range is currently below quorum. This likely has little effect in practice as a below quorum range would be unable to process a replica removal. Release note: None

bdarnell

@bdarnell TestRemoveDeadReplicas is failing with this change. It is stuck waiting for up-replication to occur after removing the dead replicas. I see that you're fixing some other problem with TestRemoveDeadReplicas on master. Is there something that needs to be backported there?

TestRemoveDeadReplicas is skipped on master because @andreimatei 's change to pre-create all ranges at bootstrap made it flaky. I've fixed it for that change, but I'm still looking at how to fix it for this one. I've gotten past the hang while waiting for replication but now it's failing for other reasons I don't quite understand.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @tbg)

petermattis · 2019-01-24T15:59:32Z

TestRemoveDeadReplicas is skipped on master because @andreimatei 's change to pre-create all ranges at bootstrap made it flaky. I've fixed it for that change, but I'm still looking at how to fix it for this one. I've gotten past the hang while waiting for replication but now it's failing for other reasons I don't quite understand.

Ack. Should I wait for that fix, or skip the test on 2.1?

bdarnell · 2019-01-24T16:00:40Z

Probably skip the test on 2.1; this test is for a not-fully-supported edge case anyway.

Release note: None

petermattis · 2019-01-25T15:21:47Z

Added a commit to skip TestRemoveDeadReplicas. PTAL.

petermattis added 2 commits January 23, 2019 13:59

petermattis requested review from bdarnell, tbg and a team January 23, 2019 19:45

petermattis added 4 commits January 23, 2019 14:53

storage: rename AvailableNodeCount to ClusterNodeCount

b4d99b3

Release note: None

petermattis force-pushed the backport2.1-34126 branch from ba55698 to 4e1e0f8 Compare January 23, 2019 19:54

bdarnell reviewed Jan 24, 2019

View reviewed changes

cli: skip TestRemoveDeadReplicas

29ec46e

Release note: None

petermattis requested a review from a team as a code owner January 25, 2019 15:21

tbg approved these changes Jan 28, 2019

View reviewed changes

petermattis merged commit 3392f79 into cockroachdb:release-2.1 Jan 28, 2019

petermattis deleted the backport2.1-34126 branch January 28, 2019 16:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-2.1: storage: fix various problems with dynamic replication factor #34199

release-2.1: storage: fix various problems with dynamic replication factor #34199

petermattis commented Jan 23, 2019

cockroach-teamcity commented Jan 23, 2019

bdarnell left a comment

petermattis commented Jan 24, 2019

bdarnell commented Jan 24, 2019

petermattis commented Jan 25, 2019

release-2.1: storage: fix various problems with dynamic replication factor #34199

release-2.1: storage: fix various problems with dynamic replication factor #34199

Conversation

petermattis commented Jan 23, 2019

cockroach-teamcity commented Jan 23, 2019

bdarnell left a comment

Choose a reason for hiding this comment

petermattis commented Jan 24, 2019

bdarnell commented Jan 24, 2019

petermattis commented Jan 25, 2019