allocator: increase rebalance threshold #44247

darinpp · 2020-01-23T00:20:36Z

Previously the rebalancing when the number of replicas was higher
or lower than the mean was subject to a 5% threshold. This works well
when the mean number of replicas per node is large. When each node
has only few replcas, the +/-5% translates to just +/-1 replica and causes
frequent attempts to rebalance. The learner snapshot throttling takes
care of the case when a single node gets requests to add multiple
replicas. The second case however, when multiple nodes remove replicas
from an overweight node is still a problem as we don't throttle the
removal of replicas. The distributed decision to do this also makes
it difficult to control. To minimize the posibility of such rebalances,
the PR modifies the minimum threshold to be +/-2 replicas.
For low number of replicas per store, this has an effect of increased
range rebelance threshold. For large number of replicas per store (40+)
it has zero effect as the default 5% percentage is larger.

Fixes #43749

Release note (allocator): The kv.allocator.range_rebalance_threshold setting
which controls how far away from the mean a store's range count has to be
before it is considered for rebalance is now subject to a 2 replica minimum. If
for example the mean number of replicas per store is 5.6 and the setting is 5%
the store won't be considered for rebalance unless the number of replicas is
larger than 8 or lower than 3. Previosuly the bounds would've been 5 and 6.

cockroach-teamcity · 2020-01-23T00:20:44Z

This change is

andy-kimball · 2020-01-23T00:58:04Z

This should really be 2 separate commits, since there are two distinct issues here.
I'm surprised that the fix to reject learner snapshots is so simple. I thought there was some reason we couldn't use the same method as we did with preemptive snapshots.
Isn't there a reasonable way to test rejecting learner snapshots?

tbg · 2020-01-23T12:50:38Z

Thanks for the PR, @darinpp. I'm cautious about "returning" to the previous behavior regarding snapshot rejections, maybe you missed my comment about that, though my current opinion is below.

A declined snapshot leads the sender to not consider the node for further rebalancing for ~1s:

cockroach/pkg/storage/store_pool.go

Lines 740 to 741 in 0e7f2f3

    
           timeout := DeclinedReservationsTimeout.Get(&sp.st.SV) 
        
           detail.throttledUntil = sp.clock.PhysicalTime().Add(timeout)

This wasn't great before with preemptive snapshots because it generates lots of noise in the logs, but it's annoying. Now, each rejection also has to roll back a learner, which is additional friction (a replication change is a fairly heavyweight operation which also writes to the meta ranges). We expect to hit rejections all of the time in the common case of adding capacity to a node (say 9 nodes are all trying to get replicas onto a fresh 10th node), where we would churn through tons of replica IDs and log messages adding and them removing these learners. (only one snapshot at a time will be allowed, so on average we'll see ~(num-nodes*snapshot-duration) failed attempts per successful snapshot (and thus also writing txns on the cluster).

This isn't to say that the alternative (serializing all snaps at the receiver) is perfect - we might hold up a node's replicate queue even though it would be able to send snapshots elsewhere. And, of course, it's more likely to "overshoot" in limited scenarios such as #43749 (not sure how important that really is, though).

Both fail to address the real problem - that the stats used for decision-making can be out of date by the time we act on them. The rejection is just a shady ways of introducing a time.Sleep(metricsInterval) to make it less likely that we're fast enough to run into that issue.

I second @andy-kimball's comment about having two commits, in particular to explain which issue is fixed by which change. In particular, we need to make sure we're not pessimizing "real world balancing" to improve what is possibly a corner-ish case.

darinpp · 2020-01-23T21:53:16Z

I removed the change in the learner snapshots and left only the change that affects the rebalancing threshold. This reduces the unnecessary rebalance in most cases that I tested and leads to quick convergence to a steady state.

tbg · 2020-01-24T09:20:09Z

pkg/storage/allocator_scorer.go

@@ -60,6 +60,22 @@ const (
 	// hypothetically ping pong back and forth between two nodes, making one full
 	// and then the other.
 	rebalanceToMaxFractionUsedThreshold = 0.925
+
+	// The rangeRebalanceThreshold in the options specifies a percentage based


I was a bit confused reading the comment and it seems to allude to some things that aren't true in the current state. Maybe this is clearer? I might be losing some of the fidelity.

// minRangeRebalanceThreshold is the number of replicas by which a store must deviate from the mean number of replicas to be considered overfull or underfull. This absolute bound exists to account for deployments with a small number of replicas to avoid premature replica movement. With few enough replicas per node (<<30), a rangeRebalanceThreshold of 5% (the default at time of writing, see below) would otherwise result in rebalancing at one replica above/below the mean, which could lead to a condition that would always fire. Instead, we only consider a store full/empty if it's at least minRebalanceThreshold away from the mean.

tbg · 2020-01-24T09:21:02Z

pkg/storage/allocator_scorer.go

 }

 func underfullThreshold(mean float64, thresholdFraction float64) float64 {
-	return mean * (1 - thresholdFraction)
+	return mean - math.Max(mean*thresholdFraction, minRangeRebalanceThreshold)


Does anything interesting happen here if we end up <0? Say the mean is 1, we end up with -1. I guess an empty node may never receive a replica? Which is honestly fine in that example.

No, nothing happens if it is negative. This is used to compare with the actual number of replicas and take an action if the actual number is below the threshold. When negative - no action is taken so we won't try to rebalance (due to the specific reason of being away from the mean).

Previously the rebalancing when the number of replicas was higher or lower than the mean was subject to a 5% threshold. This works well when the mean number of replicas per node is large. When each node has only few replcas, the +/-5% translates to just +/-1 replica and causes frequent attempts to rebalance. The learner snapshot throttling takes care of the case when a single node gets requests to add multiple replicas. The second case however, when multiple nodes remove replicas from an overweight node is still a problem as we don't throttle the removal of replicas. The distributed decision to do this also makes it difficult to control. To minimize the posibility of such rebalances, the PR modifies the minimum threshold to be +/-2 replicas. For low number of replicas per store, this has an effect of increased range rebelance threshold. For large number of replicas per store (40+) it has zero effect as the default 5% percentage is larger. Fixes cockroachdb#43749 Release note (allocator): The kv.allocator.range_rebalance_threshold setting which controls how far away from the mean a store's range count has to be before it is considered for rebalance is now subject to a 2 replica minimum. If for example the mean number of replicas per store is 5.6 and the setting is 5% the store won't be considered for rebalance unless the number of replicas is larger than 8 or lower than 3. Previosuly the bounds would've been 5 and 6.

darinpp · 2020-01-24T17:45:10Z

bors r+

44247: allocator: increase rebalance threshold r=darinpp a=darinpp Previously the rebalancing when the number of replicas was higher or lower than the mean was subject to a 5% threshold. This works well when the mean number of replicas per node is large. When each node has only few replcas, the +/-5% translates to just +/-1 replica and causes frequent attempts to rebalance. The learner snapshot throttling takes care of the case when a single node gets requests to add multiple replicas. The second case however, when multiple nodes remove replicas from an overweight node is still a problem as we don't throttle the removal of replicas. The distributed decision to do this also makes it difficult to control. To minimize the posibility of such rebalances, the PR modifies the minimum threshold to be +/-2 replicas. For low number of replicas per store, this has an effect of increased range rebelance threshold. For large number of replicas per store (40+) it has zero effect as the default 5% percentage is larger. Fixes #43749 Release note (allocator): The kv.allocator.range_rebalance_threshold setting which controls how far away from the mean a store's range count has to be before it is considered for rebalance is now subject to a 2 replica minimum. If for example the mean number of replicas per store is 5.6 and the setting is 5% the store won't be considered for rebalance unless the number of replicas is larger than 8 or lower than 3. Previosuly the bounds would've been 5 and 6. Co-authored-by: Darin <darinp@gmail.com>

craig · 2020-01-24T18:38:54Z

Build succeeded

GitHub CI (Cockroach)

darinpp requested review from rkruze, tbg and andy-kimball January 23, 2020 00:20

darinpp force-pushed the fix_rebalance_to_mean branch from c8e6b13 to 782949f Compare January 23, 2020 21:46

darinpp changed the title ~~allocator: throttle learner snapshots and increase rebalance threshold~~ allocator: increase rebalance threshold Jan 23, 2020

tbg approved these changes Jan 24, 2020

View reviewed changes

darinpp force-pushed the fix_rebalance_to_mean branch from 782949f to 684834b Compare January 24, 2020 16:55

craig bot merged commit 684834b into cockroachdb:master Jan 24, 2020

darinpp mentioned this pull request Jan 24, 2020

release-19.2: allocator: increase rebalance threshold #44359

Merged

tbg mentioned this pull request Jan 31, 2020

storage: remove preemptive snapshots #44615

Merged

jseldess mentioned this pull request Feb 19, 2020

allocator: increase rebalance threshold cockroachdb/docs#6591

Closed

darinpp mentioned this pull request Mar 2, 2020

A node can be overwhelmed with new learner replica requests #43750

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allocator: increase rebalance threshold #44247

allocator: increase rebalance threshold #44247

darinpp commented Jan 23, 2020 •

edited

Loading

cockroach-teamcity commented Jan 23, 2020

andy-kimball commented Jan 23, 2020

tbg commented Jan 23, 2020

darinpp commented Jan 23, 2020

tbg Jan 24, 2020

darinpp Jan 24, 2020

tbg Jan 24, 2020

darinpp Jan 24, 2020

darinpp commented Jan 24, 2020

craig bot commented Jan 24, 2020

allocator: increase rebalance threshold #44247

allocator: increase rebalance threshold #44247

Conversation

darinpp commented Jan 23, 2020 • edited Loading

cockroach-teamcity commented Jan 23, 2020

andy-kimball commented Jan 23, 2020

tbg commented Jan 23, 2020

darinpp commented Jan 23, 2020

tbg Jan 24, 2020

Choose a reason for hiding this comment

darinpp Jan 24, 2020

Choose a reason for hiding this comment

tbg Jan 24, 2020

Choose a reason for hiding this comment

darinpp Jan 24, 2020

Choose a reason for hiding this comment

darinpp commented Jan 24, 2020

craig bot commented Jan 24, 2020

Build succeeded

darinpp commented Jan 23, 2020 •

edited

Loading