-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allocator: increase rebalance threshold #44247
Conversation
|
Thanks for the PR, @darinpp. I'm cautious about "returning" to the previous behavior regarding snapshot rejections, maybe you missed my comment about that, though my current opinion is below. A declined snapshot leads the sender to not consider the node for further rebalancing for ~1s: cockroach/pkg/storage/store_pool.go Lines 740 to 741 in 0e7f2f3
This wasn't great before with preemptive snapshots because it generates lots of noise in the logs, but it's annoying. Now, each rejection also has to roll back a learner, which is additional friction (a replication change is a fairly heavyweight operation which also writes to the meta ranges). We expect to hit rejections all of the time in the common case of adding capacity to a node (say 9 nodes are all trying to get replicas onto a fresh 10th node), where we would churn through tons of replica IDs and log messages adding and them removing these learners. (only one snapshot at a time will be allowed, so on average we'll see ~(num-nodes*snapshot-duration) failed attempts per successful snapshot (and thus also writing txns on the cluster). This isn't to say that the alternative (serializing all snaps at the receiver) is perfect - we might hold up a node's replicate queue even though it would be able to send snapshots elsewhere. And, of course, it's more likely to "overshoot" in limited scenarios such as #43749 (not sure how important that really is, though). Both fail to address the real problem - that the stats used for decision-making can be out of date by the time we act on them. The rejection is just a shady ways of introducing a I second @andy-kimball's comment about having two commits, in particular to explain which issue is fixed by which change. In particular, we need to make sure we're not pessimizing "real world balancing" to improve what is possibly a corner-ish case. |
c8e6b13
to
782949f
Compare
I removed the change in the learner snapshots and left only the change that affects the rebalancing threshold. This reduces the unnecessary rebalance in most cases that I tested and leads to quick convergence to a steady state. |
pkg/storage/allocator_scorer.go
Outdated
@@ -60,6 +60,22 @@ const ( | |||
// hypothetically ping pong back and forth between two nodes, making one full | |||
// and then the other. | |||
rebalanceToMaxFractionUsedThreshold = 0.925 | |||
|
|||
// The rangeRebalanceThreshold in the options specifies a percentage based |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was a bit confused reading the comment and it seems to allude to some things that aren't true in the current state. Maybe this is clearer? I might be losing some of the fidelity.
// minRangeRebalanceThreshold is the number of replicas by which a store must deviate from the mean number of replicas to be considered overfull or underfull. This absolute bound exists to account for deployments with a small number of replicas to avoid premature replica movement. With few enough replicas per node (<<30), a rangeRebalanceThreshold of 5% (the default at time of writing, see below) would otherwise result in rebalancing at one replica above/below the mean, which could lead to a condition that would always fire. Instead, we only consider a store full/empty if it's at least minRebalanceThreshold away from the mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
} | ||
|
||
func underfullThreshold(mean float64, thresholdFraction float64) float64 { | ||
return mean * (1 - thresholdFraction) | ||
return mean - math.Max(mean*thresholdFraction, minRangeRebalanceThreshold) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does anything interesting happen here if we end up <0? Say the mean is 1, we end up with -1. I guess an empty node may never receive a replica? Which is honestly fine in that example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, nothing happens if it is negative. This is used to compare with the actual number of replicas and take an action if the actual number is below the threshold. When negative - no action is taken so we won't try to rebalance (due to the specific reason of being away from the mean).
Previously the rebalancing when the number of replicas was higher or lower than the mean was subject to a 5% threshold. This works well when the mean number of replicas per node is large. When each node has only few replcas, the +/-5% translates to just +/-1 replica and causes frequent attempts to rebalance. The learner snapshot throttling takes care of the case when a single node gets requests to add multiple replicas. The second case however, when multiple nodes remove replicas from an overweight node is still a problem as we don't throttle the removal of replicas. The distributed decision to do this also makes it difficult to control. To minimize the posibility of such rebalances, the PR modifies the minimum threshold to be +/-2 replicas. For low number of replicas per store, this has an effect of increased range rebelance threshold. For large number of replicas per store (40+) it has zero effect as the default 5% percentage is larger. Fixes cockroachdb#43749 Release note (allocator): The kv.allocator.range_rebalance_threshold setting which controls how far away from the mean a store's range count has to be before it is considered for rebalance is now subject to a 2 replica minimum. If for example the mean number of replicas per store is 5.6 and the setting is 5% the store won't be considered for rebalance unless the number of replicas is larger than 8 or lower than 3. Previosuly the bounds would've been 5 and 6.
782949f
to
684834b
Compare
bors r+ |
44247: allocator: increase rebalance threshold r=darinpp a=darinpp Previously the rebalancing when the number of replicas was higher or lower than the mean was subject to a 5% threshold. This works well when the mean number of replicas per node is large. When each node has only few replcas, the +/-5% translates to just +/-1 replica and causes frequent attempts to rebalance. The learner snapshot throttling takes care of the case when a single node gets requests to add multiple replicas. The second case however, when multiple nodes remove replicas from an overweight node is still a problem as we don't throttle the removal of replicas. The distributed decision to do this also makes it difficult to control. To minimize the posibility of such rebalances, the PR modifies the minimum threshold to be +/-2 replicas. For low number of replicas per store, this has an effect of increased range rebelance threshold. For large number of replicas per store (40+) it has zero effect as the default 5% percentage is larger. Fixes #43749 Release note (allocator): The kv.allocator.range_rebalance_threshold setting which controls how far away from the mean a store's range count has to be before it is considered for rebalance is now subject to a 2 replica minimum. If for example the mean number of replicas per store is 5.6 and the setting is 5% the store won't be considered for rebalance unless the number of replicas is larger than 8 or lower than 3. Previosuly the bounds would've been 5 and 6. Co-authored-by: Darin <darinp@gmail.com>
Build succeeded |
Previously the rebalancing when the number of replicas was higher
or lower than the mean was subject to a 5% threshold. This works well
when the mean number of replicas per node is large. When each node
has only few replcas, the +/-5% translates to just +/-1 replica and causes
frequent attempts to rebalance. The learner snapshot throttling takes
care of the case when a single node gets requests to add multiple
replicas. The second case however, when multiple nodes remove replicas
from an overweight node is still a problem as we don't throttle the
removal of replicas. The distributed decision to do this also makes
it difficult to control. To minimize the posibility of such rebalances,
the PR modifies the minimum threshold to be +/-2 replicas.
For low number of replicas per store, this has an effect of increased
range rebelance threshold. For large number of replicas per store (40+)
it has zero effect as the default 5% percentage is larger.
Fixes #43749
Release note (allocator): The kv.allocator.range_rebalance_threshold setting
which controls how far away from the mean a store's range count has to be
before it is considered for rebalance is now subject to a 2 replica minimum. If
for example the mean number of replicas per store is 5.6 and the setting is 5%
the store won't be considered for rebalance unless the number of replicas is
larger than 8 or lower than 3. Previosuly the bounds would've been 5 and 6.