Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allocator: increase rebalance threshold #44247

Merged
merged 1 commit into from
Jan 24, 2020

Conversation

darinpp
Copy link
Contributor

@darinpp darinpp commented Jan 23, 2020

Previously the rebalancing when the number of replicas was higher
or lower than the mean was subject to a 5% threshold. This works well
when the mean number of replicas per node is large. When each node
has only few replcas, the +/-5% translates to just +/-1 replica and causes
frequent attempts to rebalance. The learner snapshot throttling takes
care of the case when a single node gets requests to add multiple
replicas. The second case however, when multiple nodes remove replicas
from an overweight node is still a problem as we don't throttle the
removal of replicas. The distributed decision to do this also makes
it difficult to control. To minimize the posibility of such rebalances,
the PR modifies the minimum threshold to be +/-2 replicas.
For low number of replicas per store, this has an effect of increased
range rebelance threshold. For large number of replicas per store (40+)
it has zero effect as the default 5% percentage is larger.

Fixes #43749

Release note (allocator): The kv.allocator.range_rebalance_threshold setting
which controls how far away from the mean a store's range count has to be
before it is considered for rebalance is now subject to a 2 replica minimum. If
for example the mean number of replicas per store is 5.6 and the setting is 5%
the store won't be considered for rebalance unless the number of replicas is
larger than 8 or lower than 3. Previosuly the bounds would've been 5 and 6.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@andy-kimball
Copy link
Contributor

  1. This should really be 2 separate commits, since there are two distinct issues here.
  2. I'm surprised that the fix to reject learner snapshots is so simple. I thought there was some reason we couldn't use the same method as we did with preemptive snapshots.
  3. Isn't there a reasonable way to test rejecting learner snapshots?

@tbg
Copy link
Member

tbg commented Jan 23, 2020

Thanks for the PR, @darinpp. I'm cautious about "returning" to the previous behavior regarding snapshot rejections, maybe you missed my comment about that, though my current opinion is below.

A declined snapshot leads the sender to not consider the node for further rebalancing for ~1s:

timeout := DeclinedReservationsTimeout.Get(&sp.st.SV)
detail.throttledUntil = sp.clock.PhysicalTime().Add(timeout)

This wasn't great before with preemptive snapshots because it generates lots of noise in the logs, but it's annoying. Now, each rejection also has to roll back a learner, which is additional friction (a replication change is a fairly heavyweight operation which also writes to the meta ranges). We expect to hit rejections all of the time in the common case of adding capacity to a node (say 9 nodes are all trying to get replicas onto a fresh 10th node), where we would churn through tons of replica IDs and log messages adding and them removing these learners. (only one snapshot at a time will be allowed, so on average we'll see ~(num-nodes*snapshot-duration) failed attempts per successful snapshot (and thus also writing txns on the cluster).

This isn't to say that the alternative (serializing all snaps at the receiver) is perfect - we might hold up a node's replicate queue even though it would be able to send snapshots elsewhere. And, of course, it's more likely to "overshoot" in limited scenarios such as #43749 (not sure how important that really is, though).

Both fail to address the real problem - that the stats used for decision-making can be out of date by the time we act on them. The rejection is just a shady ways of introducing a time.Sleep(metricsInterval) to make it less likely that we're fast enough to run into that issue.

I second @andy-kimball's comment about having two commits, in particular to explain which issue is fixed by which change. In particular, we need to make sure we're not pessimizing "real world balancing" to improve what is possibly a corner-ish case.

@darinpp darinpp changed the title allocator: throttle learner snapshots and increase rebalance threshold allocator: increase rebalance threshold Jan 23, 2020
@darinpp
Copy link
Contributor Author

darinpp commented Jan 23, 2020

I removed the change in the learner snapshots and left only the change that affects the rebalancing threshold. This reduces the unnecessary rebalance in most cases that I tested and leads to quick convergence to a steady state.

@@ -60,6 +60,22 @@ const (
// hypothetically ping pong back and forth between two nodes, making one full
// and then the other.
rebalanceToMaxFractionUsedThreshold = 0.925

// The rangeRebalanceThreshold in the options specifies a percentage based
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a bit confused reading the comment and it seems to allude to some things that aren't true in the current state. Maybe this is clearer? I might be losing some of the fidelity.

// minRangeRebalanceThreshold is the number of replicas by which a store must deviate from the mean number of replicas to be considered overfull or underfull. This absolute bound exists to account for deployments with a small number of replicas to avoid premature replica movement. With few enough replicas per node (<<30), a rangeRebalanceThreshold of 5% (the default at time of writing, see below) would otherwise result in rebalancing at one replica above/below the mean, which could lead to a condition that would always fire. Instead, we only consider a store full/empty if it's at least minRebalanceThreshold away from the mean.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

}

func underfullThreshold(mean float64, thresholdFraction float64) float64 {
return mean * (1 - thresholdFraction)
return mean - math.Max(mean*thresholdFraction, minRangeRebalanceThreshold)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does anything interesting happen here if we end up <0? Say the mean is 1, we end up with -1. I guess an empty node may never receive a replica? Which is honestly fine in that example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, nothing happens if it is negative. This is used to compare with the actual number of replicas and take an action if the actual number is below the threshold. When negative - no action is taken so we won't try to rebalance (due to the specific reason of being away from the mean).

Previously the rebalancing when the number of replicas was higher
or lower than the mean was subject to a 5% threshold. This works well
when the mean number of replicas per node is large. When each node
has only few replcas, the +/-5% translates to just +/-1 replica and causes
frequent attempts to rebalance. The learner snapshot throttling takes
care of the case when a single node gets requests to add multiple
replicas. The second case however, when multiple nodes remove replicas
from an overweight node is still a problem as we don't throttle the
removal of replicas. The distributed decision to do this also makes
it difficult to control. To minimize the posibility of such rebalances,
the PR modifies the minimum threshold to be +/-2 replicas.
For low number of replicas per store, this has an effect of increased
range rebelance threshold. For large number of replicas per store (40+)
it has zero effect as the default 5% percentage is larger.

Fixes cockroachdb#43749

Release note (allocator): The kv.allocator.range_rebalance_threshold setting
which controls how far away from the mean a store's range count has to be
before it is considered for rebalance is now subject to a 2 replica minimum. If
for example the mean number of replicas per store is 5.6 and the setting is 5%
the store won't be considered for rebalance unless the number of replicas is
larger than 8 or lower than 3. Previosuly the bounds would've been 5 and 6.
@darinpp
Copy link
Contributor Author

darinpp commented Jan 24, 2020

bors r+

craig bot pushed a commit that referenced this pull request Jan 24, 2020
44247: allocator: increase rebalance threshold  r=darinpp a=darinpp

Previously the rebalancing when the number of replicas was higher
or lower than the mean was subject to a 5% threshold. This works well
when the mean number of replicas per node is large. When each node
has only few replcas, the +/-5% translates to just +/-1 replica and causes
frequent attempts to rebalance. The learner snapshot throttling takes
care of the case when a single node gets requests to add multiple
replicas. The second case however, when multiple nodes remove replicas
from an overweight node is still a problem as we don't throttle the
removal of replicas. The distributed decision to do this also makes
it difficult to control. To minimize the posibility of such rebalances,
the PR modifies the minimum threshold to be +/-2 replicas.
For low number of replicas per store, this has an effect of increased
range rebelance threshold. For large number of replicas per store (40+)
it has zero effect as the default 5% percentage is larger.

Fixes #43749

Release note (allocator): The kv.allocator.range_rebalance_threshold setting
which controls how far away from the mean a store's range count has to be
before it is considered for rebalance is now subject to a 2 replica minimum. If
for example the mean number of replicas per store is 5.6 and the setting is 5%
the store won't be considered for rebalance unless the number of replicas is
larger than 8 or lower than 3. Previosuly the bounds would've been 5 and 6.

Co-authored-by: Darin <darinp@gmail.com>
@craig
Copy link
Contributor

craig bot commented Jan 24, 2020

Build succeeded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unnecessary rebalances in large clusters with small number of replicas
4 participants