-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allocator: fix incorrect rebalance signal and target #43041
Conversation
pkg/storage/allocator_scorer.go
Outdated
@@ -162,7 +164,7 @@ func (c candidate) compare(o candidate) float64 { | |||
} | |||
return -4 | |||
} | |||
if c.diversityScore != o.diversityScore { | |||
if math.Abs(c.diversityScore-o.diversityScore) > epsilon { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should create a helper function like compareScores
to encapsulate this and provide more clarity to those reading the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't the only case where we compare doubles. It isn't feasible to have helper function for each double field. I prefer a system wide helper function (as I've done on multiple occasions in the past) but wasn't sure where to put it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure where it'd go either, but regardless, I'd just add a tiny function at bottom of this file:
func scoresAlmostEqual(score1, score2 double) bool {
return math.Abs(score1-score2) <= epsilon
}
Trying to create some general, system-wide function is over-kill IMO, especially since you want this function to use a particular epsilon appropriate for this particular case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, done.
b8c2f5b
to
5ab2774
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 4 files at r1, 1 of 1 files at r2.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @darinpp)
pkg/storage/allocator.go, line 704 at r2 (raw file):
existingPlusOneNew := append([]roachpb.ReplicaDescriptor(nil), existingReplicas...) existingPlusOneNew = append(existingPlusOneNew, newReplica) replicaCandidates := append([]roachpb.ReplicaDescriptor(nil), existingPlusOneNew...)
Do we need to make a copy here? simulateFilterUnremovableReplicas
seems to copy-on-write already.
pkg/storage/allocator_scorer.go, line 167 at r1 (raw file):
Previously, darinpp wrote…
OK, done.
Do you think it's worth being a little more aggressive and fully preventing misguided comparison of these float vals? Something like:
type impreciseFloat64 struct {
f float64
}
func (a impreciseFloat64) Equal(b impreciseFloat64) bool {
...
}
func (a impreciseFloat64) Less(b impreciseFloat64) bool {
...
}
I worry that we could be missing a use of scoresAlmostEqual
now, and even if we're not, I'm concerned that we might in the future.
pkg/storage/allocator_scorer.go, line 29 at r2 (raw file):
const ( epsilon = 1e-10
Give this a comment.
Also, how did you pick this?
pkg/storage/allocator_scorer_test.go, line 248 at r2 (raw file):
{ valid: true, diversityScore: math.Nextafter(1, 0),
Could you give this a comment?
pkg/storage/allocator_test.go, line 884 at r2 (raw file):
) expTo := stores[1].StoreID expFrom := map[roachpb.StoreID]bool{stores[0].StoreID: true}
We can do a direct comparison now instead of this "valid set" business.
This change fixes two bugs: - The first bug is related to the comparison of floating point numbers when considering to rebalance a range. The diversity score of two identical ranges (even the same range) may differ by a small amount. The fix is to consider candidates equal if their diversity score have very close values (within 1e-10). - The second bug has been introduced with cockroachdb#39157 and resulted in ignoring the newly added replica when determining what existing replica it should replace in Allocator.simulateRemoveTarget. As a result - when the first bug was triggering and unnecessary rebalancing was happening - the replica to be dropped was chosen at random (in the case where all existing replica are necessary). For example - let's say a range is constrained to have a replica in each of the regions A,B,C. We are trying to add a new replica in C. Evaluating what should be removed from the A,B,C may choose any one of them. Regardless of the choise, the constraint will be violated. If we however ask the same question for A,B,C(old),C(new) - the result will be C(old) as this will preserve the constraint (to have one replica in each region). Fixes cockroachdb#42715 Fixes https://github.com/cockroachlabs/support/issues/348 Release note (bug fix): the allocator now considers stores with very close diversity scores equal (all other things being the same) and doesn't attempt to rebalance. Release note (bug fix): the allocator considers the new store being added when looking for target in case of rebalance.
5ab2774
to
4475b29
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @nvanbenschoten)
pkg/storage/allocator.go, line 704 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Do we need to make a copy here?
simulateFilterUnremovableReplicas
seems to copy-on-write already.
Fixed.
pkg/storage/allocator_scorer.go, line 167 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Do you think it's worth being a little more aggressive and fully preventing misguided comparison of these float vals? Something like:
type impreciseFloat64 struct { f float64 } func (a impreciseFloat64) Equal(b impreciseFloat64) bool { ... } func (a impreciseFloat64) Less(b impreciseFloat64) bool { ... }
I worry that we could be missing a use of
scoresAlmostEqual
now, and even if we're not, I'm concerned that we might in the future.
I don't think we should do this. impreciseFloat64
is a bit redundant. The float64 by its very nature has a limited precision so some numbers can't be represented exactly. Every single place in the code where we have float64 == float64 could have this problem. I feel that it would make more sense to discuss perhaps the floating point formats and the operations so people are more aware of what should they expect as a result.
pkg/storage/allocator_scorer.go, line 29 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Give this a comment.
Also, how did you pick this?
I added a comment. There isn't anything special about the exact value that is here. It could be lower or higher but it has to be hight enough to allow for loss of accuracy related to whatever operations are done on the floating point numbers (which are with a certain precision) and at the same time low enough to make sense when comparing two numbers for equality.
pkg/storage/allocator_scorer_test.go, line 248 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Could you give this a comment?
I changed this one to a const but put a larger comment on one of the other ones.
pkg/storage/allocator_test.go, line 884 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
We can do a direct comparison now instead of this "valid set" business.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 4 files at r3.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @andy-kimball and @darinpp)
pkg/storage/allocator_scorer.go, line 167 at r1 (raw file):
Previously, darinpp wrote…
I don't think we should do this.
impreciseFloat64
is a bit redundant. The float64 by its very nature has a limited precision so some numbers can't be represented exactly. Every single place in the code where we have float64 == float64 could have this problem. I feel that it would make more sense to discuss perhaps the floating point formats and the operations so people are more aware of what should they expect as a result.
Ack.
pkg/storage/allocator_scorer_test.go, line 119 at r3 (raw file):
StoreID: roachpb.StoreID(i + idShift), }, // We want to test here that everything works when the deversity score
s/deversity/diversity/
bors r+ |
43041: allocator: fix incorrect rebalance signal and target r=darinpp a=darinpp This change fixes two bugs: - The first bug is related to the comparison of floating point numbers when considering to rebalance a range. The diversity score of two identical ranges (even the same range) may differ by a small amount. The fix is to consider candidates equal if their diversity score have very close values (within 1e-10). - The second bug has been introduced with #39157 and resulted in ignoring the newly added replica when determining what existing replica it should replace in Allocator.simulateRemoveTarget. As a result - when the first bug was triggering and unnecessary rebalancing was happening - the replica to be dropped was chosen at random (in the case where all existing replica are necessary). For example - let's say a range is constrained to have a replica in each of the regions A,B,C. We are trying to add a new replica in C. Evaluating what should be removed from the A,B,C may choose any one of them. If the choice is A or B, the constraint will be violated. If we however ask the same question for A,B,C(old),C(new) - the result will be C(old) as this will preserve the constraint (to have one replica in each region). Fixes #42715 Fixes https://github.com/cockroachlabs/support/issues/348 Release note (bug fix): the allocator now considers stores with very close diversity scores equal (all other things being the same) and doesn't attempt to rebalance. Release note (bug fix): the allocator considers the new store being added when looking for target in case of rebalance. Co-authored-by: Darin <darinp@gmail.com>
Build succeeded |
// operations on the results and the possible loss of accuracy. | ||
// More info https://en.wikipedia.org/wiki/Machine_epsilon | ||
// https://en.wikipedia.org/wiki/Floating-point_arithmetic | ||
epsilon = 1e-10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have kv.allocator.range_rebalance_threshold
which is supposed to prevent rebalancing unless there is a 5% difference. Why are we introducing a new magic number instead of reusing that threshold here?
How safe is this number of 1e-10
? The numbers we work with here are often 5 or 6 digits (e.g. number of ranges per node), so I'd be concerned about 1e-10
being smaller than the kinds of rounding errors we could see. I think it would be better to have a much larger cutoff (or better, something proportional to the actual values instead of a constant).
@@ -238,9 +249,9 @@ var _ sort.Interface = byScoreAndID(nil) | |||
|
|||
func (c byScoreAndID) Len() int { return len(c) } | |||
func (c byScoreAndID) Less(i, j int) bool { | |||
if c[i].diversityScore == c[j].diversityScore && | |||
if scoresAlmostEqual(c[i].diversityScore, c[j].diversityScore) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An inconsistent Less
function could cause all kinds of chaos. Is there any risk of floating-point math causing problems here? I think it might be better to sort things according to the raw float values, then apply the filter at the end so we won't swap to a node that is not a sufficient improvement.
This change fixes two bugs:
numbers when considering to rebalance a range. The diversity score
of two identical ranges (even the same range) may differ by a small
amount. The fix is to consider candidates equal if their diversity score
have very close values (within 1e-10).
*: remove the remaining uses of ReplicaDescriptors.Unwrap #39157 and resulted in
ignoring the newly added replica when determining what existing replica
it should replace in Allocator.simulateRemoveTarget.
As a result - when the first bug was triggering and unnecessary
rebalancing was happening - the replica to be dropped was chosen
at random (in the case where all existing replica are necessary).
For example - let's say a range is constrained to have a replica in each
of the regions A,B,C. We are trying to add a new replica in C.
Evaluating what should be removed from the A,B,C may choose any one of them.
If the choice is A or B, the constraint will be violated.
If we however ask the same question for A,B,C(old),C(new) - the result
will be C(old) as this will preserve the constraint (to have one
replica in each region).
Fixes #42715
Fixes https://github.com/cockroachlabs/support/issues/348
Release note (bug fix): the allocator now considers stores with very close
diversity scores equal (all other things being the same) and doesn't attempt
to rebalance.
Release note (bug fix): the allocator considers the new store being added
when looking for target in case of rebalance.