Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allocator: fix incorrect rebalance signal and target #43041

Merged
merged 1 commit into from
Dec 9, 2019

Conversation

darinpp
Copy link
Contributor

@darinpp darinpp commented Dec 7, 2019

This change fixes two bugs:

  • The first bug is related to the comparison of floating point
    numbers when considering to rebalance a range. The diversity score
    of two identical ranges (even the same range) may differ by a small
    amount. The fix is to consider candidates equal if their diversity score
    have very close values (within 1e-10).
  • The second bug has been introduced with
    *: remove the remaining uses of ReplicaDescriptors.Unwrap #39157 and resulted in
    ignoring the newly added replica when determining what existing replica
    it should replace in Allocator.simulateRemoveTarget.
    As a result - when the first bug was triggering and unnecessary
    rebalancing was happening - the replica to be dropped was chosen
    at random (in the case where all existing replica are necessary).
    For example - let's say a range is constrained to have a replica in each
    of the regions A,B,C. We are trying to add a new replica in C.
    Evaluating what should be removed from the A,B,C may choose any one of them.
    If the choice is A or B, the constraint will be violated.
    If we however ask the same question for A,B,C(old),C(new) - the result
    will be C(old) as this will preserve the constraint (to have one
    replica in each region).

Fixes #42715
Fixes https://github.com/cockroachlabs/support/issues/348

Release note (bug fix): the allocator now considers stores with very close
diversity scores equal (all other things being the same) and doesn't attempt
to rebalance.
Release note (bug fix): the allocator considers the new store being added
when looking for target in case of rebalance.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@@ -162,7 +164,7 @@ func (c candidate) compare(o candidate) float64 {
}
return -4
}
if c.diversityScore != o.diversityScore {
if math.Abs(c.diversityScore-o.diversityScore) > epsilon {
Copy link
Contributor

@andy-kimball andy-kimball Dec 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should create a helper function like compareScores to encapsulate this and provide more clarity to those reading the code.

Copy link
Contributor Author

@darinpp darinpp Dec 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't the only case where we compare doubles. It isn't feasible to have helper function for each double field. I prefer a system wide helper function (as I've done on multiple occasions in the past) but wasn't sure where to put it.

Copy link
Contributor

@andy-kimball andy-kimball Dec 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure where it'd go either, but regardless, I'd just add a tiny function at bottom of this file:

func scoresAlmostEqual(score1, score2 double) bool {
  return math.Abs(score1-score2) <= epsilon
}

Trying to create some general, system-wide function is over-kill IMO, especially since you want this function to use a particular epsilon appropriate for this particular case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, done.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 4 files at r1, 1 of 1 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @darinpp)


pkg/storage/allocator.go, line 704 at r2 (raw file):

		existingPlusOneNew := append([]roachpb.ReplicaDescriptor(nil), existingReplicas...)
		existingPlusOneNew = append(existingPlusOneNew, newReplica)
		replicaCandidates := append([]roachpb.ReplicaDescriptor(nil), existingPlusOneNew...)

Do we need to make a copy here? simulateFilterUnremovableReplicas seems to copy-on-write already.


pkg/storage/allocator_scorer.go, line 167 at r1 (raw file):

Previously, darinpp wrote…

OK, done.

Do you think it's worth being a little more aggressive and fully preventing misguided comparison of these float vals? Something like:

type impreciseFloat64 struct {
    f float64
}

func (a impreciseFloat64) Equal(b impreciseFloat64) bool {
    ...
}

func (a impreciseFloat64) Less(b impreciseFloat64) bool {
    ...
}

I worry that we could be missing a use of scoresAlmostEqual now, and even if we're not, I'm concerned that we might in the future.


pkg/storage/allocator_scorer.go, line 29 at r2 (raw file):

const (
	epsilon = 1e-10

Give this a comment.

Also, how did you pick this?


pkg/storage/allocator_scorer_test.go, line 248 at r2 (raw file):

		{
			valid:          true,
			diversityScore: math.Nextafter(1, 0),

Could you give this a comment?


pkg/storage/allocator_test.go, line 884 at r2 (raw file):

		)
		expTo := stores[1].StoreID
		expFrom := map[roachpb.StoreID]bool{stores[0].StoreID: true}

We can do a direct comparison now instead of this "valid set" business.

This change fixes two bugs:
- The first bug is related to the comparison of floating point
numbers when considering to rebalance a range. The diversity score
of two identical ranges (even the same range) may differ by a small
amount. The fix is to consider candidates equal if their diversity score
have very close values (within 1e-10).
- The second bug has been introduced with
cockroachdb#39157 and resulted in
ignoring the newly added replica when determining what existing replica
it should replace in Allocator.simulateRemoveTarget.
As a result - when the first bug was triggering and unnecessary
rebalancing was happening - the replica to be dropped was chosen
at random (in the case where all existing replica are necessary).
For example - let's say a range is constrained to have a replica in each
of the regions A,B,C. We are trying to add a new replica in C.
Evaluating what should be removed from the A,B,C may choose any one of them.
Regardless of the choise, the constraint will be violated.
If we however ask the same question for A,B,C(old),C(new) - the result
will be C(old) as this will preserve the constraint (to have one
replica in each region).

Fixes cockroachdb#42715
Fixes https://github.com/cockroachlabs/support/issues/348

Release note (bug fix): the allocator now considers stores with very close
diversity scores equal (all other things being the same) and doesn't attempt
to rebalance.
Release note (bug fix): the allocator considers the new store being added
when looking for target in case of rebalance.
Copy link
Contributor Author

@darinpp darinpp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @nvanbenschoten)


pkg/storage/allocator.go, line 704 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do we need to make a copy here? simulateFilterUnremovableReplicas seems to copy-on-write already.

Fixed.


pkg/storage/allocator_scorer.go, line 167 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do you think it's worth being a little more aggressive and fully preventing misguided comparison of these float vals? Something like:

type impreciseFloat64 struct {
    f float64
}

func (a impreciseFloat64) Equal(b impreciseFloat64) bool {
    ...
}

func (a impreciseFloat64) Less(b impreciseFloat64) bool {
    ...
}

I worry that we could be missing a use of scoresAlmostEqual now, and even if we're not, I'm concerned that we might in the future.

I don't think we should do this. impreciseFloat64 is a bit redundant. The float64 by its very nature has a limited precision so some numbers can't be represented exactly. Every single place in the code where we have float64 == float64 could have this problem. I feel that it would make more sense to discuss perhaps the floating point formats and the operations so people are more aware of what should they expect as a result.


pkg/storage/allocator_scorer.go, line 29 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Give this a comment.

Also, how did you pick this?

I added a comment. There isn't anything special about the exact value that is here. It could be lower or higher but it has to be hight enough to allow for loss of accuracy related to whatever operations are done on the floating point numbers (which are with a certain precision) and at the same time low enough to make sense when comparing two numbers for equality.


pkg/storage/allocator_scorer_test.go, line 248 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Could you give this a comment?

I changed this one to a const but put a larger comment on one of the other ones.


pkg/storage/allocator_test.go, line 884 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

We can do a direct comparison now instead of this "valid set" business.

Done.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 4 of 4 files at r3.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andy-kimball and @darinpp)


pkg/storage/allocator_scorer.go, line 167 at r1 (raw file):

Previously, darinpp wrote…

I don't think we should do this. impreciseFloat64 is a bit redundant. The float64 by its very nature has a limited precision so some numbers can't be represented exactly. Every single place in the code where we have float64 == float64 could have this problem. I feel that it would make more sense to discuss perhaps the floating point formats and the operations so people are more aware of what should they expect as a result.

Ack.


pkg/storage/allocator_scorer_test.go, line 119 at r3 (raw file):

					StoreID: roachpb.StoreID(i + idShift),
				},
				// We want to test here that everything works when the deversity score

s/deversity/diversity/

@darinpp
Copy link
Contributor Author

darinpp commented Dec 9, 2019

bors r+

craig bot pushed a commit that referenced this pull request Dec 9, 2019
43041: allocator: fix incorrect rebalance signal and target r=darinpp a=darinpp

This change fixes two bugs:
- The first bug is related to the comparison of floating point
numbers when considering to rebalance a range. The diversity score
of two identical ranges (even the same range) may differ by a small
amount. The fix is to consider candidates equal if their diversity score
have very close values (within 1e-10).
- The second bug has been introduced with
#39157 and resulted in
ignoring the newly added replica when determining what existing replica
it should replace in Allocator.simulateRemoveTarget.
As a result - when the first bug was triggering and unnecessary
rebalancing was happening - the replica to be dropped was chosen
at random (in the case where all existing replica are necessary).
For example - let's say a range is constrained to have a replica in each
of the regions A,B,C. We are trying to add a new replica in C.
Evaluating what should be removed from the A,B,C may choose any one of them.
If the choice is A or B, the constraint will be violated.
If we however ask the same question for A,B,C(old),C(new) - the result
will be C(old) as this will preserve the constraint (to have one
replica in each region).

Fixes #42715
Fixes https://github.com/cockroachlabs/support/issues/348

Release note (bug fix): the allocator now considers stores with very close
diversity scores equal (all other things being the same) and doesn't attempt
to rebalance.
Release note (bug fix): the allocator considers the new store being added
when looking for target in case of rebalance.

Co-authored-by: Darin <darinp@gmail.com>
@craig
Copy link
Contributor

craig bot commented Dec 9, 2019

Build succeeded

// operations on the results and the possible loss of accuracy.
// More info https://en.wikipedia.org/wiki/Machine_epsilon
// https://en.wikipedia.org/wiki/Floating-point_arithmetic
epsilon = 1e-10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have kv.allocator.range_rebalance_threshold which is supposed to prevent rebalancing unless there is a 5% difference. Why are we introducing a new magic number instead of reusing that threshold here?

How safe is this number of 1e-10? The numbers we work with here are often 5 or 6 digits (e.g. number of ranges per node), so I'd be concerned about 1e-10 being smaller than the kinds of rounding errors we could see. I think it would be better to have a much larger cutoff (or better, something proportional to the actual values instead of a constant).

@@ -238,9 +249,9 @@ var _ sort.Interface = byScoreAndID(nil)

func (c byScoreAndID) Len() int { return len(c) }
func (c byScoreAndID) Less(i, j int) bool {
if c[i].diversityScore == c[j].diversityScore &&
if scoresAlmostEqual(c[i].diversityScore, c[j].diversityScore) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An inconsistent Less function could cause all kinds of chaos. Is there any risk of floating-point math causing problems here? I think it might be better to sort things according to the raw float values, then apply the filter at the end so we won't swap to a node that is not a sufficient improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql: localities can't be changed seamlessly if there are replica constraints
5 participants