Skip to content

storage: add Allocator.TransferLease{Source,Target}#10464

Merged
petermattis merged 1 commit intocockroachdb:masterfrom
petermattis:pmattis/rebalance-leases
Nov 14, 2016
Merged

storage: add Allocator.TransferLease{Source,Target}#10464
petermattis merged 1 commit intocockroachdb:masterfrom
petermattis:pmattis/rebalance-leases

Conversation

@petermattis
Copy link
Copy Markdown
Collaborator

@petermattis petermattis commented Nov 4, 2016

Before (allocsim -n 4 -w 4):

_elapsed__ops/sec___errors_replicas_________1_________2_________3_________4
    5m0s   1448.8        0     1686   559/554     383/0     331/1     413/2

After:

_elapsed__ops/sec___errors_replicas_________1_________2_________3_________4
    5m0s   1413.7        0     1658   406/132   422/141   410/136   420/138

I do see a little bit of thrashiness with the lease-transfer heuristics,
though it settles down relatively quickly so I'm not sure if it is worth
addressing yet.

Depends on #10420


This change is Reviewable

@petermattis petermattis force-pushed the pmattis/rebalance-leases branch from dd03d64 to 9650a58 Compare November 7, 2016 01:10
@tamird
Copy link
Copy Markdown
Contributor

tamird commented Nov 7, 2016

Could you look at #10193 and #10497 before moving ahead with this? There's something funky going on with lease transfers in the latter (and the former has a similar issue).


Reviewed 3 of 3 files at r1.
Review status: all files reviewed at latest revision, 9 unresolved discussions, all commit checks successful.


pkg/storage/allocator.go, line 379 at r1 (raw file):

}

// TransferLeaseSource returns true if the specified store is overfull with

can you clarify that overfull refers to leases here? also this is a strange name for a method that returns a bool


pkg/storage/allocator_test.go, line 703 at r1 (raw file):

  sg.GossipStores(stores, t)

  makeReplicaDescs := func(storeIDs ...roachpb.StoreID) []roachpb.ReplicaDescriptor {

how come each test case needs a new slice?


pkg/storage/allocator_test.go, line 734 at r1 (raw file):

      target := a.TransferLeaseTarget(config.Constraints{}, c.existing, c.leaseholder, 0, c.check)
      if c.expected != target.StoreID {
          t.Fatalf("%d: expected %d, but found %d", i, c.expected, target.StoreID)

optional: could be nice to introduce sub-tests here, which would give each test a name


pkg/storage/allocator_test.go, line 768 at r1 (raw file):

      result := a.TransferLeaseSource(config.Constraints{}, c.leaseholder, 0)
      if c.expected != result {
          t.Fatalf("%d: expected %v, but found %v", i, c.expected, result)

optional: could be nice to introduce sub-tests here, which would give each test a name


pkg/storage/replicate_queue.go, line 188 at r1 (raw file):

  case AllocatorRemove:
      log.Event(ctx, "removing a replica")
      removeReplica, err := rq.allocator.RemoveTarget(desc.Replicas, 0)

what does zero mean here? this is the only "real" caller to RemoveTarget, so this change in semantics should probably be achieved by removing the second argument completely.


pkg/storage/replicate_queue.go, line 192 at r1 (raw file):

          return err
      }
      // We require the lease in order to process replicas, so

what is this comment now explaining?


pkg/storage/replicate_queue.go, line 207 at r1 (raw file):

              zone.Constraints, desc.Replicas, repl.store.StoreID(), desc.RangeID,
              false /* check-transfer-lease-source */)
          if target.StoreID != 0 {

nit: if (target != roachpb.ReplicaDescriptor{}) {


pkg/storage/replicate_queue.go, line 210 at r1 (raw file):

              log.VEventf(ctx, 1, "transferring lease to s%d", target.StoreID)
              if err := repl.AdminTransferLease(target.StoreID); err != nil {
                  return errors.Wrapf(err, "%s: unable to transfer lease", repl)

should this include the store to which the lease transfer was sent?


pkg/storage/replicate_queue.go, line 247 at r1 (raw file):

          log.VEventf(ctx, 1, "transferring lease to s%d", target.StoreID)
          if err := repl.AdminTransferLease(target.StoreID); err != nil {
              return errors.Wrapf(err, "%s: unable to transfer lease", repl)

ditto


Comments from Reviewable

@petermattis petermattis force-pushed the pmattis/rebalance-leases branch 2 times, most recently from 7558355 to 316ed65 Compare November 7, 2016 15:36
@petermattis
Copy link
Copy Markdown
Collaborator Author

I can't reproduce the failures in #10193 and #10497 due to another error occurring first. See #10501.


Review status: 0 of 4 files reviewed at latest revision, 9 unresolved discussions.


pkg/storage/allocator.go, line 379 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

can you clarify that overfull refers to leases here? also this is a strange name for a method that returns a bool

Clarified comment. Renamed to `ShouldTransferLease`.

pkg/storage/allocator_test.go, line 703 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

how come each test case needs a new slice?

They don't. Changed.

pkg/storage/allocator_test.go, line 734 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

optional: could be nice to introduce sub-tests here, which would give each test a name

Done.

pkg/storage/allocator_test.go, line 768 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

optional: could be nice to introduce sub-tests here, which would give each test a name

Done.

pkg/storage/replicate_queue.go, line 188 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

what does zero mean here? this is the only "real" caller to RemoveTarget, so this change in semantics should probably be achieved by removing the second argument completely.

Removed the second argument.

pkg/storage/replicate_queue.go, line 192 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

what is this comment now explaining?

Added a local which makes it more obvious that `repl.store.StoreID()` is the `leaseholderStoreID`.

pkg/storage/replicate_queue.go, line 207 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

nit: if (target != roachpb.ReplicaDescriptor{}) {

Done.

pkg/storage/replicate_queue.go, line 210 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

should this include the store to which the lease transfer was sent?

Done.

pkg/storage/replicate_queue.go, line 247 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

ditto

Done.

Comments from Reviewable

@tamird
Copy link
Copy Markdown
Contributor

tamird commented Nov 7, 2016

Reviewed 4 of 4 files at r2.
Review status: all files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.


pkg/storage/allocator.go, line 239 at r2 (raw file):

// RemoveTarget returns a suitable replica to remove from the provided replica
// set. It first attempts to randomly select a target from the set nodes that

set of nodes?


pkg/storage/allocator_test.go, line 703 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

They don't. Changed.

so why keep this function?

pkg/storage/allocator_test.go, line 737 at r2 (raw file):

              c.existing, c.leaseholder, 0, c.check)
          if c.expected != target.StoreID {
              t.Fatalf("%d: expected %d, but found %d", i, c.expected, target.StoreID)

the integer here is now duplicating the test name. also, i was imagining that the names would take the place of the comments you have on each test case rather than be integers.


pkg/storage/allocator_test.go, line 773 at r2 (raw file):

          result := a.ShouldTransferLease(config.Constraints{}, c.leaseholder, 0)
          if c.expected != result {
              t.Fatalf("%d: expected %v, but found %v", i, c.expected, result)

the integer here is now duplicating the test name. also, i was imagining that the names would take the place of the comments you have on each test case rather than be integers.


pkg/storage/allocator_test.go, line 840 at r2 (raw file):

  for i := 0; i < 100; i++ {
      // Stores 2 and 3 are overfull, so we should see them randomly selected as
      // the removal target..

double period


Comments from Reviewable

@tamird
Copy link
Copy Markdown
Contributor

tamird commented Nov 7, 2016

:lgtm:


Review status: all files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.


Comments from Reviewable

@petermattis petermattis force-pushed the pmattis/rebalance-leases branch from 316ed65 to 853b3c3 Compare November 7, 2016 16:00
@petermattis
Copy link
Copy Markdown
Collaborator Author

Review status: all files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.


pkg/storage/allocator.go, line 239 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

set of nodes?

`set of stores`

pkg/storage/allocator_test.go, line 703 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

so why keep this function?

Removed.

pkg/storage/allocator_test.go, line 737 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

the integer here is now duplicating the test name. also, i was imagining that the names would take the place of the comments you have on each test case rather than be integers.

I don't have names for all of the test cases. Feels burdensome to require that. The integer at least allows the running of a particular test case easily.

pkg/storage/allocator_test.go, line 773 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

the integer here is now duplicating the test name. also, i was imagining that the names would take the place of the comments you have on each test case rather than be integers.

See other comment.

pkg/storage/allocator_test.go, line 840 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

double period

Done.

Comments from Reviewable

@tamird
Copy link
Copy Markdown
Contributor

tamird commented Nov 7, 2016

Reviewed 2 of 2 files at r3.
Review status: all files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


pkg/storage/allocator_test.go, line 737 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I don't have names for all of the test cases. Feels burdensome to require that. The integer at least allows the running of a particular test case easily.

OK. Still duplicating it in the error message.

Comments from Reviewable

@petermattis petermattis force-pushed the pmattis/rebalance-leases branch from 853b3c3 to 778e56b Compare November 7, 2016 16:25
@petermattis
Copy link
Copy Markdown
Collaborator Author

Review status: 3 of 4 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.


pkg/storage/allocator_test.go, line 737 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

OK. Still duplicating it in the error message.

Done.

Comments from Reviewable

@tamird
Copy link
Copy Markdown
Contributor

tamird commented Nov 7, 2016

Reviewed 1 of 1 files at r4.
Review status: all files reviewed at latest revision, all discussions resolved, some commit checks pending.


Comments from Reviewable

@petermattis petermattis force-pushed the pmattis/rebalance-leases branch from 778e56b to f1cbdff Compare November 7, 2016 21:12
@bdarnell
Copy link
Copy Markdown
Contributor

bdarnell commented Nov 8, 2016

:lgtm:


Reviewed 2 of 4 files at r2, 1 of 2 files at r3, 1 of 1 files at r4.
Review status: all files reviewed at latest revision, 4 unresolved discussions, all commit checks successful.


pkg/storage/allocator.go, line 334 at r4 (raw file):

// to from the provided list. It excludes the current lease holder replica.
func (a *Allocator) TransferLeaseTarget(
  constraints config.Constraints,

Why does this take constraints but RemoveTarget doesn't? Both are acting on the current replicas of the range and comparing to an average, so it seems like either both should use the constraints to compute that average or neither should.


pkg/storage/allocator_test.go, line 729 at r4 (raw file):

  for i, c := range testCases {
      t.Run(fmt.Sprintf("%d", i), func(t *testing.T) {
          target := a.TransferLeaseTarget(config.Constraints{},

We should have some tests with non-empty constraints too (doesn't have to be in this PR)


pkg/storage/replicate_queue.go, line 207 at r4 (raw file):

          target := rq.allocator.TransferLeaseTarget(
              zone.Constraints, desc.Replicas, repl.store.StoreID(), desc.RangeID,
              false /* check-transfer-lease-source */)

Nit: either remove the hyphens in this comment (and the one below) or change it to the go-style variable name.


pkg/storage/replicate_queue.go, line 242 at r4 (raw file):

      // We require the lease in order to process replicas, so
      // repl.store.StoreID() corresponds to the lease-holder's store ID.
      target := rq.allocator.TransferLeaseTarget(

There are two behavioral changes in this PR: the fact that AllocatorRemove will now transfer leases away from overfull stores, and this one to balance leases in AllocatorNoop. What is the result of allocsim with this part commented out (to see the first part in isolation)?


Comments from Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator Author

Review status: all files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.


pkg/storage/allocator.go, line 334 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Why does this take constraints but RemoveTarget doesn't? Both are acting on the current replicas of the range and comparing to an average, so it seems like either both should use the constraints to compute that average or neither should.

`RemoveTarget` will be taking constraints soon too. I didn't want to add that here because it is being added in Bram's open PR to re-add the rule-based allocator.

pkg/storage/allocator_test.go, line 729 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

We should have some tests with non-empty constraints too (doesn't have to be in this PR)

TODO added.

pkg/storage/replicate_queue.go, line 188 at r4 (raw file):

  case AllocatorRemove:
      log.Event(ctx, "removing a replica")
      removeReplica, err := rq.allocator.RemoveTarget(desc.Replicas)

Btw, I removed the leaseStoreID parameter, but I think I should add it back and only transfer the lease on removal if the leaseHolder is "overfull".


pkg/storage/replicate_queue.go, line 207 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Nit: either remove the hyphens in this comment (and the one below) or change it to the go-style variable name.

Camel-cased the name in the comment.

pkg/storage/replicate_queue.go, line 242 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

There are two behavioral changes in this PR: the fact that AllocatorRemove will now transfer leases away from overfull stores, and this one to balance leases in AllocatorNoop. What is the result of allocsim with this part commented out (to see the first part in isolation)?

_elapsed__ops/sec___errors_replicas_________1_________2_________3_________4
    5m0s   1690.3        0     2653   663/250   647/222   675/175   668/227

I ran this on a different machine, so ignore the total number of replicas and ops/sec numbers. Without active lease rebalancing we have nodes that are +13%, +0.5%, -20% and +3% from the mean. That isn't horrible (certainly better than what we have now), but I think the complication of active lease rebalancing is worthwhile as it keeps all nodes within 5% of the mean.


Comments from Reviewable

@petermattis petermattis force-pushed the pmattis/rebalance-leases branch 2 times, most recently from a7736e0 to af4be7d Compare November 8, 2016 15:37
@tamird
Copy link
Copy Markdown
Contributor

tamird commented Nov 8, 2016

Reviewed 4 of 4 files at r5.
Review status: all files reviewed at latest revision, 9 unresolved discussions, some commit checks failed.


pkg/storage/allocator.go, line 243 at r5 (raw file):

// falls back to selecting a random target from any of the existing
// replicas. It also will exclude any replica that belongs to the range lease
// holder's store ID if leaseStoreID is non-zero.

nit: this function doesn't actually check if leaseStoreID is non-zero


pkg/storage/allocator.go, line 257 at r5 (raw file):

  // Retrieve store descriptors for the provided replicas from the StorePool.
  var descriptors []roachpb.StoreDescriptor

NYC but could pre-allocate here


pkg/storage/allocator_test.go, line 729 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

TODO added.

Let's file an issue when this merges.

pkg/storage/replicate_queue.go, line 188 at r5 (raw file):

  case AllocatorRemove:
      log.Event(ctx, "removing a replica")
      // If the lease holder (our local store) is on an overfull store (in terms

"our local store is on an overfull store" - can you improve this phrasing?


pkg/storage/replicate_queue.go, line 192 at r5 (raw file):

      leaseHolderStoreID := repl.store.StoreID()
      if rq.allocator.ShouldTransferLease(zone.Constraints, leaseHolderStoreID, desc.RangeID) {
          leaseHolderStoreID = 0

i'm confused about this logic. as I understand it, this is doing:

considerLeaseHolder := ShouldTransferLease(leaseHolder)
RemoveTarget(..., considerLeaseHolder)

But doesn't RemoveTarget already consider all the replicas and only suggest removal of replicas which are on overfull stores? ISTM that the previous logic (in r4) was exactly equivalent, and AFAICT no tests changed to support this change except for a unit test of RemoveTarget which doesn't demonstrate a meaningful change in the allocator's behaviour.


Comments from Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator Author

Review status: all files reviewed at latest revision, 9 unresolved discussions, some commit checks failed.


pkg/storage/allocator.go, line 243 at r5 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

nit: this function doesn't actually check if leaseStoreID is non-zero

Done.

pkg/storage/allocator.go, line 257 at r5 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

NYC but could pre-allocate here

Done.

pkg/storage/replicate_queue.go, line 188 at r5 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

"our local store is on an overfull store" - can you improve this phrasing?

`s/on an/an/g`

pkg/storage/replicate_queue.go, line 192 at r5 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

i'm confused about this logic. as I understand it, this is doing:

considerLeaseHolder := ShouldTransferLease(leaseHolder)
RemoveTarget(..., considerLeaseHolder)

But doesn't RemoveTarget already consider all the replicas and only suggest removal of replicas which are on overfull stores? ISTM that the previous logic (in r4) was exactly equivalent, and AFAICT no tests changed to support this change except for a unit test of RemoveTarget which doesn't demonstrate a meaningful change in the allocator's behaviour.

`RemoveTarget` tries to remove from stores with too many replicas. This is allowing the leaseholder to only be considered a candidate for removal if it has too many leases.

We don't have any direct tests of the replicateQueue. After #10515 goes in, I'll add one for basic lease balancing. But even then it will be difficult to adequately exercise this logic in a unit test environment.


Comments from Reviewable

@petermattis petermattis force-pushed the pmattis/rebalance-leases branch from af4be7d to d2842e4 Compare November 8, 2016 16:01
@tamird
Copy link
Copy Markdown
Contributor

tamird commented Nov 8, 2016

Reviewed 2 of 2 files at r6.
Review status: all files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.


pkg/storage/replicate_queue.go, line 192 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

RemoveTarget tries to remove from stores with too many replicas. This is allowing the leaseholder to only be considered a candidate for removal if it has too many leases.

We don't have any direct tests of the replicateQueue. After #10515 goes in, I'll add one for basic lease balancing. But even then it will be difficult to adequately exercise this logic in a unit test environment.

Ah, I see. Thanks.

The logic surrounding these allocator decisions feels like it's getting more complex to reason about with the addition of lease rebalancing as a result of the implementation strategy where we have a family of methods for each heuristic, and one is forced to page-in both families and their interactions in order to understand what's going on.

Perhaps this is just because I haven't been in the code (only reviews), but my intuition is that this stuff would be much easier to reason about if there was just a single family of methods that internally dealt with however many heuristics were involved (helpers that calculate scores, etc. are fine, obviously).


Comments from Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator Author

Review status: all files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.


pkg/storage/replicate_queue.go, line 192 at r5 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

Ah, I see. Thanks.

The logic surrounding these allocator decisions feels like it's getting more complex to reason about with the addition of lease rebalancing as a result of the implementation strategy where we have a family of methods for each heuristic, and one is forced to page-in both families and their interactions in order to understand what's going on.

Perhaps this is just because I haven't been in the code (only reviews), but my intuition is that this stuff would be much easier to reason about if there was just a single family of methods that internally dealt with however many heuristics were involved (helpers that calculate scores, etc. are fine, obviously).

Yeah, it's unfortunate how much logic is contained in `replicateQueue.process`. I'm reluctant to do anything in the short term with the rule-based allocator PR in flight. Let me think about this for a future PR.

Comments from Reviewable

@petermattis petermattis force-pushed the pmattis/rebalance-leases branch from d2842e4 to e109d2e Compare November 8, 2016 18:01
@petermattis
Copy link
Copy Markdown
Collaborator Author

Review status: 2 of 5 files reviewed at latest revision, 7 unresolved discussions, some commit checks pending.


pkg/storage/replicate_queue_test.go, line 98 at r7 (raw file):

      counts := countReplicas()
      for _, c := range counts {
          if c < minReplicas {

@tamird Notice the improvement that was possible here.


Comments from Reviewable

@tamird
Copy link
Copy Markdown
Contributor

tamird commented Nov 8, 2016

Reviewed 3 of 3 files at r7.
Review status: all files reviewed at latest revision, 8 unresolved discussions, some commit checks failed.


pkg/storage/replicate_queue_test.go, line 67 at r7 (raw file):

      for {
          if _, _, err := tc.SplitRange(splitKey); err != nil {
              if testutils.IsError(err, "split at key .* failed: conflict updating range descriptors") ||

can able be written as IsError("range is already split at key|plit at key .* failed: conflict updating range descriptors"). feel free to ignore


pkg/storage/replicate_queue_test.go, line 91 at r7 (raw file):

  }

  const numRanges = 15

isn't this susceptible to rot? It'd be nice to count them.


pkg/storage/replicate_queue_test.go, line 98 at r7 (raw file):

Previously, petermattis (Peter Mattis) wrote…

@tamird Notice the improvement that was possible here.

Nice, thanks.

Comments from Reviewable

@petermattis petermattis force-pushed the pmattis/rebalance-leases branch from e109d2e to 30f6a02 Compare November 8, 2016 18:21
@petermattis
Copy link
Copy Markdown
Collaborator Author

Added COCKROACH_ENABLE_LEASE_REBALANCING env var which defaults to true.

@petermattis
Copy link
Copy Markdown
Collaborator Author

Review status: 4 of 5 files reviewed at latest revision, 7 unresolved discussions, some commit checks pending.


pkg/storage/replicate_queue_test.go, line 67 at r7 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

can able be written as IsError("range is already split at key|plit at key .* failed: conflict updating range descriptors"). feel free to ignore

Yeah, but that makes the line really long.

pkg/storage/replicate_queue_test.go, line 91 at r7 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

isn't this susceptible to rot? It'd be nice to count them.

Sure, though if you change numbers above the test will start to fail.

Comments from Reviewable

@tamird
Copy link
Copy Markdown
Contributor

tamird commented Nov 8, 2016

Reviewed 1 of 1 files at r8.
Review status: all files reviewed at latest revision, 6 unresolved discussions, some commit checks pending.


pkg/storage/replicate_queue_test.go, line 91 at r7 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Sure, though if you change numbers above the test will start to fail.

What do you mean? What if a new system table is added? The test will start to fail then, too. Surely that'd be good to avoid?

Comments from Reviewable

@petermattis petermattis force-pushed the pmattis/rebalance-leases branch 2 times, most recently from fc4ee63 to adafe59 Compare November 8, 2016 19:23
@petermattis
Copy link
Copy Markdown
Collaborator Author

Review status: 4 of 5 files reviewed at latest revision, 6 unresolved discussions, some commit checks pending.


pkg/storage/replicate_queue_test.go, line 91 at r7 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

What do you mean? What if a new system table is added? The test will start to fail then, too. Surely that'd be good to avoid?

Done, though stress testing revealed this is now flaky (it was flaky with the hardcoded constants too). One of the flakes should be fixed by #10530. Another is #10501.

Comments from Reviewable

@tamird
Copy link
Copy Markdown
Contributor

tamird commented Nov 8, 2016

Reviewed 1 of 1 files at r9.
Review status: all files reviewed at latest revision, 7 unresolved discussions, some commit checks pending.


pkg/storage/replicate_queue_test.go, line 91 at r7 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Done, though stress testing revealed this is now flaky (it was flaky with the hardcoded constants too). One of the flakes should be fixed by #10530. Another is #10501.

Was it found by the stress testing I added?! 😻

pkg/storage/replicate_queue_test.go, line 54 at r9 (raw file):

  //
  //   replicate_queue_test.go:71: /Table/53/0: lookup range unexpected error:
  //   replica corruption (processed=true): range does not match splits:

this looks like...the range just did not split? scary indeed.


Comments from Reviewable

@tamird tamird mentioned this pull request Nov 8, 2016
@petermattis petermattis force-pushed the pmattis/rebalance-leases branch 2 times, most recently from 5bbe59a to 051a174 Compare November 9, 2016 22:42
@petermattis
Copy link
Copy Markdown
Collaborator Author

Review status: 4 of 5 files reviewed at latest revision, 6 unresolved discussions, some commit checks failed.


pkg/storage/replicate_queue_test.go, line 91 at r7 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

Was it found by the stress testing I added?! 😻

No, just good old fashion manual stress testing.

pkg/storage/replicate_queue_test.go, line 54 at r9 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

this looks like...the range just did not split? scary indeed.

Yes indeed. Unfortunately I haven't seen this error again.

Comments from Reviewable

@tamird
Copy link
Copy Markdown
Contributor

tamird commented Nov 10, 2016

Reviewed 1 of 1 files at r10.
Review status: all files reviewed at latest revision, 6 unresolved discussions, some commit checks failed.


pkg/storage/replicate_queue_test.go, line 54 at r9 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Yes indeed. Unfortunately I haven't seen this error again.

did you mean to set this back to 10, or why remove the comment?

Comments from Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator Author

Review status: all files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.


pkg/storage/replicate_queue_test.go, line 54 at r9 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

did you mean to set this back to 10, or why remove the comment?

I ran into a variety of flakiness problems with 10 nodes. Worth tracking down, but not holding up this PR. I'll file an issue.

Comments from Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator Author

Tracked down the TestReplicateQueueRebalance flakiness which was necessitated a small tweak to replicateQueue.shouldQueue. PTAL.

@tamird
Copy link
Copy Markdown
Contributor

tamird commented Nov 14, 2016

Reviewed 4 of 4 files at r11.
Review status: all files reviewed at latest revision, 7 unresolved discussions, all commit checks successful.


pkg/storage/allocator.go, line 556 at r11 (raw file):

var enableLeaseRebalancing = envutil.EnvOrDefaultBool("COCKROACH_ENABLE_LEASE_REBALANCING", true)

func shouldTransferLease(sl StoreList, candidate roachpb.StoreDescriptor) bool {

s/candidate/source/g for consistency with the caller. perhaps this should be a method on the store list?


pkg/storage/replicate_queue.go, line 117 at r11 (raw file):

      return true, priority
  }
  // See if there is a rebalancing opportunity present. Note that we don't

"see if there's a rebalancing opportunity present" is not a useful comment.

The rest of this comment seems pretty complicated - it talks about the negative case (no current lease) rather than the positive case which the code that immediately follows it operates on. It seems to me that if there is a current lease, then the lease holder is immediately considered for lease rebalancing, and that's all that this comment really needs to say.

Also, is it possible to get here without holding the lease?


Comments from Reviewable

@petermattis petermattis force-pushed the pmattis/rebalance-leases branch from bc2f9d1 to eef3cde Compare November 14, 2016 16:43
@petermattis
Copy link
Copy Markdown
Collaborator Author

Review status: all files reviewed at latest revision, 7 unresolved discussions, all commit checks successful.


pkg/storage/allocator.go, line 556 at r11 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

s/candidate/source/g for consistency with the caller. perhaps this should be a method on the store list?

Done.

I don't see a reason to prefer making this a method on StoreList. I don't see an argument against doing so either. I'm going to leave as-is for now.


pkg/storage/replicate_queue.go, line 117 at r11 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

"see if there's a rebalancing opportunity present" is not a useful comment.

The rest of this comment seems pretty complicated - it talks about the negative case (no current lease) rather than the positive case which the code that immediately follows it operates on. It seems to me that if there is a current lease, then the lease holder is immediately considered for lease rebalancing, and that's all that this comment really needs to say.

Also, is it possible to get here without holding the lease?

Yes, we can get here without holding the lease. We only grab the lease for calling `process()`. But we only get here if we're the leaseholder or there is no leaseholder.

I've updated the comment and also tweaked the code so we only call ShouldTransferLease if we're currently holding the lease.


Comments from Reviewable

@tamird
Copy link
Copy Markdown
Contributor

tamird commented Nov 14, 2016

Reviewed 2 of 2 files at r12.
Review status: all files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.


pkg/storage/replicate_queue.go, line 118 at r12 (raw file):

  }

  // If we hold the lease check to see if we should transfer for it.

s/for// ? also add a comma for the subordinate clause.


Comments from Reviewable

Before (`allocsim -n 4 -w 4`):

_elapsed__ops/sec___errors_replicas_________1_________2_________3_________4
    5m0s   1472.5        0     2124   704/687     492/0     522/0     406/0

After:

_elapsed__ops/sec___errors_replicas_________1_________2_________3_________4
    5m0s   1506.7        0     2157   520/183   548/180   547/166   542/180

I do see a little bit of thrashiness with the lease-transfer heuristics,
though it settles down relatively quickly so I'm not sure if it is worth
addressing yet.

Depends on cockroachdb#10420
@petermattis petermattis force-pushed the pmattis/rebalance-leases branch from eef3cde to 1844549 Compare November 14, 2016 18:05
@petermattis
Copy link
Copy Markdown
Collaborator Author

Review status: 4 of 5 files reviewed at latest revision, 6 unresolved discussions, some commit checks pending.


pkg/storage/replicate_queue.go, line 118 at r12 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

s/for// ? also add a comma for the subordinate clause.

Done.

Comments from Reviewable

@tamird
Copy link
Copy Markdown
Contributor

tamird commented Nov 14, 2016

Reviewed 1 of 1 files at r13.
Review status: all files reviewed at latest revision, 5 unresolved discussions, some commit checks pending.


Comments from Reviewable

@petermattis petermattis merged commit f8315dd into cockroachdb:master Nov 14, 2016
@petermattis petermattis deleted the pmattis/rebalance-leases branch November 14, 2016 18:49
@dianasaur323 dianasaur323 mentioned this pull request Dec 5, 2016
37 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants