Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: transfer away leases when draining #13792

Merged
merged 1 commit into from
Mar 7, 2017
Merged

storage: transfer away leases when draining #13792

merged 1 commit into from
Mar 7, 2017

Conversation

asubiotto
Copy link
Contributor

@asubiotto asubiotto commented Feb 24, 2017

When a Store is put into draining mode, it now attempts to transfer away
any leases owned by its Replicas. This avoids the need to wait for
leases to expire while maintaining availability.


This change is Reviewable

@asubiotto
Copy link
Contributor Author

Currently working on some code reuse in pkg/storage/client_replica_test.go so don't pay the test files too much attention yet.

@tamird
Copy link
Contributor

tamird commented Feb 24, 2017

Reviewed 4 of 6 files at r1.
Review status: 4 of 6 files reviewed at latest revision, 6 unresolved discussions.


pkg/storage/store.go, line 969 at r1 (raw file):

}

// SetDraining (when called with 'true') prevents all of the Store's

"SetDraining causes incoming lease transfers to be rejected and prevents ..." instead of the second sentence, which is confusing since it's not clear who is attempting the transfer.


pkg/storage/store.go, line 991 at r1 (raw file):

			// If a lease request is in progress for us, we wait on it.
			if nextLease != nil && nextLease.OwnedBy(s.StoreID()) {
				r.mu.Lock()

this doesn't seem ok - i think you need all this jazz to be done under the same lock as the stuff going on in getLease. At least, that's what all existing code looks like.


pkg/storage/store.go, line 1002 at r1 (raw file):

		if drainingLease.OwnedBy(s.StoreID()) && r.IsLeaseValid(drainingLease, s.Clock().Now()) {
			var zone config.ZoneConfig

zone := config.DefaultZoneConfig() and then overwrite it below if necessary?


pkg/storage/store.go, line 1008 at r1 (raw file):

				zone, err = sysCfg.GetZoneConfigForKey(desc.StartKey)
				if log.V(1) && err != nil {
					log.Errorf(context.TODO(), "could not get zone config for key %s when draining: %s", desc.StartKey, err)

extract a single context and annotate it.

also, this needs to stop executing, i think.


pkg/storage/store.go, line 1016 at r1 (raw file):

				context.TODO(),
				r,
				r.Desc(),

why doesn't this use desc?


pkg/storage/store.go, line 1021 at r1 (raw file):

				false, /* checkCandidateFullness */
			); log.V(1) && err != nil {
				log.Errorf(context.TODO(), "error transferring lease when draining: %s", err)

also needs to stop execution?


Comments from Reviewable

@bdarnell
Copy link
Contributor

Reviewed 5 of 6 files at r1.
Review status: 5 of 6 files reviewed at latest revision, 8 unresolved discussions.


pkg/storage/store.go, line 975 at r1 (raw file):

// When called with 'false', returns to the normal mode of operation.
//
// TODO(asubiotto): Return an error when there may be active leases held by some

What is the goal of this TODO? What will we do with this error if we don't abort the draining process? (in general aborting the draining process seems like a bad idea - in most cases you'd rather exit gracelessly than have the process stay alive when you're trying to take it down)


pkg/storage/store.go, line 985 at r1 (raw file):

	sysCfg, sysCfgSet := s.cfg.Gossip.GetSystemConfig()
	newStoreReplicaVisitor(s).Visit(func(r *Replica) bool {

Transferring all leases serially will take a long time; we need to be able to have multiple transfers in flight at once.


Comments from Reviewable

@asubiotto
Copy link
Contributor Author

Addressed comments and moved the tests into TestRangeTransferLease. PTAL.


Review status: 5 of 6 files reviewed at latest revision, 8 unresolved discussions.


pkg/storage/store.go, line 969 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

"SetDraining causes incoming lease transfers to be rejected and prevents ..." instead of the second sentence, which is confusing since it's not clear who is attempting the transfer.

Done.


pkg/storage/store.go, line 975 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

What is the goal of this TODO? What will we do with this error if we don't abort the draining process? (in general aborting the draining process seems like a bad idea - in most cases you'd rather exit gracelessly than have the process stay alive when you're trying to take it down)

The goal is to provide the caller with more information as to the state that SetDraining left the store in. I agree, the draining process shouldn't be aborted on error, which is why I suggest changing the upper-level draining logic. I don't have strong feelings about this though and it might be better to not return errors since this is a best-effort attempt.


pkg/storage/store.go, line 985 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Transferring all leases serially will take a long time; we need to be able to have multiple transfers in flight at once.

Done.


pkg/storage/store.go, line 991 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

this doesn't seem ok - i think you need all this jazz to be done under the same lock as the stuff going on in getLease. At least, that's what all existing code looks like.

Done.


pkg/storage/store.go, line 1002 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

zone := config.DefaultZoneConfig() and then overwrite it below if necessary?

Done.


pkg/storage/store.go, line 1008 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

extract a single context and annotate it.

also, this needs to stop executing, i think.

Why do you think it needs to stop executing? I think that if a zone config does not exist, it's better to transfer the lease with no constraints than not to.


pkg/storage/store.go, line 1016 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

why doesn't this use desc?

Done. Was a mistake.


pkg/storage/store.go, line 1021 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

also needs to stop execution?

If a replica fails to transfer away a lease, I don't think we should stop other replicas from attempting to transfer away their leases.


Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Feb 28, 2017

Reviewed 3 of 4 files at r2.
Review status: 6 of 7 files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.


pkg/storage/store.go, line 993 at r2 (raw file):

			var drainingLease *roachpb.Lease
			for {
				if requestInProgress := func() <-chan *roachpb.Error {

why is this function necessary? this defer doesn't seem to buy much:

var leaseCh <-chan *roachpb.Error
r.mu.Lock()
lease, nextLease := r.getLeaseRLocked()
if nextLease != nil && nextLease.OwnedBy(s.StoreID()) {
  leaseCh = r.mu.pendingLeaseRequest.JoinRequest()
}
r.mu.Unlock()

if leaseCh != nil {
  <-leaseCh
  continue
}
drainingLease = lease
break

Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Feb 28, 2017

Reviewed 1 of 4 files at r2.
Review status: all files reviewed at latest revision, 9 unresolved discussions, some commit checks failed.


pkg/storage/client_replica_test.go, line 531 at r2 (raw file):

	checkHasLease := func(t *testing.T, sender *storage.Stores) {
		testutils.SucceedsSoon(t, func() error {
			if pErr := sendRead(sender); pErr != nil {

return sendRead(sender).GoError() - this does the right thing for nil.


pkg/storage/client_replica_test.go, line 574 at r2 (raw file):

		shouldRenewTS := lease.Expiration.Add(-1, 0)
		mtc.manualClock.Set(shouldRenewTS.WallTime + 1)
		if pErr := sendRead(sender); pErr != nil {

ditto


pkg/storage/client_replica_test.go, line 685 at r2 (raw file):

		desc, err := r.GetReplicaDescriptor()
		if err != nil {
			return errors.New("could not get replica descriptor")

why not use the error returned?


pkg/storage/client_replica_test.go, line 688 at r2 (raw file):

		}
		if desc != replica0Desc {
			return errors.New("replica0 not raft leader")

please consider what someone would do with this message. which replica is the raft leader? in general, error messages should be useful, and this one is not.


pkg/storage/client_replica_test.go, line 697 at r2 (raw file):

		progress, ok := status.Progress[uint64(replica1Desc.ReplicaID)]
		if !ok {
			return errors.New("could not get replica1 progress")

ditto. include the map?


pkg/storage/client_replica_test.go, line 700 at r2 (raw file):

		}
		if progress.Match < status.Commit {
			return errors.New("replica1 failed to catch up")

ditto, include something relevant?


Comments from Reviewable

@bdarnell
Copy link
Contributor

bdarnell commented Mar 1, 2017

:lgtm:


Reviewed 4 of 4 files at r2.
Review status: all files reviewed at latest revision, 9 unresolved discussions, some commit checks failed.


pkg/storage/client_replica_test.go, line 697 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

ditto. include the map?

What map? It's nil/undefined if !ok. There's not much useful we could add here.


pkg/storage/client_replica_test.go, line 700 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

ditto, include something relevant?

This seems fine to me; the exact values are unlikely to be helpful.


pkg/storage/store.go, line 975 at r1 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

The goal is to provide the caller with more information as to the state that SetDraining left the store in. I agree, the draining process shouldn't be aborted on error, which is why I suggest changing the upper-level draining logic. I don't have strong feelings about this though and it might be better to not return errors since this is a best-effort attempt.

It's probably better to do that in some other way than returning an error. Look at the way the FreezeCluster stuff works, by returning status information as the freeze progresses.


pkg/storage/store.go, line 985 at r1 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

Done.

Transferring every lease at once is probably also a bad idea. Use Stopper.RunLimitedAsyncTask instead.


Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Mar 4, 2017

Review status: all files reviewed at latest revision, 9 unresolved discussions, some commit checks failed.


pkg/storage/client_replica_test.go, line 697 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

What map? It's nil/undefined if !ok. There's not much useful we could add here.

"the map" I was referring to is status.Progress


Comments from Reviewable

@asubiotto
Copy link
Contributor Author

Review status: all files reviewed at latest revision, 9 unresolved discussions, some commit checks failed.


pkg/storage/client_replica_test.go, line 531 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

return sendRead(sender).GoError() - this does the right thing for nil.

Done.


pkg/storage/client_replica_test.go, line 574 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

ditto

Done.


pkg/storage/client_replica_test.go, line 685 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

why not use the error returned?

Done.


pkg/storage/client_replica_test.go, line 688 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

please consider what someone would do with this message. which replica is the raft leader? in general, error messages should be useful, and this one is not.

Done.


pkg/storage/client_replica_test.go, line 697 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

"the map" I was referring to is status.Progress

Added the map.


pkg/storage/store.go, line 975 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

It's probably better to do that in some other way than returning an error. Look at the way the FreezeCluster stuff works, by returning status information as the freeze progresses.

Thanks for that, that will probably be a good thing to add to the whole draining process. Removed the error and TODO.


pkg/storage/store.go, line 985 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Transferring every lease at once is probably also a bad idea. Use Stopper.RunLimitedAsyncTask instead.

Done.


pkg/storage/store.go, line 993 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

why is this function necessary? this defer doesn't seem to buy much:

var leaseCh <-chan *roachpb.Error
r.mu.Lock()
lease, nextLease := r.getLeaseRLocked()
if nextLease != nil && nextLease.OwnedBy(s.StoreID()) {
  leaseCh = r.mu.pendingLeaseRequest.JoinRequest()
}
r.mu.Unlock()

if leaseCh != nil {
  <-leaseCh
  continue
}
drainingLease = lease
break

Done. Thanks.


Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Mar 6, 2017

Reviewed 6 of 6 files at r3.
Review status: all files reviewed at latest revision, 6 unresolved discussions, some commit checks failed.


pkg/storage/client_raft_test.go, line 1035 at r3 (raw file):

				go func(i int) {
					mtc.stores[i].SetDraining(true)
					wg.Done(nil)

seems that you can make this a plain WaitGroup now rather than a WaitGroupWithErr


pkg/storage/client_replica_test.go, line 685 at r2 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

Done.

use errors.Wrap instead, or just err.


Comments from Reviewable

When a Store is put into draining mode, it now attempts to transfer away
any leases owned by its Replicas. This avoids the need to wait for
leases to expire while maintaining availability.
@asubiotto
Copy link
Contributor Author

Fixed a failing interactive tcl test.


Review status: all files reviewed at latest revision, 6 unresolved discussions, some commit checks failed.


pkg/storage/client_raft_test.go, line 1035 at r3 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

seems that you can make this a plain WaitGroup now rather than a WaitGroupWithErr

Done.


pkg/storage/client_replica_test.go, line 685 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

use errors.Wrap instead, or just err.

Done.


Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Mar 7, 2017

Maybe @knz wants to look


Reviewed 8 of 8 files at r4.
Review status: all files reviewed at latest revision, 4 unresolved discussions, some commit checks pending.


Comments from Reviewable

@knz
Copy link
Contributor

knz commented Mar 7, 2017

Reviewed 1 of 8 files at r4.
Review status: all files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.


pkg/cli/interactive_tests/test_server_sig.tcl, line 48 at r4 (raw file):

    "shutdown completed" {}
    timeout {exit 1}
}

👍


Comments from Reviewable

@asubiotto asubiotto merged commit 1d1afaa into cockroachdb:master Mar 7, 2017
@asubiotto asubiotto deleted the drain-transfer-leases branch March 7, 2017 21:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants