Fix a race between splits and snapshots. #2944

bdarnell · 2015-10-29T03:38:58Z

When a range is split, followers of that range may receive a snapshot
from the right-hand side of the split before they have caught up and
processed the left-hand side where the split originated. This results in
a "range already exists" panic.

The solution is to silently drop any snapshots which would cause a
conflict. They will be retried and will succeed once the left-hand range
has performed its split.

Fixes #1644.

bdarnell · 2015-10-29T03:39:13Z

Cc @es-chow

tbg · 2015-10-29T03:49:43Z

ah, great that you were able to put this one together that quickly. I'll take a look in the morning.

tbg · 2015-10-29T15:10:48Z

storage/client_split_test.go

+		mtc.restartStore(i)
+	}
+
+	leftKey := roachpb.Key("a")


nit: just return {left,right}Key from setupSplitSnapshotRace.

tbg · 2015-10-29T15:24:21Z

LGTM mod minor comments

tamird · 2015-10-29T15:28:14Z

storage/client_test.go

+		if m.stores[nodeIndex] == nil {
+			pErr = &roachpb.Error{}
+			pErr.SetGoError(rpc.NewSendError("store is stopped", true))
+			continue


is this right? continueing here seems to ignore this error entirely.

Yes, this is intended. pErr exists outside the loop. We return permanent errors but for retryable errors we continue through the loop until we've exhausted all the stores (at which point we return the last such error)

bdarnell · 2015-10-29T21:45:27Z

I think the first test failure in this branch was a fluke (probably related to the recurrence of #2880), but the second on is real and reproducible. The newly-added tests fail about a third of the time, probably due to the way errors are handled in the testing stack. I'm still debugging.

mrtracy · 2015-10-29T22:30:47Z

LGTM. Although they are currently failure, the tests you created look very good.

tbg · 2015-10-30T01:39:29Z

storage/store.go

+		return false
+	}
+
+	if s.replicasByKey.Has(rangeBTreeKey(parsedSnap.RangeDescriptor.EndKey)) {


this guards only against an exact match - agreed that that's enough to avoid the split race, but shouldn't we generally discard snapshots that overlap with existing ranges in any way?

Yes. This is simply following the example of Store.addReplicaInternal; both should be updated to check for any overlap instead of an exact match on one key.

Is that tracked somewhere? Feel like a TODO can't hurt.

bdarnell · 2015-11-02T22:26:32Z

Updated with new commits. The newly-added tests are passing reliably for me now, although some of the existing tests have become flaky in race mode. multiTestContext appears to be full of races, so I'm going to try fixing that next.

The last two commits are unnecessary: they are things I added while attempting to debug that seem like good ideas to keep, but these weren't the actual problem.

tamird · 2015-11-02T22:29:51Z

LGTM so far (left 1 comment on a commit).

tbg · 2015-11-02T22:41:47Z

storage/replica.go

+	// lifetime. If the command takes that long to commit we are probably
+	// partitioned away and we can conclude that another node has taken
+	// the lease.
+	deadline := time.After(duration)


could you use a context.WithDeadline(context.Background(), duration) here? We should use context.Context more for that purpose and this is a nice precedent to set.

Using context.WithDeadline is only better than time.After if we give it a non-Background context (i.e. one that may already have a request deadline). And while we want to do that in many places, I don't think this is one of them: Thanks to the lock in redirectOnOrAcquireLeaderLease, one call to requestLeaderLease may serve many clients. If we bail out early because the first client expired or cancelled, other clients blocking on the lock could create redundant lease requests. We should wait here as long as the lease may be useful, regardless of the context.

I agree that we shouldn't be using the client's context here, but using a freshly deadlined context.Context is strictly better in that it would be passed into proposeRaftCommand above, which allows all moving parts up the callstack to cancel the request early.

Ah, that makes sense. It would be nice to be able to handle a cancelled Context at the lower level so a partitioned node doesn't try to replay all its queued lease requests once it gets reconnected. I'll add a deadline to r.context() before the call to r.proposeRaftCommand.

tbg · 2015-11-02T22:44:07Z

LGTM with minor nits. Can you rebase into commits which pass tests individually before you merge?

bdarnell · 2015-11-03T00:28:07Z

The current test failures are due to the replica scanner, which is a source of background cross-store operations. Previously, these operations would succeed even during shutdown (I think because in most of our tests prior to this one leadership was almost always on node 0 so shutting down from 0 to N was safe), but now they fail because they can't start a new task on the destination store. And thanks to #2500, we have (multiple) unbounded retry loops that will prevent the queue from ever shutting down cleanly (including a hidden unbounded loop in DistSender: even if you limit MaxRetries, we call retry.Reset on certain errors). With all of these changed to use a finite number of retries, these tests look fine but others have problems if the number is too low.

So it looks like to merge this PR we either have to solve the retry option quagmire, or hack in a special-case DrainQueues method to ensure that all this background processing has stopped before we try to shut down. I'm going to explore the latter option first, since even though #2500 has been climbing in my priority queue I'd rather not tie it to this PR.

tbg · 2015-11-03T00:31:06Z

sounds good to me.

On Mon, Nov 2, 2015 at 7:28 PM, Ben Darnell notifications@github.com
wrote:

The current test failures are due to the replica scanner, which is a
source of background cross-store operations. Previously, these operations
would succeed even during shutdown (I think because in most of our tests
prior to this one leadership was almost always on node 0 so shutting down
from 0 to N was safe), but now they fail because they can't start a new
task on the destination store. And thanks to #2500
#2500, we have
(multiple) unbounded retry loops that will prevent the queue from ever
shutting down cleanly (including a hidden unbounded loop in DistSender:
even if you limit MaxRetries, we call retry.Reset on certain errors). With
all of these changed to use a finite number of retries, these tests look
fine but others have problems if the number is too low.

So it looks like to merge this PR we either have to solve the retry option
quagmire, or hack in a special-case DrainQueues method to ensure that all
this background processing has stopped before we try to shut down. I'm
going to explore the latter option first, since even though #2500
#2500 has been climbing
in my priority queue I'd rather not tie it to this PR.

—
Reply to this email directly or view it on GitHub
#2944 (comment)
.

bdarnell · 2015-11-03T04:56:21Z

Up next: TestRangeDescriptorSnapshotRace. Looks like asynchronous intent resolutions again, racing with splits this time. The client puts together a batch of ResolveIntentRequests, but by now they're on different ranges. This comes back as an OpRequiresTransactionError, which causes TxnCoordSender to retry the request in a transaction. This subjects it to another one of those infinite retry loops.

I think we just need a flag for non-transactional requests (ResolveIntent and PushTxn), so that DistSender will never return OpRequiresTransactionError if it has to truncate a batch of them.

bdarnell · 2015-11-03T05:29:53Z

OK, finally got a green build. PTAL (and I'm rebuilding now to make sure, just in case it has turned red by the time you see this).

tbg · 2015-11-03T12:28:34Z

Could you briefly outline the infinite loop the ResolveIntent transaction gets stuck in? Your last change looks useful either way, but if such a transaction can get stuck then I see no reason why that wouldn't happen with more "honest" transactions in practice.

bdarnell · 2015-11-03T15:27:04Z

There are about 5 different retry loops in play here, each with their own ideas about what which errors are retryable. The transaction layer appears to be most persistent in its retries but I was having trouble mapping out exactly what was happening.

goroutine 6150 [select]:
github.com/cockroachdb/cockroach/util/retry.(*Retry).Next(0xc82037bfc0, 0xc820356260)
    /go/src/github.com/cockroachdb/cockroach/util/retry/retry.go:109 +0x183
github.com/cockroachdb/cockroach/kv.(*DistSender).sendChunk(0xc820194310, 0x7f2d4ebe9e40, 0xc820211b30, 0x0, 0xe6f, 0x14131634c419c1a5, 0x4b14d9a117fa3476, 0x0, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/kv/dist_sender.go:496 +0x3f2
github.com/cockroachdb/cockroach/kv.(*DistSender).(github.com/cockroachdb/cockroach/kv.sendChunk)-fm(0x7f2d4ebe9e40, 0xc820211b30, 0x0, 0xe6f, 0x14131634c419c1a5, 0x4b14d9a117fa3476, 0x0, 0x0, 0x0, 0xc820302460, ...)
    /go/src/github.com/cockroachdb/cockroach/kv/dist_sender.go:469 +0x54
github.com/cockroachdb/cockroach/kv.(*chunkingSender).Send(0xc820028280, 0x7f2d4ebe9e40, 0xc820211b30, 0x0, 0xe6f, 0x14131634c419c1a5, 0x4b14d9a117fa3476, 0x0, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/kv/batch.go:206 +0x2aa
github.com/cockroachdb/cockroach/kv.(*DistSender).Send(0xc820194310, 0x7f2d4ebe9e40, 0xc820211b30, 0x0, 0xe6f, 0x14131634c419c1a5, 0x4b14d9a117fa3476, 0x0, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/kv/dist_sender.go:469 +0x21d
github.com/cockroachdb/cockroach/kv.(*TxnCoordSender).Send(0xc820558280, 0x7f2d4ebe9e40, 0xc820211b30, 0x0, 0x0, 0x14131634c419c1a5, 0x4b14d9a117fa3476, 0x0, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/kv/txn_coord_sender.go:397 +0xe2f
github.com/cockroachdb/cockroach/client.(*txnSender).Send(0xc820478240, 0x7f2d4ebe9e00, 0xc8200136c0, 0x0, 0x0, 0x14131634c419c1a5, 0x4b14d9a117fa3476, 0x0, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/client/txn.go:50 +0x9b
github.com/cockroachdb/cockroach/client.(*DB).send(0xc820478240, 0xc8204b6100, 0x4, 0x8, 0xfb262, 0x165951a)
    /go/src/github.com/cockroachdb/cockroach/client/db.go:482 +0x1b6
github.com/cockroachdb/cockroach/client.(*Txn).send(0xc820478240, 0xc8204b6100, 0x4, 0x8, 0x7e6f69, 0x80)
    /go/src/github.com/cockroachdb/cockroach/client/txn.go:454 +0xa32
github.com/cockroachdb/cockroach/client.(*Txn).(github.com/cockroachdb/cockroach/client.send)-fm(0xc8204b6100, 0x5, 0x8, 0x7d9465, 0x7e3371)
    /go/src/github.com/cockroachdb/cockroach/client/txn.go:295 +0x3e
github.com/cockroachdb/cockroach/client.sendAndFill(0xc82037cff8, 0xc82055c2c0, 0x0, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/client/db.go:418 +0x59
github.com/cockroachdb/cockroach/client.(*Txn).RunWithResponse(0xc820478240, 0xc82055c2c0, 0xc8207a10e0, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/client/txn.go:295 +0x8e
github.com/cockroachdb/cockroach/client.(*Txn).CommitInBatchWithResponse(0xc820478240, 0xc82055c2c0, 0x1, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/client/txn.go:332 +0x392
github.com/cockroachdb/cockroach/kv.(*TxnCoordSender).resendWithTxn.func1(0xc820478240, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/kv/txn_coord_sender.go:803 +0x25e
github.com/cockroachdb/cockroach/client.(*Txn).exec(0xc820478240, 0xc82037d6e8, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/client/txn.go:372 +0x2c1
github.com/cockroachdb/cockroach/client.(*DB).Txn(0xc82037d700, 0xc82037d6e8, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/client/db.go:465 +0x16d
github.com/cockroachdb/cockroach/kv.(*TxnCoordSender).resendWithTxn(0xc820558280, 0x0, 0x0, 0x14131634c38f4e83, 0x632f31cc7f09518c, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/kv/txn_coord_sender.go:795 +0x27d
github.com/cockroachdb/cockroach/kv.(*TxnCoordSender).Send(0xc820558280, 0x7f2d4ebe9e40, 0xc8204d1b90, 0x0, 0x0, 0x14131634c38f4e83, 0x632f31cc7f09518c, 0x0, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/kv/txn_coord_sender.go:400 +0xeb6
github.com/cockroachdb/cockroach/client.(*DB).send(0xc82040a140, 0xc820573ec0, 0x4, 0x4, 0x7f2d4ebe7c38, 0xc8200ff320)
    /go/src/github.com/cockroachdb/cockroach/client/db.go:482 +0x1b6
github.com/cockroachdb/cockroach/client.(*DB).(github.com/cockroachdb/cockroach/client.send)-fm(0xc820573ec0, 0x4, 0x4, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/client/db.go:452 +0x3e
github.com/cockroachdb/cockroach/client.sendAndFill(0xc82037de70, 0xc82055cb00, 0x0, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/client/db.go:418 +0x59
github.com/cockroachdb/cockroach/client.(*DB).RunWithResponse(0xc82040a140, 0xc82055cb00, 0x0, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/client/db.go:452 +0x8e
github.com/cockroachdb/cockroach/client.(*DB).Run(0xc82040a140, 0xc82055cb00, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/client/db.go:443 +0x37
github.com/cockroachdb/cockroach/storage.(*Replica).resolveIntents.func2()
    /go/src/github.com/cockroachdb/cockroach/storage/replica.go:1527 +0x35
github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunAsyncTask.func1(0xc82056a8a0, 0xc82020ae40, 0x17, 0xc82020ae20)
    /go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:130 +0x58
created by github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunAsyncTask
    /go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:131 +0x221

tbg · 2015-11-03T15:28:45Z

I can take a look if you'd like that. Just let me know what test invocation you've experienced the issues with. We should jump at any chance to understand and remove deadlocks like that.

bdarnell · 2015-11-03T16:42:52Z

Sure. My test invocation was make testrace PKG=./storage TESTS=TestRangeDescriptorSnapshotRace TESTFLAGS='-count 10'. I can't be sure that the transaction retry loop was causing the issues. It's possible that the resendWithTxn step just added enough of a delay that the problem was more likely to occur (it's already timing related; it's much more common in race mode than not)

bdarnell · 2015-11-03T18:44:45Z

This is all done except for the rebase now.

tbg · 2015-11-03T21:43:12Z

LGTM. I didn't re-review the code but saw that you changed the SendError to non-retryable.

tbg · 2015-11-03T21:43:48Z

multiraft/multiraft.go

+	// Ensure that any remaining commands are not left hanging.
+	for _, g := range s.groups {
+		for _, p := range g.pending {
+			p.ch <- util.Errorf("shutting down")


some other places have a nil check on p.ch here. Not this one?

ResolveIntent requests are often sent in batches even when they span ranges, and currently the DistSender will try to wrap those in transactions, even though these requests do not benefit from being in a transaction. The immediate motivation for this change is that transactions are subject to infinite retries (unlike non-transactional requests), so the transaction-wrapped version would timeout at shutdown.

When a range is split, followers of that range may receive a snapshot from the right-hand side of the split before they have caught up and processed the left-hand side where the split originated. This results in a "range already exists" panic. The solution is to silently drop any snapshots which would cause a conflict. They will be retried and will succeed once the left-hand range has performed its split. Fixes cockroachdb#1644. Also check destination stopper in multiTestContext.rpcSend

Fix a race between splits and snapshots.

pkg/testutil: ForceGosched -> WaitSchedule

bdarnell assigned bdarnell and tbg and unassigned bdarnell Oct 29, 2015

tbg reviewed Oct 29, 2015
View reviewed changes

tamird reviewed Oct 29, 2015
View reviewed changes

bdarnell force-pushed the fix-split-snapshot-race branch from 474e7de to 752e6e1 Compare October 29, 2015 17:45

tbg reviewed Oct 30, 2015
View reviewed changes

tbg mentioned this pull request Oct 30, 2015

Range created by Split may conflict with Range created by multiraft #1644

Closed

tbg reviewed Nov 2, 2015
View reviewed changes

bdarnell mentioned this pull request Nov 3, 2015

Store: Fix check for existing overlapping replica #2989

Closed

bdarnell force-pushed the fix-split-snapshot-race branch from 15c4e99 to 37cb585 Compare November 3, 2015 21:30

tbg reviewed Nov 3, 2015
View reviewed changes

bdarnell force-pushed the fix-split-snapshot-race branch from 37cb585 to 3988b18 Compare November 3, 2015 22:11

bdarnell added 6 commits November 3, 2015 22:13

Add timeout to requestLeaderLease

f97ff6e

Send an error on any pending commands during raft shutdown.

66010db

Synchronize access to multiTestContext.stoppers and .stores.

e6b0603

Add an additional stopper for replicaScanner in multiTestContext.

60bb720

bdarnell force-pushed the fix-split-snapshot-race branch from 3988b18 to 0031538 Compare November 4, 2015 03:13

bdarnell added a commit that referenced this pull request Nov 4, 2015

Merge pull request #2944 from bdarnell/fix-split-snapshot-race

fbccb36

Fix a race between splits and snapshots.

bdarnell merged commit fbccb36 into cockroachdb:master Nov 4, 2015

bdarnell deleted the fix-split-snapshot-race branch November 4, 2015 03:29

pav-kv pushed a commit to pav-kv/cockroach that referenced this pull request Mar 5, 2024

Merge pull request cockroachdb#2944 from yichengq/fix-2procs

e2f05d2

pkg/testutil: ForceGosched -> WaitSchedule

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a race between splits and snapshots. #2944

Fix a race between splits and snapshots. #2944

bdarnell commented Oct 29, 2015

bdarnell commented Oct 29, 2015

tbg commented Oct 29, 2015

tbg Oct 29, 2015

bdarnell Oct 29, 2015

tbg commented Oct 29, 2015

tamird Oct 29, 2015

bdarnell Oct 29, 2015

bdarnell commented Oct 29, 2015

mrtracy commented Oct 29, 2015

tbg Oct 30, 2015

bdarnell Nov 2, 2015

tbg Nov 2, 2015

bdarnell commented Nov 2, 2015

tamird commented Nov 2, 2015

tbg Nov 2, 2015

bdarnell Nov 3, 2015

tbg Nov 3, 2015

bdarnell Nov 3, 2015

tbg commented Nov 2, 2015

bdarnell commented Nov 3, 2015

tbg commented Nov 3, 2015

bdarnell commented Nov 3, 2015

bdarnell commented Nov 3, 2015

tbg commented Nov 3, 2015

bdarnell commented Nov 3, 2015

tbg commented Nov 3, 2015

bdarnell commented Nov 3, 2015

bdarnell commented Nov 3, 2015

tbg commented Nov 3, 2015

tbg Nov 3, 2015

bdarnell Nov 3, 2015

Fix a race between splits and snapshots. #2944

Fix a race between splits and snapshots. #2944

Conversation

bdarnell commented Oct 29, 2015

bdarnell commented Oct 29, 2015

tbg commented Oct 29, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tbg commented Oct 29, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdarnell commented Oct 29, 2015

mrtracy commented Oct 29, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdarnell commented Nov 2, 2015

tamird commented Nov 2, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tbg commented Nov 2, 2015

bdarnell commented Nov 3, 2015

tbg commented Nov 3, 2015

bdarnell commented Nov 3, 2015

bdarnell commented Nov 3, 2015

tbg commented Nov 3, 2015

bdarnell commented Nov 3, 2015

tbg commented Nov 3, 2015

bdarnell commented Nov 3, 2015

bdarnell commented Nov 3, 2015

tbg commented Nov 3, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment