Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a race between splits and snapshots. #2944

Merged
merged 6 commits into from
Nov 4, 2015

Conversation

bdarnell
Copy link
Contributor

When a range is split, followers of that range may receive a snapshot
from the right-hand side of the split before they have caught up and
processed the left-hand side where the split originated. This results in
a "range already exists" panic.

The solution is to silently drop any snapshots which would cause a
conflict. They will be retried and will succeed once the left-hand range
has performed its split.

Fixes #1644.

Review on Reviewable

@bdarnell
Copy link
Contributor Author

Cc @es-chow

@bdarnell bdarnell assigned bdarnell and tbg and unassigned bdarnell Oct 29, 2015
@tbg
Copy link
Member

tbg commented Oct 29, 2015

ah, great that you were able to put this one together that quickly. I'll take a look in the morning.

mtc.restartStore(i)
}

leftKey := roachpb.Key("a")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: just return {left,right}Key from setupSplitSnapshotRace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@tbg
Copy link
Member

tbg commented Oct 29, 2015

LGTM mod minor comments

if m.stores[nodeIndex] == nil {
pErr = &roachpb.Error{}
pErr.SetGoError(rpc.NewSendError("store is stopped", true))
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this right? continueing here seems to ignore this error entirely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is intended. pErr exists outside the loop. We return permanent errors but for retryable errors we continue through the loop until we've exhausted all the stores (at which point we return the last such error)

@bdarnell
Copy link
Contributor Author

I think the first test failure in this branch was a fluke (probably related to the recurrence of #2880), but the second on is real and reproducible. The newly-added tests fail about a third of the time, probably due to the way errors are handled in the testing stack. I'm still debugging.

@mrtracy
Copy link
Contributor

mrtracy commented Oct 29, 2015

LGTM. Although they are currently failure, the tests you created look very good.

return false
}

if s.replicasByKey.Has(rangeBTreeKey(parsedSnap.RangeDescriptor.EndKey)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this guards only against an exact match - agreed that that's enough to avoid the split race, but shouldn't we generally discard snapshots that overlap with existing ranges in any way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This is simply following the example of Store.addReplicaInternal; both should be updated to check for any overlap instead of an exact match on one key.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that tracked somewhere? Feel like a TODO can't hurt.

@bdarnell
Copy link
Contributor Author

bdarnell commented Nov 2, 2015

Updated with new commits. The newly-added tests are passing reliably for me now, although some of the existing tests have become flaky in race mode. multiTestContext appears to be full of races, so I'm going to try fixing that next.

The last two commits are unnecessary: they are things I added while attempting to debug that seem like good ideas to keep, but these weren't the actual problem.

@tamird
Copy link
Contributor

tamird commented Nov 2, 2015

LGTM so far (left 1 comment on a commit).

// lifetime. If the command takes that long to commit we are probably
// partitioned away and we can conclude that another node has taken
// the lease.
deadline := time.After(duration)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you use a context.WithDeadline(context.Background(), duration) here? We should use context.Context more for that purpose and this is a nice precedent to set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using context.WithDeadline is only better than time.After if we give it a non-Background context (i.e. one that may already have a request deadline). And while we want to do that in many places, I don't think this is one of them: Thanks to the lock in redirectOnOrAcquireLeaderLease, one call to requestLeaderLease may serve many clients. If we bail out early because the first client expired or cancelled, other clients blocking on the lock could create redundant lease requests. We should wait here as long as the lease may be useful, regardless of the context.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we shouldn't be using the client's context here, but using a freshly deadlined context.Context is strictly better in that it would be passed into proposeRaftCommand above, which allows all moving parts up the callstack to cancel the request early.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that makes sense. It would be nice to be able to handle a cancelled Context at the lower level so a partitioned node doesn't try to replay all its queued lease requests once it gets reconnected. I'll add a deadline to r.context() before the call to r.proposeRaftCommand.

@tbg
Copy link
Member

tbg commented Nov 2, 2015

LGTM with minor nits. Can you rebase into commits which pass tests individually before you merge?

@bdarnell
Copy link
Contributor Author

bdarnell commented Nov 3, 2015

The current test failures are due to the replica scanner, which is a source of background cross-store operations. Previously, these operations would succeed even during shutdown (I think because in most of our tests prior to this one leadership was almost always on node 0 so shutting down from 0 to N was safe), but now they fail because they can't start a new task on the destination store. And thanks to #2500, we have (multiple) unbounded retry loops that will prevent the queue from ever shutting down cleanly (including a hidden unbounded loop in DistSender: even if you limit MaxRetries, we call retry.Reset on certain errors). With all of these changed to use a finite number of retries, these tests look fine but others have problems if the number is too low.

So it looks like to merge this PR we either have to solve the retry option quagmire, or hack in a special-case DrainQueues method to ensure that all this background processing has stopped before we try to shut down. I'm going to explore the latter option first, since even though #2500 has been climbing in my priority queue I'd rather not tie it to this PR.

@tbg
Copy link
Member

tbg commented Nov 3, 2015

sounds good to me.

On Mon, Nov 2, 2015 at 7:28 PM, Ben Darnell notifications@github.com
wrote:

The current test failures are due to the replica scanner, which is a
source of background cross-store operations. Previously, these operations
would succeed even during shutdown (I think because in most of our tests
prior to this one leadership was almost always on node 0 so shutting down
from 0 to N was safe), but now they fail because they can't start a new
task on the destination store. And thanks to #2500
#2500, we have
(multiple) unbounded retry loops that will prevent the queue from ever
shutting down cleanly (including a hidden unbounded loop in DistSender:
even if you limit MaxRetries, we call retry.Reset on certain errors). With
all of these changed to use a finite number of retries, these tests look
fine but others have problems if the number is too low.

So it looks like to merge this PR we either have to solve the retry option
quagmire, or hack in a special-case DrainQueues method to ensure that all
this background processing has stopped before we try to shut down. I'm
going to explore the latter option first, since even though #2500
#2500 has been climbing
in my priority queue I'd rather not tie it to this PR.


Reply to this email directly or view it on GitHub
#2944 (comment)
.

@bdarnell
Copy link
Contributor Author

bdarnell commented Nov 3, 2015

Up next: TestRangeDescriptorSnapshotRace. Looks like asynchronous intent resolutions again, racing with splits this time. The client puts together a batch of ResolveIntentRequests, but by now they're on different ranges. This comes back as an OpRequiresTransactionError, which causes TxnCoordSender to retry the request in a transaction. This subjects it to another one of those infinite retry loops.

I think we just need a flag for non-transactional requests (ResolveIntent and PushTxn), so that DistSender will never return OpRequiresTransactionError if it has to truncate a batch of them.

@bdarnell
Copy link
Contributor Author

bdarnell commented Nov 3, 2015

OK, finally got a green build. PTAL (and I'm rebuilding now to make sure, just in case it has turned red by the time you see this).

@tbg
Copy link
Member

tbg commented Nov 3, 2015

Could you briefly outline the infinite loop the ResolveIntent transaction gets stuck in? Your last change looks useful either way, but if such a transaction can get stuck then I see no reason why that wouldn't happen with more "honest" transactions in practice.

@bdarnell
Copy link
Contributor Author

bdarnell commented Nov 3, 2015

There are about 5 different retry loops in play here, each with their own ideas about what which errors are retryable. The transaction layer appears to be most persistent in its retries but I was having trouble mapping out exactly what was happening.

goroutine 6150 [select]:
github.com/cockroachdb/cockroach/util/retry.(*Retry).Next(0xc82037bfc0, 0xc820356260)
    /go/src/github.com/cockroachdb/cockroach/util/retry/retry.go:109 +0x183
github.com/cockroachdb/cockroach/kv.(*DistSender).sendChunk(0xc820194310, 0x7f2d4ebe9e40, 0xc820211b30, 0x0, 0xe6f, 0x14131634c419c1a5, 0x4b14d9a117fa3476, 0x0, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/kv/dist_sender.go:496 +0x3f2
github.com/cockroachdb/cockroach/kv.(*DistSender).(github.com/cockroachdb/cockroach/kv.sendChunk)-fm(0x7f2d4ebe9e40, 0xc820211b30, 0x0, 0xe6f, 0x14131634c419c1a5, 0x4b14d9a117fa3476, 0x0, 0x0, 0x0, 0xc820302460, ...)
    /go/src/github.com/cockroachdb/cockroach/kv/dist_sender.go:469 +0x54
github.com/cockroachdb/cockroach/kv.(*chunkingSender).Send(0xc820028280, 0x7f2d4ebe9e40, 0xc820211b30, 0x0, 0xe6f, 0x14131634c419c1a5, 0x4b14d9a117fa3476, 0x0, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/kv/batch.go:206 +0x2aa
github.com/cockroachdb/cockroach/kv.(*DistSender).Send(0xc820194310, 0x7f2d4ebe9e40, 0xc820211b30, 0x0, 0xe6f, 0x14131634c419c1a5, 0x4b14d9a117fa3476, 0x0, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/kv/dist_sender.go:469 +0x21d
github.com/cockroachdb/cockroach/kv.(*TxnCoordSender).Send(0xc820558280, 0x7f2d4ebe9e40, 0xc820211b30, 0x0, 0x0, 0x14131634c419c1a5, 0x4b14d9a117fa3476, 0x0, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/kv/txn_coord_sender.go:397 +0xe2f
github.com/cockroachdb/cockroach/client.(*txnSender).Send(0xc820478240, 0x7f2d4ebe9e00, 0xc8200136c0, 0x0, 0x0, 0x14131634c419c1a5, 0x4b14d9a117fa3476, 0x0, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/client/txn.go:50 +0x9b
github.com/cockroachdb/cockroach/client.(*DB).send(0xc820478240, 0xc8204b6100, 0x4, 0x8, 0xfb262, 0x165951a)
    /go/src/github.com/cockroachdb/cockroach/client/db.go:482 +0x1b6
github.com/cockroachdb/cockroach/client.(*Txn).send(0xc820478240, 0xc8204b6100, 0x4, 0x8, 0x7e6f69, 0x80)
    /go/src/github.com/cockroachdb/cockroach/client/txn.go:454 +0xa32
github.com/cockroachdb/cockroach/client.(*Txn).(github.com/cockroachdb/cockroach/client.send)-fm(0xc8204b6100, 0x5, 0x8, 0x7d9465, 0x7e3371)
    /go/src/github.com/cockroachdb/cockroach/client/txn.go:295 +0x3e
github.com/cockroachdb/cockroach/client.sendAndFill(0xc82037cff8, 0xc82055c2c0, 0x0, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/client/db.go:418 +0x59
github.com/cockroachdb/cockroach/client.(*Txn).RunWithResponse(0xc820478240, 0xc82055c2c0, 0xc8207a10e0, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/client/txn.go:295 +0x8e
github.com/cockroachdb/cockroach/client.(*Txn).CommitInBatchWithResponse(0xc820478240, 0xc82055c2c0, 0x1, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/client/txn.go:332 +0x392
github.com/cockroachdb/cockroach/kv.(*TxnCoordSender).resendWithTxn.func1(0xc820478240, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/kv/txn_coord_sender.go:803 +0x25e
github.com/cockroachdb/cockroach/client.(*Txn).exec(0xc820478240, 0xc82037d6e8, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/client/txn.go:372 +0x2c1
github.com/cockroachdb/cockroach/client.(*DB).Txn(0xc82037d700, 0xc82037d6e8, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/client/db.go:465 +0x16d
github.com/cockroachdb/cockroach/kv.(*TxnCoordSender).resendWithTxn(0xc820558280, 0x0, 0x0, 0x14131634c38f4e83, 0x632f31cc7f09518c, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/kv/txn_coord_sender.go:795 +0x27d
github.com/cockroachdb/cockroach/kv.(*TxnCoordSender).Send(0xc820558280, 0x7f2d4ebe9e40, 0xc8204d1b90, 0x0, 0x0, 0x14131634c38f4e83, 0x632f31cc7f09518c, 0x0, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/kv/txn_coord_sender.go:400 +0xeb6
github.com/cockroachdb/cockroach/client.(*DB).send(0xc82040a140, 0xc820573ec0, 0x4, 0x4, 0x7f2d4ebe7c38, 0xc8200ff320)
    /go/src/github.com/cockroachdb/cockroach/client/db.go:482 +0x1b6
github.com/cockroachdb/cockroach/client.(*DB).(github.com/cockroachdb/cockroach/client.send)-fm(0xc820573ec0, 0x4, 0x4, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/client/db.go:452 +0x3e
github.com/cockroachdb/cockroach/client.sendAndFill(0xc82037de70, 0xc82055cb00, 0x0, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/client/db.go:418 +0x59
github.com/cockroachdb/cockroach/client.(*DB).RunWithResponse(0xc82040a140, 0xc82055cb00, 0x0, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/client/db.go:452 +0x8e
github.com/cockroachdb/cockroach/client.(*DB).Run(0xc82040a140, 0xc82055cb00, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/client/db.go:443 +0x37
github.com/cockroachdb/cockroach/storage.(*Replica).resolveIntents.func2()
    /go/src/github.com/cockroachdb/cockroach/storage/replica.go:1527 +0x35
github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunAsyncTask.func1(0xc82056a8a0, 0xc82020ae40, 0x17, 0xc82020ae20)
    /go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:130 +0x58
created by github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunAsyncTask
    /go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:131 +0x221

@tbg
Copy link
Member

tbg commented Nov 3, 2015

I can take a look if you'd like that. Just let me know what test invocation you've experienced the issues with. We should jump at any chance to understand and remove deadlocks like that.

@bdarnell
Copy link
Contributor Author

bdarnell commented Nov 3, 2015

Sure. My test invocation was make testrace PKG=./storage TESTS=TestRangeDescriptorSnapshotRace TESTFLAGS='-count 10'. I can't be sure that the transaction retry loop was causing the issues. It's possible that the resendWithTxn step just added enough of a delay that the problem was more likely to occur (it's already timing related; it's much more common in race mode than not)

@bdarnell
Copy link
Contributor Author

bdarnell commented Nov 3, 2015

This is all done except for the rebase now.

@tbg
Copy link
Member

tbg commented Nov 3, 2015

LGTM. I didn't re-review the code but saw that you changed the SendError to non-retryable.

// Ensure that any remaining commands are not left hanging.
for _, g := range s.groups {
for _, p := range g.pending {
p.ch <- util.Errorf("shutting down")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some other places have a nil check on p.ch here. Not this one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

ResolveIntent requests are often sent in batches even when they span
ranges, and currently the DistSender will try to wrap those in
transactions, even though these requests do not benefit from being
in a transaction.

The immediate motivation for this change is that transactions are
subject to infinite retries (unlike non-transactional requests),
so the transaction-wrapped version would timeout at shutdown.
When a range is split, followers of that range may receive a snapshot
from the right-hand side of the split before they have caught up and
processed the left-hand side where the split originated. This results in
a "range already exists" panic.

The solution is to silently drop any snapshots which would cause a
conflict. They will be retried and will succeed once the left-hand range
has performed its split.

Fixes cockroachdb#1644.

Also check destination stopper in multiTestContext.rpcSend
bdarnell added a commit that referenced this pull request Nov 4, 2015
Fix a race between splits and snapshots.
@bdarnell bdarnell merged commit fbccb36 into cockroachdb:master Nov 4, 2015
@bdarnell bdarnell deleted the fix-split-snapshot-race branch November 4, 2015 03:29
pav-kv pushed a commit to pav-kv/cockroach that referenced this pull request Mar 5, 2024
pkg/testutil: ForceGosched -> WaitSchedule
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants