-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix a race between splits and snapshots. #2944
Conversation
Cc @es-chow |
ah, great that you were able to put this one together that quickly. I'll take a look in the morning. |
mtc.restartStore(i) | ||
} | ||
|
||
leftKey := roachpb.Key("a") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: just return {left,right}Key
from setupSplitSnapshotRace
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
LGTM mod minor comments |
if m.stores[nodeIndex] == nil { | ||
pErr = &roachpb.Error{} | ||
pErr.SetGoError(rpc.NewSendError("store is stopped", true)) | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this right? continue
ing here seems to ignore this error entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is intended. pErr
exists outside the loop. We return permanent errors but for retryable errors we continue through the loop until we've exhausted all the stores (at which point we return the last such error)
474e7de
to
752e6e1
Compare
I think the first test failure in this branch was a fluke (probably related to the recurrence of #2880), but the second on is real and reproducible. The newly-added tests fail about a third of the time, probably due to the way errors are handled in the testing stack. I'm still debugging. |
LGTM. Although they are currently failure, the tests you created look very good. |
return false | ||
} | ||
|
||
if s.replicasByKey.Has(rangeBTreeKey(parsedSnap.RangeDescriptor.EndKey)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this guards only against an exact match - agreed that that's enough to avoid the split race, but shouldn't we generally discard snapshots that overlap with existing ranges in any way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. This is simply following the example of Store.addReplicaInternal
; both should be updated to check for any overlap instead of an exact match on one key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that tracked somewhere? Feel like a TODO can't hurt.
Updated with new commits. The newly-added tests are passing reliably for me now, although some of the existing tests have become flaky in race mode. The last two commits are unnecessary: they are things I added while attempting to debug that seem like good ideas to keep, but these weren't the actual problem. |
LGTM so far (left 1 comment on a commit). |
// lifetime. If the command takes that long to commit we are probably | ||
// partitioned away and we can conclude that another node has taken | ||
// the lease. | ||
deadline := time.After(duration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you use a context.WithDeadline(context.Background(), duration)
here? We should use context.Context
more for that purpose and this is a nice precedent to set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using context.WithDeadline
is only better than time.After
if we give it a non-Background
context (i.e. one that may already have a request deadline). And while we want to do that in many places, I don't think this is one of them: Thanks to the lock in redirectOnOrAcquireLeaderLease
, one call to requestLeaderLease
may serve many clients. If we bail out early because the first client expired or cancelled, other clients blocking on the lock could create redundant lease requests. We should wait here as long as the lease may be useful, regardless of the context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we shouldn't be using the client's context here, but using a freshly deadlined context.Context
is strictly better in that it would be passed into proposeRaftCommand
above, which allows all moving parts up the callstack to cancel the request early.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that makes sense. It would be nice to be able to handle a cancelled Context at the lower level so a partitioned node doesn't try to replay all its queued lease requests once it gets reconnected. I'll add a deadline to r.context()
before the call to r.proposeRaftCommand
.
LGTM with minor nits. Can you rebase into commits which pass tests individually before you merge? |
The current test failures are due to the replica scanner, which is a source of background cross-store operations. Previously, these operations would succeed even during shutdown (I think because in most of our tests prior to this one leadership was almost always on node 0 so shutting down from 0 to N was safe), but now they fail because they can't start a new task on the destination store. And thanks to #2500, we have (multiple) unbounded retry loops that will prevent the queue from ever shutting down cleanly (including a hidden unbounded loop in DistSender: even if you limit MaxRetries, we call retry.Reset on certain errors). With all of these changed to use a finite number of retries, these tests look fine but others have problems if the number is too low. So it looks like to merge this PR we either have to solve the retry option quagmire, or hack in a special-case DrainQueues method to ensure that all this background processing has stopped before we try to shut down. I'm going to explore the latter option first, since even though #2500 has been climbing in my priority queue I'd rather not tie it to this PR. |
sounds good to me. On Mon, Nov 2, 2015 at 7:28 PM, Ben Darnell notifications@github.com
|
Up next: TestRangeDescriptorSnapshotRace. Looks like asynchronous intent resolutions again, racing with splits this time. The client puts together a batch of ResolveIntentRequests, but by now they're on different ranges. This comes back as an OpRequiresTransactionError, which causes TxnCoordSender to retry the request in a transaction. This subjects it to another one of those infinite retry loops. I think we just need a flag for non-transactional requests (ResolveIntent and PushTxn), so that DistSender will never return OpRequiresTransactionError if it has to truncate a batch of them. |
OK, finally got a green build. PTAL (and I'm rebuilding now to make sure, just in case it has turned red by the time you see this). |
Could you briefly outline the infinite loop the |
There are about 5 different retry loops in play here, each with their own ideas about what which errors are retryable. The transaction layer appears to be most persistent in its retries but I was having trouble mapping out exactly what was happening.
|
I can take a look if you'd like that. Just let me know what test invocation you've experienced the issues with. We should jump at any chance to understand and remove deadlocks like that. |
Sure. My test invocation was |
This is all done except for the rebase now. |
15c4e99
to
37cb585
Compare
LGTM. I didn't re-review the code but saw that you changed the |
// Ensure that any remaining commands are not left hanging. | ||
for _, g := range s.groups { | ||
for _, p := range g.pending { | ||
p.ch <- util.Errorf("shutting down") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some other places have a nil
check on p.ch
here. Not this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
37cb585
to
3988b18
Compare
ResolveIntent requests are often sent in batches even when they span ranges, and currently the DistSender will try to wrap those in transactions, even though these requests do not benefit from being in a transaction. The immediate motivation for this change is that transactions are subject to infinite retries (unlike non-transactional requests), so the transaction-wrapped version would timeout at shutdown.
When a range is split, followers of that range may receive a snapshot from the right-hand side of the split before they have caught up and processed the left-hand side where the split originated. This results in a "range already exists" panic. The solution is to silently drop any snapshots which would cause a conflict. They will be retried and will succeed once the left-hand range has performed its split. Fixes cockroachdb#1644. Also check destination stopper in multiTestContext.rpcSend
3988b18
to
0031538
Compare
Fix a race between splits and snapshots.
pkg/testutil: ForceGosched -> WaitSchedule
When a range is split, followers of that range may receive a snapshot
from the right-hand side of the split before they have caught up and
processed the left-hand side where the split originated. This results in
a "range already exists" panic.
The solution is to silently drop any snapshots which would cause a
conflict. They will be retried and will succeed once the left-hand range
has performed its split.
Fixes #1644.