storage: Improve handling of lease index retries #35261

bdarnell · 2019-02-28T04:21:02Z

Pipelined writes wait for commands to be evaluated, then detach before
they apply. Retryable errors generated at evaluation time are handled
correctly (by DistSender or above) even for pipelined writes, but
pipelined writes have lost the ability to handle apply-time retry
conditions in the executeWriteBatch loop (there used to be more of
these, but illegal lease index is now the only condition retried at
this level now). To remedy this, this commit moves reproposals due to
illegal lease indexes below raft (but only on the proposing node)

In practice, this can happen to writes that race with a raft
leadership transfer. This is observable immediately after table
creation, since table creation performs a split, and then may perform
a lease transfer to balance load (and if the lease is transfer, raft
leadership is transferred afterward).

Specifically,

Lease is transferred from store s1 to s2. s1 is still raft leader.
Write w1 evaluates on store s2 and is assigned lease index i1. s2
forwards the proposal to s1.
s1 initiates raft leader transfer to s2. This puts it into a
temporarily leaderless state so it drops the forwarded proposal.
s2 is elected raft leader, completing the transfer.
A second write w2 evalutes on s2, is assigned lease index i2, and
goes right in the raft log since s2 is both leaseholder and leader.
s2 refreshes proposals as a side effect of becoming leader, and
writes w1 to the log with lease index i1.
s2 applies w2, then w1. w1 fails because of the out of order lease
index.

If w1 was pipelined, the client stops listening after step 2, and
won't learn of the failure until it tries to commit. At this point the
commit returns an ASYNC_WRITE_FAILURE retry to the client.

Note that in some cases, NotLeaseHolderError can be generated at apply
time. These errors cannot be retried (since the proposer has lost its
lease) so they will unavoidably result in an ASYNC_WRITE_FAILURE error
to the client. However, this is uncommon - most NotLeaseHolderErrors
are generated at evaluation time, which is compatible with pipelined
writes.
Fixes #28876

The fourth commit is the important part of this change. The fifth is optional; it's a simplification but it changes an under-tested error path. Assuming we like the fifth commit, I'd like to add a sixth that rips out proposalReevaluationReason and the executeWriteBatch/tryExecuteWriteBatch loop altogether.

cockroach-teamcity · 2019-02-28T04:21:14Z

This change is

tbg

Could you tell the real story in the commit and PR description? I think it's something like this: #28876 (comment) but I wasn't able to piece it together from what's discussed there. But it sounds that there are both NotLeaseholderErrors and illegal lease indexes in the mix, so there are probably two separate stories to tell, both connected to lease transfers. I think that once we initiate a lease transfer, we're not accepting new writes (they bounce immediately with not leaseholder error?), so I think you're saying that a lease transfer is initiated while a write is pending. What I think this means is that the lease transfer should enter the log after the pending write, and should not be able to interfere with it per se. But somehow it does due to the Raft leadership transfer, but how exactly? Are they getting reordered in the logs? Isn't that still counter-intuitive?

I like the fifth commit except that I don't understand why the fourth commit is correct at all. If the leaseholder does change, I don't see that we can just pretend that a command eval'ed by the old leaseholder can simply be reproposed under a new lease index.

Reviewed 4 of 4 files at r1, 2 of 2 files at r2, 1 of 1 files at r3, 3 of 3 files at r4, 1 of 1 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @nvanbenschoten)

pkg/storage/replica_raft.go, line 2067 at r4 (raw file):

			// If we failed to apply at the right lease index, try again with a new one.
			r.mu.Lock()
			lease := *r.mu.state.Lease

Is this really correct? If this node was the lease holder at the time of the proposal but isn't any more, is there any chance we'd give the command a different lease now even though it didn't eval on that leaseholder, and just so happen to assign a LAI that actually lets the reproposal go through? That wouldn't be correct since the new leaseholder's ts cache isn't protecting us. ISTM that we can only repropose if the lease didn't change (or rather,w e can only repropose with the same lease, and have to make sure that there won't be an infinite loop of that, i.e. notleaseholder takes precedence over illegal lease index).
If this still does work, I'd like to see a lot more commentary than there is now.

By the way, if this is about lease index errors only (i.e. you're not expecting the lease to change) there's probably a bug in how they are assigned, but the code here suggests that it's

pkg/storage/replica_raft.go, line 2073 at r4 (raw file):

				log.VEventf(ctx, 2, "retry: proposalIllegalLeaseIndex")
				// Use a goroutine so this doesn't count as "marshaling a proto downstream of raft".
				go proposeAfterIllegalLeaseIndex(ctx, r, proposal, *r.mu.state.Lease)

Were you planning on fixing this up? Morally you'd want to just do this synchronously, right? I could settle for running this as a task if there's no good way to work around the linter.

pkg/storage/replica_test.go, line 7420 at r4 (raw file):

			ba,
		)
		if retry != proposalNoReevaluation {

The test should verify the internal reproposal.

bdarnell

Pipelined writes wait for commands to be evaluated, then detach before they apply. Retryable errors generated at evaluation time are handled correctly (by DistSender or above) even for pipelined writes, but pipelined writes have lost the ability to handle apply-time retry conditions in the executeWriteBatch loop (there used to be more of these, but illegal lease index is now the only condition retried at this level now). NotLeaseHolderError is usually generated at evaluation time so it's not a problem for pipelined writes (in some cases NotLeaseHolderError can be generated at apply time; when this happens we still get an async write failure and I don't see a way around that).

After a table creation, we do a split, and then we (maybe) transfer the lease and then transfer raft leadership. The problem here is when a write occurs between the lease transfer and the end of the raft leader transfer. If it evaluates on the old leaseholder, it gets that error at evaluation time and is redirected by the DistSender. When it evaluates on the new leaseholder, it must be forwarded to the raft leader (almost certainly the old leaseholder). If it arrives at the old leaseholder at the same time that raft leadership is transferred to the new leaseholder, it's likely to get dropped or reordered. This will produce a lease index error when applied, but the new lease is still valid and it can get retried successfully (lease index errors in other situations may result in NotLeaseHolderErrors when retried and ultimately async write failures).

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)

pkg/storage/replica_raft.go, line 2067 at r4 (raw file):
Yeah, this part still needs a little work. It should validate not just that we still own the lease but that it's the same lease (check the sequence number). NotLeaseHolder does take precedence over illegal lease index, so there's not an infinite loop.

By the way, if this is about lease index errors only (i.e. you're not expecting the lease to change) there's probably a bug in how they are assigned, but the code here suggests that it's

Finish that sentence?

pkg/storage/replica_raft.go, line 2073 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Were you planning on fixing this up? Morally you'd want to just do this synchronously, right? I could settle for running this as a task if there's no good way to work around the linter.

I can't decide whether the async goroutine is the right thing to do (we don't want to block the raft-processing goroutine longer than necessary) or just a hack to placate the linter. Maybe we should just add a t.Helper() style marker to tell the linter to ignore this.

pkg/storage/replica_raft.go, line 2064 at r5 (raw file):

	_, pErr := r.proposeLocked(ctx, proposal, lease)
	if pErr != nil {
		panic(pErr)

This part will also change (to call proposal.finishApplication instead of panicking).

pkg/storage/replica_test.go, line 7420 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

The test should verify the internal reproposal.

That's what it does, with the "expected %d proposals" check below. Formerly, you'd get retries from executeWriteBatch but not from tryExecuteWriteBatch; now you get the retry either way.

tbg

Ah, so the sequence of events is:

lease transfer from s1 to s2 initiated
command evals under s2, under the new lease
command gets proposed (forwarded to s1, which is still Raft leader), in particular it probably has a bad MaxLeaseIndex that doesn't take into account inflight commands at s1
leadership transfers to s2 (why is this important?)
command catches illegal lease applied index (but has the right lease)

I think you're saying that the leadership transfer can drop the command, but wouldn't that just lead to a reproposal and that would succeed? I don't know how it would reorder commands, after all, the inflight commands are either already in the new leader's log (in the right order) or they get reproposed (and we sort things so that the order matches). I'm probably still missing some detail.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @nvanbenschoten)

pkg/storage/replica_raft.go, line 2067 at r4 (raw file):

Finish that sentence?

Sorry. I think I was asking the same question as in the PR-wide comment thread about why we see a reordering.

Yeah, this part still needs a little work. It should validate not just that we still own the lease but that it's the same lease (check the sequence number). NotLeaseHolder does take precedence over illegal lease index, so there's not an infinite loop.

Why doesn't it blindly repropose (btw we need a word for that - reproposing means not touching MaxLeaseIndex, but here we do -- reinsert)? If the lease doesn't match, it won't apply. This is less efficient, but we're not expecting to hit this case frequently anyway (?)
All we need to be doing it seems is to check the sequence number though, so maybe this is easy enough.

pkg/storage/replica_raft.go, line 2073 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I can't decide whether the async goroutine is the right thing to do (we don't want to block the raft-processing goroutine longer than necessary) or just a hack to placate the linter. Maybe we should just add a t.Helper() style marker to tell the linter to ignore this.

calling proposeLocked outside of the loop opens the door to further reorderings, though. Not saying that these would happen, but if we expect to never hit this path in the steady state maybe it's fine to do it synchronously. Raft leadership transfers are much more expensive than doing this on the Raft goroutine.

pkg/storage/replica_test.go, line 7420 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

That's what it does, with the "expected %d proposals" check below. Formerly, you'd get retries from executeWriteBatch but not from tryExecuteWriteBatch; now you get the retry either way.

Thanks, read that the wrong way.

bdarnell

The leadership transfer is important because during a transfer (a rigged raft election) proposals get dropped rather than forwarded. The two writes in our transaction could get retried as follows:

w1 evaluates on s2, gets lease index i1, forwards to raft leader s1
s1 initiates leader transfer to s2. this puts it into a temporarily leaderless state so it drops the forwarded proposal
s2 is elected raft leader
w2 evaluates on s2, gets lease index i2, goes right in the log
s2 refreshes proposals as a side effect of becoming leader, writes w1 to log with lease index i1
s2 applies w2, then w1. w1 fails because of the out of order lease index

(note that I haven't actually verified that this is what is going on, but it's consistent with what I've observed).

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @nvanbenschoten)

tbg · 2019-02-28T19:00:56Z

Makes sense. Thanks for the example and please work it into the commit messages once you're wrapping up.

nvanbenschoten

How close do you think this gets us to removing proposalReevaluationReason and its handling in Replica.executeWriteBatch entirely?

Reviewed 4 of 4 files at r1, 2 of 2 files at r2, 1 of 1 files at r3, 2 of 3 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @tbg)

pkg/storage/replica_raft.go, line 94 at r1 (raw file):

}

// evalAndPropose prepares the necessary pending command struct and initializes a

I like the rename. Let's mention that this method includes evaluation in the comment as well.

pkg/storage/replica_raft.go, line 2067 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Finish that sentence?

Sorry. I think I was asking the same question as in the PR-wide comment thread about why we see a reordering.

Yeah, this part still needs a little work. It should validate not just that we still own the lease but that it's the same lease (check the sequence number). NotLeaseHolder does take precedence over illegal lease index, so there's not an infinite loop.

Why doesn't it blindly repropose (btw we need a word for that - reproposing means not touching MaxLeaseIndex, but here we do -- reinsert)? If the lease doesn't match, it won't apply. This is less efficient, but we're not expecting to hit this case frequently anyway (?)
All we need to be doing it seems is to check the sequence number though, so maybe this is easy enough.

Yeah, this should be comparing lease sequence numbers so that we know that the new lease is either identical to the proposer's lease or an extension of the proposer's lease.

Do we also need to short-circuit this logic if raftCmd.DeprecatedProposerLease != nil like we do in checkForcedErrLocked? We don't have any guarantee that this entry was proposed with a lease sequence number.

pkg/storage/replica_raft.go, line 2073 at r4 (raw file):

				log.VEventf(ctx, 2, "retry: proposalIllegalLeaseIndex")
				// Use a goroutine so this doesn't count as "marshaling a proto downstream of raft".
				go proposeAfterIllegalLeaseIndex(ctx, r, proposal, *r.mu.state.Lease)

The "marshaling a proto downstream of raft" lint is interesting. It's worth noting here that the entry will never be reproposed with a different proto version because only the leaseholder who originally proposed the entry will ever re-marshall it and repropose it again. That invariant helps alleviate some of the compatibility/migration concerns I have.

bdarnell

How close do you think this gets us to removing proposalReevaluationReason and its handling in Replica.executeWriteBatch entirely?

Added two more commits. tryExecuteWriteBatch is gone. The type and constants are still there, but the handling is localized in processRaftCommand.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)

pkg/storage/replica_raft.go, line 2067 at r4 (raw file):

Why doesn't it blindly repropose

r.mu.state.Lease is the current lease, which might be owned by someone else. If it is, we might hit upon a lease index that they assigned by chance, so it could apply. I'm pretty sure we currently rely on the proposer to guarantee that it only proposes under leases it owns.

I think if the code were factored a little differently, though, we could blindly repropose with the same lease - exactly the same raft command but a new max lease index. But with the way the max lease index we currently need a lease proto, so it seems easier to check if the lease still matches.

Do we also need to short-circuit this logic if raftCmd.DeprecatedProposerLease != nil like we do in checkForcedErrLocked? We don't have any guarantee that this entry was proposed with a lease sequence number.

Not necessary here - we don't reach this point for commands read from the on-disk raft log, it's only for commands that were proposed by the current process, so there won't be any old deprecated fields in use.

pkg/storage/replica_raft.go, line 2073 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

calling proposeLocked outside of the loop opens the door to further reorderings, though. Not saying that these would happen, but if we expect to never hit this path in the steady state maybe it's fine to do it synchronously. Raft leadership transfers are much more expensive than doing this on the Raft goroutine.

Yeah, I'll inline it and whitelist it for the linter.

pkg/storage/replica_raft.go, line 2073 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

The "marshaling a proto downstream of raft" lint is interesting. It's worth noting here that the entry will never be reproposed with a different proto version because only the leaseholder who originally proposed the entry will ever re-marshall it and repropose it again. That invariant helps alleviate some of the compatibility/migration concerns I have.

The "downstream of raft" linter isn't really a concern here. It's there because we don't want to marshal the same proto differently across the different replicas, but here we're already limiting ourselves to the node that proposed the command.

pkg/storage/replica_test.go, line 7420 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Thanks, read that the wrong way.

Done.

nvanbenschoten

Reviewed 1 of 3 files at r9, 1 of 2 files at r10, 3 of 5 files at r11, 3 of 3 files at r12.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @tbg)

pkg/storage/replica_raft.go, line 2073 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

The "downstream of raft" linter isn't really a concern here. It's there because we don't want to marshal the same proto differently across the different replicas, but here we're already limiting ourselves to the node that proposed the command.

Right, that's what I'm saying. It will never be reproposed using a different proto definition so we don't have to worry about losing information. It's worth noting that somewhere.

pkg/storage/replica_raft.go, line 2060 at r12 (raw file):

	defer r.mu.Unlock()
	lease := *r.mu.state.Lease
	if lease.OwnedBy(r.StoreID()) && lease.Sequence == proposal.command.ProposerLeaseSequence {

Is it possible for !lease.OwnedBy(r.StoreID())?

tbg

Reviewed 3 of 3 files at r6, 2 of 2 files at r7, 1 of 1 files at r8, 3 of 3 files at r9, 2 of 2 files at r10, 5 of 5 files at r11, 3 of 3 files at r12.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @nvanbenschoten)

pkg/storage/replica_raft.go, line 2067 at r4 (raw file):

Why doesn't it blindly repropose

Yeah, I wasn't saying "blindly stuff a new lease in" but "why change anything about the proposer lease". Your second sentence explains why: the code wants a lease (I assume for not much else than to pick the seqno). But I must still be missing something because it seems that the lease passed in is only ever accessed in insertProposalLocked, which cares about nothing but the seqno, so we could just change the signatures to take only that.

pkg/storage/replica_raft.go, line 2060 at r12 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is it possible for !lease.OwnedBy(r.StoreID())?

You're saying that this wouldn't happen because in this case the proposal would've caught a different error from the below-Raft check (not leaseholder)? That's probably right, though I'd be hesitant to rely on that for correctness. For example, if we moved this out of the Raft goroutine, it might stop being true because the lease can be any future lease as opposed to the one active when the cmd bounced.

pkg/storage/replica_write.go, line 197 at r11 (raw file):

			}
			if propResult.ProposalRetry != proposalNoReevaluation {
				log.Fatalf(ctx, "expected no reevaluation")

Do you want to print the actual value and log.Safe it?

bdarnell

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @nvanbenschoten, and @tbg)

pkg/storage/replica_raft.go, line 2067 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Why doesn't it blindly repropose

Yeah, I wasn't saying "blindly stuff a new lease in" but "why change anything about the proposer lease". Your second sentence explains why: the code wants a lease (I assume for not much else than to pick the seqno). But I must still be missing something because it seems that the lease passed in is only ever accessed in insertProposalLocked, which cares about nothing but the seqno, so we could just change the signatures to take only that.

Good idea. I changed things around so that filling in the proposer fields is a prerequisite for insertProposalLocked instead of something it does by passing through arguments you give it. This means we don't have to think about it on the retry path.

pkg/storage/replica_write.go, line 197 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Do you want to print the actual value and log.Safe it?

This fatal is removed by a subsequent commit.

tbg

Please add the explicit examples and verbose explanation to the commit message of

storage: Move illegal lease index retries down into raft

and the PR description.

Also I wouldn't mind another close pass by @nvanbenschoten to make sure we're not introducing something undesirable here.

Reviewed 2 of 2 files at r13.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @bdarnell)

pkg/storage/replica_raft.go, line 2043 at r13 (raw file):

	if proposedLocally {
		// If we failed to apply at the right lease index, try again with a new one.

Mention that this is important for pipelined writes.

pkg/storage/replica_raft.go, line 2057 at r13 (raw file):

// tryReproposeWithNewLeaseIndex is used by processRaftCommand to
// repropose commands that have gotten an illegal lease index error.

Clarify that this is only called when we know not only that there has been an illegal lease index error, but we also know that the command has never succeeded (that is, we observed the first time this command was applied, not a reproposal).

bdarnell · 2019-03-08T18:44:18Z

After rebasing, the test is no longer flaky under stress as described in #28876 (comment)

Assuming it's also not flaky on teamcity I'm going to go ahead and merge this and handle the timestamp cache issue in a new PR. I've updated the commit message with more detail.

bdarnell · 2019-03-12T17:11:55Z

Still flaky, but with a different failure:

[16:49:16][Maybe stressrace pull request] I190312 16:49:15.755320 2667 storage/replica_command.go:246  [n1,split,s1,r32/1:/{Table/63-Max}] initiating a split of this range at key /Table/64 [r33] (zone config)
[16:49:16][Maybe stressrace pull request] F190312 16:49:16.240441 821 storage/replica_raft.go:2365  [n3,s3,r1/3:/{Min-System/NodeL…}] TruncatedState regressed:
[16:49:16][Maybe stressrace pull request] old: index:130 term:6 
[16:49:16][Maybe stressrace pull request] new: index:130 term:6 
[16:49:16][Maybe stressrace pull request] goroutine 821 [running]:
[16:49:16][Maybe stressrace pull request] github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0x6bb6601, 0xc0000c3f20, 0x0, 0x17)
[16:49:16][Maybe stressrace pull request] 	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1018 +0xe2
[16:49:16][Maybe stressrace pull request] github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x61ef600, 0xc000000004, 0x5c959f7, 0x17, 0x93d, 0xc000dfebd0, 0x6b)
[16:49:16][Maybe stressrace pull request] 	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:876 +0xb91
[16:49:16][Maybe stressrace pull request] github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x456cae0, 0xc003a74570, 0x4, 0x2, 0x0, 0x0, 0xc004115778, 0x1, 0x1)
[16:49:16][Maybe stressrace pull request] 	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:85 +0x287
[16:49:16][Maybe stressrace pull request] github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x456cae0, 0xc003a74570, 0x1, 0xc000000004, 0x0, 0x0, 0xc004115778, 0x1, 0x1)
[16:49:16][Maybe stressrace pull request] 	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:71 +0x9a
[16:49:16][Maybe stressrace pull request] github.com/cockroachdb/cockroach/pkg/util/log.Fatal(0x456cae0, 0xc003a74570, 0xc004115778, 0x1, 0x1)
[16:49:16][Maybe stressrace pull request] 	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:191 +0x7a
[16:49:16][Maybe stressrace pull request] github.com/cockroachdb/cockroach/pkg/storage.(*Replica).applyRaftCommand(0xc00094b900, 0x456cae0, 0xc003a74570, 0xc0030363a8, 0x8, 0x0, 0xc0033d5b80, 0x0, 0x0, 0x0, ...)
[16:49:16][Maybe stressrace pull request] 	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raft.go:2365 +0x12ff
[16:49:16][Maybe stressrace pull request] github.com/cockroachdb/cockroach/pkg/storage.(*Replica).processRaftCommand(0xc00094b900, 0x456cae0, 0xc003a74570, 0xc0030363a8, 0x8, 0x7, 0x8d, 0x300000003, 0x3, 0x2, ...)
[16:49:16][Maybe stressrace pull request] 	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raft.go:1926 +0x5e9
[16:49:16][Maybe stressrace pull request] github.com/cockroachdb/cockroach/pkg/storage.(*Replica).handleRaftReadyRaftMuLocked(0xc00094b900, 0x456cae0, 0xc003a74570, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
[16:49:16][Maybe stressrace pull request] 	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raft.go:812 +0x1ac0
[16:49:16][Maybe stressrace pull request] github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue.func1(0x456cae0, 0xc003a74570, 0xc00094b900, 0x456cae0)
[16:49:16][Maybe stressrace pull request] 	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3585 +0x1fb
[16:49:16][Maybe stressrace pull request] github.com/cockroachdb/cockroach/pkg/storage.(*Store).withReplicaForRequest(0xc00060c600, 0x456cae0, 0xc003a74570, 0xc003071180, 0xc001b7be30, 0x0)
[16:49:16][Maybe stressrace pull request] 	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3232 +0x181
[16:49:16][Maybe stressrace pull request] github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue(0xc00060c600, 0x456cae0, 0xc0014a98f0, 0x1)
[16:49:16][Maybe stressrace pull request] 	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3573 +0x27e
[16:49:16][Maybe stressrace pull request] github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).worker(0xc00144c750, 0x456cae0, 0xc0014a98f0)
[16:49:16][Maybe stressrace pull request] 	/go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:225 +0x326
[16:49:16][Maybe stressrace pull request] github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).Start.func2(0x456cae0, 0xc0014a98f0)
[16:49:16][Maybe stressrace pull request] 	/go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:165 +0x56
[16:49:16][Maybe stressrace pull request] github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc00229ecc0, 0xc001157860, 0xc00229ecb0)
[16:49:16][Maybe stressrace pull request] 	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:200 +0x146
[16:49:16][Maybe stressrace pull request] created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
[16:49:16][Maybe stressrace pull request] 	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:193 +0xc4

I don't see any other reports of this failure, so it was probably introduced by these changes.

bdarnell · 2019-03-18T01:34:09Z

Since that "ready for one last look" I had to make a subtle tweak (re-inserting to the proposals map) to prevent another stress flake, and also give it some pretty generous timeouts. It's still passing (at least locally) with those more generous timeouts, so it's not deadlocking or anything, but I'm surprised it's needing so long. I'm not sure whether that's indicating a problem or if that's just the nature of our stress (and especially stressrace) tests these days.

All in all, I don't feel great about merging this while we have other open release blockers being investigated. It might introduce instability or at least complicate those investigations.

bdarnell · 2019-03-18T18:03:16Z

OK, the flakiness appears to be resolved now (and this time was all on the test side, not the code side)

nvanbenschoten

Reviewed 9 of 9 files at r14, 2 of 2 files at r15, 1 of 1 files at r16, 3 of 3 files at r17, 2 of 2 files at r18, 5 of 5 files at r19, 3 of 3 files at r20, 2 of 2 files at r21, 1 of 1 files at r22, 3 of 3 files at r23.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @bdarnell)

pkg/storage/replica_raft.go, line 2060 at r12 (raw file):

Previously, tbg (Tobias Grieger) wrote…

You're saying that this wouldn't happen because in this case the proposal would've caught a different error from the below-Raft check (not leaseholder)? That's probably right, though I'd be hesitant to rely on that for correctness. For example, if we moved this out of the Raft goroutine, it might stop being true because the lease can be any future lease as opposed to the one active when the cmd bounced.

I think I'm saying that for us to hit this code path we know that proposedLocally == true, so lease.Sequence == proposal.command.ProposerLeaseSequence is the only other condition we should need to ensure that we own the current lease.

EDIT: it looks like Ben went ahead and removed the entire check, even the sequence part. That wasn't what I was suggesting, but I have to figure it's safe because we would get a different error if the lease had changed between proposing and attempting to repropose. @bdarnell mind confirming this and adding a comment?

pkg/storage/batcheval/cmd_lease_test.go, line 47 at r23 (raw file):

	db := tc.ServerConn(0)

	// More than 30 iterations is flaky under stressrace on teamcity.

What kind of flaky? Is it timing out?

Moving lease index reproposals below raft introduced a bug in which the same proposal could be applied more than once at different lease indexes. (This was originally caught via an assertion in the handling of TruncateLog below raft). This commit adds a test and fixes the bug. Release note: None

bdarnell

Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @bdarnell and @nvanbenschoten)

pkg/storage/replica_raft.go, line 2060 at r12 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I think I'm saying that for us to hit this code path we know that proposedLocally == true, so lease.Sequence == proposal.command.ProposerLeaseSequence is the only other condition we should need to ensure that we own the current lease.

EDIT: it looks like Ben went ahead and removed the entire check, even the sequence part. That wasn't what I was suggesting, but I have to figure it's safe because we would get a different error if the lease had changed between proposing and attempting to repropose. @bdarnell mind confirming this and adding a comment?

Done

pkg/storage/batcheval/cmd_lease_test.go, line 47 at r23 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

What kind of flaky? Is it timing out?

I understand the mystery now. The on-every-PR stress script runs with a timeout that is a function of the number of packages in which tests were added (!). When I added the second test, the first one got a shorter timeout and became flaky again. Also the iterations here turn out to be extremely expensive with the race detector on - more than a second each.

nvanbenschoten

Reviewed 1 of 1 files at r24.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @bdarnell)

pkg/storage/replica_raft.go, line 2060 at r12 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Done

Thanks.

tbg

Reviewed 9 of 9 files at r14, 2 of 2 files at r15, 1 of 1 files at r16, 3 of 3 files at r17, 2 of 2 files at r18, 5 of 5 files at r19, 3 of 3 files at r20, 2 of 2 files at r21, 1 of 1 files at r22, 2 of 3 files at r23, 1 of 1 files at r24.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale)

bdarnell · 2019-03-19T16:30:25Z

bors r=tbg,nvanbenschoten

35261: storage: Improve handling of lease index retries r=tbg,nvanbenschoten a=bdarnell Pipelined writes wait for commands to be evaluated, then detach before they apply. Retryable errors generated at evaluation time are handled correctly (by DistSender or above) even for pipelined writes, but pipelined writes have lost the ability to handle apply-time retry conditions in the executeWriteBatch loop (there used to be more of these, but illegal lease index is now the only condition retried at this level now). To remedy this, this commit moves reproposals due to illegal lease indexes below raft (but only on the proposing node) In practice, this can happen to writes that race with a raft leadership transfer. This is observable immediately after table creation, since table creation performs a split, and then may perform a lease transfer to balance load (and if the lease is transfer, raft leadership is transferred afterward). Specifically, 1. Lease is transferred from store s1 to s2. s1 is still raft leader. 2. Write w1 evaluates on store s2 and is assigned lease index i1. s2 forwards the proposal to s1. 3. s1 initiates raft leader transfer to s2. This puts it into a temporarily leaderless state so it drops the forwarded proposal. 4. s2 is elected raft leader, completing the transfer. 5. A second write w2 evalutes on s2, is assigned lease index i2, and goes right in the raft log since s2 is both leaseholder and leader. 6. s2 refreshes proposals as a side effect of becoming leader, and writes w1 to the log with lease index i1. 7. s2 applies w2, then w1. w1 fails because of the out of order lease index. If w1 was pipelined, the client stops listening after step 2, and won't learn of the failure until it tries to commit. At this point the commit returns an ASYNC_WRITE_FAILURE retry to the client. Note that in some cases, NotLeaseHolderError can be generated at apply time. These errors cannot be retried (since the proposer has lost its lease) so they will unavoidably result in an ASYNC_WRITE_FAILURE error to the client. However, this is uncommon - most NotLeaseHolderErrors are generated at evaluation time, which is compatible with pipelined writes. Fixes #28876 The fourth commit is the important part of this change. The fifth is optional; it's a simplification but it changes an under-tested error path. Assuming we like the fifth commit, I'd like to add a sixth that rips out proposalReevaluationReason and the executeWriteBatch/tryExecuteWriteBatch loop altogether. Co-authored-by: Ben Darnell <ben@bendarnell.com>

craig · 2019-03-19T16:54:37Z

Build succeeded

GitHub CI (Cockroach)

…ading SSTs Fixes cockroachdb#36861. This optimization relied on the fact that `RaftCommands` in `Replica.mu.proposals` were immutable over the lifetime of a Raft proposal. This invariant was violated by cockroachdb#35261, which allowed a lease index error to trigger an immediate reproposal. This reproposal mutated the corresponding `RaftCommand` in `Replica.mu.proposals`. Combined with aliasing between multiple Raft proposals due to reproposals due to ticks, this resulted in cases where a leaseholder's Raft logs could diverge from its followers and cause Raft groups to become inconsistent. Release note: None

36939: storage: remove optimization to use in-memory RaftCommand when sideloading SSTs r=nvanbenschoten a=nvanbenschoten Fixes #36861. This optimization relied on the fact that `RaftCommands` in `Replica.mu.proposals` were immutable over the lifetime of a Raft proposal. This invariant was violated by #35261, which allowed a lease index error to trigger an immediate reproposal. This reproposal mutated the corresponding `RaftCommand` in `Replica.mu.proposals`. Combined with aliasing between multiple Raft proposals due to reproposals from ticks, this resulted in cases where a leaseholder's Raft logs could diverge from its followers and cause Raft groups to become inconsistent. We can justify the new cost of unmarshalling `RaftCommands` in `maybeInlineSideloadedRaftCommand` on leaseholders because https://github.com/cockroachdb/cockroach/pull/36670/files#diff-8a33b5571aa9ab154d4f3c2a16724697R230 just landed. That change removes an equally sized allocation and memory copy from the function on both leaseholders and followers. Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

…ading SSTs Fixes cockroachdb#36861. This optimization relied on the fact that `RaftCommands` in `Replica.mu.proposals` were immutable over the lifetime of a Raft proposal. This invariant was violated by cockroachdb#35261, which allowed a lease index error to trigger an immediate reproposal. This reproposal mutated the corresponding `RaftCommand` in `Replica.mu.proposals`. Combined with aliasing between multiple Raft proposals due to reproposals due to ticks, this resulted in cases where a leaseholder's Raft logs could diverge from its followers and cause Raft groups to become inconsistent. Release note: None

This commit pulls the replica leaseholder check on the read and write evaluation paths above concurrency manager sequencing. This change makes semantic sense because the contents of the concurrency manager and its lockTable are a reflection of the state of the range, so they should only be exposed to requests if the request is evaluating on a leaseholder replica or has otherwise. Up to this point, this hasn't been a real issue because the lockTable was purely a best-effort representation of the state of a range. There are a few strong reasons to make this change: - it makes it easier to fix cockroachdb#46148, a current release blocker - it allows the MaxTimestamp of transactions to be bound before sequencing in the lockTable. This has the effect of ensuring that all transaction pushes are to a timestamp less than or equal to the current node's HLC clock. - it seems like the order we would want things in if/when we begin to address consistent read replicas (cockroachdb#39758), where the concurrency manager sequencing step becomes a distributed operation that depends on the state of the current lease. One concern here is that the concurrency manager sequencing can take an arbitrarily long amount of time because it blocks on conflicting locks. It would be unfortunate if this lead to an increased number of rejected proposals due to lease sequence mismatches. To address this, the commit makes two small changes the interact with each other: 1. it performs a best-effort check before evaluating and proposing a write that the lease it will be proposing under is still active. This check is similar to the best-effort context cancellation check we perform before proposing. 2. it adds a retry loop for NotLeaseHolderErrors hit by requests into executeBatchWithConcurrencyRetries. We've had a retry loop like this a few times in the past (most recently removed in cockroachdb#35261). Now that we're proactively checking the lease before proposing, it seems like a good idea to retry on the replica itself when possible. I could be convinced out of one or both of these changes.

106502: batcheval: delete flaky integration test r=aliher1911 a=aliher1911 TestLeaseCommandLearnerReplica is fundamentally flaky as commit can't retry all underlying replication failures and acceptable failure causes are not exposed on the higher levels where test operates. What test is supposed to verify that failures to propose with invalid lease are not causing pipelined writes to fail but are retried. More details described in original PR #35261 where this test is introduced. In absence of direct controls of that, test is trying to move leases between replicas, but this action could cause different failures to happen which can't be filtered out. For example: `failed to repropose 0b9a11f4bef0e3d3 at idx 16 with new lease index: [NotLeaseHolderError] ‹reproposal failed due to closed timestamp›` and `stale proposal: command was proposed under lease #2 but is being applied under lease: ...` While closed timestamp could be mitigated by changing settings, second one is opaque to the caller and hard to filter out. Release note: None Fixes #104716 Co-authored-by: Oleg Afanasyev <oleg@cockroachlabs.com>

bdarnell requested review from tbg, nvanbenschoten and a team February 28, 2019 04:21

tbg requested changes Feb 28, 2019

View reviewed changes

bdarnell commented Feb 28, 2019

View reviewed changes

tbg reviewed Feb 28, 2019

View reviewed changes

bdarnell commented Feb 28, 2019

View reviewed changes

nvanbenschoten reviewed Feb 28, 2019

View reviewed changes

bdarnell force-pushed the lease-span-test branch from 849f68b to 1e6484f Compare February 28, 2019 23:53

bdarnell commented Feb 28, 2019

View reviewed changes

nvanbenschoten reviewed Mar 1, 2019

View reviewed changes

tbg reviewed Mar 1, 2019

View reviewed changes

bdarnell commented Mar 1, 2019

View reviewed changes

bdarnell changed the title ~~[wip] storage: Improve handling of lease index retries~~ storage: Improve handling of lease index retries Mar 1, 2019

bdarnell mentioned this pull request Mar 4, 2019

kv, storage: reduce frequency of RETRY_ASYNC_WRITE_FAILURE failures #28876

Closed

tbg approved these changes Mar 6, 2019

View reviewed changes

bdarnell force-pushed the lease-span-test branch from bf17333 to c4eed18 Compare March 8, 2019 18:41

bdarnell force-pushed the lease-span-test branch 2 times, most recently from 4a8bde8 to ac7f14c Compare March 12, 2019 16:39

bdarnell force-pushed the lease-span-test branch 7 times, most recently from 6cceb7a to a77f0f5 Compare March 13, 2019 13:31

bdarnell force-pushed the lease-span-test branch from d8197e3 to bb1a567 Compare March 18, 2019 01:30

bdarnell force-pushed the lease-span-test branch from bb1a567 to 26a87f5 Compare March 18, 2019 16:30

nvanbenschoten approved these changes Mar 18, 2019

View reviewed changes

bdarnell force-pushed the lease-span-test branch from 26a87f5 to f7dc847 Compare March 18, 2019 22:19

bdarnell commented Mar 18, 2019

View reviewed changes

nvanbenschoten approved these changes Mar 18, 2019

View reviewed changes

tbg approved these changes Mar 19, 2019

View reviewed changes

craig bot merged commit f7dc847 into cockroachdb:master Mar 19, 2019

bdarnell deleted the lease-span-test branch March 19, 2019 16:56

bdarnell mentioned this pull request Mar 25, 2019

release-19.1: storage: Improve handling of lease index retries #36104

Merged

tbg mentioned this pull request Apr 18, 2019

storage: consistency check failure during import #36861

Closed

3 tasks

nvanbenschoten mentioned this pull request Apr 18, 2019

storage: remove optimization to use in-memory RaftCommand when sideloading SSTs #36939

Merged

nvanbenschoten mentioned this pull request Apr 19, 2019

release-19.1: storage: remove optimization to use in-memory RaftCommand when sideloading SSTs #36945

Merged

This was referenced Jul 10, 2023

batcheval: delete flaky integration test #106502

Merged

kvserver: add test to verify tryReproposeWithNewLeaseIndex behaviour #106504

Closed

blathers-crl bot mentioned this pull request Jul 12, 2023

release-23.1: batcheval: delete flaky integration test #106670

Merged

nvanbenschoten mentioned this pull request Sep 27, 2023

release-22.2: batcheval: delete flaky integration test #111341

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: Improve handling of lease index retries #35261

storage: Improve handling of lease index retries #35261

bdarnell commented Feb 28, 2019 •

edited

Loading

cockroach-teamcity commented Feb 28, 2019

tbg left a comment

bdarnell left a comment

tbg left a comment

bdarnell left a comment

tbg commented Feb 28, 2019 •

edited

Loading

nvanbenschoten left a comment

bdarnell left a comment

nvanbenschoten left a comment

tbg left a comment

bdarnell left a comment

tbg left a comment

bdarnell commented Mar 8, 2019

bdarnell commented Mar 12, 2019

bdarnell commented Mar 18, 2019 •

edited

Loading

bdarnell commented Mar 18, 2019

nvanbenschoten left a comment

bdarnell left a comment

nvanbenschoten left a comment

tbg left a comment

bdarnell commented Mar 19, 2019

craig bot commented Mar 19, 2019

storage: Improve handling of lease index retries #35261

storage: Improve handling of lease index retries #35261

Conversation

bdarnell commented Feb 28, 2019 • edited Loading

cockroach-teamcity commented Feb 28, 2019

tbg left a comment

Choose a reason for hiding this comment

bdarnell left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

bdarnell left a comment

Choose a reason for hiding this comment

tbg commented Feb 28, 2019 • edited Loading

nvanbenschoten left a comment

Choose a reason for hiding this comment

bdarnell left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

bdarnell left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

bdarnell commented Mar 8, 2019

bdarnell commented Mar 12, 2019

bdarnell commented Mar 18, 2019 • edited Loading

bdarnell commented Mar 18, 2019

nvanbenschoten left a comment

Choose a reason for hiding this comment

bdarnell left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

bdarnell commented Mar 19, 2019

craig bot commented Mar 19, 2019

Build succeeded

bdarnell commented Feb 28, 2019 •

edited

Loading

tbg commented Feb 28, 2019 •

edited

Loading

bdarnell commented Mar 18, 2019 •

edited

Loading