New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: jepsen/3/register/split failed #30792

Closed
cockroach-teamcity opened this Issue Sep 29, 2018 · 59 comments

Comments

@cockroach-teamcity
Member

cockroach-teamcity commented Sep 29, 2018

SHA: https://github.com/cockroachdb/cockroach/commits/031748232a601aab191fd3fb00dc298ca20e4506

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=jepsen/3/register/split PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=932135&tab=buildLog

The test failed on master:
	test.go:498,jepsen.go:232,jepsen.go:285: /home/agent/work/.go/bin/roachprod run teamcity-932135-jepsen-3:6 -- bash -e -c "\
		cd /mnt/data1/jepsen/cockroachdb && set -eo pipefail && \
		 ~/lein run test \
		   --tarball file://${PWD}/cockroach.tgz \
		   --username ${USER} \
		   --ssh-private-key ~/.ssh/id_rsa \
		   --os ubuntu \
		   --time-limit 300 \
		   --concurrency 30 \
		   --recovery-time 25 \
		   --test-count 1 \
		   -n 10.128.0.29 -n 10.128.0.26 -n 10.128.0.28 -n 10.128.0.27 -n 10.128.0.25 \
		   --test register --nemesis split \
		> invoke.log 2>&1 \
		" returned:
		stderr:
		
		stdout:
		Error:  exit status 1
		: exit status 1

@bdarnell

This comment has been minimized.

Member

bdarnell commented Oct 1, 2018

This looks like a real one! I'm still working on making sense of the output but jepsen ran to completion and found a non-linearizable history.

@bdarnell

This comment has been minimized.

Member

bdarnell commented Oct 1, 2018

Could be a red herring, but #30778 merged in the day before this failure.

@petermattis

This comment has been minimized.

Contributor

petermattis commented Oct 1, 2018

@bdarnell What are you looking at that makes you suspect #30778? (I'm not fluent in looking through jepsen failures).

@bdarnell

This comment has been minimized.

Member

bdarnell commented Oct 1, 2018

Just that this was the split nemesis, and #30778 touched the split code. Nothing else in the last day or so (prior to commit 0317482) caught my attention, but since this failure is non-deterministic it wasn't necessarily introduced within the previous day.

@petermattis

This comment has been minimized.

Contributor

petermattis commented Oct 1, 2018

Ok. #30778 seems pretty innocuous. It is not removing a replica from the replicas map temporarily during committing a split. I'll poke around in the logs for this failure to see if anything jumps out at me.

@bdarnell

This comment has been minimized.

Member

bdarnell commented Oct 1, 2018

OK, I think these are the relevant portions of the logs. We see from the reads here (starting at line 15) that key 167 has value 1. Several writes (with values 2 and 4) are attempted, but they all fail (unambiguously, I think, although I don't see any explicit handling of ambiguity here), just after the nemesis causes a split at key 165.

INFO [2018-09-29 11:54:47,520] jepsen worker 16 - jepsen.util 16	:invoke	:read	[167 nil]
INFO [2018-09-29 11:54:47,520] jepsen worker 14 - jepsen.util 14	:invoke	:write	[167 2]
INFO [2018-09-29 11:54:47,520] jepsen worker 15 - jepsen.util 15	:invoke	:read	[167 nil]
INFO [2018-09-29 11:54:47,520] jepsen worker 10 - jepsen.util 10	:invoke	:write	[167 2]
INFO [2018-09-29 11:54:47,520] jepsen worker 17 - jepsen.util 17	:invoke	:read	[167 nil]
INFO [2018-09-29 11:54:47,520] jepsen worker 19 - jepsen.util 19	:invoke	:read	[167 nil]
INFO [2018-09-29 11:54:47,520] jepsen worker 13 - jepsen.util 13	:invoke	:write	[167 2]
INFO [2018-09-29 11:54:47,520] jepsen worker 11 - jepsen.util 11	:invoke	:cas	[167 [2 3]]
INFO [2018-09-29 11:54:47,520] jepsen worker 18 - jepsen.util 18	:invoke	:read	[167 nil]
INFO [2018-09-29 11:54:47,520] jepsen worker 12 - jepsen.util 12	:invoke	:write	[167 4]
INFO [2018-09-29 11:54:47,522] jepsen worker 6 - jepsen.util 6	:ok	:read	[166 4]
INFO [2018-09-29 11:54:47,522] jepsen worker 8 - jepsen.util 8	:ok	:read	[166 4]
INFO [2018-09-29 11:54:47,523] jepsen worker 7 - jepsen.util 7	:ok	:read	[166 4]
INFO [2018-09-29 11:54:47,523] jepsen worker 9 - jepsen.util 9	:ok	:read	[166 4]
INFO [2018-09-29 11:54:47,523] jepsen worker 15 - jepsen.util 15	:ok	:read	[167 1]
INFO [2018-09-29 11:54:47,523] jepsen worker 16 - jepsen.util 16	:ok	:read	[167 1]
INFO [2018-09-29 11:54:47,523] jepsen worker 17 - jepsen.util 17	:ok	:read	[167 1]
INFO [2018-09-29 11:54:47,523] jepsen worker 11 - jepsen.util 11	:fail	:cas	[167 [2 3]]
INFO [2018-09-29 11:54:47,524] jepsen worker 3 - jepsen.util 3	:fail	:cas	[166 [3 2]]
INFO [2018-09-29 11:54:47,524] jepsen worker 19 - jepsen.util 19	:ok	:read	[167 1]
INFO [2018-09-29 11:54:47,524] jepsen worker 18 - jepsen.util 18	:ok	:read	[167 1]
INFO [2018-09-29 11:54:47,528] jepsen worker 4 - jepsen.util 4	:fail	:cas	[166 [2 1]]
INFO [2018-09-29 11:54:47,530] jepsen worker 20 - jepsen.util 20	:ok	:write	[165 0]
INFO [2018-09-29 11:54:47,531] jepsen worker 1 - jepsen.util 1	:ok	:write	[166 4]
INFO [2018-09-29 11:54:47,541] jepsen nemesis - jepsen.util :nemesis	:info	["splits" :split]	[:split :test 165]
INFO [2018-09-29 11:54:47,562] jepsen worker 2 - jepsen.util 2	:fail	:write	[166 2]
INFO [2018-09-29 11:54:47,564] jepsen worker 12 - jepsen.util 12	:fail	:write	[167 4]
INFO [2018-09-29 11:54:47,566] jepsen worker 5 - jepsen.util 5	:ok	:read	[166 4]
INFO [2018-09-29 11:54:47,571] jepsen worker 14 - jepsen.util 14	:fail	:write	[167 2]
INFO [2018-09-29 11:54:47,574] jepsen worker 0 - jepsen.util 0	:ok	:write	[166 1]
INFO [2018-09-29 11:54:47,581] jepsen worker 10 - jepsen.util 10	:fail	:write	[167 2]
INFO [2018-09-29 11:54:47,589] jepsen worker 13 - jepsen.util 13	:fail	:write	[167 2]

Next time we touch key 167, it has value 4:

INFO [2018-09-29 11:54:48,020] jepsen worker 16 - jepsen.util 16	:invoke	:read	[167 nil]
INFO [2018-09-29 11:54:48,020] jepsen worker 17 - jepsen.util 17	:invoke	:read	[167 nil]
INFO [2018-09-29 11:54:48,020] jepsen worker 10 - jepsen.util 10	:invoke	:cas	[167 [4 4]]
INFO [2018-09-29 11:54:48,020] jepsen worker 15 - jepsen.util 15	:invoke	:read	[167 nil]
INFO [2018-09-29 11:54:48,020] jepsen worker 13 - jepsen.util 13	:invoke	:cas	[167 [2 3]]
INFO [2018-09-29 11:54:48,020] jepsen worker 19 - jepsen.util 19	:invoke	:read	[167 nil]
INFO [2018-09-29 11:54:48,020] jepsen worker 18 - jepsen.util 18	:invoke	:read	[167 nil]
INFO [2018-09-29 11:54:48,020] jepsen worker 12 - jepsen.util 12	:invoke	:cas	[167 [4 3]]
INFO [2018-09-29 11:54:48,020] jepsen worker 14 - jepsen.util 14	:invoke	:write	[167 1]
INFO [2018-09-29 11:54:48,020] jepsen worker 11 - jepsen.util 11	:invoke	:cas	[167 [2 1]]
INFO [2018-09-29 11:54:48,021] jepsen worker 6 - jepsen.util 6	:ok	:read	[166 1]
INFO [2018-09-29 11:54:48,022] jepsen worker 8 - jepsen.util 8	:ok	:read	[166 1]
INFO [2018-09-29 11:54:48,022] jepsen worker 1 - jepsen.util 1	:fail	:cas	[166 [3 3]]
INFO [2018-09-29 11:54:48,022] jepsen worker 9 - jepsen.util 9	:ok	:read	[166 1]
INFO [2018-09-29 11:54:48,022] jepsen worker 7 - jepsen.util 7	:ok	:read	[166 1]
INFO [2018-09-29 11:54:48,023] jepsen worker 5 - jepsen.util 5	:ok	:read	[166 1]
INFO [2018-09-29 11:54:48,023] jepsen worker 19 - jepsen.util 19	:ok	:read	[167 4]
INFO [2018-09-29 11:54:48,023] jepsen worker 17 - jepsen.util 17	:ok	:read	[167 4]
INFO [2018-09-29 11:54:48,023] jepsen worker 15 - jepsen.util 15	:ok	:read	[167 4]

So either A) pending writes on the RHS of a split are reporting unambiguous failures when they are in fact being applied or B) jepsen is not correctly accounting for ambiguous failures in this test.

@bdarnell

This comment has been minimized.

Member

bdarnell commented Oct 1, 2018

What kind of errors are happening here? Fortunately we have packet captures.

$ for i in */*.pcap; do tshark -d tcp.port==26257,pgsql -O pgsql -Y 'pgsql.message' -r $i; done|grep Message|cut -d: -f1-4|sort |uniq -c |sort -n
tshark: The file "10.128.0.26/trace.pcap" appears to have been cut short in the middle of a packet.
tshark: The file "10.128.0.27/trace.pcap" appears to have been cut short in the middle of a packet.
tshark: The file "10.128.0.28/trace.pcap" appears to have been cut short in the middle of a packet.
tshark: The file "10.128.0.29/trace.pcap" appears to have been cut short in the middle of a packet.
   3     Message [truncated]: restart transaction: HandledRetryableTxnError: TransactionAbortedError(ABORT_REASON_TIMESTAMP_CACHE_REJECTED_POSSIBLE_REPLAY)
   5     Message: query execution canceled
  50     Message: restart transaction: HandledRetryableTxnError: WriteTooOldError
1031     Message [truncated]: restart transaction: HandledRetryableTxnError: ReadWithinUncertaintyIntervalError
1576     Message [truncated]: restart transaction: HandledRetryableTxnError: TransactionRetryError

The test considers any restart transaction error to be an unambiguous failure. query execution canceled should be getting treated as ambiguous. So maybe the (infrequent) ABORT_REASON_TIMESTAMP_CACHE_REJECTED_POSSIBLE_REPLAY error is in fact ambiguous (at least when it is caused by a split, which advances the timestamp cache for the RHS), but it looks unambiguous. Unfortunately, it looks like that's not it. I checked out all three of these errors and none of them apply to a query involving the key 167 (although one of them involves the key 169). None of the "query execution canceled" errors involve this key either.

Let's approach this from the other direction and find queries involving key 167. This is a little tricky because of the use of prepared statements; we can't actually see the query and the parameters together. But the filter pgsql.val.data==00:00:00:00:00:00:00:a7 && pgsql.val.data==00:00:00:4 gives us queries involving both 167 (as an int64, the type of our keys) and 4 (as an int32, the type of our values). We can further narrow it down because of the arity of the statements: blind writes (which is what we're looking for) have two arguments, while reads have 1 and CAS queries have 3.

That narrows it down to a single blind write of value 4 to key 167. Here's the full dump: https://gist.github.com/bdarnell/f07d1cd944d7dc7c1c4a0ec15fd4d235

Prepared statement decoder ring:

  • S_2: BEGIN
  • S_3: select val from test where id = $1
  • S_5: UPDATE test SET val = $1 WHERE id = $2
  • S_1: COMMIT

So the query does a read followed by an update, then tries to commit and gets a RETRY_ASYNC_WRITE_FAILURE.

  Message [truncated]: restart transaction: HandledRetryableTxnError: TransactionRetryError: retry txn (RETRY_ASYNC_WRITE_FAILURE): "sql txn" id=55f33e25 key=/Table/53/1/167/0 rw=true pri=0.03783833 iso=SERIALIZABLE stat=PENDING epo=0 ts=15

Repeating the bashreduce above with more detail on the TransactionRetryErrors shows that this variant is rare:

$ for i in */*.pcap; do tshark -d tcp.port==26257,pgsql -O pgsql -Y 'pgsql.message' -r $i; done|grep Message|grep TransactionRetryError |cut -d: -f1-5|sort |uniq -c |sort -n
tshark: The file "10.128.0.26/trace.pcap" appears to have been cut short in the middle of a packet.
tshark: The file "10.128.0.27/trace.pcap" appears to have been cut short in the middle of a packet.
tshark: The file "10.128.0.28/trace.pcap" appears to have been cut short in the middle of a packet.
tshark: The file "10.128.0.29/trace.pcap" appears to have been cut short in the middle of a packet.
   5     Message [truncated]: restart transaction: HandledRetryableTxnError: TransactionRetryError: retry txn (RETRY_ASYNC_WRITE_FAILURE)
  44     Message [truncated]: restart transaction: HandledRetryableTxnError: TransactionRetryError: retry txn (RETRY_SERIALIZABLE)
1527     Message [truncated]: restart transaction: HandledRetryableTxnError: TransactionRetryError: retry txn (RETRY_WRITE_TOO_OLD)

So it looks like async writes (possibly combined with splits) are failing in an ambiguous way that is getting reported to the client as unambiguous. cc @nvanbenschoten

@bdarnell

This comment has been minimized.

Member

bdarnell commented Oct 1, 2018

Some timing details that aren't in the gist:

  • The BEGIN and SELECT were sent in a single packet, and their response was also one packet, 1ms later
  • The UPDATE took 1.3ms (really just to return, not to execute, since it's being queued up as an async write)
  • The COMMIT took 39ms to fail. The split took 36ms.

Here's how the operations were interleaved (split is coming from a different clock, so it might be a bit off)

07:54:47.477 split begins
07:54:47.492 begin+select begins
07:54:47.493 begin+select ends
07:54:47.493 update begins
07:54:47.494 update ends
07:54:47.495 commit begins
07:54:47.513 split ends
07:54:47.534 commit ends
@nvanbenschoten

This comment has been minimized.

Member

nvanbenschoten commented Oct 1, 2018

If we see an RETRY_ASYNC_WRITE_FAILURE error then the transaction should have been retried without committing, just like any other retry error that doesn't hit one of our optimizations.

There are two interesting things to note here that aren't immediately clear:

  • the Put request will be in the same batch as the BeginTxn request since it's the first write in the txn, so if the async write failed, there shouldn't be a txn record.
  • the txn only wrote a single key, so the QueryIntent and the EndTxn requests will also be in the same batch.

Together, this means that if the QueryIntent missed its intent, then the transaction record shouldn't even exist for the EndTxn to commit. It also means that if the QueryIntent returned an error during evaluation, the EndTxn should never have been evaluated.

It is however possible that the txn record exists from a previous epoch, but I don't see how that would cause issues because intents at the old epoch would not be committed.

@tbg

This comment has been minimized.

Member

tbg commented Oct 2, 2018

Is the following or some variation of it possible?

  1. {Begin,Put} gets proposed but is "somehow delayed or replayed in a way that doesn't block a subsequent QueryIntent" (magic)
  2. QueryIntent
    return result.Result{}, roachpb.NewIntentMissingError(curIntent)
    comes in (in the same batch as the commit) and produces a MissingIntentError
  3. which means that this request doesn't mutate the timestamp cache (?) due to the error
  4. {Begin,Put} gets "replayed" and writes the txn record and intent
  5. for some reason the QueryIntent,Commit also gets replayed and actually commits the txn

Independently of whether this is even at fault, shouldn't QueryIntent use the updatesTSCacheOnError flag?

I'm also not sure how the split would've made the above possible. There's seemingly no reason the Begin,Put would ever get replayed as it doesn't even get into Raft in the first place.

@bdarnell do we have gRPC packet captures too? It'll be quite nasy to decode but now might be the time to put together a tool that does it for us.

@tbg tbg modified the milestones: 2.2, 2.1 Oct 2, 2018

@nvanbenschoten

This comment has been minimized.

Member

nvanbenschoten commented Oct 2, 2018

The part of that which doesn't seem possible is the "magic". If {Begin,Put} returned successfully to the client then it would have already been proposed to Raft. If the leaseholder stays the same then the command will block the subsequent QueryIntent in the command queue until it has applied. If the leaseholder changed hands before the {Begin,Put} applied then the proposal would have gotten rejected beneath Raft. Either way, the {Begin,Put} shouldn't apply succesfully after the {QueryIntent,Commit} executes.

@tbg

This comment has been minimized.

Member

tbg commented Oct 2, 2018

Well, the "magic" replay could be that the RPC arrives twice some tryExecute stuff goes wrong. But yeah, it doesn't seem likely (but then again, this is probably a rare bug).
I also don't see how step 5) would be possible. I guess the same reasoning could do it, namely that the request pops out of gRPC twice? That seems... just unlikely.

@bdarnell

This comment has been minimized.

Member

bdarnell commented Oct 2, 2018

My working theory is that the Begin+Put batch applies normally, and all the weirdness is on the QueryIntent+Commit batch. This batch fails with an error that is flagged as retriable and unambiguous, but it is in fact ambiguous and the batch goes on to apply successfully. Then the retry happens and QueryIntent fails because the transaction and its local intent are already gone.

The question is, what is that error on the first attempt to commit? We can't see that here because we don't have grpc packet captures or verbose-enough logging. We could get that, but so far the error has proven difficult to reproduce.

@tbg

This comment has been minimized.

Member

tbg commented Oct 3, 2018

Looking at this again from @bdarnell's perspective, the RETRY_ASYNC_WRITE_FAILURE doesn't mean anything. It's what would be expected if we accidentally committed the value and then went to check it via a QueryTxn (which would return an IntentMissingError that gets turned into the async write failure). So the "real" problem is unknown. It must've happened before the retry that caught the async write failure.

@tbg

This comment has been minimized.

Member

tbg commented Oct 3, 2018

I see one at least remotely plausible way for the write to be applied twice.

If this line fires, we might return ambiguous results when refreshing commands:

refreshReason = reasonSnapshotApplied

but we could unconditionally overwrite that variable later

refreshReason = reasonNewLeaderOrConfigChange

The justification is given in the comment above as saying "it's the most aggressive mode anyway", but it's not true:

if reason == reasonSnapshotApplied {
p.finishApplication(proposalResult{ProposalRetry: proposalAmbiguousShouldBeReevaluated})
} else {
p.finishApplication(proposalResult{ProposalRetry: proposalIllegalLeaseIndex})
}

We will get an unambiguous retry instead of an ambiguous one. The retry will then fail checking the intent and the result would be what we're seeing.

Oh, it actually becomes easier because the snapshot reason assignment doesn't do it if there's also a new leader:

if !r.store.TestingKnobs().DisableRefreshReasonSnapshotApplied &&
refreshReason == noReason {
refreshReason = reasonSnapshotApplied
}

So it seems reasonable that this would happen when the replica has the batch pending, but commits it via a snapshot that gets applied simultaneously with a leadership change (not too unlikely: we were leader before, and put the command in the log but then lose leadership and fall behind, getting caught up by the new leader).

I'm not confident that this is really the bug we're seeing here. Seems like a bug though. I hope we can get a repro.

@andreimatei

This comment has been minimized.

Member

andreimatei commented Oct 3, 2018

I'm trying to get a repro, but have been unsuccessful so far. Might have to also write a unit test trying to replicate the spirit of the jepsen test+nemesis.

Incidentally to your findings, I've discussed recently with @bdarnell about getting rid of reproposals on "snapshot applied" conditions - because they're hard to test (i.e. we're not testing them) and supposed to be rare and of questionable value. Instead, we'd treat the respective proposals as ambiguous without trying to resolve the ambiguity in any direction. I was going to do that when I have a chance, and then presumably the funky code that you're pointing to would be forced to change.

@tbg

This comment has been minimized.

Member

tbg commented Oct 3, 2018

That sounds good to me. If what I wrote above is actually what caused the bug, you're going to have a terrible time reproducing this.

It's a good idea to return a few more ambiguous errors here and to remove the incorrect code.

@andreimatei

This comment has been minimized.

Member

andreimatei commented Oct 3, 2018

Well, what kind of chaos do you reckon will make the bug you found more likely to surface?

@tbg

This comment has been minimized.

Member

tbg commented Oct 3, 2018

Snapshots and leaseholder changes are associated with ... poorly functioning clusters. So you need one of those, and then you're not getting good qps = low frequency of repro.

I still "hope" (well, not exactly hope) that this is another bug that actually has to do with the split.

@tbg

This comment has been minimized.

Member

tbg commented Oct 3, 2018

But, we should at least check the logs to see whether there was a snapshot for this range.

@bdarnell

This comment has been minimized.

Member

bdarnell commented Nov 1, 2018

For tonight's run, I'll add logging to all processing of all Raft commands.

Patching verboseRaftLoggingEnabled to return true might also give us something useful. That will produce a ton of output from raft, but this test is low-throughput enough that it should be manageable. This could tell us what's going on with those reasonNewLeader reproposals (why there's a new election, and whether the initial proposals made it into the log or not).

@andreimatei

This comment has been minimized.

Member

andreimatei commented Nov 2, 2018

Conspicuously nothing seems to be failing if I enable all the Raft logging, and I'm warming up the planet. I might have to go back to more selective logging...

@andreimatei

This comment has been minimized.

Member

andreimatei commented Nov 2, 2018

Got a failure (2018-11-02-start-stop-2), but without verboseRaftLoggingEnabled.
Logs in https://drive.google.com/file/d/1ficfh0GC4X4-HGc-PMoHcrpFauLd_dXl/view?usp=sharing
Grepping notes: the wrongly committed txn is d370e84f and the command being reproposed is 25545d90c7648592.

Pretty quick analysis for now, as I have to go: the new information we have in these logs is that we see the processing of the commands, in particular of reproposed commands. As usual, we see the command for out EndTransaction being reproposed twice (reasonNewLeader and reasonNewLeaderOrConfigChange). And we now see that these reproposed commands are both processed. They're both rejected by the lease check command observed at lease index 3876, but required < 3875, as expected. The processing of the first one calls finishApplication() (with the err), the 2nd one doesn't because proposedLocally: false - which is expected; the previous finishApplication() took the command out of the pending commands map. Before the reproposals, there's no logs for processing of the command in question (25545d90c7648592).
And this logs are copacetic on the topic across all 3 nodes - they all show the same two rejections of the command, and no other processing.

And then, as usual, executeWriteBatch() re-evaluates the batch and gets the intent missing error.
What the hell is going on? When is this damn transaction being committed? Yet unclear.

Other random observations: a lot of things are being reproposed all the time, including the command for our BeginTxn + Put batch.
Also, we haven't given much thought to the fact that every time the first attempt of the EndTransaction gets a RETRY_SERIALIZABLE. That's an evaluation error; it'd definitely be surprising if the respective batch resulted in any proposals... But who knows...

Any ideas for more medieval things to log are welcomed. In particular, I'm thinking maybe we should try to work our way backwards from an intent being resolved, instead of trying to go forwards from the EndTransaction being evaluated and processed. If anyone has an idea about how to log low-level committing of intents (below Raft), please let me know.

@andreimatei

This comment has been minimized.

Member

andreimatei commented Nov 2, 2018

Random thought: the new logs take out any suspicion about snapshots being involved in any way, I'd think: we don't see the command (identified by the cmdKey) ever being successfully processed, by any node. So its effects cannot possibly be in any snapshots... Right?

A thing that we didn't have logs for is the original proposal of the command. We know what cmdKey to look for because of the reproposals, where I log the cmdKey and the BatchRequest. But technically I don't know for sure if that's the cmdKey of the original proposal... I'm just assuming that the reproposals use the same cmdKey. Perhaps I should log all proposals and their associated batches, to make sure...

@bdarnell

This comment has been minimized.

Member

bdarnell commented Nov 2, 2018

Yes, reproposals should be using the same cmdKey as the original. So this appears to tell us that the original proposal is getting lost in the void (not too surprising given the proximity to a new leader - proposals get dropped on the floor if they arrive in the middle of an election).

Fine-tuning the logging configurations to try and observe the bug without perturbing it out of existence probably isn't a great idea, but I think the raftLogger Info level (V(2)) might have some useful information (specifically about the election) without being so verbose that it overwhelms everything.

Here's another possible clue: After the commit fails, we send a rollback. But this only shows up on gateway-side logs; we never see the replica mention it:

56737:I181102 20:08:08.471019 931 kv/txn_coord_sender.go:641  [n2,client=10.142.0.99:46482,user=root,txn=d370e84f] TCS.Send: [txn: d370e84f], EndTransaction(commit:false) [/Min] ...
56738:I181102 20:08:08.471050 931 kv/dist_sender.go:1052  [n2,client=10.142.0.99:46482,user=root,txn=d370e84f] XXX sendPartialBatch: [txn: d370e84f], EndTransaction(commit:false) [/Table/53/1/77/0]
56739:I181102 20:08:08.471064 931 kv/dist_sender.go:1091  [n2,client=10.142.0.99:46482,user=root,txn=d370e84f] XXX sendPartialBatch attempt 0: [txn: d370e84f], EndTransaction(commit:false) [/Table/53/1/77/0]
56740:I181102 20:08:08.471078 931 kv/dist_sender.go:1314  [n2,client=10.142.0.99:46482,user=root,txn=d370e84f] XXX sendToReplicas: [txn: d370e84f], EndTransaction(commit:false) [/Table/53/1/77/0]
56741:I181102 20:08:08.471099 931 kv/transport.go:169  [n2,client=10.142.0.99:46482,user=root,txn=d370e84f] XXX gprcTransport sending req: [txn: d370e84f], EndTransaction(commit:false) [/Table/53/1/77/0]...
56748:I181102 20:08:08.478533 931 kv/transport.go:171  [n2,client=10.142.0.99:46482,user=root,txn=d370e84f] XXX gprcTransport sending req: [txn: d370e84f], EndTransaction(commit:false) [/Table/53/1/77/0]...  done. got resp: (err: <nil>), *roachpb.EndTransactionResponse, err: <nil>
56750:I181102 20:08:08.478583 931 kv/dist_sender.go:1345  [n2,client=10.142.0.99:46482,user=root,txn=d370e84f] XXX : sendToReplicas loop (0). Req: [txn: d370e84f], EndTransaction(commit:false) [/Table/53/1/77/0]. br: (err: <nil>), *roachpb.EndTransactionResponse, err: <nil>
56751:I181102 20:08:08.478616 931 kv/dist_sender.go:1318  [n2,client=10.142.0.99:46482,user=root,txn=d370e84f] XXX sendToReplicas returning. ba: [txn: d370e84f], EndTransaction(commit:false) [/Table/53/1/77/0]. br: (err: <nil>), *roachpb.EndTransactionResponse err: <nil>. ambiguousErr: <nil>. haveCommit: false
56753:I181102 20:08:08.478640 931 kv/dist_sender.go:1111  [n2,client=10.142.0.99:46482,user=root,txn=d370e84f] XXX sendPartialBatch: sendSingleRange returned: resp: (err: <nil>), *roachpb.EndTransactionResponse err: <nil> (ba: [txn: d370e84f], EndTransaction(commit:false) [/Table/53/1/77/0])
56754:I181102 20:08:08.478663 931 kv/dist_sender.go:1054  [n2,client=10.142.0.99:46482,user=root,txn=d370e84f] XXX sendPartialBatch returning. ba: [txn: d370e84f], EndTransaction(commit:false) [/Table/53/1/77/0]. br: (err: <nil>), *roachpb.EndTransactionResponse err: <nil>
56756:I181102 20:08:08.478696 931 kv/txn_coord_sender.go:643  [n2,client=10.142.0.99:46482,user=root,txn=d370e84f] TCS.Send: [txn: d370e84f], EndTransaction(commit:false) [/Table/53/1/77/0] ... done. br: (err: <nil>), *roachpb.EndTransactionResponse, err: <nil>

Did TxnCoordSender process this rollback locally because it thought the transaction was read-only? We know it wrote intents; how did TCS get confused? Could this confusion have other consequences? (for example, the timing doesn't line up, but there's the change to treat read-only txns as committed rather than aborted in the adriatic investigation).

@andreimatei

This comment has been minimized.

Member

andreimatei commented Nov 2, 2018

But this only shows up on gateway-side logs; we never see the replica mention it

(Texting from the tunnel) the rollback was sent to a replica - we see the grpctransport involved. Beyond that is this surprising that we don't get more logs? If the evaluation of the EndTxn doesn't find the txn record, do we still propose anything? And even if we do, if there's no reproposal of the respective command, would any of the log lines I've added show the txn/batch? I think the only log line that would show up would be about processing the command, but we'd have no way of tying the cmdKey to the rollback batch.

@bdarnell

This comment has been minimized.

Member

bdarnell commented Nov 3, 2018

Ah, right. It did go to a replica, but if the transaction has already committed and resolved, evaluating the rollback results in an empty write batch and slips through a hole in our logging. So this doesn't really tell us anything new (it comes after the async write failure, and we already know that the damage is done at that point).

OK, next line of inquiry: Is there anything other than EndTransaction that could have resolved this intent? There's a failed push in the middle of this run (we've seen that in some but not all of the occurrences of this bug, I believe, so this may not prove anything).

123732:I181102 20:08:08.455585 16981 storage/replica_proposal.go:112  XXX proposal.finishApplication. cmd: 52797ada9d4e9bde. endCmds: 0xc425869a40, doneCh: 0xc4268240c0, err: failed to push "sql txn" id=d370e84f key=/Table/53/1/77/0 rw=true pri=0.00604256 iso=SERIALIZABLE stat=PENDING epo=0 ts=1541189283.465060250,1 orig=1541189283.463512172,0 max=1541189283.963512172,0 wto=false rop=false seq=3, retry: 0

Here's the gateway-side logs for all push and resolve traffic targeting this key during the relevant second:

123718:I181102 20:08:08.454999 16981 kv/dist_sender.go:1052  [n1,s1] XXX sendPartialBatch: PushTxn [/Table/53/1/77/0,/Min)
123719:I181102 20:08:08.455023 16981 kv/dist_sender.go:1091  [n1,s1] XXX sendPartialBatch attempt 0: PushTxn [/Table/53/1/77/0,/Min)
123720:I181102 20:08:08.455046 16981 kv/dist_sender.go:1314  [n1,s1] XXX sendToReplicas: PushTxn [/Table/53/1/77/0,/Min)
123721:I181102 20:08:08.455078 16981 kv/transport.go:169  [n1,s1] XXX gprcTransport sending req: PushTxn [/Table/53/1/77/0,/Min)...
123826:I181102 20:08:08.467231 16981 kv/transport.go:171  [n1,s1] XXX gprcTransport sending req: PushTxn [/Table/53/1/77/0,/Min)...  done. got resp: (err: <nil>), *roachpb.PushTxnResponse, err: <nil>
123828:I181102 20:08:08.467264 16981 kv/dist_sender.go:1345  [n1,s1] XXX : sendToReplicas loop (0). Req: PushTxn [/Table/53/1/77/0,/Min). br: (err: <nil>), *roachpb.PushTxnResponse, err: <nil>
123842:I181102 20:08:08.467317 16981 kv/dist_sender.go:1318  [n1,s1] XXX sendToReplicas returning. ba: PushTxn [/Table/53/1/77/0,/Min). br: (err: <nil>), *roachpb.PushTxnResponse err: <nil>. ambiguousErr: <nil>. haveCommit: false
123845:I181102 20:08:08.467677 16981 kv/dist_sender.go:1111  [n1,s1] XXX sendPartialBatch: sendSingleRange returned: resp: (err: <nil>), *roachpb.PushTxnResponse err: <nil> (ba: PushTxn [/Table/53/1/77/0,/Min))
123847:I181102 20:08:08.467694 16981 kv/dist_sender.go:1054  [n1,s1] XXX sendPartialBatch returning. ba: PushTxn [/Table/53/1/77/0,/Min). br: (err: <nil>), *roachpb.PushTxnResponse err: <nil>
123849:I181102 20:08:08.467733 16981 kv/dist_sender.go:1052  [n1,s1] XXX sendPartialBatch: ResolveIntent [/Table/53/1/77/0,/Min)
123850:I181102 20:08:08.467747 16981 kv/dist_sender.go:1091  [n1,s1] XXX sendPartialBatch attempt 0: ResolveIntent [/Table/53/1/77/0,/Min)
123852:I181102 20:08:08.467765 16981 kv/dist_sender.go:1314  [n1,s1] XXX sendToReplicas: ResolveIntent [/Table/53/1/77/0,/Min)
123853:I181102 20:08:08.467797 16981 kv/transport.go:169  [n1,s1] XXX gprcTransport sending req: ResolveIntent [/Table/53/1/77/0,/Min)...
123900:I181102 20:08:08.472985 16981 kv/transport.go:171  [n1,s1] XXX gprcTransport sending req: ResolveIntent [/Table/53/1/77/0,/Min)...  done. got resp: (err: <nil>), *roachpb.ResolveIntentResponse, err: <nil>
123901:I181102 20:08:08.473037 16981 kv/dist_sender.go:1345  [n1,s1] XXX : sendToReplicas loop (0). Req: ResolveIntent [/Table/53/1/77/0,/Min). br: (err: <nil>), *roachpb.ResolveIntentResponse, err: <nil>
123902:I181102 20:08:08.473052 16981 kv/dist_sender.go:1318  [n1,s1] XXX sendToReplicas returning. ba: ResolveIntent [/Table/53/1/77/0,/Min). br: (err: <nil>), *roachpb.ResolveIntentResponse err: <nil>. ambiguousErr: <nil>. haveCommit: false
123903:I181102 20:08:08.473076 16981 kv/dist_sender.go:1111  [n1,s1] XXX sendPartialBatch: sendSingleRange returned: resp: (err: <nil>), *roachpb.ResolveIntentResponse err: <nil> (ba: ResolveIntent [/Table/53/1/77/0,/Min))
123904:I181102 20:08:08.473089 16981 kv/dist_sender.go:1054  [n1,s1] XXX sendPartialBatch returning. ba: ResolveIntent [/Table/53/1/77/0,/Min). br: (err: <nil>), *roachpb.ResolveIntentResponse err: <nil>
123929:I181102 20:08:08.474028 589 kv/dist_sender.go:1052  [n1,s1] XXX sendPartialBatch: PushTxn [/Table/53/1/77/0,/Min)
123930:I181102 20:08:08.474055 589 kv/dist_sender.go:1091  [n1,s1] XXX sendPartialBatch attempt 0: PushTxn [/Table/53/1/77/0,/Min)
123931:I181102 20:08:08.474074 589 kv/dist_sender.go:1314  [n1,s1] XXX sendToReplicas: PushTxn [/Table/53/1/77/0,/Min)
123932:I181102 20:08:08.474099 589 kv/transport.go:169  [n1,s1] XXX gprcTransport sending req: PushTxn [/Table/53/1/77/0,/Min)...
124002:I181102 20:08:08.488006 589 kv/transport.go:171  [n1,s1] XXX gprcTransport sending req: PushTxn [/Table/53/1/77/0,/Min)...  done. got resp: (err: <nil>), *roachpb.PushTxnResponse, err: <nil>
124003:I181102 20:08:08.488033 589 kv/dist_sender.go:1345  [n1,s1] XXX : sendToReplicas loop (0). Req: PushTxn [/Table/53/1/77/0,/Min). br: (err: <nil>), *roachpb.PushTxnResponse, err: <nil>
124004:I181102 20:08:08.488084 589 kv/dist_sender.go:1318  [n1,s1] XXX sendToReplicas returning. ba: PushTxn [/Table/53/1/77/0,/Min). br: (err: <nil>), *roachpb.PushTxnResponse err: <nil>. ambiguousErr: <nil>. haveCommit: false
124005:I181102 20:08:08.488107 589 kv/dist_sender.go:1111  [n1,s1] XXX sendPartialBatch: sendSingleRange returned: resp: (err: <nil>), *roachpb.PushTxnResponse err: <nil> (ba: PushTxn [/Table/53/1/77/0,/Min))
124008:I181102 20:08:08.488137 589 kv/dist_sender.go:1054  [n1,s1] XXX sendPartialBatch returning. ba: PushTxn [/Table/53/1/77/0,/Min). br: (err: <nil>), *roachpb.PushTxnResponse err: <nil>
124011:I181102 20:08:08.488247 589 kv/dist_sender.go:1052  [n1,s1] XXX sendPartialBatch: ResolveIntent [/Table/53/1/77/0,/Min)
124012:I181102 20:08:08.488269 589 kv/dist_sender.go:1091  [n1,s1] XXX sendPartialBatch attempt 0: ResolveIntent [/Table/53/1/77/0,/Min)
124013:I181102 20:08:08.488290 589 kv/dist_sender.go:1314  [n1,s1] XXX sendToReplicas: ResolveIntent [/Table/53/1/77/0,/Min)
124014:I181102 20:08:08.488315 589 kv/transport.go:169  [n1,s1] XXX gprcTransport sending req: ResolveIntent [/Table/53/1/77/0,/Min)...
124030:I181102 20:08:08.490684 589 kv/transport.go:171  [n1,s1] XXX gprcTransport sending req: ResolveIntent [/Table/53/1/77/0,/Min)...  done. got resp: (err: <nil>), *roachpb.ResolveIntentResponse, err: <nil>
124031:I181102 20:08:08.490699 589 kv/dist_sender.go:1345  [n1,s1] XXX : sendToReplicas loop (0). Req: ResolveIntent [/Table/53/1/77/0,/Min). br: (err: <nil>), *roachpb.ResolveIntentResponse, err: <nil>
124032:I181102 20:08:08.490713 589 kv/dist_sender.go:1318  [n1,s1] XXX sendToReplicas returning. ba: ResolveIntent [/Table/53/1/77/0,/Min). br: (err: <nil>), *roachpb.ResolveIntentResponse err: <nil>. ambiguousErr: <nil>. haveCommit: false
124033:I181102 20:08:08.490728 589 kv/dist_sender.go:1111  [n1,s1] XXX sendPartialBatch: sendSingleRange returned: resp: (err: <nil>), *roachpb.ResolveIntentResponse err: <nil> (ba: ResolveIntent [/Table/53/1/77/0,/Min))
124034:I181102 20:08:08.490741 589 kv/dist_sender.go:1054  [n1,s1] XXX sendPartialBatch returning. ba: ResolveIntent [/Table/53/1/77/0,/Min). br: (err: <nil>), *roachpb.ResolveIntentResponse err: <nil>

This appears to show a successful push and resolve. The failed push isn't visible here (are we not logging the right error? Also, I'm only seeing the gateway stuff coming from a single node, which is suspicious).

Of course, we don't see here which txn is being pushed (the test is deliberately racing multiple transactions against each other), or what kind of pushes these are. Maybe we need more logging of successful commands and first-time proposals, not just errors and reproposals.

@bdarnell

This comment has been minimized.

Member

bdarnell commented Nov 3, 2018

OK, time to look at this from the other side. The EndTransaction(commit=True) is behaving normally as far as we can tell; is there any other way we could end up reading the value it was supposed to write? I can think of a few ways which seem far-fetched but would be consistent with the symptoms we're seeing.

  • If the Put were processed as a non-transactional write. However, we see in the replica logs that the BeginTransaction and Put are processed in the same batch and it seems very unlikely that this link would be severed.
  • If the intent were still pending but the MVCCMetadata were somehow skipped over at read time (due to a misused time-bounded iterator, an off-by-one seek, or something like that)
  • The intent resolved with a status of ABORTED but the value was not correctly deleted. My specific concern here is that this is possible if MVCCMetadata.Timestamp were out of sync with the timestamp encoded in the MVCC key.

I think the last of these is most plausible, but I haven't found any evidence that any of them are possible.

@andreimatei

This comment has been minimized.

Member

andreimatei commented Nov 3, 2018

@tbg

This comment has been minimized.

Member

tbg commented Nov 3, 2018

@tbg

This comment has been minimized.

Member

tbg commented Nov 5, 2018

@andreimatei got another failure with full Raft logging enabled:

2018-11-04 23:23:40,710{GMT} INFO [jepsen worker 9] jepsen.util: 9 :ok :read [34 3]
2018-11-04 23:23:40,710{GMT} INFO [jepsen worker 8] jepsen.util: 8 :ok :read [34 4]
2018-11-04 23:23:40,716{GMT} INFO [jepsen worker 3] jepsen.util: 3 :fail :cas [34 [3 4]] [:restart-transaction "ERROR: restart transaction: HandledRetryableTxnError: TransactionRetryError: retry txn (RETRY_ASYNC_WRITE_FAILURE): "sql txn" id=8c066bb5 key=/Table/53/1/34/0 rw=true pri=0.01295777 iso=SERIALIZABLE stat=PENDING epo=0 ts=1541373815.494010172,2 orig=1541373813.645676518,0 max=1541373814.145676518,0 wto=false rop=false seq=5"]

We're going to dig in later.

@tbg

This comment has been minimized.

Member

tbg commented Nov 5, 2018

I might be getting over-excited here, but I think I have something (it's not based on any of the new logging). @andreimatei commented yesterday that one thing he saw in a previous repro was that there was a transaction that was waiting on the intent and that got unblocked just around the time the committing proposals were reproposed (but apparently both bounced because of invalid lease index).

This smelled to me like a side-effect leak: we're signaling the txnwait queue with the proto resulting from a commit operation. What if we accidentally did that even when the commit didn't happen?

Looking into it, this seems true. Step by step, we're seeing the command pop out of Raft for the first time (so we populate proposal) and it catches the forced error.

proposal, proposedLocally := r.mu.proposals[idKey]
// TODO(tschottdorf): consider the Trace situation here.
if proposedLocally {
// We initiated this command, so use the caller-supplied context.
ctx = proposal.ctx
proposal.ctx = nil // avoid confusion
delete(r.mu.proposals, idKey)
}

Now note how we try to apply a no-op proposal, but we're not resetting the local eval result (only the replicated one):

if forcedErr != nil {
// Apply an empty entry.
raftCmd.ReplicatedEvalResult = storagepb.ReplicatedEvalResult{}
raftCmd.WriteBatch = nil
raftCmd.LogicalOpLog = nil
}

We then continue to populate lResult (since proposedLocally is true), despite the error, and run the handler:

lResult = proposal.Local
}
// Handle the Result, executing any side effects of the last
// state machine transition.
//
// Note that this must happen after committing (the engine.Batch), but
// before notifying a potentially waiting client.
r.handleEvalResultRaftMuLocked(ctx, lResult,
raftCmd.ReplicatedEvalResult, raftIndex, leaseIndex)

This eventually ends up in

if lResult.UpdatedTxns != nil {
for _, txn := range *lResult.UpdatedTxns {
r.txnWaitQueue.UpdateTxn(ctx, txn)
lResult.UpdatedTxns = nil
}
}

and so what is likely happening is that we're trying to commit a txn (but we don't), but give the committed proto to some waiting pushers, who will happily go and resolve the intent they had been blocked on:

case txn := <-push.pending:
log.VEventf(ctx, 2, "result of pending push: %v", txn)
// If txn is nil, the queue was cleared, presumably because the
// replica lost the range lease. Return not pushed so request
// proceeds and is redirected to the new range lease holder.
if txn == nil {
return nil, nil
}
// Transaction was committed, aborted or had its timestamp
// pushed. If this PushTxn request is satisfied, return
// successful PushTxn response.
if isPushed(req, txn) {
log.VEvent(ctx, 2, "push request is satisfied")
return createPushTxnResponse(txn), nil
}

@bdarnell

This comment has been minimized.

Member

bdarnell commented Nov 5, 2018

I think you've found it! I thought I had looked through processRaftCommand for a bug like this, but I misread the fact that we use lResult even on error.

This bug lay dormant for a long while - most of the stuff in LocalResult is harmless (e.g. Intents will trigger a push which will verify the state of the world before doing anything; Metrics will cause minor discrepancies in exported metrics, etc). But EndTxns and UpdatedTxns are special, since they will cause intent resolution to act on the contained Transaction proto even if it didn't get applied. The use of EndTxns has a check on response.Err, but UpdatedTxns does not.

On a "forced" error (pErr != nil ), we should just be throwing away the LocalResult entirely. I don't think there's any reason to do any of the stuff in it for errors arising from lease/proposal ordering. For response errors (response.Err != nil), it looks like there are sometimes reasons to process parts of the LocalResult (since EndTxns is specifically aware of this case), but it seems very fragile to address this for each field. I'd rather have a whitelist of fields that are meaningful on error instead of processing all of them by default.

This should be a very simple patch (at least to test the hypothesis): just wrap the line lResult = proposal.Local in if pErr != nil.

@andreimatei

This comment has been minimized.

Member

andreimatei commented Nov 5, 2018

Yeah, the theory seems very likely.

The analysis is more interesting: the bug (as it relates to EndTxn specifically) was probably introduced with the txn wait queue. But the introduction of the QueryIntent mechanism in 2.1 makes the bug a lot more exposed: before, the re-evaluation of the EndTxn would generally succeed and so the full txn would end up being committed. If the node doing the reevaluation were to crash, then we'd end up with non-atomic txn writes. If the server did not crash, there was a window where "dirty reads" were possible - a reader would see data that was not exactly committed, but since the txn is known to commit afterwards, maybe these were benign.
But in 2.1 we introduced this QueryIntent request, which makes the re-evaluation fail. And so now it's easy to lose txn atomicity.

andreimatei added a commit to andreimatei/cockroach that referenced this issue Nov 6, 2018

storage: don't apply local results if cmd processing failed
This patch fixes a bug with the application of LocalResults: they were
applied even in cases when the processing of the respective command
failed. So, we were evaluating a batch, proposing a command, failing to
apply the command (in particular because of the lease index check), but
then still applying the LocalResult even though the replicated state was
not updated.
This bug had catastrophic consequences: part of the LocalResult is a
list of UpdatedTxn - this is used to ping the txn wait queue and unblock
waiters, communicating a final txn disposition to them. Now imagine an
EndTxn batch that evaluates to a command that fails the lease check. The
respective LocalResult is populate with out txn, updated to the
COMMITTED disposition. The command fails to process, but we still ping
the wait queue telling all the waiters that the transaction committed.
The waiters, if any, now go and send ResolveIntent requests, which
actually commit intents (thinking that the intent's txn committed). And
so we end up with "partially committed transactions" - some of the
writes from a txn are committed (the ones that happened to have waiters
on their intents), others don't.

In order for this bug to manifest itself, we need:
- read/write contention on some keys
- a command to fail processing. Generally this happens either because of
the lease index check or because of the lease check; so either a
Raft leadership change or a lease change can potentially cause the
badness. In the case of a Raft leadership change, proposals can get
dropped, which leads to reproposals, which seem to frequently(?) end up
processing at the wrong lease index and thus be rejected (and it's the
reproposal processing that triggers the bug). This kind of situation is
triggered by the Jepsen register test, with various nemeses. The lease
check failure can probably be triggered simply with lease transfers
(e.g. lease rebalancing).

Interestingly, the rate of occurence of this bug probably changed
between 2.0 and 2.1 because of the introduction, in 2.1, of the
QueryIntent request (and also, separately, because of increased usage of
lease transfers; this claim was not verified though). Speaking of the
Raft leadership change scenario, once the reproposal fail to be applied
(but the wait queue is erroneously signalled), we also re-evaluate the
EndTransaction batch. Unless something else goes wrong, in 2.0 this
re-evaluation was likely succeeding. In 2.1, however, it tends to fail
if there was a waiter on our txn because the EndTxn requests are usually
bundled with QueryIntent requests - and these QueryIntents fail during
re-evaluation because they don't find the intent - it was, by then,
errneously  committed by a waiter through a ResolveIntent. Except for
transaction that wrote to multiple ranges: I think for those the
QueryIntent requests are split off from the EndTransaction request by
the DistSender, and so we're back to the situation in 2.0.

Fixes cockroachdb#30792

Release note (bug fix): Fix a bug causing transactions to appear to be
partially committed. CRDB was sometimes claiming to have failed to
commit a transaction but some (or all) of its writes were actually
persisted.

andreimatei added a commit to andreimatei/cockroach that referenced this issue Nov 6, 2018

storage: don't apply local results if cmd processing failed
This patch fixes a bug with the application of LocalResults: they were
applied even in cases when the processing of the respective command
failed. So, we were evaluating a batch, proposing a command, failing to
apply the command (in particular because of the lease index check), but
then still applying the LocalResult even though the replicated state was
not updated.
This bug had catastrophic consequences: part of the LocalResult is a
list of UpdatedTxn - this is used to ping the txn wait queue and unblock
waiters, communicating a final txn disposition to them. Now imagine an
EndTxn batch that evaluates to a command that fails the lease check. The
respective LocalResult is populate with out txn, updated to the
COMMITTED disposition. The command fails to process, but we still ping
the wait queue telling all the waiters that the transaction committed.
The waiters, if any, now go and send ResolveIntent requests, which
actually commit intents (thinking that the intent's txn committed). And
so we end up with "partially committed transactions" - some of the
writes from a txn are committed (the ones that happened to have waiters
on their intents), others don't.

In order for this bug to manifest itself, we need:
- read/write contention on some keys
- a command to fail processing. Generally this happens either because of
the lease index check or because of the lease check; so either a
Raft leadership change or a lease change can potentially cause the
badness. In the case of a Raft leadership change, proposals can get
dropped, which leads to reproposals, which seem to frequently(?) end up
processing at the wrong lease index and thus be rejected (and it's the
reproposal processing that triggers the bug). This kind of situation is
triggered by the Jepsen register test, with various nemeses. The lease
check failure can probably be triggered simply with lease transfers
(e.g. lease rebalancing).

Interestingly, the rate of occurence of this bug probably changed
between 2.0 and 2.1 because of the introduction, in 2.1, of the
QueryIntent request (and also, separately, because of increased usage of
lease transfers; this claim was not verified though). Speaking of the
Raft leadership change scenario, once the reproposal fail to be applied
(but the wait queue is erroneously signalled), we also re-evaluate the
EndTransaction batch. Unless something else goes wrong, in 2.0 this
re-evaluation was likely succeeding. In 2.1, however, it tends to fail
if there was a waiter on our txn because the EndTxn requests are usually
bundled with QueryIntent requests - and these QueryIntents fail during
re-evaluation because they don't find the intent - it was, by then,
errneously  committed by a waiter through a ResolveIntent. Except for
transaction that wrote to multiple ranges: I think for those the
QueryIntent requests are split off from the EndTransaction request by
the DistSender, and so we're back to the situation in 2.0.

Fixes cockroachdb#30792

Release note (bug fix): Fix a bug causing transactions to appear to be
partially committed. CRDB was sometimes claiming to have failed to
commit a transaction but some (or all) of its writes were actually
persisted.

andreimatei added a commit to andreimatei/cockroach that referenced this issue Nov 12, 2018

storage: don't apply local results if cmd processing failed
This patch fixes a bug with the application of LocalResults: they were
applied even in cases when the processing of the respective command
failed. So, we were evaluating a batch, proposing a command, failing to
apply the command (in particular because of the lease index check), but
then still applying the LocalResult even though the replicated state was
not updated.
This bug had catastrophic consequences: part of the LocalResult is a
list of UpdatedTxn - this is used to ping the txn wait queue and unblock
waiters, communicating a final txn disposition to them. Now imagine an
EndTxn batch that evaluates to a command that fails the lease check. The
respective LocalResult is populate with out txn, updated to the
COMMITTED disposition. The command fails to process, but we still ping
the wait queue telling all the waiters that the transaction committed.
The waiters, if any, now go and send ResolveIntent requests, which
actually commit intents (thinking that the intent's txn committed). And
so we end up with "partially committed transactions" - some of the
writes from a txn are committed (the ones that happened to have waiters
on their intents), others don't.

In order for this bug to manifest itself, we need:
- read/write contention on some keys
- a command to fail processing. Generally this happens either because of
the lease index check or because of the lease check; so either a
Raft leadership change or a lease change can potentially cause the
badness. In the case of a Raft leadership change, proposals can get
dropped, which leads to reproposals, which seem to frequently(?) end up
processing at the wrong lease index and thus be rejected (and it's the
reproposal processing that triggers the bug). This kind of situation is
triggered by the Jepsen register test, with various nemeses. The lease
check failure can probably be triggered simply with lease transfers
(e.g. lease rebalancing).

Interestingly, the rate of occurence of this bug probably changed
between 2.0 and 2.1 because of the introduction, in 2.1, of the
QueryIntent request (and also, separately, because of increased usage of
lease transfers; this claim was not verified though). Speaking of the
Raft leadership change scenario, once the reproposal fail to be applied
(but the wait queue is erroneously signalled), we also re-evaluate the
EndTransaction batch. Unless something else goes wrong, in 2.0 this
re-evaluation was likely succeeding. In 2.1, however, it tends to fail
if there was a waiter on our txn because the EndTxn requests are usually
bundled with QueryIntent requests - and these QueryIntents fail during
re-evaluation because they don't find the intent - it was, by then,
errneously  committed by a waiter through a ResolveIntent. Except for
transaction that wrote to multiple ranges: I think for those the
QueryIntent requests are split off from the EndTransaction request by
the DistSender, and so we're back to the situation in 2.0.

Fixes cockroachdb#30792

Release note (bug fix): Fix a bug causing transactions to appear to be
partially committed. CRDB was sometimes claiming to have failed to
commit a transaction but some (or all) of its writes were actually
persisted.

craig bot pushed a commit that referenced this issue Nov 12, 2018

Merge #32166
32166: storage: don't apply local results if cmd processing failed r=andreimatei a=andreimatei

This patch fixes a bug with the application of LocalResults: they were
applied even in cases when the processing of the respective command
failed. So, we were evaluating a batch, proposing a command, failing to
apply the command (in particular because of the lease index check), but
then still applying the LocalResult even though the replicated state was
not updated.
This bug had catastrophic consequences: part of the LocalResult is a
list of UpdatedTxn - this is used to ping the txn wait queue and unblock
waiters, communicating a final txn disposition to them. Now imagine an
EndTxn batch that evaluates to a command that fails the lease check. The
respective LocalResult is populate with out txn, updated to the
COMMITTED disposition. The command fails to process, but we still ping
the wait queue telling all the waiters that the transaction committed.
The waiters, if any, now go and send ResolveIntent requests, which
actually commit intents (thinking that the intent's txn committed). And
so we end up with "partially committed transactions" - some of the
writes from a txn are committed (the ones that happened to have waiters
on their intents), others don't.

In order for this bug to manifest itself, we need:
- read/write contention on some keys
- a command to fail processing. Generally this happens either because of
the lease index check or because of the lease check; so either a
Raft leadership change or a lease change can potentially cause the
badness. In the case of a Raft leadership change, proposals can get
dropped, which leads to reproposals, which seem to frequently(?) end up
processing at the wrong lease index and thus be rejected (and it's the
reproposal processing that triggers the bug). This kind of situation is
triggered by the Jepsen register test, with various nemeses. The lease
check failure can probably be triggered simply with lease transfers
(e.g. lease rebalancing).

Interestingly, the rate of occurence of this bug probably changed
between 2.0 and 2.1 because of the introduction, in 2.1, of the
QueryIntent request (and also, separately, because of increased usage of
lease transfers; this claim was not verified though). Speaking of the
Raft leadership change scenario, once the reproposal fail to be applied
(but the wait queue is erroneously signalled), we also re-evaluate the
EndTransaction batch. Unless something else goes wrong, in 2.0 this
re-evaluation was likely succeeding. In 2.1, however, it tends to fail
if there was a waiter on our txn because the EndTxn requests are usually
bundled with QueryIntent requests - and these QueryIntents fail during
re-evaluation because they don't find the intent - it was, by then,
errneously  committed by a waiter through a ResolveIntent. Except for
transaction that wrote to multiple ranges: I think for those the
QueryIntent requests are split off from the EndTransaction request by
the DistSender, and so we're back to the situation in 2.0.

Fixes #30792

Release note (bug fix): Fix a bug causing transactions to appear to be
partially committed. CRDB was sometimes claiming to have failed to
commit a transaction but some (or all) of its writes were actually
persisted.

Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>

andreimatei added a commit to andreimatei/cockroach that referenced this issue Nov 12, 2018

storage: don't apply local results if cmd processing failed
This patch fixes a bug with the application of LocalResults: they were
applied even in cases when the processing of the respective command
failed. So, we were evaluating a batch, proposing a command, failing to
apply the command (in particular because of the lease index check), but
then still applying the LocalResult even though the replicated state was
not updated.
This bug had catastrophic consequences: part of the LocalResult is a
list of UpdatedTxn - this is used to ping the txn wait queue and unblock
waiters, communicating a final txn disposition to them. Now imagine an
EndTxn batch that evaluates to a command that fails the lease check. The
respective LocalResult is populate with out txn, updated to the
COMMITTED disposition. The command fails to process, but we still ping
the wait queue telling all the waiters that the transaction committed.
The waiters, if any, now go and send ResolveIntent requests, which
actually commit intents (thinking that the intent's txn committed). And
so we end up with "partially committed transactions" - some of the
writes from a txn are committed (the ones that happened to have waiters
on their intents), others don't.

In order for this bug to manifest itself, we need:
- read/write contention on some keys
- a command to fail processing. Generally this happens either because of
the lease index check or because of the lease check; so either a
Raft leadership change or a lease change can potentially cause the
badness. In the case of a Raft leadership change, proposals can get
dropped, which leads to reproposals, which seem to frequently(?) end up
processing at the wrong lease index and thus be rejected (and it's the
reproposal processing that triggers the bug). This kind of situation is
triggered by the Jepsen register test, with various nemeses. The lease
check failure can probably be triggered simply with lease transfers
(e.g. lease rebalancing).

Interestingly, the rate of occurence of this bug probably changed
between 2.0 and 2.1 because of the introduction, in 2.1, of the
QueryIntent request (and also, separately, because of increased usage of
lease transfers; this claim was not verified though). Speaking of the
Raft leadership change scenario, once the reproposal fail to be applied
(but the wait queue is erroneously signalled), we also re-evaluate the
EndTransaction batch. Unless something else goes wrong, in 2.0 this
re-evaluation was likely succeeding. In 2.1, however, it tends to fail
if there was a waiter on our txn because the EndTxn requests are usually
bundled with QueryIntent requests - and these QueryIntents fail during
re-evaluation because they don't find the intent - it was, by then,
errneously  committed by a waiter through a ResolveIntent. Except for
transaction that wrote to multiple ranges: I think for those the
QueryIntent requests are split off from the EndTransaction request by
the DistSender, and so we're back to the situation in 2.0.

Fixes cockroachdb#30792

Release note (bug fix): Fix a bug causing transactions to appear to be
partially committed. CRDB was sometimes claiming to have failed to
commit a transaction but some (or all) of its writes were actually
persisted.

andreimatei added a commit to andreimatei/cockroach that referenced this issue Nov 12, 2018

storage: don't apply local results if cmd processing failed
This patch fixes a bug with the application of LocalResults: they were
applied even in cases when the processing of the respective command
failed. So, we were evaluating a batch, proposing a command, failing to
apply the command (in particular because of the lease index check), but
then still applying the LocalResult even though the replicated state was
not updated.
This bug had catastrophic consequences: part of the LocalResult is a
list of UpdatedTxn - this is used to ping the txn wait queue and unblock
waiters, communicating a final txn disposition to them. Now imagine an
EndTxn batch that evaluates to a command that fails the lease check. The
respective LocalResult is populate with out txn, updated to the
COMMITTED disposition. The command fails to process, but we still ping
the wait queue telling all the waiters that the transaction committed.
The waiters, if any, now go and send ResolveIntent requests, which
actually commit intents (thinking that the intent's txn committed). And
so we end up with "partially committed transactions" - some of the
writes from a txn are committed (the ones that happened to have waiters
on their intents), others don't.

In order for this bug to manifest itself, we need:
- read/write contention on some keys
- a command to fail processing. Generally this happens either because of
the lease index check or because of the lease check; so either a
Raft leadership change or a lease change can potentially cause the
badness. In the case of a Raft leadership change, proposals can get
dropped, which leads to reproposals, which seem to frequently(?) end up
processing at the wrong lease index and thus be rejected (and it's the
reproposal processing that triggers the bug). This kind of situation is
triggered by the Jepsen register test, with various nemeses. The lease
check failure can probably be triggered simply with lease transfers
(e.g. lease rebalancing).

Interestingly, the rate of occurence of this bug probably changed
between 2.0 and 2.1 because of the introduction, in 2.1, of the
QueryIntent request (and also, separately, because of increased usage of
lease transfers; this claim was not verified though). Speaking of the
Raft leadership change scenario, once the reproposal fail to be applied
(but the wait queue is erroneously signalled), we also re-evaluate the
EndTransaction batch. Unless something else goes wrong, in 2.0 this
re-evaluation was likely succeeding. In 2.1, however, it tends to fail
if there was a waiter on our txn because the EndTxn requests are usually
bundled with QueryIntent requests - and these QueryIntents fail during
re-evaluation because they don't find the intent - it was, by then,
errneously  committed by a waiter through a ResolveIntent. Except for
transaction that wrote to multiple ranges: I think for those the
QueryIntent requests are split off from the EndTransaction request by
the DistSender, and so we're back to the situation in 2.0.

Fixes cockroachdb#30792

Release note (bug fix): Fix a bug causing transactions to appear to be
partially committed. CRDB was sometimes claiming to have failed to
commit a transaction but some (or all) of its writes were actually
persisted.

@craig craig bot closed this in #32166 Nov 12, 2018

Core Stability automation moved this from User investigations and roachtest to Closed Nov 12, 2018

andreimatei added a commit to andreimatei/cockroach that referenced this issue Nov 12, 2018

storage: add a test for interactions between rejected commands and re…
…aders

Another test for cockroachdb#30792, this time testing the specific interaction
between readers and rejected Raft commands.

Release note: None

andreimatei added a commit to andreimatei/cockroach that referenced this issue Nov 13, 2018

storage: add a test for interactions between rejected commands and re…
…aders

Another test for cockroachdb#30792, this time testing the specific interaction
between readers and rejected Raft commands.

Release note: None

craig bot pushed a commit that referenced this issue Nov 13, 2018

Merge #32236
32236: storage: add a test for interactions between rejected commands and re… r=andreimatei a=andreimatei

…aders

Another test for #30792, this time testing the specific interaction
between readers and rejected Raft commands.

Release note: None

Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment