release-20.1: kvclient/kvcoord: inhibit parallel commit when retrying EndTxn request #46848

andreimatei · 2020-04-01T14:24:54Z

Backport 6/6 commits from #46596.

/cc @cockroachdb/release

The scenario that this patch addresses is the following (from #46431):

txn1 sends Put(a) + Put(b) + EndTxn
DistSender splits the Put(a) from the rest.
Put(a) succeeds, but the rest catches some retriable error.
TxnCoordSender gets the retriable error. The fact that a sub-batch
succeeded is lost. We used to care about that fact, but we've
successively gotten rid of that tracking across storage: employ transactional idempotency to refresh mixed-success batches #35140 and kv: refresh less and retry more #44661.
we refresh everything that came before this batch. The refresh
succeeds.
we re-send the batch. It gets split again. The part with the EndTxn
executes first. The transaction is now STAGING. More than that, the txn
is in fact implicitly committed - the intent on a is already there since
the previous attempt and, because it's at a lower timestamp than the txn
record, it counts as golden for the purposes of verifying the implicit
commit condition.
some other transaction wonders in, sees that txn1 is in its way, and
transitions it to explicitly committed.
the Put(a) now tries to evaluate. It gets really confused. I guess
that different things can happen; none of them good. One thing that I
believe we've observed in roachtest: kv/contention/nodes=4 failed #46299 is that, if there's another txn's
intent there already, the Put will try to push it, enter the
txnWaitQueue, eventually observe that its own txn is committed and
return an error. The client thus gets an error (and a non-ambiguous one
to boot) although the txn is committed. Even worse perhaps, I think it's
possible for a request to return wrong results instead of an error.

This patch fixes it by inhibiting the parallel commit when the EndTxn
batch is retried. This way, there's never a STAGING record.

Release note (bug fix): A rare bug causing errors to be returned for
successfully committed transactions was fixed. The most common error
message was "TransactionStatusError: already committed".

Release justification: serious bug fix

Fixes #46341

Release note: None Release justification: Comment only.

Release note: None Release justification: Minor new log messages.

This function was used in sql, but I want to start using it elsewhere when dealing with traces. Release note: None Release justification: Minor code move, in support of the next commit.

The scenario that this patch addresses is the following (from cockroachdb#46431): 1. txn1 sends Put(a) + Put(b) + EndTxn 2. DistSender splits the Put(a) from the rest. 3. Put(a) succeeds, but the rest catches some retriable error. 4. TxnCoordSender gets the retriable error. The fact that a sub-batch succeeded is lost. We used to care about that fact, but we've successively gotten rid of that tracking across cockroachdb#35140 and cockroachdb#44661. 5. we refresh everything that came before this batch. The refresh succeeds. 6. we re-send the batch. It gets split again. The part with the EndTxn executes first. The transaction is now STAGING. More than that, the txn is in fact implicitly committed - the intent on a is already there since the previous attempt and, because it's at a lower timestamp than the txn record, it counts as golden for the purposes of verifying the implicit commit condition. 7. some other transaction wonders in, sees that txn1 is in its way, and transitions it to explicitly committed. 8. the Put(a) now tries to evaluate. It gets really confused. I guess that different things can happen; none of them good. One thing that I believe we've observed in cockroachdb#46299 is that, if there's another txn's intent there already, the Put will try to push it, enter the txnWaitQueue, eventually observe that its own txn is committed and return an error. The client thus gets an error (and a non-ambiguous one to boot) although the txn is committed. Even worse perhaps, I think it's possible for a request to return wrong results instead of an error. This patch fixes it by inhibiting the parallel commit when the EndTxn batch is retried. This way, there's never a STAGING record. Release note (bug fix): A rare bug causing errors to be returned for successfully committed transactions was fixed. The most common error message was "TransactionStatusError: already committed". Release justification: serious bug fix Fixes cockroachdb#46341

Before this patch, a trace (recording) was String()ed by sorting log messages by timestamp, across spans. This proved to be a terrible idea. This patch moves to keeping requests from the same span together, but still interspersing children spans in the parent. So messages from siblings are no longer mixed together. Release note: None Release justification: testing only

Release note: None Release justification: low risk, just logging

cockroach-teamcity · 2020-04-01T14:25:08Z

This change is

andreimatei · 2020-04-01T18:47:33Z

bors r+

andreimatei added 6 commits April 1, 2020 10:24

kvserver/batcheval: add a comment

34a1391

Release note: None Release justification: Comment only.

kv, server: minor tracing improvements

af92537

Release note: None Release justification: Minor new log messages.

util/tracing: move extractMsgFromRecord to the tracking pkg

d3a2605

This function was used in sql, but I want to start using it elsewhere when dealing with traces. Release note: None Release justification: Minor code move, in support of the next commit.

kvcoord: add a log msg

3beb292

Release note: None Release justification: low risk, just logging

andreimatei requested a review from nvanbenschoten April 1, 2020 14:24

nvanbenschoten approved these changes Apr 1, 2020

View reviewed changes

andreimatei merged commit ea91416 into cockroachdb:release-20.1 Apr 1, 2020

andreimatei deleted the backport20.1-46596 branch April 1, 2020 21:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-20.1: kvclient/kvcoord: inhibit parallel commit when retrying EndTxn request #46848

release-20.1: kvclient/kvcoord: inhibit parallel commit when retrying EndTxn request #46848

andreimatei commented Apr 1, 2020

cockroach-teamcity commented Apr 1, 2020

andreimatei commented Apr 1, 2020

release-20.1: kvclient/kvcoord: inhibit parallel commit when retrying EndTxn request #46848

release-20.1: kvclient/kvcoord: inhibit parallel commit when retrying EndTxn request #46848

Conversation

andreimatei commented Apr 1, 2020

cockroach-teamcity commented Apr 1, 2020

andreimatei commented Apr 1, 2020