Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-20.1: kvclient/kvcoord: inhibit parallel commit when retrying EndTxn request #46848

Merged

Conversation

andreimatei
Copy link
Contributor

Backport 6/6 commits from #46596.

/cc @cockroachdb/release


The scenario that this patch addresses is the following (from #46431):

  1. txn1 sends Put(a) + Put(b) + EndTxn
  2. DistSender splits the Put(a) from the rest.
  3. Put(a) succeeds, but the rest catches some retriable error.
  4. TxnCoordSender gets the retriable error. The fact that a sub-batch
    succeeded is lost. We used to care about that fact, but we've
    successively gotten rid of that tracking across storage: employ transactional idempotency to refresh mixed-success batches #35140 and kv: refresh less and retry more #44661.
  5. we refresh everything that came before this batch. The refresh
    succeeds.
  6. we re-send the batch. It gets split again. The part with the EndTxn
    executes first. The transaction is now STAGING. More than that, the txn
    is in fact implicitly committed - the intent on a is already there since
    the previous attempt and, because it's at a lower timestamp than the txn
    record, it counts as golden for the purposes of verifying the implicit
    commit condition.
  7. some other transaction wonders in, sees that txn1 is in its way, and
    transitions it to explicitly committed.
  8. the Put(a) now tries to evaluate. It gets really confused. I guess
    that different things can happen; none of them good. One thing that I
    believe we've observed in roachtest: kv/contention/nodes=4 failed #46299 is that, if there's another txn's
    intent there already, the Put will try to push it, enter the
    txnWaitQueue, eventually observe that its own txn is committed and
    return an error. The client thus gets an error (and a non-ambiguous one
    to boot) although the txn is committed. Even worse perhaps, I think it's
    possible for a request to return wrong results instead of an error.

This patch fixes it by inhibiting the parallel commit when the EndTxn
batch is retried. This way, there's never a STAGING record.

Release note (bug fix): A rare bug causing errors to be returned for
successfully committed transactions was fixed. The most common error
message was "TransactionStatusError: already committed".

Release justification: serious bug fix

Fixes #46341

Release note: None
Release justification: Comment only.
Release note: None
Release justification: Minor new log messages.
This function was used in sql, but I want to start using it elsewhere
when dealing with traces.

Release note: None
Release justification: Minor code move, in support of the next commit.
The scenario that this patch addresses is the following (from cockroachdb#46431):
1. txn1 sends Put(a) + Put(b) + EndTxn
2. DistSender splits the Put(a) from the rest.
3. Put(a) succeeds, but the rest catches some retriable error.
4. TxnCoordSender gets the retriable error. The fact that a sub-batch
  succeeded is lost. We used to care about that fact, but we've
  successively gotten rid of that tracking across cockroachdb#35140 and cockroachdb#44661.
5. we refresh everything that came before this batch. The refresh
  succeeds.
6. we re-send the batch. It gets split again. The part with the EndTxn
  executes first. The transaction is now STAGING. More than that, the txn
  is in fact implicitly committed - the intent on a is already there since
  the previous attempt and, because it's at a lower timestamp than the txn
  record, it counts as golden for the purposes of verifying the implicit
  commit condition.
7. some other transaction wonders in, sees that txn1 is in its way, and
  transitions it to explicitly committed.
8. the Put(a) now tries to evaluate. It gets really confused. I guess
  that different things can happen; none of them good. One thing that I
  believe we've observed in cockroachdb#46299 is that, if there's another txn's
  intent there already, the Put will try to push it, enter the
  txnWaitQueue, eventually observe that its own txn is committed and
  return an error. The client thus gets an error (and a non-ambiguous one
  to boot) although the txn is committed. Even worse perhaps, I think it's
  possible for a request to return wrong results instead of an error.

This patch fixes it by inhibiting the parallel commit when the EndTxn
batch is retried. This way, there's never a STAGING record.

Release note (bug fix): A rare bug causing errors to be returned for
successfully committed transactions was fixed. The most common error
message was "TransactionStatusError: already committed".

Release justification: serious bug fix

Fixes cockroachdb#46341
Before this patch, a trace (recording) was String()ed by sorting
log messages by timestamp, across spans. This proved to be a terrible
idea. This patch moves to keeping requests from the same span together,
but still interspersing children spans in the parent. So messages from
siblings are no longer mixed together.

Release note: None
Release justification: testing only
Release note: None
Release justification: low risk, just logging
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@andreimatei
Copy link
Contributor Author

bors r+

@andreimatei andreimatei merged commit ea91416 into cockroachdb:release-20.1 Apr 1, 2020
@andreimatei andreimatei deleted the backport20.1-46596 branch April 1, 2020 21:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants