-
Notifications
You must be signed in to change notification settings - Fork 4k
batcheval: return unmodified staging txn in IndeterminateCommitError #156722
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Looks reasonable to me but I'll defer the approval to @arulajmani. We see this |
|
@miraradeva I think this particular bug can happen at any isolation level. I don't think that this will fix the weak isolation level issues we've seen since in those cases we've written the transaction record already. However, this fix is required for our proposed fix of running span-refresher retries at a higher timestamp. |
arulajmani
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arulajmani reviewed 3 of 3 files at r1, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @miraradeva)
pkg/kv/kvserver/batcheval/cmd_end_transaction.go line 506 at r1 (raw file):
// actually be in a state that _would have_ been implicitly committed IF // it had been able to write a transaction record with this new state. // However, here we are aborting the transaction and don't want to end up
Should we drop the "here we are aborting the transaction" bit? It's slightly misleading, because we've tried to abort the transaction and realized we need to perform transaction recovery. The final status of the transaction is contingent on the result of this.
pkg/kv/kvserver/batcheval/cmd_end_transaction.go line 513 at r1 (raw file):
txnForError = existingTxn } else { txnForError = *reply.Txn
Is it possible for us to hit this branch? My read is that reply.Txn is either set to the existingTxn (if recordAlreadyExisted) or h.Txn.Clone(). h.Txn.Clone().Status should never be STAGING, so we should never be in this branch -- if this checks out to you, should we return an assertion failed error here instead?
pkg/kv/kvclient/kvcoord/dist_sender_server_test.go line 4621 at r1 (raw file):
} // TestRollbackAfterRefreshAndFailedCommit is a regression test for #156698
nit: missing period.
pkg/kv/kvclient/kvcoord/dist_sender_server_test.go line 4625 at r1 (raw file):
// This test forces the self-recovery of a transaction after a failed parallel // commit. It doe this via the following. //
nit: Thanks for writing this comment to explain the test setup. One thing it's missing is how the ranges are split during setup to ensure Del key=8 is split into its own batch. Should we weave that in?
|
@stevendanna great debugging here by the way! 🚀 |
0c6289b to
ab1971b
Compare
stevendanna
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @arulajmani and @miraradeva)
pkg/kv/kvserver/batcheval/cmd_end_transaction.go line 506 at r1 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Should we drop the "here we are aborting the transaction" bit? It's slightly misleading, because we've tried to abort the transaction and realized we need to perform transaction recovery. The final status of the transaction is contingent on the result of this.
Changed the text a bit
pkg/kv/kvserver/batcheval/cmd_end_transaction.go line 513 at r1 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Is it possible for us to hit this branch? My read is that reply.Txn is either set to the existingTxn (if recordAlreadyExisted) or
h.Txn.Clone().h.Txn.Clone().Statusshould never beSTAGING, so we should never be in this branch -- if this checks out to you, should we return an assertion failed error here instead?
Yup, I think so, just added a small assertion.
pkg/kv/kvclient/kvcoord/dist_sender_server_test.go line 4625 at r1 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
nit: Thanks for writing this comment to explain the test setup. One thing it's missing is how the ranges are split during setup to ensure Del key=8 is split into its own batch. Should we weave that in?
Done.
Potential Bug(s) DetectedThe three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation. Next Steps: Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary. After you review the findings, please tag the issue as follows:
|
When a EndTxn(abort) enouncouters a transaction record in STAGING, it returns
an IndeterminateCommitError so that the transaction recovery process can be
started. When constructing this error, we previously returned a copy of the
staging transaction that had its write timestamp forwarded the txn found on
the request header.
According to this comment, this is rather important.
// We're using existingTxn on the reply, although it can be stale
// compared to the Transaction in the request (e.g. the Sequence,
// and various timestamps). We must be careful to update it with the
// supplied ba.Txn if we return it with an error which might be
// retried, as for example to avoid client-side serializable restart.
However, a side-effect is that the txnrecovery performed in response to an
IndeterminateCommitError could erroneously find that the the transaction was
commited because it was checking for the implicit commit criteria using a higher
timestamp. The actual recovery of the transaction would then hit the following
assertion:
programming error: timestamp change by implicitly committed transaction
Here, we solve this by returning the unmodified txn in an IndeterminateCommitError.
While the comment warns us against this, I think in the case of here, this transaction
has nothing (other than the EndTxn) to retry since we are rolling back. Further, the
error is handled server-side immediately and then the user receives either the error
from the recovery process or the error from the retried request.
Fixes cockroachdb#156698
Release note (bug fix): Fix a bug that could result in a transaction encountering
the following assertion failure:
failed indeterminate commit recovery: programming error: timestamp change by
implicitly committed transaction
ab1971b to
3d1337f
Compare
|
bors r=arulajmani |
When a EndTxn(abort) enouncouters a transaction record in STAGING, it returns an IndeterminateCommitError so that the transaction recovery process can be started. When constructing this error, we previously returned a copy of the staging transaction that had its write timestamp forwarded the txn found on the request header.
According to this comment, this is rather important.
However, a side-effect is that the txnrecovery performed in response to an IndeterminateCommitError could erroneously find that the the transaction was commited because it was checking for the implicit commit criteria using a higher timestamp. The actual recovery of the transaction would then hit the following assertion:
Here, we solve this by returning the unmodified txn in an IndeterminateCommitError. While the comment warns us against this, I think in the case of here, this transaction has nothing (other than the EndTxn) to retry since we are rolling back. Further, the error is handled server-side immediately and then the user receives either the error from the recovery process or the error from the retried request.
Fixes #156698
Release note (bug fix): Fix a bug that could result in a transaction encountering the following assertion failure:
failed indeterminate commit recovery: programming error: timestamp change by
implicitly committed transaction