Skip to content

Conversation

@stevendanna
Copy link
Collaborator

@stevendanna stevendanna commented Aug 18, 2025

The span refresher may retry a batch with an EndTxn request. When
retrying such a request, it splits the EndTxn out into its own batch to
avoid the transaction from becoming implicitly committed in the middle
of the retry.

For serializable transactions this is sufficient because serializable
transactions are not allowed to write a transaction record if
WriteTimestamp != ReadTimestamp. All of the errors we retry in the span
refresher imply that the write timestamp is moving forward, and thus any
old staging record is at a timestamp that is less than the timestamp
that we will be writing at during the retry.

For weak isolation transactions, however, this is not sufficient.

Consider a weak isolation transaction issuing the following batch, with
ReadTimestamp == WriteTimestamp == t1.

Put(a)
Put(b)
EndTxn

Assume the Puts and EndTxn go to different ranges in parallel. The
following can happen:

Put(a) -> Encounters WriteTooOld that requires a refresh to t2
Put(b) -> Writes intent@t1
EndTxn() -> WriteTimetamp pushed via timestamp cache to t2, writes staging record@t2

In a SSI transaction the EndTxn would fail because WriteTimestamp !=
ReadTimestamp, but in an weak isolation transaction we write a staging
record at t2.

If we successfully refresh to t2 and start our retry, then that existing
STAGING transaction record meets the implicit commit criteria as soon as
Put(a) succeeds on retry.

We would like to avoid this because it can result in a number of
different errors.

Here, we avoid this hazard by refreshing weak isolation transactions
that are in STAGING to a timestamp just past the largest timestamp at
which the staging record could have been written. As a result, any
subsequent writes do not satisfy the implicit commit criteria of the
existing record.

This required a small change to the transaction record returned from
EndTxn(abort) to avoid situations where we mistake this future
transaction record for the existing transaction record.

Fixes #156698
Fixes #154510

Release note (bug fix): Fix a bug in which a Read Committed or Snapshot
isolation transaction may be committed despite returning a non-ambiguous
error.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@stevendanna stevendanna requested a review from miraradeva August 18, 2025 12:36
@stevendanna stevendanna force-pushed the ssd/commit-after-retry branch from 5bcb954 to b157314 Compare November 1, 2025 01:00
@stevendanna stevendanna changed the title WIP: kvcoord: return AmbiguousResultError if transaction is STAGING kvcoord: prevent unexpected parallel commit of weak isolation transactions Nov 1, 2025
@stevendanna stevendanna marked this pull request as ready for review November 1, 2025 01:02
@stevendanna stevendanna requested a review from a team as a code owner November 1, 2025 01:02
@stevendanna stevendanna requested review from arulajmani and removed request for a team November 1, 2025 01:02
@stevendanna
Copy link
Collaborator Author

I pulled out the fix for self-recovery after EndTxn(abort) to its own PR here; #156722

@stevendanna stevendanna force-pushed the ssd/commit-after-retry branch from b157314 to c8ee395 Compare November 12, 2025 16:04
@stevendanna
Copy link
Collaborator Author

@miraradeva @arulajmani Rebased this on master. It should be good for a look. Definitely let me know if you can athink of some tests we should add here in addition to the targeted one I added.

Copy link
Contributor

@miraradeva miraradeva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

@miraradeva reviewed 1 of 3 files at r2, 2 of 2 files at r3, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @arulajmani)


-- commits line 38 at r3:
Should this be Put(b) succeeding? Or maybe I'm misunderstanding the example.


-- commits line 41 at r3:
Future me would love to not have to remember what errors we hit. Maybe list something that's easy to explain?


pkg/kv/kvclient/kvcoord/dist_sender_server_test.go line 4796 at r3 (raw file):

//
// After a successful refresh, our testing filters have arranged for the
// second EndTxn to fail.

So no recovery involved here. I like that; it makes it simpler to understand.

But should we mention somewhere that the second EndTxn can race with a txn recovery? That's how we get the "already committed" variant of the failure, right?


pkg/kv/kvclient/kvcoord/dist_sender_server_test.go line 4869 at r3 (raw file):

		// for this bug because it both needs to be in an isolation level that
		// allows committing when wrt != rts and the read needs to produce a read
		// span that requires a fresh such that we can't do a server-side refresh.

I assume "fresh" is supposed to be "refresh". So essentially we need the stepping in order to push the txn's read timestamp, which will make it ineligible for server-side refreshes?


pkg/kv/kvclient/kvcoord/dist_sender_server_test.go line 4915 at r3 (raw file):

		require.Error(t, txn1Err)
	} else {
		// if val8 doesn't exist, then txn1 was committed so we shouldn't have

nit: comment punctuation

…tions

The span refresher may retry a batch with an EndTxn request. When
retrying such a request, it splits the EndTxn out into its own batch to
avoid the transaction from becoming implicitly committed in the middle
of the retry.

For serializable transactions this is sufficient because serializable
transactions are not allowed to write a transaction record if
WriteTimestamp != ReadTimestamp. All of the errors we retry in the span
refresher imply that the write timestamp is moving forward, and thus any
old staging record is at a timestamp that is less than the timestamp
that we will be writing at during the retry.

For weak isolation transactions, however, this is not sufficient.

Consider a weak isolation transaction issuing the following batch, with
ReadTimestamp == WriteTimestamp == t1.

    Put(a)
    Put(b)
    EndTxn

Assume the Puts and EndTxn go to different ranges in parallel. The
following can happen:

   Put(a) -> Encounters WriteTooOld that requires a refresh to t2
   Put(b) -> Writes intent@t1
   EndTxn() -> WriteTimetamp pushed via timestamp cache to t2, writes staging record@t2

In a SSI transaction the EndTxn would fail because WriteTimestamp !=
ReadTimestamp, but in an weak isolation transaction we write a staging
record at t2.

If we successfully refresh to t2 and start our retry, then that existing
STAGING transaction record meets the implicit commit criteria as soon as
Put(a) succeeds on retry.

We would like to avoid this because it can result in the transaction being
comitted by transaction recovery.  This might manifest as:

- a `TransactionStatusError: already committed (REASON_TXN_COMMITTED)` during a
  commit; or,

- an unrelated error being delivered to the client (if, say, some portion of the
  retry failed) despite the transaction committing.

Here, we avoid this hazard by refreshing weak isolation transactions
that are in STAGING to a timestamp just past the largest timestamp at
which the staging record could have been written. As a result, any
subsequent writes do not satisfy the implicit commit criteria of the
existing record.

Fixes cockroachdb#156698
Fixes cockroachdb#154510

Release note (bug fix): Fix a bug in which a Read Committed or Snapshot
isolation transaction may be committed despite returning a non-ambiguous
error.
@stevendanna stevendanna force-pushed the ssd/commit-after-retry branch from c8ee395 to f28a1db Compare November 14, 2025 00:12
Copy link
Collaborator Author

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @arulajmani and @miraradeva)


-- commits line 38 at r3:

Previously, miraradeva (Mira Radeva) wrote…

Should this be Put(b) succeeding? Or maybe I'm misunderstanding the example.

Put(b) succeeded the first time and was written at ts1 which is less than the staging timestamp of ts2. So, once Put(a) succeeds on retry, then the combination of (new) Put(a) and (old) Put(b) technically satisfying the old staging record.


-- commits line 41 at r3:

Previously, miraradeva (Mira Radeva) wrote…

Future me would love to not have to remember what errors we hit. Maybe list something that's easy to explain?

Added the two conditions we have observed, but I honestly doubt it is a complete list of ways this could manifest.


pkg/kv/kvclient/kvcoord/dist_sender_server_test.go line 4796 at r3 (raw file):

Previously, miraradeva (Mira Radeva) wrote…

So no recovery involved here. I like that; it makes it simpler to understand.

But should we mention somewhere that the second EndTxn can race with a txn recovery? That's how we get the "already committed" variant of the failure, right?

I've added some words here to describe what I think happens in this case in when the bug was in place still.


pkg/kv/kvclient/kvcoord/dist_sender_server_test.go line 4869 at r3 (raw file):

So essentially we need the stepping in order to push the txn's read timestamp, which will make it ineligible for server-side refreshes?

Yeah, IIRC, what happens here is that without SteppingEnabled, then the transaction is in "auto step" mode in which we step the transaction between each batch. Further, auto-step (SteppingDisabled) mode consults the transaction level, when doing its step and steps the read timestamp every time. Stepping the read timestamp clears the read footprint in the span refresher, which then makes the batch eligble for server-side refreshes.

@stevendanna
Copy link
Collaborator Author

TFTR!

bors r=miraradeva

@craig
Copy link
Contributor

craig bot commented Nov 17, 2025

@craig craig bot merged commit a97e76d into cockroachdb:master Nov 17, 2025
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

3 participants