Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv,kvcoord,sql: poison txnCoordSender after a retryable error #74563

Merged
merged 3 commits into from
Feb 15, 2022

Conversation

lidorcarmel
Copy link
Contributor

@lidorcarmel lidorcarmel commented Jan 7, 2022

Previously kv users could lose parts of a transaction without getting an
error. After Send() returned a retryable error the state of txn got reset
which made it usable again. If the caller ignored the error they could
continue applying more operations without realizing the first part of the
transaction was discarded. See more details in the issue (#22615).

The simple case example is where the retryable closure of DB.Txn() returns
nil instead of returning the retryable error back to the retry loop - in this
case the retry loop declares success without realizing we lost the first part
of the transaction (all the operations before the retryable error).

This PR leaves the txn in a "poisoned" state after encountering an error, so
that all future operations fail fast. The caller is therefore expected to
reset the txn handle back to a usable state intentionally, by calling
Txn.PrepareForRetry(). In the simple case of DB.Txn() the retry loop will
reset the handle and run the retry even if the callback returned nil.

Closes #22615

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@lidorcarmel lidorcarmel force-pushed the lidor_txn_retry_fix_v2 branch 8 times, most recently from 4e3d064 to e23a68a Compare January 7, 2022 23:01
@lidorcarmel lidorcarmel changed the title Lidor txn retry fix v2 kv,kvcoord: poison txnCoordSender after a retryable error Jan 7, 2022
@lidorcarmel lidorcarmel force-pushed the lidor_txn_retry_fix_v2 branch 10 times, most recently from b7c0300 to 5c7dfef Compare January 14, 2022 22:36
@lidorcarmel lidorcarmel marked this pull request as ready for review January 14, 2022 22:38
@lidorcarmel lidorcarmel requested a review from a team as a code owner January 14, 2022 22:38
@lidorcarmel lidorcarmel force-pushed the lidor_txn_retry_fix_v2 branch 2 times, most recently from 6ded62d to 05b749d Compare January 15, 2022 00:10
@lidorcarmel lidorcarmel changed the title kv,kvcoord: poison txnCoordSender after a retryable error kv,kvcoord,sql: poison txnCoordSender after a retryable error Jan 15, 2022
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job! It will be a big step forward to finally resolve this issue.

Reviewed 10 of 10 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @lidorcarmel)


pkg/kv/txn.go, line 978 at r1 (raw file):

		if err == nil {
			txn.mu.Lock()
			txnErr := txn.mu.sender.GetTxnRetryableErr(ctx)

Why do we need to explicitly check this here? Shouldn't the call to txn.Commit() below and the logic in maybeRejectClientLocked be enough to detect a poisoned txn and return an error?


pkg/kv/txn.go, line 1421 at r1 (raw file):

	}
	// Verify we have an initialized Transaction before creating a new root sender.
	if newTxn.ID == (uuid.UUID{}) {

Is this a valid case to hit?


pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 115 at r1 (raw file):

		txnState txnState

		// storedRetryableErr is set when tsnState == txnRetryableError. This

s/tsnState/txnState/


pkg/sql/conn_fsm.go, line 377 at r1 (raw file):

			Description: "ROLLBACK TO SAVEPOINT (not cockroach_restart) failed because txn needs restart",
			Next:        stateAborted{},
			Action:      prepareTxnForRetry,

Are you sure this is where we wanted the call to PrepareForRetry, as opposed to the next event which handles the cockroach_restart savepoint and advances the txnState with a txnRestart event? I'm a little surprised to not see a 1:1 mapping between uses of txnRestart and calls to PrepareForRetry.

Is this related to the need to provide an error to PrepareForRetry?


pkg/kv/db_test.go, line 724 at r1 (raw file):

			// Only write if we have not written before, because otherwise we will keep aborting
			// the other txn forever.
			r, e := hpTxn.Get(ctx, "aa")

nit: s/e/err/ throughout.


pkg/kv/db_test.go, line 755 at r1 (raw file):

		return e
	})
	require.NoError(t, err)

Want to add require.Equal(t, 1, runNumber) here to assert that we performed exactly one retry?


pkg/kv/db_test.go, line 758 at r1 (raw file):

	err1 := db.Txn(context.Background(), func(ctx context.Context, txn *kv.Txn) error {
		// The high priority txn was overridden by the successful  retry.

nit: extra space in "successful retry". Also, I think s/overridden/overwritten/ is slightly more accurate.


pkg/kv/db_test.go, line 822 at r1 (raw file):

	// db.Txn should return the retryable error that poisoned txn.
	expectedErr := (*roachpb.TransactionRetryWithProtoRefreshError)(nil)
	require.True(t, errors.As(err, &expectedErr))

Same thing here. Consider a require.Equal(t, 0, runNumber).


pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 705 at r1 (raw file):

		{
			// On abort, nothing changes but we get a new priority to use for
			// the next attempt.

It looks like we broke what the test was trying to demonstrate. Can we revive it by adding a call to PrepareForRetry?

Copy link
Contributor Author

@lidorcarmel lidorcarmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @nvanbenschoten , looks like that test failure in rttanalysis was real, the issue was that when things are slow (for some reason only on bazel) then the txn may get aborted by a heartbeat failure, BUT, when we try to PrepareForRetry without an error we fail because we cannot find the error - a retryable error created by the heartbeat failure did not poison the handle, so it's a real bug! thanks bazel :)
I tried adding a test, please feel free to tell me that there is a much better/easier way to do this, but this test does the job.
BTW I think that regardless of this change we should call handleRetryableErrLocked() when we see that heartbeat failure.
Last - the lock/unlock stuff is not related to the test failure.
PTAL, thanks!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @lidorcarmel, @nvanbenschoten, and @yuzefovich)

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 5 files at r5, 4 of 4 files at r6, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @lidorcarmel, and @yuzefovich)


pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 731 at r6 (raw file):

		abortedErr := roachpb.NewErrorWithTxn(
			roachpb.NewTransactionAbortedError(roachpb.ABORT_REASON_CLIENT_REJECT), &tc.mu.txn)
		if tc.typ == kv.LeafTxn {

LeafTxns don't use txnHeartbeaters, so I don't see how this case could be possible. I think I just missed that when this was last touched in 1a84480. Do you mind instrumenting it with a fatal and seeing whether we ever hit it in unit tests? If not, I think we should remove it.


pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 737 at r6 (raw file):

		}
		// Root txns handle retriable errors.
		return roachpb.NewError(tc.handleRetryableErrLocked(ctx, abortedErr))

This is a good catch. I agree with you that we should make this change. Can we do so in a separate commit though? That way, we can better pinpoint any fallout from it.


pkg/sql/conn_fsm.go, line 506 at r6 (raw file):

	ts := args.Extended.(*txnState)
	ts.mu.Lock()
	ts.mu.txn.PrepareForRetry(ts.Ctx)

I don't think it needs to hold up this PR, but one thing in here that I'd be interested to get @andreimatei's take on is whether it is more appropriate for these calls to PrepareForRetry to live in these fsm Action functions or whether they should be moved into txnStateTransitionsApplyWrapper and handled on the txnRestart event. I don't think I ever understood what belongs where and why all connExecutor state updates weren't handled directly in an fsm Action.


pkg/kv/db_test.go, line 17 at r6 (raw file):

	"context"
	"fmt"
	"github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord"

nit: move this into the grouping below. crlfmt should do this automatically, so I wonder whether you have that set up: https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/154206209/Goland+Tips+and+Tricks#Enable-crlfmt-Watcher. If not, I'd recommend it.


pkg/kv/db_test.go, line 802 at r6 (raw file):

}

// Verify the txn sees a retryable error without using the handle: Normally the

This might be a more appropriate test for pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go.


pkg/kv/db_test.go, line 810 at r6 (raw file):

// handle stays in txnPending then this means we do not have the error. This
// error is needed for resetting the handle in PrepareForRetry.
func TestDB_PrepareForRetryAfterHeartbeatFailure(t *testing.T) {

How long does this test to run? At least 1 second, right? If you replace setup with a call to serverutils.StartServer directly, you should be able to configure a shorter txn heartbeat interval?


pkg/kv/db_test.go, line 833 at r6 (raw file):

		// This is here because we want to call maybeRejectClientLocked, which will
		// poison the handle if there is a heartbeat failure.
		tc.GetLeafTxnFinalState(ctx, kv.OnlyPending)

nit: _, _ = tc.GetLeafTxnFinalState(ctx, kv.OnlyPending) to show that you're intentionally ignoring the result. CI might actually be complaining about this already.


pkg/kv/db_test.go, line 841 at r6 (raw file):

			return errors.New("the handle is not poisoned yet")
		}
		return nil

Should we assert on the contents of the error after we've checked that it's non-nil?

@lidorcarmel lidorcarmel force-pushed the lidor_txn_retry_fix_v2 branch 7 times, most recently from 35d5423 to cdf18bf Compare February 10, 2022 06:06
Copy link
Contributor Author

@lidorcarmel lidorcarmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @lidorcarmel, @nvanbenschoten, and @yuzefovich)


pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 731 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

LeafTxns don't use txnHeartbeaters, so I don't see how this case could be possible. I think I just missed that when this was last touched in 1a84480. Do you mind instrumenting it with a fatal and seeing whether we ever hit it in unit tests? If not, I think we should remove it.

Tested locally with some unit tests, and also pushed to test in CI.
Removing this in a separate commit (I'm getting used to thinking about commits!).
CI is still running, if it fails I'll revert.


pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 737 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This is a good catch. I agree with you that we should make this change. Can we do so in a separate commit though? That way, we can better pinpoint any fallout from it.

Done.


pkg/sql/conn_fsm.go, line 377 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

These changes look very good. Thank you for making them.

Thank you!


pkg/sql/conn_fsm.go, line 506 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I don't think it needs to hold up this PR, but one thing in here that I'd be interested to get @andreimatei's take on is whether it is more appropriate for these calls to PrepareForRetry to live in these fsm Action functions or whether they should be moved into txnStateTransitionsApplyWrapper and handled on the txnRestart event. I don't think I ever understood what belongs where and why all connExecutor state updates weren't handled directly in an fsm Action.

Looks like we're good here, verified with Andrei, conn_fsm should manage the txn object.


pkg/kv/db_test.go, line 17 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: move this into the grouping below. crlfmt should do this automatically, so I wonder whether you have that set up: https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/154206209/Goland+Tips+and+Tricks#Enable-crlfmt-Watcher. If not, I'd recommend it.

Done. Strange! I had crlfmt.. and I'm sure it was working until recently but now it was disabled.... mmm.


pkg/kv/db_test.go, line 802 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This might be a more appropriate test for pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go.

Done.


pkg/kv/db_test.go, line 810 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

How long does this test to run? At least 1 second, right? If you replace setup with a call to serverutils.StartServer directly, you should be able to configure a shorter txn heartbeat interval?

Done.


pkg/kv/db_test.go, line 833 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: _, _ = tc.GetLeafTxnFinalState(ctx, kv.OnlyPending) to show that you're intentionally ignoring the result. CI might actually be complaining about this already.

Done.
No idea why it didn't complain :(

Code quote (from pkg/kv/kvclient/kvcoord/txn_coord_sender.go):

		newTxn := roachpb.PrepareTransactionForRetry(
			ctx, abortedErr, roachpb.NormalUserPriority, tc.clock)
		return roachpb.NewError(roachpb.NewTransactionRetryWithProtoRefreshError(
			abortedErr.String(), tc.mu.txn.ID, newTxn))

pkg/kv/db_test.go, line 841 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should we assert on the contents of the error after we've checked that it's non-nil?

I added
require.Equal(t, txn.ID(), pErr.TxnID)
I'm happy to add more..


pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 323 at r8 (raw file):

		// happens due to a heartbeat failure. On the other hand
		// GetLeafTxnFinalState() calls maybeRejectClientLocked() without trying a
		// Send().

Please take a look at this comment Nathan, I forgot this when we talked - this is why Get() doesn't work here. We can do a sleep and then Get() but this is cleaner.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This :lgtm:, although I think we can clean up the test just a bit further!

Reviewed 13 of 13 files at r7, 12 of 13 files at r8, 1 of 1 files at r9, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @lidorcarmel, @nvanbenschoten, and @yuzefovich)


pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 272 at r8 (raw file):

	// Create a DB with a short heartbeat interval.
	s := createTestDBWithKnobs(t, &kvserver.StoreTestingKnobs{
		DisableScanner:    true,

Do we need these three knobs?


pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 323 at r8 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

Please take a look at this comment Nathan, I forgot this when we talked - this is why Get() doesn't work here. We can do a sleep and then Get() but this is cleaner.

Interesting. I wasn't expecting the Get to notice that the txn was aborted and return an error. This must be because we check the abort span during the read, which is populated on the target range when the high-priority txn aborts our intent. See WriteAbortSpanOnResolve and the pkg/kv/kvserver/abortspan package for a bit of an explanation. The high-level role of the "abort span" is to prevent zombie transactions from failing to read their own writes after those writes have been aborted.

One way to get around this is to send the put and the get to different ranges. For instance, you could perform a split on key "b" (keyB := roachpb.Key("b")) using s.DB.AdminSplit and then perform Get("b") requests in this SucceedsSoon loop. What do you think about trying that out?


pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 336 at r8 (raw file):

	pErr := tc.GetTxnRetryableErr(ctx)
	require.NotNil(t, pErr)
	require.Equal(t, txn.ID(), pErr.TxnID)

Should we assert the error type here as well?

Copy link
Contributor Author

@lidorcarmel lidorcarmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, thanks, PTAL.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @nvanbenschoten, and @yuzefovich)


pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 272 at r8 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do we need these three knobs?

No, removed.


pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 323 at r8 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Interesting. I wasn't expecting the Get to notice that the txn was aborted and return an error. This must be because we check the abort span during the read, which is populated on the target range when the high-priority txn aborts our intent. See WriteAbortSpanOnResolve and the pkg/kv/kvserver/abortspan package for a bit of an explanation. The high-level role of the "abort span" is to prevent zombie transactions from failing to read their own writes after those writes have been aborted.

One way to get around this is to send the put and the get to different ranges. For instance, you could perform a split on key "b" (keyB := roachpb.Key("b")) using s.DB.AdminSplit and then perform Get("b") requests in this SucceedsSoon loop. What do you think about trying that out?

Done, works like a charm.
I did like my "creative" approach but this is probably better, thanks for the details!


pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 336 at r8 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should we assert the error type here as well?

so, GetTxnRetryableErr() returns *roachpb.TransactionRetryWithProtoRefreshError, not "error", so the type will always be fine, right? or maybe I'm missing something..

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong: feel free to merge if/when you feel good about the last two comments. No need for another round of reviews.

Reviewed all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @lidorcarmel, @nvanbenschoten, and @yuzefovich)


pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 336 at r8 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

so, GetTxnRetryableErr() returns *roachpb.TransactionRetryWithProtoRefreshError, not "error", so the type will always be fine, right? or maybe I'm missing something..

Ah, ok, then maybe just add an assertion that pErr.Transaction.ID is not equal to pErr.TxnID, which indicates an aborted txn.


pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 300 at r10 (raw file):

	// Note that if we read from the same range then we will cause the txn to be
	// poisoned immediately (Send() will fail), even before the heartbeat failure,
	// which is not the case we want to test here.

Let's mention the abort span here to explain why we need to send the Get to a different range.

Copy link
Contributor Author

@lidorcarmel lidorcarmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! done.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @nvanbenschoten, and @yuzefovich)


pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 336 at r8 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Ah, ok, then maybe just add an assertion that pErr.Transaction.ID is not equal to pErr.TxnID, which indicates an aborted txn.

Done.


pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 300 at r10 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Let's mention the abort span here to explain why we need to send the Get to a different range.

Done.

Instead of calling handleRetryableErrLocked() we did similar work
when the transaction sees a retryable error from a heartbeat failure.

Instead, it's better to call the existing handleRetryableErrLocked()
to take care of those errors in one place.

Release note: None
Previously kv users could lose parts of a transaction without getting an
error. After Send() returned a retryable error the state of txn got reset
which made it usable again. If the caller ignored the error they could
continue applying more operations without realizing the first part of the
transaction was discarded. See more details in the issue (cockroachdb#22615).

The simple case example is where the retryable closure of DB.Txn() returns
nil instead of returning the retryable error back to the retry loop - in this
case the retry loop declares success without realizing we lost the first part
of the transaction (all the operations before the retryable error).

This PR leaves the txn in a "poisoned" state after encountering an error, so
that all future operations fail fast. The caller is therefore expected to
reset the txn handle back to a usable state intentionally, by calling
Txn.PrepareForRetry(). In the simple case of DB.Txn() the retry loop will
reset the handle and run the retry even if the callback returned nil.

Closes cockroachdb#22615

Release note: None
LeafTxns don't use heartbeats, therefore we should never hit
this condition.

Release note: None
@lidorcarmel
Copy link
Contributor Author

TFTR Nathan!
bors r+

@craig
Copy link
Contributor

craig bot commented Feb 15, 2022

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

client: retryable Txn errors should require acknowledgment from clients
5 participants