Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: prioritize TransactionAbortedError above all others #44460

Merged
merged 1 commit into from
Jan 28, 2020

Conversation

andreimatei
Copy link
Contributor

The DistSender and the DistSQLReceiver need to choose one error among
possibly multiple to hand out to a client. Before this patch,
non-retriable errors would take priority over a TransactionAbortedError.
This was a problem because the TxnCoordSender assumes that the only way
it'll find out about an aborted transaction is through a
TransactionAbortedError. When that assumption fails, the TxnCoordSender
doesn't stop the heartbeat loop, but the heartbeat loop reasonably
doesn't like to run on an aborted txn.

This patch makes the TransactionAbortedError have the highest priority,
thus hopefully making the TxnCoordSender's assumption valid.
Besides fixing this bug, prioritizing TransactionAbortedError generally
seems like the right thing to do, as it puts us in a better position to
allow upcoming savepoints to rollback past errors: one cannot rollback
past a TransactionAbortedError and so it seems like a good idea to
prefer to return this error. From the perspective of a particular
transaction, a TransactionAbortedError is the most unrecoverable error
of them all (even though it is retriable at a higher level).

Fixes #43879

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@andreimatei
Copy link
Contributor Author

cc @knz

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 4 of 4 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei)


pkg/kv/txn_coord_sender.go, line 767 at r1 (raw file):

		// error, the transaction has been rolled back and the state was updated in
		// handleRetryableErrLocked().
		if err.Transaction.ID == ba.Txn.ID {

Drive by comment: should we push these if err.Transaction.ID == ba.Txn.ID { checks into handleRetryableErrLocked? It's strange to me that we handle the inverse case (the txn is aborted) in there but this case at each caller.


pkg/roachpb/errors.go, line 79 at r1 (raw file):

	// ignore them). Also, the TxnCoordSender likes to assume that a
	// TransactionAbortedError is the only way it finds about an aborted
	// transaction, and so it benefits about all other errors being merged into a

s/it benefits about/it benefits from/

The DistSender and the DistSQLReceiver need to choose one error among
possibly multiple to hand out to a client. Before this patch,
non-retriable errors would take priority over a TransactionAbortedError.
This was a problem because the TxnCoordSender assumes that the only way
it'll find out about an aborted transaction is through a
TransactionAbortedError. When that assumption fails, the TxnCoordSender
doesn't stop the heartbeat loop, but the heartbeat loop reasonably
doesn't like to run on an aborted txn.

This patch makes the TransactionAbortedError have the highest priority,
thus hopefully making the TxnCoordSender's assumption valid.
Besides fixing this bug, prioritizing TransactionAbortedError generally
seems like the right thing to do, as it puts us in a better position to
allow upcoming savepoints to rollback past errors: one cannot rollback
past a TransactionAbortedError and so it seems like a good idea to
prefer to return this error. From the perspective of a particular
transaction, a TransactionAbortedError is the most unrecoverable error
of them all (even though it is retriable at a higher level).

Fixes cockroachdb#43879

Release note: None
Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/kv/txn_coord_sender.go, line 767 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Drive by comment: should we push these if err.Transaction.ID == ba.Txn.ID { checks into handleRetryableErrLocked? It's strange to me that we handle the inverse case (the txn is aborted) in there but this case at each caller.

sounds good. I'll do it separately.


pkg/roachpb/errors.go, line 79 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/it benefits about/it benefits from/

done

Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten)


pkg/kv/txn_coord_sender.go, line 767 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

sounds good. I'll do it separately.

#44463

Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten)

@andreimatei andreimatei mentioned this pull request Jan 28, 2020
19 tasks
craig bot pushed a commit that referenced this pull request Jan 28, 2020
44457: colexec: reduce test runtime r=yuzefovich a=asubiotto

**colexec: reduce maximum number of outputs in TestHashRouterRandom**

    We would previously create up to coldata.BatchSize() possible outputs for
    a HashRouter. This ends up creating one goroutine per output, which would
    slow down the runtime of this test dramatically. Since in a production
    deployment, this is the number of nodes, the number of outputs has been limited
    to 128.

    Release note: None (testing change)

**colexec: iterate over less possible batch sizes during tests**

    For any call to runTests, the test itself would be run under a couple varying
    input batch sizes from 1 up to coldata.BatchSize(). Since we added
    randomization of the batch size to tests, this loop results in tests taking
    longer than necessary since the cases with differing input batch sizes would
    be equivalent to running the tests with a different coldata.BatchSize(). The
    exception to this is testing with a smaller input batch size than
    coldata.BatchSize() which is tested by keeping the test of using batchSize=1.
    runTests therefore now runs with two input batch sizes only: 1 and
    coldata.BatchSize(). It is difficult to compare the runtime of different
    `make test` invocations due to the amount of randomization in the package, but
    this should provide a nice improvement in the time taken to run tests.

    Release note: None (testing change)

44460: kv: prioritize TransactionAbortedError above all others r=andreimatei a=andreimatei

The DistSender and the DistSQLReceiver need to choose one error among
possibly multiple to hand out to a client. Before this patch,
non-retriable errors would take priority over a TransactionAbortedError.
This was a problem because the TxnCoordSender assumes that the only way
it'll find out about an aborted transaction is through a
TransactionAbortedError. When that assumption fails, the TxnCoordSender
doesn't stop the heartbeat loop, but the heartbeat loop reasonably
doesn't like to run on an aborted txn.

This patch makes the TransactionAbortedError have the highest priority,
thus hopefully making the TxnCoordSender's assumption valid.
Besides fixing this bug, prioritizing TransactionAbortedError generally
seems like the right thing to do, as it puts us in a better position to
allow upcoming savepoints to rollback past errors: one cannot rollback
past a TransactionAbortedError and so it seems like a good idea to
prefer to return this error. From the perspective of a particular
transaction, a TransactionAbortedError is the most unrecoverable error
of them all (even though it is retriable at a higher level).

Fixes #43879

Release note: None

Co-authored-by: Alfonso Subiotto Marques <alfonso@cockroachlabs.com>
Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
@craig
Copy link
Contributor

craig bot commented Jan 28, 2020

Build succeeded

@craig craig bot merged commit a3538bc into cockroachdb:master Jan 28, 2020
@andreimatei andreimatei deleted the kv.txn-aborted-err-priority branch January 29, 2020 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest: tpccbench/nodes=12/cpu=16 failed
3 participants