New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: add a retry loop for stmts in READ COMMITTED txns #107044
Conversation
1d1d2d9
to
0b0b287
Compare
0b0b287
to
633add0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1, 3 of 3 files at r2, 5 of 5 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @michae2 and @nvanbenschoten)
633add0
to
9c30984
Compare
pkg/sql/vars.go
Outdated
// CockroachDB extension. Configures the maximum number of automatic retries | ||
// to perform for statements in explicit READ COMMITTED transactions that | ||
// see a transaction retry error. | ||
`max_retries_for_read_committed_transactions`: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about a cluster setting for this purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i lean towards session variable, since cluster settings are generally intended for usage by operators, and require higher permissions on the cluster. also, in this case, it is reasonable that different workloads may want different values here, since it depends on how much contention they see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
Reviewed 1 of 1 files at r1, 3 of 3 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @chengxiong-ruan, @fqazi, @nvanbenschoten, and @rafiss)
pkg/sql/conn_executor_exec.go
line 963 at r2 (raw file):
if ex.state.mu.txn.IsOpen() && ex.state.mu.txn.IsoLevel() == isolation.ReadCommitted && !ex.implicitTxn() &&
Do we also want to defer to the state machine retry for the first statement of explicit transactions?
pkg/sql/conn_executor_exec.go
line 980 at r2 (raw file):
} maxExecCount := 1
Rather than going through this loop codepath in every case, I think it would be cleaner to put the retry loop in a separate conn executor function that wraps dispatchToExecutionEngine
and is only called for read committed transactions. That will do two things: (a) get rid of some of the conditional logic, and (b) add a function to the stack that will be visible when debugging which could help us know we're in a read committed transaction.
pkg/sql/conn_executor_exec.go
line 987 at r2 (raw file):
} for attemptNum := 0; attemptNum < maxExecCount; attemptNum++ {
Does this loop need some kind of (randomized) backoff? For example, I think two update statements could both conflict with each other and both need to retry repeatedly.
pkg/sql/conn_executor_exec.go
line 1005 at r2 (raw file):
break } if !errIsRetriable(maybeRetriableErr) || attemptNum == maxRetries {
Does errIsRetriable
return true for any errors that need a full transaction retry (rather than just a statement retry)? If so, we might be performing extra useless statement retries in those cases, when we should go straight to a transaction-level retry.
pkg/sql/vars.go
line 2034 at r4 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
i lean towards session variable, since cluster settings are generally intended for usage by operators, and require higher permissions on the cluster. also, in this case, it is reasonable that different workloads may want different values here, since it depends on how much contention they see.
bikeshedding: I understand that "_transactions" refers to the fact that the whole transaction is at read-committed isolation, but find it a little confusing for this setting. I think it would be clearer to leave it off or change it to "_statements".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @chengxiong-ruan, @fqazi, and @nvanbenschoten)
pkg/sql/conn_executor_exec.go
line 1019 at r4 (raw file):
} res.SetError(nil) ex.state.mu.txn.PrepareForRetry(ctx)
we should wait for kv: remove TxnCoordSender.PrepareRetryableError, rationalize ManualRestart to complete.
9c30984
to
0c74392
Compare
For some reason I can't type into Reviewable, but replying here:
I don't think so - the state machine doesn't do retries if the
Very great idea. Done, and hopefully it's simpler to read now.
Great catch. I've now rebased on top of #105161 so that I can check with
I just removed "_transactions" from the end now. |
0c74392
to
f4f17c6
Compare
f4f17c6
to
2e3268d
Compare
f81d19e
to
7d1895f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to rebase this on master now.
Reviewed 10 of 13 files at r17, 5 of 5 files at r18, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @chengxiong-ruan, @michae2, @rafiss, and @rhu713)
pkg/sql/conn_executor_exec.go
line 1523 at r14 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
one thought i had is if it would make sense to reuse the
autoRetryCounter
that we already use the for the conn_executor state machine retries. it would be a slight abuse of the variable, but it actually doesn't seem incorrect to me, and it would be the quickest way to plug into any existing observability that shows retries.
This makes sense to me. I don't think we want to consider this a "transaction retry" and so we will also want to introduce some "statement retry" specific observability, but in existing cases where we are willing to broaden the definitions (e.g. "auto-retry"), it feels reasonable to group this in.
pkg/sql/conn_executor_exec.go
line 1560 at r18 (raw file):
// flushed and sent back to the client already. In that case, we can't // retry the statement. res.SetError(errors.Wrapf(
Do we have a test that exercises this case?
pkg/sql/tests/read_committed_test.go
line 58 at r18 (raw file):
// the second read committed write begins, the read committed scans // will have finished. if readCommittedWriteCount.Load() == 2 {
Instead of re-loading, do we want to use the result of the Add
? Right now, this is racy.
108728: build: add basic infrastructure for remote execution with EngFlow r=rail a=rickystewart Add the `--config engflow` which sets some appropriate configurations for building against our engflow cluster, and set some other metadata. Also bump some test sizes and shard counts to get everything working. Epic: [CRDB-8308](https://cockroachlabs.atlassian.net/browse/CRDB-8308) Release note: None 108817: sql: simplifiy tracking of injected txn retry errors r=rafiss a=rafiss Rather than using the txn epoch, we can just track how many errors were injected. This lets us have a bit more control over how many errors to inject, without having to rely on how the KV layer handles different types of transaction retries. informs: #100145 split off from: #107044 Release note: None Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com> Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
This new loop retries individual statements inside an explicit READ COMMITTED transaction. This is possible because each statement in a READ COMMITTED transaction has a different read timestamp. The conn executor already has retry logic for all implicit transactions, and we continue to use those when possible. A session setting controls how many retries will be performed for a statement inside of an explicit READ COMMITTED transaction. Release note (sql change): Added the max_retries_for_read_committed session variable. It defaults to 10, and determines the number of times an individual statement in an explicit READ COMMITTED transaction will be retried if it encounters a retriable transaction error.
Release note: None
6c37b4c
to
4e7f87e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @chengxiong-ruan, @michae2, @nvanbenschoten, and @rhu713)
pkg/sql/conn_executor_exec.go
line 1560 at r18 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Do we have a test that exercises this case?
added a test
pkg/sql/tests/read_committed_test.go
line 58 at r18 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Instead of re-loading, do we want to use the result of the
Add
? Right now, this is racy.
good point; fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might also want one person from SQL to sign off on the changes as well.
Reviewed 21 of 25 files at r20, 2 of 2 files at r21, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @chengxiong-ruan, @michae2, @rafiss, and @rhu713)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments about tests that you can ignore.
Reviewed 10 of 13 files at r17, 25 of 25 files at r19, 2 of 25 files at r20, 1 of 2 files at r21, all commit messages.
Reviewable status: complete! 2 of 0 LGTMs obtained (and 1 stale) (waiting on @chengxiong-ruan, @rafiss, and @rhu713)
pkg/sql/conn_executor_test.go
line 1442 at r20 (raw file):
params := base.TestServerArgs{} var readCommittedStmtRetries int
Does this also need to be an atomic?
pkg/sql/tests/read_committed_test.go
line 49 at r20 (raw file):
filterFunc := func(ctx context.Context, ba *kvpb.BatchRequest) *kvpb.Error { if ba.Txn == nil || ba.Txn.IsoLevel != isolation.ReadCommitted {
I'm worried that if we ever add internal queries running at ReadCommitted isolation they will cause this test to flake because we're looking for any Put
at RC isolation. I don't think we need to make it bulletproof now, but maybe a comment would help someone fix the hypothetical flake one day.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 2 of 0 LGTMs obtained (and 1 stale) (waiting on @chengxiong-ruan, @michae2, and @rhu713)
pkg/sql/conn_executor_test.go
line 1442 at r20 (raw file):
Previously, michae2 (Michael Erickson) wrote…
Does this also need to be an atomic?
hmm it might be safer. will do
pkg/sql/tests/read_committed_test.go
line 49 at r20 (raw file):
Previously, michae2 (Michael Erickson) wrote…
I'm worried that if we ever add internal queries running at ReadCommitted isolation they will cause this test to flake because we're looking for any
Put
at RC isolation. I don't think we need to make it bulletproof now, but maybe a comment would help someone fix the hypothetical flake one day.
i think i can easily change this so it only counts writes to the table in this test
4e7f87e
to
6be2709
Compare
tftrs! bors r+ |
Build failed: |
This test ensures that we do not do per-statement retries for READ COMMITTED transactions if results were already sent to the client. Release note: None
6be2709
to
0e57591
Compare
bors r+ |
Build failed: |
flake was:
bors r+ |
Build failed: |
bors r+ |
Build succeeded: |
fixes #100145
This new loop retries individual statements inside an explicit READ
COMMITTED transaction. This is possible because each statement in a READ
COMMITTED transaction has a different read timestamp. The conn executor
already has retry logic for all implicit transactions, and we continue to
use those when possible.
A session setting controls how many retries will be performed for a statement
inside of an explicit READ COMMITTED transaction.
Release note (sql change): Added the
max_retries_for_read_committed session variable. It
defaults to 10, and determines the number of times an individual
statement in an explicit READ COMMITTED transaction will be retried
if it encounters a retriable transaction error.