Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: ensure auto-retrying transactions respect the statement_timeout #53968

Merged
merged 1 commit into from
Sep 14, 2020

Conversation

arulajmani
Copy link
Collaborator

Previously, if a statement exceeded the statement_timnout in the
conn_executor, but failed with a retryable error, we would still retry
it. This was because even though we overrode the error, we didn't
override the fms.Event and fsm.EventPayload which actually dictates
the next transition. This patch overrides them, thereby ensuring that
even if a timed out query/cancelled query encountered a retryable
error, we do not transition into retrying it.

Fixes #52845

Release justification: low risk, high benefit changes to existing functionality

Release note (bug fix): queries that can be automatically retried did
not respect the statement_timeout earlier, which is now fixed.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@arulajmani
Copy link
Collaborator Author

@andreimatei I know we talked about moving making makeErrEvent responsible for the overriding, but I couldn't make that work because of

if !os.ImplicitTxn.Get() && txn.IsSerializablePushAndRefreshNotPossible() {
, which doesn't use that function to generate a retryable event. Left it in the queryDone function for now.

@rafiss
Copy link
Collaborator

rafiss commented Sep 8, 2020

Thanks for your work @arulajmani! Heads up I've added this PR to the list of requested backports to the 20.2 branch. #53662

t.Fatal(err)
}

timer := time.AfterFunc(1*time.Second, func() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could you add a comment here mentioning that this is here to verify that the timeout error is returned before the 2 second interval passed to force_retry elapses?

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @arulajmani, and @rafiss)


pkg/sql/conn_executor_exec.go, line 394 at r1 (raw file):

		// There's no need to proceed with execution if the timer has already expired.
		if timerDuration < 0 {
			ex.cancelQuery(stmt.queryID)

Is this ex.cancelQuery() call necessary? If it isn't, I wouldn't do it since it just leads to questions about what async work are we canceling.


pkg/sql/run_control_test.go, line 869 at r1 (raw file):

	_, err = conn.QueryContext(ctx,
		`SET statement_timeout = '0.1s'`)
	if err != nil {

nit: require.NoError(t, err)


pkg/sql/run_control_test.go, line 873 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

nit: could you add a comment here mentioning that this is here to verify that the timeout error is returned before the 2 second interval passed to force_retry elapses?

better yet, get rid of this since it's broken anyway (can't t.Fatal() on any goroutine but the main one) and just measure the elapsed time until you get the error. No need for asynchrony.


pkg/sql/run_control_test.go, line 878 at r1 (raw file):

	_, err = conn.QueryContext(ctx, `SELECT crdb_internal.force_retry('2s')`)
	if !testutils.IsError(err, "pq: query execution canceled due to statement timeout") {

just fyi, consider require.Regexp(t, "pq: ...", err)


pkg/sql/run_control_test.go, line 885 at r1 (raw file):

	timer.Stop()

	// Same test as above, except in an explicit transaction.

instead of repeating, use testutils.RunTrueAndFalse() and construct the query dynamically.

Copy link
Collaborator Author

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)


pkg/sql/conn_executor_exec.go, line 394 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Is this ex.cancelQuery() call necessary? If it isn't, I wouldn't do it since it just leads to questions about what async work are we canceling.

It was in the earlier structure, where the assumption was query timeouts were a special case of cancelled queries. I've untangled that bit by checking and overwriting for timed out queries outside the cancelled queries check in the queryDone function, which allows me to get rid of this ex.cancelQuery call here.


pkg/sql/run_control_test.go, line 878 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

just fyi, consider require.Regexp(t, "pq: ...", err)

Done.


pkg/sql/run_control_test.go, line 885 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

instead of repeating, use testutils.RunTrueAndFalse() and construct the query dynamically.

Much cleaner, done!

@arulajmani
Copy link
Collaborator Author

bors r=andreimatei

@craig
Copy link
Contributor

craig bot commented Sep 14, 2020

Build failed (retrying...):

@otan
Copy link
Contributor

otan commented Sep 14, 2020

bors r-

looks like this is failing CI

@craig
Copy link
Contributor

craig bot commented Sep 14, 2020

Canceled.

Previously, if a statement exceeded the statement_timnout in the
conn_executor, but failed with a retryable error, we would still retry
it. This was because even though we overrode the error, we didn't
override the `fms.Event` and `fsm.EventPayload` which actually dictates
the next transition. This patch overrides them, thereby ensuring that
even if a timed out query/cancelled query encountered a retryable
error, we do not transition into retrying it.

Fixes cockroachdb#52845

Release justification: low risk, high benefit changes to existing functionality

Release note (bug fix): queries that can be automatically retried did
not respect the `statement_timeout` earlier, which is now fixed.
@arulajmani
Copy link
Collaborator Author

Oops looks like I was branched off quite an old master, the test interface I was using was apparently changed.

bors r=andreimatei

@craig
Copy link
Contributor

craig bot commented Sep 14, 2020

Build succeeded:

@craig craig bot merged commit a62a0f8 into cockroachdb:master Sep 14, 2020
@otan
Copy link
Contributor

otan commented Sep 14, 2020

do you think the test failure in https://teamcity.cockroachdb.com/viewLog.html?buildId=2269713&buildTypeId=Cockroach_UnitTests is related to this PR?

@arulajmani
Copy link
Collaborator Author

This PR might have made that test flaky, as 1ms may not be enough time to execute SET statement_timout = 0 anymore. This PR no longer tries executing the statement if the timer has already elapse, wheras it earlier it would have been a race between the timer stopping and the statement being executed. I think we should just increase the statement timeout to something > 1ms so that this doesn't flake. I'll send out a PR.

@rafiss
Copy link
Collaborator

rafiss commented Sep 14, 2020

Let's also backport this PR and the test change as well (to the 20.2 branch)

arulajmani added a commit to arulajmani/cockroach that referenced this pull request Sep 15, 2020
Previously, if a user set a really low statement_timeout value, there
would be no way to reset it/remove the statement_timeout entirely. To
get around this, all `SET` statements are now exempt from the statement
timeout.This should also fix some of the flakes we were seeing in
checks were decoupled in cockroachdb#53968. `SET` statements aren't canceled as no
one checks for a canceled context, which meant this exemption existed
implicitly before cockroachdb#53968.

Closes cockroachdb#54372

Release note: None
arulajmani added a commit to arulajmani/cockroach that referenced this pull request Sep 15, 2020
Previously, if a user set a really low statement_timeout value, there
would be no way to reset it/remove the statement_timeout entirely. To
get around this, all `SET` statements are now exempt from the statement
timeout.This should also fix some of the flakes we were seeing in
checks were decoupled in cockroachdb#53968. `SET` statements aren't canceled as no
one checks for a canceled context, which meant this exemption existed
implicitly before cockroachdb#53968.

Closes cockroachdb#54372

Release note: None
craig bot pushed a commit that referenced this pull request Sep 15, 2020
54415: sql: exempt `SET` statements from the statement_timeout r=andreimatei a=arulajmani

Previously, if a user set a really low statement_timeout value, there
would be no way to reset it/remove the statement_timeout entirely. To
get around this, all `SET` statements are now exempt from the statement
timeout.This should also fix some of the flakes we were seeing in
checks were decoupled in #53968. `SET` statements aren't canceled as no
one checks for a canceled context, which meant this exemption existed
implicitly before #53968.

Closes #54372

Release note: None

Co-authored-by: arulajmani <arulajmani@gmail.com>
arulajmani added a commit to arulajmani/cockroach that referenced this pull request Sep 16, 2020
Previously, if a user set a really low statement_timeout value, there
would be no way to reset it/remove the statement_timeout entirely. To
get around this, all `SET` statements are now exempt from the statement
timeout.This should also fix some of the flakes we were seeing in
checks were decoupled in cockroachdb#53968. `SET` statements aren't canceled as no
one checks for a canceled context, which meant this exemption existed
implicitly before cockroachdb#53968.

Closes cockroachdb#54372

Release note: None
arulajmani added a commit to arulajmani/cockroach that referenced this pull request Sep 16, 2020
Previously, if a user set a really low statement_timeout value, there
would be no way to reset it/remove the statement_timeout entirely. To
get around this, all `SET` statements are now exempt from the statement
timeout.This should also fix some of the flakes we were seeing in
checks were decoupled in cockroachdb#53968. `SET` statements aren't canceled as no
one checks for a canceled context, which meant this exemption existed
implicitly before cockroachdb#53968.

Closes cockroachdb#54372

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql: statement_timeout should not be reset after implicit transaction retries
5 participants