sql: re-execute distributed query as local for some errors #105451

yuzefovich · 2023-06-23T18:20:15Z

This commit teaches the main query code path (i.e. ignoring sub- and
post-queries) to retry distributed plans as local in some cases. In
particular, we use this retry mechanism if:

the error is SQL retryable (i.e. it'll have a high chance of success
during the local execution)
no data has been pushed to the result writer by the distributed query
(this shouldn't be a frequent scenario since most SQL retryable errors
are likely to occur during the plan setup / before any data can be
produced by the query).

This retry mechanism allows us to hide transient network problems,
and - more importantly - in the multi-tenant model it allows us to go
around the problem when "not ready" SQL instance is being used for
DistSQL planning (e.g. the instance might have been brought down, but
the cache on top of the system table hasn't been updated accordingly).
I believe that no matter the improvements that we can make to the
instance cache, there will also be a window (which should hopefully
getting smaller - according to David T it's currently 45s but he hopes
to bring it down to 7s or so) during which the instance cache is stale,
so DistSQL planner could use "not ready" instances.

The rationale for why it is ok to do this retry is that we create
brand-new processors that aren't affiliated to the distributed plan
that was just cleaned up. It's worth mentioning that the planNode
tree couldn't haven been reused in this way, but if we needed to
execute any planNodes directly, then we would have to run such a plan
in a local fashion. In other words, the fact that we had a
distributed plan initially guarantees that we don't have any
planNodes to be concerned about.

Possible downside to this approach is that it increases overall query
latency, so ideally we wouldn't plan on "not ready" instances in the
first place (and we have issues to improve there), but given that we now
fully parallelize the setup of distributed plans, the latency increase
should be bound, assuming that most retryable errors occur during the
distributed plan setup.

Addresses: #100578.
Epic: None

Release note: None

cockroach-teamcity · 2023-06-23T18:20:28Z

This change is

msirek

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)

pkg/sql/distsql_physical_planner.go line 4590 at r2 (raw file):

) *PlanningCtx {
	distribute := distributionType == DistributionTypeAlways || (distributionType == DistributionTypeSystemTenantOnly && evalCtx.Codec.ForSystemTenant())
	planCtx := dsp.newPlanningCtxForLocal(evalCtx, planner, localityFiler)

Why does this always build a local planning context, even when distribute is true?

pkg/sql/distsql_running.go line 1238 at r2 (raw file):

	r.status = execinfra.NeedMoreRows
	r.closed = false
	r.stats = stats

Please reset r.progressAtomic. Maybe something like:
atomic.StoreUint64(r.progressAtomic, 0)

Also, if r.discardRows is true, we could be retrying with a non-initialized egress counter, so maybe add something like this:

		if r.egressCounter != nil {
			r.egressCounter = NewTenantNetworkEgressCounter()
		}

It appears discardRows is only used for testing, but we'd still want the counter to be correct for tests.

pkg/sql/distsql_running.go line 1963 at r2 (raw file):

		// brand-new processors that aren't affiliated to the distributed plan
		// that was just cleaned up. It's worth mentioning that the planNode
		// tree couldn't haven been reused in this way, but if we needed to

nit: "couldn't haven" --> "couldn't have"

pkg/sql/distsql_running.go line 1996 at r2 (raw file):

			return
		}
		finalizePlanWithRowCount(ctx, localPlanCtx, localPhysPlan, localPlanCtx.planner.curPlan.mainRowCount)

Would it be useful to check again for context cancellation here before calling dsp.Run, in case it happened while generating the new physical plan?
Also, please add the line:
recv.expectedRowsRead = int64(localPhysPlan.TotalEstimatedScannedRows)

This commit fixes `json_populate_record` builtin in an edge case. In particular, this generator builtin calls `eval.PopulateRecordWithJSON` which modifies the passed-in tuple in-place, and right now the builtin passes the input tuple. This leads to modification of the Datum which is not allowed. However, this is mostly philosophical bug that doesn't lead to any actual issues since from a single input tuple the builtin only generates a single output tuple. I noticed this problem when tried to re-execute the distributed query as local, but the tuple was corrupted for that second local execution. Release note: None

yuzefovich

Thanks for taking a look!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @msirek)

pkg/sql/distsql_running.go line 1238 at r2 (raw file):

Previously, msirek (Mark Sirek) wrote…

Please reset r.progressAtomic. Maybe something like:
atomic.StoreUint64(r.progressAtomic, 0)

Also, if r.discardRows is true, we could be retrying with a non-initialized egress counter, so maybe add something like this:
		if r.egressCounter != nil {
			r.egressCounter = NewTenantNetworkEgressCounter()
		}
It appears discardRows is only used for testing, but we'd still want the counter to be correct for tests.

Nice catch, done. For progressAtomic I somewhat deliberately was being lazy (this progress tracking is broken in the vectorized engine due #55758), but it's better to be thorough here.

pkg/sql/distsql_running.go line 1996 at r2 (raw file):

Previously, msirek (Mark Sirek) wrote…

Would it be useful to check again for context cancellation here before calling dsp.Run, in case it happened while generating the new physical plan?
Also, please add the line:
recv.expectedRowsRead = int64(localPhysPlan.TotalEstimatedScannedRows)

Updated expectedRowsRead (I think it shouldn't matter since this field remains unchanged from the distributed run and I'd guess the local estimate would be the same, but in any case it's better to be explicit).

Checking context cancellation doesn't seem that useful - we don't do that in the distributed run, and this local execution will detect it sooner if the context has just been canceled during the plan generation.

pkg/sql/distsql_physical_planner.go line 4590 at r2 (raw file):

Previously, msirek (Mark Sirek) wrote…

Why does this always build a local planning context, even when distribute is true?

I thought it would be more clear to have this helper method that sets up the planning context for local execution, and then the caller can extend it for distributed execution if necessary. It seems like it is confusing, so I removed the new helper in favor of using the existing constructor method.

msirek

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @cucaroach)

cucaroach · 2023-06-27T01:27:36Z

Will this fix #102839 I wonder?

yuzefovich · 2023-06-27T02:10:35Z

I think it'll fix select-job failures (where the internal executor uses DistSQL to run a query and it fails), but this failure is just wrong (perhaps it was another manifestation of the bug in the IE already fixed by #101477 although seems unlikely).

Tommy, I'll wait for your review here too since this change deserves extra scrutiny.

cucaroach · 2023-06-27T21:18:10Z

pkg/sql/distsql_running.go

@@ -1486,6 +1504,7 @@ func (r *DistSQLReceiver) Push(
 	if commErr := r.resultWriterMu.row.AddRow(r.ctx, r.row); commErr != nil {
 		r.handleCommErr(commErr)
 	}
+	r.dataPushed = true


What about the pushMeta early return path above?

cucaroach · 2023-06-27T21:29:29Z

pkg/sql/distsql_running_test.go

+// TestDistributedQueryErrorIsRetriedLocally verifies that if a query with a
+// distributed plan results in a SQL retryable error, then it is rerun as local
+// transparently.
+func TestDistributedQueryErrorIsRetriedLocally(t *testing.T) {


Can we add a test that makes sure the cluster setting works to disable this?

blathers-crl · 2023-06-27T21:30:53Z

Thank you for updating your pull request.

Before a member of our team reviews your PR, I have some potential action items for you:

We notice you have more than one commit in your PR. We try break logical changes into separate commits, but commits such as "fix typo" or "address review commits" should be squashed into one commit and pushed with --force
When CI has completed, please ensure no errors have appeared.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

yuzefovich

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @cucaroach and @msirek)

pkg/sql/distsql_running.go line 1507 at r4 (raw file):

Previously, cucaroach (Tommy Reilly) wrote…

What about the pushMeta early return path above?

I audited the code, and I think it'd be ok to ignore all metadata types for the purposes of the retry since it doesn't make it to the result writer, the only exception is this MetadataResultWriter, and users of that don't call PlanAndRun, so they aren't getting this automatic retry. That said, it seems like a good idea to consider any piece of non-error metadata as "data pushed" as long as it makes to the result writer, so I adjusted dataPushed boolean in that case. We need to exclude the error metadata because we know it'll be communicated when we want this retry to kick in, and we'll override it with the result of the local execution.

pkg/sql/distsql_running_test.go line 980 at r4 (raw file):

Previously, cucaroach (Tommy Reilly) wrote…

Can we add a test that makes sure the cluster setting works to disable this?

Done.

cucaroach · 2023-06-28T08:05:14Z

pkg/sql/distsql_running.go

+			// successful local execution.
+			return
+		}
+		log.VEventf(ctx, 1, "encountered an error when running the distributed plan, re-running it as local: %v", distributedErr)


Is this enough observability to see how frequently this is happening or should we have some telemetry too?

This commit teaches the main query code path (i.e. ignoring sub- and post-queries) to retry distributed plans as local in some cases. In particular, we use this retry mechanism if: - the error is SQL retryable (i.e. it'll have a high chance of success during the local execution) - no data has been pushed to the result writer by the distributed query (this shouldn't be a frequent scenario since most SQL retryable errors are likely to occur during the plan setup / before any data can be produced by the query). This retry mechanism allows us to hide transient network problems, and - more importantly - in the multi-tenant model it allows us to go around the problem when "not ready" SQL instance is being used for DistSQL planning (e.g. the instance might have been brought down, but the cache on top of the system table hasn't been updated accordingly). I believe that no matter the improvements that we can make to the instance cache, there will also be a window (which should hopefully getting smaller - according to David T it's currently 45s but he hopes to bring it down to 7s or so) during which the instance cache is stale, so DistSQL planner could use "not ready" instances. The rationale for why it is ok to do this retry is that we create brand-new processors that aren't affiliated to the distributed plan that was just cleaned up. It's worth mentioning that the planNode tree couldn't have been reused in this way, but if we needed to execute any planNodes directly, then we would have to run such a plan in a local fashion. In other words, the fact that we had a distributed plan initially guarantees that we don't have any planNodes to be concerned about. Possible downside to this approach is that it increases overall query latency, so ideally we wouldn't plan on "not ready" instances in the first place (and we have issues to improve there), but given that we now fully parallelize the setup of distributed plans, the latency increase should be bound, assuming that most retryable errors occur during the distributed plan setup. Release note: None

yuzefovich

TFTRs!

bors r+

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @cucaroach and @msirek)

pkg/sql/distsql_running.go line 1993 at r5 (raw file):

Previously, cucaroach (Tommy Reilly) wrote…

Is this enough observability to see how frequently this is happening or should we have some telemetry too?

Hm, not sure how we're going to use it, but why not, added a couple of counters.

craig · 2023-06-28T18:02:41Z

Build succeeded:

Bazel Essential CI (Cockroach)

cucaroach · 2023-07-17T13:42:57Z

Should we backport this to 23.1?

yuzefovich · 2023-07-17T16:00:34Z

Seems a bit risky, what's the rationale?

cucaroach · 2023-07-18T16:05:37Z

Seems a bit risky, what's the rationale?

I was suffering under the delusion that 23.1 had distsql turned on for internal executor and this was the cause of some 23.1 flakes but I've recovered ;-)

after an error. However, the retry logic unconditionally updated a field of `DistSQLReceiver` that may be nil, which could cause a nil-pointer error in some code paths (e.g. apply-join). This patch adds a check that the field is non-nil, as is done for other places where it is updated. There is no release note because the change has not yet made it into a release. Fixes cockroachdb#105451 Release note: None

In cockroachdb#105451, we added logic to locally retry a distributed query after an error. However, the retry logic unconditionally updated a field of `DistSQLReceiver` that may be nil, which could cause a nil-pointer error in some code paths (e.g. apply-join). This patch adds a check that the field is non-nil, as is done for other places where it is updated. There is no release note because the change has not yet made it into a release. Fixes cockroachdb#111327 Release note: None

111713: sql: fix nil-pointer error in local retry r=DrewKimball a=DrewKimball #### tree: return correct parse error for pg_lsn This patch changes the error returned upon failing to parse a PG_LSN value to match postgres. Previously, the error was an internal error. Informs #111327 Release note: None #### sql: fix nil-pointer error in local retry In #105451, we added logic to locally retry a distributed query after an error. However, the retry logic unconditionally updated a field of `DistSQLReceiver` that may be nil, which could cause a nil-pointer error in some code paths (e.g. apply-join). This patch adds a check that the field is non-nil, as is done for other places where it is updated. There is no release note because the change has not yet made it into a release. Fixes #111327 Release note: None 112654: opt: fix inverted index constrained scans for equality filters r=mgartner a=mgartner #### opt: fix inverted index constrained scans for equality filters This commit fixes a bug introduced in #101178 that allows the optimizer to generated inverted index scans on columns that are not filtered by the query. For example, an inverted index over the column `j1` could be scanned for a filter involving a different column, like `j2 = '5'`. The bug is caused by a simple omission of code that must check that the column in the filter is an indexed column. Fixes #111963 There is no release note because this bug is not present in any releases. Release note: None #### randgen: generate single-column indexes more often This commit makes `randgen` more likely to generate single-column indexes. It is motivated by the bug #111963, which surprisingly lived on the master branch for sixth months without being detected. It's not entirely clear why TLP or other randomized tests did not catch the bug, which has such a simple reproduction. One theory is that indexes tend to be multi-column and constrained scans on multi-column inverted indexes are not commonly planned for randomly generated queries because the set of requirements to generate the scan are very specific: the query must hold each prefix column constant, e.g. `a=1 AND b=2 AND j='5'::JSON`. The likelihood of randomly generating such an expression may be so low that the bug was not caught. By making 10% of indexes single-column, this bug may have been more likely to be caught because only the inverted index column needs to be constrained by an equality filter. Release note: None 112690: sql: disallow invocation of procedures outside of CALL r=mgartner a=mgartner #### sql: disallow invocation of procedures outside of CALL This commit adds some missing checks to ensure that procedures cannot be invoked in any context besides as the root expression in `CALL` statements. Epic: CRDB-25388 Release note: None #### sql: add tests with function invocation in procedure argument This commit adds a couple of tests that show that functions can be used in procedure argument expressions. Release note: None 112698: sql: clarify comments/naming of descriptorChanged flag r=rafiss a=rafiss fixes #110727 Release note: None 112701: sql/logictest: fix flakes in select_for_update_read_committed r=mgartner a=mgartner The `select_for_update_read_committed` tests were flaking because not all statements were being run under READ COMMITTED isolation. The logic test infrastructure does not allow fine-grained control of sessions, and setting the isolation level in one statement would only apply to a single session. Subsequent statements are not guaranteed to run in the same session because they could run in any session in the connection pool. This commit wraps each statement in an explicitly transaction with an explicit isolation level to ensure READ COMMITTED is used. In the future, we should investigate allowing fine-grained and explicit control of sessions in logic tests. Fixes #112677 Release note: None 112726: sql: make tests error if a leaf txn is not created when expected r=rharding6373 a=rharding6373 This adds a test-only error if a leaf transaction is expected to be used by a plan but a root transaction is used instead. Epic: none Informs: #111097 Release note: None 112767: log: fix stacktrace test goroutine counts r=rickystewart a=dhartunian Previously, we would use the count of the string `goroutine ` as a proxy for the number of goroutines in the stacktrace. This stopped working in go 1.21 due to this change: golang/go@51225f6 We should consider using a stacktrace parser in the future. Supports #112088 Epic: None Release note: None Co-authored-by: Drew Kimball <drewk@cockroachlabs.com> Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com> Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com> Co-authored-by: rharding6373 <rharding6373@users.noreply.github.com> Co-authored-by: David Hartunian <davidh@cockroachlabs.com>

In #105451, we added logic to locally retry a distributed query after an error. However, the retry logic unconditionally updated a field of `DistSQLReceiver` that may be nil, which could cause a nil-pointer error in some code paths (e.g. apply-join). This patch adds a check that the field is non-nil, as is done for other places where it is updated. There is no release note because the change has not yet made it into a release. Fixes #111327 Release note: None

yuzefovich requested review from msirek, cucaroach and a team June 23, 2023 18:20

yuzefovich requested review from a team as code owners June 23, 2023 18:20

msirek reviewed Jun 26, 2023

View reviewed changes

yuzefovich force-pushed the retry-as-local branch from 52bc423 to 336ec0f Compare June 26, 2023 23:50

yuzefovich commented Jun 26, 2023

View reviewed changes

msirek approved these changes Jun 27, 2023

View reviewed changes

cucaroach reviewed Jun 27, 2023

View reviewed changes

yuzefovich force-pushed the retry-as-local branch from 336ec0f to ea85049 Compare June 27, 2023 21:30

blathers-crl bot added the O-community Originated from the community label Jun 27, 2023

yuzefovich removed the O-community Originated from the community label Jun 27, 2023

yuzefovich force-pushed the retry-as-local branch from ea85049 to 5214763 Compare June 27, 2023 21:40

yuzefovich commented Jun 27, 2023

View reviewed changes

yuzefovich force-pushed the retry-as-local branch from 5214763 to 7e22342 Compare June 27, 2023 21:46

cucaroach reviewed Jun 28, 2023

View reviewed changes

cucaroach approved these changes Jun 28, 2023

View reviewed changes

yuzefovich force-pushed the retry-as-local branch from 7e22342 to 1afdc09 Compare June 28, 2023 17:20

yuzefovich force-pushed the retry-as-local branch from 1afdc09 to a85dbd9 Compare June 28, 2023 17:21

yuzefovich commented Jun 28, 2023

View reviewed changes

craig bot merged commit 06a051f into cockroachdb:master Jun 28, 2023
6 of 7 checks passed

yuzefovich deleted the retry-as-local branch June 28, 2023 18:22

yuzefovich mentioned this pull request Jun 28, 2023

sql: adjust DistSQL physical planning in multi-tenant mode #105256

Merged

cucaroach mentioned this pull request Jul 17, 2023

sql/importer: TestImportWorkerFailure failed #102839

Closed

yuzefovich mentioned this pull request Sep 7, 2023

sql: DistSQL physical planner might use SQL instances of different DistSQL versions during upgrade #88927

Closed

DrewKimball mentioned this pull request Oct 4, 2023

sql: fix nil-pointer error in local retry #111713

Merged

blathers-crl bot mentioned this pull request Oct 20, 2023

release-23.2: sql: fix nil-pointer error in local retry #112790

Merged

adityamaru mentioned this pull request Oct 27, 2023

release-23.1: sql: PartitionSpan should only use healthy nodes in mixed-process mode #113171

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: re-execute distributed query as local for some errors #105451

sql: re-execute distributed query as local for some errors #105451

yuzefovich commented Jun 23, 2023

cockroach-teamcity commented Jun 23, 2023

msirek left a comment

yuzefovich left a comment

msirek left a comment

cucaroach commented Jun 27, 2023

yuzefovich commented Jun 27, 2023 •

edited

cucaroach Jun 27, 2023

cucaroach Jun 27, 2023

blathers-crl bot commented Jun 27, 2023

yuzefovich left a comment

cucaroach Jun 28, 2023

yuzefovich left a comment

craig bot commented Jun 28, 2023

cucaroach commented Jul 17, 2023

yuzefovich commented Jul 17, 2023

cucaroach commented Jul 18, 2023

sql: re-execute distributed query as local for some errors #105451

sql: re-execute distributed query as local for some errors #105451

Conversation

yuzefovich commented Jun 23, 2023

cockroach-teamcity commented Jun 23, 2023

msirek left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

msirek left a comment

Choose a reason for hiding this comment

cucaroach commented Jun 27, 2023

yuzefovich commented Jun 27, 2023 • edited

cucaroach Jun 27, 2023

Choose a reason for hiding this comment

cucaroach Jun 27, 2023

Choose a reason for hiding this comment

blathers-crl bot commented Jun 27, 2023

yuzefovich left a comment

Choose a reason for hiding this comment

cucaroach Jun 28, 2023

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

craig bot commented Jun 28, 2023

cucaroach commented Jul 17, 2023

yuzefovich commented Jul 17, 2023

cucaroach commented Jul 18, 2023

yuzefovich commented Jun 27, 2023 •

edited