sql: PartitionSpan should only use healthy nodes in mixed-process mode #111337

stevendanna · 2023-09-27T12:28:15Z

Previously, when running in mixed-process mode, the DistSQLPlanner's PartitionSpans method would assume that it could directly assign a given span to the SQLInstanceID that matches the NodeID of whatever replica the current replica oracle returned, without regard to whether the SQL instance was available.

This is different from the system tenant code paths which proactively check node health and the non-mixed-process MT code paths which would use an eventually consistent view of healthy nodes.

As a result, processes that use PartitionSpans such as BACKUP may fail when a node was down.

Here, we have the mixed-process case work more like the separate process case in which we only use nodes returned by the instance reader. This list should eventually exclude any down nodes.

An alternative (or perhaps an addition) would be to allow MT planning to do direct status checks more similar to how they are done for the system tenant.

When reading this code, I also noted that we don't do DistSQL version compatibility checks like we do in the SystemTenant case. I am not sure on the impact of that.

Finally, this also adds another error to our list of non-permanent errors. Namely, if we fail to find a SQL instance, we don't tread that as permanent.

Fixes #111319

Release note (bug fix): When using a private preview of physical cluster replication, in some circumstances the source cluster would be unable to take backups when a source cluster node was unavailable.

cockroach-teamcity · 2023-09-27T12:28:26Z

This change is

yuzefovich

Nice!

Reviewed 5 of 5 files at r1, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @rhu713 and @stevendanna)

-- commits line 26 at r1:
It's not a concern because we don't plan to bump DistSQL version anymore (#98550).

pkg/sql/distsql_physical_planner.go line 1402 at r1 (raw file):

	ctx context.Context, planCtx *PlanningCtx,
) func(nodeID roachpb.NodeID) base.SQLInstanceID {
	allhealthy, err := dsp.sqlAddressResolver.GetAllInstances(ctx)

nit: s/allhealthy/allHealthy/.

pkg/sql/distsql_physical_planner.go line 1405 at r1 (raw file):

	if err != nil {
		log.Warningf(ctx, "could not get all instances: %v", err)
		return func(_ roachpb.NodeID) base.SQLInstanceID {

nit: we already have dsp.alwaysUseGateway defined for this.

pkg/sql/distsql_physical_planner.go line 1409 at r1 (raw file):

		}
	}
	healthyNodes := make(map[base.SQLInstanceID]struct{})

nit: we could pre-size the map with len(allhealthy).

yuzefovich

Also, do we want to backport this to 23.1?

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @rhu713 and @stevendanna)

Previously, when running in mixed-process mode, the DistSQLPlanner's PartitionSpans method would assume that it could directly assign a given span to the SQLInstanceID that matches the NodeID of whatever replica the current replica oracle returned, without regard to whether the SQL instance was available. This is different from the system tenant code paths which proactively check node health and the non-mixed-process MT code paths which would use an eventually consistent view of healthy nodes. As a result, processes that use PartitionSpans such as BACKUP may fail when a node was down. Here, we have the mixed-process case work more like the separate process case in which we only use nodes returned by the instance reader. This list should eventually exclude any down nodes. An alternative (or perhaps an addition) would be to allow MT planning to do direct status checks more similar to how they are done for the system tenant. Finally, this also adds another error to our list of non-permanent errors. Namely, if we fail to find a SQL instance, we don't tread that as permanent. Fixes cockroachdb#111319 Release note (bug fix): When using a private preview of physical cluster replication, in some circumstances the source cluster would be unable to take backups when a source cluster node was unavailable.

stevendanna · 2023-10-04T13:37:30Z

bors r=yuzefovich

craig · 2023-10-04T14:45:16Z

Build succeeded:

Bazel Essential CI (Cockroach)

blathers-crl · 2023-10-04T14:45:36Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from 44fac37 to blathers/backport-release-23.1-111337: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

stevendanna requested review from a team as code owners September 27, 2023 12:28

stevendanna requested review from rhu713 and yuzefovich and removed request for a team September 27, 2023 12:28

yuzefovich approved these changes Sep 27, 2023

View reviewed changes

yuzefovich reviewed Sep 27, 2023

View reviewed changes

stevendanna added the backport-23.1.x Flags PRs that need to be backported to 23.1 label Sep 28, 2023

stevendanna force-pushed the exclude-unhealthy-nodes branch 3 times, most recently from df55b39 to 5a503e4 Compare October 4, 2023 08:29

stevendanna force-pushed the exclude-unhealthy-nodes branch from 5a503e4 to 44fac37 Compare October 4, 2023 09:36

craig bot merged commit 72dad91 into cockroachdb:master Oct 4, 2023
7 of 8 checks passed

adityamaru mentioned this pull request Oct 26, 2023

release-23.1: sql: PartitionSpan should only use healthy nodes in mixed-process mode #113171

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: PartitionSpan should only use healthy nodes in mixed-process mode #111337

sql: PartitionSpan should only use healthy nodes in mixed-process mode #111337

stevendanna commented Sep 27, 2023

cockroach-teamcity commented Sep 27, 2023

yuzefovich left a comment

yuzefovich left a comment

stevendanna commented Oct 4, 2023

craig bot commented Oct 4, 2023

blathers-crl bot commented Oct 4, 2023

sql: PartitionSpan should only use healthy nodes in mixed-process mode #111337

sql: PartitionSpan should only use healthy nodes in mixed-process mode #111337

Conversation

stevendanna commented Sep 27, 2023

cockroach-teamcity commented Sep 27, 2023

yuzefovich left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

stevendanna commented Oct 4, 2023

craig bot commented Oct 4, 2023

blathers-crl bot commented Oct 4, 2023