Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: PartitionSpan should only use healthy nodes in mixed-process mode #111337

Merged
merged 1 commit into from Oct 4, 2023

Conversation

stevendanna
Copy link
Collaborator

Previously, when running in mixed-process mode, the DistSQLPlanner's PartitionSpans method would assume that it could directly assign a given span to the SQLInstanceID that matches the NodeID of whatever replica the current replica oracle returned, without regard to whether the SQL instance was available.

This is different from the system tenant code paths which proactively check node health and the non-mixed-process MT code paths which would use an eventually consistent view of healthy nodes.

As a result, processes that use PartitionSpans such as BACKUP may fail when a node was down.

Here, we have the mixed-process case work more like the separate process case in which we only use nodes returned by the instance reader. This list should eventually exclude any down nodes.

An alternative (or perhaps an addition) would be to allow MT planning to do direct status checks more similar to how they are done for the system tenant.

When reading this code, I also noted that we don't do DistSQL version compatibility checks like we do in the SystemTenant case. I am not sure on the impact of that.

Finally, this also adds another error to our list of non-permanent errors. Namely, if we fail to find a SQL instance, we don't tread that as permanent.

Fixes #111319

Release note (bug fix): When using a private preview of physical cluster replication, in some circumstances the source cluster would be unable to take backups when a source cluster node was unavailable.

@stevendanna stevendanna requested review from a team as code owners September 27, 2023 12:28
@stevendanna stevendanna requested review from rhu713 and yuzefovich and removed request for a team September 27, 2023 12:28
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! :lgtm:

Reviewed 5 of 5 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @rhu713 and @stevendanna)


-- commits line 26 at r1:
It's not a concern because we don't plan to bump DistSQL version anymore (#98550).


pkg/sql/distsql_physical_planner.go line 1402 at r1 (raw file):

	ctx context.Context, planCtx *PlanningCtx,
) func(nodeID roachpb.NodeID) base.SQLInstanceID {
	allhealthy, err := dsp.sqlAddressResolver.GetAllInstances(ctx)

nit: s/allhealthy/allHealthy/.


pkg/sql/distsql_physical_planner.go line 1405 at r1 (raw file):

	if err != nil {
		log.Warningf(ctx, "could not get all instances: %v", err)
		return func(_ roachpb.NodeID) base.SQLInstanceID {

nit: we already have dsp.alwaysUseGateway defined for this.


pkg/sql/distsql_physical_planner.go line 1409 at r1 (raw file):

		}
	}
	healthyNodes := make(map[base.SQLInstanceID]struct{})

nit: we could pre-size the map with len(allhealthy).

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do we want to backport this to 23.1?

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @rhu713 and @stevendanna)

@stevendanna stevendanna added the backport-23.1.x Flags PRs that need to be backported to 23.1 label Sep 28, 2023
@stevendanna stevendanna force-pushed the exclude-unhealthy-nodes branch 3 times, most recently from df55b39 to 5a503e4 Compare October 4, 2023 08:29
Previously, when running in mixed-process mode, the DistSQLPlanner's
PartitionSpans method would assume that it could directly assign a
given span to the SQLInstanceID that matches the NodeID of whatever
replica the current replica oracle returned, without regard to whether
the SQL instance was available.

This is different from the system tenant code paths which proactively
check node health and the non-mixed-process MT code paths which would
use an eventually consistent view of healthy nodes.

As a result, processes that use PartitionSpans such as BACKUP may fail
when a node was down.

Here, we have the mixed-process case work more like the separate
process case in which we only use nodes returned by the instance
reader. This list should eventually exclude any down nodes.

An alternative (or perhaps an addition) would be to allow MT planning
to do direct status checks more similar to how they are done for the
system tenant.

Finally, this also adds another error to our list of non-permanent
errors. Namely, if we fail to find a SQL instance, we don't tread that
as permanent.

Fixes cockroachdb#111319

Release note (bug fix): When using a private preview of physical
cluster replication, in some circumstances the source cluster would be
unable to take backups when a source cluster node was unavailable.
@stevendanna
Copy link
Collaborator Author

bors r=yuzefovich

@craig
Copy link
Contributor

craig bot commented Oct 4, 2023

Build succeeded:

@craig craig bot merged commit 72dad91 into cockroachdb:master Oct 4, 2023
7 of 8 checks passed
@blathers-crl
Copy link

blathers-crl bot commented Oct 4, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 44fac37 to blathers/backport-release-23.1-111337: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

backupccl,multi-tenant: BACKUP fails when node is down when using shared process multi-tenancy
3 participants