Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/server: support tenant auto-upgrade #102427

Merged
merged 1 commit into from Oct 10, 2023
Merged

Conversation

healthy-pod
Copy link
Contributor

@healthy-pod healthy-pod commented Apr 27, 2023

Previously, tenant upgrades in UA required a user to issue a SET CLUSTER SETTING version =
statement to finalize an upgrade. This UX is different from what we have in single-tenant
SH/Dedicated deployments in that we have auto upgrade in the later that starts an attempt
to finalize cluster version after every node startup incase the node was started with a new
binary version that all nodes now support upgrading to.

In UA, we have two differences:

  1. What to upgrade?
    • In a multi-tenant deployment, the storage and sql layers are upgraded separately.
    • The storage layer upgrade finalization is still handled by the existing auto upgrade logic.
    • In this change, we ensure that the sql layer is also auto-upgraded when possible.
  2. When to upgrade?
    • In a single-tenant deployment, all layers share the same binary version and cluster version.
      Hence, an upgrade attempt is only needed when a new node starts to ensure that the cluster is
      auto-upgraded if the new binary version supports an upgrade.
    • In a multi-tenant deployment, in addition to the condition above, the sql server upgrade is
      also constrained by the storage cluster version. It is possible for all SQL instances to have
      binary versions that support an upgrade but the upgrade will still be blocked by the storage
      cluster version if it’s equal to the current tenant cluster version.

This code change does the following:

  1. Adds logic to run a SQL server upgrade attempt (mostly adopted from the original auto upgrade code)
    within the following ordered constraints (previously we merged sqlinstance: add binary_version column to instances table #98830 to make getting the binary
    versions of instances easier):

    • Ensure that the upgrade is not blocked by the secondary tenant's setting of preserve_downgrade_option
      or an all-tenant override of that value.
    • Exit if tenant cluster version is equal to the minimum instance binary version [upgrade already completed].
    • Upgrade to storage cluster version if the binary version of all SQL instances supports that.
    • Exit if storage cluster version is less than the minimum instance binary version [upgrade blocked
      due to low storage cluster version].
    • Upgrade to the minimum instance binary version.
  2. Runs the logic above when a SQL server is started.

    • This covers the case where a SQL server binary upgrade allows for an upgrade to the
      tenant cluster version.
  3. Checks for change in storage cluster version every 10 seconds and starts an upgrade attempt if
    it was changed.

    • This covers the case where the binary versions of all SQL instances allow for an upgrade
      but it’s blocked due to the storage cluster version.

Release note: Support for automatic finalization of virtual clusters version
upgrade was added. A new setting cluster.auto_upgrade.enabled was added to
enable and disable automatic cluster version upgrade (finalization) and it
will be used with automatic upgrade at both storage-level and virtual cluster
level (each has its own setting value though).
Epic: CRDB-20860

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@healthy-pod healthy-pod marked this pull request as ready for review May 1, 2023 11:53
@healthy-pod healthy-pod requested review from a team as code owners May 1, 2023 11:53
@healthy-pod healthy-pod requested review from knz and ajstorm May 1, 2023 11:53
@abarganier abarganier removed the request for review from a team May 2, 2023 17:30
Copy link
Collaborator

@ajstorm ajstorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I left a few comments. We'll also want to have someone else look at this before it merges because I'm trying to step away from this work, and you'll need someone else to take over reviews going forward.

Reviewed 4 of 8 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @healthy-pod and @knz)


pkg/server/tenant_auto_upgrade.go line 1 at r1 (raw file):

// Copyright 2018 The Cockroach Authors.

Nit: 2023


pkg/server/tenant_auto_upgrade.go line 41 at r1 (raw file):

			// on every change to the internal version of storage cluster version
			// within a short time period.
			case <-time.After(time.Second * 30):

nit: should we make this time configurable using a cluster setting? Doesn't have to be on this PR, but it may be something worth considering should the 30 seconds be too slow or too fast.


pkg/server/tenant_auto_upgrade.go line 46 at r1 (raw file):

					storageClusterVersion = latestStorageClusterVersion
					if err := s.startAttemptUpgrade(ctx); err != nil {
						log.Errorf(ctx, "failed to start an upgrade attempt")

Is there a reason we're not logging the returned error here? I'd think it might be better to wrap the error with "failed to start an upgrade attempt", and then log AND return the wrapped error.


pkg/server/tenant_auto_upgrade.go line 61 at r1 (raw file):

		retryOpts := retry.Options{
			InitialBackoff: time.Second,

Is there a reason why we're starting here at 1 second? Is there a downside to starting at 30 seconds?


pkg/server/tenant_auto_upgrade.go line 109 at r1 (raw file):

				continue
			case upgradeAlreadyCompleted:
				log.Info(ctx, "no need to upgrade, instance already at the newest version")

I'm wondering if this error message could be misleading. Isn't it possible to get into this state in cases where the storage cluster version matches the current tenant cluster version? In that case, the SQL instance isn't at the newest version, but the upgrade will be blocked due to the storage cluster not being upgraded yet.


pkg/server/tenant_auto_upgrade.go line 116 at r1 (raw file):

				// Fall out of the select below.
			default:
				panic(errors.AssertionFailedf("unhandled case: %d", status))

Nit: Will the panic here bring down the SQL server? If so, I'm wondering if we should return here and log something instead of panicking?


pkg/server/tenant_auto_upgrade.go line 119 at r1 (raw file):

			}

			upgradeRetryOpts := retry.Options{

I'm probably missing something obvious, but why are there two retries here? What are the implications of us dropping the retry here and relying on the retry above to handle everything?


pkg/server/tenant_auto_upgrade.go line 136 at r1 (raw file):

					log.Errorf(ctx, "error when finalizing cluster version upgrade: %v", err)
				} else {
					log.Info(ctx, "successfully upgraded cluster version")

Nit: Should we say "tenant cluster version" here? Also, should we output the new cluster version?


pkg/server/tenant_auto_upgrade.go line 193 at r1 (raw file):

	}

	minInstanceBinaryVersion := findMinBinaryVersion(instances)

This block could do with a comment which describes in detail the conditions under which we permit the upgrade.

Also, is it the case that this block is not entirely required? Could we not just proceed with the upgrade here and rely on the existing blocks in place to prevent the upgrade if it's not allowed?


pkg/ccl/kvccl/kvtenantccl/upgradeccl/tenant_upgrade_test.go line 118 at r1 (raw file):

	tenantRunner.CheckQueryResults(t, "SELECT * FROM t", [][]string{{"1"}, {"2"}})
	// Wait for the tenant to auto-finalize.
	tenantRunner.SucceedsSoonDuration = time.Second * 90

Can we add a test in here where we have preserve_downgrade_option set and ensure that the auto upgrade doesn't run? Also, does it make sense to add a test for all of the permutations (e.g. multiple SQL pods, one of which is at the wrong binary version) to ensure that auto upgrade won't run when it shouldn't?

Copy link
Collaborator

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the new setting, it currently only applies to tenant upgrades but the name/help text doesn't make that very clear. I could see applying it to the non-tenant code path as well or renaming it.

@healthy-pod
Copy link
Contributor Author

For the new setting, it currently only applies to tenant upgrades but the name/help text doesn't make that very clear. I could see applying it to the non-tenant code path as well or renaming it.

Makes sense, updated to virtual_cluster.auto_upgrade.enabled.

@healthy-pod
Copy link
Contributor Author

Thanks for the review!

bors r=stevendanna

@craig
Copy link
Contributor

craig bot commented Oct 9, 2023

👎 Rejected by code reviews

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 10 files at r4, 4 of 8 files at r6, 4 of 6 files at r8, 3 of 3 files at r9, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @healthy-pod, and @stevendanna)


-- commits line 47 at r9:
Please include a release note that mentions the new cluster setting.


pkg/clusterversion/setting.go line 313 at r9 (raw file):

var VirtualClusterAutoUpgradeEnabled = settings.RegisterBoolSetting(
	settings.ApplicationLevel,
	"virtual_cluster.auto_upgrade.enabled",

Please name this cluster.auto_upgrade.enabled.
This is not just virtual clusters -- we will use the setting also in system tenant for the storage-level auto-upgrade.

This means the name of the variable should just be AutoUpgradeEnabled and the setting description should not contain (for virtual clusters).

Also mention in release note.


pkg/server/tenant_auto_upgrade.go line 59 at r9 (raw file):

				// is passed.
				storageClusterVersionChanged := storageClusterVersion != latestStorageClusterVersion
				if firstAttempt ||

May I suggest checking the cluster setting "auto upgrade enabled" here, so that we don't produce a log.Errorf if the auto upgrade is disabled.
(Errors in logs cause operators to investigate. We don't want this if the auto upgrade was simply disabled.)

Better yet; you could have a helper function "isAutoUpgradeEnabled" that checks both the cluster setting and the testing knob, and skip the call to startAttemptUpgrade in case the helper returns false.


pkg/server/settingswatcher/settings_watcher.go line 654 at r9 (raw file):

}

func (s *SettingsWatcher) GetVirtualClusterAutoUpgradeEnabledSettingValue() bool {

you can rename this according to the point discussed above.


pkg/ccl/kvccl/kvtenantccl/upgradeccl/tenant_upgrade_test.go line 118 at r9 (raw file):

	// Disable virtual_cluster.auto_upgrade.enabled setting for the tenant to prevent auto upgrade.
	tenantRunner.Exec(t, "SET CLUSTER SETTING virtual_cluster.auto_upgrade.enabled = false")

tip: use the setting variable's .Name() to generate the SQL to use in the test.

@healthy-pod
Copy link
Contributor Author

May I suggest checking the cluster setting "auto upgrade enabled" here, so that we don't produce a log.Errorf if the auto upgrade is disabled.
(Errors in logs cause operators to investigate. We don't want this if the auto upgrade was simply disabled.)

We do not return an error if auto upgrade is disabled, just nil.

@healthy-pod healthy-pod requested a review from knz October 9, 2023 18:35
Previously, tenant upgrades in UA required a user to issue a `SET CLUSTER SETTING version =`
statement to finalize an upgrade. This UX is different from what we have in single-tenant
SH/Dedicated deployments in that we have auto upgrade in the later that starts an attempt
to finalize cluster version after every node startup incase the node was started with a new
binary version that all nodes now support upgrading to.

In UA, we have two differences:

1. What to upgrade?
    - In a multi-tenant deployment, the storage and sql layers are upgraded separately.
    - The storage layer upgrade finalization is still handled by the existing auto upgrade logic.
    - In this change, we ensure that the sql layer is also auto-upgraded when possible.
2. When to upgrade?
    - In a single-tenant deployment, all layers share the same binary version and cluster version.
      Hence, an upgrade attempt is only needed when a new node starts to ensure that the cluster is
      auto-upgraded if the new binary version supports an upgrade.
    - In a multi-tenant deployment, in addition to the condition above, the sql server upgrade is
      also constrained by the storage cluster version. It is possible for all SQL instances to have
      binary versions that support an upgrade but the upgrade will still be blocked by the storage
      cluster version if it’s equal to the current tenant cluster version.

This code change does the following:

1. Adds logic to run a SQL server upgrade attempt (mostly adopted from the original auto upgrade code)
   within the following ordered constraints (previously we merged cockroachdb#98830 to make getting the binary
   versions of instances easier):
    - Ensure that the upgrade is not blocked by the secondary tenant's setting of preserve_downgrade_option
      or an all-tenant override of that value.
    - Exit if tenant cluster version is equal to the minimum instance binary version [upgrade already completed].
    - Upgrade to storage cluster version if the binary version of all SQL instances supports that.
    - Exit if storage cluster version is less than the minimum instance binary version [upgrade blocked
      due to low storage cluster version].
    - Upgrade to the minimum instance binary version.

2. Runs the logic above when a SQL server is started.
    - This covers the case where a SQL server binary upgrade allows for an upgrade to the
      tenant cluster version.

3. Checks for change in storage cluster version every 10 seconds and starts an upgrade attempt if
   it was changed.
    - This covers the case where the binary versions of all SQL instances allow for an upgrade
      but it’s blocked due to the storage cluster version.

Release note: Support for automatic finalization of virtual clusters version
upgrade was added. A new setting `cluster.auto_upgrade.enabled` was added to
enable and disable automatic cluster version upgrade (finalization) and it
will be used with automatic upgrade at both storage-level and virtual cluster
level (each has its own setting value though).
Epic: CRDB-20860
@healthy-pod
Copy link
Contributor Author

ready for another look

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 8 of 8 files at r10, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @healthy-pod, and @stevendanna)

@knz
Copy link
Contributor

knz commented Oct 10, 2023

bors r=stevendanna,knz

@craig
Copy link
Contributor

craig bot commented Oct 10, 2023

Build succeeded:

@craig craig bot merged commit bee7cfe into cockroachdb:master Oct 10, 2023
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants