pkg/server: support tenant auto-upgrade #102427

healthy-pod · 2023-04-27T16:52:14Z

Previously, tenant upgrades in UA required a user to issue a SET CLUSTER SETTING version =
statement to finalize an upgrade. This UX is different from what we have in single-tenant
SH/Dedicated deployments in that we have auto upgrade in the later that starts an attempt
to finalize cluster version after every node startup incase the node was started with a new
binary version that all nodes now support upgrading to.

In UA, we have two differences:

What to upgrade?
- In a multi-tenant deployment, the storage and sql layers are upgraded separately.
- The storage layer upgrade finalization is still handled by the existing auto upgrade logic.
- In this change, we ensure that the sql layer is also auto-upgraded when possible.
When to upgrade?
- In a single-tenant deployment, all layers share the same binary version and cluster version.
  Hence, an upgrade attempt is only needed when a new node starts to ensure that the cluster is
  auto-upgraded if the new binary version supports an upgrade.
- In a multi-tenant deployment, in addition to the condition above, the sql server upgrade is
  also constrained by the storage cluster version. It is possible for all SQL instances to have
  binary versions that support an upgrade but the upgrade will still be blocked by the storage
  cluster version if it’s equal to the current tenant cluster version.

This code change does the following:

Adds logic to run a SQL server upgrade attempt (mostly adopted from the original auto upgrade code)
within the following ordered constraints (previously we merged sqlinstance: add binary_version column to instances table #98830 to make getting the binary
versions of instances easier):
- Ensure that the upgrade is not blocked by the secondary tenant's setting of preserve_downgrade_option
  or an all-tenant override of that value.
- Exit if tenant cluster version is equal to the minimum instance binary version [upgrade already completed].
- Upgrade to storage cluster version if the binary version of all SQL instances supports that.
- Exit if storage cluster version is less than the minimum instance binary version [upgrade blocked
  due to low storage cluster version].
- Upgrade to the minimum instance binary version.
Runs the logic above when a SQL server is started.
- This covers the case where a SQL server binary upgrade allows for an upgrade to the
  tenant cluster version.
Checks for change in storage cluster version every 10 seconds and starts an upgrade attempt if
it was changed.
- This covers the case where the binary versions of all SQL instances allow for an upgrade
  but it’s blocked due to the storage cluster version.

Release note: Support for automatic finalization of virtual clusters version
upgrade was added. A new setting cluster.auto_upgrade.enabled was added to
enable and disable automatic cluster version upgrade (finalization) and it
will be used with automatic upgrade at both storage-level and virtual cluster
level (each has its own setting value though).
Epic: CRDB-20860

cockroach-teamcity · 2023-04-27T16:52:24Z

This change is

ajstorm

Nice work! I left a few comments. We'll also want to have someone else look at this before it merges because I'm trying to step away from this work, and you'll need someone else to take over reviews going forward.

Reviewed 4 of 8 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @healthy-pod and @knz)

pkg/server/tenant_auto_upgrade.go line 1 at r1 (raw file):

// Copyright 2018 The Cockroach Authors.

Nit: 2023

pkg/server/tenant_auto_upgrade.go line 41 at r1 (raw file):

			// on every change to the internal version of storage cluster version
			// within a short time period.
			case <-time.After(time.Second * 30):

nit: should we make this time configurable using a cluster setting? Doesn't have to be on this PR, but it may be something worth considering should the 30 seconds be too slow or too fast.

pkg/server/tenant_auto_upgrade.go line 46 at r1 (raw file):

					storageClusterVersion = latestStorageClusterVersion
					if err := s.startAttemptUpgrade(ctx); err != nil {
						log.Errorf(ctx, "failed to start an upgrade attempt")

Is there a reason we're not logging the returned error here? I'd think it might be better to wrap the error with "failed to start an upgrade attempt", and then log AND return the wrapped error.

pkg/server/tenant_auto_upgrade.go line 61 at r1 (raw file):

		retryOpts := retry.Options{
			InitialBackoff: time.Second,

Is there a reason why we're starting here at 1 second? Is there a downside to starting at 30 seconds?

pkg/server/tenant_auto_upgrade.go line 109 at r1 (raw file):

				continue
			case upgradeAlreadyCompleted:
				log.Info(ctx, "no need to upgrade, instance already at the newest version")

I'm wondering if this error message could be misleading. Isn't it possible to get into this state in cases where the storage cluster version matches the current tenant cluster version? In that case, the SQL instance isn't at the newest version, but the upgrade will be blocked due to the storage cluster not being upgraded yet.

pkg/server/tenant_auto_upgrade.go line 116 at r1 (raw file):

				// Fall out of the select below.
			default:
				panic(errors.AssertionFailedf("unhandled case: %d", status))

Nit: Will the panic here bring down the SQL server? If so, I'm wondering if we should return here and log something instead of panicking?

pkg/server/tenant_auto_upgrade.go line 119 at r1 (raw file):

			}

			upgradeRetryOpts := retry.Options{

I'm probably missing something obvious, but why are there two retries here? What are the implications of us dropping the retry here and relying on the retry above to handle everything?

pkg/server/tenant_auto_upgrade.go line 136 at r1 (raw file):

					log.Errorf(ctx, "error when finalizing cluster version upgrade: %v", err)
				} else {
					log.Info(ctx, "successfully upgraded cluster version")

Nit: Should we say "tenant cluster version" here? Also, should we output the new cluster version?

pkg/server/tenant_auto_upgrade.go line 193 at r1 (raw file):

	}

	minInstanceBinaryVersion := findMinBinaryVersion(instances)

This block could do with a comment which describes in detail the conditions under which we permit the upgrade.

Also, is it the case that this block is not entirely required? Could we not just proceed with the upgrade here and rely on the existing blocks in place to prevent the upgrade if it's not allowed?

pkg/ccl/kvccl/kvtenantccl/upgradeccl/tenant_upgrade_test.go line 118 at r1 (raw file):

	tenantRunner.CheckQueryResults(t, "SELECT * FROM t", [][]string{{"1"}, {"2"}})
	// Wait for the tenant to auto-finalize.
	tenantRunner.SucceedsSoonDuration = time.Second * 90

Can we add a test in here where we have preserve_downgrade_option set and ensure that the auto upgrade doesn't run? Also, does it make sense to add a test for all of the permutations (e.g. multiple SQL pods, one of which is at the wrong binary version) to ensure that auto upgrade won't run when it shouldn't?

stevendanna

For the new setting, it currently only applies to tenant upgrades but the name/help text doesn't make that very clear. I could see applying it to the non-tenant code path as well or renaming it.

healthy-pod · 2023-10-09T06:01:08Z

For the new setting, it currently only applies to tenant upgrades but the name/help text doesn't make that very clear. I could see applying it to the non-tenant code path as well or renaming it.

Makes sense, updated to virtual_cluster.auto_upgrade.enabled.

healthy-pod · 2023-10-09T08:14:05Z

Thanks for the review!

bors r=stevendanna

craig · 2023-10-09T08:14:09Z

👎 Rejected by code reviews

knz

Reviewed 1 of 10 files at r4, 4 of 8 files at r6, 4 of 6 files at r8, 3 of 3 files at r9, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @healthy-pod, and @stevendanna)

-- commits line 47 at r9:
Please include a release note that mentions the new cluster setting.

pkg/clusterversion/setting.go line 313 at r9 (raw file):

var VirtualClusterAutoUpgradeEnabled = settings.RegisterBoolSetting(
	settings.ApplicationLevel,
	"virtual_cluster.auto_upgrade.enabled",

Please name this cluster.auto_upgrade.enabled.
This is not just virtual clusters -- we will use the setting also in system tenant for the storage-level auto-upgrade.

This means the name of the variable should just be AutoUpgradeEnabled and the setting description should not contain (for virtual clusters).

Also mention in release note.

pkg/server/tenant_auto_upgrade.go line 59 at r9 (raw file):

				// is passed.
				storageClusterVersionChanged := storageClusterVersion != latestStorageClusterVersion
				if firstAttempt ||

May I suggest checking the cluster setting "auto upgrade enabled" here, so that we don't produce a log.Errorf if the auto upgrade is disabled.
(Errors in logs cause operators to investigate. We don't want this if the auto upgrade was simply disabled.)

Better yet; you could have a helper function "isAutoUpgradeEnabled" that checks both the cluster setting and the testing knob, and skip the call to startAttemptUpgrade in case the helper returns false.

pkg/server/settingswatcher/settings_watcher.go line 654 at r9 (raw file):

}

func (s *SettingsWatcher) GetVirtualClusterAutoUpgradeEnabledSettingValue() bool {

you can rename this according to the point discussed above.

pkg/ccl/kvccl/kvtenantccl/upgradeccl/tenant_upgrade_test.go line 118 at r9 (raw file):

	// Disable virtual_cluster.auto_upgrade.enabled setting for the tenant to prevent auto upgrade.
	tenantRunner.Exec(t, "SET CLUSTER SETTING virtual_cluster.auto_upgrade.enabled = false")

tip: use the setting variable's .Name() to generate the SQL to use in the test.

review is stale

healthy-pod · 2023-10-09T18:34:52Z

May I suggest checking the cluster setting "auto upgrade enabled" here, so that we don't produce a log.Errorf if the auto upgrade is disabled.
(Errors in logs cause operators to investigate. We don't want this if the auto upgrade was simply disabled.)

We do not return an error if auto upgrade is disabled, just nil.

Previously, tenant upgrades in UA required a user to issue a `SET CLUSTER SETTING version =` statement to finalize an upgrade. This UX is different from what we have in single-tenant SH/Dedicated deployments in that we have auto upgrade in the later that starts an attempt to finalize cluster version after every node startup incase the node was started with a new binary version that all nodes now support upgrading to. In UA, we have two differences: 1. What to upgrade? - In a multi-tenant deployment, the storage and sql layers are upgraded separately. - The storage layer upgrade finalization is still handled by the existing auto upgrade logic. - In this change, we ensure that the sql layer is also auto-upgraded when possible. 2. When to upgrade? - In a single-tenant deployment, all layers share the same binary version and cluster version. Hence, an upgrade attempt is only needed when a new node starts to ensure that the cluster is auto-upgraded if the new binary version supports an upgrade. - In a multi-tenant deployment, in addition to the condition above, the sql server upgrade is also constrained by the storage cluster version. It is possible for all SQL instances to have binary versions that support an upgrade but the upgrade will still be blocked by the storage cluster version if it’s equal to the current tenant cluster version. This code change does the following: 1. Adds logic to run a SQL server upgrade attempt (mostly adopted from the original auto upgrade code) within the following ordered constraints (previously we merged cockroachdb#98830 to make getting the binary versions of instances easier): - Ensure that the upgrade is not blocked by the secondary tenant's setting of preserve_downgrade_option or an all-tenant override of that value. - Exit if tenant cluster version is equal to the minimum instance binary version [upgrade already completed]. - Upgrade to storage cluster version if the binary version of all SQL instances supports that. - Exit if storage cluster version is less than the minimum instance binary version [upgrade blocked due to low storage cluster version]. - Upgrade to the minimum instance binary version. 2. Runs the logic above when a SQL server is started. - This covers the case where a SQL server binary upgrade allows for an upgrade to the tenant cluster version. 3. Checks for change in storage cluster version every 10 seconds and starts an upgrade attempt if it was changed. - This covers the case where the binary versions of all SQL instances allow for an upgrade but it’s blocked due to the storage cluster version. Release note: Support for automatic finalization of virtual clusters version upgrade was added. A new setting `cluster.auto_upgrade.enabled` was added to enable and disable automatic cluster version upgrade (finalization) and it will be used with automatic upgrade at both storage-level and virtual cluster level (each has its own setting value though). Epic: CRDB-20860

healthy-pod · 2023-10-09T18:43:05Z

ready for another look

knz

Reviewed 8 of 8 files at r10, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @healthy-pod, and @stevendanna)

knz · 2023-10-10T08:47:29Z

bors r=stevendanna,knz

craig · 2023-10-10T09:28:53Z

Build succeeded:

Bazel Essential CI (Cockroach)

healthy-pod force-pushed the tau branch 9 times, most recently from 8b99a24 to be11008 Compare May 1, 2023 11:51

healthy-pod marked this pull request as ready for review May 1, 2023 11:53

healthy-pod requested review from a team as code owners May 1, 2023 11:53

healthy-pod requested review from knz and ajstorm May 1, 2023 11:53

abarganier removed the request for review from a team May 2, 2023 17:30

ajstorm requested changes May 10, 2023

View reviewed changes

healthy-pod force-pushed the tau branch 7 times, most recently from c9e86c2 to 23471bb Compare May 30, 2023 01:08

healthy-pod force-pushed the tau branch 3 times, most recently from 15c2d11 to c79f94b Compare June 15, 2023 23:23

healthy-pod force-pushed the tau branch 2 times, most recently from b1aa7a8 to 9eaff5c Compare June 20, 2023 23:02

healthy-pod force-pushed the tau branch 3 times, most recently from bd1d480 to 7b8cb86 Compare October 6, 2023 04:05

healthy-pod requested a review from stevendanna October 6, 2023 04:06

stevendanna approved these changes Oct 6, 2023

View reviewed changes

healthy-pod force-pushed the tau branch 4 times, most recently from 8411240 to d1fee77 Compare October 6, 2023 19:57

stevendanna reviewed Oct 9, 2023

View reviewed changes

healthy-pod force-pushed the tau branch 2 times, most recently from 4b4a36d to e28c9fc Compare October 9, 2023 05:58

healthy-pod force-pushed the tau branch from e28c9fc to 05c8ba8 Compare October 9, 2023 06:04

healthy-pod requested a review from ajstorm October 9, 2023 08:15

knz suggested changes Oct 9, 2023

View reviewed changes

healthy-pod force-pushed the tau branch 2 times, most recently from 2fc68fe to dcecfc9 Compare October 9, 2023 18:31

healthy-pod requested a review from knz October 9, 2023 18:35

healthy-pod force-pushed the tau branch from dcecfc9 to a0b6f5c Compare October 9, 2023 18:40

knz approved these changes Oct 10, 2023

View reviewed changes

craig bot merged commit bee7cfe into cockroachdb:master Oct 10, 2023
7 of 8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pkg/server: support tenant auto-upgrade #102427

pkg/server: support tenant auto-upgrade #102427

healthy-pod commented Apr 27, 2023 •

edited

cockroach-teamcity commented Apr 27, 2023

ajstorm left a comment

stevendanna left a comment

healthy-pod commented Oct 9, 2023

healthy-pod commented Oct 9, 2023

craig bot commented Oct 9, 2023

knz left a comment

healthy-pod commented Oct 9, 2023

healthy-pod commented Oct 9, 2023

knz left a comment

knz commented Oct 10, 2023

craig bot commented Oct 10, 2023

pkg/server: support tenant auto-upgrade #102427

pkg/server: support tenant auto-upgrade #102427

Conversation

healthy-pod commented Apr 27, 2023 • edited

cockroach-teamcity commented Apr 27, 2023

ajstorm left a comment

Choose a reason for hiding this comment

stevendanna left a comment

Choose a reason for hiding this comment

healthy-pod commented Oct 9, 2023

healthy-pod commented Oct 9, 2023

craig bot commented Oct 9, 2023

knz left a comment

Choose a reason for hiding this comment

healthy-pod commented Oct 9, 2023

healthy-pod commented Oct 9, 2023

knz left a comment

Choose a reason for hiding this comment

knz commented Oct 10, 2023

craig bot commented Oct 10, 2023

healthy-pod commented Apr 27, 2023 •

edited