Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi Tenant Servers Can Skip Upgrades #104884

Closed
JeffSwenson opened this issue Jun 14, 2023 · 1 comment · Fixed by #104821
Closed

Multi Tenant Servers Can Skip Upgrades #104884

JeffSwenson opened this issue Jun 14, 2023 · 1 comment · Fixed by #104821
Assignees
Labels
A-cluster-upgrades A-multitenancy Related to multi-tenancy branch-release-23.1 Used to mark GA and release blockers and technical advisories for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-multitenant Issues owned by the multi-tenant virtual team

Comments

@JeffSwenson
Copy link
Collaborator

JeffSwenson commented Jun 14, 2023

There is a rare bug impacting the Serverless upgrade from 22.2 to 23.1. It was only triggered once when we upgraded our staging clusters and the staging clusters contain a few thousand tenants for scale testing purposes. Out of the thousands of staging Serverless clusters, only one skipped an upgrade job, which caused the cluster to enter a crash loop.

This is the implementation of algorithm for stepping a version gate. Each version v has a fence version v' that is 1 internal version less than v. The fence version is used to implement a barrier when upgrading a cluster.

A high level overview of the algorithm is:

  1. Send an RPC to each sql server activating version v'
  2. Run the upgrade job attached to v
  3. Send an RPC to each sql server activating version v
  4. Set the cluster setting version to v

There is a bug in the implementation that only impacts multi-tenant clusters. On the first upgrade applied by the loop, the mustPersistFenceVersion branch updates the system setting.

The purpose of the branch is to ensure that no servers running the old binary can start up. But it is setting the cluster setting version to v. If the binary crashes before running the upgrade for v, then when the server restarts, it reads the setting version v and ends up skipping the upgrade when the upgrade resumes.

Judging by the name of the mustPersistFenceVersion variable, I think the bug is a simple typo and updateSystemVersionSetting was intended to be called with the fence version v'.

Jira issue: CRDB-28765

@JeffSwenson JeffSwenson added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-multitenancy Related to multi-tenancy T-multitenant Issues owned by the multi-tenant virtual team A-cluster-upgrades labels Jun 14, 2023
@JeffSwenson JeffSwenson self-assigned this Jun 14, 2023
@michae2 michae2 added release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. branch-release-23.1 Used to mark GA and release blockers and technical advisories for 23.1 labels Jun 14, 2023
@michae2
Copy link
Collaborator

michae2 commented Jun 14, 2023

It looks like this is close to going in, so I propose we hold the v23.1.4 release to make sure we pick it up.

craig bot pushed a commit that referenced this issue Jun 14, 2023
104821: upgrade: checkpoint fence version for multi-tenant clusters r=JeffSwenson a=JeffSwenson

Previously, the tenant upgrade interlock checkpointed the version attached to the upgrade, when it should have checkpointed the fence version. This made it possible for an upgrade to skip versions. The conditions needed for this are rare, but could be triggered by crashes in the upgrade system or by the autoscaler downscaling a serverless cluster with multiple nodes.

Release Note: None

Fixes: #104884

104876: logic: keep cockroach-go logs in case of failure r=rafiss a=renatolabs

This commit makes use of the new `CockroachLogsDirOpt` option introduced in [1] to pass a custom directory where cockroach logs will be stored. This directory is relative to TMPDIR, which is part of the artifacts path when the test runs in CI. With this change, when a logic test using the cockroach-go config fails, we'll be able to see both the test logs as well as the cockroach logs themselves. The latter will be in the `cockroach.stderr` file.

[1] cockroachdb/cockroach-go#170

Epic: none

Release note: None

104879: upgrades: deflake TestBackfillJobsInfoTable r=stevendanna a=adityamaru

Previously, this test created jobs that could be adopted during test execution causing the number of payload/progress updates to be variable. With this change we create startable jobs so that they are not adopted and the number of payload/progress remain constant.

Fixes: #103046
Release note: None

Co-authored-by: Jeff <swenson@cockroachlabs.com>
Co-authored-by: Renato Costa <renato@cockroachlabs.com>
Co-authored-by: adityamaru <adityamaru@gmail.com>
@craig craig bot closed this as completed in 4f2eca6 Jun 14, 2023
blathers-crl bot pushed a commit that referenced this issue Jun 14, 2023
Previously, the tenant upgrade interlock checkpointed the version
attached to the upgrade, when it should have checkpointed the fence
version. This made it possible for an upgrade to skip versions. The
conditions needed for this are rare, but could be triggered by crashes
in the upgrade system or by the autoscaler downscaling a serverless
cluster with multiple nodes.

Release Note: None

Fixes: #104884
JeffSwenson added a commit to JeffSwenson/cockroach that referenced this issue Jun 14, 2023
Previously, the tenant upgrade interlock checkpointed the version
attached to the upgrade, when it should have checkpointed the fence
version. This made it possible for an upgrade to skip versions. The
conditions needed for this are rare, but could be triggered by crashes
in the upgrade system or by the autoscaler downscaling a serverless
cluster with multiple nodes.

Release Note: None

Fixes: cockroachdb#104884
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cluster-upgrades A-multitenancy Related to multi-tenancy branch-release-23.1 Used to mark GA and release blockers and technical advisories for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-multitenant Issues owned by the multi-tenant virtual team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants