Multi Tenant Servers Can Skip Upgrades #104884
Labels
A-cluster-upgrades
A-multitenancy
Related to multi-tenancy
branch-release-23.1
Used to mark GA and release blockers and technical advisories for 23.1
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
release-blocker
Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
T-multitenant
Issues owned by the multi-tenant virtual team
There is a rare bug impacting the Serverless upgrade from 22.2 to 23.1. It was only triggered once when we upgraded our staging clusters and the staging clusters contain a few thousand tenants for scale testing purposes. Out of the thousands of staging Serverless clusters, only one skipped an upgrade job, which caused the cluster to enter a crash loop.
This is the implementation of algorithm for stepping a version gate. Each version v has a fence version v' that is 1 internal version less than v. The fence version is used to implement a barrier when upgrading a cluster.
A high level overview of the algorithm is:
There is a bug in the implementation that only impacts multi-tenant clusters. On the first upgrade applied by the loop, the mustPersistFenceVersion branch updates the system setting.
The purpose of the branch is to ensure that no servers running the old binary can start up. But it is setting the cluster setting version to
v
. If the binary crashes before running the upgrade forv
, then when the server restarts, it reads the setting versionv
and ends up skipping the upgrade when the upgrade resumes.Judging by the name of the
mustPersistFenceVersion
variable, I think the bug is a simple typo and updateSystemVersionSetting was intended to be called with the fence versionv'
.Jira issue: CRDB-28765
The text was updated successfully, but these errors were encountered: