Skip to content

CBG-5266: Prevent downgrades across cluster compatibility versions#8235

Open
bbrks wants to merge 6 commits intomainfrom
CBG-5266
Open

CBG-5266: Prevent downgrades across cluster compatibility versions#8235
bbrks wants to merge 6 commits intomainfrom
CBG-5266

Conversation

@bbrks
Copy link
Copy Markdown
Member

@bbrks bbrks commented May 6, 2026

CBG-5266

Prevent downgrades across cluster compatibility versions

  • Records the computed cluster compatibility version high watermark in the registry
  • HWM value used to guard against loading of databases on SG nodes older than HWM
    • Affected database are put into an error state and can be seen on /_all_dbs?verbose=true response with reason.
  • Removed existing registry version downgrade code since that is no longer needed - retained sg_version in registry for backwards compatibility and potential sgcollect_info diagnostics (last SG version that wrote an update to the registry doc)

Manually tested w/ a fake 4.2 version of SG in addition to the included unit tests.

Integration Tests

bbrks and others added 3 commits May 6, 2026 13:42
Drops the per-config SGVersion comparison and registry.SGVersion-based
downgrade check in favour of a per-bucket ClusterCompatVersionHWM that
tracks the highest cluster compat version (min across registered nodes)
ever observed. The HWM ratchets up only and is maintained by
RegisterNodeVersion, which is now the single point of enforcement: it
refuses to register a node whose cluster compat version is below the
bucket's HWM. _applyConfig calls RegisterBucket early so the gate fires
before db load. registry.SGVersion is retained as a diagnostic and is
stamped on every registry write via setGatewayRegistry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When RegisterBucket refuses a load because the bucket's
ClusterCompatVersionHWM is higher than this node's compat version, the
db was silently dropped from _all_dbs. Record an invalid-config entry
with a new DatabaseClusterCompatVersionError code so admins can see why
the load failed via /_all_dbs?verbose=true.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extend TestClusterCompatDowngradeBlockedByPersistentHWM to assert the
rejected db appears in allDatabaseSummaries with
DatabaseClusterCompatVersionError, locking in the behaviour added in the
previous commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 6, 2026 13:54
Lint (golangci-lint goimports) flagged config_manager.go and
config_manager_test.go after the SGVersion-related functions / tests
were removed in 31cdc3f, leaving the files with a trailing blank line.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a persistent “cluster compatibility version” high-water mark (HWM) in the bucket registry and uses it to prevent older Sync Gateway nodes from loading databases once the cluster has advanced past a major.minor compatibility boundary. Rejected databases are placed into an error state and surfaced via /_all_dbs?verbose=true.

Changes:

  • Persist cluster_compat_version_hwm in the bucket _sync:registry and enforce a downgrade gate in node registration (RegisterNodeVersion).
  • Gate database loading on successful bucket registration, surfacing cluster-compat rejections through invalid-db tracking and a new database startup error code.
  • Update cluster compat manager behavior/tests to handle registration errors and allow tests to override the node’s reported compat version.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
rest/utilities_testing.go Replaces test override of SG build version with an override of node cluster compat version.
rest/server_context.go Adds per-node clusterCompatVersion to bootstrap context with a default from base.NodeClusterCompatVersion.
rest/config.go Moves bucket registration (and downgrade gating) into the DB apply path and surfaces failures via invalid DB tracking.
rest/config_registry.go Extends the registry JSON with cluster_compat_version_hwm.
rest/config_manager.go Implements downgrade gate + HWM ratcheting in RegisterNodeVersion; stamps sg_version on registry writes.
rest/config_manager_test.go Removes obsolete version-downgrade test coverage tied to sg_version gating.
rest/cluster_compat.go Makes RegisterBucket return errors and uses the bootstrap context’s cluster compat version for registration/refresh.
rest/cluster_compat_test.go Adds/updates tests for downgrade blocking and persistent HWM behavior.
rest/admin_api.go Updates DB creation path to match the updated _applyConfig signature.
db/database_error.go Adds a new database startup error code/message for cluster compat version rejection.
base/version_cluster_compat.go Adds GreaterThan helper on ClusterCompatVersion used by downgrade/HWM logic.

Comment thread db/database_error.go
Comment thread rest/config_registry.go
Comment thread rest/config_manager.go
- docs/api/components/schemas.yaml: enum DatabaseError.error_code adds
  11 (cluster compat downgrade), and GatewayRegistry gains the
  cluster_compat_version_hwm property. Also updates the sg_version
  description to reflect its diagnostic-only role now that downgrade
  decisions key off the HWM.
- rest/config_manager.go: tighten the RegisterNodeVersion doc comment
  so it scopes the "single point of enforcement" claim to cluster
  compat / Nodes / HWM rather than implying *all* registry mutations
  pass through this function (config groups have their own paths).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

Redocly previews

This test exercised the per-config SGVersion downgrade gate that was
removed in 31cdc3f. The test writes a config with SGVersion="5.4.3"
and asserts the running 3.1.0 node refuses to apply it; with the gate
gone the new config is now correctly applied, the assertRevsLimit poll
loop never converges, and the test hangs until the 20m package timeout
(seen on PR-8235 #3 EE unit tests).

The new gate is bucket-scoped (cluster-compat HWM in the bucket
registry) rather than per-config, and is covered by
TestClusterCompatDowngrade* in rest/cluster_compat_test.go.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bbrks bbrks requested a review from gregns1 May 6, 2026 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants