allocatorimpl: retire kv.allocator.load_based_lease_rebalancing.enabled#169669
Merged
Conversation
Contributor
|
😎 Merged successfully - details. |
|
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Member
wenyihu6
approved these changes
May 4, 2026
Contributor
|
Fixed the CI. /trunk merge |
|
Detected infrastructure failure (matched: self-hosted runner lost communication with the server). Automatically rerunning failed jobs. (run link) |
Contributor
|
/trunk merge |
Follow-the-workload (FTW) lease rebalancing has been gated behind `kv.allocator.load_based_lease_rebalancing.enabled` (default true) since 2017. Internal consensus is that the feature is largely unused, that serious users express placement intent via lease preferences, and that FTW does not work particularly well in practice. It is also implicitly disabled when the multi-metric allocator is active, so its surface area is shrinking on its own. Retire the setting: flip the default to false and mark it `settings.Retired`. The registration is kept so that the setting can still be SET as an escape hatch (and so existing automation does not error), but it is hidden from SHOW CLUSTER SETTINGS and the code path that consults it now defaults to off. Removing FTW from the default behavior shrinks the allocator's feature set and makes it easier to ultimately replace larger parts of it with the multi-metric allocator. `TestAllocatorTransferLeaseTargetLoadBased` was also updated to override the cluster setting load_based_lease_rebalancing to true since the test expects targets assuming the lease follows request locality (e.g. when most requests arrive from `l=2`, FTW moves the lease there even if that node already has more leases). Closes cockroachdb#153866. Epic: CRDB-54644 Release note (general change): follow-the-workload rebalancing (see Follow-the-Workload Topology in our docs) is now disabled by default as part of its deprecation. The \`kv.allocator.load_based_lease_rebalancing.enabled\` cluster setting is retired and hidden from SHOW CLUSTER SETTINGS, but can still be set to re-enable the feature if needed. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
tbg
added a commit
to 5hubh4m/cockroach
that referenced
this pull request
May 8, 2026
7 already backported via separate release PRs (cockroachdb#169344, cockroachdb#169590, cockroachdb#169711, cockroachdb#169734, cockroachdb#169742, cockroachdb#169761, cockroachdb#169876) but invisible to script because the cherry-picked commits don't reference the original PR number. 7 are unrelated to MMA (admission/AC, sql, kvserver/storage, ci, roachtest/perturbation). 2 are master-and-onward only by intent: cockroachdb#169430 (enable MMA by default in v26.3) and cockroachdb#169669 (retire load_based_lease_rebalancing setting). Net: 0 PRs need backporting to release-26.2. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
angeladietz
added a commit
to angeladietz/cockroach
that referenced
this pull request
May 8, 2026
Previously, TestBoundedStalenessDataDriven assumed that the leaseholder for the test table was on node 1 without enforcing it. The lease could land on any node depending on Raft leadership of the parent range at split time. This assumption was usually true thanks to follow-the- workload (FTW) biasing leases toward the SQL gateway node, but the recent FTW retirement (cockroachdb#169669) made the failure much more frequent. Now, the test explicitly relocates the lease to node 1 after table creation using ALTER RANGE RELOCATE LEASE. As well, the store rebalancer (both legacy and MMA) is disabled alongside the existing queue toggles to prevent it from moving the lease back during the test. Resolves: cockroachdb#169200 Epic: none Release note: None Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
angeladietz
added a commit
to angeladietz/cockroach
that referenced
this pull request
May 8, 2026
Previously, TestBoundedStalenessDataDriven assumed that the leaseholder for table t was on node 1 without enforcing it. This started failing frequently after the FTW retirement (cockroachdb#169669). The root cause is a race during StartTestCluster. When a new tenant is created, a range split occurs at the tenant boundaries. The lease queue then processes this new range to consider a lease transfer. Previously, when FTW was enabled, shouldTransferLeaseForAccessLocality would return [shouldNotTransfer](https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/allocator/allocatorimpl/allocator.go#L2993) since the range was younger than MinLeaseTransferStatsDuration, blocking the lease from being transferred. With FTW disabled, the function returns decideWithoutStats instead, allowing lease count convergence to proceed immediately, which results in the lease being transferred away to the node with the lowest lease count. This all happens in StartTestCluster before the lease queue is disabled. This fixes the flake by explicitly relocating the lease to n1 after table creation and disabling the store rebalancer to prevent it from moving the lease back. Resolves: cockroachdb#169200 Epic: none Release note: None Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-the-workload (FTW) lease rebalancing has been gated behind
kv.allocator.load_based_lease_rebalancing.enabled(defaulttrue) since2017. Internally, the consensus is that:
It is also implicitly disabled when the multi-metric allocator (MMA) is
active, so its surface area is already shrinking on its own.
This PR retires the setting:
false, so FTW is off out of the box.settings.Retired, which hides it fromSHOW CLUSTER SETTINGSand prepends
do not use -to its description.SETasan emergency escape hatch (and so existing automation that touches it
doesn't error). The code path in
allocator.gothat consults thesetting is unchanged.
Removing FTW from default behavior shrinks the allocator's feature set
and makes it easier to ultimately replace larger parts of it with the
multi-metric allocator.
A prior attempt at this (which only flipped the default, without
retiring) is #153865; it was approved but never merged due to competing
priorities at the time.
Closes #153866.
Epic: CRDB-54644
Release note (general change): follow-the-workload rebalancing (see
Follow-the-Workload Topology in our docs) is now disabled by default as
part of its deprecation. The
kv.allocator.load_based_lease_rebalancing.enabledcluster setting isretired and hidden from
SHOW CLUSTER SETTINGS, but can still beSETto re-enable the feature if needed.