Skip to content

allocatorimpl: retire kv.allocator.load_based_lease_rebalancing.enabled#169669

Merged
trunk-io[bot] merged 1 commit into
cockroachdb:masterfrom
tbg:ftw-off-default
May 5, 2026
Merged

allocatorimpl: retire kv.allocator.load_based_lease_rebalancing.enabled#169669
trunk-io[bot] merged 1 commit into
cockroachdb:masterfrom
tbg:ftw-off-default

Conversation

@tbg
Copy link
Copy Markdown
Member

@tbg tbg commented May 4, 2026

Follow-the-workload (FTW) lease rebalancing has been gated behind
kv.allocator.load_based_lease_rebalancing.enabled (default true) since
2017. Internally, the consensus is that:

  • The feature is likely unused, or at least not relied upon explicitly.
  • Serious users express placement intent via lease preferences.
  • It does not work particularly well in practice.

It is also implicitly disabled when the multi-metric allocator (MMA) is
active, so its surface area is already shrinking on its own.

This PR retires the setting:

  • Default flipped to false, so FTW is off out of the box.
  • Marked settings.Retired, which hides it from SHOW CLUSTER SETTINGS
    and prepends do not use - to its description.
  • Registration is kept on purpose, so the setting can still be SET as
    an emergency escape hatch (and so existing automation that touches it
    doesn't error). The code path in allocator.go that consults the
    setting is unchanged.

Removing FTW from default behavior shrinks the allocator's feature set
and makes it easier to ultimately replace larger parts of it with the
multi-metric allocator.

A prior attempt at this (which only flipped the default, without
retiring) is #153865; it was approved but never merged due to competing
priorities at the time.

Closes #153866.
Epic: CRDB-54644

Release note (general change): follow-the-workload rebalancing (see
Follow-the-Workload Topology in our docs) is now disabled by default as
part of its deprecation. The
kv.allocator.load_based_lease_rebalancing.enabled cluster setting is
retired and hidden from SHOW CLUSTER SETTINGS, but can still be SET
to re-enable the feature if needed.

@trunk-io
Copy link
Copy Markdown
Contributor

trunk-io Bot commented May 4, 2026

😎 Merged successfully - details.

@blathers-crl
Copy link
Copy Markdown

blathers-crl Bot commented May 4, 2026

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@tbg tbg requested a review from wenyihu6 May 4, 2026 14:38
@tbg tbg marked this pull request as ready for review May 4, 2026 14:38
@tbg tbg requested review from a team as code owners May 4, 2026 14:38
@wenyihu6 wenyihu6 force-pushed the ftw-off-default branch from 792e49d to 8bfb14b Compare May 4, 2026 16:31
@wenyihu6
Copy link
Copy Markdown
Contributor

wenyihu6 commented May 4, 2026

Fixed the CI.

/trunk merge

@blathers-crl
Copy link
Copy Markdown

blathers-crl Bot commented May 4, 2026

Detected infrastructure failure (matched: self-hosted runner lost communication with the server). Automatically rerunning failed jobs. (run link)

@wenyihu6
Copy link
Copy Markdown
Contributor

wenyihu6 commented May 5, 2026

/trunk merge

Follow-the-workload (FTW) lease rebalancing has been gated behind
`kv.allocator.load_based_lease_rebalancing.enabled` (default true) since
2017. Internal consensus is that the feature is largely unused, that
serious users express placement intent via lease preferences, and that
FTW does not work particularly well in practice. It is also implicitly
disabled when the multi-metric allocator is active, so its surface area
is shrinking on its own.

Retire the setting: flip the default to false and mark it
`settings.Retired`. The registration is kept so that the setting can
still be SET as an escape hatch (and so existing automation does not
error), but it is hidden from SHOW CLUSTER SETTINGS and the code path
that consults it now defaults to off. Removing FTW from the default
behavior shrinks the allocator's feature set and makes it easier to
ultimately replace larger parts of it with the multi-metric allocator.

`TestAllocatorTransferLeaseTargetLoadBased` was also updated to
override the cluster setting load_based_lease_rebalancing to true
since the test expects targets assuming the lease follows request
locality (e.g. when most requests arrive from `l=2`, FTW moves the
lease there even if that node already has more leases).

Closes cockroachdb#153866.
Epic: CRDB-54644

Release note (general change): follow-the-workload rebalancing (see
Follow-the-Workload Topology in our docs) is now disabled by default as
part of its deprecation. The
\`kv.allocator.load_based_lease_rebalancing.enabled\` cluster setting is
retired and hidden from SHOW CLUSTER SETTINGS, but can still be set to
re-enable the feature if needed.

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
@wenyihu6 wenyihu6 force-pushed the ftw-off-default branch from 8bfb14b to 3bfafd1 Compare May 5, 2026 02:59
@trunk-io trunk-io Bot merged commit 516a17d into cockroachdb:master May 5, 2026
36 of 37 checks passed
tbg added a commit to 5hubh4m/cockroach that referenced this pull request May 8, 2026
7 already backported via separate release PRs (cockroachdb#169344, cockroachdb#169590, cockroachdb#169711,
cockroachdb#169734, cockroachdb#169742, cockroachdb#169761, cockroachdb#169876) but invisible to script because the
cherry-picked commits don't reference the original PR number.

7 are unrelated to MMA (admission/AC, sql, kvserver/storage, ci,
roachtest/perturbation).

2 are master-and-onward only by intent: cockroachdb#169430 (enable MMA by default in
v26.3) and cockroachdb#169669 (retire load_based_lease_rebalancing setting).

Net: 0 PRs need backporting to release-26.2.

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
angeladietz added a commit to angeladietz/cockroach that referenced this pull request May 8, 2026
Previously, TestBoundedStalenessDataDriven assumed that the leaseholder
for the test table was on node 1 without enforcing it. The lease could
land on any node depending on Raft leadership of the parent range at
split time. This assumption was usually true thanks to follow-the-
workload (FTW) biasing leases toward the SQL gateway node, but the
recent FTW retirement (cockroachdb#169669) made the failure much more frequent.

Now, the test explicitly relocates the lease to node 1 after table
creation using ALTER RANGE RELOCATE LEASE. As well, the store rebalancer
(both legacy and MMA) is disabled alongside the existing queue toggles
to prevent it from moving the lease back during the test.

Resolves: cockroachdb#169200
Epic: none

Release note: None

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
angeladietz added a commit to angeladietz/cockroach that referenced this pull request May 8, 2026
Previously, TestBoundedStalenessDataDriven assumed that the leaseholder
for table t was on node 1 without enforcing it. This started failing
frequently after the FTW retirement (cockroachdb#169669).

The root cause is a race during StartTestCluster. When a new tenant is
created, a range split occurs at the tenant boundaries. The lease queue
then processes this new range to consider a lease transfer.

Previously, when FTW was enabled,
shouldTransferLeaseForAccessLocality would return
[shouldNotTransfer](https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/allocator/allocatorimpl/allocator.go#L2993)
since the range was younger than MinLeaseTransferStatsDuration,
blocking the lease from being transferred.

With FTW disabled, the function returns decideWithoutStats instead,
allowing lease count convergence to proceed immediately, which results
in the lease being transferred away to the node with the lowest lease
count.

This all happens in StartTestCluster before the lease queue is
disabled.

This fixes the flake by explicitly relocating the lease to n1 after
table creation and disabling the store rebalancer to prevent it from
moving the lease back.

Resolves: cockroachdb#169200
Epic: none

Release note: None

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

allocator: disable follow-the-workload lease transfers by default

3 participants