Skip to content

release-26.2: mmaprototype: dynamically verbose logs for overloaded stores#169344

Merged
trunk-io[bot] merged 12 commits into
cockroachdb:release-26.2from
wenyihu6:blathers/backport-release-26.2-165719
Apr 29, 2026
Merged

release-26.2: mmaprototype: dynamically verbose logs for overloaded stores#169344
trunk-io[bot] merged 12 commits into
cockroachdb:release-26.2from
wenyihu6:blathers/backport-release-26.2-165719

Conversation

@wenyihu6
Copy link
Copy Markdown
Contributor

@wenyihu6 wenyihu6 commented Apr 29, 2026

Backport 12/12 commits from #165719 on behalf of @wenyihu6.


Resolves: #162254
Epic: CRDB-55052
Release note: none


This PR makes MMA's rebalancing logs useful in production without
relying on operators flipping vmodule.

What changes for the operator:

  • A periodic store-level summary (cluster means, sorted shedding
    stores, per-pass results) is emitted at Infof once per
    rebalance pass, instead of being hidden behind V(2).
  • When a store has been continuously overloaded longer than 30
    minutes and hasn't had a detailed burst recently, the
    per-shedding-store log path (top-K ranges, candidate selection,
    failure reasons) is promoted to Infof for that store on that
    iteration. A burst-suppression timer prevents spamming.
  • Per-range and per-candidate logging stays gated by V(3), so
    enabling vmodule on a single MMA file shows the detailed
    decision trail without drowning the rest of the log.
  • The bridge that promotes verbose logs to Infof checks
    vmodule at the call site (not at construction), so
    --vmodule=load=3 enables verbose output for load.go's
    emissions even when the logger was constructed elsewhere.

No behavior change for clusters that don't hit persistent
overload; the detailed-burst path is dormant.


Release justification: low risk obs improvements for opt-in functionality.

wenyihu6 and others added 12 commits March 13, 2026 18:06
This change routes all `VEventf` calls in the MMA rebalancing code
through an `mmaLogf` function type so that a follow-up commit can
selectively promote calls to `Infof` for persistently overloaded
stores.

`re.logf()` on `rebalanceEnv` delegates to `VEventf`. Functions
that called `VEventf` directly now receive a `logf mmaLogf`
parameter instead. `computeLoadSummary` and
`loadSummaryForDimension` accept a `detailedLog bool` that
replaces the `log.ExpensiveLogEnabled(ctx, 3)` guard that
previously gated per-dimension log emission inside
`loadSummaryForDimension`. All callers pass
`log.ExpensiveLogEnabled(ctx, 3)`, preserving the existing
behavior.

No behavior change.

Release note: None

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
This change teaches `re.logf()` to emit at `Infof` level when
`re.detailedLog` is true. The flag is set per-shedding-store when
`log.ExpensiveLogEnabled(ctx, 3)` is true or when the store has
been continuously overloaded for longer than
`persistentOverloadThreshold` (30min) and has not had a detailed
log burst within `detailedLogInterval` (30min).

`lastDetailedLogTime` is updated only for the persistent-overload
case, not for verbose logging. This ensures that disabling verbose
logging does not suppress future persistent-overload log bursts.

Release note: None

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
This change switches each `rebalance-stores` call in
`dynamic_verbose_logs.txt` to `vmodule=0`, so the captured output
reflects what an operator at the default vmodule would actually see.
The per-shedding-store dynamic-verbose-logging promotion is now
observable across four passes as `detailedLog` toggles between false
(t=10m, t=45m) and true (t=40m, t=56m).

Release note: None
This change removes the `detailedLog` field from `rebalanceEnv` and
threads it as an explicit parameter through `rebalanceStore`,
`rebalanceReplicas`, and `rebalanceLeasesFromLocalStoreID`. A new
`re.detailedLogf` method emits at Infof, and `re.makeLogf(detailedLog)`
selects between it and `re.logf` for a single shedding store.

The field was mutated on each iteration of the shedding-stores loop
and read implicitly by `re.logf` from anywhere in the call chain. The
implicit coupling required a manual reset after the loop to avoid
leaking the per-store flag into subsequent calls. Threading the value
explicitly removes both concerns. No behavior change.

The per-shedding-store burst this control gates (top-K, load
summaries per dim, candidates, results) is throttled per (store,
localStoreID) pair to one burst per `detailedLogInterval` (10 min)
once `persistentOverloadThreshold` (30 min) is crossed.

Volume estimate, with assumptions:
  - 20 nodes × 2 stores/node = 40 store rebalancers, each calling
    rebalanceStores every LoadBasedRebalanceInterval (1 min default).
  - 10 stores sustained overloaded beyond persistentOverloadThreshold
    (30 min); all 40 rebalancers see them as shedding stores.
  - The 1-min pass cadence is faster than detailedLogInterval (10 min),
    so the throttle is the binding constraint.
  - ~35 lines per burst (top-K, ~4-5 load summaries per dim,
    candidates, results, etc.; varies with range count).

  Bursts per hour per (store, localStoreID) pair
    = (60 min/hr) / detailedLogInterval (10 min) = 6.
  Pairs = 10 overloaded × 40 rebalancers = 400.
  Cluster-wide rate = 400 × 6 × 35 = 84,000 lines/hour ≈ 23 lines/sec.

Healthy clusters emit zero of these (no store crosses the 30-minute
persistent-overload threshold).

Release note: None
This change adds an optional `vmodule=N` arg to the `rebalance-stores`
datadriven command. Without it, the command emits the recording-span
trace as before. With it, the pass runs outside any tracing span -- so
`log.ExpensiveLogEnabled` does not artificially force `detailedLog`
true -- `N` is applied as `*=N` for the pass via `log.SetVModule`,
and INFO+ entries from the `mmaprototype` package are intercepted and
printed.

The new `dynamic_verbose_logs.txt` walks four rebalance passes
against a single overloaded store at t=10m, t=40m, t=45m, and t=56m,
printing the recording-span trace for each. A follow-up commit
switches the file to `vmodule=0` to demonstrate the operator-visible
effect.

Release note: None
This change promotes the `rebalanceStores` outer-loop narrative
(rebalanceStores begins, cluster means, evaluating sN,
overload-start/continued/end, store added to shedding list) from
VEventf to Infof at most once per `outerLogInterval` (10 minutes),
via `util.EveryN` over the `clusterState`'s time source. A pure
timer (no overload condition) is used because the content is small
and useful even on healthy clusters.

This is independent from the per-store persistent-overload promotion,
which fires deep per-shedding-store detail after a 30-minute
threshold. The two have separate throttles and serve different
purposes: this one is a low-cadence pass-level summary; the per-store
control is a deep dive triggered by sustained problems.

Volume estimate, with assumptions:
  - 20 nodes × 2 stores/node = 40 store rebalancers, each calling
    rebalanceStores every LoadBasedRebalanceInterval (1 min default).
  - The 1-min pass cadence is faster than outerLogInterval (10 min),
    so the throttle is the binding constraint.
  - Lines per emission = 12 fixed (begin, cluster means, overload-*
    transitions, sorted shedding) + 40 `evaluating sN` (one per
    cluster store) = 52 lines.

  Emissions per hour per rebalancer
    = (60 min/hr) / outerLogInterval (10 min) = 6.
  Cluster-wide rate = 40 × 6 × 52 = 12,480 lines/hour ≈ 3.5 lines/sec.

The rate is the same for healthy and overloaded clusters since the
timer has no overload condition.

Release note: None
This change gates both Infof-promotion controls in `rebalanceStores`
on `re.passObs != nil`. The production caller in
`mma_store_rebalancer.go` calls rebalance in a loop, with the first
call setting `PeriodicCall=true` and subsequent calls (continuing the
same logical pass) setting it false. `passObs` is initialized only
for periodic calls, so `re.passObs != nil` already identifies the
first iteration of the rebalance loop.

The outer-loop control was already implicitly suppressed via
`EveryN`'s update-on-fire behavior; the gate makes the intent
explicit. The per-store burst was scoped per-store via
`lastDetailedLogTimes`, which allowed iteration N to fire fresh
bursts for stores not processed in iteration 1, even though those
belong to the same operator-visible time window. The gate scopes the
burst strictly to the first iteration; a store first encountered in
iteration 2+ defers its burst to the next periodic call (~1 minute
later), which is operationally invisible.

Note that `log.ExpensiveLogEnabled(ctx, 3)` continues to bypass the
gate, so vmodule=3 always emits.

Release note: None
This change increases `detailedLogInterval` from 10 minutes to 30
minutes, reducing per-store burst cadence by 3x. The throttle is the
binding constraint on burst volume in production, so each minute of
throttle materially affects log output.

Volume estimate, with the same scenario as the earlier commits
(20 nodes × 2 stores/node = 40 rebalancers, 10 sustained overloaded
stores):

  Bursts per hour per (store, localStoreID) pair
    = (60 min/hr) / detailedLogInterval (30 min) = 2.
  Pairs = 10 overloaded × 40 rebalancers = 400.
  Cluster-wide rate = 400 × 2 × 35 = 28,000 lines/hour ≈ 7.8 lines/sec.

Down from ~23 lines/sec at the previous 10-minute interval.

The dynamic_verbose_logs.txt test scenarios are updated: the
"burst fires after interval expires" pass moves from t=56m (16 min
after t=40m) to t=75m (35 min after t=40m).

Release note: None
Makes 0 (vmodule unset, log capture only) the default for the `vmodule`
arg on the `rebalance-stores` datadriven command. The previous default
ran inside a verbose recording span, which forced
`log.ExpensiveLogEnabled(ctx, 3)` true and masked the dynamic-verbose-
logging behavior. The new default reflects what an operator at the
default vmodule actually sees.

Existing `vmodule=0` invocations in dynamic_verbose_logs.txt are dropped
as redundant. All other testdata files that relied on the recording-span
trace gain `vmodule=3` to retain their previous output.

Release note: None

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Splits the conjunctive condition into the specific reason that fired
(pending decrease over threshold, or pending increase over epsilon) and
appends a compact list of the unenacted pending changes contributing
to the skip:

  skipping overloaded store s4 (worst dim: WriteBandwidth): pending
  increase 0.09 >= epsilon; pending: r1+s4[cpu:1ns/s, write-bandwidth:
  880 kB/s, byte-size:0 B]

Each pending change is rendered as r{rangeID}{op}s{storeID}{loadDelta}
where op is + (add), - (remove), or ~ (other), in changeID order.

Release note: None

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Replaces the (logf mmaLogf, detailedLog bool) param pair threaded
through MMA rebalancing with a single mmaLogf bridge struct that
internally tracks whether logging is enabled. The bridge has a noop
field that callers can inspect to skip arg construction at hot sites.

Before this change, dropping upstream's `if log.ExpensiveLogEnabled(ctx,
3)` wrappers (during the rebase conflict resolution) caused per-call
variadic arg slices to escape to the heap even when logging was off,
adding ~500 allocs/op to BenchmarkRebalanceStores (162 -> 666).

The mmaLogf type is now:

  type mmaLogf struct {
      noop         bool
      internalLogf func(ctx, level, format, args ...interface{})
  }

makeLogf returns a noop bridge when neither detailedLog nor V(3) is on.
Cold-path callers continue to use lb.logf(ctx, 3, ...) unconditionally;
hot-path callers (sortTargetCandidateSetAndPick, the result(failed)
sites, etc.) wrap the call in `if !lb.noop`.

The change also drops the now-redundant detailedLog bool parameter from
sortTargetCandidateSetAndPick, computeCandidatesForReplicaTransfer,
canShedAndAddLoad, rebalanceStore, rebalanceReplicas, and
rebalanceLeasesFromLocalStoreID. computeLoadSummary and
loadSummaryForDimension still take detailedLog (they have a separate
Infof-vs-skip semantic).

BenchmarkRebalanceStores goes from 666 -> 175 allocs/op (vs 162 baseline);
time/op back within 1% of baseline.

Release note: None

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
The mmaLogger bridge previously decided at construction time whether
it would emit. The vmodule check ran in the file containing the
constructor, so a logf call in a different file inherited a stale
verbosity decision keyed off the constructor's source location.
Setting --vmodule=cluster_state_rebalance_stores=3 did not enable
verbose output for callees in load.go (and vice versa), which is
surprising.

Replace the construction-time noop flag with a per-call vmodule check
via log.ExpensiveLogEnabledVDepth, so attribution lands on the file
that actually emits the message. The bridge keeps a verboseToInfof
bool for the two production paths (outer-loop heartbeat and per-
shedding-store burst) that promote to Infof; in that mode logf
short-circuits the V check and always emits.

Hot-path call sites that previously guarded expensive arg construction
with `if !ml.noop` now use `if ml.V(ctx, level)`, passing the same
level they will pass to logf.

The rebalanceEnv-specific constructor is gone; makeMMALogger(bool) is
the single entry point.

Epic: none
Release note: None

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
@wenyihu6 wenyihu6 requested review from a team as code owners April 29, 2026 15:09
@blathers-crl blathers-crl Bot added blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. labels Apr 29, 2026
@trunk-io
Copy link
Copy Markdown
Contributor

trunk-io Bot commented Apr 29, 2026

😎 Merged successfully - details.

@blathers-crl blathers-crl Bot requested a review from tbg April 29, 2026 15:09
@blathers-crl
Copy link
Copy Markdown

blathers-crl Bot commented Apr 29, 2026

Thanks for opening a backport.

Before merging, please confirm that the change does not break backwards compatibility and otherwise complies with the backport policy. Include a brief release justification in the PR description explaining why the backport is appropriate. All backports must be reviewed by the TL for the owning area. While the stricter LTS policy does not yet apply, please exercise judgment and consider gating non-critical changes behind a disabled-by-default feature flag when appropriate.

@blathers-crl blathers-crl Bot added backport Label PR's that are backports to older release branches T-kv KV Team labels Apr 29, 2026
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@wenyihu6
Copy link
Copy Markdown
Contributor Author

/trunk merge

@trunk-io trunk-io Bot merged commit 3a5a123 into cockroachdb:release-26.2 Apr 29, 2026
23 checks passed
tbg added a commit to 5hubh4m/cockroach that referenced this pull request May 8, 2026
7 already backported via separate release PRs (cockroachdb#169344, cockroachdb#169590, cockroachdb#169711,
cockroachdb#169734, cockroachdb#169742, cockroachdb#169761, cockroachdb#169876) but invisible to script because the
cherry-picked commits don't reference the original PR number.

7 are unrelated to MMA (admission/AC, sql, kvserver/storage, ci,
roachtest/perturbation).

2 are master-and-onward only by intent: cockroachdb#169430 (enable MMA by default in
v26.3) and cockroachdb#169669 (retire load_based_lease_rebalancing setting).

Net: 0 PRs need backporting to release-26.2.

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport Label PR's that are backports to older release branches blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. T-kv KV Team target-release-26.2.1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants