Skip to content

mmaprototype: fix nil-pointer panic when StorePool sees stores MMA does not#170796

Merged
trunk-io[bot] merged 4 commits into
cockroachdb:masterfrom
tbg:mma-sload-nil
May 26, 2026
Merged

mmaprototype: fix nil-pointer panic when StorePool sees stores MMA does not#170796
trunk-io[bot] merged 4 commits into
cockroachdb:masterfrom
tbg:mma-sload-nil

Conversation

@tbg
Copy link
Copy Markdown
Member

@tbg tbg commented May 22, 2026

Fixes #170703.

The MMA prototype crashed with a nil-pointer dereference in computeMeansForStoreSet whenever the legacy allocator handed it a candidate slice (or an existing storeID) that MMA's clusterState had not yet seen. This is routine during startup: StorePool and MMA's clusterState are populated by separate gossip-driven paths, so the two views diverge until MMA's SetStore callbacks catch up. The asymmetry is already acknowledged for updateStoreStatuses, which logs and skips unknown stores; BuildMMARebalanceAdvisor was missing the same guard, so it segfaulted instead.

The fix filters unknown stores at the right architectural layer — the integration boundary in BuildMMARebalanceAdvisor — rather than papering over the symptom inside the internal helper. The internal computeMeansForStoreSet precondition (every storeID must be known to the load provider) is also now documented and defended by an assertTruef that panics in test builds and logs+skips in production, so future internal misuse fails loudly without re-introducing the production segfault.

This issue became more common after MMA was enabled by default in v26.3 (#169411).

Epic: none

Release note: None

tbg added 2 commits May 22, 2026 11:48
A subsequent commit needs to log and assert from inside
(*allocatorState).BuildMMARebalanceAdvisor with proper call-site log
tags (e.g. mmaid). Today the function does not take a ctx, so the
existing fallback assertion has to use context.Background(), which
loses those tags.

Add ctx as the first parameter of (*allocatorState).BuildMMARebalanceAdvisor
and the corresponding interface declarations on the mmaprototype.Allocator
and mmaintegration.mmaState interfaces. Plumb the ctx that AllocatorSync
already has on hand into the underlying call. Replace the
context.Background() literal in the existing fallback assertion.

No behavior change.

Release note: None
Add a test that pins the current, broken behavior described in cockroachdb#170703:
(*allocatorState).BuildMMARebalanceAdvisor panics with a nil pointer
dereference if any storeID it is given (either `existing` or anything
in `cands`) is not yet known to MMA's clusterState.

This happens during startup, when the legacy allocator builds candidate
lists from StorePool's gossip-driven view before MMA has been notified
of the corresponding stores via SetStore — see the call path in cockroachdb#170703
from bestRebalanceTarget through AllocatorSync.BuildMMARebalanceAdvisor
into computeMeansForStoreSet, which dereferences the nil *storeLoad
returned by clusterState.getStoreReportedLoad for unknown stores.

The assertions use require.Panics so the test passes on this commit.
The next commit fixes the panic and flips them to require.NotPanics
plus a behavior check, so the red/green pair is visible in a single
diff.

Informs cockroachdb#170703.

Release note: None
@trunk-io
Copy link
Copy Markdown
Contributor

trunk-io Bot commented May 22, 2026

😎 Merged successfully - details.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Fix the nil-pointer panic from cockroachdb#170703 at the right architectural
layer: the integration boundary where MMA's clusterState meets the
legacy allocator's StorePool view.

The two views are kept in sync independently — StorePool from gossip
callbacks, MMA's clusterState from explicit SetStore / ProcessStoreLoadMsg
calls. During startup (and any time gossip races ahead of MMA), the
candidate slice handed to BuildMMARebalanceAdvisor can include storeIDs
MMA has never heard of. computeMeansForStoreSet then nil-derefs the
*storeLoad returned by getStoreReportedLoad for those unknown stores.

Filter at the entry point instead:

  - If `existing` is unknown to MMA, return NoopMMARebalanceAdvisor and
    log at V(2). MMA has no load history for the source store, so it
    cannot judge whether candidates are more overloaded than existing.
  - Drop unknown storeIDs from `cands` before computing means.

This matches the asymmetry already acknowledged in updateStoreStatuses
(cluster_state.go), which logs and skips unknown stores rather than
panicking.

In addition, tighten the contract of the internal helper
computeMeansForStoreSet: its precondition that every storeID be known
to the loadProvider is now documented, and a defensive assertTruef
inside the loop catches future violations — panicking in test builds
so the bug surfaces immediately, logging and skipping in production
so we never reintroduce the segfault. A divide-by-zero guard handles
the (now unreachable from BuildMMARebalanceAdvisor) case where every
store was filtered.

Flip the red test added in the previous commit to assert the new
behavior: unknown cand → silently dropped, unknown existing →
disabled (no-op) advisor.

Fixes cockroachdb#170703.

Release note: None
@tbg tbg force-pushed the mma-sload-nil branch from 360eeb0 to e5d81ba Compare May 22, 2026 10:30
@tbg tbg marked this pull request as ready for review May 22, 2026 10:31
@tbg tbg requested review from a team as code owners May 22, 2026 10:31
@tbg tbg requested a review from pav-kv May 22, 2026 10:31
@tbg tbg added the backport-26.2.x Flags PRs that need to be backported to 26.2 label May 22, 2026
Comment thread pkg/kv/kvserver/allocator/mmaprototype/rebalance_advisor.go Outdated
Replace the copy-on-write loop in BuildMMARebalanceAdvisor with
slices.IndexFunc + slices.DeleteFunc, per review on cockroachdb#170796. The
previous loop was hard to read: control flow after `cands = filtered`
was unclear, the `+1` capacity hint was opaque, and mutating a slice
mid-`range` is an uncommon pattern.

The new shape preserves the no-allocation property in the steady state
(IndexFunc walks once, returns -1). In the filtered path it makes one
copy and lets DeleteFunc compact; DeleteFunc leaves cap > len, so the
subsequent append(cands, existing) reuses the residual capacity
without a re-alloc and the +1 hint is no longer needed.

Add cluster_state.notHasStore as a one-line negation of hasStore so
the slices helpers can take a method value instead of an inline
closure.

Release note: None
@trunk-io trunk-io Bot merged commit 22e1230 into cockroachdb:master May 26, 2026
27 checks passed
@blathers-crl
Copy link
Copy Markdown

blathers-crl Bot commented May 26, 2026

Based on the specified backports for this PR, I applied new labels to the following linked issue(s). Please adjust the labels as needed to match the branches actually affected by the issue(s), including adding any known older branches.


Issue #170703: branch-release-26.2.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

tbg added a commit to tbg/cockroach that referenced this pull request May 26, 2026
Replace the copy-on-write loop in BuildMMARebalanceAdvisor with
slices.IndexFunc + slices.DeleteFunc, per review on cockroachdb#170796. The
previous loop was hard to read: control flow after `cands = filtered`
was unclear, the `+1` capacity hint was opaque, and mutating a slice
mid-`range` is an uncommon pattern.

The new shape preserves the no-allocation property in the steady state
(IndexFunc walks once, returns -1). In the filtered path it makes one
copy and lets DeleteFunc compact; DeleteFunc leaves cap > len, so the
subsequent append(cands, existing) reuses the residual capacity
without a re-alloc and the +1 hint is no longer needed.

Add cluster_state.notHasStore as a one-line negation of hasStore so
the slices helpers can take a method value instead of an inline
closure.

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request May 27, 2026
Replace the copy-on-write loop in BuildMMARebalanceAdvisor with
slices.IndexFunc + slices.DeleteFunc, per review on cockroachdb#170796. The
previous loop was hard to read: control flow after `cands = filtered`
was unclear, the `+1` capacity hint was opaque, and mutating a slice
mid-`range` is an uncommon pattern.

The new shape preserves the no-allocation property in the steady state
(IndexFunc walks once, returns -1). In the filtered path it makes one
copy and lets DeleteFunc compact; DeleteFunc leaves cap > len, so the
subsequent append(cands, existing) reuses the residual capacity
without a re-alloc and the +1 hint is no longer needed.

Add cluster_state.notHasStore as a one-line negation of hasStore so
the slices helpers can take a method value instead of an inline
closure.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-26.2.x Flags PRs that need to be backported to 26.2 target-release-26.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

pkg/ccl/serverccl/diagnosticsccl/diagnosticsccl_test: TestServerReport failed

3 participants