mmaprototype: fix nil-pointer panic when StorePool sees stores MMA does not#170796
Merged
Conversation
A subsequent commit needs to log and assert from inside (*allocatorState).BuildMMARebalanceAdvisor with proper call-site log tags (e.g. mmaid). Today the function does not take a ctx, so the existing fallback assertion has to use context.Background(), which loses those tags. Add ctx as the first parameter of (*allocatorState).BuildMMARebalanceAdvisor and the corresponding interface declarations on the mmaprototype.Allocator and mmaintegration.mmaState interfaces. Plumb the ctx that AllocatorSync already has on hand into the underlying call. Replace the context.Background() literal in the existing fallback assertion. No behavior change. Release note: None
Add a test that pins the current, broken behavior described in cockroachdb#170703: (*allocatorState).BuildMMARebalanceAdvisor panics with a nil pointer dereference if any storeID it is given (either `existing` or anything in `cands`) is not yet known to MMA's clusterState. This happens during startup, when the legacy allocator builds candidate lists from StorePool's gossip-driven view before MMA has been notified of the corresponding stores via SetStore — see the call path in cockroachdb#170703 from bestRebalanceTarget through AllocatorSync.BuildMMARebalanceAdvisor into computeMeansForStoreSet, which dereferences the nil *storeLoad returned by clusterState.getStoreReportedLoad for unknown stores. The assertions use require.Panics so the test passes on this commit. The next commit fixes the panic and flips them to require.NotPanics plus a behavior check, so the red/green pair is visible in a single diff. Informs cockroachdb#170703. Release note: None
Contributor
|
😎 Merged successfully - details. |
Member
Fix the nil-pointer panic from cockroachdb#170703 at the right architectural layer: the integration boundary where MMA's clusterState meets the legacy allocator's StorePool view. The two views are kept in sync independently — StorePool from gossip callbacks, MMA's clusterState from explicit SetStore / ProcessStoreLoadMsg calls. During startup (and any time gossip races ahead of MMA), the candidate slice handed to BuildMMARebalanceAdvisor can include storeIDs MMA has never heard of. computeMeansForStoreSet then nil-derefs the *storeLoad returned by getStoreReportedLoad for those unknown stores. Filter at the entry point instead: - If `existing` is unknown to MMA, return NoopMMARebalanceAdvisor and log at V(2). MMA has no load history for the source store, so it cannot judge whether candidates are more overloaded than existing. - Drop unknown storeIDs from `cands` before computing means. This matches the asymmetry already acknowledged in updateStoreStatuses (cluster_state.go), which logs and skips unknown stores rather than panicking. In addition, tighten the contract of the internal helper computeMeansForStoreSet: its precondition that every storeID be known to the loadProvider is now documented, and a defensive assertTruef inside the loop catches future violations — panicking in test builds so the bug surfaces immediately, logging and skipping in production so we never reintroduce the segfault. A divide-by-zero guard handles the (now unreachable from BuildMMARebalanceAdvisor) case where every store was filtered. Flip the red test added in the previous commit to assert the new behavior: unknown cand → silently dropped, unknown existing → disabled (no-op) advisor. Fixes cockroachdb#170703. Release note: None
pav-kv
approved these changes
May 22, 2026
Replace the copy-on-write loop in BuildMMARebalanceAdvisor with slices.IndexFunc + slices.DeleteFunc, per review on cockroachdb#170796. The previous loop was hard to read: control flow after `cands = filtered` was unclear, the `+1` capacity hint was opaque, and mutating a slice mid-`range` is an uncommon pattern. The new shape preserves the no-allocation property in the steady state (IndexFunc walks once, returns -1). In the filtered path it makes one copy and lets DeleteFunc compact; DeleteFunc leaves cap > len, so the subsequent append(cands, existing) reuses the residual capacity without a re-alloc and the +1 hint is no longer needed. Add cluster_state.notHasStore as a one-line negation of hasStore so the slices helpers can take a method value instead of an inline closure. Release note: None
|
Based on the specified backports for this PR, I applied new labels to the following linked issue(s). Please adjust the labels as needed to match the branches actually affected by the issue(s), including adding any known older branches. Issue #170703: branch-release-26.2. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
tbg
added a commit
to tbg/cockroach
that referenced
this pull request
May 26, 2026
Replace the copy-on-write loop in BuildMMARebalanceAdvisor with slices.IndexFunc + slices.DeleteFunc, per review on cockroachdb#170796. The previous loop was hard to read: control flow after `cands = filtered` was unclear, the `+1` capacity hint was opaque, and mutating a slice mid-`range` is an uncommon pattern. The new shape preserves the no-allocation property in the steady state (IndexFunc walks once, returns -1). In the filtered path it makes one copy and lets DeleteFunc compact; DeleteFunc leaves cap > len, so the subsequent append(cands, existing) reuses the residual capacity without a re-alloc and the +1 hint is no longer needed. Add cluster_state.notHasStore as a one-line negation of hasStore so the slices helpers can take a method value instead of an inline closure. Release note: None
tbg
added a commit
to tbg/cockroach
that referenced
this pull request
May 27, 2026
Replace the copy-on-write loop in BuildMMARebalanceAdvisor with slices.IndexFunc + slices.DeleteFunc, per review on cockroachdb#170796. The previous loop was hard to read: control flow after `cands = filtered` was unclear, the `+1` capacity hint was opaque, and mutating a slice mid-`range` is an uncommon pattern. The new shape preserves the no-allocation property in the steady state (IndexFunc walks once, returns -1). In the filtered path it makes one copy and lets DeleteFunc compact; DeleteFunc leaves cap > len, so the subsequent append(cands, existing) reuses the residual capacity without a re-alloc and the +1 hint is no longer needed. Add cluster_state.notHasStore as a one-line negation of hasStore so the slices helpers can take a method value instead of an inline closure. Release note: None
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #170703.
The MMA prototype crashed with a nil-pointer dereference in
computeMeansForStoreSetwhenever the legacy allocator handed it a candidate slice (or anexistingstoreID) that MMA'sclusterStatehad not yet seen. This is routine during startup:StorePooland MMA'sclusterStateare populated by separate gossip-driven paths, so the two views diverge until MMA'sSetStorecallbacks catch up. The asymmetry is already acknowledged forupdateStoreStatuses, which logs and skips unknown stores;BuildMMARebalanceAdvisorwas missing the same guard, so it segfaulted instead.The fix filters unknown stores at the right architectural layer — the integration boundary in
BuildMMARebalanceAdvisor— rather than papering over the symptom inside the internal helper. The internalcomputeMeansForStoreSetprecondition (every storeID must be known to the load provider) is also now documented and defended by anassertTruefthat panics in test builds and logs+skips in production, so future internal misuse fails loudly without re-introducing the production segfault.This issue became more common after MMA was enabled by default in v26.3 (#169411).
Epic: none
Release note: None