Skip to content

Conversation

@tbg
Copy link
Member

@tbg tbg commented Nov 12, 2025

This PR refactors rebalanceStores by extracting the store-shedding logic from into a standalone rebalanceStore method on a new rebalanceState struct. The struct encapsulates rebalancing state (cluster state references, counters, configuration, and scratch variables), reducing the method's cyclomatic complexity and improving organization. More follow-up refactors are planned to straighten this up further and decompose it more, but this seemed like a good checkpoint.

There should be zero behavior changes in this PR.

Informs #157757.

Epic: CRDB-55052

@tbg tbg added the do-not-merge bors won't merge a PR with this label. label Nov 12, 2025
@blathers-crl
Copy link

blathers-crl bot commented Nov 12, 2025

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg force-pushed the mma-rebalance-stores-refactor branch from e7e27ff to 782ad5f Compare November 13, 2025 07:04
@blathers-crl
Copy link

blathers-crl bot commented Nov 13, 2025

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@tbg tbg requested review from sumeerbhola and wenyihu6 November 13, 2025 07:04
@tbg
Copy link
Member Author

tbg commented Nov 13, 2025

Reviewer notes: this PR was generated with the help of AI. Concretely, I scoped out a step-by-step plan for the refactor, and had Cursor work through it step by step. The commit messages are Cursor-generated as well, and I reviewed everything as it went in (occasionally telling Cursor to backtrack and solve things differently). Each individual commit should be easy to review, if that is not the case, please flag so that I can figure out what went wrong.

@tbg tbg marked this pull request as ready for review November 13, 2025 07:08
@tbg tbg requested review from a team as code owners November 13, 2025 07:08
tbg added 11 commits November 13, 2025 14:48
This is step 1 in extracting the shedStore logic into a standalone function.
Convert return and continue statements to boolean flags (shouldReturnEarly,
shouldContinue) that are checked after the block. This prepares the code
for extraction into a function by moving all loop control structure outside
the block that will become shedStore.
Convert the shedStore block to an IIFE that returns (bool, bool) for
shouldReturnEarly and shouldContinue. Replace all flag assignments with
return statements. This prepares the code for adding more return values
and parameters.
Introduce rebalanceState struct that contains *clusterState and tracks
counters and outcomes of rebalanceStores invocation. Refactor the shedStore
function to write into *rebalanceState instead of returning values. This
prepares for passing *rebalanceState as a parameter in a future change.
Change the shedStore IIFE to accept *rebalanceState as a parameter instead
of capturing it from the outer scope. This prepares the function for
extraction to a standalone method on *rebalanceState.
Add store sheddingStore as a parameter to the shedStore function instead
of capturing it from the loop. Remove idx parameter as it was only used
for logging and is not needed.
Move rng and dsm fields into the rebalanceState struct instead of passing
them as parameters. This reduces the parameter list and keeps related
state together.
…vars to rebalanceState

Add ctx, localStoreID, and now as parameters to the shedStore function.
Move scratch variables (disj, storesToExclude, storesToExcludeForRange,
scratchNodes, scratchStores) into a nested scratch struct within
rebalanceState for better organization.
Extract the shedStore immediately invoked function expression (IIFE) into
a standalone rebalanceStore method on *rebalanceState. Move sheddingStore
type definition to top level. The method now has a clean signature with
rs as receiver and takes store, ctx, localStoreID, and now as parameters.
Remove the shouldReturnEarly flag and instead check the condition
directly in the loop. The loop now checks if rangeMoveCount or
leaseTransferCount has reached their limits before processing each
store, and breaks if either condition is met.
Remove the shouldContinue flag since the function returns normally when
it wants to skip to the next store, and the loop naturally continues to
the next iteration.
@tbg tbg force-pushed the mma-rebalance-stores-refactor branch from 782ad5f to db7c2de Compare November 13, 2025 13:48
@tbg tbg removed the do-not-merge bors won't merge a PR with this label. label Nov 13, 2025
Copy link
Contributor

@wenyihu6 wenyihu6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)


pkg/kv/kvserver/allocator/mmaprototype/cluster_state_rebalance_stores.go line 344 at r3 (raw file):

					if leaseTransferCount >= maxLeaseTransferCount {
						log.KvDistribution.VInfof(ctx, 2, "reached max lease transfer count %d, returning", maxLeaseTransferCount)
						return true, false

nit: not needed now - lets squeeze this in in the future can we add comments like return true /shouldReturnEarly/, false /* shouldContinue */

update: nvm later commits changed it


pkg/kv/kvserver/allocator/mmaprototype/cluster_state_rebalance_stores.go line 380 at r2 (raw file):

				log.KvDistribution.VInfof(ctx, 2, "skipping remote store s%d: in lease shedding grace period", store.StoreID)
				shouldContinue = true
				break

I found this intermediate state confusing - am i seeing this right that we break out of the loop here for sheddingStores do we want that


pkg/kv/kvserver/allocator/mmaprototype/cluster_state_rebalance_stores.go line 582 at r2 (raw file):

			return changes
		}
		if shouldContinue {

when we added the shouldContinue flag here, is it correct that we don't actually need this since the loop would naturally go to the next iteration like you did in the last commit?


pkg/kv/kvserver/allocator/mmaprototype/cluster_state_rebalance_stores.go line 167 at r4 (raw file):

	// See the long comment where rangeState.lastFailedChange is declared.
	const lastFailedChangeDelayDuration time.Duration = 60 * time.Second
	rs := &rebalanceState{

nit: not needed now but can we rename this rs - it confuses me with rangeState


pkg/kv/kvserver/allocator/mmaprototype/cluster_state_rebalance_stores.go line 606 at r4 (raw file):

			}
		}()
		if rs.shouldReturnEarly {

nit: can we add an assertion that rs.shouldReturnEarly != rs.shouldContinue

update: nvm your later commits got rid of it

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

@sumeerbhola reviewed 1 of 1 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @tbg)


pkg/kv/kvserver/allocator/mmaprototype/cluster_state_rebalance_stores.go line 346 at r2 (raw file):

					if leaseTransferCount >= maxLeaseTransferCount {
						log.KvDistribution.VInfof(ctx, 2, "reached max lease transfer count %d, returning", maxLeaseTransferCount)
						shouldReturnEarly = true

I can't correctly code review this piece-by-piece refactor without having the code open in my IDE and looking at the nested loops and what loop is being exited with a break.
One didn't need to think through the nesting with a return statement. And looks like the final state as the returns again, so I am just looking at the full thing as a single diff.

Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!
bors r+

@tbg dismissed @sumeerbhola from a discussion.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @sumeerbhola and @wenyihu6)


pkg/kv/kvserver/allocator/mmaprototype/cluster_state_rebalance_stores.go line 346 at r2 (raw file):

Previously, sumeerbhola wrote…

I can't correctly code review this piece-by-piece refactor without having the code open in my IDE and looking at the nested loops and what loop is being exited with a break.
One didn't need to think through the nesting with a return statement. And looks like the final state as the returns again, so I am just looking at the full thing as a single diff.

Yeah, single-diff is the way to go here, in hindsight. This will be different in follow-up PRs, where large chunks of code will actually have to move. I'll need to think of a better way to "ease out" of a nested for loop - I agree that the review of this is a real pain (not to mention I introduced a bug that lasted a few commits until it was patched up).


pkg/kv/kvserver/allocator/mmaprototype/cluster_state_rebalance_stores.go line 380 at r2 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

I found this intermediate state confusing - am i seeing this right that we break out of the loop here for sheddingStores do we want that

Yeah, this looks wrong - good catch, I missed this during self-review. It's hard to track whether one is in the inner or outer loop, and this is in the outer one. This is fixed by the end, fortunately, and all tests passed (lol).
I'll think of a better strategy for easing out of such loops in a more reviewable way.


pkg/kv/kvserver/allocator/mmaprototype/cluster_state_rebalance_stores.go line 582 at r2 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

when we added the shouldContinue flag here, is it correct that we don't actually need this since the loop would naturally go to the next iteration like you did in the last commit?

Yeah, the if shouldContinue { continue } isn't needed. I think the LLM was being unnecessarily verbose and I didn't catch it.


pkg/kv/kvserver/allocator/mmaprototype/cluster_state_rebalance_stores.go line 167 at r4 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

nit: not needed now but can we rename this rs - it confuses me with rangeState

Good point. Queued up for a follow-up.

@craig
Copy link
Contributor

craig bot commented Nov 13, 2025

@craig craig bot merged commit d2bb68c into cockroachdb:master Nov 13, 2025
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants