mmaprototype: comment on the lifecycle of pending changes #157032

wenyihu6 · 2025-11-06T22:28:13Z

Epic: CRDB-55052
Release note: None

blathers-crl · 2025-11-06T22:28:17Z

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2025-11-06T22:28:37Z

This change is

wenyihu6 · 2025-11-06T22:32:42Z

I have a remaining question that arose while writing this PR. I’ll continue the discussion here so that it doesn’t block the other PR from merging.

From our convo this morning (pls correct me if i misinterpreted), my mental model is now that when there’s a partial intermediate state and we observe something we’d like to undo, the idea is that the changes have already been subsumed in some sense. We’re not really dropping a pending change, just ignoring the remaining portion that has already been added to the state.

What I’m unsure about is whether we should undo the load change here

cockroach/pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go

Line 1454 in 368df4e

cs.undoChangeLoadDelta(change.ReplicaChange)

. When the leaseholder of a range changes, we don’t undo the load change; instead, we mark it as enacted and wait for the load adjustment to be dropped later in processStoreLoadMsg

cockroach/pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go

Line 1501 in 368df4e

cs.pendingChangeEnacted(changeID, now, true)

. Intuitively, it seems like the same principle should apply here since we’re treating both as things that have been done. Should we similarly avoid undoing the load in this case?

tbg

Looks good, thanks!

@tbg reviewed 4 of 11 files at r1, 1 of 3 files at r5, 5 of 6 files at r6, 1 of 1 files at r9, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)

pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 942 at r9 (raw file):

	//
	// A pending change is removed from tracking in one of three ways:
	// 1. Marked as enacted successfully: remove the pending changes, but the

I'm still a little shaky on this. When the change is "enacted", does it get dropped from pendingChanges? And what keeps track of the fact that there's still adjusted load? Do we just leave the load adjustment around (but there's no more pending change to tie it to) and we also track when the last change was made (for the timeout mechanism)?

pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 953 at r9 (raw file):

	// - The leaseholder of the range has changed. This is a special case where
	// the leaseholder of the range has moved to a different store, and the
	// rangeMag no longer contains the range. We assume that the pending change

rangeMag typo

pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 956 at r9 (raw file):

	// has been enacted in this case.
	//
	// 2. Undone as failed: corresponding replica and load change is rolled back.

Sumeer's example from yesterday was interesting:

initially {s1*,s2,s3}
decision: add s4 and transfer the lease to it
this produces one pending change (new replica s4 with lease) - iiuc
StoreLeaseholderMsg arrives that shows {s1*,s2,s3,s4}
pending change is still pending (b/c s4 doesn't have the lease yet)
if nothing else happens (store leaseholder msgs stay the same) eventually the pending change gets GC'ed, and range state reverts to {s1*,s2,s3} (even though it should be {s1*,s2,s3,s4} as seen in recent leaseholder msgs)

Now of course the next leaseholder msg would roll the state forward again to the correct `{s1*,s2,s3,s4}. But this behavior is weird.

Sumeer seems to want to fix it, so hopefully it won't remain for long. I'll defer to @sumeerbhola what should be documented now. Personally I'd add the example and then Sumeer can update it when he updates the rest of this comment as part of the second round of re-work.

pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 979 at r9 (raw file):

	// conflicting state. See preCheckOnApplyReplicaChanges for details on how
	// compatibility between the pending change and the new range state is
	// determined. When incompatibility is detected, the pending change is

And here again all pending changes for one logical change are rolled back, right? Might be worth saying explicitly.

blathers-crl · 2025-11-07T13:31:53Z

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

wenyihu6 · 2025-11-07T13:33:16Z

Rebased on top of master to pick up the PR. fyi - planning to force push on this one since its short.

tbg

@tbg reviewed 11 of 11 files at r10, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @wenyihu6)

wenyihu6

Dragging the matrix from r10 to r11 should give the recent change. I still want to discuss #157032 (comment) so I will hold off merging for now.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)

pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 942 at r9 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I'm still a little shaky on this. When the change is "enacted", does it get dropped from pendingChanges? And what keeps track of the fact that there's still adjusted load? Do we just leave the load adjustment around (but there's no more pending change to tie it to) and we also track when the last change was made (for the timeout mechanism)?

Added more details for clarity: when a change is enacted, it is removed from the range and cluster state's pending changes, and enactedAtTime is set here. The load adjustment remains at the store level ss.adjusted.loadPendingChanges. When processStoreLoadMsg sees that lagForChangeReflectedInLoad has elapsed since enactedAtTime here, the pending load is no longer applied. It isn’t dropped similar to how we treat pending replica changes - we just treat it as already reflected in the load message and avoid applying it again.

pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 953 at r9 (raw file):

Previously, tbg (Tobias Grieger) wrote…

rangeMag typo

Fixed,

pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 956 at r9 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Sumeer's example from yesterday was interesting:

initially {s1*,s2,s3}

decision: add s4 and transfer the lease to it

this produces one pending change (new replica s4 with lease) - iiuc

StoreLeaseholderMsg arrives that shows {s1*,s2,s3,s4}

pending change is still pending (b/c s4 doesn't have the lease yet)

if nothing else happens (store leaseholder msgs stay the same) eventually the pending change gets GC'ed, and range state reverts to {s1*,s2,s3} (even though it should be {s1*,s2,s3,s4} as seen in recent leaseholder msgs)

Now of course the next leaseholder msg would roll the state forward again to the correct `{s1*,s2,s3,s4}. But this behavior is weird.

Sumeer seems to want to fix it, so hopefully it won't remain for long. I'll defer to @sumeerbhola what should be documented now. Personally I'd add the example and then Sumeer can update it when he updates the rest of this comment as part of the second round of re-work.

This example looks the same as

cockroach/pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go

Lines 962 to 973 in 204dfc1

    
           // and leaseholder to s4. This will be modeled with two pending changes, one 
        
           // that removes the replica and leaseholder from s3, and another that adds 
        
           // the replica and leaseholder to s4. An intermediate state that can be 
        
           // observed is {s1, s2, s3, s4} with the lease still at s3. But the pending 
        
           // change for adding s4 includes both that it has a replica, and it has the 
        
           // lease, so we will not mark it done, and keep pretending that the whole 
        
           // change is pending. Since lease transfers are fast, we accept this 
        
           // imperfect modeling fidelity. One consequence of this imperfect modeling 
        
           // is that if in this example there are no further changes observed until 
        
           // GC, the allocator will undo both changes and go back to the state {s1, 
        
           // s2, s3} with s3 as the leaseholder. That is, it has forgotten that s4 was 
        
           // added. This is unavoidable and will be fixed by the first

in point 3. Added a sentence to direct readers to 3 for its imperfection.

pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 979 at r9 (raw file):

Previously, tbg (Tobias Grieger) wrote…

And here again all pending changes for one logical change are rolled back, right? Might be worth saying explicitly.

This made realize that my comment was off - I don't think we roll back replica pending changes but just load adjustment. When incompatibility is detected,

cockroach/pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go

Line 1417 in 204dfc1

if err != nil {

, we just discard the remaining changes

cockroach/pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go

Lines 1445 to 1450 in 204dfc1

    
           	rs.removePendingChangeTracking(change.ChangeID) 
        
           	delete(cs.stores[change.target.StoreID].adjusted.loadPendingChanges, change.ChangeID) 
        
           	delete(cs.pendingChanges, change.ChangeID) 
        
           	cs.undoChangeLoadDelta(change.ReplicaChange) 
        
           } 
        
           if n := len(rs.pendingChanges); n > 0 {

. It is possible that some of that logical action may have already been enacted, so this does not necessarily include all pending changes for a single logical change. But when we revert, it should always include all pending changes for one logical change (which only happens when we gc things and when AdjustPendingChangesDisposition(failed)).

wenyihu6

r10 to r13 now

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)

wenyihu6

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)

pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 979 at r9 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

This made realize that my comment was off - I don't think we roll back replica pending changes but just load adjustment. When incompatibility is detected,

cockroach/pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go

Line 1417 in 204dfc1

if err != nil {

, we just discard the remaining changes

cockroach/pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go

Lines 1445 to 1450 in 204dfc1

rs.removePendingChangeTracking(change.ChangeID)

delete(cs.stores[change.target.StoreID].adjusted.loadPendingChanges, change.ChangeID)

delete(cs.pendingChanges, change.ChangeID)

cs.undoChangeLoadDelta(change.ReplicaChange)

}

if n := len(rs.pendingChanges); n > 0 {

. It is possible that some of that logical action may have already been enacted, so this does not necessarily include all pending changes for a single logical change. But when we revert, it should always include all pending changes for one logical change (which only happens when we gc things and when AdjustPendingChangesDisposition(failed)).

Hm, perhaps we are saying the same thing - it is rolled back in the sense that we just don't apply the pending changes again to the new leaseholder range state.

wenyihu6 · 2025-11-07T20:42:25Z

Pulled the discussion over on slack. Will bors this in.

bors r+

157032: mmaprototype: comment on the lifecycle of pending changes r=wenyihu6 a=wenyihu6 Epic: CRDB-55052 Release note: None Co-authored-by: wenyihu6 <wenyi@cockroachlabs.com>

craig · 2025-11-07T20:47:42Z

Build failed:

docker_image_amd64

wenyihu6 · 2025-11-07T20:49:09Z

Rebased.

bors r+

craig · 2025-11-07T22:18:52Z

Build succeeded:

wenyihu6 requested review from a team as code owners November 6, 2025 22:28

wenyihu6 requested review from sumeerbhola and tbg November 6, 2025 22:32

wenyihu6 force-pushed the sumeerpr branch 5 times, most recently from 6d37831 to b8e08f7 Compare November 7, 2025 03:43

tbg approved these changes Nov 7, 2025

View reviewed changes

wenyihu6 force-pushed the sumeerpr branch from b8e08f7 to 8f80e4e Compare November 7, 2025 13:31

tbg approved these changes Nov 7, 2025

View reviewed changes

wenyihu6 commented Nov 7, 2025

View reviewed changes

wenyihu6 force-pushed the sumeerpr branch from 204dfc1 to 405c3ac Compare November 7, 2025 14:49

wenyihu6 commented Nov 7, 2025

View reviewed changes

craig bot pushed a commit that referenced this pull request Nov 7, 2025

Merge #157032

5593174

157032: mmaprototype: comment on the lifecycle of pending changes r=wenyihu6 a=wenyihu6 Epic: CRDB-55052 Release note: None Co-authored-by: wenyihu6 <wenyi@cockroachlabs.com>

mmaprototype: comment on the lifecycle of pending changes

af3267c

wenyihu6 force-pushed the sumeerpr branch from 405c3ac to af3267c Compare November 7, 2025 20:48

craig bot merged commit c24bbca into cockroachdb:master Nov 7, 2025
24 checks passed

celeste-cockroachdb bot added the target-release-26.1.0 label Nov 7, 2025

	// and leaseholder to s4. This will be modeled with two pending changes, one
	// that removes the replica and leaseholder from s3, and another that adds
	// the replica and leaseholder to s4. An intermediate state that can be
	// observed is {s1, s2, s3, s4} with the lease still at s3. But the pending
	// change for adding s4 includes both that it has a replica, and it has the
	// lease, so we will not mark it done, and keep pretending that the whole
	// change is pending. Since lease transfers are fast, we accept this
	// imperfect modeling fidelity. One consequence of this imperfect modeling
	// is that if in this example there are no further changes observed until
	// GC, the allocator will undo both changes and go back to the state {s1,
	// s2, s3} with s3 as the leaseholder. That is, it has forgotten that s4 was
	// added. This is unavoidable and will be fixed by the first

	rs.removePendingChangeTracking(change.ChangeID)
	delete(cs.stores[change.target.StoreID].adjusted.loadPendingChanges, change.ChangeID)
	delete(cs.pendingChanges, change.ChangeID)
	cs.undoChangeLoadDelta(change.ReplicaChange)
	}
	if n := len(rs.pendingChanges); n > 0 {

mmaprototype: comment on the lifecycle of pending changes #157032

mmaprototype: comment on the lifecycle of pending changes #157032

Uh oh!

Conversation

wenyihu6 commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

blathers-crl bot commented Nov 6, 2025

Uh oh!

cockroach-teamcity commented Nov 6, 2025

Uh oh!

wenyihu6 commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

blathers-crl bot commented Nov 7, 2025

Uh oh!

wenyihu6 commented Nov 7, 2025

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

wenyihu6 left a comment

Choose a reason for hiding this comment

Uh oh!

wenyihu6 left a comment

Choose a reason for hiding this comment

Uh oh!

wenyihu6 left a comment

Choose a reason for hiding this comment

Uh oh!

wenyihu6 commented Nov 7, 2025

Uh oh!

craig bot commented Nov 7, 2025

Uh oh!

wenyihu6 commented Nov 7, 2025

Uh oh!

craig bot commented Nov 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wenyihu6 commented Nov 6, 2025 •

edited

Loading

wenyihu6 commented Nov 6, 2025 •

edited

Loading