Skip to content

Conversation

@wenyihu6
Copy link
Contributor

@wenyihu6 wenyihu6 commented Nov 6, 2025

Epic: CRDB-55052
Release note: None

@wenyihu6 wenyihu6 requested review from a team as code owners November 6, 2025 22:28
@blathers-crl
Copy link

blathers-crl bot commented Nov 6, 2025

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@wenyihu6 wenyihu6 requested review from sumeerbhola and tbg November 6, 2025 22:32
@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Nov 6, 2025

I have a remaining question that arose while writing this PR. I’ll continue the discussion here so that it doesn’t block the other PR from merging.

From our convo this morning (pls correct me if i misinterpreted), my mental model is now that when there’s a partial intermediate state and we observe something we’d like to undo, the idea is that the changes have already been subsumed in some sense. We’re not really dropping a pending change, just ignoring the remaining portion that has already been added to the state.

What I’m unsure about is whether we should undo the load change here

cs.undoChangeLoadDelta(change.ReplicaChange)
. When the leaseholder of a range changes, we don’t undo the load change; instead, we mark it as enacted and wait for the load adjustment to be dropped later in processStoreLoadMsg
cs.pendingChangeEnacted(changeID, now, true)
. Intuitively, it seems like the same principle should apply here since we’re treating both as things that have been done. Should we similarly avoid undoing the load in this case?

@wenyihu6 wenyihu6 force-pushed the sumeerpr branch 5 times, most recently from 6d37831 to b8e08f7 Compare November 7, 2025 03:43
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

@tbg reviewed 4 of 11 files at r1, 1 of 3 files at r5, 5 of 6 files at r6, 1 of 1 files at r9, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)


pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 942 at r9 (raw file):

	//
	// A pending change is removed from tracking in one of three ways:
	// 1. Marked as enacted successfully: remove the pending changes, but the

I'm still a little shaky on this. When the change is "enacted", does it get dropped from pendingChanges? And what keeps track of the fact that there's still adjusted load? Do we just leave the load adjustment around (but there's no more pending change to tie it to) and we also track when the last change was made (for the timeout mechanism)?


pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 953 at r9 (raw file):

	// - The leaseholder of the range has changed. This is a special case where
	// the leaseholder of the range has moved to a different store, and the
	// rangeMag no longer contains the range. We assume that the pending change

rangeMag typo


pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 956 at r9 (raw file):

	// has been enacted in this case.
	//
	// 2. Undone as failed: corresponding replica and load change is rolled back.

Sumeer's example from yesterday was interesting:

  • initially {s1*,s2,s3}
  • decision: add s4 and transfer the lease to it
  • this produces one pending change (new replica s4 with lease) - iiuc
  • StoreLeaseholderMsg arrives that shows {s1*,s2,s3,s4}
  • pending change is still pending (b/c s4 doesn't have the lease yet)
  • if nothing else happens (store leaseholder msgs stay the same) eventually the pending change gets GC'ed, and range state reverts to {s1*,s2,s3} (even though it should be {s1*,s2,s3,s4} as seen in recent leaseholder msgs)

Now of course the next leaseholder msg would roll the state forward again to the correct `{s1*,s2,s3,s4}. But this behavior is weird.

Sumeer seems to want to fix it, so hopefully it won't remain for long. I'll defer to @sumeerbhola what should be documented now. Personally I'd add the example and then Sumeer can update it when he updates the rest of this comment as part of the second round of re-work.


pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 979 at r9 (raw file):

	// conflicting state. See preCheckOnApplyReplicaChanges for details on how
	// compatibility between the pending change and the new range state is
	// determined. When incompatibility is detected, the pending change is

And here again all pending changes for one logical change are rolled back, right? Might be worth saying explicitly.

@blathers-crl
Copy link

blathers-crl bot commented Nov 7, 2025

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Nov 7, 2025

Rebased on top of master to pick up the PR. fyi - planning to force push on this one since its short.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tbg reviewed 11 of 11 files at r10, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @wenyihu6)

Copy link
Contributor Author

@wenyihu6 wenyihu6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dragging the matrix from r10 to r11 should give the recent change. I still want to discuss #157032 (comment) so I will hold off merging for now.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)


pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 942 at r9 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I'm still a little shaky on this. When the change is "enacted", does it get dropped from pendingChanges? And what keeps track of the fact that there's still adjusted load? Do we just leave the load adjustment around (but there's no more pending change to tie it to) and we also track when the last change was made (for the timeout mechanism)?

Added more details for clarity: when a change is enacted, it is removed from the range and cluster state's pending changes, and enactedAtTime is set here. The load adjustment remains at the store level ss.adjusted.loadPendingChanges. When processStoreLoadMsg sees that lagForChangeReflectedInLoad has elapsed since enactedAtTime here, the pending load is no longer applied. It isn’t dropped similar to how we treat pending replica changes - we just treat it as already reflected in the load message and avoid applying it again.


pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 953 at r9 (raw file):

Previously, tbg (Tobias Grieger) wrote…

rangeMag typo

Fixed,


pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 956 at r9 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Sumeer's example from yesterday was interesting:

  • initially {s1*,s2,s3}
  • decision: add s4 and transfer the lease to it
  • this produces one pending change (new replica s4 with lease) - iiuc
  • StoreLeaseholderMsg arrives that shows {s1*,s2,s3,s4}
  • pending change is still pending (b/c s4 doesn't have the lease yet)
  • if nothing else happens (store leaseholder msgs stay the same) eventually the pending change gets GC'ed, and range state reverts to {s1*,s2,s3} (even though it should be {s1*,s2,s3,s4} as seen in recent leaseholder msgs)

Now of course the next leaseholder msg would roll the state forward again to the correct `{s1*,s2,s3,s4}. But this behavior is weird.

Sumeer seems to want to fix it, so hopefully it won't remain for long. I'll defer to @sumeerbhola what should be documented now. Personally I'd add the example and then Sumeer can update it when he updates the rest of this comment as part of the second round of re-work.

This example looks the same as

// and leaseholder to s4. This will be modeled with two pending changes, one
// that removes the replica and leaseholder from s3, and another that adds
// the replica and leaseholder to s4. An intermediate state that can be
// observed is {s1, s2, s3, s4} with the lease still at s3. But the pending
// change for adding s4 includes both that it has a replica, and it has the
// lease, so we will not mark it done, and keep pretending that the whole
// change is pending. Since lease transfers are fast, we accept this
// imperfect modeling fidelity. One consequence of this imperfect modeling
// is that if in this example there are no further changes observed until
// GC, the allocator will undo both changes and go back to the state {s1,
// s2, s3} with s3 as the leaseholder. That is, it has forgotten that s4 was
// added. This is unavoidable and will be fixed by the first
in point 3. Added a sentence to direct readers to 3 for its imperfection.


pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 979 at r9 (raw file):

Previously, tbg (Tobias Grieger) wrote…

And here again all pending changes for one logical change are rolled back, right? Might be worth saying explicitly.

This made realize that my comment was off - I don't think we roll back replica pending changes but just load adjustment. When incompatibility is detected,

, we just discard the remaining changes
rs.removePendingChangeTracking(change.ChangeID)
delete(cs.stores[change.target.StoreID].adjusted.loadPendingChanges, change.ChangeID)
delete(cs.pendingChanges, change.ChangeID)
cs.undoChangeLoadDelta(change.ReplicaChange)
}
if n := len(rs.pendingChanges); n > 0 {
. It is possible that some of that logical action may have already been enacted, so this does not necessarily include all pending changes for a single logical change. But when we revert, it should always include all pending changes for one logical change (which only happens when we gc things and when AdjustPendingChangesDisposition(failed)).

Copy link
Contributor Author

@wenyihu6 wenyihu6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r10 to r13 now

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)

Copy link
Contributor Author

@wenyihu6 wenyihu6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)


pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 979 at r9 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

This made realize that my comment was off - I don't think we roll back replica pending changes but just load adjustment. When incompatibility is detected,

, we just discard the remaining changes
rs.removePendingChangeTracking(change.ChangeID)
delete(cs.stores[change.target.StoreID].adjusted.loadPendingChanges, change.ChangeID)
delete(cs.pendingChanges, change.ChangeID)
cs.undoChangeLoadDelta(change.ReplicaChange)
}
if n := len(rs.pendingChanges); n > 0 {
. It is possible that some of that logical action may have already been enacted, so this does not necessarily include all pending changes for a single logical change. But when we revert, it should always include all pending changes for one logical change (which only happens when we gc things and when AdjustPendingChangesDisposition(failed)).

Hm, perhaps we are saying the same thing - it is rolled back in the sense that we just don't apply the pending changes again to the new leaseholder range state.

@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Nov 7, 2025

Pulled the discussion over on slack. Will bors this in.

bors r+

craig bot pushed a commit that referenced this pull request Nov 7, 2025
157032: mmaprototype: comment on the lifecycle of pending changes r=wenyihu6 a=wenyihu6

Epic: CRDB-55052
Release note: None

Co-authored-by: wenyihu6 <wenyi@cockroachlabs.com>
@craig
Copy link
Contributor

craig bot commented Nov 7, 2025

Build failed:

@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Nov 7, 2025

Rebased.

bors r+

@craig
Copy link
Contributor

craig bot commented Nov 7, 2025

@craig craig bot merged commit c24bbca into cockroachdb:master Nov 7, 2025
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants