New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allocatorimpl,asim: fix replace constraints check fn #117900
allocatorimpl,asim: fix replace constraints check fn #117900
Conversation
5a3f25f
to
7e5f08e
Compare
Introduce a `decommission_conformance` simulator test, which reproduces a simplified scenario of cockroachdb#117886. Informs: cockroachdb#117886 Release note: None
25173e4
to
7d1c9b3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
once #117941 merges and this is rebased on it.
Reviewed 1 of 1 files at r1, 4 of 4 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @kvoli)
-- commits
line 28 at r2:
Can we change this release note to talk more about the user-visible behavior instead of the implementation? For instance, instead of saying "replacement targets returned", we can talk about the decommissioning not getting stuck on a rebalance operation that was falsely determined to be unsafe.
7d1c9b3
to
b28aa6a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TYFTR
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten)
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Can we change this release note to talk more about the user-visible behavior instead of the implementation? For instance, instead of saying "replacement targets returned", we can talk about the decommissioning not getting stuck on a rebalance operation that was falsely determined to be unsafe.
Fair point, I've updated to be closer to this language.
TYFTR bors r=nvanbenschoten. |
bors r- |
Canceled. |
Previously, it was possible for replica replacement to not return a valid target when some constraint was initially undersatisfied in addition to the replica being replaced satisfying a separate constraint. When this occurred, it could stall decommissioning a node, as replicas become stuck with the allocator returning no valid replacement target. Update the `replaceConstraintsCheck` function to correctly consider whether the replacement store satisfies the constraint when computing if the replacement is necessary. Fixes: cockroachdb#117886 Part of: cockroachdb#117891 Release note (bug fix): Decommissioning replicas which are part of a mis-replicated range will no longer get stuck on a rebalance operation that was falsely determined to be unsafe. This bug was introduced in 23.1.0.
b28aa6a
to
3d731c5
Compare
Updated the unit test to only replace existing stores. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r4, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @kvoli)
TYFTR bors r=nvanbenschoten |
Build failed: |
Canceled with comment: Agent removed Retrying bors r=nvanbenschoten |
Build failed: |
bors r=nvanbenschoten |
Build succeeded: |
Previously, it was possible for replica replacement to not return a
valid target when some constraint was initially undersatisfied in
addition to the replica being replaced satisfying a separate constraint.
When this occurred, it could stall decommissioning a node, as replicas
become stuck with the allocator returning no valid replacement target.
Update the
replaceConstraintsCheck
function to correctly considerwhether the replacement store satisfies the constraint when computing if
the replacement is necessary.
Fixes: #117886
Part of: #117891
Release note (bug fix): Decommissioning replicas which are part of a
mis-replicated range will no longer get stuck on a rebalance operation
that was falsely determined to be unsafe. This bug was introduced in
23.1.0.