Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allocatorimpl,asim: fix replace constraints check fn #117900

Merged
merged 2 commits into from Jan 26, 2024

Conversation

kvoli
Copy link
Collaborator

@kvoli kvoli commented Jan 17, 2024

Previously, it was possible for replica replacement to not return a
valid target when some constraint was initially undersatisfied in
addition to the replica being replaced satisfying a separate constraint.

When this occurred, it could stall decommissioning a node, as replicas
become stuck with the allocator returning no valid replacement target.

Update the replaceConstraintsCheck function to correctly consider
whether the replacement store satisfies the constraint when computing if
the replacement is necessary.

Fixes: #117886
Part of: #117891
Release note (bug fix): Decommissioning replicas which are part of a
mis-replicated range will no longer get stuck on a rebalance operation
that was falsely determined to be unsafe. This bug was introduced in
23.1.0.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@kvoli kvoli force-pushed the 240117.replace-constraint-fix branch 2 times, most recently from 5a3f25f to 7e5f08e Compare January 18, 2024 15:27
@kvoli kvoli self-assigned this Jan 18, 2024
@kvoli kvoli added backport-23.1.x Flags PRs that need to be backported to 23.1 backport-23.2.x Flags PRs that need to be backported to 23.2. labels Jan 18, 2024
Introduce a `decommission_conformance` simulator test, which reproduces
a simplified scenario of cockroachdb#117886.

Informs: cockroachdb#117886
Release note: None
@kvoli kvoli force-pushed the 240117.replace-constraint-fix branch 2 times, most recently from 25173e4 to 7d1c9b3 Compare January 18, 2024 16:38
@kvoli kvoli marked this pull request as ready for review January 18, 2024 17:56
@kvoli kvoli requested a review from a team as a code owner January 18, 2024 17:56
@kvoli kvoli changed the title allocatorimpl,asim: [dnm] fix replace constraints check fn allocatorimpl,asim: fix replace constraints check fn Jan 18, 2024
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: once #117941 merges and this is rebased on it.

Reviewed 1 of 1 files at r1, 4 of 4 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @kvoli)


-- commits line 28 at r2:
Can we change this release note to talk more about the user-visible behavior instead of the implementation? For instance, instead of saying "replacement targets returned", we can talk about the decommissioning not getting stuck on a rebalance operation that was falsely determined to be unsafe.

@kvoli kvoli force-pushed the 240117.replace-constraint-fix branch from 7d1c9b3 to b28aa6a Compare January 23, 2024 21:39
Copy link
Collaborator Author

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TYFTR

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten)


-- commits line 28 at r2:

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Can we change this release note to talk more about the user-visible behavior instead of the implementation? For instance, instead of saying "replacement targets returned", we can talk about the decommissioning not getting stuck on a rebalance operation that was falsely determined to be unsafe.

Fair point, I've updated to be closer to this language.

@kvoli
Copy link
Collaborator Author

kvoli commented Jan 24, 2024

TestStoreRangeMergeWithData timed out, unrelated to this PR.

TYFTR

bors r=nvanbenschoten.

@kvoli
Copy link
Collaborator Author

kvoli commented Jan 24, 2024

bors r-

@craig
Copy link
Contributor

craig bot commented Jan 24, 2024

Canceled.

Previously, it was possible for replica replacement to not return a
valid target when some constraint was initially undersatisfied in
addition to the replica being replaced satisfying a separate constraint.

When this occurred, it could stall decommissioning a node, as replicas
become stuck with the allocator returning no valid replacement target.

Update the `replaceConstraintsCheck` function to correctly consider
whether the replacement store satisfies the constraint when computing if
the replacement is necessary.

Fixes: cockroachdb#117886
Part of: cockroachdb#117891
Release note (bug fix): Decommissioning replicas which are part of a
mis-replicated range will no longer get stuck on a rebalance operation
that was falsely determined to be unsafe. This bug was introduced in
23.1.0.
@kvoli kvoli force-pushed the 240117.replace-constraint-fix branch from b28aa6a to 3d731c5 Compare January 24, 2024 19:58
@kvoli
Copy link
Collaborator Author

kvoli commented Jan 24, 2024

Updated the unit test to only replace existing stores.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r4, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @kvoli)

@kvoli
Copy link
Collaborator Author

kvoli commented Jan 25, 2024

TYFTR

bors r=nvanbenschoten

@craig
Copy link
Contributor

craig bot commented Jan 25, 2024

Build failed:

@kvoli
Copy link
Collaborator Author

kvoli commented Jan 25, 2024

Canceled with comment: Agent removed

Retrying

bors r=nvanbenschoten

@craig
Copy link
Contributor

craig bot commented Jan 25, 2024

Build failed:

@kvoli
Copy link
Collaborator Author

kvoli commented Jan 25, 2024

bors r=nvanbenschoten

@craig
Copy link
Contributor

craig bot commented Jan 26, 2024

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1 backport-23.2.x Flags PRs that need to be backported to 23.2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kv: replaceConstraintsCheck considers necessary replacement to be invalid, can stall decommission
3 participants