HDDS-9257. LegacyReplicationManager: Unhealthy replicas could block under replication handling#5261
Merged
errose28 merged 9 commits intoapache:masterfrom Sep 12, 2023
Merged
Conversation
…nder replication handling
Contributor
Author
|
I am checking for at least 3 replicas in the latest commit, not 4. This is consistent with #5255. |
siddhantsangwan
commented
Sep 8, 2023
...src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java
Outdated
Show resolved
Hide resolved
errose28
reviewed
Sep 9, 2023
Contributor
errose28
left a comment
There was a problem hiding this comment.
Thanks for quickly fixing this @siddhantsangwan. We will need a unit test similar to the one you added but for the case where SCM state is quasi-closed, 1 or 2 datanodes have a quasi-closed replica, and the rest have an unhealthy replica.
...src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java
Outdated
Show resolved
Hide resolved
…alling placement policy. delete only UNHEALTHY replica for quasi_closed container.
Contributor
Author
|
I've addressed review comments and added a unit test for quasi-closed containers. |
Contributor
Author
sodonnel
approved these changes
Sep 12, 2023
Contributor
sodonnel
left a comment
There was a problem hiding this comment.
LGTM - Thanks for changing this to use the shared method as it will make any changes to that shared logic easier to make in the future.
errose28
approved these changes
Sep 12, 2023
Contributor
errose28
left a comment
There was a problem hiding this comment.
Thanks for working on this @siddhantsangwan LGTM.
jojochuang
pushed a commit
to jojochuang/ozone
that referenced
this pull request
Oct 24, 2023
…ould block under replication handling (apache#5261) (cherry picked from commit 872401c) Change-Id: I22d0dcf6c55e1c2a4e777a67612a0eb170de6be5
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Problem:
The legacy replication manager currently resolves mismatched replicas (those whose replica state do not match SCM's container state) by
This approach does not work when LRM is presented with the following small cluster situation:
SCM state: CLOSED.
5 datanodes in the cluster.
Replica states: CLOSED CLOSED QUASI QUASI QUASI.
LRM will not make progress because there is no datanode to add a closed replica to that does not already have a replica.
Changes proposed:
Try to delete an unhealthy replica (UNHEALTHY or QUASI_CLOSED) to free up a datanode for a healthy replica. We prefer deleting a replica with less sequence id than the container's. If the container is QUASI_CLOSED, then the replica to be deleted should not have a unique origin node id. Also, this replica should be on a healthy, in-service node.
We do this only if there isn't a pending delete, if there are at least 3 (EDITED: was 4) replicas, and if there is at least one replica which matches the container's lifecycle state.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-9257
How was this patch tested?
Added a unit test.