Skip to content

HDDS-9257. LegacyReplicationManager: Unhealthy replicas could block under replication handling#5261

Merged
errose28 merged 9 commits intoapache:masterfrom
siddhantsangwan:HDDS-9257
Sep 12, 2023
Merged

HDDS-9257. LegacyReplicationManager: Unhealthy replicas could block under replication handling#5261
errose28 merged 9 commits intoapache:masterfrom
siddhantsangwan:HDDS-9257

Conversation

@siddhantsangwan
Copy link
Copy Markdown
Contributor

@siddhantsangwan siddhantsangwan commented Sep 8, 2023

What changes were proposed in this pull request?

Problem:
The legacy replication manager currently resolves mismatched replicas (those whose replica state do not match SCM's container state) by

  1. Replicating the matching replicas until they are fully replicated.
  2. Deleting the mismatched replicas.

This approach does not work when LRM is presented with the following small cluster situation:
SCM state: CLOSED.
5 datanodes in the cluster.
Replica states: CLOSED CLOSED QUASI QUASI QUASI.
LRM will not make progress because there is no datanode to add a closed replica to that does not already have a replica.

Changes proposed:
Try to delete an unhealthy replica (UNHEALTHY or QUASI_CLOSED) to free up a datanode for a healthy replica. We prefer deleting a replica with less sequence id than the container's. If the container is QUASI_CLOSED, then the replica to be deleted should not have a unique origin node id. Also, this replica should be on a healthy, in-service node.

We do this only if there isn't a pending delete, if there are at least 3 (EDITED: was 4) replicas, and if there is at least one replica which matches the container's lifecycle state.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-9257

How was this patch tested?

Added a unit test.

@siddhantsangwan siddhantsangwan marked this pull request as ready for review September 8, 2023 18:59
@siddhantsangwan
Copy link
Copy Markdown
Contributor Author

I am checking for at least 3 replicas in the latest commit, not 4. This is consistent with #5255.

Copy link
Copy Markdown
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for quickly fixing this @siddhantsangwan. We will need a unit test similar to the one you added but for the case where SCM state is quasi-closed, 1 or 2 datanodes have a quasi-closed replica, and the rest have an unhealthy replica.

@siddhantsangwan
Copy link
Copy Markdown
Contributor Author

I've addressed review comments and added a unit test for quasi-closed containers.

@siddhantsangwan
Copy link
Copy Markdown
Contributor Author

@sodonnel @errose28 Using the common logic from ReplicationManagerUtil in this PR now. Please review.

Copy link
Copy Markdown
Contributor

@sodonnel sodonnel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Thanks for changing this to use the shared method as it will make any changes to that shared logic easier to make in the future.

Copy link
Copy Markdown
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @siddhantsangwan LGTM.

@errose28 errose28 merged commit 872401c into apache:master Sep 12, 2023
jojochuang pushed a commit to jojochuang/ozone that referenced this pull request Oct 24, 2023
…ould block under replication handling (apache#5261)

(cherry picked from commit 872401c)
Change-Id: I22d0dcf6c55e1c2a4e777a67612a0eb170de6be5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants