HDDS-9257. LegacyReplicationManager: Unhealthy replicas could block under replication handling by siddhantsangwan · Pull Request #5261 · apache/ozone

siddhantsangwan · 2023-09-08T15:59:52Z

What changes were proposed in this pull request?

Problem:
The legacy replication manager currently resolves mismatched replicas (those whose replica state do not match SCM's container state) by

Replicating the matching replicas until they are fully replicated.
Deleting the mismatched replicas.

This approach does not work when LRM is presented with the following small cluster situation:
SCM state: CLOSED.
5 datanodes in the cluster.
Replica states: CLOSED CLOSED QUASI QUASI QUASI.
LRM will not make progress because there is no datanode to add a closed replica to that does not already have a replica.

Changes proposed:
Try to delete an unhealthy replica (UNHEALTHY or QUASI_CLOSED) to free up a datanode for a healthy replica. We prefer deleting a replica with less sequence id than the container's. If the container is QUASI_CLOSED, then the replica to be deleted should not have a unique origin node id. Also, this replica should be on a healthy, in-service node.

We do this only if there isn't a pending delete, if there are at least 3 (EDITED: was 4) replicas, and if there is at least one replica which matches the container's lifecycle state.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-9257

How was this patch tested?

Added a unit test.

…nder replication handling

…est.

siddhantsangwan · 2023-09-08T19:03:12Z

I am checking for at least 3 replicas in the latest commit, not 4. This is consistent with #5255.

...src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java

errose28

Thanks for quickly fixing this @siddhantsangwan. We will need a unit test similar to the one you added but for the case where SCM state is quasi-closed, 1 or 2 datanodes have a quasi-closed replica, and the rest have an unhealthy replica.

...src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java

…alling placement policy. delete only UNHEALTHY replica for quasi_closed container.

siddhantsangwan · 2023-09-11T11:47:12Z

I've addressed review comments and added a unit test for quasi-closed containers.

siddhantsangwan · 2023-09-12T07:56:38Z

@sodonnel @errose28 Using the common logic from ReplicationManagerUtil in this PR now. Please review.

sodonnel

LGTM - Thanks for changing this to use the shared method as it will make any changes to that shared logic easier to make in the future.

errose28

Thanks for working on this @siddhantsangwan LGTM.

…ould block under replication handling (apache#5261) (cherry picked from commit 872401c) Change-Id: I22d0dcf6c55e1c2a4e777a67612a0eb170de6be5

HDDS-9257. LegacyReplicationManager: Unhealthy replicas could block u…

e4b6219

…nder replication handling

siddhantsangwan requested review from errose28 and sodonnel September 8, 2023 15:59

check for at least 3 nodes, not 4. fix sorting of bcsid. add a unit t…

32caffc

…est.

siddhantsangwan marked this pull request as ready for review September 8, 2023 18:59

fix checkstyle

fbd6a21

siddhantsangwan commented Sep 8, 2023

View reviewed changes

...src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java Outdated Show resolved Hide resolved

errose28 reviewed Sep 9, 2023

View reviewed changes

siddhantsangwan added 3 commits September 11, 2023 13:56

add ut for quasi_closed container

1530e4e

add unit test for quasi_closed container. exclude all replicas when c…

d7a21f7

…alling placement policy. delete only UNHEALTHY replica for quasi_closed container.

fix checkstyle

7567dcc

sodonnel mentioned this pull request Sep 11, 2023

HDDS-8536. ReplicationManager: Unhealthy replicas could block Ratis containers being replicated #5255

Merged

siddhantsangwan added 3 commits September 11, 2023 22:01

fix small bugs in test code

77c7820

Merge branch 'master' into HDDS-9257

1ca9da0

use logic from ReplicationManagerUtil to select a replica to delete

e6c0cc0

sodonnel approved these changes Sep 12, 2023

View reviewed changes

errose28 approved these changes Sep 12, 2023

View reviewed changes

errose28 merged commit 872401c into apache:master Sep 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-9257. LegacyReplicationManager: Unhealthy replicas could block under replication handling#5261

HDDS-9257. LegacyReplicationManager: Unhealthy replicas could block under replication handling#5261
errose28 merged 9 commits intoapache:masterfrom
siddhantsangwan:HDDS-9257

siddhantsangwan commented Sep 8, 2023 •

edited

Loading

Uh oh!

siddhantsangwan commented Sep 8, 2023

Uh oh!

Uh oh!

errose28 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

siddhantsangwan commented Sep 11, 2023

Uh oh!

siddhantsangwan commented Sep 12, 2023

Uh oh!

sodonnel left a comment

Uh oh!

errose28 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

siddhantsangwan commented Sep 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

siddhantsangwan commented Sep 8, 2023

Uh oh!

Uh oh!

errose28 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

siddhantsangwan commented Sep 11, 2023

Uh oh!

siddhantsangwan commented Sep 12, 2023

Uh oh!

sodonnel left a comment

Choose a reason for hiding this comment

Uh oh!

errose28 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

siddhantsangwan commented Sep 8, 2023 •

edited

Loading