HDDS-7101. EC: ReplicationManager - handle UNHEALTHY replicas #3962

siddhantsangwan · 2022-11-15T10:49:34Z

What changes were proposed in this pull request?

If the DN scrubber marks a container as unhealthy, the EC Replication Manager logic needs to be able to deal with it. It needs to remove the unhealthy one, and create a new copy using EC Reconstruction.

This PR proposes considering unhealthy replicas when checking for under and over replication in ECReplicationCheckHandler.
EDIT: While iterating through reviews, it was decided that we should take account of unhealthy replicas in the underlying ContainerReplicaCount interface that's used for checking sufficient replication (ECContainerReplicaCount#isSufficientlyReplicated). Instead of having to remove unhealthy replicas and then passing the replica set to ECContainerReplicaCount, now unhealthy replicas will be removed in one place in ECContainerReplicaCount. This change marks a shift from the logic in Legacy RM, where under/over replication checks didn't take into account unhealthy replicas. The following example should help in understanding the logic proposed in this change.

Consider a closed EC 3-2 container with the following replicas:
Replica Index 1: Closed
Replica Index 2: Closed
Replica Index 3: Closed, Unhealthy (2 replicas for this index)
Replica Index 4: Unhealthy
Replica Index 5: Closed

This is a case of under replication because index 4 is unavailable. Index 3 is not considered over replicated because its second copy is unhealthy. In the current iteration of replication manager,
ECReplicationCheckHandler will find this container is under replicated and will add it to the under replication queue.
When the container is no longer under (or over, mis) replicated, the newly introduced handler ClosedWithUnhealthyReplicasHandler will handle it by deleting the unhealthy replicas in a future iteration.

ClosedWithUnhealthyReplicasHandler is meant to handle closed EC containers with unhealthy replicas but no under, over or mis replication.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-7101

How was this patch tested?

TestClosedWithUnhealthyReplicasHandler and TestECReplicationCheckHandler

sodonnel · 2022-11-15T11:58:36Z

.../java/org/apache/hadoop/hdds/scm/container/replication/health/ECReplicationCheckHandler.java

+    This is a case of under replication because index 4 is unavailable. Index
+    3 is not considered over replicated because its second copy is unhealthy.
+     */
+    replicas.removeIf(containerReplica -> containerReplica.getState() ==


I think there is a very subtle bug here. The replicas set comes from the ContainerCheckRequest object, and from it, you are removing the unhealthy replicas. Later, that same request will end up in in the unhealthy handler and it will find there are no unhealthy replicas, as we removed them here.

I guess we need to create a copy of the set to modify it, but its too easy to fall into this trap and the way our tests are currently structured, testing each handler in isolation, we would never catch it.

I think we need to also update the ContainerCheckRequest object to wrap all the sets in Collections.UnmodifableSet(...) and then this operation would get an error.

Great catch. I've removed this code and instead added logic to consider unhealthy replicas in ECContainerReplicaCount. Now it only exists in that one place.

I think we need to also update the ContainerCheckRequest object to wrap all the sets in Collections.UnmodifableSet(...) and then this operation would get an error.

Done

sodonnel · 2022-11-15T12:21:19Z

.../apache/hadoop/hdds/scm/container/replication/health/ClosedWithUnhealthyReplicasHandler.java

+    boolean foundUnhealthy = false;
+    // send delete commands for unhealthy replicas
+    for (ContainerReplica replica : replicas) {
+      if (replica.getState() == ContainerReplicaProto.State.UNHEALTHY) {


We know that an under or over replicated container should never get here, but if there was some bug that meant it did, we could delete unhealthy replicas that don't have another copy.

Do you think we should have one sanity check here - when we find an unhealthy replica with index N, check there is also another replica which is CLOSED and has index X too? That way, we don't need to do a full under / over replicated check here, but we have some protection against another bug causing under-replicated containers coming here somehow.

Yes, I was also thinking about having a check. I've added this check to the handler in the latest commit.

… address comments

siddhantsangwan · 2022-11-15T20:48:15Z

@sodonnel Thanks for the review. I've updated the changes with a sanity check and also added this handler to the chain in RM. Instead of having to remove unhealthy replicas every time we need to check for sufficient replication, I've added this logic to ECContainerReplicaCount (as discussed offline).

sodonnel · 2022-11-16T09:55:18Z

.../apache/hadoop/hdds/scm/container/replication/health/ClosedWithUnhealthyReplicasHandler.java

+
+    // some unhealthy replicas were found so mark UNHEALTHY (note that
+    // UNHEALTHY is currently not a LifeCycleState for a Container)
+    if (foundUnhealthy) {


Lets leave this as it is for now, but I am not sure about this. To get here, the container is neither over or under replicated, but it has some unhealthy replicas to be removed. Its not really degraded in anyway from a users perspective, except that it uses more space than it should.

With that in mind, we have no distinction between an under-replicated container due to missing replicas vs under-replicated due to unhealthy replicas. From a users perspective they are much the same thing. The only way to debug this, would be to take a missing container, and list its replicas, and from there you should be able to see some are unhealthy.

We will go ahead with this for now, but we may have to have a think about these states, as unhealthy sounds bad, where it isn't really in this case.

sodonnel

LGTM - thanks for working on this and adding in the suggested changes.

HDDS-7101. EC: ReplicationManager - handle UNHEALTHY replicas

67ff84c

siddhantsangwan requested review from sodonnel and umamaheswararao November 15, 2022 10:49

sodonnel reviewed Nov 15, 2022

View reviewed changes

add logic in ECContainerReplicaCount to consider unhealthy replicas +…

4b56267

… address comments

sodonnel reviewed Nov 16, 2022

View reviewed changes

sodonnel approved these changes Nov 16, 2022

View reviewed changes

sodonnel merged commit 006df10 into apache:master Nov 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-7101. EC: ReplicationManager - handle UNHEALTHY replicas #3962

HDDS-7101. EC: ReplicationManager - handle UNHEALTHY replicas #3962

Uh oh!

siddhantsangwan commented Nov 15, 2022 •

edited

Loading

Uh oh!

sodonnel Nov 15, 2022

Uh oh!

siddhantsangwan Nov 15, 2022

Uh oh!

sodonnel Nov 15, 2022

Uh oh!

siddhantsangwan Nov 15, 2022

Uh oh!

siddhantsangwan commented Nov 15, 2022

Uh oh!

sodonnel Nov 16, 2022

Uh oh!

sodonnel left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HDDS-7101. EC: ReplicationManager - handle UNHEALTHY replicas #3962

HDDS-7101. EC: ReplicationManager - handle UNHEALTHY replicas #3962

Uh oh!

Conversation

siddhantsangwan commented Nov 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

sodonnel Nov 15, 2022

Choose a reason for hiding this comment

Uh oh!

siddhantsangwan Nov 15, 2022

Choose a reason for hiding this comment

Uh oh!

sodonnel Nov 15, 2022

Choose a reason for hiding this comment

Uh oh!

siddhantsangwan Nov 15, 2022

Choose a reason for hiding this comment

Uh oh!

siddhantsangwan commented Nov 15, 2022

Uh oh!

sodonnel Nov 16, 2022

Choose a reason for hiding this comment

Uh oh!

sodonnel left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

siddhantsangwan commented Nov 15, 2022 •

edited

Loading