-
Notifications
You must be signed in to change notification settings - Fork 594
HDDS-7101. EC: ReplicationManager - handle UNHEALTHY replicas #3962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| This is a case of under replication because index 4 is unavailable. Index | ||
| 3 is not considered over replicated because its second copy is unhealthy. | ||
| */ | ||
| replicas.removeIf(containerReplica -> containerReplica.getState() == |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a very subtle bug here. The replicas set comes from the ContainerCheckRequest object, and from it, you are removing the unhealthy replicas. Later, that same request will end up in in the unhealthy handler and it will find there are no unhealthy replicas, as we removed them here.
I guess we need to create a copy of the set to modify it, but its too easy to fall into this trap and the way our tests are currently structured, testing each handler in isolation, we would never catch it.
I think we need to also update the ContainerCheckRequest object to wrap all the sets in Collections.UnmodifableSet(...) and then this operation would get an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch. I've removed this code and instead added logic to consider unhealthy replicas in ECContainerReplicaCount. Now it only exists in that one place.
I think we need to also update the ContainerCheckRequest object to wrap all the sets in Collections.UnmodifableSet(...) and then this operation would get an error.
Done
| boolean foundUnhealthy = false; | ||
| // send delete commands for unhealthy replicas | ||
| for (ContainerReplica replica : replicas) { | ||
| if (replica.getState() == ContainerReplicaProto.State.UNHEALTHY) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We know that an under or over replicated container should never get here, but if there was some bug that meant it did, we could delete unhealthy replicas that don't have another copy.
Do you think we should have one sanity check here - when we find an unhealthy replica with index N, check there is also another replica which is CLOSED and has index X too? That way, we don't need to do a full under / over replicated check here, but we have some protection against another bug causing under-replicated containers coming here somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was also thinking about having a check. I've added this check to the handler in the latest commit.
… address comments
|
@sodonnel Thanks for the review. I've updated the changes with a sanity check and also added this handler to the chain in RM. Instead of having to remove unhealthy replicas every time we need to check for sufficient replication, I've added this logic to |
|
|
||
| // some unhealthy replicas were found so mark UNHEALTHY (note that | ||
| // UNHEALTHY is currently not a LifeCycleState for a Container) | ||
| if (foundUnhealthy) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets leave this as it is for now, but I am not sure about this. To get here, the container is neither over or under replicated, but it has some unhealthy replicas to be removed. Its not really degraded in anyway from a users perspective, except that it uses more space than it should.
With that in mind, we have no distinction between an under-replicated container due to missing replicas vs under-replicated due to unhealthy replicas. From a users perspective they are much the same thing. The only way to debug this, would be to take a missing container, and list its replicas, and from there you should be able to see some are unhealthy.
We will go ahead with this for now, but we may have to have a think about these states, as unhealthy sounds bad, where it isn't really in this case.
sodonnel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - thanks for working on this and adding in the suggested changes.
What changes were proposed in this pull request?
If the DN scrubber marks a container as unhealthy, the EC Replication Manager logic needs to be able to deal with it. It needs to remove the unhealthy one, and create a new copy using EC Reconstruction.
This PR proposes considering unhealthy replicas when checking for under and over replication in
ECReplicationCheckHandler.EDIT: While iterating through reviews, it was decided that we should take account of unhealthy replicas in the underlying
ContainerReplicaCountinterface that's used for checking sufficient replication (ECContainerReplicaCount#isSufficientlyReplicated). Instead of having to remove unhealthy replicas and then passing the replica set toECContainerReplicaCount, now unhealthy replicas will be removed in one place inECContainerReplicaCount. This change marks a shift from the logic in Legacy RM, where under/over replication checks didn't take into account unhealthy replicas. The following example should help in understanding the logic proposed in this change.Consider a closed EC 3-2 container with the following replicas:
Replica Index 1: Closed
Replica Index 2: Closed
Replica Index 3: Closed, Unhealthy (2 replicas for this index)
Replica Index 4: Unhealthy
Replica Index 5: Closed
This is a case of under replication because index 4 is unavailable. Index 3 is not considered over replicated because its second copy is unhealthy. In the current iteration of replication manager,
ECReplicationCheckHandlerwill find this container is under replicated and will add it to the under replication queue.When the container is no longer under (or over, mis) replicated, the newly introduced handler
ClosedWithUnhealthyReplicasHandlerwill handle it by deleting the unhealthy replicas in a future iteration.ClosedWithUnhealthyReplicasHandleris meant to handle closed EC containers with unhealthy replicas but no under, over or mis replication.What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-7101
How was this patch tested?
TestClosedWithUnhealthyReplicasHandlerandTestECReplicationCheckHandler