Skip to content

Conversation

@siddhantsangwan
Copy link
Contributor

@siddhantsangwan siddhantsangwan commented Nov 15, 2022

What changes were proposed in this pull request?

If the DN scrubber marks a container as unhealthy, the EC Replication Manager logic needs to be able to deal with it. It needs to remove the unhealthy one, and create a new copy using EC Reconstruction.

This PR proposes considering unhealthy replicas when checking for under and over replication in ECReplicationCheckHandler.
EDIT: While iterating through reviews, it was decided that we should take account of unhealthy replicas in the underlying ContainerReplicaCount interface that's used for checking sufficient replication (ECContainerReplicaCount#isSufficientlyReplicated). Instead of having to remove unhealthy replicas and then passing the replica set to ECContainerReplicaCount, now unhealthy replicas will be removed in one place in ECContainerReplicaCount. This change marks a shift from the logic in Legacy RM, where under/over replication checks didn't take into account unhealthy replicas. The following example should help in understanding the logic proposed in this change.

Consider a closed EC 3-2 container with the following replicas:
Replica Index 1: Closed
Replica Index 2: Closed
Replica Index 3: Closed, Unhealthy (2 replicas for this index)
Replica Index 4: Unhealthy
Replica Index 5: Closed

This is a case of under replication because index 4 is unavailable. Index 3 is not considered over replicated because its second copy is unhealthy. In the current iteration of replication manager,
ECReplicationCheckHandler will find this container is under replicated and will add it to the under replication queue.
When the container is no longer under (or over, mis) replicated, the newly introduced handler ClosedWithUnhealthyReplicasHandler will handle it by deleting the unhealthy replicas in a future iteration.

ClosedWithUnhealthyReplicasHandler is meant to handle closed EC containers with unhealthy replicas but no under, over or mis replication.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-7101

How was this patch tested?

TestClosedWithUnhealthyReplicasHandler and TestECReplicationCheckHandler

This is a case of under replication because index 4 is unavailable. Index
3 is not considered over replicated because its second copy is unhealthy.
*/
replicas.removeIf(containerReplica -> containerReplica.getState() ==
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a very subtle bug here. The replicas set comes from the ContainerCheckRequest object, and from it, you are removing the unhealthy replicas. Later, that same request will end up in in the unhealthy handler and it will find there are no unhealthy replicas, as we removed them here.

I guess we need to create a copy of the set to modify it, but its too easy to fall into this trap and the way our tests are currently structured, testing each handler in isolation, we would never catch it.

I think we need to also update the ContainerCheckRequest object to wrap all the sets in Collections.UnmodifableSet(...) and then this operation would get an error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch. I've removed this code and instead added logic to consider unhealthy replicas in ECContainerReplicaCount. Now it only exists in that one place.

I think we need to also update the ContainerCheckRequest object to wrap all the sets in Collections.UnmodifableSet(...) and then this operation would get an error.

Done

boolean foundUnhealthy = false;
// send delete commands for unhealthy replicas
for (ContainerReplica replica : replicas) {
if (replica.getState() == ContainerReplicaProto.State.UNHEALTHY) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We know that an under or over replicated container should never get here, but if there was some bug that meant it did, we could delete unhealthy replicas that don't have another copy.

Do you think we should have one sanity check here - when we find an unhealthy replica with index N, check there is also another replica which is CLOSED and has index X too? That way, we don't need to do a full under / over replicated check here, but we have some protection against another bug causing under-replicated containers coming here somehow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was also thinking about having a check. I've added this check to the handler in the latest commit.

@siddhantsangwan
Copy link
Contributor Author

@sodonnel Thanks for the review. I've updated the changes with a sanity check and also added this handler to the chain in RM. Instead of having to remove unhealthy replicas every time we need to check for sufficient replication, I've added this logic to ECContainerReplicaCount (as discussed offline).


// some unhealthy replicas were found so mark UNHEALTHY (note that
// UNHEALTHY is currently not a LifeCycleState for a Container)
if (foundUnhealthy) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets leave this as it is for now, but I am not sure about this. To get here, the container is neither over or under replicated, but it has some unhealthy replicas to be removed. Its not really degraded in anyway from a users perspective, except that it uses more space than it should.

With that in mind, we have no distinction between an under-replicated container due to missing replicas vs under-replicated due to unhealthy replicas. From a users perspective they are much the same thing. The only way to debug this, would be to take a missing container, and list its replicas, and from there you should be able to see some are unhealthy.

We will go ahead with this for now, but we may have to have a think about these states, as unhealthy sounds bad, where it isn't really in this case.

Copy link
Contributor

@sodonnel sodonnel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks for working on this and adding in the suggested changes.

@sodonnel sodonnel merged commit 006df10 into apache:master Nov 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants