New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-6699. EC: ReplicationManager - collect under and over replicated containers #3545
Conversation
…ather than a list of add and delete
…licated EC containers
continue; | ||
} | ||
try { | ||
processContainer(c, underReplicated, overReplicated, report); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by below TODO? I thought we will just populate the under and over replicated lists above and later we process them with queues at the line 254: TODO.
continue; | ||
} | ||
try { | ||
processContainer(c, underReplicated, overReplicated, report); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by below TODO? I thought we will just populate the under and over replicated lists above and later we process them with queues at the line 254: TODO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment the health check commands do not return any commands, but they might. Eg if we put in some logic about unhealthy containers, eg not all replicas in the same state - there is a check like this in the Legacy RM for ratis containers. Then the command may return some commands such as "closeContainer". We should fire those commands onto the event queue if they are returned.
protected ContainerHealthResult processContainer(ContainerInfo containerInfo, | ||
List<ContainerHealthResult.UnderReplicatedHealthResult> underRep, | ||
List<ContainerHealthResult.OverReplicatedHealthResult> overRep, | ||
ReplicationManagerReport report) throws ContainerNotFoundException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
variable and assignment can be in same statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, I think I had a try...catch block here before to handle containerNotFound, but changed it. I will fix this.
!underHealth.isUnrecoverable()) { | ||
underRep.add(underHealth); | ||
} | ||
} else if (health.getHealthState() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering in EC case, a container can have both situations. Some indexes over-replicated while other replicas missing. In that situation what is the HealthCheck result? We may give preference to missing? meaning Underreplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I thought it would be easier to have only a single state. The worse case is under-replication, so ideally we fix that first. When that is fixed, the container will get processed again and fix the over-replication. So yes, the if the container is both over and under replicated, under-replicated will be the result and over will be ignored until it is fixed.
I guess there is an edge case, where the container is both missing (unrecoverable) and over-replicated. The missing will never get fixed and it will be stuck like that. I am not sure what the answer is here - probably the container needs to be removed from the system as it cannot be read anyway.
Overall patch looks good to me, I have just dropped few comments/questions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
...r-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ContainerReplicaOp.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, pending CI
…plicated containers (apache#3545)" This reverts commit 5cf5c97.
HDDS-6699. EC: ReplicationManager - collect under and over replicated containers (apache#3545) (cherry picked from commit 5cf5c97) Change-Id: I3d7a53596fcaa0bfd4252f3d174c6f402ad8c06b
What changes were proposed in this pull request?
Scan all containers in Replication Manager. For the EC containers, pass them to the EcContainerHealthCheck class to allow their health to be check. For any under or over replicated containers add them a list which later will be sorted by priority and turned into a queue for subsequent stages of the RM processing.
What is the link to the Apache JIRA
ec-HDDS-6699
How was this patch tested?
New unit tests