Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-6699. EC: ReplicationManager - collect under and over replicated containers #3545

Merged
merged 11 commits into from Jun 27, 2022

Conversation

sodonnel
Copy link
Contributor

What changes were proposed in this pull request?

Scan all containers in Replication Manager. For the EC containers, pass them to the EcContainerHealthCheck class to allow their health to be check. For any under or over replicated containers add them a list which later will be sorted by priority and turned into a queue for subsequent stages of the RM processing.

What is the link to the Apache JIRA

ec-HDDS-6699

How was this patch tested?

New unit tests

continue;
}
try {
processContainer(c, underReplicated, overReplicated, report);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by below TODO? I thought we will just populate the under and over replicated lists above and later we process them with queues at the line 254: TODO.

continue;
}
try {
processContainer(c, underReplicated, overReplicated, report);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by below TODO? I thought we will just populate the under and over replicated lists above and later we process them with queues at the line 254: TODO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment the health check commands do not return any commands, but they might. Eg if we put in some logic about unhealthy containers, eg not all replicas in the same state - there is a check like this in the Legacy RM for ratis containers. Then the command may return some commands such as "closeContainer". We should fire those commands onto the event queue if they are returned.

protected ContainerHealthResult processContainer(ContainerInfo containerInfo,
List<ContainerHealthResult.UnderReplicatedHealthResult> underRep,
List<ContainerHealthResult.OverReplicatedHealthResult> overRep,
ReplicationManagerReport report) throws ContainerNotFoundException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

variable and assignment can be in same statement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I think I had a try...catch block here before to handle containerNotFound, but changed it. I will fix this.

!underHealth.isUnrecoverable()) {
underRep.add(underHealth);
}
} else if (health.getHealthState()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering in EC case, a container can have both situations. Some indexes over-replicated while other replicas missing. In that situation what is the HealthCheck result? We may give preference to missing? meaning Underreplication.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I thought it would be easier to have only a single state. The worse case is under-replication, so ideally we fix that first. When that is fixed, the container will get processed again and fix the over-replication. So yes, the if the container is both over and under replicated, under-replicated will be the result and over will be ignored until it is fixed.

I guess there is an edge case, where the container is both missing (unrecoverable) and over-replicated. The missing will never get fixed and it will be stuck like that. I am not sure what the answer is here - probably the container needs to be removed from the system as it cannot be read anyway.

@umamaheswararao
Copy link
Contributor

Overall patch looks good to me, I have just dropped few comments/questions.

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@umamaheswararao umamaheswararao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending CI

@umamaheswararao umamaheswararao merged commit 5cf5c97 into apache:master Jun 27, 2022
guihecheng pushed a commit to guihecheng/ozone that referenced this pull request Jun 28, 2022
duongkame pushed a commit to duongkame/ozone that referenced this pull request Aug 16, 2022
HDDS-6699. EC: ReplicationManager - collect under and over replicated containers (apache#3545)

(cherry picked from commit 5cf5c97)
Change-Id: I3d7a53596fcaa0bfd4252f3d174c6f402ad8c06b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants