HDDS-6699. EC: ReplicationManager - collect under and over replicated containers #3545

sodonnel · 2022-06-24T11:00:56Z

What changes were proposed in this pull request?

Scan all containers in Replication Manager. For the EC containers, pass them to the EcContainerHealthCheck class to allow their health to be check. For any under or over replicated containers add them a list which later will be sorted by priority and turned into a queue for subsequent stages of the RM processing.

What is the link to the Apache JIRA

ec-HDDS-6699

How was this patch tested?

New unit tests

…ather than a list of add and delete

…licated EC containers

…class

umamaheswararao · 2022-06-24T18:15:09Z

...r-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ReplicationManager.java

+        continue;
+      }
+      try {
+        processContainer(c, underReplicated, overReplicated, report);


What do you mean by below TODO? I thought we will just populate the under and over replicated lists above and later we process them with queues at the line 254: TODO.

umamaheswararao · 2022-06-24T18:15:30Z

...r-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ReplicationManager.java

+        continue;
+      }
+      try {
+        processContainer(c, underReplicated, overReplicated, report);


What do you mean by below TODO? I thought we will just populate the under and over replicated lists above and later we process them with queues at the line 254: TODO.

At the moment the health check commands do not return any commands, but they might. Eg if we put in some logic about unhealthy containers, eg not all replicas in the same state - there is a check like this in the Legacy RM for ratis containers. Then the command may return some commands such as "closeContainer". We should fire those commands onto the event queue if they are returned.

umamaheswararao · 2022-06-24T18:26:47Z

...r-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ReplicationManager.java

+  protected ContainerHealthResult processContainer(ContainerInfo containerInfo,
+      List<ContainerHealthResult.UnderReplicatedHealthResult> underRep,
+      List<ContainerHealthResult.OverReplicatedHealthResult> overRep,
+      ReplicationManagerReport report) throws ContainerNotFoundException {


variable and assignment can be in same statement.

Ah yes, I think I had a try...catch block here before to handle containerNotFound, but changed it. I will fix this.

umamaheswararao · 2022-06-24T18:35:37Z

...r-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ReplicationManager.java

+          !underHealth.isUnrecoverable()) {
+        underRep.add(underHealth);
+      }
+    } else if (health.getHealthState()


I am wondering in EC case, a container can have both situations. Some indexes over-replicated while other replicas missing. In that situation what is the HealthCheck result? We may give preference to missing? meaning Underreplication.

For now I thought it would be easier to have only a single state. The worse case is under-replication, so ideally we fix that first. When that is fixed, the container will get processed again and fix the over-replication. So yes, the if the container is both over and under replicated, under-replicated will be the result and over will be ignored until it is fixed.

I guess there is an edge case, where the container is both missing (unrecoverable) and over-replicated. The missing will never get fixed and it will be stuck like that. I am not sure what the answer is here - probably the container needs to be removed from the system as it cannot be read anyway.

umamaheswararao · 2022-06-24T18:36:24Z

Overall patch looks good to me, I have just dropped few comments/questions.

adoroszlai

LGTM

...r-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ContainerReplicaOp.java

umamaheswararao

LGTM, pending CI

…plicated containers (apache#3545)" This reverts commit 5cf5c97.

HDDS-6699. EC: ReplicationManager - collect under and over replicated containers (apache#3545) (cherry picked from commit 5cf5c97) Change-Id: I3d7a53596fcaa0bfd4252f3d174c6f402ad8c06b

S O'Donnell added 10 commits June 24, 2022 11:32

Change interface of ContainerHealthCheck to take list of pendingOps r…

fb0b611

…ather than a list of add and delete

Add basic flow to the ReplicationManager to detect under and over rep…

4c1bd5f

…licated EC containers

Rename TestReplicationManager to TestLegacyReplicationManager

4cfea18

Remove serviceManager dependency from ReplicationManager

56daa92

Inject LegacyRM into RM rather than constructing it internally

16d891e

Extracted common methods from TestECContainerHealthCheck into a util …

1a26601

…class

Add tests and fixes to the flow in RM

23684d5

Fix find bugs

c9803a8

Fixing failing legacyRM tests

7f930c3

Fix checkstyle

3d43a25

umamaheswararao reviewed Jun 24, 2022

View reviewed changes

Fix variable assignment

ffeda25

adoroszlai reviewed Jun 27, 2022

View reviewed changes

...r-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ContainerReplicaOp.java Show resolved Hide resolved

umamaheswararao approved these changes Jun 27, 2022

View reviewed changes

umamaheswararao merged commit 5cf5c97 into apache:master Jun 27, 2022

guihecheng pushed a commit to guihecheng/ozone that referenced this pull request Jun 28, 2022

Revert "HDDS-6699. EC: ReplicationManager - collect under and over re…

3d6ac1f

…plicated containers (apache#3545)" This reverts commit 5cf5c97.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-6699. EC: ReplicationManager - collect under and over replicated containers #3545

HDDS-6699. EC: ReplicationManager - collect under and over replicated containers #3545

sodonnel commented Jun 24, 2022

umamaheswararao Jun 24, 2022

umamaheswararao Jun 24, 2022

sodonnel Jun 27, 2022

umamaheswararao Jun 24, 2022

sodonnel Jun 27, 2022

umamaheswararao Jun 24, 2022

sodonnel Jun 27, 2022

umamaheswararao commented Jun 24, 2022

adoroszlai left a comment

umamaheswararao left a comment

HDDS-6699. EC: ReplicationManager - collect under and over replicated containers #3545

HDDS-6699. EC: ReplicationManager - collect under and over replicated containers #3545

Conversation

sodonnel commented Jun 24, 2022

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

umamaheswararao Jun 24, 2022

Choose a reason for hiding this comment

umamaheswararao Jun 24, 2022

Choose a reason for hiding this comment

sodonnel Jun 27, 2022

Choose a reason for hiding this comment

umamaheswararao Jun 24, 2022

Choose a reason for hiding this comment

sodonnel Jun 27, 2022

Choose a reason for hiding this comment

umamaheswararao Jun 24, 2022

Choose a reason for hiding this comment

sodonnel Jun 27, 2022

Choose a reason for hiding this comment

umamaheswararao commented Jun 24, 2022

adoroszlai left a comment

Choose a reason for hiding this comment

umamaheswararao left a comment

Choose a reason for hiding this comment