HDDS-7058. EC: ReplicationManager - Implement ratis container replication check handler #3802

sodonnel · 2022-10-06T11:10:23Z

What changes were proposed in this pull request?

Create a handler for the new replication manager to process Ratis container and detect under / over / mis-replication issues.

The logic is largely unchanged from the LegacyReplication manager - simply packaged into the new "handler" structure.

At the moment, this code will not be executed by the new replication manager, as all non-EC container will be directed to the Legacy Replication Manager for processing.

This Jira is part of the work to remove the Legacy Replication Manager.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-7058

How was this patch tested?

New unit tests

kerneltime · 2022-10-10T16:04:00Z

@aswinshakil

swamirishi

Some comments left inline. Have some confusion on certain flows & race condition it would be really great if you could clear my doubts there. Haven't checked unit tests yet.

swamirishi · 2022-10-10T18:02:20Z

...c/main/java/org/apache/hadoop/hdds/scm/container/replication/RatisContainerReplicaCount.java

+   *
+   * @return True if the container is over replicated, false otherwise.
+   */
+  public boolean isOverReplicated(boolean includePendingDelete) {


Could be changed to getExcessRedundancyCanBeCalled(includePending)>0 to avoid redundancy of logic.

swamirishi · 2022-10-11T06:33:42Z

...c/main/java/org/apache/hadoop/hdds/scm/container/replication/RatisContainerReplicaCount.java

+   *
+   * @return True if the container is over replicated, false otherwise.
+   */
+  public boolean isOverReplicated(boolean includePendingDelete) {


Could be changed to getExcessRedundancyCanBeCalled(includePending)>0 to avoid redundancy of logic.

Good point. I have changed this.

swamirishi · 2022-10-11T06:45:20Z

...cm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ContainerHealthResult.java

@@ -106,6 +106,9 @@ public static class UnHealthyResult extends ContainerHealthResult {
    private final int remainingRedundancy;
    private final boolean dueToDecommission;
    private final boolean sufficientlyReplicatedAfterPending;
+    private boolean dueToMisReplication = false;
+    private boolean isMisReplicated = false;
+    private boolean isMisReplicatedAfterPending = false;


Should we have another constructor which initializes the following arguments?
public UnderReplicatedHealthResult(ContainerInfo containerInfo,
int remainingRedundancy, boolean dueToDecommission, boolean replicatedOkWithPending, boolean unrecoverable,boolean dueToMisReplication, boolean isMisReplicated, boolean isMisReplicatedAfterPending)

I don't want to change the existing constructor, as then we need to change it everywhere it is used. Adding a new constructor starts a bad pattern where each new parameter needs a new constructor, and what we really need is a builder.

At the moment I think these 3 parameters have good defaults for the common case and then using the settings when needed is a good compromise.

swamirishi · 2022-10-11T06:50:38Z

...c/main/java/org/apache/hadoop/hdds/scm/container/replication/RatisContainerReplicaCount.java

+  /**
+   * @return Return Excess Redundancy replica nums.
+   */
+  public int getExcessRedundancy(boolean includePendingDelete) {


We could add a boolean with includePendingAdd as well. I see redundant duplicate code logic in sufficientlyReplicated & isOverReplicated.

Yea the logic is very similar. I have added a new private method both can call.

...r-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ReplicationManager.java

swamirishi · 2022-10-11T07:15:32Z

...va/org/apache/hadoop/hdds/scm/container/replication/health/RatisReplicationCheckHandler.java

+    return false;
+  }
+
+  public ContainerHealthResult checkHealth(ContainerCheckRequest request) {


Do we need to make this method as public?

Probably could add @VisibleForTesting if you want to add unit tests or could even change the access specifier using replication using reflection.

It will need to be public when we implement the under / over replication handler to process the under / over replicated queue. This is the same as in the ECHandler, where it has this same method public for that reason.

swamirishi · 2022-10-11T07:19:01Z

...va/org/apache/hadoop/hdds/scm/container/replication/health/RatisReplicationCheckHandler.java

+            pendingDelete, requiredNodes, minReplicasForMaintenance);
+
+    ContainerPlacementStatus placementStatus =
+        getPlacementStatus(replicas, requiredNodes, Collections.EMPTY_LIST);


Suggested change

getPlacementStatus(replicas, requiredNodes, Collections.EMPTY_LIST);

getPlacementStatus(replicas, requiredNodes, Collections.emptyList());

I fixed this.

swamirishi · 2022-10-11T07:22:14Z

...va/org/apache/hadoop/hdds/scm/container/replication/health/RatisReplicationCheckHandler.java

+        replicaDns.add(op.getTarget());
+      } else if (op.getOpType() == ContainerReplicaOp.PendingOpType.DELETE) {
+        replicaDns.remove(op.getTarget());
+      }


Can there be a case where there is pending ADD & pending DELETE to the same node? Some kind of race condition.

Better to use a Map<DataNodeDetails,Integer> in that case.

This should not be able to happen, as if a node has a replica it cannot get another copy of it. For a delete to be scheduled it must have a copy which will prevent an add etc.

However I will change this to two IF statements rather than if .. else if

siddhantsangwan

@sodonnel The handling logic looks good. I haven't checked the tests yet.

siddhantsangwan · 2022-10-11T06:52:14Z

...c/main/java/org/apache/hadoop/hdds/scm/container/replication/RatisContainerReplicaCount.java

+  public boolean isSufficientlyReplicated(boolean includePendingAdd) {
+    // Positive for under-rep, negative for over-rep
+    int delta = missingReplicas();
+    if (includePendingAdd) {
+      delta -= inFlightAdd;
+    }
+    return delta <= 0;
+  }


Are we deliberately not considering pending deletes here?

I think you are correct - I have missed this. We should be removing the inflight deletes as per the original method defined just above this one. I will fix this and modidy a test to validate it.

siddhantsangwan · 2022-10-11T07:31:52Z

...va/org/apache/hadoop/hdds/scm/container/replication/health/RatisReplicationCheckHandler.java

+        request.getReplicationQueue().enqueue(underHealth);
+      }
+      return true;
+    }


NIT: Let's add a new line after this brace for better readability.

ok - I added that in.

…check

siddhantsangwan

Changes look good! I just have 2 minor comments for the tests.

siddhantsangwan · 2022-10-18T11:42:33Z

...rg/apache/hadoop/hdds/scm/container/replication/health/TestRatisReplicationCheckHandler.java

+  }
+
+  @Test
+  public void testOverReplicatedContainerDueToMaintenance() {


Since we're testing the HEALTHY case, let's change the test's name?

OK - I added IsHealthy to the end of it.

siddhantsangwan · 2022-10-18T11:44:38Z

...rg/apache/hadoop/hdds/scm/container/replication/health/TestRatisReplicationCheckHandler.java

+    Set<ContainerReplica> replicas = createReplicas(container.containerID(),
+        Pair.of(IN_SERVICE, 0), Pair.of(IN_SERVICE, 0),
+        Pair.of(IN_SERVICE, 0), Pair.of(IN_MAINTENANCE, 0),
+        Pair.of(IN_MAINTENANCE, 2));


The replica index should be 0 instead of 2 I think. (IN_MAINTENANCE, 0)

yes, well spotted.

siddhantsangwan

LGTM, pending green CI.

swamirishi

LGTM

sodonnel mentioned this pull request Oct 6, 2022

HDDS-7058. EC: ReplicationManager - Implement Ratis Container healthCheck #3634

Closed

siddhantsangwan self-requested a review October 10, 2022 17:16

swamirishi requested changes Oct 11, 2022

View reviewed changes

siddhantsangwan reviewed Oct 11, 2022

View reviewed changes

Jackson Yao and others added 5 commits October 12, 2022 11:18

HDDS-7058. EC: ReplicationManager - Implement ratis container health …

049302b

…check

Refactor to fit into the new handler format

03e14d2

Handle mis-replication

a2adb27

Fix review comments

389614b

Fix failing test

520c6b9

sodonnel force-pushed the ec-HDDS-7058-ratis-handler branch from 19657b8 to 520c6b9 Compare October 12, 2022 10:19

siddhantsangwan reviewed Oct 18, 2022

View reviewed changes

Address further review comments

2975ce4

siddhantsangwan approved these changes Oct 18, 2022

View reviewed changes

swamirishi approved these changes Oct 18, 2022

View reviewed changes

sodonnel merged commit 237a9a1 into apache:master Oct 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-7058. EC: ReplicationManager - Implement ratis container replication check handler #3802

HDDS-7058. EC: ReplicationManager - Implement ratis container replication check handler #3802

sodonnel commented Oct 6, 2022

kerneltime commented Oct 10, 2022

swamirishi left a comment

swamirishi Oct 10, 2022

swamirishi Oct 11, 2022

sodonnel Oct 11, 2022

swamirishi Oct 11, 2022

sodonnel Oct 11, 2022

swamirishi Oct 11, 2022

sodonnel Oct 11, 2022

swamirishi Oct 11, 2022 •

edited

swamirishi Oct 11, 2022

sodonnel Oct 11, 2022

swamirishi Oct 11, 2022

sodonnel Oct 11, 2022

swamirishi Oct 11, 2022

swamirishi Oct 11, 2022

sodonnel Oct 11, 2022

siddhantsangwan left a comment

siddhantsangwan Oct 11, 2022

sodonnel Oct 11, 2022

siddhantsangwan Oct 11, 2022

sodonnel Oct 11, 2022

siddhantsangwan left a comment

siddhantsangwan Oct 18, 2022

sodonnel Oct 18, 2022

siddhantsangwan Oct 18, 2022

sodonnel Oct 18, 2022

siddhantsangwan left a comment

swamirishi left a comment

	getPlacementStatus(replicas, requiredNodes, Collections.EMPTY_LIST);
	getPlacementStatus(replicas, requiredNodes, Collections.emptyList());

HDDS-7058. EC: ReplicationManager - Implement ratis container replication check handler #3802

HDDS-7058. EC: ReplicationManager - Implement ratis container replication check handler #3802

Conversation

sodonnel commented Oct 6, 2022

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

kerneltime commented Oct 10, 2022

swamirishi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

swamirishi Oct 11, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

siddhantsangwan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

siddhantsangwan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

siddhantsangwan left a comment

Choose a reason for hiding this comment

swamirishi left a comment

Choose a reason for hiding this comment

swamirishi Oct 11, 2022 •

edited