HDDS-9011. EC: Unhealthy replica not replaced with a healthy replica in a rack-scatter environment by siddhantsangwan · Pull Request #5097 · apache/ozone

siddhantsangwan · 2023-07-20T13:13:55Z

What changes were proposed in this pull request?

Simply put, the problem here is that EC under replication handling is blocked in cases where the Rack Scatter policy cannot return new Datanodes because they don't satisfy the policy.

Please check the jira for one such scenario, and this document for a detailed (but roughly documented) reproduction of some cases.

Changes proposed: Allow the rack scatter policy to fallback and return the required number of datanodes even if the policy can't be satisfied. But still try to follow the policy as well as possible. Because fallback is true by default, the policy will now try to return nodes if possible in situations it'd throw an SCMException previously. The invariant that this policy tries to return the asked number of nodes and fails if it can't is still true.

The normal definition of "maximum replicas a rack can host" (called maxReplicasPerRack in the code) in rack scatter policy is:

    return numReplicas / numberOfRacks
            + Math.min(numReplicas % numberOfRacks, 1);

When falling back, the definition of maxReplicasPerRack in chooseDatanodesInternal is changed to "replication factor number of replicas", so that one rack can host all replicas of this container if needed. It'll still try to spread the replicas across as many racks as possible - this is done by the existing algorithm in chooseNodesFromRacks. The algorithm iterates over reach rack and just chooses one node per rack in one iteration, so the chosen nodes are spread across as many racks as possible:

      for (Node rack : toChooseRacks) {
        if (rack == null) {
          // TODO: need to recheck why null coming here.
          continue;
        }
        Node node = chooseNode(rack.getNetworkFullPath(), unavailableNodes,
                metadataSizeRequired, dataSizeRequired);
        if (node != null) {
          chosenNodes.add((DatanodeDetails) node);
          rackCntMap.merge(rack, 1, Math::addExact);
          mutableFavoredNodes.remove(node);
          unavailableNodes.add(node);
          nodesRequired--;
          if (nodesRequired == 0) {
            break;
          }
        } else {
          // Store the skipped racks to check them first in next outer loop
          skippedRacks.add(rack);
        }
      }

The number of times this algo runs, or the number of times we will consider a rack for choosing a node, is controlled by the argument maxOuterLoopIterations. Previously, this argument was 1 in one case and Integer.MAX_VALUE in another case - now it's maxReplicasPerRack for the first call and still Integer.MAX_VALUE for the second call (see discussion under this comment).

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-9011

How was this patch tested?

Changed some unit tests and added some new ones.
Draft while waiting for CI and doing some manual testing in a cluster.

…in a rack-scatter environment

siddhantsangwan · 2023-07-20T13:18:09Z

@swamirishi can you also please review?

ashishkumar50

@siddhantsangwan, Thanks for the patch. I have a question related to maxOuterLoopIterations.

ashishkumar50 · 2023-07-20T14:59:03Z

.../apache/hadoop/hdds/scm/container/placement/algorithms/SCMContainerPlacementRackScatter.java

              mutableFavoredNodes, nodesRequired - chosenNodes.size(),
              metadataSizeRequired, dataSizeRequired,
-              Integer.MAX_VALUE, usedRacksCntMap, maxReplicasPerRack));
+              nodesRequired - chosenNodes.size(),


Changing maxOuterLoopIterations to exact number of nodes required, This will decrease probablity of getting exact number of nodes required.
chooseRandom will return any random node. There is chance that random node is not valid isValidNode.
After certain retry(currently its 5) it will return null node back to chooseNodesFromRacks.
Assuming this happens for all the racks. Then at the end of the loop we decrement maxOuterLoopIterations without even choosing any node. So the probability of getting required nodes keeps decreasing.
I think maxOuterLoopIterations should not bound to number of nodes required.

Good point. We can set maxOuterLoopIterations to a higher value. Looks like the number of iterations are bounded by both maxOuterLoopIterations and OUTER_LOOP_MAX_RETRY. The latter is a constant and its value currently is 3, so we need to increase that too. Any suggestions for what their values should be?

OUTER_LOOP_MAX_RETRY is there, when we don't find any nodes in any rack even after going through all the racks OUTER_LOOP_MAX_RETRY times, I believe we can make it configurable but default value seems to be fine. Even if we find one node in the iteration retry count is reset.

Keeping INT_MAX doesn't harm anything here as OUTER_LOOP_MAX_RETRY already controls for exit condition too in case we really could not find enough nodes.

Ok, the consensus is that we should keep maxOuterLoopIterations as Integer.MAX_VALUE. I wonder if we should then change this argument to Integer.MAX_VALUE when we call chooseNodesFromRacks at line 352, too?

At line 352 it should be max replica per rack, If we increase this iteration count beyond max replica per rack it can use same rack for multiple replica because every time outer loop will start reusing all the racks. Only in the same iteration it removes rack from reusing.
If we want to optimize, we can remove racks which has already reached max replica and then any number of times like INT_MAX we can do for outer loop.

If we want to optimize, we can remove racks which has already reached max replica

I agree, and I think we're already doing that.
At line 70, we update the count of used nodes per rack:

rackCntMap.merge(rack, 1, Math::addExact);

And at line 126, we filter nodes according to this map:

for (Node rack : racks) { if (rackCntMap.getOrDefault(rack, 0) < maxReplicasPerRack) { toChooseRacks.add(rack); } }

I've decided to make it maxReplicasPerRack at line 352 for the first call, and Integer.MAX_VALUE for the second call. If we want to increase it for the first call, we can do that in this separate jira - https://issues.apache.org/jira/browse/HDDS-9054.

Looks fine now, earlier it was 1 for the 1st call now changed to maxReplicasPerRack which is better than 1 and don't have any impact. Yes later we can optimize more in new Jira you created HDDS-9054.
You can update PR description as well for maxOuterLoopIterations.

Thanks @ashishkumar50 for your review. I've now updated the PR description as well.

swamirishi

Thanks @siddhantsangwan for the patch. I have taken a look at the code side changes and haven't taken a look at the tests. Have some queries around certain changes in the algorithm since it is major change in the behaviour.

swamirishi · 2023-07-21T06:41:43Z

.../apache/hadoop/hdds/scm/container/placement/algorithms/SCMContainerPlacementRackScatter.java

    }

+    LOG.debug("Available racks excluding racks with used nodes: {}.", racks);
    if (racks.size() < additionalRacksRequired) {


Curious to know when this condition being satisfied.

I haven't seen it happen yet.

swamirishi · 2023-07-21T06:53:04Z

.../apache/hadoop/hdds/scm/container/placement/algorithms/SCMContainerPlacementRackScatter.java

+      additional number of racks that replicas need to be on
+      ("additionalRacksRequired").
+      */
+      if (!fallback) {


if we have fallback enabled shouldn't we just change the list available racks as the total number of racks required and changeMaxReplicasPerRack value instead? It would be simpler right?

I don't really understand what you're suggesting - can you please rephrase that?

The general idea here is that it's possible to fail some of the conditions above this, such as nodesRequired < additionalRacksRequired, and still get a new DN without having to increase maxReplicasPerRack.

I meant we should recalculate maxReplicas per rack based on the number of available rack and not based on the topology configuration, since there could be cases some racks could be excluded completely.

This should happen in the case racks.size() < additionalRacksRequired

If you think this PR works as expected and what you're suggesting is an improvement that can be made separately, let's do that in https://issues.apache.org/jira/browse/HDDS-9054?

swamirishi · 2023-07-21T06:54:47Z

.../apache/hadoop/hdds/scm/container/placement/algorithms/SCMContainerPlacementRackScatter.java

-            metadataSizeRequired, dataSizeRequired, 1,
-            usedRacksCntMap, maxReplicasPerRack));
+            mutableFavoredNodes, nodesRequired, metadataSizeRequired,
+            dataSizeRequired, maxOuterLoopIterations, usedRacksCntMap,


We shouldn't change the numberOfNodesRequired from additionalRacksRequired to nodesRequired. Since this function call is for spreading placement to as many racks as possible & choose racks only from the unused racks.

I changed this because it's possible to have a situation where nodesRequired is less than additionalRacksRequired (check the third case in my linked document). If we set it to nodesRequired, the policy still prefers finding a new rack rather than a rack that already has a used node.

I believe in that case we should just check if the misreplication count is going down or remaining same in the case fallback is enabled. We shouldn't change this parameter.

check if the misreplication count is going down or remaining same in the case fallback is enabled

I believe we're already doing that here:

List<DatanodeDetails> newPlacement = new ArrayList<>(usedNodes.size() + result.size()); newPlacement.addAll(usedNodes); newPlacement.addAll(chosenNodes); ContainerPlacementStatus placementStatus = validateContainerPlacement(newPlacement, requiredReplicationFactor); if (!placementStatus.isPolicySatisfied()) { ContainerPlacementStatus initialPlacementStatus = validateContainerPlacement(usedNodes, requiredReplicationFactor); if (initialPlacementStatus.misReplicationCount() < placementStatus.misReplicationCount()) { String errorMsg = "ContainerPlacementPolicy not met. Misreplication" + " Reason: " + placementStatus.misReplicatedReason() + " Initial Used nodes mis-replication Count: " + initialPlacementStatus.misReplicationCount() + " Used nodes + Chosen nodes mis-replication Count: " + placementStatus.misReplicationCount(); throw new SCMException(errorMsg, FAILED_TO_FIND_SUITABLE_NODE); } }

We shouldn't change this parameter.

My understanding is that we should look for additionalRacksRequired if fallback is false, and for nodesRequired if fallback is true. Now, if fallback is false, then nodesRequired must be greater than additionalRacksRequired or we'll throw an exception earlier in the method:

if (nodesRequired < additionalRacksRequired) { String reason = "Required nodes size: " + nodesRequired + " is less than required number of racks to choose: " + additionalRacksRequired + "."; LOG.warn("Placement policy cannot choose the enough racks. {}" + "Total number of Required Racks: {} Used Racks Count:" + " {}, Required Nodes count: {}, fallback: {}.", reason, numberOfRacksRequired, usedRacksCntMap.size(), nodesRequired, fallback); /* The number of additional Datanodes ("nodesRequired") that were asked for are less than the additional number of racks ("additionalRacksRequired") that container replicas need to be on. This means that returning nodesRequired number of nodes will not satisfy the rack scatter policy. If fallback is false, throw an exception. Else, continue to find nodesRequired number of nodes. */ if (!fallback) { throw new SCMException(reason, FAILED_TO_FIND_SUITABLE_NODE); } }

So this means that for both values of fallback, the argument can just be nodesRequired. Anyway, chooseNodesFromRacks should still try to spread it across as many racks as possible, with excluded and used racks sorted at the end of the list. What do you think?

…r the first and second calls, respectively

siddhantsangwan · 2023-07-24T06:22:14Z

This PR is ready for review. I've updated maxOuterLoopIterations according to the comments.

siddhantsangwan · 2023-07-24T11:16:10Z

Note: While this PR is meant to improve EC under replication handling, I'm trying to determine its impact on mis replication handling.

siddhantsangwan · 2023-07-24T14:15:28Z

I'm trying to determine its impact on mis replication handling.

Mis replication handling works in two stages. The first stage is MisReplicationHandler using placement policy's replicasToCopyToFixMisreplication to ask for replicas that need to be copied to other racks. These target racks are selected using SCMContainerPlacementRackScatter. Once they're copied, the container becomes over replicated. Then ECOverReplicationHandler will delete replicas using the policy's replicasToRemoveToFixOverreplication API.

As far as I understand, there should be no negative impact on mis replication handling because:

SCMContainerPlacementRackScatter will not return targets that make mis replication worse.
replicasToCopyToFixMisreplication will only select replicas to copy if it's possible to improve mis replication.

@swamirishi I'd appreciate your feedback on this since you're very familiar with this area.

swamirishi · 2023-07-24T16:51:32Z

I'm trying to determine its impact on mis replication handling.

Mis replication handling works in two stages. The first stage is MisReplicationHandler using placement policy's replicasToCopyToFixMisreplication to ask for replicas that need to be copied to other racks. These target racks are selected using SCMContainerPlacementRackScatter. Once they're copied, the container becomes over replicated. Then ECOverReplicationHandler will delete replicas using the policy's replicasToRemoveToFixOverreplication API.

As far as I understand, there should be no negative impact on mis replication handling because:

SCMContainerPlacementRackScatter will not return targets that make mis replication worse.

replicasToCopyToFixMisreplication will only select replicas to copy if it's possible to improve mis replication.

@swamirishi I'd appreciate your feedback on this since you're very familiar with this area.
The first priority for replication manager is to fix unhealthy replicas which would be done by underreplication handler. The next priority in line is fixing over replication & mis-replication has the least priority. So in the case of unhealthy replicas it misreplication shouldn't really matter. But the chooseDatanode function is used by both misreplication handler & underreplication handler.

umamaheswararao · 2023-07-24T23:53:52Z

@swamirishi could you please take a look at this PR ?

siddhantsangwan · 2023-07-25T06:55:19Z

The first priority for replication manager is to fix unhealthy replicas which would be done by underreplication handler. The next priority in line is fixing over replication & mis-replication has the least priority.

Yes, I agree.

But the chooseDatanode function is used by both misreplication handler & underreplication handler.

Yes, according to me the changes introduced in this PR should not have a negative impact on mis replication handling, like I said before. But I might be missing some edge cases that you're aware of.

swamirishi · 2023-07-25T14:04:03Z

.../apache/hadoop/hdds/scm/container/placement/algorithms/SCMContainerPlacementRackScatter.java

@@ -268,20 +276,36 @@ protected List<DatanodeDetails> chooseDatanodesInternal(
    int numberOfRacksRequired = getRequiredRackCount(requiredReplicationFactor);
    int additionalRacksRequired =


From what I understand all we have to change is make additionalRacksRequired = Math.min(nodesRequired, numberOfRacksRequired - usedRacksCntMap.size()); and remove the condition check nodesRequired < additionalRacksRequired all together. We wouldn't need a fall back flag then. This particular issue can only happen when under replication handler is calling this function asking for lesser nodes than the misreplication count. Rackscatter Placement policy is ensuring that the misreplication count is going down if an additional node is chosen.

On the misreplication handler in that case we should only copy the number of replicas by the change in the misreplication count. Does this make sense?

From what I understand all we have to change is make additionalRacksRequired = Math.min(nodesRequired, numberOfRacksRequired - usedRacksCntMap.size()); and remove the condition check nodesRequired < additionalRacksRequired all together.

That will work for this particular case, yes. However simply doing that will not work in the first case, which is:

/** * Scenario: * rack0 -> node0 * rack1 -> node1 * rack2 -> node2 * rack3 -> node3 * rack4 -> node4, node5 * <p> * node0, node1, node2, node4 are used nodes. node3 is excluded node. * Expectation is that node5 should be chosen.

So, we'll still need to "fall back and relax the policy" somehow, which is why I decided to use the existing fallback flag.

yeah this does makes sense when the rack itself is not available. Thanks for pointing out such a scenario. @siddhantsangwan

I had another question. Ideally this can happen only in a small cluster. Shouldn't we ideally wait for the unhealthy replica to get deleted and create the replica ? That should be fine right?

Ideally this can happen only in a small cluster.

Yes, all of these are edge cases in small cluster situations.

Shouldn't we ideally wait for the unhealthy replica to get deleted and create the replica ? That should be fine right?

Yes, that's the idea that's already existing in ECUnderReplicationHandler. This change as it stands should not interfere with that.

siddhantsangwan · 2023-07-26T11:22:57Z

I retested after applying this PR in the ozone-topology docker environment and have updated the testing document.

Conclusion:
Without patch HDDS-9011, under replication + mis replication is fixed in situation 1. It fails in Situation 2 and Situation 4. It will probably work in Situation 3 - not tested.

With patch HDDS-9011, under replication is fixed in ALL the above scenarios. However it performs worse in Situation 1 - it’s able to fix under replication but mis replication is blocked.

So there's an impact to mis replication in at least one case, after all. I've explained the impact in Situation 1 in the doc. Otherwise, this patch works well.

…tions for first call is maxReplicasPerRack.

siddhantsangwan · 2023-07-27T15:51:19Z

Update: In an offline discussion, we decided to keep the scope of this PR minimal. We've decided to go ahead with this suggestion:

From what I understand all we have to change is make additionalRacksRequired = Math.min(nodesRequired, numberOfRacksRequired - usedRacksCntMap.size()); and remove the condition check nodesRequired < additionalRacksRequired all together. We wouldn't need a fall back flag then. This particular issue can only happen when under replication handler is calling this function asking for lesser nodes than the misreplication count. Rackscatter Placement policy is ensuring that the misreplication count is going down if an additional node is chosen.

This means situation 2 (from the linked doc) gets solved, and situation 4 remains unsolved (can be tackled in a future jira). Since situation 2 seems more critical, we restricted this PR to solving that problem. There will also be no regression in situation 1.

@swamirishi and @ashishkumar50 please take another look!

swamirishi

LGTM. Thanks @siddhantsangwan for the patch

siddhantsangwan · 2023-07-28T06:20:25Z

Thanks @swamirishi for the review!

sumitagrawl

@siddhantsangwan Thanks for working over this, LGTM +1

siddhantsangwan · 2023-07-28T06:32:06Z

Thanks everyone for the reviews and discussions.

…thy replica in a rack-scatter environment (apache#5097) (cherry picked from commit affd163) Change-Id: I9397314bc24d89aced66a82046f25225e222972e

siddhantsangwan added 5 commits July 20, 2023 17:45

HDDS-9011. EC: Unhealthy replica not replaced with a healthy replica …

befbba1

…in a rack-scatter environment

Merge branch 'master' into HDDS-9011

6dae204

fix checkstyle

1ec7eb6

modify a comment and change outer loop iterations arg

41bf4cb

introduce a variable for max outer loop iterations

93a8be8

siddhantsangwan requested review from adoroszlai, sodonnel and symious July 20, 2023 13:15

siddhantsangwan changed the title ~~EC: Unhealthy replica not replaced with a healthy replica in a rack-scatter environment~~ HDDS-9011. EC: Unhealthy replica not replaced with a healthy replica in a rack-scatter environment Jul 20, 2023

ashishkumar50 reviewed Jul 20, 2023

View reviewed changes

swamirishi requested changes Jul 21, 2023

View reviewed changes

siddhantsangwan assigned umamaheswararao and siddhantsangwan and unassigned umamaheswararao Jul 24, 2023

siddhantsangwan requested a review from umamaheswararao July 24, 2023 06:01

maxOuterLoopIterations is maxReplicasPerRack and Integer.MAX_VALUE fo…

2363449

…r the first and second calls, respectively

siddhantsangwan marked this pull request as ready for review July 24, 2023 06:21

siddhantsangwan requested review from ashishkumar50 and swamirishi July 24, 2023 06:43

swamirishi reviewed Jul 25, 2023

View reviewed changes

don't use fallback. adjust additionalRacksRequired. maxOuterLoopItera…

f4812f2

…tions for first call is maxReplicasPerRack.

siddhantsangwan requested a review from swamirishi July 27, 2023 15:51

fix checkstyle

5c5f4d5

swamirishi approved these changes Jul 28, 2023

View reviewed changes

sumitagrawl approved these changes Jul 28, 2023

View reviewed changes

sumitagrawl merged commit affd163 into apache:master Jul 28, 2023

		@@ -268,20 +276,36 @@ protected List<DatanodeDetails> chooseDatanodesInternal(
		int numberOfRacksRequired = getRequiredRackCount(requiredReplicationFactor);
		int additionalRacksRequired =

Conversation

siddhantsangwan commented Jul 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

siddhantsangwan commented Jul 20, 2023

Uh oh!

ashishkumar50 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

swamirishi left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

siddhantsangwan Jul 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

siddhantsangwan Jul 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

siddhantsangwan commented Jul 24, 2023

Uh oh!

siddhantsangwan commented Jul 24, 2023

Uh oh!

siddhantsangwan commented Jul 24, 2023

Uh oh!

swamirishi commented Jul 24, 2023

Uh oh!

umamaheswararao commented Jul 24, 2023

Uh oh!

siddhantsangwan commented Jul 25, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

siddhantsangwan commented Jul 20, 2023 •

edited

Loading

swamirishi left a comment •

edited

Loading

siddhantsangwan Jul 21, 2023 •

edited

Loading

siddhantsangwan Jul 25, 2023 •

edited

Loading

swamirishi left a comment •

edited

Loading