Skip to content

HDDS-9011. EC: Unhealthy replica not replaced with a healthy replica in a rack-scatter environment#5097

Merged
sumitagrawl merged 8 commits intoapache:masterfrom
siddhantsangwan:HDDS-9011
Jul 28, 2023
Merged

HDDS-9011. EC: Unhealthy replica not replaced with a healthy replica in a rack-scatter environment#5097
sumitagrawl merged 8 commits intoapache:masterfrom
siddhantsangwan:HDDS-9011

Conversation

@siddhantsangwan
Copy link
Contributor

@siddhantsangwan siddhantsangwan commented Jul 20, 2023

What changes were proposed in this pull request?

Simply put, the problem here is that EC under replication handling is blocked in cases where the Rack Scatter policy cannot return new Datanodes because they don't satisfy the policy.

Please check the jira for one such scenario, and this document for a detailed (but roughly documented) reproduction of some cases.

Changes proposed: Allow the rack scatter policy to fallback and return the required number of datanodes even if the policy can't be satisfied. But still try to follow the policy as well as possible. Because fallback is true by default, the policy will now try to return nodes if possible in situations it'd throw an SCMException previously. The invariant that this policy tries to return the asked number of nodes and fails if it can't is still true.

The normal definition of "maximum replicas a rack can host" (called maxReplicasPerRack in the code) in rack scatter policy is:

    return numReplicas / numberOfRacks
            + Math.min(numReplicas % numberOfRacks, 1);

When falling back, the definition of maxReplicasPerRack in chooseDatanodesInternal is changed to "replication factor number of replicas", so that one rack can host all replicas of this container if needed. It'll still try to spread the replicas across as many racks as possible - this is done by the existing algorithm in chooseNodesFromRacks. The algorithm iterates over reach rack and just chooses one node per rack in one iteration, so the chosen nodes are spread across as many racks as possible:

      for (Node rack : toChooseRacks) {
        if (rack == null) {
          // TODO: need to recheck why null coming here.
          continue;
        }
        Node node = chooseNode(rack.getNetworkFullPath(), unavailableNodes,
                metadataSizeRequired, dataSizeRequired);
        if (node != null) {
          chosenNodes.add((DatanodeDetails) node);
          rackCntMap.merge(rack, 1, Math::addExact);
          mutableFavoredNodes.remove(node);
          unavailableNodes.add(node);
          nodesRequired--;
          if (nodesRequired == 0) {
            break;
          }
        } else {
          // Store the skipped racks to check them first in next outer loop
          skippedRacks.add(rack);
        }
      }

The number of times this algo runs, or the number of times we will consider a rack for choosing a node, is controlled by the argument maxOuterLoopIterations. Previously, this argument was 1 in one case and Integer.MAX_VALUE in another case - now it's maxReplicasPerRack for the first call and still Integer.MAX_VALUE for the second call (see discussion under this comment).

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-9011

How was this patch tested?

Changed some unit tests and added some new ones.
Draft while waiting for CI and doing some manual testing in a cluster.

@siddhantsangwan siddhantsangwan changed the title EC: Unhealthy replica not replaced with a healthy replica in a rack-scatter environment HDDS-9011. EC: Unhealthy replica not replaced with a healthy replica in a rack-scatter environment Jul 20, 2023
@siddhantsangwan
Copy link
Contributor Author

@swamirishi can you also please review?

Copy link
Contributor

@ashishkumar50 ashishkumar50 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siddhantsangwan, Thanks for the patch. I have a question related to maxOuterLoopIterations.

mutableFavoredNodes, nodesRequired - chosenNodes.size(),
metadataSizeRequired, dataSizeRequired,
Integer.MAX_VALUE, usedRacksCntMap, maxReplicasPerRack));
nodesRequired - chosenNodes.size(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing maxOuterLoopIterations to exact number of nodes required, This will decrease probablity of getting exact number of nodes required.
chooseRandom will return any random node. There is chance that random node is not valid isValidNode.
After certain retry(currently its 5) it will return null node back to chooseNodesFromRacks.
Assuming this happens for all the racks. Then at the end of the loop we decrement maxOuterLoopIterations without even choosing any node. So the probability of getting required nodes keeps decreasing.
I think maxOuterLoopIterations should not bound to number of nodes required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. We can set maxOuterLoopIterations to a higher value. Looks like the number of iterations are bounded by both maxOuterLoopIterations and OUTER_LOOP_MAX_RETRY. The latter is a constant and its value currently is 3, so we need to increase that too. Any suggestions for what their values should be?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OUTER_LOOP_MAX_RETRY is there, when we don't find any nodes in any rack even after going through all the racks OUTER_LOOP_MAX_RETRY times, I believe we can make it configurable but default value seems to be fine. Even if we find one node in the iteration retry count is reset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping INT_MAX doesn't harm anything here as OUTER_LOOP_MAX_RETRY already controls for exit condition too in case we really could not find enough nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, the consensus is that we should keep maxOuterLoopIterations as Integer.MAX_VALUE. I wonder if we should then change this argument to Integer.MAX_VALUE when we call chooseNodesFromRacks at line 352, too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At line 352 it should be max replica per rack, If we increase this iteration count beyond max replica per rack it can use same rack for multiple replica because every time outer loop will start reusing all the racks. Only in the same iteration it removes rack from reusing.
If we want to optimize, we can remove racks which has already reached max replica and then any number of times like INT_MAX we can do for outer loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to optimize, we can remove racks which has already reached max replica

I agree, and I think we're already doing that.
At line 70, we update the count of used nodes per rack:

rackCntMap.merge(rack, 1, Math::addExact);

And at line 126, we filter nodes according to this map:

      for (Node rack : racks) {
        if (rackCntMap.getOrDefault(rack, 0) < maxReplicasPerRack) {
          toChooseRacks.add(rack);
        }
      }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've decided to make it maxReplicasPerRack at line 352 for the first call, and Integer.MAX_VALUE for the second call. If we want to increase it for the first call, we can do that in this separate jira - https://issues.apache.org/jira/browse/HDDS-9054.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine now, earlier it was 1 for the 1st call now changed to maxReplicasPerRack which is better than 1 and don't have any impact. Yes later we can optimize more in new Jira you created HDDS-9054.
You can update PR description as well for maxOuterLoopIterations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ashishkumar50 for your review. I've now updated the PR description as well.

Copy link
Contributor

@swamirishi swamirishi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @siddhantsangwan for the patch. I have taken a look at the code side changes and haven't taken a look at the tests. Have some queries around certain changes in the algorithm since it is major change in the behaviour.

}

LOG.debug("Available racks excluding racks with used nodes: {}.", racks);
if (racks.size() < additionalRacksRequired) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious to know when this condition being satisfied.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't seen it happen yet.

additional number of racks that replicas need to be on
("additionalRacksRequired").
*/
if (!fallback) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we have fallback enabled shouldn't we just change the list available racks as the total number of racks required and changeMaxReplicasPerRack value instead? It would be simpler right?

Copy link
Contributor Author

@siddhantsangwan siddhantsangwan Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand what you're suggesting - can you please rephrase that?

The general idea here is that it's possible to fail some of the conditions above this, such as nodesRequired < additionalRacksRequired, and still get a new DN without having to increase maxReplicasPerRack.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant we should recalculate maxReplicas per rack based on the number of available rack and not based on the topology configuration, since there could be cases some racks could be excluded completely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should happen in the case racks.size() < additionalRacksRequired

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you think this PR works as expected and what you're suggesting is an improvement that can be made separately, let's do that in https://issues.apache.org/jira/browse/HDDS-9054?

metadataSizeRequired, dataSizeRequired, 1,
usedRacksCntMap, maxReplicasPerRack));
mutableFavoredNodes, nodesRequired, metadataSizeRequired,
dataSizeRequired, maxOuterLoopIterations, usedRacksCntMap,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't change the numberOfNodesRequired from additionalRacksRequired to nodesRequired. Since this function call is for spreading placement to as many racks as possible & choose racks only from the unused racks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this because it's possible to have a situation where nodesRequired is less than additionalRacksRequired (check the third case in my linked document). If we set it to nodesRequired, the policy still prefers finding a new rack rather than a rack that already has a used node.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe in that case we should just check if the misreplication count is going down or remaining same in the case fallback is enabled. We shouldn't change this parameter.

Copy link
Contributor Author

@siddhantsangwan siddhantsangwan Jul 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check if the misreplication count is going down or remaining same in the case fallback is enabled

I believe we're already doing that here:

    List<DatanodeDetails> newPlacement =
        new ArrayList<>(usedNodes.size() + result.size());
    newPlacement.addAll(usedNodes);
    newPlacement.addAll(chosenNodes);
    ContainerPlacementStatus placementStatus =
        validateContainerPlacement(newPlacement, requiredReplicationFactor);
    if (!placementStatus.isPolicySatisfied()) {
      ContainerPlacementStatus initialPlacementStatus =
          validateContainerPlacement(usedNodes, requiredReplicationFactor);
      if (initialPlacementStatus.misReplicationCount()
              < placementStatus.misReplicationCount()) {
        String errorMsg = "ContainerPlacementPolicy not met. Misreplication" +
                " Reason: " + placementStatus.misReplicatedReason() +
                " Initial Used nodes mis-replication Count: " +
                initialPlacementStatus.misReplicationCount() +
                " Used nodes + Chosen nodes mis-replication Count: " +
                placementStatus.misReplicationCount();
        throw new SCMException(errorMsg, FAILED_TO_FIND_SUITABLE_NODE);
      }
    }

We shouldn't change this parameter.

My understanding is that we should look for additionalRacksRequired if fallback is false, and for nodesRequired if fallback is true. Now, if fallback is false, then nodesRequired must be greater than additionalRacksRequired or we'll throw an exception earlier in the method:

if (nodesRequired < additionalRacksRequired) {
      String reason = "Required nodes size: " + nodesRequired
              + " is less than required number of racks to choose: "
              + additionalRacksRequired + ".";
      LOG.warn("Placement policy cannot choose the enough racks. {}"
                      + "Total number of Required Racks: {} Used Racks Count:" +
                      " {}, Required Nodes count: {}, fallback: {}.",
              reason, numberOfRacksRequired, usedRacksCntMap.size(),
              nodesRequired, fallback);

      /*
      The number of additional Datanodes ("nodesRequired") that were asked for
      are less than the additional number of racks ("additionalRacksRequired")
      that container replicas need to be on. This means that returning
      nodesRequired number of nodes will not satisfy the rack scatter policy.
      If fallback is false, throw an exception. Else, continue to find
      nodesRequired number of nodes.
       */
      if (!fallback) {
        throw new SCMException(reason,
            FAILED_TO_FIND_SUITABLE_NODE);
      }
    }

So this means that for both values of fallback, the argument can just be nodesRequired. Anyway, chooseNodesFromRacks should still try to spread it across as many racks as possible, with excluded and used racks sorted at the end of the list. What do you think?

@siddhantsangwan siddhantsangwan marked this pull request as ready for review July 24, 2023 06:21
@siddhantsangwan
Copy link
Contributor Author

This PR is ready for review. I've updated maxOuterLoopIterations according to the comments.

@siddhantsangwan
Copy link
Contributor Author

Note: While this PR is meant to improve EC under replication handling, I'm trying to determine its impact on mis replication handling.

@siddhantsangwan
Copy link
Contributor Author

I'm trying to determine its impact on mis replication handling.

Mis replication handling works in two stages. The first stage is MisReplicationHandler using placement policy's replicasToCopyToFixMisreplication to ask for replicas that need to be copied to other racks. These target racks are selected using SCMContainerPlacementRackScatter. Once they're copied, the container becomes over replicated. Then ECOverReplicationHandler will delete replicas using the policy's replicasToRemoveToFixOverreplication API.

As far as I understand, there should be no negative impact on mis replication handling because:

  1. SCMContainerPlacementRackScatter will not return targets that make mis replication worse.
  2. replicasToCopyToFixMisreplication will only select replicas to copy if it's possible to improve mis replication.

@swamirishi I'd appreciate your feedback on this since you're very familiar with this area.

@swamirishi
Copy link
Contributor

I'm trying to determine its impact on mis replication handling.

Mis replication handling works in two stages. The first stage is MisReplicationHandler using placement policy's replicasToCopyToFixMisreplication to ask for replicas that need to be copied to other racks. These target racks are selected using SCMContainerPlacementRackScatter. Once they're copied, the container becomes over replicated. Then ECOverReplicationHandler will delete replicas using the policy's replicasToRemoveToFixOverreplication API.

As far as I understand, there should be no negative impact on mis replication handling because:

  1. SCMContainerPlacementRackScatter will not return targets that make mis replication worse.
  2. replicasToCopyToFixMisreplication will only select replicas to copy if it's possible to improve mis replication.

@swamirishi I'd appreciate your feedback on this since you're very familiar with this area.
The first priority for replication manager is to fix unhealthy replicas which would be done by underreplication handler. The next priority in line is fixing over replication & mis-replication has the least priority. So in the case of unhealthy replicas it misreplication shouldn't really matter. But the chooseDatanode function is used by both misreplication handler & underreplication handler.

@umamaheswararao
Copy link
Contributor

@swamirishi could you please take a look at this PR ?

@siddhantsangwan
Copy link
Contributor Author

The first priority for replication manager is to fix unhealthy replicas which would be done by underreplication handler. The next priority in line is fixing over replication & mis-replication has the least priority.

Yes, I agree.

But the chooseDatanode function is used by both misreplication handler & underreplication handler.

Yes, according to me the changes introduced in this PR should not have a negative impact on mis replication handling, like I said before. But I might be missing some edge cases that you're aware of.

@@ -268,20 +276,36 @@ protected List<DatanodeDetails> chooseDatanodesInternal(
int numberOfRacksRequired = getRequiredRackCount(requiredReplicationFactor);
int additionalRacksRequired =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand all we have to change is make additionalRacksRequired = Math.min(nodesRequired, numberOfRacksRequired - usedRacksCntMap.size()); and remove the condition check nodesRequired < additionalRacksRequired all together. We wouldn't need a fall back flag then. This particular issue can only happen when under replication handler is calling this function asking for lesser nodes than the misreplication count. Rackscatter Placement policy is ensuring that the misreplication count is going down if an additional node is chosen.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the misreplication handler in that case we should only copy the number of replicas by the change in the misreplication count. Does this make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand all we have to change is make additionalRacksRequired = Math.min(nodesRequired, numberOfRacksRequired - usedRacksCntMap.size()); and remove the condition check nodesRequired < additionalRacksRequired all together.

That will work for this particular case, yes. However simply doing that will not work in the first case, which is:

  /**
   * Scenario:
   * rack0 -> node0
   * rack1 -> node1
   * rack2 -> node2
   * rack3 -> node3
   * rack4 -> node4, node5
   * <p>
   * node0, node1, node2, node4 are used nodes. node3 is excluded node.
   * Expectation is that node5 should be chosen.

So, we'll still need to "fall back and relax the policy" somehow, which is why I decided to use the existing fallback flag.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this does makes sense when the rack itself is not available. Thanks for pointing out such a scenario. @siddhantsangwan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had another question. Ideally this can happen only in a small cluster. Shouldn't we ideally wait for the unhealthy replica to get deleted and create the replica ? That should be fine right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this can happen only in a small cluster.

Yes, all of these are edge cases in small cluster situations.

Shouldn't we ideally wait for the unhealthy replica to get deleted and create the replica ? That should be fine right?

Yes, that's the idea that's already existing in ECUnderReplicationHandler. This change as it stands should not interfere with that.

@siddhantsangwan
Copy link
Contributor Author

I retested after applying this PR in the ozone-topology docker environment and have updated the testing document.

Conclusion:
Without patch HDDS-9011, under replication + mis replication is fixed in situation 1. It fails in Situation 2 and Situation 4. It will probably work in Situation 3 - not tested.

With patch HDDS-9011, under replication is fixed in ALL the above scenarios. However it performs worse in Situation 1 - it’s able to fix under replication but mis replication is blocked.

So there's an impact to mis replication in at least one case, after all. I've explained the impact in Situation 1 in the doc. Otherwise, this patch works well.

@siddhantsangwan
Copy link
Contributor Author

Update: In an offline discussion, we decided to keep the scope of this PR minimal. We've decided to go ahead with this suggestion:

From what I understand all we have to change is make additionalRacksRequired = Math.min(nodesRequired, numberOfRacksRequired - usedRacksCntMap.size()); and remove the condition check nodesRequired < additionalRacksRequired all together. We wouldn't need a fall back flag then. This particular issue can only happen when under replication handler is calling this function asking for lesser nodes than the misreplication count. Rackscatter Placement policy is ensuring that the misreplication count is going down if an additional node is chosen.

This means situation 2 (from the linked doc) gets solved, and situation 4 remains unsolved (can be tackled in a future jira). Since situation 2 seems more critical, we restricted this PR to solving that problem. There will also be no regression in situation 1.

@swamirishi and @ashishkumar50 please take another look!

Copy link
Contributor

@swamirishi swamirishi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @siddhantsangwan for the patch

@siddhantsangwan
Copy link
Contributor Author

Thanks @swamirishi for the review!

Copy link
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siddhantsangwan Thanks for working over this, LGTM +1

@sumitagrawl sumitagrawl merged commit affd163 into apache:master Jul 28, 2023
@siddhantsangwan
Copy link
Contributor Author

Thanks everyone for the reviews and discussions.

jojochuang pushed a commit to jojochuang/ozone that referenced this pull request Sep 20, 2023
…thy replica in a rack-scatter environment (apache#5097)

(cherry picked from commit affd163)
Change-Id: I9397314bc24d89aced66a82046f25225e222972e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants