HDDS-9011. EC: Unhealthy replica not replaced with a healthy replica in a rack-scatter environment#5097
Conversation
|
@swamirishi can you also please review? |
ashishkumar50
left a comment
There was a problem hiding this comment.
@siddhantsangwan, Thanks for the patch. I have a question related to maxOuterLoopIterations.
| mutableFavoredNodes, nodesRequired - chosenNodes.size(), | ||
| metadataSizeRequired, dataSizeRequired, | ||
| Integer.MAX_VALUE, usedRacksCntMap, maxReplicasPerRack)); | ||
| nodesRequired - chosenNodes.size(), |
There was a problem hiding this comment.
Changing maxOuterLoopIterations to exact number of nodes required, This will decrease probablity of getting exact number of nodes required.
chooseRandom will return any random node. There is chance that random node is not valid isValidNode.
After certain retry(currently its 5) it will return null node back to chooseNodesFromRacks.
Assuming this happens for all the racks. Then at the end of the loop we decrement maxOuterLoopIterations without even choosing any node. So the probability of getting required nodes keeps decreasing.
I think maxOuterLoopIterations should not bound to number of nodes required.
There was a problem hiding this comment.
Good point. We can set maxOuterLoopIterations to a higher value. Looks like the number of iterations are bounded by both maxOuterLoopIterations and OUTER_LOOP_MAX_RETRY. The latter is a constant and its value currently is 3, so we need to increase that too. Any suggestions for what their values should be?
There was a problem hiding this comment.
OUTER_LOOP_MAX_RETRY is there, when we don't find any nodes in any rack even after going through all the racks OUTER_LOOP_MAX_RETRY times, I believe we can make it configurable but default value seems to be fine. Even if we find one node in the iteration retry count is reset.
There was a problem hiding this comment.
Keeping INT_MAX doesn't harm anything here as OUTER_LOOP_MAX_RETRY already controls for exit condition too in case we really could not find enough nodes.
There was a problem hiding this comment.
Ok, the consensus is that we should keep maxOuterLoopIterations as Integer.MAX_VALUE. I wonder if we should then change this argument to Integer.MAX_VALUE when we call chooseNodesFromRacks at line 352, too?
There was a problem hiding this comment.
At line 352 it should be max replica per rack, If we increase this iteration count beyond max replica per rack it can use same rack for multiple replica because every time outer loop will start reusing all the racks. Only in the same iteration it removes rack from reusing.
If we want to optimize, we can remove racks which has already reached max replica and then any number of times like INT_MAX we can do for outer loop.
There was a problem hiding this comment.
If we want to optimize, we can remove racks which has already reached max replica
I agree, and I think we're already doing that.
At line 70, we update the count of used nodes per rack:
rackCntMap.merge(rack, 1, Math::addExact);
And at line 126, we filter nodes according to this map:
for (Node rack : racks) {
if (rackCntMap.getOrDefault(rack, 0) < maxReplicasPerRack) {
toChooseRacks.add(rack);
}
}
There was a problem hiding this comment.
I've decided to make it maxReplicasPerRack at line 352 for the first call, and Integer.MAX_VALUE for the second call. If we want to increase it for the first call, we can do that in this separate jira - https://issues.apache.org/jira/browse/HDDS-9054.
There was a problem hiding this comment.
Looks fine now, earlier it was 1 for the 1st call now changed to maxReplicasPerRack which is better than 1 and don't have any impact. Yes later we can optimize more in new Jira you created HDDS-9054.
You can update PR description as well for maxOuterLoopIterations.
There was a problem hiding this comment.
Thanks @ashishkumar50 for your review. I've now updated the PR description as well.
There was a problem hiding this comment.
Thanks @siddhantsangwan for the patch. I have taken a look at the code side changes and haven't taken a look at the tests. Have some queries around certain changes in the algorithm since it is major change in the behaviour.
| } | ||
|
|
||
| LOG.debug("Available racks excluding racks with used nodes: {}.", racks); | ||
| if (racks.size() < additionalRacksRequired) { |
There was a problem hiding this comment.
Curious to know when this condition being satisfied.
There was a problem hiding this comment.
I haven't seen it happen yet.
| additional number of racks that replicas need to be on | ||
| ("additionalRacksRequired"). | ||
| */ | ||
| if (!fallback) { |
There was a problem hiding this comment.
if we have fallback enabled shouldn't we just change the list available racks as the total number of racks required and changeMaxReplicasPerRack value instead? It would be simpler right?
There was a problem hiding this comment.
I don't really understand what you're suggesting - can you please rephrase that?
The general idea here is that it's possible to fail some of the conditions above this, such as nodesRequired < additionalRacksRequired, and still get a new DN without having to increase maxReplicasPerRack.
There was a problem hiding this comment.
I meant we should recalculate maxReplicas per rack based on the number of available rack and not based on the topology configuration, since there could be cases some racks could be excluded completely.
There was a problem hiding this comment.
This should happen in the case racks.size() < additionalRacksRequired
There was a problem hiding this comment.
If you think this PR works as expected and what you're suggesting is an improvement that can be made separately, let's do that in https://issues.apache.org/jira/browse/HDDS-9054?
| metadataSizeRequired, dataSizeRequired, 1, | ||
| usedRacksCntMap, maxReplicasPerRack)); | ||
| mutableFavoredNodes, nodesRequired, metadataSizeRequired, | ||
| dataSizeRequired, maxOuterLoopIterations, usedRacksCntMap, |
There was a problem hiding this comment.
We shouldn't change the numberOfNodesRequired from additionalRacksRequired to nodesRequired. Since this function call is for spreading placement to as many racks as possible & choose racks only from the unused racks.
There was a problem hiding this comment.
I changed this because it's possible to have a situation where nodesRequired is less than additionalRacksRequired (check the third case in my linked document). If we set it to nodesRequired, the policy still prefers finding a new rack rather than a rack that already has a used node.
There was a problem hiding this comment.
I believe in that case we should just check if the misreplication count is going down or remaining same in the case fallback is enabled. We shouldn't change this parameter.
There was a problem hiding this comment.
check if the misreplication count is going down or remaining same in the case fallback is enabled
I believe we're already doing that here:
List<DatanodeDetails> newPlacement =
new ArrayList<>(usedNodes.size() + result.size());
newPlacement.addAll(usedNodes);
newPlacement.addAll(chosenNodes);
ContainerPlacementStatus placementStatus =
validateContainerPlacement(newPlacement, requiredReplicationFactor);
if (!placementStatus.isPolicySatisfied()) {
ContainerPlacementStatus initialPlacementStatus =
validateContainerPlacement(usedNodes, requiredReplicationFactor);
if (initialPlacementStatus.misReplicationCount()
< placementStatus.misReplicationCount()) {
String errorMsg = "ContainerPlacementPolicy not met. Misreplication" +
" Reason: " + placementStatus.misReplicatedReason() +
" Initial Used nodes mis-replication Count: " +
initialPlacementStatus.misReplicationCount() +
" Used nodes + Chosen nodes mis-replication Count: " +
placementStatus.misReplicationCount();
throw new SCMException(errorMsg, FAILED_TO_FIND_SUITABLE_NODE);
}
}
We shouldn't change this parameter.
My understanding is that we should look for additionalRacksRequired if fallback is false, and for nodesRequired if fallback is true. Now, if fallback is false, then nodesRequired must be greater than additionalRacksRequired or we'll throw an exception earlier in the method:
if (nodesRequired < additionalRacksRequired) {
String reason = "Required nodes size: " + nodesRequired
+ " is less than required number of racks to choose: "
+ additionalRacksRequired + ".";
LOG.warn("Placement policy cannot choose the enough racks. {}"
+ "Total number of Required Racks: {} Used Racks Count:" +
" {}, Required Nodes count: {}, fallback: {}.",
reason, numberOfRacksRequired, usedRacksCntMap.size(),
nodesRequired, fallback);
/*
The number of additional Datanodes ("nodesRequired") that were asked for
are less than the additional number of racks ("additionalRacksRequired")
that container replicas need to be on. This means that returning
nodesRequired number of nodes will not satisfy the rack scatter policy.
If fallback is false, throw an exception. Else, continue to find
nodesRequired number of nodes.
*/
if (!fallback) {
throw new SCMException(reason,
FAILED_TO_FIND_SUITABLE_NODE);
}
}
So this means that for both values of fallback, the argument can just be nodesRequired. Anyway, chooseNodesFromRacks should still try to spread it across as many racks as possible, with excluded and used racks sorted at the end of the list. What do you think?
…r the first and second calls, respectively
|
This PR is ready for review. I've updated |
|
Note: While this PR is meant to improve EC under replication handling, I'm trying to determine its impact on mis replication handling. |
Mis replication handling works in two stages. The first stage is As far as I understand, there should be no negative impact on mis replication handling because:
@swamirishi I'd appreciate your feedback on this since you're very familiar with this area. |
|
|
@swamirishi could you please take a look at this PR ? |
Yes, I agree.
Yes, according to me the changes introduced in this PR should not have a negative impact on mis replication handling, like I said before. But I might be missing some edge cases that you're aware of. |
| @@ -268,20 +276,36 @@ protected List<DatanodeDetails> chooseDatanodesInternal( | |||
| int numberOfRacksRequired = getRequiredRackCount(requiredReplicationFactor); | |||
| int additionalRacksRequired = | |||
There was a problem hiding this comment.
From what I understand all we have to change is make additionalRacksRequired = Math.min(nodesRequired, numberOfRacksRequired - usedRacksCntMap.size()); and remove the condition check nodesRequired < additionalRacksRequired all together. We wouldn't need a fall back flag then. This particular issue can only happen when under replication handler is calling this function asking for lesser nodes than the misreplication count. Rackscatter Placement policy is ensuring that the misreplication count is going down if an additional node is chosen.
There was a problem hiding this comment.
On the misreplication handler in that case we should only copy the number of replicas by the change in the misreplication count. Does this make sense?
There was a problem hiding this comment.
From what I understand all we have to change is make additionalRacksRequired = Math.min(nodesRequired, numberOfRacksRequired - usedRacksCntMap.size()); and remove the condition check nodesRequired < additionalRacksRequired all together.
That will work for this particular case, yes. However simply doing that will not work in the first case, which is:
/**
* Scenario:
* rack0 -> node0
* rack1 -> node1
* rack2 -> node2
* rack3 -> node3
* rack4 -> node4, node5
* <p>
* node0, node1, node2, node4 are used nodes. node3 is excluded node.
* Expectation is that node5 should be chosen.
So, we'll still need to "fall back and relax the policy" somehow, which is why I decided to use the existing fallback flag.
There was a problem hiding this comment.
yeah this does makes sense when the rack itself is not available. Thanks for pointing out such a scenario. @siddhantsangwan
There was a problem hiding this comment.
I had another question. Ideally this can happen only in a small cluster. Shouldn't we ideally wait for the unhealthy replica to get deleted and create the replica ? That should be fine right?
There was a problem hiding this comment.
Ideally this can happen only in a small cluster.
Yes, all of these are edge cases in small cluster situations.
Shouldn't we ideally wait for the unhealthy replica to get deleted and create the replica ? That should be fine right?
Yes, that's the idea that's already existing in ECUnderReplicationHandler. This change as it stands should not interfere with that.
|
I retested after applying this PR in the ozone-topology docker environment and have updated the testing document.
So there's an impact to mis replication in at least one case, after all. I've explained the impact in Situation 1 in the doc. Otherwise, this patch works well. |
…tions for first call is maxReplicasPerRack.
|
Update: In an offline discussion, we decided to keep the scope of this PR minimal. We've decided to go ahead with this suggestion:
This means situation 2 (from the linked doc) gets solved, and situation 4 remains unsolved (can be tackled in a future jira). Since situation 2 seems more critical, we restricted this PR to solving that problem. There will also be no regression in situation 1. @swamirishi and @ashishkumar50 please take another look! |
|
Thanks @swamirishi for the review! |
sumitagrawl
left a comment
There was a problem hiding this comment.
@siddhantsangwan Thanks for working over this, LGTM +1
|
Thanks everyone for the reviews and discussions. |
…thy replica in a rack-scatter environment (apache#5097) (cherry picked from commit affd163) Change-Id: I9397314bc24d89aced66a82046f25225e222972e
What changes were proposed in this pull request?
Simply put, the problem here is that EC under replication handling is blocked in cases where the Rack Scatter policy cannot return new Datanodes because they don't satisfy the policy.
Please check the jira for one such scenario, and this document for a detailed (but roughly documented) reproduction of some cases.
Changes proposed: Allow the rack scatter policy to fallback and return the required number of datanodes even if the policy can't be satisfied. But still try to follow the policy as well as possible. Because
fallbackis true by default, the policy will now try to return nodes if possible in situations it'd throw anSCMExceptionpreviously. The invariant that this policy tries to return the asked number of nodes and fails if it can't is still true.The normal definition of "maximum replicas a rack can host" (called
maxReplicasPerRackin the code) in rack scatter policy is:When falling back, the definition of
maxReplicasPerRackinchooseDatanodesInternalis changed to "replication factor number of replicas", so that one rack can host all replicas of this container if needed. It'll still try to spread the replicas across as many racks as possible - this is done by the existing algorithm inchooseNodesFromRacks. The algorithm iterates over reach rack and just chooses one node per rack in one iteration, so the chosen nodes are spread across as many racks as possible:The number of times this algo runs, or the number of times we will consider a rack for choosing a node, is controlled by the argument
maxOuterLoopIterations. Previously, this argument was1in one case andInteger.MAX_VALUEin another case - now it'smaxReplicasPerRackfor the first call and stillInteger.MAX_VALUEfor the second call (see discussion under this comment).What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-9011
How was this patch tested?
Changed some unit tests and added some new ones.
Draft while waiting for CI and doing some manual testing in a cluster.