HDDS-8369. Decommissioning with rack aware placement policy does not replicate to correct rack.#4556
Conversation
…replicate to correct rack.
…replicate to correct rack.
…replicate to correct rack.
| // are on different racks. | ||
| for (int i = 0; i < excludedNodesCount; i++) { | ||
| for (int j = i + 1; j < excludedNodesCount; j++) { | ||
| if (excludedNodes.get(j).isDecomissioned()) { |
There was a problem hiding this comment.
excludedNodes should not have decommissioned node , isn't replicationManager should remove decommissioned node from excludedNodes before invoking rack awareness.
There was a problem hiding this comment.
@krishnaasawa1 thanks for review. Removed decommissioned node from excludedNodes in replicationManager also. We can keep check in topology too for not considering as part of rack aware?
There was a problem hiding this comment.
When picking new nodes, you have to pass in all the nodes that are not allowed to be used for a new replica, and that should include all the nodes that have a replica already, as they cannot be used for a new replica. The issue here, is that SCMContainerPlacementRackAware is using "excluded" nodes to mean something like "existing nodes I want to replace", but nodes could be excluded for other reasons too, such as being overloaded etc.
We changed the interface to pass "usedNodes" and "ExcludedNodes" separately, but until https://issues.apache.org/jira/browse/HDDS-7226 is implemented, the usedNodes are not used in SCMContainerPlacementRackAware.
Looking around the code, it is not strictly needed to exclude a decommissioning node, as the placement policies check if any selected node is valid, and one of those checks is the ensure the node is IN_SERVICE.
Therefore I don't think we need the check for "isDecommissioned" here. Either we should fix the placement policy to use usedNodes correctly, or we should just not pass the decommissioned node in the exclude list to begin with.
…replicate to correct rack.
|
@ChenSammi , @siddhantsangwan Can you please help to review. |
| // maintenance nodes, as the replicas will remain present in the | ||
| // container manager, even when they go dead. | ||
| .filter(r -> getNodeStatus(r.getDatanodeDetails()).isHealthy() | ||
| && !r.getDatanodeDetails().isDecomissioned() |
There was a problem hiding this comment.
I don't think this is the correct place to filter out decommissioned nodes. This is forming a list of replication source nodes - it is valid to replicate from a decommissioning host, and in some cases it is necessary.
There was a problem hiding this comment.
The correct place to filter out the decommission / maintenance nodes would be in the method "replicateAnyWithTopology", as that is where the exclude list is formed to pass into the placement policy. But we would also need to take care of this in the new replication manager.
There was a problem hiding this comment.
Hi @sodonnel , thanks for the review. Handled for both Legacy and New replication manager.
…replicate to correct rack.
|
I think we should look at the difficulty of fixing https://issues.apache.org/jira/browse/HDDS-7226 before proceeding with this fix. It would be easy for other code areas to fall into a similar trap to this one, and ultimately we need to fix https://issues.apache.org/jira/browse/HDDS-7226 anyway. |
|
Considering this is a two line fix, is there any harm in letting this go in and then take on HDDS-7226? |
|
It needs a unit test, at least in the non legacy RM. I also think it just papers over a problem and we should fix it correctly by addressing HDDS-7226 and then adjusting the code to use usedNodes and excludedNodes correctly. If something changes in the topology, then it could end up returning the decommissioning node as we are no longer excluding it. Fixing HDDS-7226 should not be difficult and it will give a more robust solution going forward. |
|
Agreed, we need a UT at least. I am not sure how much work HDDS-7226 is going to be, and also Ashish mentioned we will need a follow up fix after HDDS-7226 to fix this specific issue. So we need to balance the short-term problem with the long-term fix. Ashish can you discuss with Stephen and decide a path forward? |
|
Its debatable that this is actually a problem too - the placement policy says the container should be on at least 2 racks, not exactly 2 racks. The "problem" is that it is ending up on more than 2 racks, which isn't really a problem. |
|
I also wonder about other scenarios. Eg, lets say we have 3 replicas, and one is unhealthy (scrubber found a problem with it, and marked it bad). Right now, when we make the call to the placement policy we will pass the 2 good nodes and the unhealthy replica node as excluded and ask for 1 new node - which is effectively the same as passing 2 good nodes and a decommission node, as it will still confuse the placement algorithm. What we really need to do is pass used nodes as nodes 1 and 2, as they are going to stay, and pass the unhealthy replica node as excluded so we don't select that node for a new copy. But we need that other Jira implemented to have usedNodes working in the RackAwarePlacementPolicy. So I think the fix here addresses a partial solution for a specific scenario, but leaves other parts unfixed. |
|
@sodonnel , I have updated PR for just Legacy RM which will continue to use old interface without usedNodes. |
|
Thanks @ashishkumar50 for the patch, @krishnaasawa1, @sodonnel for the review. |
What changes were proposed in this pull request?
Currently decommissioned node is also considered to determine rack which leads to undesired result causing wrong rack selection. Now we don't consider decommissioned node for rack aware policy algorithm.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-8369
How was this patch tested?
Verified in real test environment.