Skip to content

HDDS-8369. Decommissioning with rack aware placement policy does not replicate to correct rack.#4556

Merged
adoroszlai merged 6 commits intoapache:masterfrom
ashishkumar50:HDDS-8369
Apr 28, 2023
Merged

HDDS-8369. Decommissioning with rack aware placement policy does not replicate to correct rack.#4556
adoroszlai merged 6 commits intoapache:masterfrom
ashishkumar50:HDDS-8369

Conversation

@ashishkumar50
Copy link
Contributor

@ashishkumar50 ashishkumar50 commented Apr 11, 2023

What changes were proposed in this pull request?

Currently decommissioned node is also considered to determine rack which leads to undesired result causing wrong rack selection. Now we don't consider decommissioned node for rack aware policy algorithm.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-8369

How was this patch tested?

Verified in real test environment.

// are on different racks.
for (int i = 0; i < excludedNodesCount; i++) {
for (int j = i + 1; j < excludedNodesCount; j++) {
if (excludedNodes.get(j).isDecomissioned()) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excludedNodes should not have decommissioned node , isn't replicationManager should remove decommissioned node from excludedNodes before invoking rack awareness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@krishnaasawa1 thanks for review. Removed decommissioned node from excludedNodes in replicationManager also. We can keep check in topology too for not considering as part of rack aware?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When picking new nodes, you have to pass in all the nodes that are not allowed to be used for a new replica, and that should include all the nodes that have a replica already, as they cannot be used for a new replica. The issue here, is that SCMContainerPlacementRackAware is using "excluded" nodes to mean something like "existing nodes I want to replace", but nodes could be excluded for other reasons too, such as being overloaded etc.

We changed the interface to pass "usedNodes" and "ExcludedNodes" separately, but until https://issues.apache.org/jira/browse/HDDS-7226 is implemented, the usedNodes are not used in SCMContainerPlacementRackAware.

Looking around the code, it is not strictly needed to exclude a decommissioning node, as the placement policies check if any selected node is valid, and one of those checks is the ensure the node is IN_SERVICE.

Therefore I don't think we need the check for "isDecommissioned" here. Either we should fix the placement policy to use usedNodes correctly, or we should just not pass the decommissioned node in the exclude list to begin with.

@ashishkumar50
Copy link
Contributor Author

@ChenSammi , @siddhantsangwan Can you please help to review.

@adoroszlai adoroszlai requested a review from sodonnel April 11, 2023 11:45
// maintenance nodes, as the replicas will remain present in the
// container manager, even when they go dead.
.filter(r -> getNodeStatus(r.getDatanodeDetails()).isHealthy()
&& !r.getDatanodeDetails().isDecomissioned()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the correct place to filter out decommissioned nodes. This is forming a list of replication source nodes - it is valid to replicate from a decommissioning host, and in some cases it is necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The correct place to filter out the decommission / maintenance nodes would be in the method "replicateAnyWithTopology", as that is where the exclude list is formed to pass into the placement policy. But we would also need to take care of this in the new replication manager.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sodonnel , thanks for the review. Handled for both Legacy and New replication manager.

@sodonnel
Copy link
Contributor

I think we should look at the difficulty of fixing https://issues.apache.org/jira/browse/HDDS-7226 before proceeding with this fix. It would be easy for other code areas to fall into a similar trap to this one, and ultimately we need to fix https://issues.apache.org/jira/browse/HDDS-7226 anyway.

@arp7
Copy link
Contributor

arp7 commented Apr 17, 2023

Considering this is a two line fix, is there any harm in letting this go in and then take on HDDS-7226?

@sodonnel
Copy link
Contributor

It needs a unit test, at least in the non legacy RM. I also think it just papers over a problem and we should fix it correctly by addressing HDDS-7226 and then adjusting the code to use usedNodes and excludedNodes correctly. If something changes in the topology, then it could end up returning the decommissioning node as we are no longer excluding it.

Fixing HDDS-7226 should not be difficult and it will give a more robust solution going forward.

@arp7
Copy link
Contributor

arp7 commented Apr 17, 2023

Agreed, we need a UT at least. I am not sure how much work HDDS-7226 is going to be, and also Ashish mentioned we will need a follow up fix after HDDS-7226 to fix this specific issue. So we need to balance the short-term problem with the long-term fix.

Ashish can you discuss with Stephen and decide a path forward?

@sodonnel
Copy link
Contributor

Its debatable that this is actually a problem too - the placement policy says the container should be on at least 2 racks, not exactly 2 racks. The "problem" is that it is ending up on more than 2 racks, which isn't really a problem.

@sodonnel
Copy link
Contributor

I also wonder about other scenarios. Eg, lets say we have 3 replicas, and one is unhealthy (scrubber found a problem with it, and marked it bad). Right now, when we make the call to the placement policy we will pass the 2 good nodes and the unhealthy replica node as excluded and ask for 1 new node - which is effectively the same as passing 2 good nodes and a decommission node, as it will still confuse the placement algorithm.

What we really need to do is pass used nodes as nodes 1 and 2, as they are going to stay, and pass the unhealthy replica node as excluded so we don't select that node for a new copy. But we need that other Jira implemented to have usedNodes working in the RackAwarePlacementPolicy.

So I think the fix here addresses a partial solution for a specific scenario, but leaves other parts unfixed.

@ashishkumar50
Copy link
Contributor Author

@sodonnel , I have updated PR for just Legacy RM which will continue to use old interface without usedNodes.
New RM will use the new interface which will use both "used and exclude nodes".
Rack aware policy to handle both scenario is handled in below PR.
#4614
Until legacy RM code exist decommissioning scenario will continue to work with this fix.

@adoroszlai adoroszlai added the scm label Apr 28, 2023
@adoroszlai adoroszlai merged commit 6eebe45 into apache:master Apr 28, 2023
@adoroszlai
Copy link
Contributor

Thanks @ashishkumar50 for the patch, @krishnaasawa1, @sodonnel for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants