New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-5277. Recon shows operational status as "DECOMMISSIONING" for "DECOMMISSIONED" DNs #2286
Conversation
I think there is another scenario this change won't catch.
In this case the dead node handler would fire once on Recon, and populate the dead nodes array, then the state would be checked and the array cleared. The later transition to IN_SERVICE + DEAD would never get seen by Recon. I wonder if it would be simpler to forget about the dead node handler on Recon, and just sync up all nodes on SCM with the nodes on Recon on each run (one every 5 minutes I think). Even with a few 1000 nodes, this should not have a big overhead. In saying that, using the dead node handler is a nice optimisation. We could ask SCM for a list of nodes which are only in the DEAD state and sync up that group. There is an existing API for that in SCMNodeManager:
I think that would cover both scenarios. |
Thanks for the review @sodonnel. The current patch does a general sync through the PipelineSyncTask as well. Doesn't that handle that case? To be fair, the DeadNodeHandler logic that I have added is redundant here. |
I think you are correct and I miss-understood how the change works. I will check again in the morning. |
I understand this better now. The DeadNodeHandler will catch the case where a node is stopped before it reports its own status to Recon. The sync in PipelineSyncTask would catch the maintenance timing out scenario too. This change LGTM, +1. |
Thanks for the review @sodonnel. |
What changes were proposed in this pull request?
Recon relies on DN heartbeats to understand Node operational state. If the node goes down before it reports itself as DECOMMISSIONED, then there is a loss of information on Recon side. It is the SCM that moves a node form DECOMMISSIONING to DECOMMISSIONED state first, then the Datanode persists the state change locally, and then heartbeats with the new state to SCM & Recon subsequently. If the DN is shutdown before it can heartbeat the state change to Recon, then Recon lives with the stale information.
This patch adds a DeadNodeHandler for Recon that updates the node operational status information from SCM if needed. It also adds a step in the Recon SCM Sync Task that updates the node operational status from SCM for every dead node as seen from Recon.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-5277
How was this patch tested?
Manually tested using ozone decommissioning CLI.