HDDS-5090. make Decommission work under SCM HA.#2148
HDDS-5090. make Decommission work under SCM HA.#2148GlenGeng-awx merged 5 commits intoapache:masterfrom
Conversation
| * This method should only be called when processing the | ||
| * heartbeat, and for a registered node, the information stored in SCM is the | ||
| * source of truth. | ||
| * This method should only be called when processing the heartbeat. |
There was a problem hiding this comment.
Could we add a test for this new logic - Command fired if leader, but not fired if follower?
I checked in TestSCMNodeManager, and there is a test "testSetNodeOpStateAndCommandFired" which I earlier set to ignore as it became invalid when I developed decommission. I then forgot to go back and fix it.
You could use that tests as a starting point, and change it to call processHeartBeat with the correct DatanodeDetails to trigger a command.
There was a problem hiding this comment.
Thanks for the hint. testSetNodeOpStateAndCommandFired is a good place to test the logic.
| return nodeManager.getNodeStatus(dn); | ||
| } | ||
|
|
||
| /** |
There was a problem hiding this comment.
We should probably add a test for this - ie ensure that nodes are added to the decommission workflow when the onBecomeLeader() event is fired.
sodonnel
left a comment
There was a problem hiding this comment.
Thanks for working on this @GlenGeng. The change looks mostly good, but I think we should add a couple of unit tests where I have commented inline.
Also, it might be a good idea to add an Integration test or two. There are quite a few integration tests in the TestDecommissionAndMaintenance class.
Finally, when node registers for the first time on a restart SCM, the NewNodeHandler event is fired. Inside it, there is:
if (datanodeDetails.getPersistedOpState()
!= HddsProtos.NodeOperationalState.IN_SERVICE) {
decommissionManager.continueAdminForNode(datanodeDetails);
}
So inside the above event handler, or inside NodeDecommissionManager.continueAdminForNode I think we will need the isLeader() check, and skip adding the DN to the Admin workflow if the SCM is not the leader.
There was a problem hiding this comment.
Thanks @sodonnel for the review. It's helpful!
test has been added. Please one more look.
I think we will need the isLeader() check, and skip adding the DN to the Admin workflow if the SCM is not the leader.
It is safe to add DN to the workflow for a follower SCM. The design of SCM HA is, all the event handler will keep running for a follower SCM, since 1) the SCMCommand sent out from a follower SCM will be ignored on Datanode, and 2) the scm.db is protected by ratis, any update to scm.db will fail with NotLeaderException 3) ReplicationManager and BackgrouondPipelineCreator has been stopped on a follower SCM.
But I added the isLeader check to make behavior on follower SCM to be more straightforward.
Also, it might be a good idea to add an Integration test or two.
I propose to replace MiniOzoneCluster with MiniOzoneHAClusterImpl in TestDecommissionAndMaintenance, and verify that the decommission info is correctly replicated across multi SCMs. I prefer to handle the work in a followup jira (A new jira https://issues.apache.org/jira/browse/HDDS-5100 is created to track it)
| * This method should only be called when processing the | ||
| * heartbeat, and for a registered node, the information stored in SCM is the | ||
| * source of truth. | ||
| * This method should only be called when processing the heartbeat. |
There was a problem hiding this comment.
Thanks for the hint. testSetNodeOpStateAndCommandFired is a good place to test the logic.
| return nodeManager.getNodeStatus(dn); | ||
| } | ||
|
|
||
| /** |
|
@GlenGeng I have updated the branch to avoid compile error in case of merge to |
|
Thanks @adoroszlai ! This is fine to me, since the it can pass check style |
sodonnel
left a comment
There was a problem hiding this comment.
Updated tests look good to me. Feel free to commit the changes.
|
Thanks @sodonnel for the view. I will merge it. |
What changes were proposed in this pull request?
The problem
The decommission/maintenance info is saved in memory of SCM, and if SCM is restarted, it relearns this info during re-register of Datanode.
Only leader SCM handles the decommissionNodes(), recommissionNodes(), startMaintenanceNodes() request, and not replicate these info to follower SCM, thus when failover happens, the new leader SCM will lose this info, since they are saved in memory of previous leader SCM.
Current status
If a SCM is restarted, then upon re-registration the datanode will already be in DECOMMISSIONING or ENTERING_MAINTENANCE or IN_MAINTENANCE state. In that case, it needs to be added back into the monitor to track its progress.
For a registered node, the information stored in SCM is the source of truth. If SCM finds that the opState or opStateExpiryEpoch is different from what it saves in memory, it will send SetNodeOperationalStateCommand to update the Datanode.
The solution
leader SCM -hb> DN --hb-> follower SCM
Disadvantage
The same as now, if leader SCM records the info, notifies Datanode via heartbeat, but steps down before Datanode notifies follower SCM via heartbeat, that info will be lost in the new leader SCM.
As discussed with Stephen O'Donnell, we can live with the rare event of a decommission starting and SCM failing over before the state has made it to the DNs.
For details: https://docs.google.com/document/d/1N5PsUuLBGgvkYFQgDumvRZujc-9RcDwoE0SubZcLUzY/edit?usp=sharing
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-5090
How was this patch tested?
Tested in tencent internal integration environment, the operationalState can be replicated between multi SCMs, and survive failover of SCM.