HDDS-5090. make Decommission work under SCM HA. by GlenGeng-awx · Pull Request #2148 · apache/ozone

GlenGeng-awx · 2021-04-12T07:57:30Z

What changes were proposed in this pull request?

The problem
The decommission/maintenance info is saved in memory of SCM, and if SCM is restarted, it relearns this info during re-register of Datanode.

Only leader SCM handles the decommissionNodes(), recommissionNodes(), startMaintenanceNodes() request, and not replicate these info to follower SCM, thus when failover happens, the new leader SCM will lose this info, since they are saved in memory of previous leader SCM.

Current status
If a SCM is restarted, then upon re-registration the datanode will already be in DECOMMISSIONING or ENTERING_MAINTENANCE or IN_MAINTENANCE state. In that case, it needs to be added back into the monitor to track its progress.

For a registered node, the information stored in SCM is the source of truth. If SCM finds that the opState or opStateExpiryEpoch is different from what it saves in memory, it will send SetNodeOperationalStateCommand to update the Datanode.

The solution
leader SCM -hb> DN --hb-> follower SCM

Leader SCM updates PersistedOpState of Datanode via heartbeat. Datanode update OpState in follower SCM via heartbeat.
When follower SCM becomes leader, it calls continueAdminForNode for all datanode, so that the DECOMMISSIONING, ENTERING_MAINTENANCE, IN_MAINTENANCE datanode will be added back to the monitor.

Disadvantage
The same as now, if leader SCM records the info, notifies Datanode via heartbeat, but steps down before Datanode notifies follower SCM via heartbeat, that info will be lost in the new leader SCM.

As discussed with Stephen O'Donnell, we can live with the rare event of a decommission starting and SCM failing over before the state has made it to the DNs.

For details: https://docs.google.com/document/d/1N5PsUuLBGgvkYFQgDumvRZujc-9RcDwoE0SubZcLUzY/edit?usp=sharing

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-5090

How was this patch tested?

Tested in tencent internal integration environment, the operationalState can be replicated between multi SCMs, and survive failover of SCM.

sodonnel · 2021-04-12T14:14:52Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/SCMNodeManager.java

-   * This method should only be called when processing the
-   * heartbeat, and for a registered node, the information stored in SCM is the
-   * source of truth.
+   * This method should only be called when processing the heartbeat.


Could we add a test for this new logic - Command fired if leader, but not fired if follower?

I checked in TestSCMNodeManager, and there is a test "testSetNodeOpStateAndCommandFired" which I earlier set to ignore as it became invalid when I developed decommission. I then forgot to go back and fix it.

You could use that tests as a starting point, and change it to call processHeartBeat with the correct DatanodeDetails to trigger a command.

Thanks for the hint. testSetNodeOpStateAndCommandFired is a good place to test the logic.

sodonnel · 2021-04-12T14:16:16Z

...p-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeDecommissionManager.java

    return nodeManager.getNodeStatus(dn);
  }

+  /**


We should probably add a test for this - ie ensure that nodes are added to the decommission workflow when the onBecomeLeader() event is fired.

sodonnel

Thanks for working on this @GlenGeng. The change looks mostly good, but I think we should add a couple of unit tests where I have commented inline.

Also, it might be a good idea to add an Integration test or two. There are quite a few integration tests in the TestDecommissionAndMaintenance class.

Finally, when node registers for the first time on a restart SCM, the NewNodeHandler event is fired. Inside it, there is:

      if (datanodeDetails.getPersistedOpState()
          != HddsProtos.NodeOperationalState.IN_SERVICE) {
        decommissionManager.continueAdminForNode(datanodeDetails);
      }

So inside the above event handler, or inside NodeDecommissionManager.continueAdminForNode I think we will need the isLeader() check, and skip adding the DN to the Admin workflow if the SCM is not the leader.

GlenGeng-awx

Thanks @sodonnel for the review. It's helpful!

test has been added. Please one more look.

I think we will need the isLeader() check, and skip adding the DN to the Admin workflow if the SCM is not the leader.

It is safe to add DN to the workflow for a follower SCM. The design of SCM HA is, all the event handler will keep running for a follower SCM, since 1) the SCMCommand sent out from a follower SCM will be ignored on Datanode, and 2) the scm.db is protected by ratis, any update to scm.db will fail with NotLeaderException 3) ReplicationManager and BackgrouondPipelineCreator has been stopped on a follower SCM.

But I added the isLeader check to make behavior on follower SCM to be more straightforward.

Also, it might be a good idea to add an Integration test or two.

I propose to replace MiniOzoneCluster with MiniOzoneHAClusterImpl in TestDecommissionAndMaintenance, and verify that the decommission info is correctly replicated across multi SCMs. I prefer to handle the work in a followup jira (A new jira https://issues.apache.org/jira/browse/HDDS-5100 is created to track it)

GlenGeng-awx · 2021-04-13T07:51:07Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/SCMNodeManager.java

-   * This method should only be called when processing the
-   * heartbeat, and for a registered node, the information stored in SCM is the
-   * source of truth.
+   * This method should only be called when processing the heartbeat.


Thanks for the hint. testSetNodeOpStateAndCommandFired is a good place to test the logic.

GlenGeng-awx · 2021-04-13T08:58:58Z

...p-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeDecommissionManager.java

    return nodeManager.getNodeStatus(dn);
  }

+  /**


adoroszlai · 2021-04-13T17:07:59Z

@GlenGeng I have updated the branch to avoid compile error in case of merge to master, where now another test is @Ignore-d. Thus we cannot remove import org.junit.Ignore;. This kind of conflict is not detected by Github, so it would not prevent merge.

GlenGeng-awx · 2021-04-14T02:28:01Z

Thanks @adoroszlai ! This is fine to me, since the it can pass check style

sodonnel

Updated tests look good to me. Feel free to commit the changes.

GlenGeng-awx · 2021-04-15T02:26:27Z

Thanks @sodonnel for the view. I will merge it.

HDDS-5090. make Decommission work under SCM HA.

df59ccc

GlenGeng-awx requested review from nandakumar131 and sodonnel April 12, 2021 07:57

sodonnel reviewed Apr 12, 2021

View reviewed changes

sodonnel requested changes Apr 12, 2021

View reviewed changes

fix comments

31f838e

GlenGeng-awx commented Apr 13, 2021

View reviewed changes

Glen Geng and others added 3 commits April 13, 2021 18:39

fix checkstyle

685f67f

Merge remote-tracking branch 'origin/master' into HDDS-5090

4eeb781

Fix conflict due to @ignore

d112ecd

sodonnel approved these changes Apr 14, 2021

View reviewed changes

GlenGeng-awx merged commit 8a80c80 into apache:master Apr 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-5090. make Decommission work under SCM HA.#2148

HDDS-5090. make Decommission work under SCM HA.#2148
GlenGeng-awx merged 5 commits intoapache:masterfrom
GlenGeng-awx:HDDS-5090

GlenGeng-awx commented Apr 12, 2021 •

edited

Loading

Uh oh!

sodonnel Apr 12, 2021

Uh oh!

GlenGeng-awx Apr 13, 2021

Uh oh!

sodonnel Apr 12, 2021

Uh oh!

GlenGeng-awx Apr 13, 2021

Uh oh!

sodonnel left a comment

Uh oh!

GlenGeng-awx left a comment •

edited

Loading

Uh oh!

GlenGeng-awx Apr 13, 2021

Uh oh!

GlenGeng-awx Apr 13, 2021

Uh oh!

adoroszlai commented Apr 13, 2021

Uh oh!

GlenGeng-awx commented Apr 14, 2021

Uh oh!

sodonnel left a comment

Uh oh!

GlenGeng-awx commented Apr 15, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

GlenGeng-awx commented Apr 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

sodonnel Apr 12, 2021

Choose a reason for hiding this comment

Uh oh!

GlenGeng-awx Apr 13, 2021

Choose a reason for hiding this comment

Uh oh!

sodonnel Apr 12, 2021

Choose a reason for hiding this comment

Uh oh!

GlenGeng-awx Apr 13, 2021

Choose a reason for hiding this comment

Uh oh!

sodonnel left a comment

Choose a reason for hiding this comment

Uh oh!

GlenGeng-awx left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GlenGeng-awx Apr 13, 2021

Choose a reason for hiding this comment

Uh oh!

GlenGeng-awx Apr 13, 2021

Choose a reason for hiding this comment

Uh oh!

adoroszlai commented Apr 13, 2021

Uh oh!

GlenGeng-awx commented Apr 14, 2021

Uh oh!

sodonnel left a comment

Choose a reason for hiding this comment

Uh oh!

GlenGeng-awx commented Apr 15, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

GlenGeng-awx commented Apr 12, 2021 •

edited

Loading

GlenGeng-awx left a comment •

edited

Loading