HDDS-2329 Destroy pipelines on any decommission or maintenance nodes #86

sodonnel · 2019-10-25T15:38:42Z

What changes were proposed in this pull request?

When a node is marked for decommission or maintenance, the first step in taking the node out of service is to destroy any pipelines the node is involved in and confirm they have been destroyed before getting the container list for the node.

This commit added a new class called the DatanodeAdminMonitor, which is responsible for tracking nodes as they go through the decommission workflow.

When a node is marked for decommission, it gets added a to a queue in this monitor. The monitor runs periodically (30 seconds by default) and process any queued nodes. After processing they are tracked inside the monitor as they decommission workflow progresses (closing pipelines, getting the container list, replicating the container, etc).

With this commit, a node can be added to the monitor for decommission or maintenace and it will have its pipelines closed.

It will not make any further progress after the pipelines have been closed and further commits will address the next states.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-2329

How was this patch tested?

Some manual tests and new unit tests have been added.

… moved through the decommission workflow. With this commit, a node can be added to the monitor for decommission or maintenace and it will have its pipelines closed. It will not make any further progress after the pipelines have been closed and further commits will address the next states.

… value

anuengineer

I am +1 on this patch. @nandakumar131 , @xiaoyuyao, @elek Just flagging since this is an important patch. I will commit this on Monday evening PST if there are no further comments. Thanks

anuengineer · 2019-11-02T21:02:43Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitor.java

+ * the following happens:
+ *
+ * 1. First an event is fired to close any pipelines on the node, which will
+ *    also close any contaners.


typo: contaners -> containers

anuengineer · 2019-11-02T21:04:21Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitor.java

+   * into maintenance.
+   */
+  public enum States {
+    CLOSE_PIPELINES, GET_CONTAINERS, REPLICATE_CONTAINERS,


Do you want to add a integer in this enum, so you can actually annotate which enum is first, second etc.
At this point it is implicit I suppose ?

Also we might want to add a detailed Ascii based state machine diagram for the future readers.

I added sequenceNumber to indicate the ordering. ASCII diagram is a good idea, but I will leave it out until we are sure there are no more states or this flow may change before decommission is finished.

anuengineer · 2019-11-02T21:12:10Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitor.java

+    while(!cancelledNodes.isEmpty()) {
+      DatanodeAdminNodeDetails dn = cancelledNodes.poll();
+      trackedNodes.remove(dn);
+      // TODO - fire event to bring node back into service?


I don't think is needed since if the DN starts HB-ing we should be ok is what I think. @nandakumar131 , any comments?

I will leave this TODO in place for now, and we can review the "bring back into service" in more detail later as we progress with future patches.

anuengineer · 2019-11-07T18:41:25Z

hadoop-hdds/common/src/main/resources/ozone-default.xml

@@ -2501,4 +2501,15 @@
      The number of Recon Tasks that are waiting on updates from OM.
    </description>
  </property>
+  <property>


I am going to commit this, but we have different way of writing configs. I missed it earlier. I will file a follow up JIRA to clean this up.

S O'Donnell added 2 commits October 25, 2019 19:06

Changed logger class in StartDatanodeAdminHandler.java to the correct…

d83074f

… value

sodonnel force-pushed the HDDS-2329-pipelines branch from 29575af to d83074f Compare October 25, 2019 18:10

Added ozone.scm.datanode.admin.monitor.interval to ozone-default.xml

4e4d761

anuengineer approved these changes Nov 2, 2019

View reviewed changes

S O'Donnell added 2 commits November 7, 2019 12:55

Updates after review comments

6880cfa

Fixed find bugs and checkstyle issue around public enum field

55e4354

anuengineer reviewed Nov 7, 2019

View reviewed changes

nandakumar131 closed this Nov 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-2329 Destroy pipelines on any decommission or maintenance nodes #86

HDDS-2329 Destroy pipelines on any decommission or maintenance nodes #86

sodonnel commented Oct 25, 2019

anuengineer left a comment

anuengineer Nov 2, 2019

sodonnel Nov 7, 2019

anuengineer Nov 2, 2019

anuengineer Nov 2, 2019

sodonnel Nov 7, 2019

anuengineer Nov 2, 2019

sodonnel Nov 7, 2019

anuengineer Nov 7, 2019

HDDS-2329 Destroy pipelines on any decommission or maintenance nodes #86

HDDS-2329 Destroy pipelines on any decommission or maintenance nodes #86

Conversation

sodonnel commented Oct 25, 2019

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

anuengineer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment