HDDS-6567. Store datanode command queue counts from heartbeat in DatanodeInfo in SCM by sodonnel · Pull Request #3329 · apache/ozone

sodonnel · 2022-04-21T13:48:30Z

What changes were proposed in this pull request?

HDDS-6554 added the current command counts for all commands queued on a Datanode to the datanode heartbeat. This Jira will process that information on SCM and store it inside the DatanodeInfo object so other parts of SCM can reference it.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-6567

How was this patch tested?

New Unit Tests

…nodeInfo in SCM

umamaheswararao

@sodonnel patch almost looks good to me. I have few questions to get clarity though, thanks

umamaheswararao · 2022-04-29T00:11:30Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeInfo.java

@@ -49,6 +57,7 @@ public class DatanodeInfo extends DatanodeDetails {
  private List<StorageReportProto> storageReports;
  private List<MetadataStorageReportProto> metadataStorageReports;
  private LayoutVersionProto lastKnownLayoutVersion;


@sodonnel I am wondering what happens to this DatanodeInfo when it expire due to lack of HB from that node? is this object around or destroyed. I am trying to figure out that particular code part. I have not found so far. Please point me where we remove this nodeInfo object when node expires. Thanks

NodeStateManager.checkNodesHealth is what notices the lost heartbeats and triggers events based on that.

The DeadNodeHandler is triggered when the node goes dead (there is also a StaleNodeHandler), and clears out its pipelines etc. Perhaps we should reset the command counts when this happens, or perhaps it is valid to leave them as the last known value. The datanodeInfo object is not removed AFAIK, as it holds the DN service state (in_service, decommissioning, healthy, stale, dead etc). If the DN comes back, it will be reset by the heartbeat processing. If it never comes back, the datanodedetails and datanodeinfo stick around in SCM until it is restarted.

I am not sure if the command counts remaining is a big issue, as we should avoid scheduling commands on dead (and maybe stale) nodes anyway. Eg before scheduling a command for a node, need to check it is HEALTHY, as otherwise the commands will be queued in SCM and never taken by a DN. If something in SCM keeps scheduling commands for dead nodes, it will slowly fill up the SCM memory on the command queue.

Yeah, On thinking a bit, I think it's ok to leave the counts as well ( anyway we will not assign tasks to dead DN). when node rejoined, we should receive new HB and counts should get refreshed.
This above questions was just for my clarity what happens to DN object. Thanks for the details

umamaheswararao · 2022-04-29T00:12:56Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeInfo.java

+  /**
+   * Retrieve the number of queued commands of the given type, as reported by
+   * the datanode at the last heartbeat.
+   * @param cmd The command for which to receive the queued command count


if it's -1, we should wait to assign any tasks to this node as we don;t know the actual state?

-1 means we have not received any data yet. In the case of an upgrade adding a new command (eg SCM upgraded with a new command, but some DNs not upgraded) those DNs will always show a -1 for the new command.

I am not sure how we should handle this - possibly we need a fallback position in any code that uses these counts. If it is "-1" then we need to use some other way of limiting the commands sent. The upgrade scenario should be short lived, and then DNs should only have -1 until their first heartbeat.

I just thought it was a good idea to include -1 as a different state than zero, so we can tell the difference between the two.

umamaheswararao · 2022-04-29T00:13:32Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeManager.java

+  /**
+   * Get the number of commands of the given type queued on the datanode at the
+   * last heartbeat. If the Datanode has not reported information for the given
+   * command type, -1 wil be returned.


nit: wil -> will

umamaheswararao · 2022-04-29T00:16:36Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/SCMNodeManager.java

+        datanodeInfo.setCommandCounts(commandQueueReportProto);
+        metrics.incNumNodeCommandQueueReportProcessed();
+      }
+    } catch (NodeNotFoundException e) {


Is this metric a "report failed"? or just unknown node report? I am not sure about the definition of this metric here

These metrics are copying what is already there for other commands, eg see processNodeReport() and processHeartbeat() - I basically copied this methods structure from there to keep it consistent. In both those cases, the metric is "failedProcessing" but the only failure handled is nodeNotFound, so the name is a bit misleading.

ok. Yeah name is bit misleading a bit.

umamaheswararao

LGTM

umamaheswararao · 2022-04-29T17:04:31Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeInfo.java

@@ -49,6 +57,7 @@ public class DatanodeInfo extends DatanodeDetails {
  private List<StorageReportProto> storageReports;
  private List<MetadataStorageReportProto> metadataStorageReports;
  private LayoutVersionProto lastKnownLayoutVersion;


Yeah, On thinking a bit, I think it's ok to leave the counts as well ( anyway we will not assign tasks to dead DN). when node rejoined, we should receive new HB and counts should get refreshed.
This above questions was just for my clarity what happens to DN object. Thanks for the details

umamaheswararao · 2022-04-29T17:04:43Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeInfo.java

+  /**
+   * Retrieve the number of queued commands of the given type, as reported by
+   * the datanode at the last heartbeat.
+   * @param cmd The command for which to receive the queued command count


umamaheswararao · 2022-04-29T17:05:20Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/SCMNodeManager.java

+        datanodeInfo.setCommandCounts(commandQueueReportProto);
+        metrics.incNumNodeCommandQueueReportProcessed();
+      }
+    } catch (NodeNotFoundException e) {


ok. Yeah name is bit misleading a bit.

HDDS-6567. Store datanode command queue counts from heartbeat in Data…

68fe26b

…nodeInfo in SCM

umamaheswararao requested review from adoroszlai and umamaheswararao April 21, 2022 16:38

umamaheswararao reviewed Apr 29, 2022

View reviewed changes

Fix typo

f4e7d21

umamaheswararao approved these changes Apr 29, 2022

View reviewed changes

sodonnel merged commit d2ac336 into apache:master Apr 29, 2022

Comments

Conversation

sodonnel commented Apr 21, 2022

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

umamaheswararao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

umamaheswararao Apr 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

umamaheswararao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

umamaheswararao Apr 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

umamaheswararao Apr 29, 2022 •

edited

Loading

umamaheswararao Apr 29, 2022 •

edited

Loading