Skip to content

HDDS-15262. [Recon] Fix incorrect datanode insight values when a DN goes down#10259

Merged
ChenSammi merged 1 commit into
apache:masterfrom
priyeshkaratha:HDDS-15262
May 14, 2026
Merged

HDDS-15262. [Recon] Fix incorrect datanode insight values when a DN goes down#10259
ChenSammi merged 1 commit into
apache:masterfrom
priyeshkaratha:HDDS-15262

Conversation

@priyeshkaratha
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

This pull request updates the metrics collection logic by migrating from DatanodeDetails to DatanodeInfo so that it can track dead nodes. So that the number of dn showing in storageDistribution and pendingDeletion will match and it will avoid confusions.

What is the link to the Apache JIRA

HDDS-15262

How was this patch tested?

Tested using existing unit testcases.

@priyeshkaratha priyeshkaratha marked this pull request as ready for review May 13, 2026 13:31
@ChenSammi ChenSammi requested a review from Copilot May 14, 2026 02:05
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates Recon’s DataNode pending-deletion metrics collection to use DatanodeInfo / ReconNodeManager#getAllNodes() so metrics can include dead nodes and align DataNode counts between pendingDeletion and storage distribution views.

Changes:

  • Switched the metrics-collection task input type from DatanodeDetails to DatanodeInfo.
  • Changed node enumeration from reconNodeManager.getNodeStats().keySet() to reconNodeManager.getAllNodes() to include dead nodes.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/tasks/DataNodeMetricsCollectionTask.java Accepts DatanodeInfo and builds JMX URL using DataNode web ports.
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/api/DataNodeMetricsService.java Uses getAllNodes() and submits metrics collection tasks for all DataNodes (including dead).
Comments suppressed due to low confidence (2)

hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/tasks/DataNodeMetricsCollectionTask.java:86

  • getJmxMetricsUrl() dereferences nodeDetails.getPort(portName) without checking for null. DatanodeDetails#getPort(...) can return null when a port is not present (eg older layout versions / incomplete details), and this PR intentionally starts including dead nodes where missing web ports is more likely. This can cause an NPE during task construction and fail the whole metrics collection. Consider handling missing host/port explicitly (eg fall back to IP, skip the node / return a -1 result with a warning) instead of calling .getValue() unconditionally.
  private String getJmxMetricsUrl() {
    String protocol = httpsEnabled ? "https" : "http";
    Name portName = httpsEnabled ?  Name.HTTPS : Name.HTTP;
    return String.format("%s://%s:%d/jmx",
        protocol,
        nodeDetails.getHostName(),
        nodeDetails.getPort(portName).getValue());
  }

hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/api/DataNodeMetricsService.java:182

  • DataNodeMetricsCollectionTask builds its JMX URL in the constructor; if a node is missing HTTP/HTTPS port details, task creation will throw (eg NPE from getPort(...).getValue()), aborting submission for all nodes and causing the whole collection to fail. Since this PR expands the node set to include dead nodes, it would be safer to validate required fields per-node here (or catch exceptions around task construction), record the node as failed, and continue submitting tasks for the remaining nodes.
    long submissionTime = System.currentTimeMillis();
    for (DatanodeInfo node : nodes) {
      DataNodeMetricsCollectionTask task = new DataNodeMetricsCollectionTask(
          node, httpsEnabled, metricsServiceProviderFactory);
      DatanodePendingDeletionMetrics key = new DatanodePendingDeletionMetrics(
          node.getHostName(), node.getUuidString(), -1L); // -1 is used as placeholder/failed status
      futures.put(key, executorService.submit(task));
    }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ChenSammi ChenSammi merged commit f1b1b53 into apache:master May 14, 2026
116 of 119 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants