HDDS-15262. [Recon] Fix incorrect datanode insight values when a DN goes down#10259
Merged
Conversation
95eb3d3 to
a25e12a
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates Recon’s DataNode pending-deletion metrics collection to use DatanodeInfo / ReconNodeManager#getAllNodes() so metrics can include dead nodes and align DataNode counts between pendingDeletion and storage distribution views.
Changes:
- Switched the metrics-collection task input type from
DatanodeDetailstoDatanodeInfo. - Changed node enumeration from
reconNodeManager.getNodeStats().keySet()toreconNodeManager.getAllNodes()to include dead nodes.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/tasks/DataNodeMetricsCollectionTask.java | Accepts DatanodeInfo and builds JMX URL using DataNode web ports. |
| hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/api/DataNodeMetricsService.java | Uses getAllNodes() and submits metrics collection tasks for all DataNodes (including dead). |
Comments suppressed due to low confidence (2)
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/tasks/DataNodeMetricsCollectionTask.java:86
getJmxMetricsUrl()dereferencesnodeDetails.getPort(portName)without checking for null.DatanodeDetails#getPort(...)can return null when a port is not present (eg older layout versions / incomplete details), and this PR intentionally starts including dead nodes where missing web ports is more likely. This can cause an NPE during task construction and fail the whole metrics collection. Consider handling missing host/port explicitly (eg fall back to IP, skip the node / return a -1 result with a warning) instead of calling.getValue()unconditionally.
private String getJmxMetricsUrl() {
String protocol = httpsEnabled ? "https" : "http";
Name portName = httpsEnabled ? Name.HTTPS : Name.HTTP;
return String.format("%s://%s:%d/jmx",
protocol,
nodeDetails.getHostName(),
nodeDetails.getPort(portName).getValue());
}
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/api/DataNodeMetricsService.java:182
DataNodeMetricsCollectionTaskbuilds its JMX URL in the constructor; if a node is missing HTTP/HTTPS port details, task creation will throw (eg NPE fromgetPort(...).getValue()), aborting submission for all nodes and causing the whole collection to fail. Since this PR expands the node set to include dead nodes, it would be safer to validate required fields per-node here (or catch exceptions around task construction), record the node as failed, and continue submitting tasks for the remaining nodes.
long submissionTime = System.currentTimeMillis();
for (DatanodeInfo node : nodes) {
DataNodeMetricsCollectionTask task = new DataNodeMetricsCollectionTask(
node, httpsEnabled, metricsServiceProviderFactory);
DatanodePendingDeletionMetrics key = new DatanodePendingDeletionMetrics(
node.getHostName(), node.getUuidString(), -1L); // -1 is used as placeholder/failed status
futures.put(key, executorService.submit(task));
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ChenSammi
approved these changes
May 14, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This pull request updates the metrics collection logic by migrating from DatanodeDetails to DatanodeInfo so that it can track dead nodes. So that the number of dn showing in storageDistribution and pendingDeletion will match and it will avoid confusions.
What is the link to the Apache JIRA
HDDS-15262
How was this patch tested?
Tested using existing unit testcases.