HDDS-15034. Query SCM status for ozone admin upgrade status command by dombizita · Pull Request #10084 · apache/ozone

dombizita · 2026-04-17T09:13:29Z

What changes were proposed in this pull request?

After #10011 is merged the hardcoded placeholder responses can be removed and connect it to SCM for real values. Based on @errose28's suggestion I used HDDSLayoutVersionManager to check the finalization status of SCM and added a new counter to SCMNodeManager to keep track of the number of DNs finalized and used that for the ozone admin upgrade status output.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-15034

How was this patch tested?

Added tests, green CI on my fork: https://github.com/dombizita/ozone/actions/runs/24517013218

dombizita · 2026-04-20T11:22:47Z

Thank you for the review @sodonnel, addressed your comments in the latest commit.

dombizita · 2026-04-21T09:32:53Z

Thanks @sodonnel, based on you comments and offline discussion I agree that it's safer and easier to just get the count each time while iterating through the nodes and not store it as a counter, which could go out of sync because of corner cases

sodonnel · 2026-04-21T10:44:23Z

+
+  static boolean shouldFinalize(UpgradeFinalization.Status scmUpgradeStatus,
+      int finalizedDatanodes, int healthyDatanodes) {
+    return UpgradeFinalization.Status.FINALIZATION_REQUIRED.equals(scmUpgradeStatus)


ShouldFinalize should be true when SCM is finalized and all DNs are finalized as it is the trigger for other components to finalize. I think the existing enum states are quite confusing:

public enum Status { ALREADY_FINALIZED, => I think its this value if it startups up already finalized? STARTING_FINALIZATION, FINALIZATION_IN_PROGRESS, FINALIZATION_DONE, => It gets here if it was REQUIRED and then gets finalized. FINALIZATION_REQUIRED, => Starts up unfinalized. }

So I think here we need to check:

if ((UpgradeFinalization.Status.FINALIZATION_DONE.equals(scmUpgradeStatus) || UpgradeFinalization.Status.ALREADY_FINALIZED.equals(scmUpgradeStatus)) && finalizedDatanodes == healthyDatanodes) { // Should finalize == true } else { // should finalize == false }

We already have isScmFinalized above, so we can just reuse it.

So its just scmFinalized && numDBs == finalizedDNs and then can drop this method.

Ohh, thanks, I only added the check for numDBs == finalizedDNs in the previous round of review, forgot to fix the SCM finalization check, as OM will poll this status.
Shouldn't the CLI output be more specific about this? Like "OM should finalize"?

ozone/hadoop-ozone/cli-admin/src/main/java/org/apache/hadoop/ozone/admin/upgrade/StatusSubCommand.java

Lines 43 to 47 in 6c4d852

out().println("Update status:");

out().println(" SCM Finalized: " + status.getScmFinalized());

out().println(" Datanodes finalized: " + status.getNumDatanodesFinalized());

out().println(" Total Datanodes: " + status.getNumDatanodesTotal());

out().println(" Should Finalize: " + status.getShouldFinalize());

Yea we can tweak the CLI output, as what is there is just kind of a placeholder and we still need to add the JSON part. Possibly the Should Finalize part doesn't even need to be in the human readable part and only in the JSON response. We can consider it more in another PR.

sodonnel

This version looks good if we get green CI.

errose28

Thanks for working on this @dombizita. I don't think we are quite ready to merge yet.

errose28 · 2026-04-21T17:13:14Z

+      try {
+        // Only check HEALTHY nodes. STALE/DEAD nodes will be told to
+        // finalize when they recover.
+        if (!nodeManager.getNodeStatus(dn).isHealthy()) {


This doesn't address the previous comment because it is only checking the health state returned by getNodeStatus, not the operational state. We can probably just loop over nodeManager.getNodes(NodeStatus.inServiceHealthy()) then not need to skip any entries in the loop.

I did wonder if we should let any decommission(ed / ing) and maintenance modes finalize if they are healthy (meaning they are still heartbeating). If they are heartbeating they should either finalize or stop heartbeating and go dead. They can be forced to finalize if they ever transition back to in_service but I don't think that needs a re-register so it would be another code path to worry about later.

I do think we should heartbeat to all nodes we can reach that they should finalize. We don't need to count all those nodes towards the HDDS finalization exit criteria, but maybe we should. It is simpler to reason that all live nodes have finalized, and we don't need to worry about decom/maintenance at all. I was considering a scenario where a decom node has just been shut down and might block finalization for the full heartbeat timeout, but that is only 10 minutes.

Along the same lines, I'm wondering if we should also count stale nodes towards the total count and wait for them to either finalize or go dead. If we exclude them from the count, we cannot say that all registered nodes have finalized because stale nodes will not have to re-register when they heartbeat again. If we include them, there could be one or a few slow nodes that periodically go stale and are also not processing the commands which would hold up OM.

Mostly I think we just need to make a decision on what states we count and why.

I was considering a scenario where a decom node has just been shut down and might block finalization for the full heartbeat timeout, but that is only 10 minutes.

Yea, I think we can live with that. People should not be upgrading clusters with nodes in strange states IMO, as it just adds to potential problems.

On stale nodes, there are:

stale : unfinalized
stale : finalized

The second we can just count as a finalized node and not worry about it. The second is a problem.

It can either go to dead and we can ignore it, or it can go back to healthy and may or may not finalize and go stale again. I guess if it is able to go stale -> healthy, it should must have heartbeat and hence picked up its finalize command, but it may not process it if the DN is in bad shape, and it may not heartbeat again.

I think our upgrade write path design could handle a node that slips through un-finalized? Or do we put some extra logic in to fence it out until it finalizes? For a stale node, all its Ratis pipelines get closed on transition to stale anyway, so it is effectively out of the write path at that point.

errose28 · 2026-04-21T17:19:47Z

+    private final int numFinalizedDatanodes;
+    private final int totalHealthyDatanodes;
+    private final int numUnfinalizedDatanodes;


Do we actually need three fields here? I think we can just track total healthy in-service nodes and number of finalized nodes. The number of unfinalized nodes can be derived from that. These are the only two fields the upgrade status API uses.

The finalization wait code can be changed to check that the healthy node count equals the finalized count. We could also wrap that in an allNodesFinalized method inside DatanodeFinalizationCount.

getNumUnfinalizedDatanodes has no test coverage, but I think it should be removed. getTotalHealthyDatanodes also has no test coverage, which should be added.

errose28 · 2026-04-21T17:21:18Z

    }
  }
+
+  public static DatanodeFinalizationCounts getNumFinalizedDatanodes(


This should be an instance method in NodeManager, not a static method that takes an instance as its first/only argument. DatanodeFinalizationCounts can be moved there as well, or moved to a standalone class.

errose28 · 2026-04-21T17:43:59Z

-          .setNumDatanodesTotal(10)
-          .setShouldFinalize(true)
-          .build();
+      UpgradeFinalization.Status scmUpgradeStatus = scm.getLayoutVersionManager().getUpgradeState();


This upgrade status enum is going to be removed in the new version manager. We will only use the proto facing version for compatibility with the older APIs. This call can be changed to scm.getLayoutVersionManager().needsFinalization() which easily maps to the same call in the new version manager when we switch. From here we can:

Remove the isScmFinalized helper method

Add a DatanodeFinalizationCounts#allNodesFinalized method as suggested here

Inline the shouldFinalize method to just a variable: shouldFinalize = scmFinalized && datanodeFinalizationCounts.allNodesFinalized()

Inline buildUpgradeStatus to invoke the builder directly in this method since the labelled builder setters are clearer than the unlabeled method parameters and it is now just a passthrough.

This minimizes the amount of upgrade specific logic in the server side translator.

errose28 · 2026-04-21T17:51:58Z

+      nonFinalizedNode.setPersistedOpState(
+          HddsProtos.NodeOperationalState.DECOMMISSIONED);


We should have test coverage for all ineligible health and op states to make the intent clear. It should be fast to enumerate all of them.

errose28 · 2026-04-21T18:03:52Z

        HddsProtos.ReplicationFactor.THREE).getContainerInfoList().size());
  }

+  @Test


With the simplifications mentioned above I don't think we need all these tests, it looks like the AI went overboard here. These aren't actually testing the queryUpgradeStatus method either. Looking at the other methods in this class it looks like we usually only cover them with integration tests. So in this PR I would switch all test usage of queryUpgradeFinalizationProgress to queryUpgradeStatus for test coverage.

dombizita added 2 commits April 16, 2026 16:48

HDDS-15034. Query SCM status for ozone admin upgrade status command

2533cd4

Remove unused import

bd49bba

dombizita requested review from errose28 and sodonnel April 17, 2026 09:13

sodonnel reviewed Apr 17, 2026

View reviewed changes

Comment thread hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/SCMNodeManager.java Outdated

github-actions Bot added the zdu Pull requests for Zero Downtime Upgrade (ZDU) https://issues.apache.org/jira/browse/HDDS-14496 label Apr 17, 2026

sodonnel reviewed Apr 17, 2026

View reviewed changes

Comment thread ...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/SCMClientProtocolServer.java Outdated

sodonnel reviewed Apr 17, 2026

View reviewed changes

Comment thread hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/node/TestSCMNodeManager.java

Address review comments

0a88f9b

sodonnel reviewed Apr 20, 2026

View reviewed changes

Comment thread ...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/SCMClientProtocolServer.java Outdated

sodonnel reviewed Apr 20, 2026

View reviewed changes

Comment thread hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeManager.java Outdated

dombizita added 2 commits April 21, 2026 11:26

Iterate over nodes each time, instead of counter

1e73007

Fix checkstyle

425441c

Fix checkstyle

c39b79b

sodonnel reviewed Apr 21, 2026

View reviewed changes

Fix shouldFinalize

2858c76

sodonnel approved these changes Apr 21, 2026

View reviewed changes

Fix wrong assertation TestSCMClientProtocolServer

71047b7

errose28 requested changes Apr 21, 2026

View reviewed changes

	out().println("Update status:");
	out().println(" SCM Finalized: " + status.getScmFinalized());
	out().println(" Datanodes finalized: " + status.getNumDatanodesFinalized());
	out().println(" Total Datanodes: " + status.getNumDatanodesTotal());
	out().println(" Should Finalize: " + status.getShouldFinalize());

		nonFinalizedNode.setPersistedOpState(
		HddsProtos.NodeOperationalState.DECOMMISSIONED);

Conversation

dombizita commented Apr 17, 2026

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dombizita commented Apr 20, 2026

Uh oh!

Uh oh!

Uh oh!

dombizita commented Apr 21, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sodonnel left a comment

Choose a reason for hiding this comment

Uh oh!

errose28 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants