HDFS-15737. Don't remove datanodes from outOfServiceNodeBlocks while checking in DatanodeAdminManager #2562

NickyYe · 2020-12-18T00:54:37Z

https://issues.apache.org/jira/browse/HDFS-15737

NOTICE

Please create an issue in ASF JIRA before opening a pull request,
and you need to set the title of the pull request which starts with
the corresponding JIRA issue number. (e.g. HADOOP-XXXXX. Fix a typo in YYY.)
For more details, please see https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute

hadoop-yetus · 2020-12-18T02:44:24Z

💔 -1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	11m 7s	Docker mode activated.
		_ Prechecks _
+1 💚	dupname	0m 0s	No case conflicting files found.
+1 💚	@author	0m 0s	The patch does not contain any @author tags.
-1 ❌	test4tests	0m 0s	The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
		_ branch-2.10 Compile Tests _
+1 💚	mvninstall	16m 8s	branch-2.10 passed
+1 💚	compile	0m 58s	branch-2.10 passed
+1 💚	checkstyle	0m 38s	branch-2.10 passed
+1 💚	mvnsite	1m 12s	branch-2.10 passed
+1 💚	javadoc	1m 15s	branch-2.10 passed
+0 🆗	spotbugs	2m 59s	Used deprecated FindBugs config; considering switching to SpotBugs.
+1 💚	findbugs	2m 56s	branch-2.10 passed
		_ Patch Compile Tests _
+1 💚	mvninstall	1m 0s	the patch passed
+1 💚	compile	0m 54s	the patch passed
+1 💚	javac	0m 54s	the patch passed
+1 💚	checkstyle	0m 29s	the patch passed
+1 💚	mvnsite	1m 2s	the patch passed
+1 💚	whitespace	0m 0s	The patch has no whitespace issues.
+1 💚	javadoc	1m 10s	the patch passed
+1 💚	findbugs	3m 0s	the patch passed
		_ Other Tests _
-1 ❌	unit	62m 34s	hadoop-hdfs in the patch failed.
+1 💚	asflicense	0m 41s	The patch does not generate ASF License warnings.
		108m 52s

Reason	Tests
Failed junit tests	hadoop.hdfs.qjournal.server.TestJournalNodeRespectsBindHostKeys
	hadoop.hdfs.server.blockmanagement.TestReplicationPolicyWithUpgradeDomain
	hadoop.hdfs.TestRollingUpgrade

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2562/1/artifact/out/Dockerfile
GITHUB PR	#2562
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle
uname	Linux 8cd430b18ff7 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	branch-2.10 / 2c45e93
Default Java	Oracle Corporation-1.7.0_95-b00
unit	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2562/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2562/1/testReport/
Max. process+thread count	2255 (vs. ulimit of 5500)
modules	C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2562/1/console
versions	git=2.7.4 maven=3.3.9 findbugs=3.0.1
Powered by	Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

sodonnel · 2020-12-18T12:14:39Z

It looks like this same logic also exists in trunk - could you submit a trunk PR / patch and then we can backport the change across all active branches?

I am also a little confused about this problem. The map outOfServiceNodeBlocks is modified in a few places in the middle of the Cyclic Iteration. If it threw an ConcurrentModificationException on modification, then I would expect us to be seeing this a lot. Probably anytime there is more than 1 node added to decommission / maintenance.

Eg, from trunk DatanodeAdminDefaultMonitor.java, here it is a CyclicIterator over outOfServiceNodeBlocks

    while (it.hasNext() && !exceededNumBlocksPerCheck() && namesystem
        .isRunning()) {
      numNodesChecked++;
      final Map.Entry<DatanodeDescriptor, AbstractList<BlockInfo>>
          entry = it.next();
      final DatanodeDescriptor dn = entry.getKey();
      try {
        AbstractList<BlockInfo> blocks = entry.getValue();
        boolean fullScan = false;
        if (dn.isMaintenance() && dn.maintenanceExpired()) {
          // If maintenance expires, stop tracking it.
          dnAdmin.stopMaintenance(dn);
          toRemove.add(dn);
          continue;
        }
        if (dn.isInMaintenance()) {
          // The dn is IN_MAINTENANCE and the maintenance hasn't expired yet.
          continue;
        }
        if (blocks == null) {
          // This is a newly added datanode, run through its list to schedule
          // under-replicated blocks for replication and collect the blocks
          // that are insufficiently replicated for further tracking
          LOG.debug("Newly-added node {}, doing full scan to find " +
              "insufficiently-replicated blocks.", dn);
          blocks = handleInsufficientlyStored(dn);
          outOfServiceNodeBlocks.put(dn, blocks);  // **** Modifies outOfServiceNodeBlocks
         ...

Note that outOfServiceNodeBlocks is modified on the first pass, and so it.next() should throw an exception on the next iteration.

Have you seen the ConcurrentModificationException logged due to this problem?

NickyYe · 2020-12-18T19:42:44Z

ConcurrentModificationException

Thanks for looking. If there are only 2 datanodes in outOfServiceNodeBlocks and the first one is removed, then it will be a dead loop on the second datanode. If there are more than 2 datanodes and the first one is removed, there will be a ConcurrentModificationException. I see both two cases in our prod very often.

This issue only happens when remove (dnAdmin.stopMaintenance(dn);). By outOfServiceNodeBlocks.put(dn, blocks), it only updates the value, so Cyclic Iteration won't be affected

sodonnel · 2020-12-18T21:04:00Z

Thanks for the information - this may explain why HDFS-12703 was needed, as some exceptions which were not logged at that time, caused the decommission thread to stop running until the NN was restarted. The change there was to catch the exception.

The change here looks correct to me, but as the issue exists on the trunk branch, we should fix it there first, and then backport to 3.3, 3.2, 3.1 and 2.10 so the fix is in place across all branches.

NickyYe · 2020-12-18T21:54:38Z

Thanks for the information - this may explain why HDFS-12703 was needed, as some exceptions which were not logged at that time, caused the decommission thread to stop running until the NN was restarted. The change there was to catch the exception.

The change here looks correct to me, but as the issue exists on the trunk branch, we should fix it there first, and then backport to 3.3, 3.2, 3.1 and 2.10 so the fix is in place across all branches.

Due to HDFS-14854, the fix on trunk could be a very different one, since it doesn't make sense to change the new interface with a boolean parameter to stopTrackingNode while DatanodeAdminBackoffMonitor does't need.

Looks a better fix would be introduce a cancelledNodes to DatanodeAdminDefaultMonitor, just like DatanodeAdminBackoffMonitor . Then in stopTrackingNode, don't remove dn from outOfServiceNodeBlocks, but add it to cancelledNodes for further process.

However, the change would be a little bit bigger.

github-actions · 2025-12-07T00:26:44Z

We're closing this stale PR because it has been open for 100 days with no activity. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you feel like this was a mistake, or you would like to continue working on it, please feel free to re-open it and ask for a committer to remove the stale tag and review again.
Thanks all for your contribution.

Update DatanodeAdminManager.java

1f30506

github-actions bot added the Stale label Dec 7, 2025

github-actions bot closed this Dec 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDFS-15737. Don't remove datanodes from outOfServiceNodeBlocks while checking in DatanodeAdminManager #2562

HDFS-15737. Don't remove datanodes from outOfServiceNodeBlocks while checking in DatanodeAdminManager #2562

NickyYe commented Dec 18, 2020

Uh oh!

hadoop-yetus commented Dec 18, 2020

Uh oh!

sodonnel commented Dec 18, 2020

Uh oh!

NickyYe commented Dec 18, 2020

Uh oh!

sodonnel commented Dec 18, 2020

Uh oh!

NickyYe commented Dec 18, 2020

Uh oh!

github-actions bot commented Dec 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HDFS-15737. Don't remove datanodes from outOfServiceNodeBlocks while checking in DatanodeAdminManager #2562

HDFS-15737. Don't remove datanodes from outOfServiceNodeBlocks while checking in DatanodeAdminManager #2562

Conversation

NickyYe commented Dec 18, 2020

NOTICE

Uh oh!

hadoop-yetus commented Dec 18, 2020

Uh oh!

sodonnel commented Dec 18, 2020

Uh oh!

NickyYe commented Dec 18, 2020

Uh oh!

sodonnel commented Dec 18, 2020

Uh oh!

NickyYe commented Dec 18, 2020

Uh oh!

github-actions bot commented Dec 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants