Skip to content

Conversation

@NickyYe
Copy link
Contributor

@NickyYe NickyYe commented Dec 18, 2020

https://issues.apache.org/jira/browse/HDFS-15737

NOTICE

Please create an issue in ASF JIRA before opening a pull request,
and you need to set the title of the pull request which starts with
the corresponding JIRA issue number. (e.g. HADOOP-XXXXX. Fix a typo in YYY.)
For more details, please see https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 11m 7s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ branch-2.10 Compile Tests _
+1 💚 mvninstall 16m 8s branch-2.10 passed
+1 💚 compile 0m 58s branch-2.10 passed
+1 💚 checkstyle 0m 38s branch-2.10 passed
+1 💚 mvnsite 1m 12s branch-2.10 passed
+1 💚 javadoc 1m 15s branch-2.10 passed
+0 🆗 spotbugs 2m 59s Used deprecated FindBugs config; considering switching to SpotBugs.
+1 💚 findbugs 2m 56s branch-2.10 passed
_ Patch Compile Tests _
+1 💚 mvninstall 1m 0s the patch passed
+1 💚 compile 0m 54s the patch passed
+1 💚 javac 0m 54s the patch passed
+1 💚 checkstyle 0m 29s the patch passed
+1 💚 mvnsite 1m 2s the patch passed
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 javadoc 1m 10s the patch passed
+1 💚 findbugs 3m 0s the patch passed
_ Other Tests _
-1 ❌ unit 62m 34s hadoop-hdfs in the patch failed.
+1 💚 asflicense 0m 41s The patch does not generate ASF License warnings.
108m 52s
Reason Tests
Failed junit tests hadoop.hdfs.qjournal.server.TestJournalNodeRespectsBindHostKeys
hadoop.hdfs.server.blockmanagement.TestReplicationPolicyWithUpgradeDomain
hadoop.hdfs.TestRollingUpgrade
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2562/1/artifact/out/Dockerfile
GITHUB PR #2562
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle
uname Linux 8cd430b18ff7 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision branch-2.10 / 2c45e93
Default Java Oracle Corporation-1.7.0_95-b00
unit https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2562/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2562/1/testReport/
Max. process+thread count 2255 (vs. ulimit of 5500)
modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2562/1/console
versions git=2.7.4 maven=3.3.9 findbugs=3.0.1
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@sodonnel
Copy link
Contributor

It looks like this same logic also exists in trunk - could you submit a trunk PR / patch and then we can backport the change across all active branches?

I am also a little confused about this problem. The map outOfServiceNodeBlocks is modified in a few places in the middle of the Cyclic Iteration. If it threw an ConcurrentModificationException on modification, then I would expect us to be seeing this a lot. Probably anytime there is more than 1 node added to decommission / maintenance.

Eg, from trunk DatanodeAdminDefaultMonitor.java, here it is a CyclicIterator over outOfServiceNodeBlocks

    while (it.hasNext() && !exceededNumBlocksPerCheck() && namesystem
        .isRunning()) {
      numNodesChecked++;
      final Map.Entry<DatanodeDescriptor, AbstractList<BlockInfo>>
          entry = it.next();
      final DatanodeDescriptor dn = entry.getKey();
      try {
        AbstractList<BlockInfo> blocks = entry.getValue();
        boolean fullScan = false;
        if (dn.isMaintenance() && dn.maintenanceExpired()) {
          // If maintenance expires, stop tracking it.
          dnAdmin.stopMaintenance(dn);
          toRemove.add(dn);
          continue;
        }
        if (dn.isInMaintenance()) {
          // The dn is IN_MAINTENANCE and the maintenance hasn't expired yet.
          continue;
        }
        if (blocks == null) {
          // This is a newly added datanode, run through its list to schedule
          // under-replicated blocks for replication and collect the blocks
          // that are insufficiently replicated for further tracking
          LOG.debug("Newly-added node {}, doing full scan to find " +
              "insufficiently-replicated blocks.", dn);
          blocks = handleInsufficientlyStored(dn);
          outOfServiceNodeBlocks.put(dn, blocks);  // **** Modifies outOfServiceNodeBlocks
         ...

Note that outOfServiceNodeBlocks is modified on the first pass, and so it.next() should throw an exception on the next iteration.

Have you seen the ConcurrentModificationException logged due to this problem?

@NickyYe
Copy link
Contributor Author

NickyYe commented Dec 18, 2020

ConcurrentModificationException

Thanks for looking. If there are only 2 datanodes in outOfServiceNodeBlocks and the first one is removed, then it will be a dead loop on the second datanode. If there are more than 2 datanodes and the first one is removed, there will be a ConcurrentModificationException. I see both two cases in our prod very often.

This issue only happens when remove (dnAdmin.stopMaintenance(dn);). By outOfServiceNodeBlocks.put(dn, blocks), it only updates the value, so Cyclic Iteration won't be affected

@sodonnel
Copy link
Contributor

Thanks for the information - this may explain why HDFS-12703 was needed, as some exceptions which were not logged at that time, caused the decommission thread to stop running until the NN was restarted. The change there was to catch the exception.

The change here looks correct to me, but as the issue exists on the trunk branch, we should fix it there first, and then backport to 3.3, 3.2, 3.1 and 2.10 so the fix is in place across all branches.

@NickyYe
Copy link
Contributor Author

NickyYe commented Dec 18, 2020

Thanks for the information - this may explain why HDFS-12703 was needed, as some exceptions which were not logged at that time, caused the decommission thread to stop running until the NN was restarted. The change there was to catch the exception.

The change here looks correct to me, but as the issue exists on the trunk branch, we should fix it there first, and then backport to 3.3, 3.2, 3.1 and 2.10 so the fix is in place across all branches.

Due to HDFS-14854, the fix on trunk could be a very different one, since it doesn't make sense to change the new interface with a boolean parameter to stopTrackingNode while DatanodeAdminBackoffMonitor does't need.

Looks a better fix would be introduce a cancelledNodes to DatanodeAdminDefaultMonitor, just like DatanodeAdminBackoffMonitor . Then in stopTrackingNode, don't remove dn from outOfServiceNodeBlocks, but add it to cancelledNodes for further process.

However, the change would be a little bit bigger.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 7, 2025

We're closing this stale PR because it has been open for 100 days with no activity. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you feel like this was a mistake, or you would like to continue working on it, please feel free to re-open it and ask for a committer to remove the stale tag and review again.
Thanks all for your contribution.

@github-actions github-actions bot added the Stale label Dec 7, 2025
@github-actions github-actions bot closed this Dec 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants