Skip to content

Conversation

@tomscut
Copy link
Contributor

@tomscut tomscut commented Jan 25, 2022

JIRA: HDFS-16438.

At the time of decommission, if use DatanodeAdminBackoffMonitor, there is a heavy operation: scanDatanodeStorage. If the number of blocks on a storage is large(more than 5 hundred thousand), and GC performance is also poor, it may hold read lock for a long time, we should optimize it.

image

2021-12-22 07:49:01,279 INFO  namenode.FSNamesystem (FSNamesystemLock.java:readUnlock(220)) - FSNamesystem scanDatanodeStorage read lock held for 5491 ms via
java.lang.Thread.getStackTrace(Thread.java:1552)
org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.readUnlock(FSNamesystemLock.java:222)
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.readUnlock(FSNamesystem.java:1641)
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminBackoffMonitor.scanDatanodeStorage(DatanodeAdminBackoffMonitor.java:646)
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminBackoffMonitor.checkForCompletedNodes(DatanodeAdminBackoffMonitor.java:417)
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminBackoffMonitor.check(DatanodeAdminBackoffMonitor.java:300)
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminBackoffMonitor.run(DatanodeAdminBackoffMonitor.java:201)
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
    Number of suppressed read-lock reports: 0
    Longest read-lock held interval: 5491 

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 58s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 💚 mvninstall 35m 40s trunk passed
+1 💚 compile 1m 29s trunk passed with JDK Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04
+1 💚 compile 1m 23s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 checkstyle 0m 58s trunk passed
+1 💚 mvnsite 1m 30s trunk passed
+1 💚 javadoc 1m 1s trunk passed with JDK Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04
+1 💚 javadoc 1m 31s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 3m 21s trunk passed
+1 💚 shadedclient 25m 49s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 1m 20s the patch passed
+1 💚 compile 1m 22s the patch passed with JDK Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04
+1 💚 javac 1m 22s the patch passed
+1 💚 compile 1m 15s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 javac 1m 15s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 52s the patch passed
+1 💚 mvnsite 1m 21s the patch passed
+1 💚 javadoc 0m 51s the patch passed with JDK Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04
+1 💚 javadoc 1m 25s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 3m 26s the patch passed
+1 💚 shadedclient 25m 40s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 339m 49s hadoop-hdfs in the patch passed.
+1 💚 asflicense 0m 41s The patch does not generate ASF License warnings.
449m 3s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3928/1/artifact/out/Dockerfile
GITHUB PR #3928
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell
uname Linux 3a6640569fde 4.15.0-163-generic #171-Ubuntu SMP Fri Nov 5 11:55:11 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / d8acbfe
Default Java Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3928/1/testReport/
Max. process+thread count 2217 (vs. ulimit of 5500)
modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3928/1/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

@tomscut
Copy link
Contributor Author

tomscut commented Jan 26, 2022

Next: fix the transitive dependency of the hbase2 profile; update reload4j to 1.2.18.3 (latest)

Hi @jojochuang , I want to confirm whether this comment is related to this issue? Thanks.

@jojochuang
Copy link
Contributor

Sorry. My bad. Posted to the wrong PR.

@sodonnel
Copy link
Contributor

I am not sure if it is safe to drop the read lock while iterating over the blocks in a StorageInfo as the blocks could change when you drop the lock.

Looking at the code in DatanodeStorageInfo, we can add a block to the list, which always adds to the head of the list. The worst case here, is we don't consider some block, but it would be caught by the rescan at the end, so that is not a big problem.

However we can remove a block. Lets say we stop iterating at blockX and drop the lock.

Then a IBR comes in and gets processed, removing that block from the storage.

The iterator will be holding a reference to that blockInfo object as current, and it is expected to call next on it to get the next block Info.

The remove block code in BlockInfo.java looks like:

BlockInfo listRemove(BlockInfo head, DatanodeStorageInfo storage) {
    if (head == null) {
      return null;
    }
    int dnIndex = this.findStorageInfo(storage);
    if (dnIndex < 0) { // this block is not on the data-node list
      return head;
    }

    BlockInfo next = this.getNext(dnIndex);
    BlockInfo prev = this.getPrevious(dnIndex);
    this.setNext(dnIndex, null);
    this.setPrevious(dnIndex, null);
    if (prev != null) {
      prev.setNext(prev.findStorageInfo(storage), next);
    }
    if (next != null) {
      next.setPrevious(next.findStorageInfo(storage), prev);
    }
    if (this == head) { // removing the head
      head = next;
    }
    return head;
  }

It is basically unlinking itself from the linked list, and then sets its own prev and next to null. Then the iterator calls getNext on it:

  BlockInfo getNext(int index) {
    assert this.triplets != null : "BlockInfo is not initialized";
    assert index >= 0 && index * 3 + 2 < triplets.length : "Index is out of bound";
    BlockInfo info = (BlockInfo)triplets[index * 3 + 2];
    assert info == null || info.getClass().getName().startsWith(
        BlockInfo.class.getName()) :
        "BlockInfo is expected at " + (index * 3 + 2);
    return info;
  }

If the case of the current blockInfo having a null next, the iterator will get null, and think it has reached the end of the list and exit wrongly.

This would be a rare bug to hit, as you would need to drop the lock and then the next block to consume from the list would need to be removed from the list, but it is one of those things that will happen from time to time and be very hard to explain.

5M blocks on a single storage is a lot - dropping and re-locking at the storage level was supposed to handle DNs with millions of blocks, but each storage only having a small number.

Does this pause occur once per storage at the start and end of decommission for each node?

@tomscut
Copy link
Contributor Author

tomscut commented Jan 27, 2022

Thanks @sodonnel for your comments and detailed explanations.

I'm sorry I said the wrong number 5M, we have more than 10 million blocks per node, and 20 disks, so more than 500,000 blocks per storage.

Your explanation makes sense to me. During scanDatanodeStorage, there is listRemove behavior at the same time, so there will indeed be inconsistency.

Can we copy the blocks of each storage to a list first? This is a lightweight operation. Then scan the list and release the lock once for part of scan to reduce the lock holding time. After scan, clear the list. Do you think this is feasible?

@sodonnel
Copy link
Contributor

This is the scan logic:

    for (DatanodeStorageInfo s : storage) {
      namesystem.readLock();
      try {
        // As the lock is dropped and re-taken between each storage, we need
        // to check the storage is still present before processing it, as it
        // may have been removed.
        if (dn.getStorageInfo(s.getStorageID()) == null) {
          continue;
        }
        Iterator<BlockInfo> it = s.getBlockIterator();
        while (it.hasNext()) {
          BlockInfo b = it.next();
          if (!initialScan || dn.isEnteringMaintenance()) {
            // this is a rescan, so most blocks should be replicated now,
            // or this node is going into maintenance. On a healthy
            // cluster using racks or upgrade domain, a node should be
            // able to go into maintenance without replicating many blocks
            // so we will check them immediately.
            if (!isBlockReplicatedOk(dn, b, false, null)) {
              blockList.put(b, null);
            }
          } else {
            blockList.put(b, null);
          }
          numBlocksChecked++;
        }
      } finally {
        namesystem.readUnlock();
      }
    }

The rescan would be more expensive as it needs to check if each block is correctly replicated, and also if you are putting a node to maintenance. The rescan happens at the end. The initial scan basically puts all the entries into a list, so it should already be lightweight.

Do you know if the pause is happening at the end of decommission during the re-scan? If so, we should probably refactor it to not do the isBlockReplicatedOk() check under the lock.

@tomscut
Copy link
Contributor Author

tomscut commented Jan 27, 2022

Hi @sodonnel , thanks for your comments.

This is the stack:

2021-12-21 16:56:21,203 INFO  namenode.FSNamesystem (FSNamesystemLock.java:readUnlock(220)) - FSNamesystem scanDatanodeStorage read lock held for 16013 ms via
java.lang.Thread.getStackTrace(Thread.java:1552)
org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.readUnlock(FSNamesystemLock.java:222)
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.readUnlock(FSNamesystem.java:1641)
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminBackoffMonitor.scanDatanodeStorage(DatanodeAdminBackoffMonitor.java:646)
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminBackoffMonitor.checkForCompletedNodes(DatanodeAdminBackoffMonitor.java:417)
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminBackoffMonitor.check(DatanodeAdminBackoffMonitor.java:300)
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminBackoffMonitor.run(DatanodeAdminBackoffMonitor.java:201)
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
	Number of suppressed read-lock reports: 0
	Longest read-lock held interval: 16013

I think this is at the end of decommission, because we can see checkForCompletedNodes, initialScan must be false.

@sodonnel
Copy link
Contributor

Yea looks likes it is all the isReplicatedOK that is making it slow. What about a change like this, to move the replicated check outside of the lock:

--- a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminBackoffMonitor.java
+++ b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminBackoffMonitor.java
@@ -652,24 +652,27 @@ private void scanDatanodeStorage(DatanodeDescriptor dn,
         Iterator<BlockInfo> it = s.getBlockIterator();
         while (it.hasNext()) {
           BlockInfo b = it.next();
-          if (!initialScan || dn.isEnteringMaintenance()) {
-            // this is a rescan, so most blocks should be replicated now,
-            // or this node is going into maintenance. On a healthy
-            // cluster using racks or upgrade domain, a node should be
-            // able to go into maintenance without replicating many blocks
-            // so we will check them immediately.
-            if (!isBlockReplicatedOk(dn, b, false, null)) {
-              blockList.put(b, null);
-            }
-          } else {
-            blockList.put(b, null);
-          }
+          blockList.put(b, null);
           numBlocksChecked++;
         }
       } finally {
         namesystem.readUnlock();
       }
     }
+    if (!initialScan || dn.isEnteringMaintenance()) {
+      // this is a rescan, so most blocks should be replicated now,
+      // or this node is going into maintenance. On a healthy
+      // cluster using racks or upgrade domain, a node should be
+      // able to go into maintenance without replicating many blocks
+      // so we will check them immediately.
+      Iterator<BlockInfo> iterator = blockList.keySet().iterator();
+      while(iterator.hasNext()) {
+        BlockInfo b = iterator.next();
+        if (!isBlockReplicatedOk(dn, b, false, null)) {
+          iterator.remove();
+        }
+      }
+    }
   }

if (Time.monotonicNow() - beginTime > scanDatanodeStorageLockTimeMs) {
namesystem.readUnlock();
try {
Thread.sleep(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 1ms sleep time enough to remove lock starvation or we should add bit more?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 1ms sleep time enough to remove lock starvation or we should add bit more?

Thanks @virajjasani for your review, I am trying a new implementation now, this code may be replaced.

@tomscut
Copy link
Contributor Author

tomscut commented Jan 28, 2022

+        if (!isBlockReplicatedOk(dn, b, false, null)) {

Hi @sodonnel , I just want to make sure that we don't have to take the inverse here? Please point me out if I understand wrong.

while(iterator.hasNext()) {
+        BlockInfo b = iterator.next();
+        if (isBlockReplicatedOk(dn, b, false, null)) {
+          iterator.remove();
+        }
+      }

@tomscut
Copy link
Contributor Author

tomscut commented Jan 28, 2022

Thanks @sodonnel for your comments and suggestions, which make sense to me.

Moving the replicated check outside of the lock is a good idea, I'll update the code and do some testing.

@sodonnel
Copy link
Contributor

+        if (!isBlockReplicatedOk(dn, b, false, null)) {

Hi @sodonnel , I just want to make sure that we don't have to take the inverse here? Please point me out if I understand wrong.

while(iterator.hasNext()) {
+        BlockInfo b = iterator.next();
+        if (isBlockReplicatedOk(dn, b, false, null)) {
+          iterator.remove();
+        }
+      }

Yes you are correct, my suggested code has it wrong. We need to remove from the list if it is replicated OK. The original code only added if it was not replicated OK.

@tomscut
Copy link
Contributor Author

tomscut commented Jan 28, 2022

+        if (!isBlockReplicatedOk(dn, b, false, null)) {

Hi @sodonnel , I just want to make sure that we don't have to take the inverse here? Please point me out if I understand wrong.

while(iterator.hasNext()) {
+        BlockInfo b = iterator.next();
+        if (isBlockReplicatedOk(dn, b, false, null)) {
+          iterator.remove();
+        }
+      }

Yes you are correct, my suggested code has it wrong. We need to remove from the list if it is replicated OK. The original code only added if it was not replicated OK.

Thanks @sodonnel for your quick reply. I'll update the code and do some related tests later.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 1m 1s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 💚 mvninstall 42m 46s trunk passed
+1 💚 compile 2m 8s trunk passed with JDK Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04
+1 💚 compile 1m 45s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 checkstyle 1m 15s trunk passed
+1 💚 mvnsite 1m 47s trunk passed
+1 💚 javadoc 1m 11s trunk passed with JDK Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04
+1 💚 javadoc 1m 45s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 4m 31s trunk passed
+1 💚 shadedclient 30m 50s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 1m 45s the patch passed
+1 💚 compile 1m 55s the patch passed with JDK Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04
+1 💚 javac 1m 55s the patch passed
+1 💚 compile 1m 40s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 javac 1m 40s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 1m 10s the patch passed
+1 💚 mvnsite 1m 49s the patch passed
+1 💚 javadoc 1m 12s the patch passed with JDK Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04
+1 💚 javadoc 1m 55s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 4m 46s the patch passed
+1 💚 shadedclient 31m 32s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 412m 22s hadoop-hdfs in the patch passed.
+1 💚 asflicense 1m 40s The patch does not generate ASF License warnings.
546m 32s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3928/2/artifact/out/Dockerfile
GITHUB PR #3928
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell
uname Linux 49ce7415c3a6 4.15.0-163-generic #171-Ubuntu SMP Fri Nov 5 11:55:11 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 95535ac
Default Java Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3928/2/testReport/
Max. process+thread count 2019 (vs. ulimit of 5500)
modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3928/2/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

@github-actions
Copy link
Contributor

We're closing this stale PR because it has been open for 100 days with no activity. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you feel like this was a mistake, or you would like to continue working on it, please feel free to re-open it and ask for a committer to remove the stale tag and review again.
Thanks all for your contribution.

@github-actions github-actions bot added the Stale label Nov 18, 2025
@github-actions github-actions bot closed this Nov 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants