Skip to content

HDFS-17906. Fix issue that DataNodes get stuck in infinite loop when meet InvalidBlockReportLeaseException.#8416

Merged
Hexiaoqiao merged 3 commits intoapache:trunkfrom
cxzl25:HDFS-17906
Apr 15, 2026
Merged

HDFS-17906. Fix issue that DataNodes get stuck in infinite loop when meet InvalidBlockReportLeaseException.#8416
Hexiaoqiao merged 3 commits intoapache:trunkfrom
cxzl25:HDFS-17906

Conversation

@cxzl25
Copy link
Copy Markdown
Contributor

@cxzl25 cxzl25 commented Apr 10, 2026

Description of PR

#5460 (comment)

HDFS-16942 introduced InvalidBlockReportLeaseException, which the NameNode now throws back to the DataNode via RPC when a block report is rejected due to an invalid lease. On a DataNode that also includes HDFS-16942, the exception is caught and fullBlockReportLeaseId is reset to 0, allowing the DN to request a new lease on the next heartbeat and retry.

However, during a rolling upgrade where the NameNode has been upgraded (with HDFS-16942) but DataNodes are still running an older version (without HDFS-16942), the old DataNode code does not have the InvalidBlockReportLeaseException handling branch in BPServiceActor.offerService(). This causes the DN to enter an infinite failure loop where it can never successfully send a full block report.

Root Cause

In BPServiceActor.offerService(), the logic works as follows:

  1. The DN requests a block report lease during heartbeat only when fullBlockReportLeaseId == 0:
boolean requestBlockReportLease = (fullBlockReportLeaseId == 0) &&
        scheduler.isBlockReportDue(startTime);
  1. After sending the block report, fullBlockReportLeaseId is reset to 0:
if ((fullBlockReportLeaseId != 0) || forceFullBr) {
    cmds = blockReport(fullBlockReportLeaseId);
    fullBlockReportLeaseId = 0;  // not reached if blockReport() throws
}
  1. When the upgraded NN throws InvalidBlockReportLeaseException, blockReport() propagates the exception. The fullBlockReportLeaseId = 0 line after the call is never executed.

  2. The exception is caught by the generic RemoteException catch block. The old DN code does not recognize InvalidBlockReportLeaseException, so fullBlockReportLeaseId remains set to the stale invalid value.

  3. On the next heartbeat iteration, because fullBlockReportLeaseId != 0, requestBlockReportLease is false — the DN does not request a new lease. It then attempts blockReport() again with the same stale lease, which the NN rejects again. This repeats indefinitely.

How was this patch tested?

Add test

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

AI Tooling

If an AI tool was used:

@hadoop-yetus
Copy link
Copy Markdown

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 19m 22s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 1s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 53m 27s trunk passed
+1 💚 compile 1m 44s trunk passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 compile 1m 48s trunk passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 checkstyle 1m 57s trunk passed
+1 💚 mvnsite 1m 55s trunk passed
+1 💚 javadoc 1m 32s trunk passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 javadoc 1m 28s trunk passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 spotbugs 4m 29s trunk passed
+1 💚 shadedclient 36m 46s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 1m 23s the patch passed
+1 💚 compile 1m 15s the patch passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 javac 1m 15s the patch passed
+1 💚 compile 1m 20s the patch passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 javac 1m 20s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 1m 18s /results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 311 unchanged - 0 fixed = 312 total (was 311)
+1 💚 mvnsite 1m 27s the patch passed
+1 💚 javadoc 0m 58s the patch passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 javadoc 1m 2s the patch passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 spotbugs 4m 2s the patch passed
+1 💚 shadedclient 36m 20s patch has no errors when building and testing our client artifacts.
_ Other Tests _
-1 ❌ unit 255m 37s /patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt hadoop-hdfs in the patch passed.
+1 💚 asflicense 0m 53s The patch does not generate ASF License warnings.
427m 51s
Reason Tests
Failed junit tests hadoop.tools.TestHdfsConfigFields
Subsystem Report/Notes
Docker ClientAPI=1.54 ServerAPI=1.54 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8416/1/artifact/out/Dockerfile
GITHUB PR #8416
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 48da9b3f7a3c 5.15.0-164-generic #174-Ubuntu SMP Fri Nov 14 20:25:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / bf7b676
Default Java Ubuntu-17.0.18+8-Ubuntu-124.04.1
Multi-JDK versions /usr/lib/jvm/java-21-openjdk-amd64:Ubuntu-21.0.10+7-Ubuntu-124.04 /usr/lib/jvm/java-17-openjdk-amd64:Ubuntu-17.0.18+8-Ubuntu-124.04.1
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8416/1/testReport/
Max. process+thread count 2380 (vs. ulimit of 10000)
modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8416/1/console
versions git=2.43.0 maven=3.9.11 spotbugs=4.9.7
Powered by Apache Yetus 0.14.1 https://yetus.apache.org

This message was automatically generated.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a rolling-upgrade compatibility issue where older DataNodes can get stuck endlessly retrying full block reports after an upgraded NameNode starts rejecting invalid block report leases.

Changes:

  • Add a NameNode-side configuration (dfs.blockreport.reject.invalid.lease) to optionally avoid throwing InvalidBlockReportLeaseException on invalid/expired leases.
  • Wire the configuration through BlockManager and gate the exception throw in NameNodeRpcServer.blockReport.
  • Add a unit test covering the “do not throw when disabled” behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeRpcServer.java Gates InvalidBlockReportLeaseException on a new BlockManager flag.
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java Loads/stores the new config and exposes shouldRejectInvalidBlockReportLease().
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java Introduces the new config key and default.
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockReportLease.java Adds coverage for disabling invalid-lease rejection.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@hadoop-yetus
Copy link
Copy Markdown

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 33s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+0 🆗 xmllint 0m 1s xmllint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 41m 32s trunk passed
+1 💚 compile 1m 45s trunk passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 compile 1m 49s trunk passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 checkstyle 1m 55s trunk passed
+1 💚 mvnsite 1m 56s trunk passed
+1 💚 javadoc 1m 32s trunk passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 javadoc 1m 30s trunk passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 spotbugs 4m 11s trunk passed
+1 💚 shadedclient 31m 5s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 1m 20s the patch passed
+1 💚 compile 1m 15s the patch passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 javac 1m 15s the patch passed
+1 💚 compile 1m 17s the patch passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 javac 1m 17s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 1m 17s the patch passed
+1 💚 mvnsite 1m 28s the patch passed
+1 💚 javadoc 1m 0s the patch passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 javadoc 1m 0s the patch passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 spotbugs 3m 47s the patch passed
+1 💚 shadedclient 30m 11s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 216m 28s hadoop-hdfs in the patch passed.
+1 💚 asflicense 0m 46s The patch does not generate ASF License warnings.
345m 45s
Subsystem Report/Notes
Docker ClientAPI=1.54 ServerAPI=1.54 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8416/4/artifact/out/Dockerfile
GITHUB PR #8416
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint
uname Linux 7dd5fe072fb7 5.15.0-164-generic #174-Ubuntu SMP Fri Nov 14 20:25:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / ba9a7a7
Default Java Ubuntu-17.0.18+8-Ubuntu-124.04.1
Multi-JDK versions /usr/lib/jvm/java-21-openjdk-amd64:Ubuntu-21.0.10+7-Ubuntu-124.04 /usr/lib/jvm/java-17-openjdk-amd64:Ubuntu-17.0.18+8-Ubuntu-124.04.1
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8416/4/testReport/
Max. process+thread count 3475 (vs. ulimit of 10000)
modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8416/4/console
versions git=2.43.0 maven=3.9.11 spotbugs=4.9.7
Powered by Apache Yetus 0.14.1 https://yetus.apache.org

This message was automatically generated.

@Hexiaoqiao Hexiaoqiao changed the title HDFS-17906. Rolling upgrade: old DataNodes get stuck in infinite invalid block report lease loop HDFS-17906. Fix issue that DataNodes get stuck in infinite loop when meet InvalidBlockReportLeaseException. Apr 15, 2026
Copy link
Copy Markdown
Contributor

@Hexiaoqiao Hexiaoqiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. +1.

@Hexiaoqiao Hexiaoqiao merged commit 0672a00 into apache:trunk Apr 15, 2026
4 checks passed
@Hexiaoqiao
Copy link
Copy Markdown
Contributor

Committed to trunk. Thanks @cxzl25 for your works.

@pan3793
Copy link
Copy Markdown
Member

pan3793 commented Apr 15, 2026

@Hexiaoqiao, this is a bug fix, do we need to backport it to branch-3.5 and branch-3.4? HDFS rolling upgrade is quite an important case

@Hexiaoqiao
Copy link
Copy Markdown
Contributor

@Hexiaoqiao, this is a bug fix, do we need to backport it to branch-3.5 and branch-3.4? HDFS rolling upgrade is quite an important case

Yes, @cxzl25 Would you mind to submit another PRs to branch-3.5 and branch-3.4?

cxzl25 added a commit to cxzl25/hadoop that referenced this pull request Apr 17, 2026
…meet InvalidBlockReportLeaseException. (apache#8416). Contributed by dzcxzl

Signed-off-by: He Xiaoqiao <hexiaoqiao@apache.org>
cxzl25 added a commit to cxzl25/hadoop that referenced this pull request Apr 17, 2026
…meet InvalidBlockReportLeaseException. (apache#8416). Contributed by dzcxzl

Signed-off-by: He Xiaoqiao <hexiaoqiao@apache.org>
# Conflicts:
#	hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockReportLease.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants