HDFS-17906. Fix issue that DataNodes get stuck in infinite loop when meet InvalidBlockReportLeaseException.#8416
Conversation
|
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Pull request overview
This PR addresses a rolling-upgrade compatibility issue where older DataNodes can get stuck endlessly retrying full block reports after an upgraded NameNode starts rejecting invalid block report leases.
Changes:
- Add a NameNode-side configuration (
dfs.blockreport.reject.invalid.lease) to optionally avoid throwingInvalidBlockReportLeaseExceptionon invalid/expired leases. - Wire the configuration through
BlockManagerand gate the exception throw inNameNodeRpcServer.blockReport. - Add a unit test covering the “do not throw when disabled” behavior.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeRpcServer.java |
Gates InvalidBlockReportLeaseException on a new BlockManager flag. |
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java |
Loads/stores the new config and exposes shouldRejectInvalidBlockReportLease(). |
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java |
Introduces the new config key and default. |
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockReportLease.java |
Adds coverage for disabling invalid-lease rejection. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
🎊 +1 overall
This message was automatically generated. |
|
Committed to trunk. Thanks @cxzl25 for your works. |
|
@Hexiaoqiao, this is a bug fix, do we need to backport it to branch-3.5 and branch-3.4? HDFS rolling upgrade is quite an important case |
Yes, @cxzl25 Would you mind to submit another PRs to branch-3.5 and branch-3.4? |
…meet InvalidBlockReportLeaseException. (apache#8416). Contributed by dzcxzl Signed-off-by: He Xiaoqiao <hexiaoqiao@apache.org>
…meet InvalidBlockReportLeaseException. (apache#8416). Contributed by dzcxzl Signed-off-by: He Xiaoqiao <hexiaoqiao@apache.org> # Conflicts: # hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockReportLease.java
Description of PR
#5460 (comment)
HDFS-16942 introduced
InvalidBlockReportLeaseException, which the NameNode now throws back to the DataNode via RPC when a block report is rejected due to an invalid lease. On a DataNode that also includes HDFS-16942, the exception is caught andfullBlockReportLeaseIdis reset to 0, allowing the DN to request a new lease on the next heartbeat and retry.However, during a rolling upgrade where the NameNode has been upgraded (with HDFS-16942) but DataNodes are still running an older version (without HDFS-16942), the old DataNode code does not have the
InvalidBlockReportLeaseExceptionhandling branch inBPServiceActor.offerService(). This causes the DN to enter an infinite failure loop where it can never successfully send a full block report.Root Cause
In
BPServiceActor.offerService(), the logic works as follows:fullBlockReportLeaseId == 0:fullBlockReportLeaseIdis reset to 0:When the upgraded NN throws
InvalidBlockReportLeaseException,blockReport()propagates the exception. ThefullBlockReportLeaseId = 0line after the call is never executed.The exception is caught by the generic
RemoteExceptioncatch block. The old DN code does not recognizeInvalidBlockReportLeaseException, sofullBlockReportLeaseIdremains set to the stale invalid value.On the next heartbeat iteration, because
fullBlockReportLeaseId != 0,requestBlockReportLeaseisfalse— the DN does not request a new lease. It then attemptsblockReport()again with the same stale lease, which the NN rejects again. This repeats indefinitely.How was this patch tested?
Add test
For code changes:
LICENSE,LICENSE-binary,NOTICE-binaryfiles?AI Tooling
If an AI tool was used:
where is the name of the AI tool used.
https://www.apache.org/legal/generative-tooling.html