HDFS-17906. (3.5) Fix issue that DataNodes get stuck in infinite loop when meet InvalidBlockReportLeaseException.#8440
Open
cxzl25 wants to merge 1 commit intoapache:branch-3.5from
Open
HDFS-17906. (3.5) Fix issue that DataNodes get stuck in infinite loop when meet InvalidBlockReportLeaseException.#8440cxzl25 wants to merge 1 commit intoapache:branch-3.5from
cxzl25 wants to merge 1 commit intoapache:branch-3.5from
Conversation
…meet InvalidBlockReportLeaseException. (apache#8416). Contributed by dzcxzl Signed-off-by: He Xiaoqiao <hexiaoqiao@apache.org>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Backport #8416 to branch-3.5
Description of PR
#5460 (comment)
HDFS-16942 introduced
InvalidBlockReportLeaseException, which the NameNode now throws back to the DataNode via RPC when a block report is rejected due to an invalid lease. On a DataNode that also includes HDFS-16942, the exception is caught andfullBlockReportLeaseIdis reset to 0, allowing the DN to request a new lease on the next heartbeat and retry.However, during a rolling upgrade where the NameNode has been upgraded (with HDFS-16942) but DataNodes are still running an older version (without HDFS-16942), the old DataNode code does not have the
InvalidBlockReportLeaseExceptionhandling branch inBPServiceActor.offerService(). This causes the DN to enter an infinite failure loop where it can never successfully send a full block report.Root Cause
In
BPServiceActor.offerService(), the logic works as follows:fullBlockReportLeaseId == 0:fullBlockReportLeaseIdis reset to 0:When the upgraded NN throws
InvalidBlockReportLeaseException,blockReport()propagates the exception. ThefullBlockReportLeaseId = 0line after the call is never executed.The exception is caught by the generic
RemoteExceptioncatch block. The old DN code does not recognizeInvalidBlockReportLeaseException, sofullBlockReportLeaseIdremains set to the stale invalid value.On the next heartbeat iteration, because
fullBlockReportLeaseId != 0,requestBlockReportLeaseisfalse— the DN does not request a new lease. It then attemptsblockReport()again with the same stale lease, which the NN rejects again. This repeats indefinitely.How was this patch tested?
Add test
For code changes:
LICENSE,LICENSE-binary,NOTICE-binaryfiles?AI Tooling
If an AI tool was used:
where is the name of the AI tool used.
https://www.apache.org/legal/generative-tooling.html