-
Notifications
You must be signed in to change notification settings - Fork 8.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDFS-16869 Fail to start NameNode when replay editlog onwing to 0 size of clientId or callId recorded in editlog #5235
base: trunk
Are you sure you want to change the base?
Conversation
…e of clientId or callId recorded in editlog
Thank you very much for your contribution, reporting this issue, but can you explain why this modification solves the issue? |
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The root cause for why 0 size of clientId happened here is still under investigating.
So here we add a protection judge to exclude 0 size of clientId from being added into cache.
You need to find that out first, have a proper reason why only during RollingUpgrade, if it can happen in other cases as well, a Unit Test reproducing the said behaviour as well.
Then we can think what is a proper fix, we can't add any extra validation checks in critical Namenode paths,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for contributing, @Daniel-009497 , but I agree with @slfan1989 and @ayushtkn . The comment directly below the patch even states that the ops loaded from the edit log must be trustworthy.
I haven't seen anything like what you're describing myself. This would imply that somehow version 3.1.1 emitted edit log ops with bad retry cache data. If you still have the files around, you might investigate this more with Offline Edits Viewer. Allowing ops like this to proceed might violate the at-most-once guarantees that the retry cache is trying to provide.
We first encouter this issue in Hadoop 3.3.1 version when we are rollingUpgrade from 3.1.1 to 3.3.1, which may cause NameNode start failure but just occasionally not everytime.
The root cause for why 0 size of clientId happened here is still under investigating.
So here we add a protection judge to exclude 0 size of clientId from being added into cache.