Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDFS-16869 Fail to start NameNode when replay editlog onwing to 0 size of clientId or callId recorded in editlog #5235

Open
wants to merge 1 commit into
base: trunk
Choose a base branch
from

Conversation

Daniel-009497
Copy link
Contributor

We first encouter this issue in Hadoop 3.3.1 version when we are rollingUpgrade from 3.1.1 to 3.3.1, which may cause NameNode start failure but just occasionally not everytime.

The root cause for why 0 size of clientId happened here is still under investigating.
So here we add a protection judge to exclude 0 size of clientId from being added into cache.

@slfan1989
Copy link
Contributor

We first encouter this issue in Hadoop 3.3.1 version when we are rollingUpgrade from 3.1.1 to 3.3.1, which may cause NameNode start failure but just occasionally not everytime.

Thank you very much for your contribution, reporting this issue, but can you explain why this modification solves the issue?

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 1m 10s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 💚 mvninstall 41m 11s trunk passed
+1 💚 compile 25m 44s trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04
+1 💚 compile 22m 13s trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08
+1 💚 checkstyle 1m 6s trunk passed
+1 💚 mvnsite 1m 39s trunk passed
-1 ❌ javadoc 1m 9s /branch-javadoc-hadoop-common-project_hadoop-common-jdkUbuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04.txt hadoop-common in trunk failed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04.
+1 💚 javadoc 0m 43s trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08
+1 💚 spotbugs 2m 46s trunk passed
+1 💚 shadedclient 25m 22s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 1m 2s the patch passed
+1 💚 compile 25m 12s the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04
+1 💚 javac 25m 12s the patch passed
+1 💚 compile 22m 13s the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08
+1 💚 javac 22m 13s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 0m 59s /results-checkstyle-hadoop-common-project_hadoop-common.txt hadoop-common-project/hadoop-common: The patch generated 1 new + 7 unchanged - 0 fixed = 8 total (was 7)
+1 💚 mvnsite 1m 35s the patch passed
-1 ❌ javadoc 0m 59s /patch-javadoc-hadoop-common-project_hadoop-common-jdkUbuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04.txt hadoop-common in the patch failed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04.
+1 💚 javadoc 0m 42s the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08
+1 💚 spotbugs 2m 43s the patch passed
+1 💚 shadedclient 25m 10s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 18m 25s hadoop-common in the patch passed.
+1 💚 asflicense 0m 56s The patch does not generate ASF License warnings.
222m 49s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5235/1/artifact/out/Dockerfile
GITHUB PR #5235
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 604a53f2d73d 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 8ce1ef6
Default Java Private Build-1.8.0_352-8u352-ga-1~20.04-b08
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5235/1/testReport/
Max. process+thread count 2840 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common U: hadoop-common-project/hadoop-common
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5235/1/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

Copy link
Member

@ayushtkn ayushtkn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The root cause for why 0 size of clientId happened here is still under investigating.
So here we add a protection judge to exclude 0 size of clientId from being added into cache.

You need to find that out first, have a proper reason why only during RollingUpgrade, if it can happen in other cases as well, a Unit Test reproducing the said behaviour as well.

Then we can think what is a proper fix, we can't add any extra validation checks in critical Namenode paths,

Copy link
Contributor

@cnauroth cnauroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for contributing, @Daniel-009497 , but I agree with @slfan1989 and @ayushtkn . The comment directly below the patch even states that the ops loaded from the edit log must be trustworthy.

I haven't seen anything like what you're describing myself. This would imply that somehow version 3.1.1 emitted edit log ops with bad retry cache data. If you still have the files around, you might investigate this more with Offline Edits Viewer. Allowing ops like this to proceed might violate the at-most-once guarantees that the retry cache is trying to provide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants