HDFS-16869 Fail to start NameNode when replay editlog onwing to 0 size of clientId or callId recorded in editlog #5235

Daniel-009497 · 2022-12-17T08:11:08Z

We first encouter this issue in Hadoop 3.3.1 version when we are rollingUpgrade from 3.1.1 to 3.3.1, which may cause NameNode start failure but just occasionally not everytime.

The root cause for why 0 size of clientId happened here is still under investigating.
So here we add a protection judge to exclude 0 size of clientId from being added into cache.

…e of clientId or callId recorded in editlog

slfan1989 · 2022-12-17T11:23:23Z

We first encouter this issue in Hadoop 3.3.1 version when we are rollingUpgrade from 3.1.1 to 3.3.1, which may cause NameNode start failure but just occasionally not everytime.

Thank you very much for your contribution, reporting this issue, but can you explain why this modification solves the issue?

hadoop-yetus · 2022-12-17T11:55:20Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	1m 10s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	detsecrets	0m 0s		detect-secrets was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
-1 ❌	test4tests	0m 0s		The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
			_ trunk Compile Tests _
+1 💚	mvninstall	41m 11s		trunk passed
+1 💚	compile	25m 44s		trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04
+1 💚	compile	22m 13s		trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08
+1 💚	checkstyle	1m 6s		trunk passed
+1 💚	mvnsite	1m 39s		trunk passed
-1 ❌	javadoc	1m 9s	/branch-javadoc-hadoop-common-project_hadoop-common-jdkUbuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04.txt	hadoop-common in trunk failed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04.
+1 💚	javadoc	0m 43s		trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08
+1 💚	spotbugs	2m 46s		trunk passed
+1 💚	shadedclient	25m 22s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+1 💚	mvninstall	1m 2s		the patch passed
+1 💚	compile	25m 12s		the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04
+1 💚	javac	25m 12s		the patch passed
+1 💚	compile	22m 13s		the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08
+1 💚	javac	22m 13s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
-0 ⚠️	checkstyle	0m 59s	/results-checkstyle-hadoop-common-project_hadoop-common.txt	hadoop-common-project/hadoop-common: The patch generated 1 new + 7 unchanged - 0 fixed = 8 total (was 7)
+1 💚	mvnsite	1m 35s		the patch passed
-1 ❌	javadoc	0m 59s	/patch-javadoc-hadoop-common-project_hadoop-common-jdkUbuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04.txt	hadoop-common in the patch failed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04.
+1 💚	javadoc	0m 42s		the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08
+1 💚	spotbugs	2m 43s		the patch passed
+1 💚	shadedclient	25m 10s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	18m 25s		hadoop-common in the patch passed.
+1 💚	asflicense	0m 56s		The patch does not generate ASF License warnings.
		222m 49s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5235/1/artifact/out/Dockerfile
GITHUB PR	#5235
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname	Linux 604a53f2d73d 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `8ce1ef6`
Default Java	Private Build-1.8.0_352-8u352-ga-1~20.04-b08
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5235/1/testReport/
Max. process+thread count	2840 (vs. ulimit of 5500)
modules	C: hadoop-common-project/hadoop-common U: hadoop-common-project/hadoop-common
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5235/1/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

ayushtkn

The root cause for why 0 size of clientId happened here is still under investigating.
So here we add a protection judge to exclude 0 size of clientId from being added into cache.

You need to find that out first, have a proper reason why only during RollingUpgrade, if it can happen in other cases as well, a Unit Test reproducing the said behaviour as well.

Then we can think what is a proper fix, we can't add any extra validation checks in critical Namenode paths,

cnauroth

Thank you for contributing, @Daniel-009497 , but I agree with @slfan1989 and @ayushtkn . The comment directly below the patch even states that the ops loaded from the edit log must be trustworthy.

I haven't seen anything like what you're describing myself. This would imply that somehow version 3.1.1 emitted edit log ops with bad retry cache data. If you still have the files around, you might investigate this more with Offline Edits Viewer. Allowing ops like this to proceed might violate the at-most-once guarantees that the retry cache is trying to provide.

HDFS-16869 Fail to start NameNode when replay editlog onwing to 0 siz…

8ce1ef6

…e of clientId or callId recorded in editlog

ayushtkn reviewed Dec 17, 2022

View reviewed changes

cnauroth reviewed Dec 20, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDFS-16869 Fail to start NameNode when replay editlog onwing to 0 size of clientId or callId recorded in editlog #5235

HDFS-16869 Fail to start NameNode when replay editlog onwing to 0 size of clientId or callId recorded in editlog #5235

Daniel-009497 commented Dec 17, 2022

slfan1989 commented Dec 17, 2022

hadoop-yetus commented Dec 17, 2022

ayushtkn left a comment

cnauroth left a comment

HDFS-16869 Fail to start NameNode when replay editlog onwing to 0 size of clientId or callId recorded in editlog #5235

Are you sure you want to change the base?

HDFS-16869 Fail to start NameNode when replay editlog onwing to 0 size of clientId or callId recorded in editlog #5235

Conversation

Daniel-009497 commented Dec 17, 2022

slfan1989 commented Dec 17, 2022

hadoop-yetus commented Dec 17, 2022

ayushtkn left a comment

Choose a reason for hiding this comment

cnauroth left a comment

Choose a reason for hiding this comment