Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDFS-17116. RBF: Update invoke millisecond time as monotonicNow() in RouterSafemodeService #5876

Merged
merged 2 commits into from
Jul 28, 2023

Conversation

haiyang1987
Copy link
Contributor

@haiyang1987 haiyang1987 commented Jul 23, 2023

Description of PR

https://issues.apache.org/jira/browse/HDFS-17116

The following exceptions occurred in our online environment:

  1. After the machine restarts, the system time is abnormal, is a time in the future
  2. After starting the router, there is log "safemode exit for 24981702 milliseconds...", which has been in the safemode state,this is mainly because the startupTime is recorded as the future system time when router is started at this time, and the system time returns to normal soon, resulting in a negative delta, at this time, the service can only be restored by restart the router service.

The relevant logs are:

2023-07-15 03:15:49,276 INFO  ipc.Server xxx
2023-07-15 11:21:03,785 INFO  router.DFSRouter (LogAdapter.java:info(51)) [main] - STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting Router
...
2023-07-15 11:21:51,325 INFO xxx
2023-07-15 03:22:00,257 INFO xxx
2023-07-15 03:22:29,829 INFO router.RouterSafemodeService (RouterSafemodeService.java:periodicInvoke(167)) [RouterSafemodeService-0] - Delaying safemode exit for 28761777 milliseconds...

Maybe we can be compatible with this case at the code level, can invoke monotonicNow() to calculate the delta.

…erval is negative during router safe mode exit check
@@ -128,7 +129,6 @@ public void testRouterExitSafemode()

assertTrue(router.getSafemodeService().isInSafeMode());
verifyRouter(RouterServiceState.SAFEMODE);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid this change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @slfan1989 help me reivew it, i wll update it later.

TimeUnit.SECONDS.toMillis(2), TimeUnit.MILLISECONDS) +
conf.getTimeDuration(DFS_ROUTER_CACHE_TIME_TO_LIVE_MS,
TimeUnit.SECONDS.toMillis(1), TimeUnit.MILLISECONDS) * 2;
Thread.sleep(interval);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use GenericTestUtils.waitFor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wll update it later.

verifyRouter(RouterServiceState.SAFEMODE);

// Wait for initial time in milliseconds
long interval =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of the code is not readable, can it be extended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wll update it later.

@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 49s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 1s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 49m 19s trunk passed
+1 💚 compile 0m 44s trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1
+1 💚 compile 0m 36s trunk passed with JDK Private Build-1.8.0_362-8u372-gaus1-0ubuntu120.04-b09
+1 💚 checkstyle 0m 29s trunk passed
+1 💚 mvnsite 0m 44s trunk passed
+1 💚 javadoc 0m 42s trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1
+1 💚 javadoc 0m 30s trunk passed with JDK Private Build-1.8.0_362-8u372-gaus1-0ubuntu120.04-b09
+1 💚 spotbugs 1m 30s trunk passed
+1 💚 shadedclient 39m 34s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 33s the patch passed
+1 💚 compile 0m 34s the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1
+1 💚 javac 0m 34s the patch passed
+1 💚 compile 0m 30s the patch passed with JDK Private Build-1.8.0_362-8u372-gaus1-0ubuntu120.04-b09
+1 💚 javac 0m 30s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 18s the patch passed
+1 💚 mvnsite 0m 33s the patch passed
+1 💚 javadoc 0m 28s the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1
+1 💚 javadoc 0m 23s the patch passed with JDK Private Build-1.8.0_362-8u372-gaus1-0ubuntu120.04-b09
+1 💚 spotbugs 1m 24s the patch passed
+1 💚 shadedclient 39m 25s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 22m 7s hadoop-hdfs-rbf in the patch passed.
+1 💚 asflicense 0m 35s The patch does not generate ASF License warnings.
165m 51s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5876/1/artifact/out/Dockerfile
GITHUB PR #5876
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux d255c8193617 4.15.0-212-generic #223-Ubuntu SMP Tue May 23 13:09:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 4c67183
Default Java Private Build-1.8.0_362-8u372-gaus1-0ubuntu120.04-b09
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_362-8u372-gaus1-0ubuntu120.04-b09
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5876/1/testReport/
Max. process+thread count 2434 (vs. ulimit of 5500)
modules C: hadoop-hdfs-project/hadoop-hdfs-rbf U: hadoop-hdfs-project/hadoop-hdfs-rbf
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5876/1/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@@ -161,11 +161,17 @@ protected void serviceInit(Configuration conf) throws Exception {

@Override
public void periodicInvoke() {
long now = Time.now();
long now = now();
long delta = now - startupTime;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about to invoke monotonicNow() to calculate the delta?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Hexiaoqiao help me review it
yeah, your suggestion is right, because monotonicNow() is not affected by settimeofday or similar system clock changes, if here invoke monotonicNow() to calculate the delta will avoid the exception case.

if startupTime use monotonicNow, maybe cacheLastUpdateTime and enterSafeModeTime we should also use monotonicNow() need to be consistent

what you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely yes.

assertTrue(router.getSafemodeService().isInSafeMode());
verifyRouter(RouterServiceState.SAFEMODE);

// Wait for initial time in milliseconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One tiny issue, please unify the comments pattern. Add dot in the end of a sentence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hfutatzhanghb help me reivew it, i wll update it later.

@haiyang1987 haiyang1987 changed the title HDFS-17116. Reset startupTime and enterSafeModeTime if check time interval is negative during router safe mode exit check HDFS-17116. RBF: Update invoke millisecond time as monotonicNow() in RouterSafemodeService Jul 27, 2023
@haiyang1987 haiyang1987 changed the title HDFS-17116. RBF: Update invoke millisecond time as monotonicNow() in RouterSafemodeService HDFS-17116. RBF: Update invoke millisecond time as monotonicNow() in RouterSafemodeService Jul 27, 2023
@haiyang1987
Copy link
Contributor Author

Hi Sir @Hexiaoqiao @slfan1989 @hfutatzhanghb Update PR and because update implementation need modify the name of the issue.

please help me review this pr again when you have free time. Thanks a lot~

Copy link
Contributor

@Hexiaoqiao Hexiaoqiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. +1 from my side. Let's wait what Yetus will say.

Copy link
Contributor

@hfutatzhanghb hfutatzhanghb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.+1

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 49s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 49m 10s trunk passed
+1 💚 compile 0m 41s trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1
+1 💚 compile 0m 37s trunk passed with JDK Private Build-1.8.0_362-8u372-gaus1-0ubuntu120.04-b09
+1 💚 checkstyle 0m 28s trunk passed
+1 💚 mvnsite 0m 41s trunk passed
+1 💚 javadoc 0m 42s trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1
+1 💚 javadoc 0m 31s trunk passed with JDK Private Build-1.8.0_362-8u372-gaus1-0ubuntu120.04-b09
+1 💚 spotbugs 1m 27s trunk passed
+1 💚 shadedclient 39m 17s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 34s the patch passed
+1 💚 compile 0m 37s the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1
+1 💚 javac 0m 36s the patch passed
+1 💚 compile 0m 31s the patch passed with JDK Private Build-1.8.0_362-8u372-gaus1-0ubuntu120.04-b09
+1 💚 javac 0m 31s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 18s the patch passed
+1 💚 mvnsite 0m 32s the patch passed
+1 💚 javadoc 0m 29s the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1
+1 💚 javadoc 0m 23s the patch passed with JDK Private Build-1.8.0_362-8u372-gaus1-0ubuntu120.04-b09
+1 💚 spotbugs 1m 24s the patch passed
+1 💚 shadedclient 38m 44s patch has no errors when building and testing our client artifacts.
_ Other Tests _
-1 ❌ unit 22m 34s /patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt hadoop-hdfs-rbf in the patch passed.
+1 💚 asflicense 0m 35s The patch does not generate ASF License warnings.
164m 58s
Reason Tests
Failed junit tests hadoop.hdfs.server.federation.router.TestRouterRPCMultipleDestinationMountTableResolver
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5876/2/artifact/out/Dockerfile
GITHUB PR #5876
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 903479638afd 4.15.0-212-generic #223-Ubuntu SMP Tue May 23 13:09:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 2376747
Default Java Private Build-1.8.0_362-8u372-gaus1-0ubuntu120.04-b09
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_362-8u372-gaus1-0ubuntu120.04-b09
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5876/2/testReport/
Max. process+thread count 2428 (vs. ulimit of 5500)
modules C: hadoop-hdfs-project/hadoop-hdfs-rbf U: hadoop-hdfs-project/hadoop-hdfs-rbf
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5876/2/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@haiyang1987
Copy link
Contributor Author

The failed unit test seems unrelated to the change, I will follow up on this UT failure issue and create a new issue to solve it

@Hexiaoqiao Hexiaoqiao merged commit 87c036e into apache:trunk Jul 28, 2023
1 of 4 checks passed
@Hexiaoqiao
Copy link
Contributor

Committed to trunk. Thanks @haiyang1987 for your contribution and @hfutatzhanghb @slfan1989 for your reviews!

@haiyang1987
Copy link
Contributor Author

Thanks @Hexiaoqiao @slfan1989 @hfutatzhanghb help me review and merge it.

jiajunmao pushed a commit to jiajunmao/hadoop-MLEC that referenced this pull request Feb 6, 2024
…RouterSafemodeService (apache#5876). Contributed by Haiyang Hu.

Reviewed-by: hfutatzhanghb <1036798979@qq.com>
Reviewed-by: Shilun Fan <slfan1989@apache.org>
Signed-off-by: He Xiaoqiao <hexiaoqiao@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants