Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JVM paused for a long time when trying to reach safepoint #455

Open
k3np opened this issue Mar 29, 2023 · 2 comments
Open

JVM paused for a long time when trying to reach safepoint #455

k3np opened this issue Mar 29, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@k3np
Copy link

k3np commented Mar 29, 2023

Describe the bug

We are having issues within core Java when the JVM tries to reach a safepoint. We experience a delay of 104-108 seconds, where the JVM is paused when trying to reach the safepoint. Our logs show certain threads were in a monitor blocked state during synchronization of the sun.net.www.http.KeepAliveCache.get(URL, Object) method.
Lately we have experienced the issue to happen multiple times per week.

To Reproduce

We have not been able to reproduce the safepoint pause in a test environment.

Expected behavior

Time to reach safepoint should be in milliseconds, not seconds.

Logs

Safepoint log:

Mar 28 08:05:16 server: vmop [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count
Mar 28 08:05:16 server: 96773.234: ForceAsyncSafepoint [ 377 0 2 ] [ 0 0108196 0 0 ] 0

Flight Recorder Monitor Blocked state:

With Flight Recorder activated the following threads is Monitor Blocked state during the incident:

Start Time Duration End Time Event Thread Monitor Address Monitor Class Previous Monitor Owner
2023-03-28 08:03:28.981 1 min 47 s 2023-03-28 08:05:16.391 xxx-thread-9 0x7F9EEC01F770 sun.net.www.http.KeepAliveCache xxx-thread-10
2023-03-28 08:03:29.983 1 min 46 s 2023-03-28 08:05:16.391 xxx-thread-10 0x7F9EEC01F770 sun.net.www.http.KeepAliveCache hystrix-2
2023-03-28 08:03:30.388 1 min 46 s 2023-03-28 08:05:16.391 hystrix-2 0x7F9EEC01F770 sun.net.www.http.KeepAliveCache http-nio-8080-exec-93
2023-03-28 08:03:30.423 1 min 46 s 2023-03-28 08:05:16.391 http-nio-8080-exec-93 0x7F9EEC01F770 sun.net.www.http.KeepAliveCache yyy-thread-1
2023-03-28 08:03:30.464 1 min 46 s 2023-03-28 08:05:16.391 yyy-thread-1 0x7F9EEC01F770 sun.net.www.http.KeepAliveCache http-nio-8080-exec-33
2023-03-28 08:03:30.772 1 min 46 s 2023-03-28 08:05:16.383 http-nio-8080-exec-33 0x7F9EEC01F770 sun.net.www.http.KeepAliveCache xxx-thread-3
2023-03-28 08:03:30.986 1 min 45 s 2023-03-28 08:05:16.330 xxx-thread-3 0x7F9EEC01F770 sun.net.www.http.KeepAliveCache http-nio-8080-exec-21
2023-03-28 08:03:31.712 1 min 45 s 2023-03-28 08:05:16.330 http-nio-8080-exec-21 0x7F9EEC01F770 sun.net.www.http.KeepAliveCache http-nio-8080-exec-3
2023-03-28 08:03:31.794 1 min 45 s 2023-03-28 08:05:16.330 http-nio-8080-exec-3 0x7F9EEC01F770 sun.net.www.http.KeepAliveCache http-nio-8080-exec-92
2023-03-28 08:03:31.794 1 min 44 s 2023-03-28 08:05:16.293 http-nio-8080-exec-92 0x7F9EEC01F770 sun.net.www.http.KeepAliveCache zzz-1
2023-03-28 08:03:34.222 1 min 42 s 2023-03-28 08:05:16.293 zzz-1 0x7F9EEC01F770 sun.net.www.http.KeepAliveCache LoggingThread 461
2023-03-28 08:03:41.913 1 min 34 s 2023-03-28 08:05:16.293 LoggingThread 461 0x7F9EEC01F770 sun.net.www.http.KeepAliveCache zzz-2

Thread Stack Trace

Based on the stack traces it looks like an issue with HttpClient at the syncronized sun.net.www.http.KeepAliveCache.get(URL, Object).

Example 1:

HttpClient sun.net.www.http.KeepAliveCache.get(URL, Object) HttpClient sun.net.www.protocol.https.HttpsClient.New(SSLSocketFactory, URL, HostnameVerifier, Proxy, boolean, int, HttpURLConnection) HttpClient sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(URL, Proxy, int) void sun.net.www.protocol.http.HttpURLConnection.plainConnect0() void sun.net.www.protocol.http.HttpURLConnection.plainConnect() void sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect() InputStream sun.net.www.protocol.http.HttpURLConnection.getInputStream0() InputStream sun.net.www.protocol.http.HttpURLConnection.getInputStream() int java.net.HttpURLConnection.getResponseCode() int sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode() ...

Example 2:

HttpClient sun.net.www.http.KeepAliveCache.get(URL, Object) HttpClient sun.net.www.http.HttpClient.New(URL, Proxy, int, boolean, HttpURLConnection) HttpClient sun.net.www.http.HttpClient.New(URL, Proxy, int, HttpURLConnection) HttpClient sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(URL, Proxy, int) void sun.net.www.protocol.http.HttpURLConnection.plainConnect0() void sun.net.www.protocol.http.HttpURLConnection.plainConnect() void sun.net.www.protocol.http.HttpURLConnection.connect() OutputStream sun.net.www.protocol.http.HttpURLConnection.getOutputStream0() OutputStream sun.net.www.protocol.http.HttpURLConnection.getOutputStream() ....

Platform information

AWS Corretto: 8.352.08.1
OS: Amazon Linux 2
Kernel: Linux 4.14.294-220.533.amzn2.x86_64

@k3np k3np added the bug Something isn't working label Mar 29, 2023
@olivergillespie
Copy link
Contributor

olivergillespie commented Mar 29, 2023

Hi!

I can see you have an incredibly long time-to-safepoint (sync phase), 108196 milliseconds. However, I don't see the connection to the blocked threads you've shared. Blocked threads are waiting on a monitor, which does not stop the thread coming to a safepoint. In fact, blocked threads are already at a safepoint, so they do not contribute to time-to-safepoint delays.

Only a running thread will delay reaching a safepoint, in normal circumstances. It's also possible that something outside the JVM is causing a pause, for example something at the OS level, and it just seems like it's related to safepoints because that's where the metrics are.

If it is truly a safepoint issue, I suggest two techniques for diagnosis.

  1. Use these debug flags to print the thread(s) which are slow to reach the safepoint:
-XX:+SafepointTimeout -XX:SafepointTimeoutDelay=1000
  1. Use https://github.com/async-profiler/async-profiler with --ttsp (time-to-safepoint) profiling mode enabled. This will show you what activity is happening in your process while waiting for all threads to reach a safepoint. If async-profiler doesn't show any/many samples during that time, it's an indication that this is something happening external to the java process.

Useful references:

https://krzysztofslusarski.github.io/2020/11/14/tts.html
http://psy-lob-saw.blogspot.com/2015/12/safepoints.html

I don't see any reason to believe this is a bug in Corretto, but I'll leave this open for now in case you have any further questions.

@navyxliu
Copy link
Contributor

In jdk8, you still can use -XX:+PrintSafepointStatistics to dump TTSP. Like @olivergillespie pointed out, if you do encounter SafePoint Timeout, hotspot will dump whereabout. that will help you find the culprit.

Besides external reasons, jdk8u has a defect in counted loops. If your program has a counted loop and it takes very long time, the thread may hinder other java threads from reaching SafePoint and trigger timeout. In this case, you need to use -XX:+UseCountedLoopSafepoints or refactor your code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants