JVM paused for a long time when trying to reach safepoint #455

k3np · 2023-03-29T13:11:23Z

Describe the bug

We are having issues within core Java when the JVM tries to reach a safepoint. We experience a delay of 104-108 seconds, where the JVM is paused when trying to reach the safepoint. Our logs show certain threads were in a monitor blocked state during synchronization of the sun.net.www.http.KeepAliveCache.get(URL, Object) method.
Lately we have experienced the issue to happen multiple times per week.

To Reproduce

We have not been able to reproduce the safepoint pause in a test environment.

Expected behavior

Time to reach safepoint should be in milliseconds, not seconds.

Logs

Safepoint log:

Mar 28 08:05:16 server: vmop [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count
Mar 28 08:05:16 server: 96773.234: ForceAsyncSafepoint [ 377 0 2 ] [ 0 0108196 0 0 ] 0

Flight Recorder Monitor Blocked state:

With Flight Recorder activated the following threads is Monitor Blocked state during the incident:

Start Time	Duration	End Time	Event Thread	Monitor Address	Monitor Class	Previous Monitor Owner
2023-03-28 08:03:28.981	1 min 47 s	2023-03-28 08:05:16.391	xxx-thread-9	0x7F9EEC01F770	sun.net.www.http.KeepAliveCache	xxx-thread-10
2023-03-28 08:03:29.983	1 min 46 s	2023-03-28 08:05:16.391	xxx-thread-10	0x7F9EEC01F770	sun.net.www.http.KeepAliveCache	hystrix-2
2023-03-28 08:03:30.388	1 min 46 s	2023-03-28 08:05:16.391	hystrix-2	0x7F9EEC01F770	sun.net.www.http.KeepAliveCache	http-nio-8080-exec-93
2023-03-28 08:03:30.423	1 min 46 s	2023-03-28 08:05:16.391	http-nio-8080-exec-93	0x7F9EEC01F770	sun.net.www.http.KeepAliveCache	yyy-thread-1
2023-03-28 08:03:30.464	1 min 46 s	2023-03-28 08:05:16.391	yyy-thread-1	0x7F9EEC01F770	sun.net.www.http.KeepAliveCache	http-nio-8080-exec-33
2023-03-28 08:03:30.772	1 min 46 s	2023-03-28 08:05:16.383	http-nio-8080-exec-33	0x7F9EEC01F770	sun.net.www.http.KeepAliveCache	xxx-thread-3
2023-03-28 08:03:30.986	1 min 45 s	2023-03-28 08:05:16.330	xxx-thread-3	0x7F9EEC01F770	sun.net.www.http.KeepAliveCache	http-nio-8080-exec-21
2023-03-28 08:03:31.712	1 min 45 s	2023-03-28 08:05:16.330	http-nio-8080-exec-21	0x7F9EEC01F770	sun.net.www.http.KeepAliveCache	http-nio-8080-exec-3
2023-03-28 08:03:31.794	1 min 45 s	2023-03-28 08:05:16.330	http-nio-8080-exec-3	0x7F9EEC01F770	sun.net.www.http.KeepAliveCache	http-nio-8080-exec-92
2023-03-28 08:03:31.794	1 min 44 s	2023-03-28 08:05:16.293	http-nio-8080-exec-92	0x7F9EEC01F770	sun.net.www.http.KeepAliveCache	zzz-1
2023-03-28 08:03:34.222	1 min 42 s	2023-03-28 08:05:16.293	zzz-1	0x7F9EEC01F770	sun.net.www.http.KeepAliveCache	LoggingThread 461
2023-03-28 08:03:41.913	1 min 34 s	2023-03-28 08:05:16.293	LoggingThread 461	0x7F9EEC01F770	sun.net.www.http.KeepAliveCache	zzz-2

Thread Stack Trace

Based on the stack traces it looks like an issue with HttpClient at the syncronized sun.net.www.http.KeepAliveCache.get(URL, Object).

Example 1:

HttpClient sun.net.www.http.KeepAliveCache.get(URL, Object) HttpClient sun.net.www.protocol.https.HttpsClient.New(SSLSocketFactory, URL, HostnameVerifier, Proxy, boolean, int, HttpURLConnection) HttpClient sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(URL, Proxy, int) void sun.net.www.protocol.http.HttpURLConnection.plainConnect0() void sun.net.www.protocol.http.HttpURLConnection.plainConnect() void sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect() InputStream sun.net.www.protocol.http.HttpURLConnection.getInputStream0() InputStream sun.net.www.protocol.http.HttpURLConnection.getInputStream() int java.net.HttpURLConnection.getResponseCode() int sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode() ...

Example 2:

HttpClient sun.net.www.http.KeepAliveCache.get(URL, Object) HttpClient sun.net.www.http.HttpClient.New(URL, Proxy, int, boolean, HttpURLConnection) HttpClient sun.net.www.http.HttpClient.New(URL, Proxy, int, HttpURLConnection) HttpClient sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(URL, Proxy, int) void sun.net.www.protocol.http.HttpURLConnection.plainConnect0() void sun.net.www.protocol.http.HttpURLConnection.plainConnect() void sun.net.www.protocol.http.HttpURLConnection.connect() OutputStream sun.net.www.protocol.http.HttpURLConnection.getOutputStream0() OutputStream sun.net.www.protocol.http.HttpURLConnection.getOutputStream() ....

Platform information

AWS Corretto: 8.352.08.1
OS: Amazon Linux 2
Kernel: Linux 4.14.294-220.533.amzn2.x86_64

The text was updated successfully, but these errors were encountered:

olivergillespie · 2023-03-29T13:39:04Z

Hi!

I can see you have an incredibly long time-to-safepoint (sync phase), 108196 milliseconds. However, I don't see the connection to the blocked threads you've shared. Blocked threads are waiting on a monitor, which does not stop the thread coming to a safepoint. In fact, blocked threads are already at a safepoint, so they do not contribute to time-to-safepoint delays.

Only a running thread will delay reaching a safepoint, in normal circumstances. It's also possible that something outside the JVM is causing a pause, for example something at the OS level, and it just seems like it's related to safepoints because that's where the metrics are.

If it is truly a safepoint issue, I suggest two techniques for diagnosis.

Use these debug flags to print the thread(s) which are slow to reach the safepoint:

-XX:+SafepointTimeout -XX:SafepointTimeoutDelay=1000

Use https://github.com/async-profiler/async-profiler with --ttsp (time-to-safepoint) profiling mode enabled. This will show you what activity is happening in your process while waiting for all threads to reach a safepoint. If async-profiler doesn't show any/many samples during that time, it's an indication that this is something happening external to the java process.

Useful references:

https://krzysztofslusarski.github.io/2020/11/14/tts.html
http://psy-lob-saw.blogspot.com/2015/12/safepoints.html

I don't see any reason to believe this is a bug in Corretto, but I'll leave this open for now in case you have any further questions.

navyxliu · 2023-04-13T21:36:28Z

In jdk8, you still can use -XX:+PrintSafepointStatistics to dump TTSP. Like @olivergillespie pointed out, if you do encounter SafePoint Timeout, hotspot will dump whereabout. that will help you find the culprit.

Besides external reasons, jdk8u has a defect in counted loops. If your program has a counted loop and it takes very long time, the thread may hinder other java threads from reaching SafePoint and trigger timeout. In this case, you need to use -XX:+UseCountedLoopSafepoints or refactor your code.

k3np added the bug Something isn't working label Mar 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JVM paused for a long time when trying to reach safepoint #455

JVM paused for a long time when trying to reach safepoint #455

k3np commented Mar 29, 2023

olivergillespie commented Mar 29, 2023 •

edited

Loading

navyxliu commented Apr 13, 2023

JVM paused for a long time when trying to reach safepoint #455

JVM paused for a long time when trying to reach safepoint #455

Comments

k3np commented Mar 29, 2023

Describe the bug

To Reproduce

Expected behavior

Logs

Safepoint log:

Flight Recorder Monitor Blocked state:

Thread Stack Trace

Example 1:

Example 2:

Platform information

olivergillespie commented Mar 29, 2023 • edited Loading

navyxliu commented Apr 13, 2023

olivergillespie commented Mar 29, 2023 •

edited

Loading