Crash in WallClock::getThreadState #862

vans239 · 2023-12-08T09:47:52Z

Async profiler caused a crash of JVM
Async profieler is provided via ap-loader: me.bechberger:ap-loader:2.9-2-all
Java: 21.0.1+12

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f3af0fe14b4, pid=26416, tid=26737
#
# JRE version: OpenJDK Runtime Environment Temurin-21.0.1+12 (21.0.1+12) (build 21.0.1+12-LTS)
# Java VM: OpenJDK 64-Bit Server VM Temurin-21.0.1+12 (21.0.1+12-LTS, mixed mode, sharing, tiered, compressed class ptrs, z gc, linux-amd64)
# Problematic frame:
# C  [libasyncProfiler.so+0xd4b4]  WallClock::getThreadState(void*)+0x14
#
# Core dump will be written. Default location: /data/coredumps/core.pid%P.uid%u.sig%s.%e.%t.26416
#
# An error report file with more information is saved as:
# /tmp/hs_err_pid26416.log
[106500.162s][warning][os] Loading hsdis library failed

Thread stack:

---------------  T H R E A D  ---------------

Current thread (0x00007f3b1bc4d1a0):  JavaThread "foreground"        [_thread_in_native, id=26737, stack(0x00007f388e0f8000,0x00007f388e1f9000) (1028K)]

Stack: [0x00007f388e0f8000,0x00007f388e1f9000],  sp=0x00007f388e1f67b0,  free space=1017k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libasyncProfiler.so+0xd4b4]  WallClock::getThreadState(void*)+0x14
C  [libasyncProfiler.so+0x35889]  WallClock::signalHandler(int, siginfo*, void*)+0x29
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
J 16870  ...NativeUtils.recvfrom(IJII)J (0 bytes) @ 0x00007f3b0874d902 [0x00007f3b0874d8a0+0x0000000000000062]
...

I can provide additional debug information from hs_err_pid file if needed

The text was updated successfully, but these errors were encountered:

apangin · 2023-12-08T12:20:09Z

Yes, please provide the registers section and the memory map (Dynamic libraries) from hs_err.
You may send the file privately if you wish.

vans239 · 2023-12-08T12:21:25Z

Registers:
RAX=0x0000000000000001, RBX=0x0000000000000423, RCX=0x00000000000001e4, RDX=0x00007f388e1f5780
RSP=0x00007f388e1f67b0, RBP=0x00007f388e1f67d0, RSI=0x00007f388e1f68f0, RDI=0x00007f388e1f5780
R8 =0x0000000000000423, R9 =0x00007f388e1f6a64, R10=0x00007f3818016e60, R11=0x00007f3818019bb0
R12=0x00007f388e1f68f0, R13=0x00007f388e1f5780, R14=0x0000000010000000, R15=0x00007f388e1f7b24
RIP=0x00007f3af0fe14b4, EFLAGS=0x0000000000010202, CSGSFS=0x002b000000000033, ERR=0x0000000000000004
  TRAPNO=0x000000000000000e


Register to memory mapping:

RAX=0x0000000000000001 is an unknown value
RBX=0x0000000000000423 is an unknown value
RCX=0x00000000000001e4 is an unknown value
RDX=0x00007f388e1f5780 is pointing into the stack for thread: 0x00007f3b1bc4d1a0
RSP=0x00007f388e1f67b0 is pointing into the stack for thread: 0x00007f3b1bc4d1a0
RBP=0x00007f388e1f67d0 is pointing into the stack for thread: 0x00007f3b1bc4d1a0
RSI=0x00007f388e1f68f0 is pointing into the stack for thread: 0x00007f3b1bc4d1a0
RDI=0x00007f388e1f5780 is pointing into the stack for thread: 0x00007f3b1bc4d1a0
R8 =0x0000000000000423 is an unknown value
R9 =0x00007f388e1f6a64 is pointing into the stack for thread: 0x00007f3b1bc4d1a0
R10=0x00007f3818016e60 points into unknown readable memory: 0x00007f38d8016e60 | 60 6e 01 d8 38 7f 00 00
R11=0x00007f3818019bb0 points into unknown readable memory: 0x021420fe092d534c | 4c 53 2d 09 fe 20 14 02
R12=0x00007f388e1f68f0 is pointing into the stack for thread: 0x00007f3b1bc4d1a0
R13=0x00007f388e1f5780 is pointing into the stack for thread: 0x00007f3b1bc4d1a0
R14=0x0000000010000000 is an unknown value
R15=0x00007f388e1f7b24 is pointing into the stack for thread: 0x00007f3b1bc4d1a0

vans239 · 2023-12-08T12:25:33Z

Happy to share hs_err file privately. What email should I use?

apangin · 2023-12-08T12:38:00Z

support at profiler . tools
thanks

Has the error happened just once or multiple times?

vans239 · 2023-12-08T13:34:46Z

Sent to email.

I have seen 2 failures due to async-profiler. However I am not sure that second one is caused by same issue and I don't have any details anymore

apangin · 2023-12-08T22:18:53Z

The crash looks weird to me: in the signal handler, pc value from ucontext is some random value.
Async-profiler only reads what it receives from the kernel, I don't see how ucontext can be corrupted - might be a kernel or glibc bug.

vans239 · 2023-12-11T10:28:56Z

Thanks for looking into this.

I'll look into upgrading dependencies, but it will take long time
For visibility:

uname: Linux 5.3.7-301
libc: glibc 2.30 NPTL 2.30

Feel free to close issue if you believe that it's not caused by async-profiler

apangin · 2023-12-11T15:21:36Z

Yeah, the crash happened when async-profiler tried to access an instruction at the address obtained here:

async-profiler/src/wallClock.cpp

Line 41 in 62c1a79

uintptr_t pc = frame.pc();

pc address is passed to a signal handler from outside. Normally, it always points to a valid instruction, but here pc = 0x423, that suggests some external bug.

I just thought of another (although unlikely) hypothesis: for instance, some third-party library uses the same signal as async-profiler (SIGVTALRM), but does not do signal chaining properly. If this is the case, you may configure async-profiler to use a different signal, see #759.

I'll close the issue for now. Please update, if the bug happens again. We'll see if there is a common pattern.

vans239 · 2023-12-11T15:36:54Z

Hmm.. Your guess about signal handler seems possible.
We use OpenOnload which does something with signal handlers and listens to SIGVTALRM (didn't look yet).

Thanks for the help!!!

vans239 · 2023-12-20T18:20:41Z

Update:
I am able to reliably reproduce issue. It's indeed happen only when OpenOnload is enabled.

Unexpected moment: it impacts only wall profiler. I am not able to reproduce issue with cpu profiler

apangin · 2023-12-20T19:07:22Z

Thank you for the update. So it looks like OpenOnload indeed uses SIGVTALRM for its own purpose, but does not care about signal chaining. CPU profiling works with a different signal - SIGPROF. Did you you try to configure an alternative profiling signal? E.g. to use SIGPROF for wall clock profiling, run asprof -e wall --signal 27

vans239 · 2023-12-20T20:37:34Z

Would it work with multi event profiling? So far it seems to not fail, so workaround works

I am trying to do (collect wall and alloc together): asprof -d 15 -a -f /tmp/flame.jfr --cstack vm -F vtable -t -d 10 -e cpu,alloc --wall 300us -i 300us --signal 27 jps
Does it make sense?

apangin · 2023-12-20T21:50:30Z

If you want to profile cpu+wall together, you'll need two different signals. Something like --signal 27/37 should work (which means use signal 27=SIGPROF for cpu profiling and signal 37 for wall clock).

As a side note, there are multiple issues with your asprof command:

-d is specified twice;
-a and -t are relevant for text formats, but do nothing with jfr;
-F vtable is not yet supported with --cstack vm (will fix this later);
-e cpu,alloc --wall works but it's a mix of two styles and is discouraged.
Preferred syntax for multi-event profiling is -e cpu -i 300us --alloc 512m --wall 1ms
--wall 300us the interval for wall-clock profiling is too low - it may result in high overhead. Values below 1ms are not recommended, unless you limit the number of wall-profiled threads with --filter option.

vans239 · 2023-12-20T21:56:22Z

Thanks! I appreciate response.
Some issues above are intentional ( I reuse same default args for different use cases and hope that they are ignored).

We pin threads to cores, so I'd expect that number of threads is not much larger than number of cores (300us is used trying reproduce issue above)
Based on response above -e cpu -i 300us --alloc 512m --wall 1ms --cstack vm --signal 27/37 should be correct (didn't know that multiple signals are supported)

apangin · 2023-12-20T22:01:16Z

-e cpu -i 300us --alloc 512m --wall 1ms --cstack vm --signal 27/37 should be correct

Yes, looks good.

vans239 changed the title ~~Jvm crash~~ Crash in WallClock::getThreadState Dec 8, 2023

apangin closed this as completed Dec 11, 2023

apangin added the external label Dec 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash in WallClock::getThreadState #862

Crash in WallClock::getThreadState #862

vans239 commented Dec 8, 2023

apangin commented Dec 8, 2023 •

edited

vans239 commented Dec 8, 2023

vans239 commented Dec 8, 2023

apangin commented Dec 8, 2023

vans239 commented Dec 8, 2023

apangin commented Dec 8, 2023

vans239 commented Dec 11, 2023 •

edited

apangin commented Dec 11, 2023

vans239 commented Dec 11, 2023

vans239 commented Dec 20, 2023

apangin commented Dec 20, 2023

vans239 commented Dec 20, 2023 •

edited

apangin commented Dec 20, 2023 •

edited

vans239 commented Dec 20, 2023

apangin commented Dec 20, 2023

Crash in WallClock::getThreadState #862

Crash in WallClock::getThreadState #862

Comments

vans239 commented Dec 8, 2023

apangin commented Dec 8, 2023 • edited

vans239 commented Dec 8, 2023

vans239 commented Dec 8, 2023

apangin commented Dec 8, 2023

vans239 commented Dec 8, 2023

apangin commented Dec 8, 2023

vans239 commented Dec 11, 2023 • edited

apangin commented Dec 11, 2023

vans239 commented Dec 11, 2023

vans239 commented Dec 20, 2023

apangin commented Dec 20, 2023

vans239 commented Dec 20, 2023 • edited

apangin commented Dec 20, 2023 • edited

vans239 commented Dec 20, 2023

apangin commented Dec 20, 2023

apangin commented Dec 8, 2023 •

edited

vans239 commented Dec 11, 2023 •

edited

vans239 commented Dec 20, 2023 •

edited

apangin commented Dec 20, 2023 •

edited