Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in WallClock::getThreadState #862

Closed
vans239 opened this issue Dec 8, 2023 · 15 comments
Closed

Crash in WallClock::getThreadState #862

vans239 opened this issue Dec 8, 2023 · 15 comments
Labels

Comments

@vans239
Copy link

vans239 commented Dec 8, 2023

Async profiler caused a crash of JVM
Async profieler is provided via ap-loader: me.bechberger:ap-loader:2.9-2-all
Java: 21.0.1+12

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f3af0fe14b4, pid=26416, tid=26737
#
# JRE version: OpenJDK Runtime Environment Temurin-21.0.1+12 (21.0.1+12) (build 21.0.1+12-LTS)
# Java VM: OpenJDK 64-Bit Server VM Temurin-21.0.1+12 (21.0.1+12-LTS, mixed mode, sharing, tiered, compressed class ptrs, z gc, linux-amd64)
# Problematic frame:
# C  [libasyncProfiler.so+0xd4b4]  WallClock::getThreadState(void*)+0x14
#
# Core dump will be written. Default location: /data/coredumps/core.pid%P.uid%u.sig%s.%e.%t.26416
#
# An error report file with more information is saved as:
# /tmp/hs_err_pid26416.log
[106500.162s][warning][os] Loading hsdis library failed

Thread stack:

---------------  T H R E A D  ---------------

Current thread (0x00007f3b1bc4d1a0):  JavaThread "foreground"        [_thread_in_native, id=26737, stack(0x00007f388e0f8000,0x00007f388e1f9000) (1028K)]

Stack: [0x00007f388e0f8000,0x00007f388e1f9000],  sp=0x00007f388e1f67b0,  free space=1017k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libasyncProfiler.so+0xd4b4]  WallClock::getThreadState(void*)+0x14
C  [libasyncProfiler.so+0x35889]  WallClock::signalHandler(int, siginfo*, void*)+0x29
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
J 16870  ...NativeUtils.recvfrom(IJII)J (0 bytes) @ 0x00007f3b0874d902 [0x00007f3b0874d8a0+0x0000000000000062]
...

I can provide additional debug information from hs_err_pid file if needed

@vans239 vans239 changed the title Jvm crash Crash in WallClock::getThreadState Dec 8, 2023
@apangin
Copy link
Collaborator

apangin commented Dec 8, 2023

Yes, please provide the registers section and the memory map (Dynamic libraries) from hs_err.
You may send the file privately if you wish.

@vans239
Copy link
Author

vans239 commented Dec 8, 2023

Registers:
RAX=0x0000000000000001, RBX=0x0000000000000423, RCX=0x00000000000001e4, RDX=0x00007f388e1f5780
RSP=0x00007f388e1f67b0, RBP=0x00007f388e1f67d0, RSI=0x00007f388e1f68f0, RDI=0x00007f388e1f5780
R8 =0x0000000000000423, R9 =0x00007f388e1f6a64, R10=0x00007f3818016e60, R11=0x00007f3818019bb0
R12=0x00007f388e1f68f0, R13=0x00007f388e1f5780, R14=0x0000000010000000, R15=0x00007f388e1f7b24
RIP=0x00007f3af0fe14b4, EFLAGS=0x0000000000010202, CSGSFS=0x002b000000000033, ERR=0x0000000000000004
  TRAPNO=0x000000000000000e


Register to memory mapping:

RAX=0x0000000000000001 is an unknown value
RBX=0x0000000000000423 is an unknown value
RCX=0x00000000000001e4 is an unknown value
RDX=0x00007f388e1f5780 is pointing into the stack for thread: 0x00007f3b1bc4d1a0
RSP=0x00007f388e1f67b0 is pointing into the stack for thread: 0x00007f3b1bc4d1a0
RBP=0x00007f388e1f67d0 is pointing into the stack for thread: 0x00007f3b1bc4d1a0
RSI=0x00007f388e1f68f0 is pointing into the stack for thread: 0x00007f3b1bc4d1a0
RDI=0x00007f388e1f5780 is pointing into the stack for thread: 0x00007f3b1bc4d1a0
R8 =0x0000000000000423 is an unknown value
R9 =0x00007f388e1f6a64 is pointing into the stack for thread: 0x00007f3b1bc4d1a0
R10=0x00007f3818016e60 points into unknown readable memory: 0x00007f38d8016e60 | 60 6e 01 d8 38 7f 00 00
R11=0x00007f3818019bb0 points into unknown readable memory: 0x021420fe092d534c | 4c 53 2d 09 fe 20 14 02
R12=0x00007f388e1f68f0 is pointing into the stack for thread: 0x00007f3b1bc4d1a0
R13=0x00007f388e1f5780 is pointing into the stack for thread: 0x00007f3b1bc4d1a0
R14=0x0000000010000000 is an unknown value
R15=0x00007f388e1f7b24 is pointing into the stack for thread: 0x00007f3b1bc4d1a0

@vans239
Copy link
Author

vans239 commented Dec 8, 2023

Happy to share hs_err file privately. What email should I use?

@apangin
Copy link
Collaborator

apangin commented Dec 8, 2023

support at profiler . tools
thanks

Has the error happened just once or multiple times?

@vans239
Copy link
Author

vans239 commented Dec 8, 2023

Sent to email.

I have seen 2 failures due to async-profiler. However I am not sure that second one is caused by same issue and I don't have any details anymore

@apangin
Copy link
Collaborator

apangin commented Dec 8, 2023

The crash looks weird to me: in the signal handler, pc value from ucontext is some random value.
Async-profiler only reads what it receives from the kernel, I don't see how ucontext can be corrupted - might be a kernel or glibc bug.

@vans239
Copy link
Author

vans239 commented Dec 11, 2023

Thanks for looking into this.

I'll look into upgrading dependencies, but it will take long time
For visibility:

uname: Linux 5.3.7-301
libc: glibc 2.30 NPTL 2.30

Feel free to close issue if you believe that it's not caused by async-profiler

@apangin
Copy link
Collaborator

apangin commented Dec 11, 2023

Yeah, the crash happened when async-profiler tried to access an instruction at the address obtained here:

uintptr_t pc = frame.pc();

pc address is passed to a signal handler from outside. Normally, it always points to a valid instruction, but here pc = 0x423, that suggests some external bug.

I just thought of another (although unlikely) hypothesis: for instance, some third-party library uses the same signal as async-profiler (SIGVTALRM), but does not do signal chaining properly. If this is the case, you may configure async-profiler to use a different signal, see #759.

I'll close the issue for now. Please update, if the bug happens again. We'll see if there is a common pattern.

@vans239
Copy link
Author

vans239 commented Dec 11, 2023

Hmm.. Your guess about signal handler seems possible.
We use OpenOnload which does something with signal handlers and listens to SIGVTALRM (didn't look yet).

Thanks for the help!!!

@vans239
Copy link
Author

vans239 commented Dec 20, 2023

Update:
I am able to reliably reproduce issue. It's indeed happen only when OpenOnload is enabled.

Unexpected moment: it impacts only wall profiler. I am not able to reproduce issue with cpu profiler

@apangin
Copy link
Collaborator

apangin commented Dec 20, 2023

Thank you for the update. So it looks like OpenOnload indeed uses SIGVTALRM for its own purpose, but does not care about signal chaining. CPU profiling works with a different signal - SIGPROF. Did you you try to configure an alternative profiling signal? E.g. to use SIGPROF for wall clock profiling, run asprof -e wall --signal 27

@vans239
Copy link
Author

vans239 commented Dec 20, 2023

Would it work with multi event profiling? So far it seems to not fail, so workaround works

I am trying to do (collect wall and alloc together): asprof -d 15 -a -f /tmp/flame.jfr --cstack vm -F vtable -t -d 10 -e cpu,alloc --wall 300us -i 300us --signal 27 jps
Does it make sense?

@apangin
Copy link
Collaborator

apangin commented Dec 20, 2023

If you want to profile cpu+wall together, you'll need two different signals. Something like --signal 27/37 should work (which means use signal 27=SIGPROF for cpu profiling and signal 37 for wall clock).

As a side note, there are multiple issues with your asprof command:

  • -d is specified twice;
  • -a and -t are relevant for text formats, but do nothing with jfr;
  • -F vtable is not yet supported with --cstack vm (will fix this later);
  • -e cpu,alloc --wall works but it's a mix of two styles and is discouraged.
    Preferred syntax for multi-event profiling is -e cpu -i 300us --alloc 512m --wall 1ms
  • --wall 300us the interval for wall-clock profiling is too low - it may result in high overhead. Values below 1ms are not recommended, unless you limit the number of wall-profiled threads with --filter option.

@vans239
Copy link
Author

vans239 commented Dec 20, 2023

Thanks! I appreciate response.
Some issues above are intentional ( I reuse same default args for different use cases and hope that they are ignored).

We pin threads to cores, so I'd expect that number of threads is not much larger than number of cores (300us is used trying reproduce issue above)
Based on response above -e cpu -i 300us --alloc 512m --wall 1ms --cstack vm --signal 27/37 should be correct (didn't know that multiple signals are supported)

@apangin
Copy link
Collaborator

apangin commented Dec 20, 2023

-e cpu -i 300us --alloc 512m --wall 1ms --cstack vm --signal 27/37 should be correct

Yes, looks good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants