Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable native thread collection in javacores on OSX #14201

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

EricYangIBM
Copy link
Contributor

@EricYangIBM EricYangIBM commented Dec 23, 2021

Enable native thread/stack iteration on x86 OSX, i.e. the omrintrospect
functions used to collect backtraces in the javacore thread section.
Omrintrospect library for OSX is implemented in eclipse/omr#6267

Signed-off-by: Eric Yang eric.yang@ibm.com

Copy link
Contributor

@babsingh babsingh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR depends on eclipse/omr#6267. The code changes LGTM.

@EricYangIBM EricYangIBM deleted the osx_nativestack branch January 24, 2022 16:14
@babsingh
Copy link
Contributor

@EricYangIBM This PR was closed due to the deletion of osx_nativestack. Did you intend to close this PR? Can you post a comment describing the future for this work?

@EricYangIBM EricYangIBM restored the osx_nativestack branch January 26, 2022 18:39
@EricYangIBM EricYangIBM reopened this Jan 26, 2022
@EricYangIBM
Copy link
Contributor Author

Looks like I accidentally deleted my branches when trying to push a renamed branch

@EricYangIBM EricYangIBM force-pushed the osx_nativestack branch 2 times, most recently from cf29435 to ff56a38 Compare February 24, 2022 17:26
@EricYangIBM
Copy link
Contributor Author

Is J9AARCH64 the correct flag for AArch64?

@babsingh
Copy link
Contributor

As per cmake/modules/platform/arch/aarch64.cmake#L23-L24, both AARCH64 and J9AARCH64 should work.

@babsingh
Copy link
Contributor

@babsingh
Copy link
Contributor

@tajila Can you help with merging this PR?

@tajila
Copy link
Contributor

tajila commented Apr 20, 2022

@EricYangIBM Is this change to enable native thread collection on x86 mac osx?

@EricYangIBM
Copy link
Contributor Author

Yes, it enables the omrintrospect setup for the thread collection. The omrintrospect_backtrace_thread_raw and omrintrospect_backtrace_symbols_raw are additionally enabled elsewhere already (e.g. printBacktrace in gphandle.c).
Also I'm fairly certain that the x86 mac tests passed with this PR but probably best to double check

@tajila
Copy link
Contributor

tajila commented Apr 20, 2022

Okay, can you please add more details to the commit message + description describing this. I wasn't sure why this change was being made

@tajila
Copy link
Contributor

tajila commented Apr 20, 2022

And just to be clear, support on aarch64 mac osx is still pending?

@EricYangIBM
Copy link
Contributor Author

aarch64 mac osx is still pending

Yes

Enable native thread/stack iteration on x86 OSX, i.e. the omrintrospect
functions used to collect backtraces in the javacore thread section.
Omrintrospect library for OSX is implemented in eclipse/omr#6267

Signed-off-by: Eric Yang <eric.yang@ibm.com>
@tajila
Copy link
Contributor

tajila commented Apr 20, 2022

Jenkins test sanity osx jdk8

@babsingh
Copy link
Contributor

babsingh commented Apr 20, 2022

https://openj9-jenkins.osuosl.org/job/Test_openjdk8_j9_sanity.functional_x86-64_mac_Personal/115/

Five different timeouts occurred while invoking abort:

Running command: "/Users/jenkins/workspace/Test_openjdk8_j9_sanity.functional_x86-64_mac_Personal_testList_0/openjdkbinary/j2sdk-image/bin/java"   -Xmx20m -Xdump:system+heap+snap+jit:none -cp "/Users/jenkins/workspace/Test_openjdk8_j9_sanity.functional_x86-64_mac_Personal_testList_0/aqa-tests/TKG/../../jvmtest/functional/cmdline_options_testresources/cmdlinetestresources.jar" VMBench.GPTests.GPTest abort
Time spent starting: 29 milliseconds
***[TEST INFO 2022/04/20 11:55:44] ProcessKiller detected a timeout after 2400000 milliseconds!***
INFO: The current OS is 'Mac OS X'. 'Debug on timeout' is currently only supported on Linux.
***[TEST INFO 2022/04/20 11:55:44] executing kill -ABRT 43329***
***[TEST INFO 2022/04/20 11:55:44] kill -ABRT signal sent***
Time spent executing: 2400075 milliseconds
***[TEST INFO 2022/04/20 11:55:44] ABRT completed***
Test result: FAILED
Output from test:
Output from test:
 [OUT] Invoking abort!
 [ERR] JVMDUMP039I Processing dump event "abort", detail "" at 2022/04/20 11:15:44 - please wait.
 [ERR] JVMDUMP032I JVM requested Java dump using '/Users/jenkins/workspace/Test_openjdk8_j9_sanity.functional_x86-64_mac_Personal_testList_0/aqa-tests/TKG/output_1650466071469/cmdLineTest_sigabrtHandlingTest_0/javacore.20220420.111544.43329.0001.txt' in response to an event

@tajila
Copy link
Contributor

tajila commented Apr 21, 2022

@babsingh are these failures expected?

@babsingh
Copy link
Contributor

@babsingh are these failures expected?

No, they are not expected. These timeouts were not seen before merging the OMR changes: eclipse/omr#6267 (comment) and eclipse/omr#6467 (comment). These failures are new, and they happen during javacore generation.

@babsingh
Copy link
Contributor

babsingh commented Apr 21, 2022

Maybe it's specific to the machines used in the PR tests?

Next step: request access to the machine and verify. ++@AdamBrousseau for accessing the machines involved in https://openj9-jenkins.osuosl.org/job/Test_openjdk8_j9_sanity.functional_x86-64_mac_Personal/115/.

@pshipton
Copy link
Member

pshipton commented Apr 21, 2022

@EricYangIBM
Copy link
Contributor Author

I found that it is hanging at J9JavaVM->walkStackFrames on the second last thread

if (j9sig_protect(protectedWalkJavaStack, &closure, handlerJavaThreadWalk, this, J9PORT_SIG_FLAG_SIGALLSYNC | J9PORT_SIG_FLAG_MAY_RETURN, &sink) == 0) {
return (UDATA)closure->jcw->_VirtualMachine->walkStackFrames((J9VMThread*)closure->jcw->_Context->onThread, (J9StackWalkState*)closure->state);

@babsingh
Copy link
Contributor

I found that it is hanging at J9JavaVM->walkStackFrames on the second last thread

  • Can we get a full native stack trace for the thread that is hung?
  • Is there anything different about the second last thread which causes the hang?

@EricYangIBM
Copy link
Contributor Author

I can't get any useful frames

bash-5.0$ lldb ../../../build/macosx-x86_64-normal-server-release/images/j2sdk-image/bin/java -c output_16509110006126/cmdLineTest_sigabrtHandlingTest_0/core.20220425.142501.52045.0001.dmp
(lldb) target create "../../../build/macosx-x86_64-normal-server-release/images/j2sdk-image/bin/java" --core "output_16509110006126/cmdLineTest_sigabrtHandlingTest_0/core.20220425.142501.52045.0001.dmp"
Core file '/Users/jenkins/eric/openj9-openjdk-jdk8/openj9/test/TKG/output_16509110006126/cmdLineTest_sigabrtHandlingTest_0/core.20220425.142501.52045.0001.dmp' (x86_64) was loaded.
(lldb) bt
* thread #1, stop reason = signal SIGSTOP
  * frame #0: 0x00007fff700e222a libsystem_kernel.dylib`mach_msg_trap + 10
    frame #1: 0x00007fffe7464dbc

I'm using --with-native-debug-symbols=internal configure arg as well

@babsingh
Copy link
Contributor

I can't get any useful frames

This work can paused for the time being. Cgroup v2 takes priority and needs to be completed by the end of May. @EricYangIBM Focus on cgroup v2. I will take a look at the outlier failure/hang by the end of this week.

@babsingh
Copy link
Contributor

babsingh commented May 18, 2022

@gacholio Would you have insights about this failure (happens during walkStackFrames)?

Finally got to look at the hang.

  • It is a segfault (EXC_BAD_ACCESS). On OSX, the process just hangs due to the segfault.
  • The segfault is intermittent and only happens on osx1014-x86-2 while walking the Java stack of the Attach API wait loop thread.
  • @EricYangIBM's changes only walk the native thread stacks. Not sure if this failure is related to Eric's changes since it happens while walking the Java stack.
  • Sometimes, deriving the ROM method from walkstate->method fails. Sometimes, just accessing the walkstate itself fails.
  • The test consistently passes once the Attach API wait loop thread is disabled via -Dcom.ibm.tools.attach.enable=no.
  • During reruns, the segfault happens at different locations. Another segfault scenario: getOwnedObjectMonitors ownedmonitors.c:93 -> walkStackFramesVerbose swalk.c:313 -> walkStackFramesVerbose swalk.c:969.

Last output in javacore at segfault:

    3XMTHREADINFO      "Attach API wait loop" J9VMThread:0x000000000E304D00, omrthread_t:0x00007FA4D48AE850, java/lang/Thread:0x00000000FFF68148, state:R, prio=10
    3XMJAVALTHREAD            (java/lang/Thread getId:0x1A, isDaemon:true)
    3XMJAVALTHRCCL            sun/misc/Launcher$AppClassLoader(0x00000000FFF3E7F0)
    3XMTHREADINFO1            (native thread ID:0x45846A8C, native priority:0xA, native policy:UNKNOWN, vmstate:R, vm thread flags:0x00000020)
    3XMTHREADINFO2            (native stack address range from:0x00007000031C9000, to:0x0000700003209000, size:0x40000)
    3XMCPUTIME               CPU usage total: 0.000415000 secs, current category="System-JVM"
    3XMHEAPALLOC             Heap bytes allocated since last GC cycle=4608 (0x1200)
    <SEG_FAULT>

lldb output at segfault:

(lldb) t 5
* thread #5, name = 'main', stop reason = EXC_BAD_ACCESS (code=1, address=0x24)
    frame #0: 0x000000000451bb1e libj9vrb29.dylib`printFrameType at swalk.c:1138:29 [opt]
   1135     J9Method * method = walkState->method;
   1136
   1137     if (method) {
-> 1138         PORT_ACCESS_FROM_WALKSTATE(walkState);
   1139         J9UTF8 * className = J9ROMCLASS_CLASSNAME(UNTAGGED_METHOD_CP(method)->ramClass->romClass);
   1140         J9ROMMethod * romMethod = J9_ROM_METHOD_FROM_RAM_METHOD(method);
   1141         J9UTF8 * name = J9ROMMETHOD_NAME(romMethod);

(lldb) bt
* thread #5, name = 'main', stop reason = EXC_BAD_ACCESS (code=1, address=0x24)
  * frame #0: 0x000000000451bb1e libj9vrb29.dylib`printFrameType at swalk.c:1138:29 [opt]
    frame #1: 0x000000000451bb15 libj9vrb29.dylib`printFrameType(walkState=0x0000700002555f50, frameType="JNI native method") at swalk.c:1121 [opt]
    frame #2: 0x000000000451a809 libj9vrb29.dylib`walkStackFramesVerbose [inlined] walkMethodFrame(walkState=<unavailable>) at swalk.c:0:30 [opt]
    frame #3: 0x000000000451a7a8 libj9vrb29.dylib`walkStackFramesVerbose(currentThread=0x000000000e1dc400, walkState=<unavailable>) at swalk.c:295 [opt]
    frame #4: 0x00000000026bb41a libj9dmp29.dylib`::protectedWalkJavaStack(portLibrary=<unavailable>, args=0x0000700002555bb8) at javadump.cpp:5667:22 [opt]
    frame #5: 0x000000000250af2c libj9prt29.dylib`omrsig_protect(portLibrary=0x00000000009ba0f8, fn=(libj9dmp29.dylib`::protectedWalkJavaStack(J9PortLibrary *, void *) at javadump.cpp:5664), fn_arg=0x0000700002555bb8, handler=<unavailable>, handler_arg=0x0000700002558d80, flags=<unavailable>, result=0x0000700002555bf0) at omrsignal.c:425:12 [opt]
    frame #6: 0x00000000026b91e8 libj9dmp29.dylib`JavaCoreDumpWriter::writeThread(this=0x0000700002558d80, vmThread=<unavailable>, nativeThread=0x0000700002556a00, vmstate=<unavailable>, javaState=<unavailable>, javaPriority=10, lockObject=0x0000000000000000, lockOwnerThread=0x0000000000000000) at javadump.cpp:4445:8 [opt]0000000026bb41a libj9dmp29.dylib`::protectedWalkJavaStack(portLibrary=<unavailable>, args=0x0000700002555bb8) at javadump.cpp:5667:22 [opt]
    ... 40 more frames ...

-verbose:stackwalk=2 output at segfault:

<000000000E304D00>
<000000000E304D00> *** BEGIN STACK WALK, flags = 00000000042C0000, walkThread = 000000000E304D00, walkState = 0000700002555770 ***
<000000000E304D00>  INCLUDE_NATIVES
<000000000E304D00>  ITERATE_FRAMES
<000000000E304D00>  SKIP_INLINES
<000000000E304D00>  VISIBLE_ONLY
<000000000E304D00> Initial values: walkSP = 000000000E1DC200, PC = 0000000000000007, literals = 0000000000000000, A0 = 000000000E1DC238, j2iFrame = 0000000000000000, ELS = 0000700003206CA8, decomp = 0000000000000000
<000000000E304D00> JNI native method frame: bp = 000000000E1DC220, sp = 000000000E1DC200, pc = 0000000000000007, cp = 0000000000000000, arg0EA = 000000000E1DC238, flags = 0000000000000000
<000000000E304D00>  Method: openj9/internal/tools/attach/target/FileLock.lockFileImpl(Ljava/lang/String;IZ)J (000000000E30BB70)
<000000000E304D00> Bytecode frame: bp = 000000000E1DC258, sp = 000000000E1DC240, pc = 0000000005ADE6AC, cp = 000000000E30B320, arg0EA = 000000000E1DC2A0, flags = 0000000000000000
<000000000E304D00>  Method: openj9/internal/tools/attach/target/FileLock.lockFile(ZLjava/lang/String;)Z (000000000E30BB10)
<000000000E304D00> Bytecode frame: bp = 000000000E1DC2B8, sp = 000000000E1DC2A8, pc = 0000000005AD8857, cp = 000000000E301D70, arg0EA = 000000000E1DC2D8, flags = 0000000000000000
<000000000E304D00>  Method: openj9/internal/tools/attach/target/CommonDirectory.obtainControllerLock(Ljava/lang/String;)V (000000000E302810)
<000000000E304D00> Bytecode frame: bp = 000000000E1DC2F0, sp = 000000000E1DC2E0, pc = 0000000005B1A05B, cp = 000000000E33F7C0, arg0EA = 000000000E1DC328, flags = 0000000000000000
<000000000E304D00>  Method: openj9/internal/tools/attach/target/WaitLoop.waitForNotification(Z)Lopenj9/internal/tools/attach/target/Attachment; (000000000E33BC08)
<000000000E304D00> Bytecode frame: bp = 000000000E1DC340, sp = 000000000E1DC330, pc = 0000000005B1A2A4, cp = 000000000E33F7C0, arg0EA = 000000000E1DC358, flags = 0000000000000000
<000000000E304D00>  Method: openj9/internal/tools/attach/target/WaitLoop.run()V (000000000E33BC48)
<000000000E304D00> JNI call-in frame: bp = 000000000E1DC380, sp = 000000000E1DC360, pc = 000000000246AA30, cp = 0000000000000000, arg0EA = 000000000E1DC380, flags = 0000000000000000
<000000000E304D00>  New ELS = 0000000000000000
<000000000E304D00> JNI native method frame: bp = 000000000E1DC3A8, sp = 000000000E1DC388, pc = 0000000000000007, cp = 0000000000000000, arg0EA = 000000000E1DC3A8, flags = 0000000000000000
<000000000E304D00> <end of stack>
<000000000E304D00> *** END STACK WALK (rc = 0) ***
<000000000E304D00>
<000000000E304D00> *** BEGIN STACK WALK, flags = 00000000002C0021, walkThread = 000000000E304D00, walkState = 0000700002555F50 ***
<000000000E304D00>  NO_ABORT
<000000000E304D00>  INCLUDE_NATIVES
<000000000E304D00>  ITERATE_FRAMES
<000000000E304D00>  VISIBLE_ONLY
<000000000E304D00>  RECORD_BYTECODE_PC_OFFSET
<000000000E304D00> Initial values: walkSP = 000000000E1DC280, PC = 0000000000000007, literals = 0000000000000000, A0 = 000000000E1DC2B8, j2iFrame = 0000000000000000, ELS = 0000700003206CA8, decomp = 0000000000000000
<000000000E304D00> JNI native method frame: bp = 000000000E1DC2A0, sp = 000000000E1DC280, pc = 0000000000000007, cp = 0000000000000000, arg0EA = 000000000E1DC2B8, flags = 0000000000000010

@gacholio
Copy link
Contributor

No real insight - what you're seeing would seem to be impossible. Can you dump the walkState at the crashing point? I verified that the code uses walkThread (must always be valid) rather than currentThread (sometimes NULL during javacode production) to fetch the port library.

This smacks of corruption, though why it would only show up on OSX I can't say. Looking into the walkState details will tell us more.

@babsingh
Copy link
Contributor

Can you dump the walkState at the crashing point?

@gacholio In the below walkState, pc and method are primarily bad and cause the segfault. For this failure, I have seen segfaults due to:

WalkState during segfault

(lldb) p *(J9StackWalkState*)0x700001b32730
(J9StackWalkState) $0 = {
  previous = 0x0000000000000000
  walkThread = 0x000000000f64cd00
  javaVM = 0x00007fb5c6804020
  flags = 69992448
  bp = 0x000000002cfc99d0
  unwindSP = 0x0000000000000000
  pc = 0xb80055b813123b03 <no value available>  <--- BAD
  nextPC = 0xb80055b813123b03 <no value available> <--- BAD
  sp = 0x000000002cfc99d8
  arg0EA = 0x002cb6002bb80054
  literals = 0x0001000000190003
  walkSP = 0x000000002cfc99d8
  argCount = 0
  constantPool = 0x0000000000000000
  method = 0x0000000000000000 <--- BAD
  jitInfo = 0x0000000000000000
  frameFlags = 0
  resolveFrameFlags = 0
  skipCount = 0
  maxFrames = 0
  userData1 = 0x0000700001b32c10
  userData2 = 0x0000700001b32ef8
  userData3 = 0x00007fb5c600de70
  userData4 = 0x0000000000000002
  framesWalked = 1
  frameWalkFunction = 0x000000000c663570 (libj9vm29.dylib`getOwnedObjectMonitorsIterator at ownedmonitors.c:120)
  objectSlotWalkFunction = 0x0000000000000006 (0x0000000000000006)
  returnAddressWalkFunction = 0x0000000000000000
  cache = 0x0000000000000000
  restartPoint = 0x000000000000000a
  restartException = 0x0000700001b32a80
  inlinerMap = 0x0000000000000000
  inlineDepth = 0
  cacheCursor = 0x000000000c8fe38e
  decompilationRecord = 0x0000000000000000
  registerEAs = {
    jit_rax = 0x000000000cad67f0
    jit_rbx = 0x0000000000000002
    jit_rcx = 0x0000000000000000
    jit_rdx = 0x000000000ada70f8
    jit_rdi = 0x000000000000000a
    jit_rsi = 0x0000700001b32ad0
    jit_rbp = 0x000000000c8fff87
    jit_rsp = 0x0000000000000000
    jit_r8 = 0x0000000000000000
    jit_r9 = 0x0000000000000000
    jit_r10 = 0x0000000000000000
    jit_r11 = 0x0505000000000000
    jit_r12 = 0x000000002e2f0020
    jit_r13 = 0x000000000cad6108
    jit_r14 = 0x0000000000000000
    jit_r15 = 0x0000000000000000
  }
  walkedEntryLocalStorage = 0x00007000027e3ca8
  i2jState = 0x00007000027e3cb8
  decompilationStack = 0x0000000000000000
  pcAddress = 0x000000002cfc99c8
  outgoingArgCount = 0
  objectSlotBitVector = 0x0000000000000000 <no value available>
  elsBitVector = 0
  savedObjectSlotWalkFunction = 0x0000000000000000
  bytecodePCOffset = -1
  dropToCurrentFrame = 0x000000000c670e00 (libj9vm29.dylib`dropToCurrentFrame at swalk.c:1572)
  j2iFrame = 0x0000000000000000
  previousFrameFlags = 0
  slotIndex = 0
  slotType = 0
  currentThread = 0x000000000f524400
  linearSlotWalker = 0x0000000000000000
  inlinedCallSite = 0x0000000000000000
  stackMap = 0x0000000000000000
  inlineMap = 0x0000000000000000
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants