Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestJlmRemoteThreadAuth crash LT1 vmState=0x00000000 #18149

Closed
pshipton opened this issue Sep 15, 2023 · 12 comments
Closed

TestJlmRemoteThreadAuth crash LT1 vmState=0x00000000 #18149

pshipton opened this issue Sep 15, 2023 · 12 comments
Labels
arch:aarch64 comp:jit segfault Issues that describe segfaults / JVM crashes test failure

Comments

@pshipton
Copy link
Member

pshipton commented Sep 15, 2023

https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.system_aarch64_linux_Nightly_testList_1/531
TestJlmRemoteThreadAuth_0

STF 20:26:31.687 -   o Process LT1 has crashed unexpectedly

https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Test_openjdk17_j9_sanity.system_aarch64_linux_Nightly_testList_1/531/system_test_output.tar.gz

0SECTION       TITLE subcomponent dump routine
NULL           ===============================
1TICHARSET     UTF-8
1TISIGINFO     Dump Event "gpf" (00002000) received
1TIDATETIMEUTC Date: 2023/09/15 at 00:26:29:384 (UTC)
1TIDATETIME    Date: 2023/09/14 at 20:26:29:384
1TITIMEZONE    Timezone: UTC-4 (EDT)
1TINANOTIME    System nanotime: 1703831458358335
1TIFILENAME    Javacore filename:    /home/jenkins/workspace/Test_openjdk17_j9_sanity.system_aarch64_linux_Nightly_testList_1/aqa-tests/TKG/output_16947355463817/TestJlmRemoteThreadAuth_0/20230914-202346-TestJlmRemoteThreadAuth/results/javacore.20230914.202628.2165240.0002.txt
1TIREQFLAGS    Request Flags: 0x41 (exclusive+preempt)
1TIPREPSTATE   Prep State: 0x80 (trace_disabled)
1TIPREPINFO    Exclusive VM access not taken: data may not be consistent across javacore sections
NULL           ------------------------------------------------------------------------
0SECTION       GPINFO subcomponent dump routine
NULL           ================================
2XHOSLEVEL     OS Level         : Linux 4.18.0-348.7.1.el8_5.aarch64
2XHCPUS        Processors -
3XHCPUARCH       Architecture   : aarch64
3XHNUMCPUS       How Many       : 8
3XHNUMASUP       NUMA is either not supported or has been disabled by user
NULL
1XHEXCPCODE    J9Generic_Signal_Number: 00000018
1XHEXCPCODE    Signal_Number: 0000000B
1XHEXCPCODE    Error_Value: 00000000
1XHEXCPCODE    Signal_Code: 00000002
1XHEXCPCODE    Handler1: 0000FFFFAD5B2DA4
1XHEXCPCODE    Handler2: 0000FFFFAD517AF0
1XHEXCPCODE    InaccessibleAddress: 00000000009772B0
NULL
NULL
1XHREGISTERS   Registers:
2XHREGISTER      R0: 0000000000000016
2XHREGISTER      R1: 0000FFFFAD608DFC
2XHREGISTER      R2: 0000000000000000
2XHREGISTER      R3: 0000000000977D00
2XHREGISTER      R4: 0000000000971A00
2XHREGISTER      R5: 0000000000977CA0
2XHREGISTER      R6: 0000000000000000
2XHREGISTER      R7: 0000000000000001
2XHREGISTER      R8: 0000000000000008
2XHREGISTER      R9: 0000FFFF8E79311C
2XHREGISTER      R10: 0000FFFF8D0DE100
2XHREGISTER      R11: 00000000F028CDB0
2XHREGISTER      R12: 00000000F028CDB0
2XHREGISTER      R13: 00000000F028CD90
2XHREGISTER      R14: 00000000000062A0
2XHREGISTER      R15: 0000000000000000
2XHREGISTER      R16: 0000FFFFAD7A02C8
2XHREGISTER      R17: 0000FFFFADC04940
2XHREGISTER      R18: 00000000F0106D50
2XHREGISTER      R19: 0000000000971900
2XHREGISTER      R20: 0000FFFFAD79F000
2XHREGISTER      R21: 00000000FF0FC1C8
2XHREGISTER      R22: 0000000000443750
2XHREGISTER      R23: 0000000000971900
2XHREGISTER      R24: 0000000000000000
2XHREGISTER      R25: 0000FFFF8D0DE590
2XHREGISTER      R26: 0000FFFFADD60000
2XHREGISTER      R27: 0000FFFF8D0DF1E0
2XHREGISTER      R28: 0000FFFF8D0DF1E0
2XHREGISTER      R29: 0000000000971900
2XHREGISTER      R30: 00000000009772B0
2XHREGISTER      R31: 0000FFFF8D0DE540
2XHREGISTER      V0: 0000000000977CB8
2XHREGISTER      V1: FFFFFFFFFFFFFFFF
2XHREGISTER      V2: E3BEA6451A118789
2XHREGISTER      V3: 0000000000000098
2XHREGISTER      V4: 8A77D375292B0A98
2XHREGISTER      V5: 068960FE10A91930
2XHREGISTER      V6: 852471A5F420DD40
2XHREGISTER      V7: 0000000000000098
2XHREGISTER      V8: 0000FFFE54048D70
2XHREGISTER      V9: 0000000000977298
2XHREGISTER      V10: 0000000000000020
2XHREGISTER      V11: 0000000000977288
2XHREGISTER      V12: 000000000000000C
2XHREGISTER      V13: 0000000000977278
2XHREGISTER      V14: 0000FFFF8D0DE150
2XHREGISTER      V15: 0000000000000001
2XHREGISTER      V16: 0000000000000028
2XHREGISTER      V17: 00000000000000B0
2XHREGISTER      V18: 00BFD04B41499B65
2XHREGISTER      V19: C200000000000000
2XHREGISTER      V20: C79ED91518C6E4C1
2XHREGISTER      V21: C2EAF0E46BB84949
2XHREGISTER      V22: F6B7F9EF07E50CE1
2XHREGISTER      V23: 172CDE009048FBFF
2XHREGISTER      V24: F0B59F6E338E1372
2XHREGISTER      V25: A8E8601E3B7F6C62
2XHREGISTER      V26: C7A3A2C877735869
2XHREGISTER      V27: 138D24252925B968
2XHREGISTER      V28: 3A1B90B1E3789AFD
2XHREGISTER      V29: B6F206C0B3BC0443
2XHREGISTER      V30: 4C32ECDF3D33029A
2XHREGISTER      V31: 8B6966E65C612DE2
2XHREGISTER      PC: 00000000009772B0
2XHREGISTER      SP: 0000FFFF8D0DE540
2XHREGISTER      PSTATE: 0000000060001000
NULL
1XHFLAGS       VM flags:0000000000000000
@pshipton pshipton added test failure segfault Issues that describe segfaults / JVM crashes labels Sep 15, 2023
@pshipton
Copy link
Member Author

@knn-k fyi

@knn-k
Copy link
Contributor

knn-k commented Sep 15, 2023

PC = R30 (LR) = InaccessibleAddress = 0x009772B0
It seems it loaded a wrong address into LR, and returned to that address.

@knn-k
Copy link
Contributor

knn-k commented Sep 19, 2023

I ran the following Grinder jobs (30x2).
https://openj9-jenkins.osuosl.org/job/Grinder/2902/
https://openj9-jenkins.osuosl.org/job/Grinder/2904/

I don't see the SEGV above, and there are different exceptions instead.

CL2 stderr java.rmi.UnmarshalException: Error unmarshaling return header; nested exception is: 
CL2 stderr 	java.io.EOFException
CL2 stderr 	at java.rmi/sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:255)
CL2 stderr 	at java.rmi/sun.rmi.server.UnicastRef.invoke(UnicastRef.java:165)
CL2 stderr java.rmi.ConnectException: Connection refused to host: 147.28.142.201; nested exception is: 
CL2 stderr 	java.net.ConnectException: Connection refused
CL2 stderr 	at java.rmi/sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:626)
CL2 stderr 	at java.rmi/sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:217)

@Akira1Saitoh
Copy link
Contributor

Looks like SEGV was actually happening in the Grinder jobs.

[2023-09-16T10:24:24.166Z] STF 06:24:23.260 - Found dump at: /home/jenkins/workspace/Grinder/aqa-tests/TKG/output_16948592869997/TestJlmRemoteThreadAuth_0/20230916-061447-TestJlmRemoteThreadAuth/results/javacore.20230916.062422.2223525.0002.txt
[2023-09-16T10:24:24.166Z] STF 06:24:23.261 - Found dump at: /home/jenkins/workspace/Grinder/aqa-tests/TKG/output_16948592869997/TestJlmRemoteThreadAuth_0/20230916-061447-TestJlmRemoteThreadAuth/results/core.20230916.062422.2223525.0001.dmp
[2023-09-16T10:24:24.166Z] CL2 j> 2023/09/16 06:24:23.487 Error unmarshaling return header; nested exception is:
[2023-09-16T10:24:24.166Z] CL2 java.io.EOFException
[2023-09-16T10:24:24.166Z] CL2 stderr javacore file generated - /home/jenkins/workspace/Grinder/aqa-tests/TKG/output_16948592869997/TestJlmRemoteThreadAuth_0/20230916-061447-TestJlmRemoteThreadAuth/results/javacore.20230916.062422.2223525.0002.txt
[2023-09-16T10:24:24.166Z] CL2 stderr core file generated - /home/jenkins/workspace/Grinder/aqa-tests/TKG/output_16948592869997/TestJlmRemoteThreadAuth_0/20230916-061447-TestJlmRemoteThreadAuth/results/core.20230916.062422.2223525.0001.dmp

@Akira1Saitoh
Copy link
Contributor

Akira1Saitoh commented Sep 27, 2023

It seems that the link register value saved into the native stack in c_cInterpreter gets overwritten by jitReleaseVMAccess helper called from JNI invocation sequence. When returning from c_cInterpreter, an incorrect value is loaded into the link register. Thus, it crashes on ret instruction.

(gdb) disassemble c_cInterpreter 
Dump of assembler code for function c_cInterpreter:
   0x0000fffff793144c <+0>:     sub     sp, sp, #0x3a0
   0x0000fffff7931450 <+4>:     stp     x19, x20, [sp]
   0x0000fffff7931454 <+8>:     stp     x21, x22, [sp, #16]
   0x0000fffff7931458 <+12>:    stp     x23, x24, [sp, #32]
   0x0000fffff793145c <+16>:    stp     x25, x26, [sp, #48]
   0x0000fffff7931460 <+20>:    stp     x27, x28, [sp, #64]
   0x0000fffff7931464 <+24>:    stp     x29, x30, [sp, #80]    <--- LR (x30) saved into 0xffffd1dbf208
   0x0000fffff7931468 <+28>:    stp     d8, d9, [sp, #96]
   0x0000fffff793146c <+32>:    stp     d10, d11, [sp, #112]
   0x0000fffff7931470 <+36>:    stp     d12, d13, [sp, #128]
   0x0000fffff7931474 <+40>:    stp     d14, d15, [sp, #144]
   0x0000fffff7931478 <+44>:    mov     x19, x0
   0x0000fffff793147c <+48>:    ldr     x27, [x19, #600]
   0x0000fffff7931480 <+52>:    add     x28, sp, #0xa0
   0x0000fffff7931484 <+56>:    str     x28, [x27, #8]
   0x0000fffff7931488 <+60>:    add     x28, sp, #0x1a0
   0x0000fffff793148c <+64>:    str     x28, [x27, #48]
   0x0000fffff7931490 <+68>:    mov     x0, x19
   0x0000fffff7931494 <+72>:    ldr     x27, [x19, #8]
   0x0000fffff7931498 <+76>:    ldr     x28, [x27, #168]
   0x0000fffff793149c <+80>:    blr     x28
(gdb) disassemble jitReleaseVMAccess
Dump of assembler code for function jitReleaseVMAccess:
   0x0000fffff66fba3c <+0>:     stp     x29, x30, [sp, #392]
   0x0000fffff66fba40 <+4>:     stp     x0, x1, [sp, #160]          <--- this instruction overwrites 0xffffd1dbf208
   0x0000fffff66fba44 <+8>:     stp     x2, x3, [sp, #176]

The location where the link register is saved in c_cInterpreter is J9CInterpreterStackFrame.preservedGPRs. The location where the GPR is saved in jitReleaseVMAccess is J9CInterpreterStackFrame.jitGPRs.

UDATA preservedGPRs[12]; /* x19-x30 */
U_8 preservedFPRs[8 * 8]; /* v8-15 */
J9JITGPRSpillArea jitGPRs;
U_8 jitFPRs[32 * 16]; /* v0-v31 */

The reason why jitReleaseVMAccess overwrites preservedGPRs is that the native stack pointer is modified in JNI invocation sequence in order to pass the JNI arguments by the stack. The native stack pointer is modified before calling out to jitReleaseVMAccess. In this case, sp is subtracted by 80 and [sp, #160] points the location where c_cInterpreter saved the link register.

size_t spSize = buildJNIArgs(callNode, deps, passThread, passReceiver, killNonVolatileGPRs);
TR::RealRegister *sp = machine()->getRealRegister(_systemLinkage->getProperties().getStackPointerRegister());
if (spSize > 0)
{
if (constantIsUnsignedImm12(spSize))
{
generateTrg1Src1ImmInstruction(cg(), TR::InstOpCode::subimmx, callNode, sp, sp, spSize);
}
else
{
TR_ASSERT_FATAL(false, "Too many arguments.");
}
}
TR::Register *returnRegister = getReturnRegisterFromDeps(callNode, deps);
#ifdef J9VM_INTERP_ATOMIC_FREE_JNI
TR::Register *zeroReg = cg()->allocateRegister();
deps->addPostCondition(zeroReg, TR::RealRegister::xzr);
#endif
auto postLabelDeps = deps->clonePost(cg());
TR::RealRegister *vmThreadReg = machine()->getRealRegister(getProperties().getMethodMetaDataRegister()); // x19
TR::RealRegister *javaStackReg = machine()->getRealRegister(getProperties().getStackPointerRegister()); // x20
TR::Register *x9Reg = deps->searchPreConditionRegister(TR::RealRegister::x9);
TR::Register *x10Reg = deps->searchPreConditionRegister(TR::RealRegister::x10);
TR::Register *x11Reg = deps->searchPreConditionRegister(TR::RealRegister::x11);
TR::Register *x12Reg = deps->searchPreConditionRegister(TR::RealRegister::x12);
if (createJNIFrame)
{
buildJNICallOutFrame(callNode, (resolvedMethod == comp()->getCurrentMethod()), returnLabel, vmThreadReg, javaStackReg, x9Reg, x10Reg);
}
if (passThread)
{
TR::RealRegister *vmThread = machine()->getRealRegister(getProperties().getMethodMetaDataRegister());
TR::Register *x0Reg = deps->searchPreConditionRegister(TR::RealRegister::x0);
generateMovInstruction(cg(), callNode, x0Reg, vmThread);
}
if (dropVMAccess)
{
#ifdef J9VM_INTERP_ATOMIC_FREE_JNI
releaseVMAccessAtomicFree(callNode, vmThreadReg, x9Reg, x10Reg);
#else
releaseVMAccess(callNode, vmThreadReg, x9Reg, x10Reg, x11Reg, x12Reg);
#endif
}
TR::Instruction *callInstr = generateMethodDispatch(callNode, isJNIGCPoint, deps, targetAddress, x9Reg);

@Akira1Saitoh
Copy link
Contributor

On power, J9CInterpreterStackFrame has a space for JNI outgoing arguments, thus I think power does not have this problem.

/* Linux PPC 64 LE
*
* Stack must be 64-byte aligned.
*/
UDATA backChain; /* caller SP */
UDATA preservedCR; /* callee saves in caller frame */
UDATA preservedLR; /* callee saves in caller frame */
UDATA currentTOC; /* callee saves own TOC in own frame */
UDATA outgoingArguments[J9_INLINE_JNI_MAX_ARG_COUNT];
J9JITGPRSpillArea jitGPRs;
U_8 jitFPRs[32 * 8]; /* fp0-fp31 */
U_8 jitVRs[52 * 16]; /* vsr0-vsr51 */
UDATA align[6];
UDATA preservedGPRs[18]; /* r14-r31 */
U_8 preservedFPRs[18 * 8]; /* fp14-31 */
U_8 preservedVRs[12 * 16]; /* vsr52-vsr63 */

On AArch64, J9CInterpreterStackFrame does not have such space.

UDATA preservedGPRs[12]; /* x19-x30 */
U_8 preservedFPRs[8 * 8]; /* v8-15 */
J9JITGPRSpillArea jitGPRs;
U_8 jitFPRs[32 * 16]; /* v0-v31 */

@Akira1Saitoh
Copy link
Contributor

x avoids this problem by implementing jitReleaseVMAccess helper differently than other helpers.

https://github.com/eclipse-openj9/openj9/blob/32a61988c5d34c32992429391be95b50256426a8/runtime/compiler/x/runtime/X86Codert.nasm#L1006C1-L1082

@Akira1Saitoh
Copy link
Contributor

Opened #18227.

Akira1Saitoh pushed a commit to Akira1Saitoh/openj9 that referenced this issue Oct 5, 2023
This commit adds outgoingArguments array to J9CInterpreterStackFrame
for AArch64 VM to avoid overwriting preservedGPRs and preservedFPRs when
calling out to jitReleaseVMAccess helper from JNI invocation sequence.

This commit also updates ARM64JNILinkage to store JNI method arguments
that are passed by stack to the area where the outgoingArguments array resides.

Adding outgoingArguments makes the size of J9CInterpreterStackFrame large
and some slots are not accessible with ldp/stp instructions
using sp as a base register. Thus, this commit also updates m4 macros
in arm64helpers.m4. They use the start of slots for saving JIT GPRs
as the base address and x16 as the base register.

Issue: eclipse-openj9#18149

Signed-off-by: Akira Saitoh <saiaki@jp.ibm.com>
@knn-k
Copy link
Contributor

knn-k commented Oct 6, 2023

I merged #18227 yesterday.

@knn-k
Copy link
Contributor

knn-k commented Oct 13, 2023

Nightly sanity.system failed earlier this week in TestJlmRemoteThreadAuth_1.
https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.system_aarch64_linux_Nightly/546/
I think it should be handled separately.

CL2 stderr java.rmi.UnmarshalException: Error unmarshaling return header; nested exception is: 
CL2 stderr 	java.io.EOFException
CL2 stderr 	at java.rmi/sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:255)
CL2 stderr 	at java.rmi/sun.rmi.server.UnicastRef.invoke(UnicastRef.java:165)
CL2 stderr 	at jdk.remoteref/jdk.jmx.remote.internal.rmi.PRef.invoke(Unknown Source)
CL2 stderr 	at java.management.rmi/javax.management.remote.rmi.RMIConnectionImpl_Stub.invoke(RMIConnectionImpl_Stub.java:419)
CL2 stderr 	at java.management.rmi/javax.management.remote.rmi.RMIConnector$RemoteMBeanServerConnection.invoke(RMIConnector.java:1021)
CL2 stderr 	at net.adoptopenjdk.test.jlm.resources.ThreadData.writeData(ThreadData.java:549)
CL2 stderr 	at net.adoptopenjdk.test.jlm.remote.ThreadProfiler.getStatsViaServer(ThreadProfiler.java:199)
CL2 stderr 	at net.adoptopenjdk.test.jlm.remote.ThreadProfiler.main(ThreadProfiler.java:99)
CL2 stderr Caused by: java.io.EOFException
CL2 stderr 	at java.base/java.io.DataInputStream.readUnsignedByte(DataInputStream.java:290)
CL2 stderr 	at java.base/java.io.DataInputStream.readByte(DataInputStream.java:268)
CL2 stderr 	at java.rmi/sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:241)
CL2 stderr 	... 7 more

@knn-k
Copy link
Contributor

knn-k commented Oct 13, 2023

I opened Issue #18280 for the UnmarshalException above.

@knn-k
Copy link
Contributor

knn-k commented Oct 16, 2023

I think this issue can be closed now.
Recent sanity.system jobs are not failing with TestJlmRemoteThreadAuth since #18227 was merged, except for the #18280 case above.

Akira1Saitoh pushed a commit to Akira1Saitoh/openj9 that referenced this issue Nov 20, 2023
…StackFrame

This commit adds outgoingArguments array to J9CInterpreterStackFrame
for AArch64 VM to avoid overwriting preservedGPRs and preservedFPRs when
calling out to jitReleaseVMAccess helper from JNI invocation sequence.

This commit also updates ARM64JNILinkage to store JNI method arguments
that are passed by stack to the area where the outgoingArguments array resides.

Adding outgoingArguments makes the size of J9CInterpreterStackFrame large
and some slots are not accessible with ldp/stp instructions
using sp as a base register. Thus, this commit also updates m4 macros
in arm64helpers.m4. They use the start of slots for saving JIT GPRs
as the base address and x16 as the base register.

Issue: eclipse-openj9#18149

Signed-off-by: Akira Saitoh <saiaki@jp.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch:aarch64 comp:jit segfault Issues that describe segfaults / JVM crashes test failure
Projects
None yet
Development

No branches or pull requests

4 participants