Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jdk19 openjdk java/foreign/SafeFunctionAccessTest.java crash vmState=0x00000000 #16606

Closed
pshipton opened this issue Jan 25, 2023 · 30 comments · Fixed by #16663
Closed

jdk19 openjdk java/foreign/SafeFunctionAccessTest.java crash vmState=0x00000000 #16606

pshipton opened this issue Jan 25, 2023 · 30 comments · Fixed by #16663
Assignees
Labels
comp:vm jdk19 project:panama Used to track Project Panama related work test failure
Milestone

Comments

@pshipton
Copy link
Member

https://openj9-jenkins.osuosl.org/job/Test_openjdk19_j9_sanity.openjdk_ppc64le_linux_Nightly/96
jdk_foreign_0
java/foreign/SafeFunctionAccessTest.java

https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Test_openjdk19_j9_sanity.openjdk_ppc64le_linux_Nightly/96/openjdk_test_output.tar.gz

23:23:04  STDOUT:
23:23:04  test SafeFunctionAccessTest.testClosedStruct(): success
23:23:04  test SafeFunctionAccessTest.testClosedStructAddr_6(): success
23:23:04  STDERR:
23:23:04  Unhandled exception
23:23:04  Type=Segmentation error vmState=0x00000000
23:23:04  J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000001
23:23:04  Handler1=000079261F33FCC0 Handler2=000079261F268AA0
23:23:04  R0=000079261F39CB90 R1=00007925C52AAE90 R2=000079261F576D00 R3=0000000000296700
23:23:04  R4=0000000000000060 R5=FFFFFFFFFFFFFFA0 R6=0000000000000008 R7=0000000000000001
23:23:04  R8=0000000000000010 R9=000079261802DC30 R10=0000000000000000 R11=0000000000000000
23:23:04  R12=000079261F1FFE60 R13=00007925C52B68E0 R14=0000000000000003 R15=000000004A39564D
23:23:04  R16=0000000000000000 R17=000079256800E0F0 R18=00007925C52AB500 R19=0000792568179B60
23:23:04  R20=0000000080000002 R21=00007925C5AAEDBC R22=000000000029BE60 R23=FFFFFFFFFFFFFFFD
23:23:04  R24=00007925C52AB060 R25=0000000000000002 R26=0000000040000000 R27=0000000000000000
23:23:04  R28=000079261802DC30 R29=0000792568137ED0 R30=0000000000296700 R31=0000000000296700
23:23:04  NIP=000079261F39CC20 MSR=800000010280F033 ORIG_GPR3=00000000000081C8 CTR=000079261F1FFE60
23:23:04  LINK=000079261F39CB90 XER=0000000020000000 CCR=0000000040888802 SOFTE=0000000000000001
23:23:04  TRAP=0000000000000300 DAR=0000000000000014 dsisr=0000000040000000 RESULT=0000000000000000
23:23:04  FPR0 000079261f9e124c (f: 530453056.000000, d: 6.581184e-310)
23:23:04  FPR1 4054d68e00000000 (f: 0.000000, d: 8.335242e+01)
23:23:04  FPR2 3ff017b09c78239b (f: 2625119232.000000, d: 1.005784e+00)
23:23:04  FPR3 3fee666660000000 (f: 1610612736.000000, d: 9.500000e-01)
23:23:04  FPR4 3fce840b4ac4e4d2 (f: 1254417664.000000, d: 2.384047e-01)
23:23:04  FPR5 bfe7154748bef6c8 (f: 1220474624.000000, d: -7.213475e-01)
23:23:04  FPR6 3fe62e42fefa39ef (f: 4277811712.000000, d: 6.931472e-01)
23:23:04  FPR7 bfb1973c5a611ccc (f: 1516313856.000000, d: -6.871392e-02)
23:23:04  FPR8 bfdffffef20a4123 (f: 4060758272.000000, d: -4.999997e-01)
23:23:04  FPR9 bfd00ea348b88334 (f: 1220051712.000000, d: -2.508934e-01)
23:23:04  FPR10 bfdfa58a805efc44 (f: 2153708544.000000, d: -4.944788e-01)
23:23:04  FPR11 41cdcd6500000000 (f: 0.000000, d: 1.000000e+09)
23:23:04  FPR12 0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:23:04  FPR13 4000000000000000 (f: 0.000000, d: 2.000000e+00)
23:23:04  FPR14 0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:23:04  FPR15 0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:23:04  FPR16 0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:23:04  FPR17 0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:23:04  FPR18 0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:23:04  FPR19 0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:23:04  FPR20 0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:23:04  FPR21 0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:23:04  FPR22 0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:23:04  FPR23 0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:23:04  FPR24 0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:23:04  FPR25 0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:23:04  FPR26 0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:23:04  FPR27 0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:23:04  FPR28 0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:23:04  FPR29 0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:23:04  FPR30 0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:23:04  FPR31 0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:23:04  Module=/home/jenkins/workspace/Test_openjdk19_j9_sanity.openjdk_ppc64le_linux_Nightly/openjdkbinary/j2sdk-image/lib/default/libj9vm29.so
23:23:04  Module_base_address=000079261F300000
23:23:04  Target=2_90_20230124_165 (Linux 4.15.0-175-generic)
23:23:04  CPU=ppc64le (16 logical CPUs) (0x1fdc30000 RAM)
23:23:04  ----------- Stack Backtrace -----------
23:23:04  native2InterpJavaUpcallImpl+0x180 (0x000079261F39CC20 [libj9vm29.so+0x9cc20])
23:23:04   (0x00007925FC000048 [<unknown>+0x0])
23:23:04  addr_func_cb+0x24 (0x00007925C4D00804 [libSafeAccess.so+0x804])
23:23:04   (0x000079261F4FA018 [libj9vm29.so+0x1fa018])
23:23:04  ffi_call_int+0xd4 (0x000079261F4F9CE4 [libj9vm29.so+0x1f9ce4])
23:23:04  bytecodeLoopCompressed+0xfba4 (0x000079261F3C51C4 [libj9vm29.so+0xc51c4])
23:23:04   (0x000079261F4668E0 [libj9vm29.so+0x1668e0])
23:23:04  runJavaThread+0x250 (0x000079261F316D80 [libj9vm29.so+0x16d80])
23:23:04  javaProtectedThreadProc+0x138 (0x000079261F3B3AF8 [libj9vm29.so+0xb3af8])
23:23:04  omrsig_protect+0x3f4 (0x000079261F269F74 [libj9prt29.so+0x39f74])
23:23:04  javaThreadProc+0x60 (0x000079261F3AF260 [libj9vm29.so+0xaf260])
23:23:04  thread_wrapper+0x190 (0x000079261F1FCBC0 [libj9thr29.so+0xcbc0])
23:23:04  start_thread+0x10c (0x000079261FD1885C [libpthread.so.0+0x885c])
23:23:04  clone+0x98 (0x000079261FC08C98 [libc.so.6+0x158c98])
23:23:04  ---------------------------------------
@pshipton
Copy link
Member Author

@tajila @ChengJin01 fyi

@ChengJin01 ChengJin01 added the project:panama Used to track Project Panama related work label Jan 25, 2023
@ChengJin01
Copy link
Contributor

ChengJin01 commented Jan 25, 2023

The javacore & core dump indicate the test suite failed in testClosedStructCallback() when doing the upcall in downcall.

3XMTHREADINFO3           Java callstack:
4XESTACKTRACE                at openj9/internal/foreign/abi/InternalDowncallHandler.invokeNative(Native Method)
4XESTACKTRACE                at openj9/internal/foreign/abi/InternalDowncallHandler.runNativeMethod(InternalDowncallHandler.java:538)
4XESTACKTRACE                at java/lang/invoke/LambdaForm$DMH/0x00000000680d7350.invokeSpecial(LambdaForm$DMH)
4XESTACKTRACE                at java/lang/invoke/LambdaForm$MH/0x0000000068178fc0.invoke(LambdaForm$MH)
4XESTACKTRACE                at java/lang/invoke/LambdaForm$MH/0x000000006817c0a0.invoke_MT(LambdaForm$MH)
4XESTACKTRACE                at SafeFunctionAccessTest.testClosedStructCallback(SafeFunctionAccessTest.java:165) <----

against the test code at https://github.com/ibmruntimes/openj9-openjdk-jdk19/blob/be0b953d126ad1302eb49c4467f7017ef4853b9c/test/jdk/java/foreign/SafeFunctionAccessTest.java#L165

    @Test
    public void testClosedStructCallback() throws Throwable {
        MethodHandle handle = Linker.nativeLinker().downcallHandle(
                findNativeOrThrow("addr_func_cb"),
                FunctionDescriptor.ofVoid(C_POINTER, C_POINTER));

        try (MemorySession session = MemorySession.openConfined()) {
            MemorySegment segment = MemorySegment.allocateNative(POINT, session);
            handle.invoke(segment, sessionChecker(session)); <--------
        }
    }

along with the stacktrace in gdb:

#14 <signal handler called>
#15 0x000079261f39cc20 in getReturnTypeFromMetaData (data=0x792568137ed0) 
at /home/jenkins/workspace/Build_JDK19_ppc64le_linux_Nightly/openj9/runtime/vm/UpcallVMHelpers.cpp:159 <-------
#16 native2InterpJavaUpcallImpl (data=0x792568137ed0, argsListPointer=0x7925c52ab060) 
at /home/jenkins/workspace/Build_JDK19_ppc64le_linux_Nightly/openj9/runtime/vm/UpcallVMHelpers.cpp:241
#17 0x00007925fc000048 in ?? ()
#18 0x00007925c4d00804 in ?? ()
#19 0x000079261f4fa018 in ffi_call_LINUX64 () at /home/jenkins/workspace/Build_JDK19_ppc64le_linux_Nightly/openj9/runtime/libffi/powerpc/linux64.S:178
#20 0x000079261f4f9ce4 in ffi_call_int (cif=0x7925680d5100, fn=<optimized out>, rvalue=<optimized out>, avalue=<optimized out>, closure=<optimized out>) at /home/jenkins/workspace/Build_JDK19_ppc64le_linux_Nightly/openj9/runtime/libffi/powerpc/ffi.c:105
#21 0x000079261f3c51c4 in VM_BytecodeInterpreterCompressed::inlInternalDowncallHandlerInvokeNative (_pc=<synthetic pointer>: <optimized out>, _sp=<synthetic pointer>: <optimized out>, this=0x7925c52ab580) at ....

(gdb) p currentThread
$2 = (J9VMThread *) 0x296700 <--- Valid
(gdb) p data->mhMetaData
$4 = (jobject) 0x7926180ec920 <--- might be corrupted/outdated

against the native code at

j9object_t methodType = J9VMOPENJ9INTERNALFOREIGNABIUPCALLMHMETADATA_CALLEETYPE(currentThread,

getReturnTypeFromMetaData(J9UpcallMetaData *data)
{
	J9JavaVM *vm = data->vm;
	J9VMThread *currentThread = currentVMThread(vm);
	j9object_t methodType = J9VMOPENJ9INTERNALFOREIGNABIUPCALLMHMETADATA_CALLEETYPE(currentThread,
			J9_JNI_UNWRAP_REFERENCE(data->mhMetaData));

So the only reason for the crash is that the address of data->mhMetaData was corrupted or outdated, in which case it failed to locate the MethodType field stored in UpcallMHMetaData.

@ChengJin01
Copy link
Contributor

ChengJin01 commented Jan 25, 2023

According to our existing implementation in upcall at

upcallMetaData->mhMetaData = j9jni_createGlobalRef((JNIEnv*)currentThread, mhMetaData, false);

mhMetaData is already stored as the global reference which means it is impossible to discard it before the upcall is completed (the location of the crash was in the dispatcher which means the upcall method was not yet invoked at that time).

Meanwhile, the intention of the test suite is to validate of MemorySession when closing it in the upcall method, for which we already have code at


to ensure the current session is never terminated until the whole upcall invocation is done.

So I can't image how this still happened in upcall.

@ChengJin01
Copy link
Contributor

ChengJin01 commented Jan 25, 2023

Launched a Grinder (x200) at https://openj9-jenkins.osuosl.org/job/Grinder/1887/ to see how frequently the crash can be reproduced.

@ChengJin01
Copy link
Contributor

The issue can be reproduced on the Grinder above (1/100). So I'd like to know whether this is a platform-dependent issue or only happened to Linux/ppc64le.

Launched Grinders (x300) on AIX, zLinux, and Linux/x86_64:
https://openj9-jenkins.osuosl.org/job/Grinder/1892/ (AIX)
https://openj9-jenkins.osuosl.org/job/Grinder/1893/ (zLinux)
https://openj9-jenkins.osuosl.org/job/Grinder/1894/ (Linux/x86_64)

@ChengJin01
Copy link
Contributor

ChengJin01 commented Jan 27, 2023

[1] The crash with this failing test suite didn't occur on the AIX, zLinux and Linux/x86_64 based on the results of Grinder launched above. So this is a pLinux-specific issue.

[2] As discussed with @gac offline, the only thing we confirmed so far is that data->mhMetaData was corrupted as compared to the shape of object on Linux/x86_64.
pLinux

(gdb) p data->mhMetaData
$6 = (jobject) 0x7926180ec920
1:21
(gdb) x/10x data->mhMetaData
0x7926180ec920: 0x00000008      0x00000000      0x00000008      0x00000000 <-----
0x7926180ec930: 0x00000008      0x00000000      0x00000008      0x00000000
0x7926180ec940: 0x00000008      0x00000000

Linux/x86_64

(gdb) p data->mhMetaData
$3 = (jobject) 0x7ffff0091908
(gdb) x/10x data->mhMetaData
0x7ffff0091908: 0xfffdc370      0x00000007      0x00000008      0x00000000 <-------
0x7ffff0091918: 0x00000008      0x00000000      0x00000008      0x00000000
0x7ffff0091928: 0x00000008      0x00000000

[3] We also ruled out most of possibilities from VM perspective, which include:

  1. the global reference was deleted earlier which is impossible as the deletion happens only when VM exists in freeJavaVM via freeUpcallMetaDataDoFn`, in which case the test suite is completed.
  2. the failing test is not specific to the mulitthreading environment and we already have VM_VMAccess::inlineEnterVMFromJNI() to handle the operations memory barriers before the macro in getReturnTypeFromMetaData(), in which case there is no more read/write barrier required for the macro.
static U_64 JNICALL
native2InterpJavaUpcallImpl(J9UpcallMetaData *data, void *argsListPointer)
{
...
	VM_VMAccess::inlineEnterVMFromJNI(currentThread); <-------
	returnType = getReturnTypeFromMetaData(data);
  1. the global reference is created under the mutex when preparing the upcall stub, in which case it never gets corrupted in there.

@ChengJin01
Copy link
Contributor

The whole data was also unexpectedly corrupted on the Grinders above at https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Grinder/1887/openjdk_test_output.tar.gz

(gdb) p *data
$2 = {
  vm = 0x2744554d50, <---- corrupted
  downCallThread = 0x0, <--- wrong 
  mhMetaData = 0x30504d55444d564a, <--- corrupted
  upCallCommonDispatcher = 0x636f725020493933,  <--- corrupted
  thunkAddress = 0x6420676e69737365, <--- wrong
  thunkSize = 7954894493527862645,
  nativeFuncSignature = 0x2273243125222074,
  functionPtr = {7811882181195997228, 3614752383930016288, 2675266139157852964},
  thunkHeapWrapper = 0x6c70202d20732434
}
(gdb) p data->vm
$3 = (J9JavaVM *) 0x2744554d50
(gdb) p *data->vm
Cannot access memory at address 0x2744554d50
1:40
(gdb) x/10x   data->mhMetaData
0x30504d55444d564a:     Cannot access memory at address 0x30504d55444d564a

So we are wondering whether there is any chance the data was messed up in the thunk code after passing it to the thunk generation code when preparing for the upcall stub.

@ChengJin01
Copy link
Contributor

@zl-wang, could you help take a look at this issue from the thunk perspective to see why the data was corrupted given the upcall dispatcher received it from the thunk code?

@zl-wang
Copy link
Contributor

zl-wang commented Jan 27, 2023

@ChengJin01 it doesn't make senses that thunk-gen can corrupt the fields it never tries to write:

[zlwang@pjavanfs xl64]$ grep metaData UpcallThunkGen.cpp
 * @param metaData[in/out] a pointer to the given J9UpcallMetaData
createUpcallThunk(J9UpcallMetaData *metaData)
	J9JavaVM *vm = metaData->vm;
	J9UpcallSigType *sigArray = metaData->nativeFuncSignature->sigArray;
	I_32 lastSigIdx = (I_32)(metaData->nativeFuncSignature->numSigs - 1);
	/* To call the dispatcher: load metaData, load targetAddress, set-up argListPtr, mtctr, bctr */
			metaData->upCallCommonDispatcher = (void *)vmFuncs->native2InterpJavaUpcall0;
			metaData->upCallCommonDispatcher = (void *)vmFuncs->native2InterpJavaUpcall1;
			metaData->upCallCommonDispatcher = (void *)vmFuncs->native2InterpJavaUpcallJ;
			metaData->upCallCommonDispatcher = (void *)vmFuncs->native2InterpJavaUpcallF;
			metaData->upCallCommonDispatcher = (void *)vmFuncs->native2InterpJavaUpcallD;
			metaData->upCallCommonDispatcher = (void *)vmFuncs->native2InterpJavaUpcallStruct;
			metaData->upCallCommonDispatcher = (void *)vmFuncs->native2InterpJavaUpcallStruct;
			metaData->upCallCommonDispatcher = (void *)vmFuncs->native2InterpJavaUpcallStruct;
			metaData->upCallCommonDispatcher = (void *)vmFuncs->native2InterpJavaUpcallStruct;
	 * another 8-byte to store metaData pointer itself
	metaData->thunkSize = roundedCodeSize + 8;
	thunkMem = (I_32 *)vmFuncs->allocateUpcallThunkMemory(metaData);
	metaData->thunkAddress = (void *)thunkMem;
	 * in which case we can load the metaData by a fixed offset
	/* Store the metaData pointer */
	*(J9UpcallMetaData **)((char *)thunkMem + roundedCodeSize) = metaData;
	vmFuncs->doneUpcallThunkGeneration(metaData, (void *)thunkMem);

All read/write are in C code. the only fields are touched: read -- vm and nativeFuncSignature; write -- upCallCommonDispatcher, thunkSize, and thunkAddress. And, the same accesses are happening on AIX/zLinux/xLinux by the way.

@zl-wang
Copy link
Contributor

zl-wang commented Jan 27, 2023

no bulk memcopy was done to metaData pointer, ever during thunk-gen. it is just beyond my imagination that gcc compiler did some horrific things to corrupt metaData fields.

I think it is more likely that memory management of the related area is at fault.

@zl-wang
Copy link
Contributor

zl-wang commented Jan 27, 2023

can you recover/see the correct thunkAddress? the metaData pointer is stored at the end of the thunk. When thunk is executed, it will retrieve the metaData pointer to pass along to the common dispatcher. if you can see it, we can verify if the existing "data" matches with the metaData.

@zl-wang
Copy link
Contributor

zl-wang commented Jan 27, 2023

by the way, what is the intended upCall signature? I am thinking if the common dispatcher stores back to the caller frame (when thunk didn't create a frame on a suitable signature) and corrupts something. i doubted it very much, since that would happen every time, instead of 1 out of 100 runs. [PS: never mind ... i recalled that the common dispatcher has 2 arguments only. gcc cannot/won't store back to the caller frame.]

@ChengJin01
Copy link
Contributor

I think it is more likely that memory management of the related area is at fault.

That's something we suspected but there is no strong evidence to confirm that for now.

can you recover/see the correct thunkAddress?

There is no way to do that for the dumps at https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Grinder/1887/openjdk_test_output.tar.gz because the only way to see the thunk address is in data->thunkAddress is in the dispatcher after passing it to the thunk gen code via createUpcallThunk when preparing the upcall stub.

For the dumps in the description at https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Test_openjdk19_j9_sanity.openjdk_ppc64le_linux_Nightly/96/openjdk_test_output.tar.gz, only data->mhMetaData was corrupted while the thunkAddress was valid.

@zl-wang
Copy link
Contributor

zl-wang commented Jan 27, 2023

ah .. as long as the thunkAddress is valid, i will be able to see the correct metaData address embedded in the thunk. instruction listing (disassembly) on that thunkAddress for 32 or 50 instructions let's say (more than the thunk length ... because the embedded data is at the end of the thunk)... posted here, and i will be able to tell you what is the embedded metaData pointer.

@ChengJin01
Copy link
Contributor

Here's the assembly code starting from the thunkAddress according to the core dump in https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Test_openjdk19_j9_sanity.openjdk_ppc64le_linux_Nightly/96/openjdk_test_output.tar.gz

(gdb) frame  15
#15 0x000079261f39cc20 in getReturnTypeFromMetaData (data=0x792568137ed0) at /home/jenkins/workspace/Build_JDK19_ppc64le_linux_Nightly/openj9/runtime/vm/UpcallVMHelpers.cpp:159
159     /home/jenkins/workspace/Build_JDK19_ppc64le_linux_Nightly/openj9/runtime/vm/UpcallVMHelpers.cpp: No such file or directory.
(gdb) p *data
$1 = {
  vm = 0x79261802dc30,
  downCallThread = 0x296700,
  mhMetaData = 0x7926180ec920,
  upCallCommonDispatcher = 0x79261f39e580 <native2InterpJavaUpcall0(J9UpcallMetaData*, void*)>,
  thunkAddress = 0x7925fc000028, <----------
  thunkSize = 56,
  nativeFuncSignature = 0x792568179b60,
  functionPtr = {133201567083744, 133201567083840, 15987178197214948768},
  thunkHeapWrapper = 0x7925680bc6d0
}
(gdb) disas 0x7925fc000028, 0x7925fc000028 + 60
Dump of assembler code from 0x7925fc000028 to 0x7925fc000064:
   0x00007925fc000028:  mflr    r0
   0x00007925fc00002c:  std     r0,16(r1)
   0x00007925fc000030:  stdu    r1,-64(r1)
   0x00007925fc000034:  ld      r3,48(r12)
   0x00007925fc000038:  ld      r12,24(r3)
   0x00007925fc00003c:  addi    r4,r1,48
   0x00007925fc000040:  mtctr   r12
   0x00007925fc000044:  bctrl
   0x00007925fc000048:  addi    r1,r1,64
   0x00007925fc00004c:  ld      r0,16(r1)
   0x00007925fc000050:  mtlr    r0
   0x00007925fc000054:  blr
   0x00007925fc000058:  xori    r19,r0,32464
   0x00007925fc00005c:  .long 0x7925
   0x00007925fc000060:  .long 0x7
End of assembler dump.

@zl-wang
Copy link
Contributor

zl-wang commented Jan 27, 2023

so, the metaData pointer is embedded at 0x00007925fc000058: the xori encoding should be 0x6A607ED0. such that, the metaData pointer should be 0x000079256A607ED0. you can verify if "data" is that value.

Also, I noticed there is no register stored back into the thunk-created frame. i.e. the upCall has no argument.

@ChengJin01
Copy link
Contributor

ChengJin01 commented Jan 27, 2023

so, the metaData pointer is embedded at 0x00007925fc000058: the xori encoding should be 0x6A607ED0. such that, the metaData pointer should be 0x000079256A607ED0. you can verify if "data" is that value.

Unfortunately, none of your results matches the values in the core dump above at #16606 (comment).

(gdb) p data <---- this is the J9UpcallMetaData pointer passed to the thunk gen code.
$2 = (J9UpcallMetaData *) 0x792568137ed0 <-------
(gdb) p *data
$3 = {
  vm = 0x79261802dc30,
  downCallThread = 0x296700,
  mhMetaData = 0x7926180ec920, <-------- this is the metadata we'd like to use to access the java field
  upCallCommonDispatcher = 0x79261f39e580 <native2InterpJavaUpcall0(J9UpcallMetaData*, void*)>,
  thunkAddress = 0x7925fc000028,
  thunkSize = 56,
  nativeFuncSignature = 0x792568179b60,
  functionPtr = {133201567083744, 133201567083840, 15987178197214948768},
  thunkHeapWrapper = 0x7925680bc6d0

@zl-wang
Copy link
Contributor

zl-wang commented Jan 28, 2023

My bad: when I encoded that xori, r19 and r0 were put into wrong register positions. r0 should come first, but i put it in the 2nd position. Flipped them around ... 0 first, 19 (0x13) later. Indeed, it should be 0x792568137ed0. My "metaData" is the J9UpcallMetaData, instead of mhMetaData. so, it matched up. Now, you need to follow that correct J9UpcallMetaData and find out where you got the corrupted mhMetaData. You can see the thunk passed in the correct J9UpcallMetaData (and stack arg pointer --- it was calculated by addi r4,r1,48 instruction).

@zl-wang
Copy link
Contributor

zl-wang commented Jan 28, 2023

instead of disassembly, you can list the thunk content in binary ... you will see the embedded J9UpcallMetaData if you list the 8-byte at address 0x00007925fc000058.

@ChengJin01
Copy link
Contributor

There were two different cases in the dumps (data->mhMetaData was invalid in one dump while the data was corrupted in another), which was less likely to be intentionally messed up. I suspect the memory of data was be unexpectedly being released in code when accessing the java field via macro; otherwise, it should end up with the same result (e.g. the corrupted data->mhMetaData). I will need to double-check the test case against the related code in OpenJDK & OpenJ9 to see whether there was anything else I ignored previously.

@zl-wang
Copy link
Contributor

zl-wang commented Jan 30, 2023

i guessed you meant thread-safety re scoped memory accesses? or, pure scoped memory management bug (released too early ... can trigger even for single thread)? i am also wondering: when multiple threads are doing the same downcall, metadata and thunk-gen are shared by those threads (?). these data (thunk) are generated in a certain exclusive manner (?).

@ChengJin01
Copy link
Contributor

ChengJin01 commented Jan 30, 2023

i guessed you meant thread-safety re scoped memory accesses? or, pure scoped memory management bug (released too early ... can trigger even for single thread)? ...

There is something I suspect given there is not a multi-threading based test. The test indicates the upcall thunk is allocated under an implicit GC-based scope/session instead of the confined scope/session intended for the test as follows:

    public void testClosedStructCallback() throws Throwable {
        MethodHandle handle = Linker.nativeLinker().downcallHandle(
                findNativeOrThrow("addr_func_cb"),
                FunctionDescriptor.ofVoid(C_POINTER, C_POINTER));

        try (MemorySession session = MemorySession.openConfined()) { <---- this is a confined session for the downcall & upcall 
            MemorySegment segment = MemorySegment.allocateNative(POINT, session);
            handle.invoke(segment, sessionChecker(session));
        }
    }

    MemorySegment sessionChecker(MemorySession session) {
        try {
            MethodHandle handle = MethodHandles.lookup().findStatic(SafeFunctionAccessTest.class, "checkSession",
                    MethodType.methodType(void.class, MemorySession.class));
            handle = handle.bindTo(session);
            return Linker.nativeLinker().upcallStub(handle, FunctionDescriptor.ofVoid(), 
              MemorySession.openImplicit()); <------ the upcall stub is allocated under an implicit session
        } catch (Throwable ex) {
            throw new AssertionError(ex);
        }
    }

Normally, the upcall stub is allocate under the same session within the test, in which case all memory resource for the upcall (thunk, metadata) will be released only when the session is terminated, which is different from this failing test.

An implicit session is not controlled by the test case but by GC, in which it is not explicitly closed in there. So the upcall memory resource will be cleaned up if the termination of the implicit session occurs before doing the upcall.

@zl-wang
Copy link
Contributor

zl-wang commented Jan 30, 2023

hopefully, it is relatively straight-forward to track where "data" became bad within the common dispatcher (single-thread test case), when you can safely assume the passed-in "data" is correct (it was retrieved from the thunk itself -- embedded there).

@ChengJin01
Copy link
Contributor

ChengJin01 commented Jan 30, 2023

instead of disassembly, you can list the thunk content in binary ... you will see the embedded J9UpcallMetaData if you list the 8-byte at address 0x00007925fc000058.

According the data in the debugger, the value for 0x00007925fc000058 is correct for J9UpcallMetaData

(gdb) p data
No symbol "datea" in current context.
(gdb) p data
$1 = (J9UpcallMetaData *) 0x792568137ed0
(gdb) x/10x  0x00007925fc000058
0x7925fc000058: 0x68137ed0      0x00007925 <-------  0x7925 68137ed0 which was correct for J9UpcallMetaData
0x00000007      0x00000000

which means the value of J9UpcallMetaData was not modified from the thunk perspective. That being said, there is no code in thunk touching the value after the J9UpcallMetaData is passed to the thunk.

@ChengJin01
Copy link
Contributor

To verify whether the issue was triggered by the implicit session in the test, I changed the test by replacing it with a global session at ChengJin01/openj9-openjdk-jdk19@f526a82, in which case the crash should disappear given the metadata will never be released until VM exists.

Grinders (x300 on pLinux):
https://openj9-jenkins.osuosl.org/job/Grinder/1944

@ChengJin01
Copy link
Contributor

There is no crash/issue detected with the global session (never released the upcall stub until VM exists) in the Grinders above, which means the problem came from the misused of the implicit session.

Given this is not specific to OpenJ9 but a generic problem for upcall stub (the upcall stub must be kept alive till the upcall ends in any case), I will bring this up to Panama/OpenJDK via the mailing list to challenge this test as it is inappropriate to allocate upcall stub with an implicit session controlled by GC (which means the memory allocated by the implicit session will be forced to release by GC before/during the upcall, which leads to unexpected behaviors).

@ChengJin01
Copy link
Contributor

@ChengJin01
Copy link
Contributor

Based on the explanation from Oracle at https://mail.openjdk.org/pipermail/panama-dev/2023-January/018486.html against the Spec at https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/lang/foreign/Linker.html#safety:

On downcall handle invocation, the linker runtime guarantees the following for any argument 
that is a memory resource R (of type MemorySegment or VaList):
- The memory session of R is kept alive (and cannot be closed) during the invocation.

it means the memory segment of upcalll stub should be kept alive regardless of the session's type in downcall.

So we might need to modify our code in downcall to remove the restriction for the implicit session as the restrictions in the existing implementation is only intended for the confined session in downcall.

@ChengJin01
Copy link
Contributor

I created a fix at ChengJin01@7046851 by removing the check on the owner thread, which should resolve the issue with the implicit session (literally a kind of shared session). Will need to verify in personal builds & Grinders to see how it goes in this way.

@ChengJin01
Copy link
Contributor

ChengJin01 commented Feb 3, 2023

Already verified in OpenJ9 tests & Grinders (x300) at https://openj9-jenkins.osuosl.org/job/Grinder/1963/ without any issue detected on this test suite.

ChengJin01 added a commit to ChengJin01/openj9 that referenced this issue Feb 3, 2023
The changes remove the restriction for any session regardless
of the session's type to ensure the memory of upcall stub
is kept alive during the downcall.

Fixes: eclipse-openj9#16606

Signed-off-by: ChengJin01 <jincheng@ca.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:vm jdk19 project:panama Used to track Project Panama related work test failure
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants