-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jdk19 openjdk java/foreign/SafeFunctionAccessTest.java crash vmState=0x00000000 #16606
Comments
@tajila @ChengJin01 fyi |
The javacore & core dump indicate the test suite failed in
against the test code at https://github.com/ibmruntimes/openj9-openjdk-jdk19/blob/be0b953d126ad1302eb49c4467f7017ef4853b9c/test/jdk/java/foreign/SafeFunctionAccessTest.java#L165
along with the stacktrace in gdb:
against the native code at openj9/runtime/vm/UpcallVMHelpers.cpp Line 159 in 63dc82f
So the only reason for the crash is that the address of |
According to our existing implementation in upcall at openj9/runtime/vm/OutOfLineINL_openj9_internal_foreign_abi_InternalUpcallHandler.cpp Line 115 in 63dc82f
mhMetaData is already stored as the global reference which means it is impossible to discard it before the upcall is completed (the location of the crash was in the dispatcher which means the upcall method was not yet invoked at that time).
Meanwhile, the intention of the test suite is to validate of openj9/jcl/src/java.base/share/classes/openj9/internal/foreign/abi/InternalDowncallHandler.java Line 742 in b975e82
to ensure the current session is never terminated until the whole upcall invocation is done. So I can't image how this still happened in upcall. |
Launched a Grinder (x200) at https://openj9-jenkins.osuosl.org/job/Grinder/1887/ to see how frequently the crash can be reproduced. |
The issue can be reproduced on the Grinder above (1/100). So I'd like to know whether this is a platform-dependent issue or only happened to Launched Grinders (x300) on AIX, zLinux, and Linux/x86_64: |
[1] The crash with this failing test suite didn't occur on the AIX, zLinux and Linux/x86_64 based on the results of Grinder launched above. So this is a pLinux-specific issue. [2] As discussed with @gac offline, the only thing we confirmed so far is that
Linux/x86_64
[3] We also ruled out most of possibilities from VM perspective, which include:
|
The whole
So we are wondering whether there is any chance the |
@zl-wang, could you help take a look at this issue from the thunk perspective to see why the |
@ChengJin01 it doesn't make senses that thunk-gen can corrupt the fields it never tries to write:
All read/write are in C code. the only fields are touched: read -- |
no bulk memcopy was done to metaData pointer, ever during thunk-gen. it is just beyond my imagination that gcc compiler did some horrific things to corrupt metaData fields. I think it is more likely that memory management of the related area is at fault. |
can you recover/see the correct thunkAddress? the metaData pointer is stored at the end of the thunk. When thunk is executed, it will retrieve the metaData pointer to pass along to the common dispatcher. if you can see it, we can verify if the existing "data" matches with the metaData. |
by the way, what is the intended upCall signature? I am thinking if the common dispatcher stores back to the caller frame (when thunk didn't create a frame on a suitable signature) and corrupts something. i doubted it very much, since that would happen every time, instead of 1 out of 100 runs. [PS: never mind ... i recalled that the common dispatcher has 2 arguments only. gcc cannot/won't store back to the caller frame.] |
That's something we suspected but there is no strong evidence to confirm that for now.
There is no way to do that for the dumps at https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Grinder/1887/openjdk_test_output.tar.gz because the only way to see the thunk address is in For the dumps in the description at https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Test_openjdk19_j9_sanity.openjdk_ppc64le_linux_Nightly/96/openjdk_test_output.tar.gz, only |
ah .. as long as the thunkAddress is valid, i will be able to see the correct metaData address embedded in the thunk. instruction listing (disassembly) on that thunkAddress for 32 or 50 instructions let's say (more than the thunk length ... because the embedded data is at the end of the thunk)... posted here, and i will be able to tell you what is the embedded metaData pointer. |
Here's the assembly code starting from the thunkAddress according to the core dump in https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Test_openjdk19_j9_sanity.openjdk_ppc64le_linux_Nightly/96/openjdk_test_output.tar.gz
|
so, the metaData pointer is embedded at 0x00007925fc000058: the xori encoding should be 0x6A607ED0. such that, the metaData pointer should be 0x000079256A607ED0. you can verify if "data" is that value. Also, I noticed there is no register stored back into the thunk-created frame. i.e. the upCall has no argument. |
Unfortunately, none of your results matches the values in the core dump above at #16606 (comment).
|
My bad: when I encoded that xori, r19 and r0 were put into wrong register positions. r0 should come first, but i put it in the 2nd position. Flipped them around ... 0 first, 19 (0x13) later. Indeed, it should be 0x792568137ed0. My "metaData" is the J9UpcallMetaData, instead of mhMetaData. so, it matched up. Now, you need to follow that correct J9UpcallMetaData and find out where you got the corrupted mhMetaData. You can see the thunk passed in the correct J9UpcallMetaData (and stack arg pointer --- it was calculated by |
instead of disassembly, you can list the thunk content in binary ... you will see the embedded J9UpcallMetaData if you list the 8-byte at address |
There were two different cases in the dumps ( |
i guessed you meant thread-safety re scoped memory accesses? or, pure scoped memory management bug (released too early ... can trigger even for single thread)? i am also wondering: when multiple threads are doing the same downcall, metadata and thunk-gen are shared by those threads (?). these data (thunk) are generated in a certain exclusive manner (?). |
There is something I suspect given there is not a multi-threading based test. The test indicates the upcall thunk is allocated under an implicit GC-based scope/session instead of the confined scope/session intended for the test as follows:
Normally, the upcall stub is allocate under the same session within the test, in which case all memory resource for the upcall (thunk, metadata) will be released only when the session is terminated, which is different from this failing test. An implicit session is not controlled by the test case but by GC, in which it is not explicitly closed in there. So the upcall memory resource will be cleaned up if the termination of the implicit session occurs before doing the upcall. |
hopefully, it is relatively straight-forward to track where "data" became bad within the common dispatcher (single-thread test case), when you can safely assume the passed-in "data" is correct (it was retrieved from the thunk itself -- embedded there). |
According the data in the debugger, the value for
which means the value of |
To verify whether the issue was triggered by the implicit session in the test, I changed the test by replacing it with a global session at ChengJin01/openj9-openjdk-jdk19@f526a82, in which case the crash should disappear given the metadata will never be released until VM exists. Grinders (x300 on pLinux): |
There is no crash/issue detected with the global session (never released the upcall stub until VM exists) in the Grinders above, which means the problem came from the misused of the implicit session. Given this is not specific to OpenJ9 but a generic problem for upcall stub (the upcall stub must be kept alive till the upcall ends in any case), I will bring this up to Panama/OpenJDK via the mailing list to challenge this test as it is inappropriate to allocate upcall stub with an implicit session controlled by GC (which means the memory allocated by the implicit session will be forced to release by GC before/during the upcall, which leads to unexpected behaviors). |
The issue is created at https://mail.openjdk.org/pipermail/panama-dev/2023-January/018483.html. |
Based on the explanation from Oracle at https://mail.openjdk.org/pipermail/panama-dev/2023-January/018486.html against the Spec at https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/lang/foreign/Linker.html#safety:
it means the memory segment of upcalll stub should be kept alive regardless of the session's type in downcall. So we might need to modify our code in downcall to remove the restriction for the implicit session as the restrictions in the existing implementation is only intended for the confined session in downcall. |
I created a fix at ChengJin01@7046851 by removing the check on the owner thread, which should resolve the issue with the implicit session (literally a kind of shared session). Will need to verify in personal builds & Grinders to see how it goes in this way. |
Already verified in OpenJ9 tests & Grinders (x300) at https://openj9-jenkins.osuosl.org/job/Grinder/1963/ without any issue detected on this test suite. |
The changes remove the restriction for any session regardless of the session's type to ensure the memory of upcall stub is kept alive during the downcall. Fixes: eclipse-openj9#16606 Signed-off-by: ChengJin01 <jincheng@ca.ibm.com>
https://openj9-jenkins.osuosl.org/job/Test_openjdk19_j9_sanity.openjdk_ppc64le_linux_Nightly/96
jdk_foreign_0
java/foreign/SafeFunctionAccessTest.java
https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Test_openjdk19_j9_sanity.openjdk_ppc64le_linux_Nightly/96/openjdk_test_output.tar.gz
The text was updated successfully, but these errors were encountered: