New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jdk15+36: java/foreign/TestMismatch.java.TestMismatch J9 Crash #10588
Comments
It appears #10580 with different error output. |
fyi @andrewcraik |
We need a stop-ship determination on this issue. Given the crash is in java.foreign package which I believe is still in preview, I don't think will be but we should still look at the crash to be sure it's a broader issue |
@liqunl sorry to interrupt your dev work but could you take a look as this is lined up against the release that will be going out imminently. |
@andrew-m-leonard @llxia What should be the target if I want to rerun the failed test? |
@andrew-m-leonard You mentioned the test also failed on xlinux, could you provide the link to the x86 failure? |
#10580 is a failure in the same test and it has a link to rerun the test with target But it failed. @rpshukla Could you help with the grinder? I don't know why my grinder failed, maybe something's wrong in my configuration? |
@liqunl i've just kicked off a Grinder on xLinux for just the TestMissmatch test here: https://ci.adoptopenjdk.net/job/Grinder/3859/console |
@liqunl fyi, the "Rerun Grinder" links rarely work as-is, as they typically point at the "upstream" job artifact, which by the time you run the grinder has dissappeared! |
I started another plinux grinder using CUSTOMIZED_SDK_URL instead of UPSTREAM_JOB_NAME |
@AlenBadel could you kindly take a look? |
Failure rate on my side is between 1/10 and 1/20.
java.foreign was bundled with OpenJDK and Oracle builds starting with Java 14. I understand J9 may not be in total sync with package support (i.e javafx). Is foreign bundled with J9 Java 14 builds? |
Foreign is still incubator status for Java 15, but it is included. |
It looks like the foreign package is also included in J9 Java 14 builds. I ran it with the latest Java 14 release, and was not able to reproduce the failure. This suggests that this is a recent regression. |
I'm afraid that the core files are not readable on my end. I've even tried to add Continuing my investigation by attempting to increase the reproducibility, so that I can reliably get this test to fail locally. |
I'm guessing the core file(s) aren't readable because you need the jextract data from the machine that produced the core file. If it's reproducible on an OpenJ9 or internal jenkins machine we can get that. If it's only reproducible at Adopt, someone with access to those machines could run jextract. |
This is considered stop-ship until we have a core file and can evaluate the cause and frequency |
It's reproducible on internal jenkins machine, failure rate seems to be 4/12 on nightly build. Also saw crash in compiled method |
I was able to reproduce this locally on one of our internal farm machines. Failing at a rate of 100% on this particular machine. jextract can't find the libraries unfortunately. GDB, as well as our internal debugging tools are not able to load the libraries even when given the prefix, and absolute path.
I've also tried to generate a trace file, without any luck. I'll attempt to run this workload with gdb to get a snapshot of where we're crashing. |
The size of the segment inside the java stack is correct. This issue is stemming from an instruction call srawi which only shifts the lower 32 bits. I.e it sounds like somewhere we're assuming the generated size to be an integer.
|
Changing I'll be taking a look on xLinuxXL to see if it's a similar story. |
I think we need to find out which opcode (or some recognized method?) this instruction is coming from and check/document overall assumptions about the incoming length for that code. |
From what Alen showed me, it comes from |
Running the same test #10588 (comment) on XLinuxXL does not produce a crash, or does it hang. I will continue attempting to reproduce this on XLinuxXL, but I believe the reproduction rate was established to be less than 1 in 300. #10580 (comment) |
@IBMJimmyk is a tree problem or a codegen problem just to be clear on the width issue at play here? I'm guessing codegen? |
At least the problem on Power was a codegen issue. The child for the size of the data to write was an |
ok great thanks - just confirming there wasn't a trees/common issue - looks to be codegen - thanks! |
In regards to the crash on power, the following are the trees generated.
I'll be taking a second look at the generated sequence to see if there were any similar assumptions, otherwise we should be ready for a PR. On the XlinuxXL front, I'm checking if there are similar assumptions taken (I.e size of data treated as an int), otherwise there's very little I can do without being able to reproduce the failure. |
so the trees do look correct so the issue would just be incorrect widths in the codegen - thanks for the great analysis @AlenBadel |
I think we still need to understand the whole flow: which opcodes and evaluators are involved. For example, we might be reusing some code that normally accepts int length. |
I agree. I will have an update soon with the summary of the issue from a top-down approach. |
It occurs to me that this would have been avoided if we had some kind of 'compiler' to check uses like that, along the lines of the error you get in Java passing a long to a function that expects an int (e.g. |
Yeah, it's something to consider in the future. Basically, Power codegen's assumption is that a register containing an Int can have an undefined value in its upper half. So Int and Long types need to be treated differently. However we only have Register class and not Register32 and Register64. |
To Summarize the issue on Power: We’ve established that the crash occurs when MemorySegment.allocateNative is JIT compiled and invoked with a requested memory segment size greater than Integer.MAX_VALUE. E.x The method attempts to create, and allocate a new memory segment that is an allocated block from off-heap memory. The size argument represents the size of the memory segment requested. The crash A segmentation fault occurs because the system was attempting to clear read-only memory[1]. This is happening because the instructions generated which computes how much memory should be cleared was not correct[2]. Where was this code generated?
The only point where The Fix As discussed within the [1] #10588 (comment) |
Thanks for the very detailed investigation and the summary. However, I am not sure I agree with the proposed fix.
Note that the length information is conveyed via creating an I am not sure we can change OMR method to adjust it to a particular consumer, openj9, especially since even this consumer passes information in somewhat inconsistent way. |
BTW : I think we need to remove the Intel label and assume that this issue will only take care of the Power failure. |
Removed the intel label, this platform should be covered by #10580 |
This description can certainly be added.
This is true, an ArraySet does normally allocate an array length of type int due to indexing limitations. The limitation is historical, and based off the class ArraySet inherits; ArrayList which itself is an implementation of List. Looking through the remainder of the Unsafe methods, we can see that we tend use the underlying array evaluators. From what I've seen they all support a length which could be long, or int. For example, such case as I believe as long as the evaluators we use support the initialization of a contiguous memory segment which has a length argument of type long then we could easily justify it's use to compile and evaluate Unsafe methods.
The evaluator should support both long, or int length values. This is consistent with many of our other evaluators. Java is not the only language that has the capability to allocate, and initialize a very-large contiguous region of memory. [1] https://github.com/eclipse/omr/blob/0b9653c23ba22247f66c11ad6df8a3b46c2c4738/compiler/p/codegen/OMRTreeEvaluator.cpp#L4276-L4277 |
Do you mean we should be able to pass nodes with the length child being both Int and Long type? |
They can be either Long, or Int. We currently don't call However, it's better illustrated with the example of |
I had a look at how |
Exactly. We can use |
OMR PR: eclipse/omr#5590 |
The above PR has been merged. Fix will be merged into R0.23 as well. |
@andrew-m-leonard can this be closed? |
I'll go ahead and close it since the fix has been merged, we can open a new issue if another problem is found. |
Failure link
https://ci.adoptopenjdk.net/job/Test_openjdk15_j9_sanity.openjdk_ppc64le_linux/43/consoleFull
java/foreign/TestMismatch.java
Fails on platforms: Platforms: pLinux, xLinuxXL
Optional info
Failure output (captured from console output)
The text was updated successfully, but these errors were encountered: