New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfault at checkCast on Java 17 when running multiple RedefineClasses
threads
#14014
Comments
I've tentatively assigned this to the 0.29.1 release as we need to determine if this is a stop-ship / blocker issue for that release as info so far indicates it's present in the release branch. |
@EricYangIBM can you try this test again with Nazims work around in eclipse/omr#6255 |
It still crashes with the latest openj9 and omr changes (including that commit) |
Thanks confirming |
Looks like this is similar to #13504 |
I put the truncated test code here: https://github.com/EricYangIBM/openj9/tree/crash |
The test is run with Ran a 100x grinder:
Referring to the native stack from gdb below, the segfault happens while resolving Also, I was able to produce segfault at different locations when I tried Native stack during crash
|
There is a potential workaround to prevent this crash by avoiding |
That's a pretty heavy hammer to use to avoid the issue as that will make all compilations synchronous, not just those needed to address guard failures. Expect to see much higher latency and worse startup with that option. I'm not comfortable with that being considered a sufficient workaround for this issue |
The workaround suggested by @babsingh was to avoid passing
Running without specifying Separately, I wonder whether this is related to #13162 |
However: While running without I think it probably still makes sense to take into account that this is an unusual mode when evaluating the severity though:
|
The workaround for that was promoted in the head stream last night. |
This issue was opened yesterday in the morning |
re #14014 (comment): Yesterday, @EricYangIBM tried eclipse/omr#6255 and stated in #14014 (comment) that the crash still happens. Just now, I rebased both openj9 and omr repos with the latest changes. But, I can no longer reproduce the segfault. Can another person confirm this point? |
Oh I missed that, sorry |
Has this been confirmed now? I see a 👍 on the post but I'm too old to know if that means "confirmed" or "agree we should confirm" |
That's my 👍, which I meant mainly to acknowledge the information I had missed above from #14014 (comment) I'm not aware of any confirmation, and certainly not on my part. For the record, if I had confirmed something like that, I would have said so explicitly. But sorry nonetheless for the ambiguity 😕 |
It still happens with the latest OpenJ9 and OMR changes. I observed the following three scenarios:
Different SCC optionsFrom the above scenarios, I feel that the JIT is storing some data in the SCC which is either corrupted or not being read correctly from the SCC. The culprit data is only generated when I tried the below -Xshareclasses options to figure out where the corruption is happening in the SCC:
None of the above options prevented the segfault. @hangshao0 for insights. |
|
With |
This failure will have to be triaged to get to the root cause. I suspect including What is the failure rate in 100 runs? Does it fail every time? I don't think that statistic was mentioned above. Given that |
BTW, I took a look at string peepholes, and it seems it will already refuse to transform under involuntary OSR, though not as intentionally as we might like. The patterns it's looking for start like so:
(though With involuntary OSR, the trees for this bytecode contain a store to a pending push temp because the
When string peepholes sees the So we shouldn't currently have this same bug for string peepholes, though IMO it would be an improvement to explicitly skip the attempt to find patterns and transform, e.g. here: openj9/runtime/compiler/optimizer/StringPeepholes.cpp Lines 217 to 223 in 0f92184
FYI @vijaysun-omr, @0xdaryl |
Similarly, |
Thanks @jdmpapin I agree it would be better to more explicitly avoid those kinds of transformations that change the program in ways that involuntary OSR may not be able to handle. I wonder if these transformations in the IL generator also fall in a similar category : openj9/runtime/compiler/ilgen/Walker.cpp Line 4374 in 5afc033
|
Now that we understand the circumstances in which this problem occurs I don't believe this a blocker for 0.29.1 and can wait for the 0.30 release. I recommend the fix (#14071) be merged into master and then subsequently merged into 0.30 when it is ready. This is a long-standing problem that is not JDK17-specific. It manifests itself when Exploration of other OSR problems with existing JIT optimizations will also be conducted in the background and contributed to the code base when ready. We are not aware of any known OSR problems from any of these other JIT optimizations. |
I agree with @0xdaryl 's analysis and suggested course of action, which suggests this issue can be removed from 0.29.1's milestone blockers https://github.com/eclipse-openj9/openj9/milestone/32 . Thank you to everyone who contributed to understanding this issue, regardless of how we proceed here. Everyone worked hard to figure out the underlying problem and it showed great collaboration across teams as the investigation progressed! |
The original problem was only seen in JDK17, Ill take a look at these ones to determine the cause |
Similar failure observed at an internal JDK8 build
|
To confirm I understand the latest updates - the issue is with any call to clone() where the jit inlines the sequence directly that we later OSR out of? It applies to all current releases with OSR-enabled and could occur in any release in the field. 17 just happens to be where we found it. Assuming that's accurate, then the plan to move this out of the 0.29.1 release's list of blockers makes sense to me. It's not specific to the 17 release and is already present in the field.
A big +1 to this as well! |
Yes, that's accurate. |
Crashes in new test specific to these platforms. See eclipse-openj9#14014 (comment) Signed-off-by: Eric Yang <eric.yang@ibm.com>
I don't think we should be excluding. We've already merged a fix #14071. We should be removing the exclude for jdk17 if the problem has been resolved. Also note that excluding the entire test rather than just rc021 could hide other issues. |
This is still failing although the build contains #14071 |
@jdmpapin : can you take a look at the crash, at least to confirm whether it is or isn't related to the original problem? If AArch64-specific I can farm it out for investigation. |
Comparing the output against the only prior failure I've found linked from this issue, which is from the initial comment #14014 (comment) Within rc021, the original failure occurred in
The crash was in libj9vm29.so:
And it indicated that there was a bad reference encountered in
On the other hand, in this more recent failed test,
With these differences and the fact that the more recent failure happened despite #14071, I'm pretty sure it's a distinct bug |
Crash happens during MemberName.clone().
Sample failure link: https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_extended.functional_ppc64le_linux_Personal/11/tapResults/
To reproduce, run
RedefineClasses
threads (e.g. in rc021.java) in a loop, adding and removing a static field from a class. The threads must not bejoin
ed one by one otherwise the crash will not happen. See #13836 (comment).Also note that this happens without the non-test changes in #13836
The text was updated successfully, but these errors were encountered: