Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault at checkCast on Java 17 when running multiple RedefineClasses threads #14014

Closed
EricYangIBM opened this issue Nov 25, 2021 · 81 comments
Closed

Comments

@EricYangIBM
Copy link
Contributor

EricYangIBM commented Nov 25, 2021

Crash happens during MemberName.clone().
Sample failure link: https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_extended.functional_ppc64le_linux_Personal/11/tapResults/

===============================================
Running test cmdLineTester_jvmtitests_hcr_OSRG_nongold_0 ...
===============================================
cmdLineTester_jvmtitests_hcr_OSRG_nongold_0 Start Time: Mon Nov 22 20:22:38 2021 Epoch Time (ms): 1637626958276
variation: Mode107-OSRG
JVM_OPTIONS:  -Xgcpolicy:optthruput -Xdebug -Xrunjdwp:transport=dt_socket,address=8888,server=y,onthrow=no.pkg.foo,launch=echo -Xjit:enableOSR,enableOSROnGuardFailure,count=1,disableAsyncCompilation

Output from test:
 [OUT] *** Testing [1/2]:	testAnnotatedTypes
 [ERR] Unhandled exception
 [ERR] Type=Segmentation error vmState=0x00000000
 [ERR] J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000001
 [ERR] Handler1=00003FFFAE285BA0 Handler2=00003FFFAE1BC300
 [ERR] R0=0000000000000000 R1=00003FFEBD9C9F80 R2=00003FFFAE496D00 R3=FFFFFFFFFFFFFC28
 [ERR] R4=00003FFFAE3420C4 R5=00000000000B12E8 R6=00000000005304F0 R7=0000000000000001
 [ERR] R8=0000000000101350 R9=0000000000102400 R10=000000000000000E R11=0000000000000000
 [ERR] R12=0000000028222422 R13=00003FFEBD9D6900 R14=0000000000530480 R15=0000000000367400
 [ERR] R16=00003FFF84B30038 R17=FFFFFFFFFFFFFFFF R18=00003FFEBD9CF900 R19=0000000000090000
 [ERR] R20=00003FFEBD9CA8A0 R21=0000000000000000 R22=0000000088ADA180 R23=0000000000000000
 [ERR] R24=00003FFF4D4E561C R25=00003FFEBD9CA9D8 R26=00000000005304C8 R27=00003FFF00002200
 [ERR] R28=FFFFFFFFFFFFFFFD R29=0000000000101008 R30=0000000000367400 R31=0000000000000001
 [ERR] NIP=00003FFFAE377640 MSR=800000010280F033 ORIG_GPR3=00000000000081C8 CTR=00003FFFAE3460F0
 [ERR] LINK=00003FFFAE3AE960 XER=0000000020000000 CCR=0000000028222424 SOFTE=0000000000000001
 [ERR] TRAP=0000000000000300 DAR=000000000000000E dsisr=0000000040000000 RESULT=0000000000000000
 [ERR] FPR0 0000000000000001 (f: 1.000000, d: 4.940656e-324)
 [ERR] FPR1 40469ae0a0000000 (f: 2684354560.000000, d: 4.520998e+01)
 [ERR] FPR2 74736e6f435f7463 (f: 1130329216.000000, d: 8.903897e+252)
 [ERR] FPR3 5f6176616a4c335f (f: 1783378816.000000, d: 2.858067e+151)
 [ERR] FPR4 3fe3333340000000 (f: 1073741824.000000, d: 6.000000e-01)
 [ERR] FPR5 bfdec82840000000 (f: 1073741824.000000, d: -4.809666e-01)
 [ERR] FPR6 3fe5555560000000 (f: 1610612736.000000, d: 6.666667e-01)
 [ERR] FPR7 3f1c7e7740000000 (f: 1073741824.000000, d: 1.086960e-04)
 [ERR] FPR8 4028000000000000 (f: 0.000000, d: 1.200000e+01)
 [ERR] FPR9 3fd9999b80000000 (f: 2147483648.000000, d: 4.000005e-01)
 [ERR] FPR10 3f65970b40000000 (f: 1073741824.000000, d: 2.635500e-03)
 [ERR] FPR11 bfb2bd7340000000 (f: 1073741824.000000, d: -7.320328e-02)
 [ERR] FPR12 3fc7565060000000 (f: 1610612736.000000, d: 1.823216e-01)
 [ERR] FPR13 3fdb9b2820000000 (f: 536870912.000000, d: 4.313450e-01)
 [ERR] FPR14 0000000000000000 (f: 0.000000, d: 0.000000e+00)
 [ERR] FPR15 0000000000000000 (f: 0.000000, d: 0.000000e+00)
 [ERR] FPR16 0000000000000000 (f: 0.000000, d: 0.000000e+00)
 [ERR] FPR17 0000000000000000 (f: 0.000000, d: 0.000000e+00)
 [ERR] FPR18 0000000000000000 (f: 0.000000, d: 0.000000e+00)
 [ERR] FPR19 0000000000000000 (f: 0.000000, d: 0.000000e+00)
 [ERR] FPR20 0000000000000000 (f: 0.000000, d: 0.000000e+00)
 [ERR] FPR21 0000000000000000 (f: 0.000000, d: 0.000000e+00)
 [ERR] FPR22 0000000000000000 (f: 0.000000, d: 0.000000e+00)
 [ERR] FPR23 0000000000000000 (f: 0.000000, d: 0.000000e+00)
 [ERR] FPR24 0000000000000000 (f: 0.000000, d: 0.000000e+00)
 [ERR] FPR25 0000000000000000 (f: 0.000000, d: 0.000000e+00)
 [ERR] FPR26 0000000000000000 (f: 0.000000, d: 0.000000e+00)
 [ERR] FPR27 0000000000000000 (f: 0.000000, d: 0.000000e+00)
 [ERR] FPR28 0000000000000000 (f: 0.000000, d: 0.000000e+00)
 [ERR] FPR29 0000000000000000 (f: 0.000000, d: 0.000000e+00)
 [ERR] FPR30 0000000000000000 (f: 0.000000, d: 0.000000e+00)
 [ERR] FPR31 0000000000000000 (f: 0.000000, d: 0.000000e+00)
 [ERR] Module=/home/jenkins/workspace/Test_openjdk17_j9_extended.functional_ppc64le_linux_Personal_testList_0/openjdkbinary/j2sdk-image/lib/default/libj9vm29.so
 [ERR] Module_base_address=00003FFFAE240000
 [ERR] Target=2_90_20211122_74 (Linux 3.10.0-1160.25.1.el7.ppc64le)
 [ERR] CPU=ppc64le (8 logical CPUs) (0x1de290000 RAM)
 [ERR] ----------- Stack Backtrace -----------
 [ERR] (0x00003FFFAE1B69E4 [libj9prt29.so+0x369e4])
 [ERR] (0x00003FFFAE1BD708 [libj9prt29.so+0x3d708])
 [ERR] (0x00003FFFAE1B6AA8 [libj9prt29.so+0x36aa8])
 [ERR] (0x00003FFFAE1B6C10 [libj9prt29.so+0x36c10])
 [ERR] (0x00003FFFAE1B66E4 [libj9prt29.so+0x366e4])
 [ERR] (0x00003FFFAE1BD708 [libj9prt29.so+0x3d708])
 [ERR] (0x00003FFFAE1B67D8 [libj9prt29.so+0x367d8])
 [ERR] (0x00003FFFAE285A20 [libj9vm29.so+0x45a20])
 [ERR] (0x00003FFFAE1BD708 [libj9prt29.so+0x3d708])
 [ERR] (0x00003FFFAE285D54 [libj9vm29.so+0x45d54])
 [ERR] (0x00003FFFAE1BC560 [libj9prt29.so+0x3c560])
 [ERR] __kernel_sigtramp_rt64+0x0 (0x00003FFFAF340478)
 [ERR] (0x00003FFFAE34F1EC [libj9vm29.so+0x10f1ec])
 [ERR] (0x00003FFFAE3AE960 [libj9vm29.so+0x16e960])
 [ERR] (0x00003FFFAE25DD64 [libj9vm29.so+0x1dd64])
 [ERR] (0x00003FFFAE2D38A4 [libj9vm29.so+0x938a4])
 [ERR] (0x00003FFFAE36B004 [libj9vm29.so+0x12b004])
 [ERR] (0x00003FFFAE3AE960 [libj9vm29.so+0x16e960])
 [ERR] (0x00003FFFAE259E30 [libj9vm29.so+0x19e30])
 [ERR] (0x00003FFFAE2F6C70 [libj9vm29.so+0xb6c70])
 [ERR] (0x00003FFFAE1BD708 [libj9prt29.so+0x3d708])
 [ERR] (0x00003FFFAE2F1E04 [libj9vm29.so+0xb1e04])
 [ERR] (0x00003FFFAE1513E8 [libj9thr29.so+0x113e8])
 [ERR] (0x00003FFFAF278CD4 [libpthread.so.0+0x8cd4])
 [ERR] clone+0xe4 (0x00003FFFAF177F14 [libc.so.6+0x127f14])
 [ERR] ---------------------------------------
 [ERR] JVMDUMP039I Processing dump event "gpf", detail "" at 2021/11/22 20:23:55 - please wait.
 [ERR] 0000000000367400: Object neither in heap nor stack-allocated in thread pool-1-thread-92
 [ERR] 0000000000367400:	O-Slot=00000000005304C8
 [ERR] 0000000000367400:	O-Slot value=000000000000000E
 [ERR] 0000000000367400:	PC=00003FFF4D4E561C
 [ERR] 0000000000367400:	framesWalked=0
 [ERR] 0000000000367400:	arg0EA=00000000005304F0
 [ERR] 0000000000367400:	walkSP=00000000005304C8
 [ERR] 0000000000367400:	literals=0000000000101008
 [ERR] 0000000000367400:	jitInfo=0000000000000000
 [ERR] 0000000000367400:	method=0000000000101008 (java/lang/invoke/MemberName.clone()Ljava/lang/invoke/MemberName;) (Interpreted)
 [ERR] 0000000000367400:	stack=000000000052B480-0000000000530D20

To reproduce, run RedefineClasses threads (e.g. in rc021.java) in a loop, adding and removing a static field from a class. The threads must not be joined one by one otherwise the crash will not happen. See #13836 (comment).
Also note that this happens without the non-test changes in #13836

@DanHeidinga
Copy link
Member

I've tentatively assigned this to the 0.29.1 release as we need to determine if this is a stop-ship / blocker issue for that release as info so far indicates it's present in the release branch.

@tajila
Copy link
Contributor

tajila commented Nov 25, 2021

@babsingh @fengxue-IS

@tajila
Copy link
Contributor

tajila commented Nov 25, 2021

@EricYangIBM can you try this test again with Nazims work around in eclipse/omr#6255

@EricYangIBM
Copy link
Contributor Author

It still crashes with the latest openj9 and omr changes (including that commit)

@tajila
Copy link
Contributor

tajila commented Nov 25, 2021

It still crashes with the latest openj9 and omr changes (including that commit)

Thanks confirming

@tajila
Copy link
Contributor

tajila commented Nov 25, 2021

Looks like this is similar to #13504

@EricYangIBM
Copy link
Contributor Author

I put the truncated test code here: https://github.com/EricYangIBM/openj9/tree/crash
The crash occurs even if I redefine a class to itself as long as I start ~20+ redefine threads.

@tajila tajila removed the comp:jit label Nov 25, 2021
@babsingh
Copy link
Contributor

babsingh commented Nov 26, 2021

The test is run with -Xjit:enableOSR,enableOSROnGuardFailure,count=1,disableAsyncCompilation.

Ran a 100x grinder:

  • Segfault is only seen when -Xjit:disableAsyncCompilation is specified.
  • With -Xint, the test passes 100 times.
  • With JIT enabled but without -Xjit:disableAsyncCompilation, the test passes 100 times.

Referring to the native stack from gdb below, the segfault happens while resolving invokehandle in the JVM, which invokes MethodHandleResolver.linkCallerMethod -> MethodHandleNatives.linkMethod (Java methods). The crash stops happening once -Xjit:disableAsyncCompilation is removed. ++@0xdaryl @nbhuiyan @jdmpapin for JIT insights.

Also, I was able to produce segfault at different locations when I tried -Xjit:disableOSR and -Xjit:disableNextGenHCR.

Native stack during crash

#0  VM_DebugBytecodeInterpreterCompressed::run (this=this@entry=0x7fffc8fa4060, vmThread=<optimized out>) at /root/openj9-openjdk-jdk17/openj9/runtime/vm/BytecodeInterpreter.hpp:11066
#1  0x00007ffff6004b6d in debugBytecodeLoopCompressed (currentThread=<optimized out>) at /root/openj9-openjdk-jdk17/openj9/runtime/vm/BytecodeInterpreter.inc:112
#2  0x00007ffff60685a2 in c_cInterpreter () at /root/openj9-openjdk-jdk17/build/linux-x86_64-server-release/vm/runtime/vm/xcinterp.s:158
#3  0x00007ffff5f52b78 in sendResolveOpenJDKInvokeHandle (currentThread=0x0, currentThread@entry=0x2a5d00, ramCP=ramCP@entry=0x21a490, cpIndex=cpIndex@entry=29, refKind=refKind@entry=5,
    resolvedClass=0x1be600, nameAndSig=0x30, nameAndSig@entry=0x7fff99809bfc) at /root/openj9-openjdk-jdk17/openj9/runtime/vm/callin.cpp:974
#4  0x00007ffff5faa311 in resolveOpenJDKInvokeHandle (vmThread=0x2a5d00, ramCP=0x21a490, cpIndex=29, resolveFlags=resolveFlags@entry=0)
    at /root/openj9-openjdk-jdk17/openj9/runtime/vm/resolvesupport.cpp:2042
#5  0x00007ffff6034a6b in VM_DebugBytecodeInterpreterCompressed::invokehandle (_pc=<optimized out>, _sp=<optimized out>, this=<optimized out>)
    at /root/openj9-openjdk-jdk17/openj9/runtime/vm/BytecodeInterpreter.hpp:8465
#6  VM_DebugBytecodeInterpreterCompressed::run (this=this@entry=0x7fffc8fa4900, vmThread=<optimized out>) at /root/openj9-openjdk-jdk17/openj9/runtime/vm/BytecodeInterpreter.hpp:11124
#7  0x00007ffff6004b6d in debugBytecodeLoopCompressed (currentThread=<optimized out>) at /root/openj9-openjdk-jdk17/openj9/runtime/vm/BytecodeInterpreter.inc:112
#8  0x00007ffff60685a2 in c_cInterpreter () at /root/openj9-openjdk-jdk17/build/linux-x86_64-server-release/vm/runtime/vm/xcinterp.s:158
#9  0x00007ffff5f4faea in runJavaThread (currentThread=0x7058debb8, currentThread@entry=0x2a5d00) at /root/openj9-openjdk-jdk17/openj9/runtime/vm/callin.cpp:648
#10 0x00007ffff5fc4b9d in javaProtectedThreadProc (portLibrary=portLibrary@entry=0x7ffff677a340 <j9portLibrary>, entryarg=entryarg@entry=0x2a5d00)
    at /root/openj9-openjdk-jdk17/openj9/runtime/vm/vmthread.c:2070
#11 0x00007ffff5528f73 in omrsig_protect (portLibrary=0x7ffff677a340 <j9portLibrary>, fn=0x7ffff5fc4ae0 <javaProtectedThreadProc>, fn_arg=0x2a5d00,
    handler=0x7ffff5f730f0 <structuredSignalHandler>, handler_arg=0x2a5d00, flags=506, result=0x7fffc8fa4dc8) at /root/openj9-openjdk-jdk17/omr/port/unix/omrsignal.c:425
#12 0x00007ffff5fc0cfa in javaThreadProc (entryarg=0x7ffff0010070) at /root/openj9-openjdk-jdk17/openj9/runtime/vm/vmthread.c:349
#13 0x00007ffff5d26302 in thread_wrapper (arg=0x7ffff03670e0) at /root/openj9-openjdk-jdk17/omr/thread/common/omrthread.c:1724
#14 0x00007ffff71986db in start_thread (arg=0x7fffc8fa7700) at pthread_create.c:463
#15 0x00007ffff78f271f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

@babsingh
Copy link
Contributor

we need to determine if this is a stop-ship / blocker issue for 0.29.1

There is a potential workaround to prevent this crash by avoiding -Xjit:disableAsyncCompilation. JIT team may have a better workaround. This issue should not be labelled as a stop-ship/blocker issue for 0.29.1 due to the presence of a workaround.

@DanHeidinga
Copy link
Member

we need to determine if this is a stop-ship / blocker issue for 0.29.1

There is a potential workaround to prevent this crash by avoiding -Xjit:disableAsyncCompilation. JIT team may have a better workaround. This issue should not be labelled as a stop-ship/blocker issue for 0.29.1 due to the presence of a workaround.

That's a pretty heavy hammer to use to avoid the issue as that will make all compilations synchronous, not just those needed to address guard failures. Expect to see much higher latency and worse startup with that option.

I'm not comfortable with that being considered a sufficient workaround for this issue

@jdmpapin
Copy link
Contributor

jdmpapin commented Nov 26, 2021

That's a pretty heavy hammer to use to avoid the issue as that will make all compilations synchronous, not just those needed to address guard failures. Expect to see much higher latency and worse startup with that option.

The workaround suggested by @babsingh was to avoid passing disableAsyncCompilation rather than to start passing it:

Segfault is only seen when -Xjit:disableAsyncCompilation is specified.

Running without specifying disableAsyncCompilation just makes all compilations asynchronous, which is the default

Separately, I wonder whether this is related to #13162

@jdmpapin
Copy link
Contributor

jdmpapin commented Nov 26, 2021

However: While running without disableAsyncCompilation seems to prevent this test from failing, it's not necessarily a generally reliable workaround for the actual bug. I would even go further and guess that it doesn't prevent the bug in general, since I don't think I've ever personally encountered a bug for which it would

I think it probably still makes sense to take into account that this is an unusual mode when evaluating the severity though:

-Xdebug -Xjit:enableOSR,enableOSROnGuardFailure,count=1,disableAsyncCompilation

@DanHeidinga
Copy link
Member

Thanks for the clarification @jdmpapin and apologies @babsingh - I misread the original comment.

@pshipton
Copy link
Member

pshipton commented Nov 26, 2021

Separately, I wonder whether this is related to #13162

The workaround for that was promoted in the head stream last night.

@jdmpapin
Copy link
Contributor

This issue was opened yesterday in the morning

@babsingh
Copy link
Contributor

re #14014 (comment):

Yesterday, @EricYangIBM tried eclipse/omr#6255 and stated in #14014 (comment) that the crash still happens.

Just now, I rebased both openj9 and omr repos with the latest changes. But, I can no longer reproduce the segfault. Can another person confirm this point?

@jdmpapin
Copy link
Contributor

Oh I missed that, sorry

@DanHeidinga
Copy link
Member

Just now, I rebased both openj9 and omr repos with the latest changes. But, I can no longer reproduce the segfault. Can another person confirm this point?

Has this been confirmed now? I see a 👍 on the post but I'm too old to know if that means "confirmed" or "agree we should confirm"

@jdmpapin
Copy link
Contributor

That's my 👍, which I meant mainly to acknowledge the information I had missed above from #14014 (comment)

I'm not aware of any confirmation, and certainly not on my part. For the record, if I had confirmed something like that, I would have said so explicitly. But sorry nonetheless for the ambiguity 😕

@babsingh
Copy link
Contributor

babsingh commented Nov 30, 2021

Has this been confirmed now?

It still happens with the latest OpenJ9 and OMR changes. I observed the following three scenarios:

  • After the SCC is cleared, the first run with -Xjit:disableAsyncCompilation will always pass. But, subsequent runs will fail. Probably, the data after generation is good that's why the first run passes. After the data is stored in the SCC, it either gets corrupted or is not read correctly from the SCC which leads to segfaults in the subsequent runs.
  • After the SCC is cleared and if the first run does not use -Xjit:disableAsyncCompilation, then subsequent runs with -Xjit:disableAsyncCompilation will always pass. In this scenario, some data is being stored in the SCC during the first run which allows subsequent runs to always pass. The segfault re-occurs after the SCC is destroyed and -Xjit:disableAsyncCompilation is used in the first run.
  • The segfault does not happen with SCC disabled (-Xshareclasses:none).

Different SCC options

From the above scenarios, I feel that the JIT is storing some data in the SCC which is either corrupted or not being read correctly from the SCC. The culprit data is only generated when -Xjit:disableAsyncCompilation is specified.

I tried the below -Xshareclasses options to figure out where the corruption is happening in the SCC:

  1. -Xshareclasses:noaot
  2. -XX:-ShareUnsafeClasses -XX:-ShareAnonymousClasses
  3. -Xshareclasses:nojitdata
  4. -Xshareclasses:cacheRetransformed
  5. -Xshareclasses:disableBCI

None of the above options prevented the segfault. @hangshao0 for insights.

@pshipton
Copy link
Member

pshipton commented Nov 30, 2021

-Xshareclasses:noaot (and similarly -Xshareclasses:nojitdata) prevents storing new aot but doesn't stop loading of existing aot already in the cache. Try -Xnoaot.

@babsingh
Copy link
Contributor

With -Xnoaot, the segfault does not happen. @dsouzai for AOT insights.

@0xdaryl
Copy link
Contributor

0xdaryl commented Nov 30, 2021

This failure will have to be triaged to get to the root cause. I suspect including -Xjit:disableAsyncCompilation is really just affecting the timing or ordering of compilations in such a way that it is making the problem more reproducible. It is likely the problem exists without async compilation but it is harder to trigger.

What is the failure rate in 100 runs? Does it fail every time? I don't think that statistic was mentioned above.

Given that -Xjit:disableAsyncCompilation,count=1 is not the default way of invoking the JIT suggests this is not a blocker but one that should be investigated for 0.30.

@jdmpapin
Copy link
Contributor

jdmpapin commented Dec 3, 2021

BTW, I took a look at string peepholes, and it seems it will already refuse to transform under involuntary OSR, though not as intentionally as we might like. The patterns it's looking for start like so:

new java/lang/StringBuilder
dup
invokespecial java/lang/StringBuilder.<init>()V

(though StringBuffer and a few other constructor signatures are also accepted)

With involuntary OSR, the trees for this bytecode contain a store to a pending push temp because the new result (in its original stack slot) is live across the constructor call:

treetop
  new
    loadaddr java/lang/StringBuilder
astore <pending push temp ...>
  ==>new
NULLCHK on n4n
  call  java/lang/StringBuilder.<init>()V
    ==>new

When string peepholes sees the new and starts looking for the pattern, it searches for the constructor call using searchForInitCall() starting at the tree just after the new. In the above trees, that's the store to the pending push temp. It does not satisfy skipNodeUnderOSR() (which only skips potential OSR point helper calls and some other nodes in post-execution OSR, which is all for voluntary OSR). Then because the new node occurs within the astore tree, searchForInitCall() stops searching there, and it stops the search without setting initTree, preventing the transformation

So we shouldn't currently have this same bug for string peepholes, though IMO it would be an improvement to explicitly skip the attempt to find patterns and transform, e.g. here:

int32_t TR_StringPeepholes::perform()
{
static char *skipitAtWarm = feGetEnv("TR_noPeepholeAtWarm");
if (comp()->getOption(TR_DisableStringPeepholes)
|| (!comp()->fej9()->doStringPeepholing() && !comp()->getOption(TR_UseSymbolValidationManager))
|| (skipitAtWarm && comp()->getMethodHotness()==warm))
return 1;

FYI @vijaysun-omr, @0xdaryl

@jdmpapin
Copy link
Contributor

jdmpapin commented Dec 3, 2021

Similarly, StringBuilder transformer expects to find the constructor call in the next tree after the new (possibly skipping an allocationFence, and possibly skipping some other nodes in post-execution OSR, which again is voluntary)

@vijaysun-omr
Copy link
Contributor

vijaysun-omr commented Dec 3, 2021

Thanks @jdmpapin I agree it would be better to more explicitly avoid those kinds of transformations that change the program in ways that involuntary OSR may not be able to handle. I wonder if these transformations in the IL generator also fall in a similar category :

// fast pathing for ORB readObject optimization

@0xdaryl
Copy link
Contributor

0xdaryl commented Dec 6, 2021

Now that we understand the circumstances in which this problem occurs I don't believe this a blocker for 0.29.1 and can wait for the 0.30 release. I recommend the fix (#14071) be merged into master and then subsequently merged into 0.30 when it is ready.

This is a long-standing problem that is not JDK17-specific. It manifests itself when clone() calls are transformed to allocations by the JIT and an OSR is subsequently provoked during these "artificial" allocations. The problem can be triggered more readily through a carefully designed test case and JIT stress options. Even with this test it has not appeared with default JIT options.

Exploration of other OSR problems with existing JIT optimizations will also be conducted in the background and contributed to the code base when ready. We are not aware of any known OSR problems from any of these other JIT optimizations.

@mstoodle @pshipton @DanHeidinga

@mstoodle
Copy link
Contributor

mstoodle commented Dec 6, 2021

I agree with @0xdaryl 's analysis and suggested course of action, which suggests this issue can be removed from 0.29.1's milestone blockers https://github.com/eclipse-openj9/openj9/milestone/32 .

Thank you to everyone who contributed to understanding this issue, regardless of how we proceed here. Everyone worked hard to figure out the underlying problem and it showed great collaboration across teams as the investigation progressed!

@tajila
Copy link
Contributor

tajila commented Dec 6, 2021

The original problem was only seen in JDK17, Ill take a look at these ones to determine the cause

@JasonFengJ9
Copy link
Member

Similar failure observed at an internal JDK8 build job/Test_openjdk8_j9_extended.functional_aarch64_linux_testList_2/71/(ub18-aarch64-10)

openjdk version "1.8.0_322"
IBM Semeru Runtime Open Edition (build 1.8.0_322-b04)
Eclipse OpenJ9 VM (build master-5afc033b2, JRE 1.8.0 Linux aarch64-64-Bit Compressed References 20211204_243 (JIT enabled, AOT enabled)
OpenJ9   - 5afc033b2
OMR      - 5e9418dcf
JCL      - 854c9aec based on jdk8u322-b04)

[2021-12-04T05:42:54.566Z] variation: Mode107-OSRG -Xnocompressedrefs
[2021-12-04T05:42:54.566Z] JVM_OPTIONS:  -Xgcpolicy:optthruput -Xdebug -Xrunjdwp:transport=dt_socket,address=8888,server=y,onthrow=no.pkg.foo,launch=echo -Xjit:enableOSR,enableOSROnGuardFailure,count=1,disableAsyncCompilation -Xnocompressedrefs 

[2021-12-04T05:48:17.301Z] Testing: rc021
[2021-12-04T05:48:17.301Z] Test start time: 2021/12/04 00:48:14 Eastern Standard Time
[2021-12-04T05:48:17.301Z] Running command: "/home/jenkins/workspace/Test_openjdk8_j9_extended.functional_aarch64_linux_testList_2/openjdkbinary/j2sdk-image/bin/java"  -Xgcpolicy:optthruput -Xdebug -Xrunjdwp:transport=dt_socket,address=8888,server=y,onthrow=no.pkg.foo,launch=echo -Xjit:enableOSR,enableOSROnGuardFailure,count=1,disableAsyncCompilation -Xnocompressedrefs  -Xdump    -agentlib:jvmtitest=test:rc021 -cp "/home/jenkins/workspace/Test_openjdk8_j9_extended.functional_aarch64_linux_testList_2/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/jvmtitests/jvmtitest.jar:/home/jenkins/workspace/Test_openjdk8_j9_extended.functional_aarch64_linux_testList_2/aqa-tests/TKG/../TKG/lib/asm-all.jar" com.ibm.jvmti.tests.util.TestRunner
[2021-12-04T05:48:17.301Z] Time spent starting: 1 milliseconds
[2021-12-04T05:50:26.174Z] Time spent executing: 116588 milliseconds
[2021-12-04T05:50:26.174Z] Test result: FAILED
[2021-12-04T05:50:26.174Z] Output from test:
[2021-12-04T05:50:26.174Z]  [OUT] *** Testing [1/2]:	testAnnotatedTypes

[2021-12-04T05:50:26.178Z]  [OUT] OK
[2021-12-04T05:50:26.178Z]  [OUT] 
[2021-12-04T05:50:26.178Z]  [OUT] *** Testing [2/2]:	testAnnotationCacheAfterRedefine

[2021-12-04T05:50:26.181Z]  [OUT] [@java.lang.Deprecated()]
[2021-12-04T05:50:26.181Z]  [OUT] 	at java.util.Objects.requireNonNull(Objects.java:203)
[2021-12-04T05:50:26.181Z]  [OUT] 	at java.util.Objects.requireNonNull(Objects.java:203)	at java.lang.reflect.Proxy.<init>(Proxy.java:265)
[2021-12-04T05:50:26.181Z]  [OUT] 
[2021-12-04T05:50:26.181Z]  [OUT] 	at java.lang.reflect.Proxy.<init>(Proxy.java:265)
[2021-12-04T05:50:26.181Z]  [OUT] 	at com.ibm.jvmti.tests.redefineClasses.$Proxy6.<init>(Unknown Source)	at com.ibm.jvmti.tests.redefineClasses.$Proxy6.<init>(Unknown Source)
[2021-12-04T05:50:26.181Z]  [OUT] 
[2021-12-04T05:50:26.181Z]  [OUT] 	at <unknown class>.<unknown method>(Unknown Source)	at <unknown class>.<unknown method>(Unknown Source)
[2021-12-04T05:50:26.181Z]  [OUT] 
[2021-12-04T05:50:26.181Z]  [OUT] 	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
[2021-12-04T05:50:26.181Z]  [OUT] 
[2021-12-04T05:50:26.181Z]  [OUT] 	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
[2021-12-04T05:50:26.181Z]  [OUT] 
[2021-12-04T05:50:26.181Z]  [OUT] 	at java.lang.reflect.Proxy.newProxyInstance(Proxy.java:739)	at java.lang.reflect.Proxy.newProxyInstance(Proxy.java:739)
[2021-12-04T05:50:26.181Z]  [OUT] 
[2021-12-04T05:50:26.181Z]  [OUT] 	at sun.reflect.annotation.AnnotationParser$1.run(AnnotationParser.java:305)	at sun.reflect.annotation.AnnotationParser$1.run(AnnotationParser.java:305)
[2021-12-04T05:50:26.181Z]  [OUT] 
[2021-12-04T05:50:26.181Z]  [OUT] 	at sun.reflect.annotation.AnnotationParser$1.run(AnnotationParser.java:303)	at sun.reflect.annotation.AnnotationParser$1.run(AnnotationParser.java:303)
[2021-12-04T05:50:26.181Z]  [OUT] 
[2021-12-04T05:50:26.181Z]  [OUT] 	at java.security.AccessController.doPrivileged(AccessController.java:682)	at java.security.AccessController.doPrivileged(AccessController.java:682)
[2021-12-04T05:50:26.181Z]  [OUT] 
[2021-12-04T05:50:26.181Z]  [OUT] 	at sun.reflect.annotation.AnnotationParser.annotationForMap(AnnotationParser.java:303)	at sun.reflect.annotation.AnnotationParser.annotationForMap(AnnotationParser.java:303)
[2021-12-04T05:50:26.181Z]  [OUT] 
[2021-12-04T05:50:26.181Z]  [OUT] 	at sun.reflect.annotation.AnnotationParser.parseAnnotation2(AnnotationParser.java:293)	at sun.reflect.annotation.AnnotationParser.parseAnnotation2(AnnotationParser.java:293)
[2021-12-04T05:50:26.181Z]  [OUT] 
[2021-12-04T05:50:26.181Z]  [OUT] 	at sun.reflect.annotation.AnnotationParser.parseAnnotations2(AnnotationParser.java:120)	at sun.reflect.annotation.AnnotationParser.parseAnnotations2(AnnotationParser.java:120)
[2021-12-04T05:50:26.181Z]  [OUT] 
[2021-12-04T05:50:26.181Z]  [OUT] 	at sun.reflect.annotation.AnnotationParser.parseAnnotations(AnnotationParser.java:72)	at sun.reflect.annotation.AnnotationParser.parseAnnotations(AnnotationParser.java:72)
[2021-12-04T05:50:26.181Z]  [OUT] 
[2021-12-04T05:50:26.181Z]  [OUT] 	at java.lang.Class.getAnnotationCache(Class.java:2866)	at java.lang.Class.getAnnotationCache(Class.java:2866)
[2021-12-04T05:50:26.181Z]  [OUT] 
[2021-12-04T05:50:26.181Z]  [OUT] 	at java.lang.Class.getDeclaredAnnotations(Class.java:2534)	at java.lang.Class.getDeclaredAnnotations(Class.java:2534)
[2021-12-04T05:50:26.181Z]  [OUT] 
[2021-12-04T05:50:26.181Z]  [OUT] 	at com.ibm.jvmti.tests.redefineClasses.rc021$1AnnotationFetch.run(rc021.java:73)	at com.ibm.jvmti.tests.redefineClasses.rc021$1AnnotationFetch.run(rc021.java:73)
[2021-12-04T05:50:26.181Z]  [OUT] 
[2021-12-04T05:50:26.181Z]  [OUT] 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
[2021-12-04T05:50:26.181Z]  [OUT] 
[2021-12-04T05:50:26.181Z]  [OUT] 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[2021-12-04T05:50:26.181Z]  [OUT] 
[2021-12-04T05:50:26.181Z]  [OUT] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[2021-12-04T05:50:26.181Z]  [OUT] 
[2021-12-04T05:50:26.181Z]  [OUT] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[2021-12-04T05:50:26.181Z]  [OUT] 
[2021-12-04T05:50:26.181Z]  [OUT] 	at java.lang.Thread.run(Thread.java:827)	at java.lang.Thread.run(Thread.java:827)
[2021-12-04T05:50:26.181Z]  [OUT] 
[2021-12-04T05:50:26.181Z]  [OUT] Annotations:
[2021-12-04T05:50:26.181Z]  [OUT] Annotations:
[2021-12-04T05:50:26.181Z]  [OUT] [@com.ibm.jvmti.tests.redefineClasses.Original(), @com.ibm.jvmti.tests.redefineClasses.Original2(), @com.ibm.jvmti.tests.redefineClasses.Original3()]
[2021-12-04T05:50:26.181Z]  [OUT] Annotations:
[2021-12-04T05:50:26.181Z]  [OUT] Annotations:
[2021-12-04T05:50:26.181Z]  [OUT] Annotations:
[2021-12-04T05:50:26.181Z]  [OUT] [@com.ibm.jvmti.tests.redefineClasses.Original(), @com.ibm.jvmti.tests.redefineClasses.Original2(), @com.ibm.jvmti.tests.redefineClasses.Original3()]
[2021-12-04T05:50:26.181Z]  [OUT] [@com.ibm.jvmti.tests.redefineClasses.Original(), @com.ibm.jvmti.tests.redefineClasses.Original2(), @com.ibm.jvmti.tests.redefineClasses.Original3()]
[2021-12-04T05:50:26.181Z]  [OUT] [@com.ibm.jvmti.tests.redefineClasses.Original(), @com.ibm.jvmti.tests.redefineClasses.Original2(), @com.ibm.jvmti.tests.redefineClasses.Original3()]
[2021-12-04T05:50:26.181Z]  [OUT] [@com.ibm.jvmti.tests.redefineClasses.Original(), @com.ibm.jvmti.tests.redefineClasses.Original2(), @com.ibm.jvmti.tests.redefineClasses.Original3()]
[2021-12-04T05:50:26.181Z]  [OUT] *** Test took 99723 milliseconds
[2021-12-04T05:50:26.181Z]  [OUT] FAILED
[2021-12-04T05:50:26.181Z]  [OUT] 
[2021-12-04T05:50:26.181Z] >> Success condition was not found: [Return code: 0]

[2021-12-04T05:51:33.865Z] ---TEST RESULTS---
[2021-12-04T05:51:33.865Z] Number of PASSED tests: 26 out of 27
[2021-12-04T05:51:33.865Z] Number of FAILED tests: 1 out of 27
[2021-12-04T05:51:33.865Z] 
[2021-12-04T05:51:33.865Z] ---SUMMARY OF FAILED TESTS---
[2021-12-04T05:51:33.865Z] rc021
[2021-12-04T05:51:33.865Z] -----------------------------
[2021-12-04T05:51:33.865Z] 
[2021-12-04T05:51:33.865Z] 
[2021-12-04T05:51:33.865Z] cmdLineTester_jvmtitests_hcr_OSRG_nongold_SE80_0_FAILED

@DanHeidinga
Copy link
Member

I agree with @0xdaryl 's analysis and suggested course of action, which suggests this issue can be removed from 0.29.1's milestone blockers https://github.com/eclipse-openj9/openj9/milestone/32 .

To confirm I understand the latest updates - the issue is with any call to clone() where the jit inlines the sequence directly that we later OSR out of? It applies to all current releases with OSR-enabled and could occur in any release in the field. 17 just happens to be where we found it.

Assuming that's accurate, then the plan to move this out of the 0.29.1 release's list of blockers makes sense to me. It's not specific to the 17 release and is already present in the field.

Thank you to everyone who contributed to understanding this issue, regardless of how we proceed here. Everyone worked hard to figure out the underlying problem and it showed great collaboration across teams as the investigation progressed!

A big +1 to this as well!

@0xdaryl
Copy link
Contributor

0xdaryl commented Dec 6, 2021

Assuming that's accurate

Yes, that's accurate.

EricYangIBM added a commit to EricYangIBM/openj9 that referenced this issue Dec 7, 2021
Crashes in new test specific to these platforms.

See eclipse-openj9#14014 (comment)
Signed-off-by: Eric Yang <eric.yang@ibm.com>
@pshipton
Copy link
Member

pshipton commented Dec 7, 2021

We will exclude the test on PPC and arch64.

I don't think we should be excluding. We've already merged a fix #14071. We should be removing the exclude for jdk17 if the problem has been resolved.

Also note that excluding the entire test rather than just rc021 could hide other issues.

@pshipton
Copy link
Member

pshipton commented Dec 8, 2021

@0xdaryl
Copy link
Contributor

0xdaryl commented Dec 8, 2021

This is still failing although the build contains #14071
https://openj9-jenkins.osuosl.org/job/Test_openjdk8_j9_extended.functional_aarch64_linux_Nightly_testList_0/47/

@jdmpapin : can you take a look at the crash, at least to confirm whether it is or isn't related to the original problem? If AArch64-specific I can farm it out for investigation.

@jdmpapin
Copy link
Contributor

jdmpapin commented Dec 8, 2021

Comparing the output against the only prior failure I've found linked from this issue, which is from the initial comment #14014 (comment)

Within rc021, the original failure occurred in testAnnotatedTypes:

[OUT] *** Testing [1/2]:	testAnnotatedTypes
[ERR] Unhandled exception
[ERR] Type=Segmentation error vmState=0x00000000
...

The crash was in libj9vm29.so:

[ERR] Module=/home/jenkins/workspace/Test_openjdk17_j9_extended.functional_ppc64le_linux_Personal_testList_0/openjdkbinary/j2sdk-image/lib/default/libj9vm29.so

And it indicated that there was a bad reference encountered in MemberName.clone():

[ERR] 0000000000367400: Object neither in heap nor stack-allocated in thread pool-1-thread-92
[ERR] 0000000000367400:	O-Slot=00000000005304C8
[ERR] 0000000000367400:	O-Slot value=000000000000000E
[ERR] 0000000000367400:	PC=00003FFF4D4E561C
[ERR] 0000000000367400:	framesWalked=0
[ERR] 0000000000367400:	arg0EA=00000000005304F0
[ERR] 0000000000367400:	walkSP=00000000005304C8
[ERR] 0000000000367400:	literals=0000000000101008
[ERR] 0000000000367400:	jitInfo=0000000000000000
[ERR] 0000000000367400:	method=0000000000101008 (java/lang/invoke/MemberName.clone()Ljava/lang/invoke/MemberName;) (Interpreted)
[ERR] 0000000000367400:	stack=000000000052B480-0000000000530D20

On the other hand, in this more recent failed test, testAnnotatedTypes ran to completion, and the crash occurred during testAnnotationCacheAfterRedefine, in libjclse29.so, and without any output complaining of a bad reference or mentioning a clone() method

[OUT] *** Testing [1/2]:	testAnnotatedTypes
...
[OUT] *** Test took 5429 milliseconds
[OUT] OK
[OUT] 
[OUT] *** Testing [2/2]:	testAnnotationCacheAfterRedefine
...
[ERR] Unhandled exception
[ERR] Type=Segmentation error vmState=0x00040000
...
[ERR] Module=/home/jenkins/workspace/Test_openjdk8_j9_extended.functional_aarch64_linux_Nightly_testList_0/openjdkbinary/j2sdk-image/jre/lib/aarch64/default/libjclse29.so
...

With these differences and the fact that the more recent failure happened despite #14071, I'm pretty sure it's a distinct bug

@pshipton
Copy link
Member

pshipton commented Dec 8, 2021

I've created #14115 for the separate problem. We can close this one once #14105 is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests