Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support continuation stack in JIT code cache release #16374

Merged
merged 4 commits into from
Jan 28, 2023

Conversation

fengxue-IS
Copy link
Contributor

Related: #15939

Signed-off-by: Jack Lu Jack.S.Lu@ibm.com

@fengxue-IS
Copy link
Contributor Author

I haven't been able to figure out the reason for failure in Balanced GC, will try adding debug output locally.
@0xdaryl FYI

@fengxue-IS
Copy link
Contributor Author

Tests used to verify this changeset: https://github.com/ibmruntimes/openj9-openjdk-jdk19/blob/openj9/test/jdk/java/lang/Thread/virtual/stress/Skynet.java

compile with javac --enable-preview -source 19 Skynet.java
run with java --enable-preview --Xnocompressedrefs -Xgcpolicy:[gencon/balanced/metronome] Skynet

@fengxue-IS fengxue-IS added backport:candidate Possible candidate for a backport to a `0.x.y+1` release jdk19 labels Dec 1, 2022
@fengxue-IS fengxue-IS marked this pull request as ready for review December 1, 2022 22:38
@0xdaryl
Copy link
Contributor

0xdaryl commented Dec 6, 2022

Balanced GC crash backtrace that I'm observing with this PR running Skynet:

getOMRVMThreadName+0x4 (0x00007F396E08F9D4 [libj9gc29.so+0x1fc9d4])
_ZN21MM_StackSlotValidator15reportStackSlotEP18MM_EnvironmentBasePKc+0x7b (0x00007F396E083CBB [libj9gc29.so+0x1f0cbb])
_ZN22MM_GlobalMarkingScheme11doStackSlotEP19MM_EnvironmentVLHGCP8J9ObjectPS3_P16J9StackWalkStatePKv+0x1a7 (0x00007F396DF8A657 [libj9gc29.so+0xf7657])
walkJITFrameSlots.constprop.0+0xb7 (0x00007F396EB344E7 [libj9jit29.so+0x9dd4e7])
jitWalkFrame+0x2af (0x00007F396EB3491F [libj9jit29.so+0x9dd91f])
jitWalkStackFrames+0x10b4 (0x00007F396EB35EC4 [libj9jit29.so+0x9deec4])
walkStackFrames+0xa6 (0x00007F39741F3406 [libj9vm29.so+0x80406])
walkContinuationStackFrames.part.0+0x56 (0x00007F3974208EB6 [libj9vm29.so+0x95eb6])
_ZN28GC_VMThreadStackSlotIterator9scanSlotsEP10J9VMThreadP8J9ObjectPvPFvP8J9JavaVMPS3_S4_P16J9StackWalkStatePKvEbbb+0x5f (0x00007F396DED77BF [libj9gc29.so+0x447bf])
_ZN22MM_GlobalMarkingScheme27scanContinuationNativeSlotsEP19MM_EnvironmentVLHGCP8J9ObjectNS_10ScanReasonE+0xa9 (0x00007F396DF8AD49 [libj9gc29.so+0xf7d49])
_ZN22MM_GlobalMarkingScheme22scanContinuationObjectEP19MM_EnvironmentVLHGCP8J9ObjectNS_10ScanReasonE+0x1c (0x00007F396DF8AD6C [libj9gc29.so+0xf7d6c])
_ZN22MM_GlobalMarkingScheme19markLiveObjectsScanEP19MM_EnvironmentVLHGC+0x90 (0x00007F396DF8E500 [libj9gc29.so+0xfb500])
_ZN25MM_ParallelGlobalMarkTask3runEP18MM_EnvironmentBase+0x29c (0x00007F396DF8F93C [libj9gc29.so+0xfc93c])
_ZN21MM_ParallelDispatcher16workerEntryPointEP18MM_EnvironmentBase+0x228 (0x00007F396DFC2238 [libj9gc29.so+0x12f238])
_Z23dispatcher_thread_proc2P14OMRPortLibraryPv+0x109 (0x00007F396DFC1A29 [libj9gc29.so+0x12ea29])
omrsig_protect+0x2b1 (0x00007F397576F851 [libj9prt29.so+0x2a851])
dispatcher_thread_proc+0x3f (0x00007F396DFC146F [libj9gc29.so+0x12e46f])
thread_wrapper+0x163 (0x00007F3975737333 [libj9thr29.so+0xb333])
start_thread+0xea (0x00007F397527E17A [libpthread.so.0+0x817a])
clone+0x43 (0x00007F3974DA9DF3 [libc.so.6+0xfcdf3])

@fengxue-IS fengxue-IS changed the title [WIP] Add new internalVMFunctions walkAllStackFrames Add new internalVMFunctions walkAllStackFrames Dec 6, 2022
@fengxue-IS
Copy link
Contributor Author

@0xdaryl Can you please do a first round review on this. I will continue to investigate the crash seen with Balanced GC.
@tajila can you take a look at the walkAllStackFrames API added in VM (the commented out block of code is pseudo code for using GC's continuation list)

@0xdaryl
Copy link
Contributor

0xdaryl commented Jan 17, 2023

The JIT changes themselves look OK here.

I'm trying to swap this issue back into my head. What was the solution to the balanced GC crash, or is it still an outstanding problem?

@fengxue-IS
Copy link
Contributor Author

fengxue-IS commented Jan 17, 2023

I'm trying to swap this issue back into my head. What was the solution to the balanced GC crash, or is it still an outstanding problem?

The Balanced GC issue have been addressed. (iirc by #16360 and #16485)

@tajila can you take a look at the VM side changes please

@fengxue-IS fengxue-IS force-pushed the walkAllStack branch 2 times, most recently from 964f102 to ab47db3 Compare January 17, 2023 23:35
- Use walkAllStackFrames API for standard GC
- Custom code handling for realtimeGC path

Signed-off-by: Jack Lu <Jack.S.Lu@ibm.com>
Signed-off-by: Jack Lu <Jack.S.Lu@ibm.com>
@fengxue-IS fengxue-IS changed the title Add new internalVMFunctions walkAllStackFrames Support continuation stack in JIT code cache release Jan 19, 2023
@tajila
Copy link
Contributor

tajila commented Jan 19, 2023

jenkins test sanity xlinux jdk19

@tajila
Copy link
Contributor

tajila commented Jan 19, 2023

jenkins test sanity zlinux jdk19

@0xdaryl
Copy link
Contributor

0xdaryl commented Jan 23, 2023

With the latest OpenJ9 plus this PR cherrypicked Skynet is much healthier. However, it still eventually fails with:

*** Invalid JIT return address 00007F2255CD51B0 in 00007F2236EFC7B0

The VMThread address suggests it may be a variant of this problem and there is still some virtual thread stack somewhere that isn't being scanned for reclaimed methods.

@fengxue-IS
Copy link
Contributor Author

My limited local run either passed or failure due to excessive memory and killed by OS, will try to reproduce the Invalid JIT return address locally

@0xdaryl
Copy link
Contributor

0xdaryl commented Jan 23, 2023

I'm running on a monster Intel Cascade Lake box with 112 hardware threads. Not sure if that might be exposing some race conditions not seen on a VM.

@0xdaryl
Copy link
Contributor

0xdaryl commented Jan 23, 2023

@fengxue-IS : If you want access to that box please contact me directly.

@fengxue-IS
Copy link
Contributor Author

I was able to reproduce this once locally this afternoon, corefile isn't really useful as the cleaning already happened. Based on code inspection, there seem to be a small timing hole between where the continuation is mounted/unmounted and when it is added/removed in the global list (I'm locally testing a change to walkAllStackFrames API to address this issue.)

This is to ensure virtual/carrier threads in mount/unmount transition
will be found during the walk process.

Signed-off-by: Jack Lu <Jack.S.Lu@ibm.com>
@fengxue-IS
Copy link
Contributor Author

As vthreads are added/removed in the global list during first/last transition process, there is a chance that a carrier thread have virtual mounted yet not be in the vthread list which means the carrier will not be scanned if a GC occurred during the transition period.

@0xdaryl I've updated the logic in walkAllStackFrames to include vmthread->currentContinuation as part of vmthread list to avoid missing those threads. Can you please try with the latest changes from this PR. My local testing passed for Skynet 20/20

@0xdaryl
Copy link
Contributor

0xdaryl commented Jan 25, 2023

I cherry-picked your last commit onto my previous build. Skynet blows up almost immediately during a stack walk with:

----------- Stack Backtrace -----------
old_fast_jitLookupInterfaceMethod+0x50 (0x00007F8D3D9C61E0 [libj9jit29.so+0x9da1e0])
 (0x00007F8D3D9D38E7 [libj9jit29.so+0x9e78e7])
---------------------------------------
<SNIP SNIP>
---------------------------------------
_ZN26MM_MarkingSchemeRootMarker11doStackSlotEPP8J9ObjectPvPKv+0x38b (0x00007F8D3CEC975B [libj9gc_full29.so+0x1a075b])
walkStackFrames+0x5bb (0x00007F8D3F08EF3B [libj9vm29.so+0x80f3b])
_ZN28GC_VMThreadStackSlotIterator9scanSlotsEP10J9VMThreadS1_PvPFvP8J9JavaVMPP8J9ObjectS2_P16J9StackWalkStatePKvEbb+0x3c (0x00007F8D3CD6CDBC [libj9gc_full29.so+0x43dbc])
_ZN14MM_RootScanner13scanOneThreadEP18MM_EnvironmentBaseP10J9VMThreadPv+0x10d (0x00007F8D3CD6412D [libj9gc_full29.so+0x3b12d])
_ZN14MM_RootScanner11scanThreadsEP18MM_EnvironmentBase+0xbf (0x00007F8D3CD6332F [libj9gc_full29.so+0x3a32f])
_ZN14MM_RootScanner9scanRootsEP18MM_EnvironmentBase+0x43 (0x00007F8D3CD660F3 [libj9gc_full29.so+0x3d0f3])
_ZN18MM_MarkingDelegate9scanRootsEP18MM_EnvironmentBaseb+0x125 (0x00007F8D3CEC4385 [libj9gc_full29.so+0x19b385])
_ZN19MM_ParallelMarkTask3runEP18MM_EnvironmentBase+0x79 (0x00007F8D3CEEAD39 [libj9gc_full29.so+0x1c1d39])
_ZN21MM_ParallelDispatcher16workerEntryPointEP18MM_EnvironmentBase+0x228 (0x00007F8D3CE56A38 [libj9gc_full29.so+0x12da38])
_Z23dispatcher_thread_proc2P14OMRPortLibraryPv+0x109 (0x00007F8D3CE56249 [libj9gc_full29.so+0x12d249])
omrsig_protect+0x2b1 (0x00007F8D4061D871 [libj9prt29.so+0x2a871])
dispatcher_thread_proc+0x3f (0x00007F8D3CE5537F [libj9gc_full29.so+0x12c37f])
thread_wrapper+0x163 (0x00007F8D405E5333 [libj9thr29.so+0xb333])
start_thread+0xea (0x00007F8D4012C17A [libpthread.so.0+0x817a])
clone+0x43 (0x00007F8D3FC57DF3 [libc.so.6+0xfcdf3])

I will rebuild a JDK19 from scratch and reapply your commits just to be sure something is not out of sync in my build.

@0xdaryl
Copy link
Contributor

0xdaryl commented Jan 25, 2023

The failure is intermittent, and I seem to have been unlucky (or lucky?) on the first two runs which failed immediately. Seems to fail about 1/8 runs.

I've also only seen it fail on the 112 thread machine. Not reproduced yet with the same build on a 96 thread machine.

@fengxue-IS
Copy link
Contributor Author

Looking at the stacktrace, it doesn't seem to be related to the code changes in this PR, the failing function old_fast_jitLookupInterfaceMethod is called during GC root scan by a GC thread I think.
@LinHu2016 do you have any insight to this particular issue?

In this case, could this be a separate issue exposed after the JIT return address issue has been fix?

As @0xdaryl have mentioned, this case seem particular to machine with very high thread count, which is not something that we have not been testing on regularly in the past (if at all).

If this is new/different issue, I suggest we merge this PR once review is done and track the failure separately as this will allow us to enable/resolve the testing that is blocked due to invalid JIT return address problem.

@fengxue-IS
Copy link
Contributor Author

I went through the code shown in the stack trace but I don't fully understand how did the call stack jump from MM_MarkingSchemeRootMarker::doStackSlot to old_fast_jitLookupInterfaceMethod. Those two don't seem linked in any way.

@0xdaryl what is the <SNIP SNIP> block?

@0xdaryl
Copy link
Contributor

0xdaryl commented Jan 26, 2023

Jenkins test sanity all jdk19

@0xdaryl
Copy link
Contributor

0xdaryl commented Jan 26, 2023

Jenkins test sanity aix jdk19

AIX failure looks infrastructural. Re-running...

@0xdaryl
Copy link
Contributor

0xdaryl commented Jan 28, 2023

Jenkins test sanity aix jdk19

@0xdaryl 0xdaryl self-assigned this Jan 28, 2023
@0xdaryl
Copy link
Contributor

0xdaryl commented Jan 28, 2023

This should be delivered to 0.37 as well.

@0xdaryl 0xdaryl merged commit 5395447 into eclipse-openj9:master Jan 28, 2023
Comment on lines +6666 to +6668
#if JAVA_SPEC_VERSION >= 19
if (NULL != thr->currentContinuation) {
thr->currentContinuation->dropFlags &=0x0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why &= 0 instead of just = 0 (lines 6665, 6668, 6683)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was written like this originally as the inverse of ->dropFlags |= 0x1;

Will update this to = 0 as part of the upcoming refactor PR (to avoid duplicate walk for mounted continuation)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the intent was just to clear the low-order bit, then the code is wrong (it clears all bits), it should be

            thr->currentContinuation->dropFlags &= ~1;

fengxue-IS added a commit to fengxue-IS/aqa-tests that referenced this pull request Feb 14, 2023
Tests excluded due eclipse-openj9/openj9#15939 are fixed by eclipse-openj9/openj9#16374

Signed-off-by: Jack Lu <Jack.S.Lu@ibm.com>
smlambert pushed a commit to adoptium/aqa-tests that referenced this pull request Feb 14, 2023
Tests excluded due eclipse-openj9/openj9#15939 are fixed by eclipse-openj9/openj9#16374

Signed-off-by: Jack Lu <Jack.S.Lu@ibm.com>
@pshipton pshipton removed the backport:candidate Possible candidate for a backport to a `0.x.y+1` release label Mar 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants