-
Notifications
You must be signed in to change notification settings - Fork 707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support continuation stack in JIT code cache release #16374
Conversation
1d3c93d
to
dfebbcd
Compare
dfebbcd
to
7f20aa4
Compare
I haven't been able to figure out the reason for failure in Balanced GC, will try adding debug output locally. |
Tests used to verify this changeset: https://github.com/ibmruntimes/openj9-openjdk-jdk19/blob/openj9/test/jdk/java/lang/Thread/virtual/stress/Skynet.java compile with |
7f20aa4
to
211ec25
Compare
Balanced GC crash backtrace that I'm observing with this PR running Skynet:
|
211ec25
to
1d40dfb
Compare
1d40dfb
to
8f36798
Compare
The JIT changes themselves look OK here. I'm trying to swap this issue back into my head. What was the solution to the balanced GC crash, or is it still an outstanding problem? |
964f102
to
ab47db3
Compare
- Use walkAllStackFrames API for standard GC - Custom code handling for realtimeGC path Signed-off-by: Jack Lu <Jack.S.Lu@ibm.com>
Signed-off-by: Jack Lu <Jack.S.Lu@ibm.com>
ab47db3
to
8730014
Compare
jenkins test sanity xlinux jdk19 |
jenkins test sanity zlinux jdk19 |
With the latest OpenJ9 plus this PR cherrypicked Skynet is much healthier. However, it still eventually fails with:
The VMThread address suggests it may be a variant of this problem and there is still some virtual thread stack somewhere that isn't being scanned for reclaimed methods. |
My limited local run either passed or failure due to excessive memory and killed by OS, will try to reproduce the |
I'm running on a monster Intel Cascade Lake box with 112 hardware threads. Not sure if that might be exposing some race conditions not seen on a VM. |
@fengxue-IS : If you want access to that box please contact me directly. |
I was able to reproduce this once locally this afternoon, corefile isn't really useful as the cleaning already happened. Based on code inspection, there seem to be a small timing hole between where the continuation is mounted/unmounted and when it is added/removed in the global list (I'm locally testing a change to |
This is to ensure virtual/carrier threads in mount/unmount transition will be found during the walk process. Signed-off-by: Jack Lu <Jack.S.Lu@ibm.com>
As vthreads are added/removed in the global list during first/last transition process, there is a chance that a carrier thread have virtual mounted yet not be in the vthread list which means the carrier will not be scanned if a GC occurred during the transition period. @0xdaryl I've updated the logic in |
I cherry-picked your last commit onto my previous build. Skynet blows up almost immediately during a stack walk with:
I will rebuild a JDK19 from scratch and reapply your commits just to be sure something is not out of sync in my build. |
The failure is intermittent, and I seem to have been unlucky (or lucky?) on the first two runs which failed immediately. Seems to fail about 1/8 runs. I've also only seen it fail on the 112 thread machine. Not reproduced yet with the same build on a 96 thread machine. |
Looking at the stacktrace, it doesn't seem to be related to the code changes in this PR, the failing function In this case, could this be a separate issue exposed after the JIT return address issue has been fix? As @0xdaryl have mentioned, this case seem particular to machine with very high thread count, which is not something that we have not been testing on regularly in the past (if at all). If this is new/different issue, I suggest we merge this PR once review is done and track the failure separately as this will allow us to enable/resolve the testing that is blocked due to invalid JIT return address problem. |
I went through the code shown in the stack trace but I don't fully understand how did the call stack jump from @0xdaryl what is the |
Jenkins test sanity all jdk19 |
Jenkins test sanity aix jdk19 AIX failure looks infrastructural. Re-running... |
Jenkins test sanity aix jdk19 |
This should be delivered to 0.37 as well. |
#if JAVA_SPEC_VERSION >= 19 | ||
if (NULL != thr->currentContinuation) { | ||
thr->currentContinuation->dropFlags &=0x0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why &= 0
instead of just = 0
(lines 6665, 6668, 6683)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it was written like this originally as the inverse of ->dropFlags |= 0x1
;
Will update this to = 0
as part of the upcoming refactor PR (to avoid duplicate walk for mounted continuation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the intent was just to clear the low-order bit, then the code is wrong (it clears all bits), it should be
thr->currentContinuation->dropFlags &= ~1;
Tests excluded due eclipse-openj9/openj9#15939 are fixed by eclipse-openj9/openj9#16374 Signed-off-by: Jack Lu <Jack.S.Lu@ibm.com>
Tests excluded due eclipse-openj9/openj9#15939 are fixed by eclipse-openj9/openj9#16374 Signed-off-by: Jack Lu <Jack.S.Lu@ibm.com>
Related: #15939
Signed-off-by: Jack Lu Jack.S.Lu@ibm.com