-
Notifications
You must be signed in to change notification settings - Fork 721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Synchronization between continuation mounting and concurrent scanning #16290
Conversation
runtime/vm/ContinuationHelpers.cpp
Outdated
@@ -88,19 +89,32 @@ createContinuation(J9VMThread *currentThread, j9object_t continuationObject) | |||
return result; | |||
} | |||
|
|||
void | |||
handleRaceConditionWithContinuationScan(J9VMThread *currentThread, J9VMContinuation *continuation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest something like:
synchronizeWithConcurrentGCScan
@LinHu2016 does this fix the JIT invalid return address issue discussed within #15939? I wanted to test this out myself, however, I am not able to get a build with your commit due to a linker error. |
@nbhuiyan here is my earlier personal build for trying, if you want to confirm if JIT invalid return address issue also is triggered by race condition. |
@LinHu2016 the JIT invalid return address issue still happens with your build, so it is not caused by this particular race condition. |
c9af7e1
to
44ca3f2
Compare
runtime/vm/ContinuationHelpers.cpp
Outdated
omrthread_monitor_exit(currentThread->publicFlagsMutex); | ||
|
||
/* set J9_PUBLIC_FLAGS_HALT_THREAD_FOR_CONCURRENT_GC in currentThread's publicFlags */ | ||
setHaltFlag(currentThread, J9_PUBLIC_FLAGS_HALT_THREAD_FOR_CONCURRENT_GC); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not used/needed anymore
Please do not use names like |
runtime/vm/ContinuationHelpers.cpp
Outdated
/* atomically 'or' (not 'set') continuation->carrierThread with currentThread */ | ||
uintptr_t oldValue = VM_AtomicSupport::bitOr((volatile uintptr_t*)&continuation->carrierThread, (uintptr_t)currentThread); | ||
|
||
if (oldValue&J9_GC_CONTINUATION_CONCURRENTSCANNING) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Formatting, and should be ANY_BITS_SET anyway.
and while renaming it, let's use a bit more descriptive name, like: syncWithConcurrentGCScan EDIT: I was referring to usage as an argument in walkContinuationStackFramesWrapper. Other spots may or may not need a different name. |
runtime/gc_base/GCExtensions.hpp
Outdated
@@ -337,6 +339,8 @@ class MM_GCExtensions : public MM_GCExtensionsBase { | |||
, minimumFreeSizeForSurvivor(DEFAULT_SURVIVOR_MINIMUM_FREESIZE) | |||
, freeSizeThresholdForSurvivor(DEFAULT_SURVIVOR_THRESHOLD) | |||
, recycleRemainders(true) | |||
, disableScanMountedContinuationObject(true) | |||
, enableContinuationMountingMutex(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe these are just temporary debug flags that you used during testing. I think you can remove them now, so that affected code is easier to read.
@@ -272,7 +272,8 @@ MM_MarkingDelegate::scanContinuationNativeSlots(MM_EnvironmentBase *env, omrobje | |||
bStackFrameClassWalkNeeded = isDynamicClassUnloadingEnabled(); | |||
#endif /* J9VM_GC_DYNAMIC_CLASS_UNLOADING */ | |||
|
|||
GC_VMThreadStackSlotIterator::scanSlots(currentThread, objectPtr, (void *)&localData, stackSlotIteratorForMarkingDelegate, bStackFrameClassWalkNeeded, false); | |||
bool bNeedMutex = _extensions->enableContinuationMountingMutex && _extensions->shouldScavengeNotifyGlobalGCOfOldToOldReference(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unfortunately we will either have to introduce a new API that is not Scav specific (what is OMR change) or find another existing way to check this (via ConcurrentGC mode?)
runtime/vm/ContinuationHelpers.cpp
Outdated
|
||
volatile uintptr_t *localAddr = (volatile uintptr_t *) &continuation->carrierThread; | ||
omrthread_monitor_enter(currentThread->publicFlagsMutex); | ||
while ((((uintptr_t)*localAddr) & J9_GC_CONTINUATION_CONCURRENTSCANNING)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ANY_BITS here too. The cast also seems unnecessary given the pointer type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you probably meant 'cast'
I see you edited meanwhile
By using an existing field to contain the tag, do we need to change general uses of that field to mask the tag? |
indeed we thought about it (there is comment there in isCountinationMounted, that is probably the only other user) |
I think that would be a good idea, as well as renaming the field to make it (more) obvious not to use it directly. |
5d24adc
to
3594b76
Compare
runtime/oti/VMHelpers.hpp
Outdated
} else { | ||
/* low tagging failed due to another GC thread winning low tagging, we don't do anything - winning thread will do everything instead */ | ||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd structure the code like this:
if (0 == oldValue) {
if (atomic succeeds)
return true;
}
return false;
And I'd put just one comment in front of the method (since it applies to both not attempting atomic and attempting, but losing) like:
If low tagging failed (or not even attempted) due to either
- a carrier thread winning to mount, we don't need to do anything, since it will be compensated by pre/post mount actions/scans
- another GC thread winning to scan, again don't do anything, and let the winning thread do the work, instead
runtime/oti/VMHelpers.hpp
Outdated
uintptr_t oldValue = *localAddr; | ||
while ((VM_AtomicSupport::lockCompareExchange(localAddr, oldValue, oldValue & (~(uintptr_t)J9_GC_CONTINUATION_CONCURRENTSCANNING))) != oldValue) { | ||
oldValue = *localAddr; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is VM_AtomicSupport::bitAnd which also returns what you need - oldValue
37a44a3
to
29e7909
Compare
runtime/vm/ContinuationHelpers.cpp
Outdated
|
||
/* We need a full fence here to preserve happens-before relationship on PPC and other weakly | ||
* ordered architectures since learning/reservation is turned on by default. Since we have the | ||
* global pin lock counters we only need to need to address yield points, as thats the | ||
* only time a different virtualThread can run on the underlying j9vmthread. | ||
*/ | ||
VM_AtomicSupport::readWriteBarrier(); | ||
Assert_VM_true(currentThread == VM_VMHelpers::getCarrierThreadFromContinuationState(continuation->state)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also add a comment that we don't need atomic here, since no GC thread should be able to start scanning while continuation is mounted, nor should another carrier thread be able to mount before we complete the unmount (hence no risk to overwrite anything in a race)
After that comment you can add an assert that it indeed concurrentlyScanned is not set (and have the other assert about carrier ID). Or that could just really be one assert that the state is exactly currentThread
runtime/vm/ContinuationHelpers.cpp
Outdated
/* we don't need atomic here, since no GC thread should be able to start scanning while continuation is mounted, | ||
* nor should another carrier thread be able to mount before we complete the unmount (hence no risk to overwrite anything in a race). | ||
*/ | ||
Assert_VM_false(VM_VMHelpers::isConcurrentlyScannedFromContinuationState(continuation->state)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand your intention to put this comment early, thinking if there is a need for synchronization, it should be done early.
But comment really applies to putting state back to initial, while someone may assume we are talking about currentContinuation being set to NULL. So either we should state it more explicitly rather than saying 'here" or (what was my initial intention) just keep the comment next to state reset.
Same applies to the assert, it's the best to be the closest to state reset as possible (in some malicious scenario if anotehr thread mutates state between the assert and the reset, keeping them close increases chances to catch the intruder)
Then we can merge the 2 asserts into one: currentThread == state. Such an assert would not be completely clean (in a sense it knows something about how state struct is organized), but I'm still ok with it, for the sake of simpler code (asserts). Besides, it would not be the only spot: for example, atomic in tryWinningConcurrentGCScan also assumes this knowledge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'll need a comment about the reset being the last step and write fence preceding it, but only after/if GAC confirms 'borrowing' the existing fence was ok, first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see no issue with the placement of the fence.
jenkins test sanity aix jdk19 |
2587e7b
to
d5c0842
Compare
jenkins test sanity win jdk19 |
@amicic There's a question above related to barriers - can you please reiterate what I'm supposed to be considering? |
It's about readWriteBarrier. It used to be the very last thing in unmount. Now, it's not. We needed a write barrier just before resetting the state (setting it to INITIAL, what basically clears carrier ID). The reset effectively presents the continuation structure to GC, letting it to be concurrently scanned. But GC potentially running on another CPU must see the up-to-date continuation structure that just have been extensively mutated by swapFieldsWithContinuation a step earlier. So, we ended up borrowing the write part of the barrier by swapping the existing full barrier with the reset line (the line used to be just setting carrierThread to NULL) I don't fully understand the original intent of the barrier (but I guess it has something to do with lock-free inNative?), so the question is if we compromised the original intent. |
The placement/use of the fence seems fine. |
|
||
/* We need a full fence here to preserve happens-before relationship on PPC and other weakly | ||
* ordered architectures since learning/reservation is turned on by default. Since we have the | ||
* global pin lock counters we only need to need to address yield points, as thats the | ||
* only time a different virtualThread can run on the underlying j9vmthread. | ||
*/ | ||
VM_AtomicSupport::readWriteBarrier(); | ||
/* we don't need atomic here, since no GC thread should be able to start scanning while continuation is mounted, | ||
* nor should another carrier thread be able to mount before we complete the unmount (hence no risk to overwrite anything in a race). | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LinHu2016, you can expand this comment with:
Order
swap-stacks
writeBarrier
state initial
must be maintained for weakly ordered CPUs, to unsure that once the continuation is again available for GC scan (on potentially remote CPUs), all CPUs see up-to-date stack .
{ | ||
bool needScan = false; | ||
#if JAVA_SPEC_VERSION >= 19 | ||
jboolean started = J9VMJDKINTERNALVMCONTINUATION_STARTED(vmThread, continuationObject); | ||
J9VMContinuation *continuation = J9VMJDKINTERNALVMCONTINUATION_VMREF(vmThread, continuationObject); | ||
needScan = started && (NULL != continuation) && (!scanOnlyUnmounted || !isContinuationMounted(continuation)); | ||
needScan = started && (NULL != continuation) && (!isContinuationMountedOrConcurrentlyScanned(continuation)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add a comment
We don't scan mounted continuations:
- for concurrent GCs, since stack is actively changing. Instead, we scan them during preMount or during root scanning if already mounted at cycle start or during postUnmount (might be indirectly via card cleaning) or during final STW (via root re-scan) if still mounted at cycle end
- for sliding compacts to avoid double slot fixups
For fully STW GCs, there is no harm to scan them, but it's a waste of time since they are scanned during root scanning already.
We don't scan currently scanned either - one scan is enough.
#endif /* JAVA_SPEC_VERSION >= 19 */ | ||
return rc; | ||
} | ||
|
||
/** | ||
* Check if we need to scan the java stack for the Continuation Object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment:
Used during main scan phase of GC (object graph traversal) or heap object iteration (in sliding compact). Not meant to be used during root scanning (neither strong roots nor weak roots)!
runtime/vm/ContinuationHelpers.cpp
Outdated
* | ||
* swap-stacks | ||
* writeBarrier | ||
* state initial |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
align these 3
runtime/vm/ContinuationHelpers.cpp
Outdated
@@ -33,6 +33,14 @@ | |||
|
|||
extern "C" { | |||
|
|||
void randomSleep() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove debug code
There is a race condition between continuation mounting(swapping java stacks with carrier thread) and concurrent continuation scanning the related java stacks). use atomic operations with continuation->state to synchronizer to avoid to introduce a new Mutex. 1, GC concurrent scanning use low bit of continuation->state for enter and exit ConcurrentGCScan(J9_GC_CONTINUATION_STATE_CONCURRENT_SCAN, tryWinningConcurrentGCScan(), exitConcurrentGCScan()). 2, mounting use atomic "or" set continuation->state (synchronizeWithConcurrentGCScan()). 3, if mounting enter earlier than scanning, then drop scanning 4, if scanning enter earlier than mounting, then mounting is going to wait scanning complete 5, after scanning complete, need to clear the low bit (J9_GC_CONTINUATION_STATE_CONCURRENT_SCAN) and if mounting is in waiting, notify the mounting. Signed-off-by: Lin Hu <linhu@ca.ibm.com>
jenkins test sanity.functional all jdk19 |
jenkins compile win,aix jdk8 |
eclipse-openj9/openj9#16212 was fixed by 1. eclipse-openj9/openj9#16290 2. eclipse-openj9/openj9#16293 eclipse-openj9/openj9#16275 is a duplicate of eclipse-openj9/openj9#16212. eclipse-openj9/openj9#16229 was fixed by eclipse-openj9/openj9#16323. FramePop/framepop02 fails with another issue, which is reported in eclipse-openj9/openj9#16346. Signed-off-by: Babneet Singh <sbabneet@ca.ibm.com>
eclipse-openj9/openj9#16212 was fixed by 1. eclipse-openj9/openj9#16290; and 2. eclipse-openj9/openj9#16293. eclipse-openj9/openj9#16275 is a duplicate of eclipse-openj9/openj9#16212. eclipse-openj9/openj9#16229 was fixed by eclipse-openj9/openj9#16323. FramePop/framepop02 fails with another issue, which is reported in eclipse-openj9/openj9#16346. Signed-off-by: Babneet Singh <sbabneet@ca.ibm.com>
eclipse-openj9/openj9#16212 was fixed by 1. eclipse-openj9/openj9#16290; and 2. eclipse-openj9/openj9#16293. eclipse-openj9/openj9#16275 is a duplicate of eclipse-openj9/openj9#16212. eclipse-openj9/openj9#16229 was fixed by eclipse-openj9/openj9#16323. FramePop/framepop02 fails with another issue, which is reported in eclipse-openj9/openj9#16346. Signed-off-by: Babneet Singh <sbabneet@ca.ibm.com> Signed-off-by: Babneet Singh <sbabneet@ca.ibm.com>
There is a race condition between continuation mounting(swapping java
stacks with carrier thread) and concurrent continuation scanning the
related java stacks).
use atomic operations with continuation->state to synchronizer
to avoid to introduce a new Mutex.
1, GC concurrent scanning use low bit of continuation->state for enter
and exit ConcurrentGCScan(J9_GC_CONTINUATION_STATE_CONCURRENT_SCAN,
tryWinningConcurrentGCScan(), exitConcurrentGCScan()).
2, mounting use atomic "or" set continuation->state
(synchronizeWithConcurrentGCScan()).
3, if mounting enter earlier than scanning, then drop scanning
4, if scanning enter earlier than mounting, then mounting is going to
wait scanning complete
5, after scanning complete, need to clear the low bit
(J9_GC_CONTINUATION_STATE_CONCURRENT_SCAN) and if mounting is in
waiting, notify the mounting.
fix: #16212, #15939
Signed-off-by: Lin Hu linhu@ca.ibm.com