Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synchronization between continuation mounting and concurrent scanning #16290

Merged
merged 1 commit into from
Nov 14, 2022

Conversation

LinHu2016
Copy link
Contributor

@LinHu2016 LinHu2016 commented Nov 8, 2022

There is a race condition between continuation mounting(swapping java
stacks with carrier thread) and concurrent continuation scanning the
related java stacks).

use atomic operations with continuation->state to synchronizer
to avoid to introduce a new Mutex.
1, GC concurrent scanning use low bit of continuation->state for enter
and exit ConcurrentGCScan(J9_GC_CONTINUATION_STATE_CONCURRENT_SCAN,
tryWinningConcurrentGCScan(), exitConcurrentGCScan()).
2, mounting use atomic "or" set continuation->state
(synchronizeWithConcurrentGCScan()).
3, if mounting enter earlier than scanning, then drop scanning
4, if scanning enter earlier than mounting, then mounting is going to
wait scanning complete
5, after scanning complete, need to clear the low bit
(J9_GC_CONTINUATION_STATE_CONCURRENT_SCAN) and if mounting is in
waiting, notify the mounting.

fix: #16212, #15939
Signed-off-by: Lin Hu linhu@ca.ibm.com

@amicic amicic added project:loom Used to track Project Loom related work comp:gc labels Nov 8, 2022
@@ -88,19 +89,32 @@ createContinuation(J9VMThread *currentThread, j9object_t continuationObject)
return result;
}

void
handleRaceConditionWithContinuationScan(J9VMThread *currentThread, J9VMContinuation *continuation)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest something like:
synchronizeWithConcurrentGCScan

@nbhuiyan
Copy link
Member

nbhuiyan commented Nov 8, 2022

@LinHu2016 does this fix the JIT invalid return address issue discussed within #15939? I wanted to test this out myself, however, I am not able to get a build with your commit due to a linker error.

@LinHu2016
Copy link
Contributor Author

@nbhuiyan here is my earlier personal build for trying, if you want to confirm if JIT invalid return address issue also is triggered by race condition.
https://hyc-runtimes-jenkins.swg-devops.com/view/OpenJ9%20-%20Personal/job/Pipeline-Build-Test-Personal/14726/

@nbhuiyan
Copy link
Member

nbhuiyan commented Nov 8, 2022

@LinHu2016 the JIT invalid return address issue still happens with your build, so it is not caused by this particular race condition.

omrthread_monitor_exit(currentThread->publicFlagsMutex);

/* set J9_PUBLIC_FLAGS_HALT_THREAD_FOR_CONCURRENT_GC in currentThread's publicFlags */
setHaltFlag(currentThread, J9_PUBLIC_FLAGS_HALT_THREAD_FOR_CONCURRENT_GC);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not used/needed anymore

@gacholio
Copy link
Contributor

gacholio commented Nov 9, 2022

Please do not use names like bNeedsMutex in the VM code. It's pretty obviously a boolean.

/* atomically 'or' (not 'set') continuation->carrierThread with currentThread */
uintptr_t oldValue = VM_AtomicSupport::bitOr((volatile uintptr_t*)&continuation->carrierThread, (uintptr_t)currentThread);

if (oldValue&J9_GC_CONTINUATION_CONCURRENTSCANNING) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatting, and should be ANY_BITS_SET anyway.

@amicic
Copy link
Contributor

amicic commented Nov 9, 2022

Please do not use names like bNeedsMutex in the VM code. It's pretty obviously a boolean.

and while renaming it, let's use a bit more descriptive name, like: syncWithConcurrentGCScan

EDIT: I was referring to usage as an argument in walkContinuationStackFramesWrapper. Other spots may or may not need a different name.

@@ -337,6 +339,8 @@ class MM_GCExtensions : public MM_GCExtensionsBase {
, minimumFreeSizeForSurvivor(DEFAULT_SURVIVOR_MINIMUM_FREESIZE)
, freeSizeThresholdForSurvivor(DEFAULT_SURVIVOR_THRESHOLD)
, recycleRemainders(true)
, disableScanMountedContinuationObject(true)
, enableContinuationMountingMutex(true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe these are just temporary debug flags that you used during testing. I think you can remove them now, so that affected code is easier to read.

@@ -272,7 +272,8 @@ MM_MarkingDelegate::scanContinuationNativeSlots(MM_EnvironmentBase *env, omrobje
bStackFrameClassWalkNeeded = isDynamicClassUnloadingEnabled();
#endif /* J9VM_GC_DYNAMIC_CLASS_UNLOADING */

GC_VMThreadStackSlotIterator::scanSlots(currentThread, objectPtr, (void *)&localData, stackSlotIteratorForMarkingDelegate, bStackFrameClassWalkNeeded, false);
bool bNeedMutex = _extensions->enableContinuationMountingMutex && _extensions->shouldScavengeNotifyGlobalGCOfOldToOldReference();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately we will either have to introduce a new API that is not Scav specific (what is OMR change) or find another existing way to check this (via ConcurrentGC mode?)


volatile uintptr_t *localAddr = (volatile uintptr_t *) &continuation->carrierThread;
omrthread_monitor_enter(currentThread->publicFlagsMutex);
while ((((uintptr_t)*localAddr) & J9_GC_CONTINUATION_CONCURRENTSCANNING)) {
Copy link
Contributor

@gacholio gacholio Nov 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ANY_BITS here too. The cast also seems unnecessary given the pointer type.

Copy link
Contributor

@amicic amicic Nov 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you probably meant 'cast'
I see you edited meanwhile

@gacholio
Copy link
Contributor

gacholio commented Nov 9, 2022

By using an existing field to contain the tag, do we need to change general uses of that field to mask the tag?

@amicic
Copy link
Contributor

amicic commented Nov 9, 2022

By using an existing field to contain the tag, do we need to change general uses of that field to mask the tag?

indeed we thought about it (there is comment there in isCountinationMounted, that is probably the only other user)
Perhaps we should have a cuple of macros to extract (pure) carrierThread and/or tagged value, and even rename the field to be more descriptive (for example carrierThreadAnd(Or)ConcurrentlyScanned?)

@gacholio
Copy link
Contributor

gacholio commented Nov 9, 2022

Perhaps we should have a couple of macros

I think that would be a good idea, as well as renaming the field to make it (more) obvious not to use it directly.

@LinHu2016 LinHu2016 force-pushed the GC_Loom_4 branch 3 times, most recently from 5d24adc to 3594b76 Compare November 9, 2022 20:35
} else {
/* low tagging failed due to another GC thread winning low tagging, we don't do anything - winning thread will do everything instead */

}
Copy link
Contributor

@amicic amicic Nov 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd structure the code like this:

if (0 == oldValue) {
	if (atomic succeeds) 
            return true;
}

return false;

And I'd put just one comment in front of the method (since it applies to both not attempting atomic and attempting, but losing) like:

If low tagging failed (or not even attempted) due to either

  • a carrier thread winning to mount, we don't need to do anything, since it will be compensated by pre/post mount actions/scans
  • another GC thread winning to scan, again don't do anything, and let the winning thread do the work, instead

uintptr_t oldValue = *localAddr;
while ((VM_AtomicSupport::lockCompareExchange(localAddr, oldValue, oldValue & (~(uintptr_t)J9_GC_CONTINUATION_CONCURRENTSCANNING))) != oldValue) {
oldValue = *localAddr;
}
Copy link
Contributor

@amicic amicic Nov 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is VM_AtomicSupport::bitAnd which also returns what you need - oldValue

@LinHu2016 LinHu2016 force-pushed the GC_Loom_4 branch 2 times, most recently from 37a44a3 to 29e7909 Compare November 9, 2022 21:40

/* We need a full fence here to preserve happens-before relationship on PPC and other weakly
* ordered architectures since learning/reservation is turned on by default. Since we have the
* global pin lock counters we only need to need to address yield points, as thats the
* only time a different virtualThread can run on the underlying j9vmthread.
*/
VM_AtomicSupport::readWriteBarrier();
Assert_VM_true(currentThread == VM_VMHelpers::getCarrierThreadFromContinuationState(continuation->state));
Copy link
Contributor

@amicic amicic Nov 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also add a comment that we don't need atomic here, since no GC thread should be able to start scanning while continuation is mounted, nor should another carrier thread be able to mount before we complete the unmount (hence no risk to overwrite anything in a race)

After that comment you can add an assert that it indeed concurrentlyScanned is not set (and have the other assert about carrier ID). Or that could just really be one assert that the state is exactly currentThread

/* we don't need atomic here, since no GC thread should be able to start scanning while continuation is mounted,
* nor should another carrier thread be able to mount before we complete the unmount (hence no risk to overwrite anything in a race).
*/
Assert_VM_false(VM_VMHelpers::isConcurrentlyScannedFromContinuationState(continuation->state));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your intention to put this comment early, thinking if there is a need for synchronization, it should be done early.

But comment really applies to putting state back to initial, while someone may assume we are talking about currentContinuation being set to NULL. So either we should state it more explicitly rather than saying 'here" or (what was my initial intention) just keep the comment next to state reset.

Same applies to the assert, it's the best to be the closest to state reset as possible (in some malicious scenario if anotehr thread mutates state between the assert and the reset, keeping them close increases chances to catch the intruder)
Then we can merge the 2 asserts into one: currentThread == state. Such an assert would not be completely clean (in a sense it knows something about how state struct is organized), but I'm still ok with it, for the sake of simpler code (asserts). Besides, it would not be the only spot: for example, atomic in tryWinningConcurrentGCScan also assumes this knowledge.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'll need a comment about the reset being the last step and write fence preceding it, but only after/if GAC confirms 'borrowing' the existing fence was ok, first

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see no issue with the placement of the fence.

@amicic
Copy link
Contributor

amicic commented Nov 11, 2022

jenkins test sanity aix jdk19

runtime/oti/VMHelpers.hpp Outdated Show resolved Hide resolved
@LinHu2016 LinHu2016 force-pushed the GC_Loom_4 branch 2 times, most recently from 2587e7b to d5c0842 Compare November 11, 2022 21:13
@amicic
Copy link
Contributor

amicic commented Nov 11, 2022

jenkins test sanity win jdk19
jenkins test sanity.functional aix jdk19

@gacholio
Copy link
Contributor

@amicic There's a question above related to barriers - can you please reiterate what I'm supposed to be considering?

@amicic
Copy link
Contributor

amicic commented Nov 11, 2022

@amicic There's a question above related to barriers - can you please reiterate what I'm supposed to be considering?

It's about readWriteBarrier. It used to be the very last thing in unmount. Now, it's not.

We needed a write barrier just before resetting the state (setting it to INITIAL, what basically clears carrier ID). The reset effectively presents the continuation structure to GC, letting it to be concurrently scanned. But GC potentially running on another CPU must see the up-to-date continuation structure that just have been extensively mutated by swapFieldsWithContinuation a step earlier.

So, we ended up borrowing the write part of the barrier by swapping the existing full barrier with the reset line (the line used to be just setting carrierThread to NULL)

I don't fully understand the original intent of the barrier (but I guess it has something to do with lock-free inNative?), so the question is if we compromised the original intent.

@gacholio
Copy link
Contributor

The placement/use of the fence seems fine.


/* We need a full fence here to preserve happens-before relationship on PPC and other weakly
* ordered architectures since learning/reservation is turned on by default. Since we have the
* global pin lock counters we only need to need to address yield points, as thats the
* only time a different virtualThread can run on the underlying j9vmthread.
*/
VM_AtomicSupport::readWriteBarrier();
/* we don't need atomic here, since no GC thread should be able to start scanning while continuation is mounted,
* nor should another carrier thread be able to mount before we complete the unmount (hence no risk to overwrite anything in a race).
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LinHu2016, you can expand this comment with:

Order 

swap-stacks
writeBarrier
state initial

must be maintained for weakly ordered CPUs, to unsure that once the continuation is again available for GC scan (on potentially remote CPUs), all CPUs see up-to-date stack .

{
bool needScan = false;
#if JAVA_SPEC_VERSION >= 19
jboolean started = J9VMJDKINTERNALVMCONTINUATION_STARTED(vmThread, continuationObject);
J9VMContinuation *continuation = J9VMJDKINTERNALVMCONTINUATION_VMREF(vmThread, continuationObject);
needScan = started && (NULL != continuation) && (!scanOnlyUnmounted || !isContinuationMounted(continuation));
needScan = started && (NULL != continuation) && (!isContinuationMountedOrConcurrentlyScanned(continuation));
Copy link
Contributor

@amicic amicic Nov 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add a comment

We don't scan mounted continuations:

  • for concurrent GCs, since stack is actively changing. Instead, we scan them during preMount or during root scanning if already mounted at cycle start or during postUnmount (might be indirectly via card cleaning) or during final STW (via root re-scan) if still mounted at cycle end
  • for sliding compacts to avoid double slot fixups

For fully STW GCs, there is no harm to scan them, but it's a waste of time since they are scanned during root scanning already.

We don't scan currently scanned either - one scan is enough.

#endif /* JAVA_SPEC_VERSION >= 19 */
return rc;
}

/**
* Check if we need to scan the java stack for the Continuation Object
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment:
Used during main scan phase of GC (object graph traversal) or heap object iteration (in sliding compact). Not meant to be used during root scanning (neither strong roots nor weak roots)!

*
* swap-stacks
* writeBarrier
* state initial
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

align these 3

@@ -33,6 +33,14 @@

extern "C" {

void randomSleep()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove debug code

There is a race condition between continuation mounting(swapping java
 stacks with carrier thread) and concurrent continuation scanning the
 related java stacks).

 use atomic operations with continuation->state to synchronizer
 to avoid to introduce a new Mutex.
 1, GC concurrent scanning use low bit of continuation->state for enter
  and exit ConcurrentGCScan(J9_GC_CONTINUATION_STATE_CONCURRENT_SCAN,
  tryWinningConcurrentGCScan(), exitConcurrentGCScan()).
 2, mounting use atomic "or" set continuation->state
  (synchronizeWithConcurrentGCScan()).
 3, if mounting enter earlier than scanning, then drop scanning
 4, if scanning enter earlier than mounting, then mounting is going to
  wait scanning complete
 5, after scanning complete, need to clear the low bit
  (J9_GC_CONTINUATION_STATE_CONCURRENT_SCAN) and if mounting is in
  waiting, notify the mounting.

Signed-off-by: Lin Hu <linhu@ca.ibm.com>
@amicic
Copy link
Contributor

amicic commented Nov 14, 2022

jenkins test sanity.functional all jdk19

@amicic
Copy link
Contributor

amicic commented Nov 14, 2022

jenkins compile win,aix jdk8

@amicic amicic merged commit e0259a6 into eclipse-openj9:master Nov 14, 2022
babsingh added a commit to babsingh/aqa-tests that referenced this pull request Nov 21, 2022
eclipse-openj9/openj9#16212 was fixed by
1. eclipse-openj9/openj9#16290
2. eclipse-openj9/openj9#16293

eclipse-openj9/openj9#16275 is a duplicate of
eclipse-openj9/openj9#16212.

eclipse-openj9/openj9#16229 was fixed by eclipse-openj9/openj9#16323.

FramePop/framepop02 fails with another issue, which is reported in
eclipse-openj9/openj9#16346.

Signed-off-by: Babneet Singh <sbabneet@ca.ibm.com>
babsingh added a commit to babsingh/aqa-tests that referenced this pull request Nov 21, 2022
eclipse-openj9/openj9#16212 was fixed by
1. eclipse-openj9/openj9#16290; and
2. eclipse-openj9/openj9#16293.

eclipse-openj9/openj9#16275 is a duplicate of
eclipse-openj9/openj9#16212.

eclipse-openj9/openj9#16229 was fixed by eclipse-openj9/openj9#16323.

FramePop/framepop02 fails with another issue, which is reported in
eclipse-openj9/openj9#16346.

Signed-off-by: Babneet Singh <sbabneet@ca.ibm.com>
Mesbah-Alam pushed a commit to adoptium/aqa-tests that referenced this pull request Nov 22, 2022
eclipse-openj9/openj9#16212 was fixed by
1. eclipse-openj9/openj9#16290; and
2. eclipse-openj9/openj9#16293.

eclipse-openj9/openj9#16275 is a duplicate of
eclipse-openj9/openj9#16212.

eclipse-openj9/openj9#16229 was fixed by eclipse-openj9/openj9#16323.

FramePop/framepop02 fails with another issue, which is reported in
eclipse-openj9/openj9#16346.

Signed-off-by: Babneet Singh <sbabneet@ca.ibm.com>

Signed-off-by: Babneet Singh <sbabneet@ca.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:gc comp:vm project:loom Used to track Project Loom related work
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[JDK19] JVMTI RedefineRunningMethods Segfaults
5 participants