Synchronization between continuation mounting and concurrent scanning #16290

LinHu2016 · 2022-11-08T16:49:50Z

There is a race condition between continuation mounting(swapping java
stacks with carrier thread) and concurrent continuation scanning the
related java stacks).

use atomic operations with continuation->state to synchronizer
to avoid to introduce a new Mutex.
1, GC concurrent scanning use low bit of continuation->state for enter
and exit ConcurrentGCScan(J9_GC_CONTINUATION_STATE_CONCURRENT_SCAN,
tryWinningConcurrentGCScan(), exitConcurrentGCScan()).
2, mounting use atomic "or" set continuation->state
(synchronizeWithConcurrentGCScan()).
3, if mounting enter earlier than scanning, then drop scanning
4, if scanning enter earlier than mounting, then mounting is going to
wait scanning complete
5, after scanning complete, need to clear the low bit
(J9_GC_CONTINUATION_STATE_CONCURRENT_SCAN) and if mounting is in
waiting, notify the mounting.

fix: #16212, #15939
Signed-off-by: Lin Hu linhu@ca.ibm.com

amicic · 2022-11-08T17:43:04Z

runtime/vm/ContinuationHelpers.cpp

@@ -88,19 +89,32 @@ createContinuation(J9VMThread *currentThread, j9object_t continuationObject)
 	return result;
 }

+void
+handleRaceConditionWithContinuationScan(J9VMThread *currentThread, J9VMContinuation *continuation)


I'd suggest something like:
synchronizeWithConcurrentGCScan

nbhuiyan · 2022-11-08T21:35:53Z

@LinHu2016 does this fix the JIT invalid return address issue discussed within #15939? I wanted to test this out myself, however, I am not able to get a build with your commit due to a linker error.

LinHu2016 · 2022-11-08T21:47:20Z

@nbhuiyan here is my earlier personal build for trying, if you want to confirm if JIT invalid return address issue also is triggered by race condition.
https://hyc-runtimes-jenkins.swg-devops.com/view/OpenJ9%20-%20Personal/job/Pipeline-Build-Test-Personal/14726/

nbhuiyan · 2022-11-08T21:59:59Z

@LinHu2016 the JIT invalid return address issue still happens with your build, so it is not caused by this particular race condition.

amicic · 2022-11-09T17:30:19Z

runtime/vm/ContinuationHelpers.cpp

+		omrthread_monitor_exit(currentThread->publicFlagsMutex);
+
+		/* set J9_PUBLIC_FLAGS_HALT_THREAD_FOR_CONCURRENT_GC in currentThread's publicFlags */
+		setHaltFlag(currentThread, J9_PUBLIC_FLAGS_HALT_THREAD_FOR_CONCURRENT_GC);


not used/needed anymore

gacholio · 2022-11-09T19:11:24Z

Please do not use names like bNeedsMutex in the VM code. It's pretty obviously a boolean.

gacholio · 2022-11-09T19:13:19Z

runtime/vm/ContinuationHelpers.cpp

+	/* atomically 'or' (not 'set') continuation->carrierThread  with currentThread */
+	uintptr_t oldValue = VM_AtomicSupport::bitOr((volatile uintptr_t*)&continuation->carrierThread, (uintptr_t)currentThread);
+
+	if (oldValue&J9_GC_CONTINUATION_CONCURRENTSCANNING) {


Formatting, and should be ANY_BITS_SET anyway.

amicic · 2022-11-09T19:14:16Z

Please do not use names like bNeedsMutex in the VM code. It's pretty obviously a boolean.

and while renaming it, let's use a bit more descriptive name, like: syncWithConcurrentGCScan

EDIT: I was referring to usage as an argument in walkContinuationStackFramesWrapper. Other spots may or may not need a different name.

amicic · 2022-11-09T19:21:12Z

runtime/gc_base/GCExtensions.hpp

@@ -337,6 +339,8 @@ class MM_GCExtensions : public MM_GCExtensionsBase {
 		, minimumFreeSizeForSurvivor(DEFAULT_SURVIVOR_MINIMUM_FREESIZE)
 		, freeSizeThresholdForSurvivor(DEFAULT_SURVIVOR_THRESHOLD)
 		, recycleRemainders(true)
+		, disableScanMountedContinuationObject(true)
+		, enableContinuationMountingMutex(true)


I believe these are just temporary debug flags that you used during testing. I think you can remove them now, so that affected code is easier to read.

amicic · 2022-11-09T19:28:21Z

runtime/gc_glue_java/MarkingDelegate.cpp

@@ -272,7 +272,8 @@ MM_MarkingDelegate::scanContinuationNativeSlots(MM_EnvironmentBase *env, omrobje
 		bStackFrameClassWalkNeeded = isDynamicClassUnloadingEnabled();
 #endif /* J9VM_GC_DYNAMIC_CLASS_UNLOADING */

-		GC_VMThreadStackSlotIterator::scanSlots(currentThread, objectPtr, (void *)&localData, stackSlotIteratorForMarkingDelegate, bStackFrameClassWalkNeeded, false);
+		bool bNeedMutex = _extensions->enableContinuationMountingMutex && _extensions->shouldScavengeNotifyGlobalGCOfOldToOldReference();


unfortunately we will either have to introduce a new API that is not Scav specific (what is OMR change) or find another existing way to check this (via ConcurrentGC mode?)

gacholio · 2022-11-09T19:30:14Z

runtime/vm/ContinuationHelpers.cpp

+
+		volatile uintptr_t *localAddr = (volatile uintptr_t *) &continuation->carrierThread;
+		omrthread_monitor_enter(currentThread->publicFlagsMutex);
+		while ((((uintptr_t)*localAddr) & J9_GC_CONTINUATION_CONCURRENTSCANNING)) {


ANY_BITS here too. The cast also seems unnecessary given the pointer type.

~~you probably meant 'cast'~~
I see you edited meanwhile

gacholio · 2022-11-09T19:37:22Z

By using an existing field to contain the tag, do we need to change general uses of that field to mask the tag?

amicic · 2022-11-09T19:47:50Z

By using an existing field to contain the tag, do we need to change general uses of that field to mask the tag?

indeed we thought about it (there is comment there in isCountinationMounted, that is probably the only other user)
Perhaps we should have a cuple of macros to extract (pure) carrierThread and/or tagged value, and even rename the field to be more descriptive (for example carrierThreadAnd(Or)ConcurrentlyScanned?)

gacholio · 2022-11-09T19:50:27Z

Perhaps we should have a couple of macros

I think that would be a good idea, as well as renaming the field to make it (more) obvious not to use it directly.

amicic · 2022-11-09T20:47:59Z

runtime/oti/VMHelpers.hpp

+			} else {
+				/* low tagging failed due to another GC thread winning low tagging, we don't do anything - winning thread will do everything instead */
+
+			}


I'd structure the code like this:

if (0 == oldValue) { if (atomic succeeds) return true; } return false;

And I'd put just one comment in front of the method (since it applies to both not attempting atomic and attempting, but losing) like:

If low tagging failed (or not even attempted) due to either

a carrier thread winning to mount, we don't need to do anything, since it will be compensated by pre/post mount actions/scans

another GC thread winning to scan, again don't do anything, and let the winning thread do the work, instead

amicic · 2022-11-09T21:07:07Z

runtime/oti/VMHelpers.hpp

+		uintptr_t oldValue = *localAddr;
+		while ((VM_AtomicSupport::lockCompareExchange(localAddr, oldValue, oldValue & (~(uintptr_t)J9_GC_CONTINUATION_CONCURRENTSCANNING))) != oldValue) {
+			oldValue = *localAddr;
+		}


there is VM_AtomicSupport::bitAnd which also returns what you need - oldValue

amicic · 2022-11-11T17:13:07Z

runtime/vm/ContinuationHelpers.cpp


 	/* We need a full fence here to preserve happens-before relationship on PPC and other weakly
 	 * ordered architectures since learning/reservation is turned on by default. Since we have the
 	 * global pin lock counters we only need to need to address yield points, as thats the
 	 * only time a different virtualThread can run on the underlying j9vmthread.
 	 */
 	VM_AtomicSupport::readWriteBarrier();
+	Assert_VM_true(currentThread == VM_VMHelpers::getCarrierThreadFromContinuationState(continuation->state));


You can also add a comment that we don't need atomic here, since no GC thread should be able to start scanning while continuation is mounted, nor should another carrier thread be able to mount before we complete the unmount (hence no risk to overwrite anything in a race)

After that comment you can add an assert that it indeed concurrentlyScanned is not set (and have the other assert about carrier ID). Or that could just really be one assert that the state is exactly currentThread

amicic · 2022-11-11T18:49:59Z

runtime/vm/ContinuationHelpers.cpp

+	/* we don't need atomic here, since no GC thread should be able to start scanning while continuation is mounted,
+	 * nor should another carrier thread be able to mount before we complete the unmount (hence no risk to overwrite anything in a race).
+	 */
+	Assert_VM_false(VM_VMHelpers::isConcurrentlyScannedFromContinuationState(continuation->state));


I understand your intention to put this comment early, thinking if there is a need for synchronization, it should be done early.

But comment really applies to putting state back to initial, while someone may assume we are talking about currentContinuation being set to NULL. So either we should state it more explicitly rather than saying 'here" or (what was my initial intention) just keep the comment next to state reset.

Same applies to the assert, it's the best to be the closest to state reset as possible (in some malicious scenario if anotehr thread mutates state between the assert and the reset, keeping them close increases chances to catch the intruder)
Then we can merge the 2 asserts into one: currentThread == state. Such an assert would not be completely clean (in a sense it knows something about how state struct is organized), but I'm still ok with it, for the sake of simpler code (asserts). Besides, it would not be the only spot: for example, atomic in tryWinningConcurrentGCScan also assumes this knowledge.

we'll need a comment about the reset being the last step and write fence preceding it, but only after/if GAC confirms 'borrowing' the existing fence was ok, first

I see no issue with the placement of the fence.

amicic · 2022-11-11T19:17:06Z

jenkins test sanity aix jdk19

runtime/oti/VMHelpers.hpp

amicic · 2022-11-11T21:25:20Z

jenkins test sanity win jdk19
jenkins test sanity.functional aix jdk19

gacholio · 2022-11-11T23:11:13Z

@amicic There's a question above related to barriers - can you please reiterate what I'm supposed to be considering?

amicic · 2022-11-11T23:42:36Z

@amicic There's a question above related to barriers - can you please reiterate what I'm supposed to be considering?

It's about readWriteBarrier. It used to be the very last thing in unmount. Now, it's not.

We needed a write barrier just before resetting the state (setting it to INITIAL, what basically clears carrier ID). The reset effectively presents the continuation structure to GC, letting it to be concurrently scanned. But GC potentially running on another CPU must see the up-to-date continuation structure that just have been extensively mutated by swapFieldsWithContinuation a step earlier.

So, we ended up borrowing the write part of the barrier by swapping the existing full barrier with the reset line (the line used to be just setting carrierThread to NULL)

I don't fully understand the original intent of the barrier (but I guess it has something to do with lock-free inNative?), so the question is if we compromised the original intent.

gacholio · 2022-11-14T13:15:48Z

The placement/use of the fence seems fine.

amicic · 2022-11-14T14:24:22Z

runtime/vm/ContinuationHelpers.cpp


 	/* We need a full fence here to preserve happens-before relationship on PPC and other weakly
 	 * ordered architectures since learning/reservation is turned on by default. Since we have the
 	 * global pin lock counters we only need to need to address yield points, as thats the
 	 * only time a different virtualThread can run on the underlying j9vmthread.
 	 */
 	VM_AtomicSupport::readWriteBarrier();
+	/* we don't need atomic here, since no GC thread should be able to start scanning while continuation is mounted,
+	 * nor should another carrier thread be able to mount before we complete the unmount (hence no risk to overwrite anything in a race).
+	 */


@LinHu2016, you can expand this comment with:

Order swap-stacks writeBarrier state initial must be maintained for weakly ordered CPUs, to unsure that once the continuation is again available for GC scan (on potentially remote CPUs), all CPUs see up-to-date stack .

amicic · 2022-11-14T14:47:38Z

runtime/oti/VMHelpers.hpp

 	{
 		bool needScan = false;
 #if JAVA_SPEC_VERSION >= 19
 		jboolean started = J9VMJDKINTERNALVMCONTINUATION_STARTED(vmThread, continuationObject);
 		J9VMContinuation *continuation = J9VMJDKINTERNALVMCONTINUATION_VMREF(vmThread, continuationObject);
-		needScan = started && (NULL != continuation) && (!scanOnlyUnmounted || !isContinuationMounted(continuation));
+		needScan = started && (NULL != continuation) && (!isContinuationMountedOrConcurrentlyScanned(continuation));


let's add a comment

We don't scan mounted continuations:

for concurrent GCs, since stack is actively changing. Instead, we scan them during preMount or during root scanning if already mounted at cycle start or during postUnmount (might be indirectly via card cleaning) or during final STW (via root re-scan) if still mounted at cycle end

for sliding compacts to avoid double slot fixups

For fully STW GCs, there is no harm to scan them, but it's a waste of time since they are scanned during root scanning already.

We don't scan currently scanned either - one scan is enough.

amicic · 2022-11-14T14:52:55Z

runtime/oti/VMHelpers.hpp

 #endif /* JAVA_SPEC_VERSION >= 19 */
+		return rc;
+	}

 	/**
 	 * Check if we need to scan the java stack for the Continuation Object


Add a comment:
Used during main scan phase of GC (object graph traversal) or heap object iteration (in sliding compact). Not meant to be used during root scanning (neither strong roots nor weak roots)!

amicic · 2022-11-14T14:56:58Z

runtime/vm/ContinuationHelpers.cpp

+	 *
+	 *	swap-stacks
+	 *	writeBarrier
+	 * state initial


align these 3

amicic · 2022-11-14T14:58:06Z

runtime/vm/ContinuationHelpers.cpp

@@ -33,6 +33,14 @@

 extern "C" {

+void randomSleep()


remove debug code

There is a race condition between continuation mounting(swapping java stacks with carrier thread) and concurrent continuation scanning the related java stacks). use atomic operations with continuation->state to synchronizer to avoid to introduce a new Mutex. 1, GC concurrent scanning use low bit of continuation->state for enter and exit ConcurrentGCScan(J9_GC_CONTINUATION_STATE_CONCURRENT_SCAN, tryWinningConcurrentGCScan(), exitConcurrentGCScan()). 2, mounting use atomic "or" set continuation->state (synchronizeWithConcurrentGCScan()). 3, if mounting enter earlier than scanning, then drop scanning 4, if scanning enter earlier than mounting, then mounting is going to wait scanning complete 5, after scanning complete, need to clear the low bit (J9_GC_CONTINUATION_STATE_CONCURRENT_SCAN) and if mounting is in waiting, notify the mounting. Signed-off-by: Lin Hu <linhu@ca.ibm.com>

amicic · 2022-11-14T16:37:07Z

jenkins test sanity.functional all jdk19

amicic · 2022-11-14T16:38:04Z

jenkins compile win,aix jdk8

eclipse-openj9/openj9#16212 was fixed by 1. eclipse-openj9/openj9#16290 2. eclipse-openj9/openj9#16293 eclipse-openj9/openj9#16275 is a duplicate of eclipse-openj9/openj9#16212. eclipse-openj9/openj9#16229 was fixed by eclipse-openj9/openj9#16323. FramePop/framepop02 fails with another issue, which is reported in eclipse-openj9/openj9#16346. Signed-off-by: Babneet Singh <sbabneet@ca.ibm.com>

eclipse-openj9/openj9#16212 was fixed by 1. eclipse-openj9/openj9#16290; and 2. eclipse-openj9/openj9#16293. eclipse-openj9/openj9#16275 is a duplicate of eclipse-openj9/openj9#16212. eclipse-openj9/openj9#16229 was fixed by eclipse-openj9/openj9#16323. FramePop/framepop02 fails with another issue, which is reported in eclipse-openj9/openj9#16346. Signed-off-by: Babneet Singh <sbabneet@ca.ibm.com>

eclipse-openj9/openj9#16212 was fixed by 1. eclipse-openj9/openj9#16290; and 2. eclipse-openj9/openj9#16293. eclipse-openj9/openj9#16275 is a duplicate of eclipse-openj9/openj9#16212. eclipse-openj9/openj9#16229 was fixed by eclipse-openj9/openj9#16323. FramePop/framepop02 fails with another issue, which is reported in eclipse-openj9/openj9#16346. Signed-off-by: Babneet Singh <sbabneet@ca.ibm.com> Signed-off-by: Babneet Singh <sbabneet@ca.ibm.com>

amicic added project:loom Used to track Project Loom related work comp:gc labels Nov 8, 2022

amicic reviewed Nov 8, 2022

View reviewed changes

LinHu2016 force-pushed the GC_Loom_4 branch from 8787d85 to 1072df8 Compare November 8, 2022 18:01

babsingh mentioned this pull request Nov 8, 2022

Delay appending VirtualThread objects to liveVirtualThreadList #16293

Merged

nbhuiyan mentioned this pull request Nov 8, 2022

Loom: jdk19 OpenJDK Invalid JIT return address ASSERTION FAILED swalk.c:1601 #15939

Closed

LinHu2016 force-pushed the GC_Loom_4 branch 2 times, most recently from c9af7e1 to 44ca3f2 Compare November 9, 2022 17:03

amicic reviewed Nov 9, 2022

View reviewed changes

LinHu2016 force-pushed the GC_Loom_4 branch from 44ca3f2 to 95c8708 Compare November 9, 2022 17:46

gacholio reviewed Nov 9, 2022

View reviewed changes

amicic reviewed Nov 9, 2022

View reviewed changes

gacholio reviewed Nov 9, 2022

View reviewed changes

LinHu2016 force-pushed the GC_Loom_4 branch from 95c8708 to ecba955 Compare November 9, 2022 19:41

LinHu2016 force-pushed the GC_Loom_4 branch 3 times, most recently from 5d24adc to 3594b76 Compare November 9, 2022 20:35

amicic reviewed Nov 9, 2022

View reviewed changes

LinHu2016 force-pushed the GC_Loom_4 branch 2 times, most recently from 37a44a3 to 29e7909 Compare November 9, 2022 21:40

LinHu2016 force-pushed the GC_Loom_4 branch from e926c80 to 73f03ef Compare November 11, 2022 17:00

amicic reviewed Nov 11, 2022

View reviewed changes

LinHu2016 force-pushed the GC_Loom_4 branch from 73f03ef to ce80b56 Compare November 11, 2022 18:11

amicic reviewed Nov 11, 2022

View reviewed changes

LinHu2016 force-pushed the GC_Loom_4 branch from ce80b56 to 6a5814d Compare November 11, 2022 19:08

gacholio reviewed Nov 11, 2022

View reviewed changes

runtime/oti/VMHelpers.hpp Outdated Show resolved Hide resolved

LinHu2016 force-pushed the GC_Loom_4 branch 2 times, most recently from 2587e7b to d5c0842 Compare November 11, 2022 21:13

gacholio approved these changes Nov 11, 2022

View reviewed changes

amicic reviewed Nov 14, 2022

View reviewed changes

LinHu2016 force-pushed the GC_Loom_4 branch from d5c0842 to b1c0051 Compare November 14, 2022 14:33

amicic reviewed Nov 14, 2022

View reviewed changes

LinHu2016 force-pushed the GC_Loom_4 branch from b1c0051 to 1dba3f6 Compare November 14, 2022 16:33

amicic approved these changes Nov 14, 2022

View reviewed changes

amicic merged commit e0259a6 into eclipse-openj9:master Nov 14, 2022

dmitripivkine mentioned this pull request Nov 18, 2022

jdk19 OpenJDK timeout with call to VirtualThread.notifyJvmtiUnmountBegin #16340

Closed

babsingh mentioned this pull request Nov 21, 2022

Re-enable tests adoptium/aqa-tests#4162

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synchronization between continuation mounting and concurrent scanning #16290

Synchronization between continuation mounting and concurrent scanning #16290

LinHu2016 commented Nov 8, 2022 •

edited

Loading

amicic Nov 8, 2022

nbhuiyan commented Nov 8, 2022 •

edited

Loading

LinHu2016 commented Nov 8, 2022

nbhuiyan commented Nov 8, 2022

amicic Nov 9, 2022

gacholio commented Nov 9, 2022

gacholio Nov 9, 2022

amicic commented Nov 9, 2022 •

edited

Loading

amicic Nov 9, 2022

amicic Nov 9, 2022

gacholio Nov 9, 2022 •

edited

Loading

amicic Nov 9, 2022 •

edited

Loading

gacholio commented Nov 9, 2022

amicic commented Nov 9, 2022

gacholio commented Nov 9, 2022

amicic Nov 9, 2022 •

edited

Loading

amicic Nov 9, 2022 •

edited

Loading

amicic Nov 11, 2022 •

edited

Loading

amicic Nov 11, 2022

amicic Nov 11, 2022

gacholio Nov 14, 2022

amicic commented Nov 11, 2022

amicic commented Nov 11, 2022

gacholio commented Nov 11, 2022

amicic commented Nov 11, 2022 •

edited

Loading

gacholio commented Nov 14, 2022

amicic Nov 14, 2022

amicic Nov 14, 2022 •

edited

Loading

amicic Nov 14, 2022

amicic Nov 14, 2022

amicic Nov 14, 2022

amicic commented Nov 14, 2022

amicic commented Nov 14, 2022

Synchronization between continuation mounting and concurrent scanning #16290

Synchronization between continuation mounting and concurrent scanning #16290

Conversation

LinHu2016 commented Nov 8, 2022 • edited Loading

Choose a reason for hiding this comment

nbhuiyan commented Nov 8, 2022 • edited Loading

LinHu2016 commented Nov 8, 2022

nbhuiyan commented Nov 8, 2022

Choose a reason for hiding this comment

gacholio commented Nov 9, 2022

Choose a reason for hiding this comment

amicic commented Nov 9, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gacholio Nov 9, 2022 • edited Loading

Choose a reason for hiding this comment

amicic Nov 9, 2022 • edited Loading

Choose a reason for hiding this comment

gacholio commented Nov 9, 2022

amicic commented Nov 9, 2022

gacholio commented Nov 9, 2022

amicic Nov 9, 2022 • edited Loading

Choose a reason for hiding this comment

amicic Nov 9, 2022 • edited Loading

Choose a reason for hiding this comment

amicic Nov 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amicic commented Nov 11, 2022

amicic commented Nov 11, 2022

gacholio commented Nov 11, 2022

amicic commented Nov 11, 2022 • edited Loading

gacholio commented Nov 14, 2022

Choose a reason for hiding this comment

amicic Nov 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amicic commented Nov 14, 2022

amicic commented Nov 14, 2022

LinHu2016 commented Nov 8, 2022 •

edited

Loading

nbhuiyan commented Nov 8, 2022 •

edited

Loading

amicic commented Nov 9, 2022 •

edited

Loading

gacholio Nov 9, 2022 •

edited

Loading

amicic Nov 9, 2022 •

edited

Loading

amicic Nov 9, 2022 •

edited

Loading

amicic Nov 9, 2022 •

edited

Loading

amicic Nov 11, 2022 •

edited

Loading

amicic commented Nov 11, 2022 •

edited

Loading

amicic Nov 14, 2022 •

edited

Loading