Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delay lock-related operations in single thread mode #14583

Merged
merged 1 commit into from
Mar 21, 2022

Conversation

tajila
Copy link
Contributor

@tajila tajila commented Feb 23, 2022

Delay lock-related operations in single thread mode

This PR delays notify/notifyAll operations by intercepting the
Object.notify* INLs and saving the instances and operations to a queue.
These instances must be saved as global JNI refs as we cannot save them
in localref storage and disallowing a GC is not feasible. At the end of
the single thread phase all the operations will be executed by the
checkpointing thread.

See #14584

Handles case 4) Notify

Signed-off-by: Tobi Ajila atobia@ca.ibm.com

@tajila tajila marked this pull request as ready for review February 23, 2022 21:35
@tajila
Copy link
Contributor Author

tajila commented Feb 23, 2022

@DanHeidinga Please review these changes

@tajila tajila force-pushed the criu branch 5 times, most recently from 91599b5 to 9adeb6b Compare March 3, 2022 16:29
runtime/criusupport/criusupport.cpp Show resolved Hide resolved
@@ -515,3 +515,11 @@ J9NLS_JCL_CRIU_FAILED_TO_RUN_INTERNAL_RESTORE_HOOKS.explanation=An error occured
J9NLS_JCL_CRIU_FAILED_TO_RUN_INTERNAL_RESTORE_HOOKS.system_action=The JVM will throw a RestoreException.
J9NLS_JCL_CRIU_FAILED_TO_RUN_INTERNAL_RESTORE_HOOKS.user_response=View CRIU documentation to determine how to resolve the error.
# END NON-TRANSLATABLE

J9NLS_JCL_CRIU_FAILED_DELAY_INDENTIY_OPS=Could not run delayed identity operations succesfully, errno=%li
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
J9NLS_JCL_CRIU_FAILED_DELAY_INDENTIY_OPS=Could not run delayed identity operations succesfully, errno=%li
J9NLS_JCL_CRIU_FAILED_DELAY_INDENTIY_OPS=Could not run delayed lock-related operations succesfully, errno=%li

Identity makes sense in the context of Valhalla but less so here. Should we just call it lock-related? Or are there other operations as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought there would be more, but now I think it will just be wait/notify* that we handle in this manner

J9NLS_JCL_CRIU_FAILED_DELAY_INDENTIY_OPS=Could not run delayed identity operations succesfully, errno=%li
# START NON-TRANSLATABLE
J9NLS_JCL_CRIU_FAILED_DELAY_INDENTIY_OPS.sample_input_1=1
J9NLS_JCL_CRIU_FAILED_DELAY_INDENTIY_OPS.explanation=An error occured when the JVM attempted to run delayed identity operations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern with 'identity' here

Comment on lines 2695 to 2700
} else {
#endif /* defined(J9VM_OPT_CRIU_SUPPORT) */
if (VM_ObjectMonitor::getMonitorForNotify(_currentThread, receiver, &monitorPtr, true)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
} else {
#endif /* defined(J9VM_OPT_CRIU_SUPPORT) */
if (VM_ObjectMonitor::getMonitorForNotify(_currentThread, receiver, &monitorPtr, true)) {
} else
#endif /* defined(J9VM_OPT_CRIU_SUPPORT) */
{
if (VM_ObjectMonitor::getMonitorForNotify(_currentThread, receiver, &monitorPtr, true)) {

If you put the opening { outside the ifdef, you can avoid the extra ifdef to close it later. Basically, treat it like an extra (otherwise unnecessary) scope that is only an else block for the ifdef case

#define JAVA_UTIL_RANDOM "java/util/Random"
J9Class *juRandomClass = hashClassTableAt(vm->systemClassLoader, (U_8 *)JAVA_UTIL_RANDOM, LITERAL_STRLEN(JAVA_UTIL_RANDOM));
juRandomClass = hashClassTableAt(vm->systemClassLoader, (U_8 *)JAVA_UTIL_RANDOM, LITERAL_STRLEN(JAVA_UTIL_RANDOM));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be protected by the classTableMutex or use the peekClassHashTable function instead

runtime/vm/CRIUHelpers.cpp Show resolved Hide resolved
}
if (!VM_ObjectMonitor::getMonitorForNotify(currentThread, instance, &monitorPtr, true)) {
if (NULL != monitorPtr) {
/* another thread owns the lock, shouldn't be possible */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracepoint

J9Pool *delayedSingleThreadModeOpRecords = vm->checkpointState.delayedSingleThreadModeOpRecord;
J9InternalVMFunctions* vmFuncs = vm->internalVMFunctions;
J9DelayedSingleThreadModeOpRecord *delayedSingleThreadModeOp = (J9DelayedSingleThreadModeOpRecord*) pool_startDo(delayedSingleThreadModeOpRecords, &walkState);
omrthread_monitor_t monitorPtr = NULL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you declare this in the loop to shorten it's lifetime?

}

monitorPtr = NULL;
vmFuncs->j9jni_deleteGlobalRef((JNIEnv*) currentThread, delayedSingleThreadModeOp->globalObjectRef, JNI_FALSE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each item should be removed from the pool after the its been notified - use pool_removeElement which says it is safe to use during an iteration

@@ -4070,11 +4070,20 @@ typedef struct J9InternalHookRecord {
struct J9Pool *instanceObjects;
} J9InternalHookRecord;

typedef struct J9DelayedSingleThreadModeOpRecord {
jobject globalObjectRef;
UDATA operation;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does pool iteration guarantee the same order FIFO iteration order? i.e. the first notify we delay should be the first we do after exiting single threaded mode. Similarly the 2nd should be the 2nd, etc.

If the pool iteration order isn't guaranteed (and documented!) I think these should have a J9DelayedSingleThreadModeOpRecord *next field as well and they should be stored in a linked list. The pool is fine for backing the allocation but we should iterate through the list to do the notifies.

And likely need two pointers - one for the head of the list and one for the tail so we can insert quickly at the tail and iterate from the head

@tajila
Copy link
Contributor Author

tajila commented Mar 8, 2022

@DanHeidinga Changes are ready for another look

@tajila tajila force-pushed the criu branch 2 times, most recently from 2e72dac to d375fd8 Compare March 9, 2022 15:16
@tajila tajila force-pushed the criu branch 2 times, most recently from 66b4528 to 8eaadf6 Compare March 9, 2022 21:00
@tajila
Copy link
Contributor Author

tajila commented Mar 9, 2022

@DanHeidinga

I fixed a small issue with not releasing locks when there are no waiters and added more tracepoints. Changes are ready for another look


vmFuncs->j9jni_deleteGlobalRef((JNIEnv*) currentThread, delayedLockingOperation->globalObjectRef, JNI_FALSE);
pool_removeElement(vm->checkpointState.delayedLockingOperationsRecords, delayedLockingOperation);
delayedLockingOperation = J9_LINKED_LIST_NEXT_DO(vm->checkpointState.delayedLockingOperationsRoot, delayedLockingOperation);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a use after free issue as you've removed the element from the pool (free) and then dereferenced it.

Copy link
Member

@DanHeidinga DanHeidinga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor formatting issues to shrink variable lifetimes and use c++ style casts, otherwise fine

}

vmFuncs->j9jni_deleteGlobalRef((JNIEnv*) currentThread, delayedLockingOperation->globalObjectRef, JNI_FALSE);
lastOperation = delayedLockingOperation;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this is a cpp file, can we declare lastOperation here rather than at the top of the scope? It makes it easer to reason about the lifetime of the variable

Suggested change
lastOperation = delayedLockingOperation;
J9DelayedLockingOpertionsRecord *lastOperation = delayedLockingOperation;

goto throwOOM;
}

newRecord = (J9DelayedLockingOpertionsRecord*) pool_newElement(vm->checkpointState.delayedLockingOperationsRecords);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
newRecord = (J9DelayedLockingOpertionsRecord*) pool_newElement(vm->checkpointState.delayedLockingOperationsRecords);
J9DelayedLockingOpertionsRecord *newRecord = static_cast<J9DelayedLockingOpertionsRecord*>( pool_newElement(vm->checkpointState.delayedLockingOperationsRecords));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't move this one as well because of the goto's

{
J9JavaVM *vm = currentThread->javaVM;
J9InternalVMFunctions* vmFuncs = vm->internalVMFunctions;
J9DelayedLockingOpertionsRecord *delayedLockingOperation = (J9DelayedLockingOpertionsRecord*) J9_LINKED_LIST_START_DO(vm->checkpointState.delayedLockingOperationsRoot);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
J9DelayedLockingOpertionsRecord *delayedLockingOperation = (J9DelayedLockingOpertionsRecord*) J9_LINKED_LIST_START_DO(vm->checkpointState.delayedLockingOperationsRoot);
J9DelayedLockingOpertionsRecord *delayedLockingOperation = static_cast<J9DelayedLockingOpertionsRecord*>(J9_LINKED_LIST_START_DO(vm->checkpointState.delayedLockingOperationsRoot));

#define JAVA_UTIL_RANDOM "java/util/Random"
J9Class *juRandomClass = hashClassTableAt(vm->systemClassLoader, (U_8 *)JAVA_UTIL_RANDOM, LITERAL_STRLEN(JAVA_UTIL_RANDOM));
juRandomClass = peekClassHashTable(currentThread, vm->systemClassLoader, (U_8 *)JAVA_UTIL_RANDOM, LITERAL_STRLEN(JAVA_UTIL_RANDOM));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
juRandomClass = peekClassHashTable(currentThread, vm->systemClassLoader, (U_8 *)JAVA_UTIL_RANDOM, LITERAL_STRLEN(JAVA_UTIL_RANDOM));
J9Class *juRandomClass = peekClassHashTable(currentThread, vm->systemClassLoader, (U_8 *)JAVA_UTIL_RANDOM, LITERAL_STRLEN(JAVA_UTIL_RANDOM));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cant move this one because of the jump to done:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might require an additional scope so the change becomes:

{
	J9Class *juRandomClass = peekClassHashTable(currentThread, vm->systemClassLoader, (U_8 *)JAVA_UTIL_RANDOM, LITERAL_STRLEN(JAVA_UTIL_RANDOM));
....
	if (NULL != juRandomClass) {
		addInternalJVMCheckpointHook(currentThread, TRUE, juRandomClass, FALSE, juRandomReseed);
	}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I'm OK if you want to leave it as is. The method's clear enough that this isn't critical

@tajila
Copy link
Contributor Author

tajila commented Mar 17, 2022

@DanHeidinga The PR is ready for another look

@DanHeidinga
Copy link
Member

DanHeidinga commented Mar 17, 2022

@tajila do we have CRIU-enabled PR builds? If so, do they need a special trigger phrase?

@tajila
Copy link
Contributor Author

tajila commented Mar 17, 2022

@DanHeidinga

I've made the changes

@tajila do we have CRIU-enabled PR builds? If so, do they need a special trigger phrase?

No, unfortunately we dont have machine support yet

@tajila
Copy link
Contributor Author

tajila commented Mar 17, 2022

Jenkins test sanity win jdk11

@DanHeidinga
Copy link
Member

The windows PR build is failing due to code reaching for vm->checkpointState which is only defined under #if defined(J9VM_OPT_CRIU_SUPPORT).

@tajila
Copy link
Contributor Author

tajila commented Mar 18, 2022

Jenkins test sanity win jdk11

This PR delays notify/notifyAll operations by intercepting the
Object.notify* INLs and saving the instances and operations to a queue.
These instances must be saved as global JNI refs as we cannot save them
in localref storage and disallowing a GC is not feasible. At the end of
the single thread phase all the operations will be executed by the
checkpointing thread.

See eclipse-openj9#14584

Handles case 4) Notify

Signed-off-by: Tobi Ajila <atobia@ca.ibm.com>
@tajila
Copy link
Contributor Author

tajila commented Mar 18, 2022

Jenkins test sanity win jdk11

1 similar comment
@tajila
Copy link
Contributor Author

tajila commented Mar 19, 2022

Jenkins test sanity win jdk11

@DanHeidinga
Copy link
Member

Jenkins test sanity aix jdk8

@DanHeidinga
Copy link
Member

Jenkins test sanity xlinux jdk17

@DanHeidinga DanHeidinga merged commit cc586f0 into eclipse-openj9:master Mar 21, 2022
@JasonFengJ9 JasonFengJ9 added comp:vm criu Used to track CRIU snapshot related work labels Jul 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:vm criu Used to track CRIU snapshot related work
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants