Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent hangs when preparing the JIT for checkpoint #15202

Merged
merged 2 commits into from
Jun 11, 2022

Conversation

dsouzai
Copy link
Contributor

@dsouzai dsouzai commented Jun 2, 2022

Prevent hangs when preparing the JIT for checkpoint; there are two different circumstances that can lead to a hang:

  1. Active comp threads waiting for work will wait on the comp monitor; if the hook thread does not notify them, then it will wait indefinitely for those comp threads.
  2. Comp threads actively compiling will try to acquire Exclusive VMAccess; if the hook thread does not release VMAccess, then it will wait indefinitely for those comp threads.

Fixes #15191

When the compilation queue is empty, active compilation threads will
wait on the comp monitor for work. The thread that hooks into the jit to
prepare for checkpoint signals that comp threads should be suspended but
it does not notify threads waiting on the comp monitor. This can lead to
a hang where the hook thread waits to be notified by the "active" comp
thread that it is en route to suspend itself, while the "active" comp
thread waits to be notified for when there is work to do be done.

Signed-off-by: Irwin D'Souza <dsouzai.gh@gmail.com>
@dsouzai dsouzai added bug comp:jit criu Used to track CRIU snapshot related work labels Jun 2, 2022
@dsouzai
Copy link
Contributor Author

dsouzai commented Jun 2, 2022

@mpirvu could you please review?

@dsouzai dsouzai marked this pull request as ready for review June 2, 2022 18:27
@dsouzai dsouzai marked this pull request as draft June 2, 2022 19:02
@dsouzai dsouzai changed the title Notify comp threads waiting on comp monitor when preparing to checkpoint Prevent hangs when preparing the JIT for checkpoint Jun 6, 2022
@dsouzai dsouzai marked this pull request as ready for review June 6, 2022 14:17
@dsouzai
Copy link
Contributor Author

dsouzai commented Jun 6, 2022

@mpirvu this PR is now ready for review.

@mpirvu mpirvu self-assigned this Jun 9, 2022
Copy link
Contributor

@mpirvu mpirvu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a small comment, otherwise it looks good

The thread that hooks into the JIT to start preparing for
checkpoint has VMAccess. However, any compilation threads currently
compiling will request Exclusive VMAccess at some point during the
compilation to add the code cache to the artifact manager. Because the
hook thread waits for all compilations in flight to complete, a deadlock
condition arises as the compilation thread waits for Exclusive VMAccess
that it can't acquire because the hook thread has VMAccess, while the
hook thread waits for the compilation thread to notify it that it will
suspend itself.

This commit fixes this by releasing VMAccess before preparing to
checkpoint, and requires it before returning to the VM. This is ok
because all other java threads have been halted at this point.

Signed-off-by: Irwin D'Souza <dsouzai.gh@gmail.com>
@dsouzai
Copy link
Contributor Author

dsouzai commented Jun 10, 2022

@mpirvu I created a new RAII class to release VM Access and acquire the monitor because given how I missed the return, it's entirely possible for it to happen again in the future, so it just felt better to make the acquiring/releasing automatic.

Good for another review now.

Copy link
Contributor

@mpirvu mpirvu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mpirvu
Copy link
Contributor

mpirvu commented Jun 10, 2022

jenkins compile all jdk17

@mpirvu
Copy link
Contributor

mpirvu commented Jun 10, 2022

jenkins test sanity xlinux jdk17

@mpirvu
Copy link
Contributor

mpirvu commented Jun 10, 2022

Jenkins test sanity xlinuxcriu jdk11

@dsouzai
Copy link
Contributor Author

dsouzai commented Jun 10, 2022

mac failure may be infra related (similar to #14994 (comment)) :

[2022-06-10T16:01:06.915Z] === Output from failing command(s) repeated here ===
[2022-06-10T16:01:06.915Z] * For target support_demos_classes_jfc_TableExample__the.BUILD_DEMO_TableExample_batch:
[2022-06-10T16:01:06.915Z] IOException caught during compilation: Resource deadlock avoided

@dsouzai
Copy link
Contributor Author

dsouzai commented Jun 10, 2022

The criu build failed in cmdLineTester_criu_1 but the variation is -Xint:

[2022-06-10T17:07:19.101Z] ===============================================
[2022-06-10T17:07:19.101Z] Running test cmdLineTester_criu_1 ...
[2022-06-10T17:07:19.101Z] ===============================================
[2022-06-10T17:07:19.101Z] cmdLineTester_criu_1 Start Time: Fri Jun 10 14:07:18 2022 Epoch Time (ms): 1654880838316
[2022-06-10T17:07:19.101Z] "/home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_x86-64_linux_criu_Personal_testList_0/openjdkbinary/j2sdk-image/bin/java" -Xshareclasses:destroyAll; "/home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_x86-64_linux_criu_Personal_testList_0/openjdkbinary/j2sdk-image/bin/java" -Xshareclasses:groupAccess,destroyAll; echo "cache cleanup done";
[2022-06-10T17:07:19.101Z] JVMSHRC005I No shared class caches available
[2022-06-10T17:07:19.101Z] JVMSHRC005I No shared class caches available
[2022-06-10T17:07:19.101Z] cache cleanup done
[2022-06-10T17:07:19.101Z] variation: -Xint
[2022-06-10T17:07:19.101Z] JVM_OPTIONS:  -Xint 
[2022-06-10T17:07:19.101Z] { itercnt=1; \
[2022-06-10T17:07:19.101Z] mkdir -p "/home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_x86-64_linux_criu_Personal_testList_0/aqa-tests/TKG/../TKG/output_16548796323937/cmdLineTester_criu_1"; \
[2022-06-10T17:07:19.101Z] cd "/home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_x86-64_linux_criu_Personal_testList_0/aqa-tests/TKG/../TKG/output_16548796323937/cmdLineTester_criu_1"; \
[2022-06-10T17:07:19.101Z] "/home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_x86-64_linux_criu_Personal_testList_0/openjdkbinary/j2sdk-image/bin/java"  -Xint  -Xdump \
[2022-06-10T17:07:19.101Z] 		-DSCRIPPATH=/home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_x86-64_linux_criu_Personal_testList_0/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/criu/criuScript.sh -DTEST_RESROOT=/home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_x86-64_linux_criu_Personal_testList_0/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/criu \
[2022-06-10T17:07:19.101Z] 		-DJAVA_COMMAND="/home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_x86-64_linux_criu_Personal_testList_0/openjdkbinary/j2sdk-image/bin/java" -DJVM_OPTIONS=' -Xint ' \
[2022-06-10T17:07:19.101Z] 		-jar "/home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_x86-64_linux_criu_Personal_testList_0/aqa-tests/TKG/../../jvmtest/functional/cmdline_options_tester/cmdlinetester.jar" -config "/home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_x86-64_linux_criu_Personal_testList_0/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/criu/criu.xml" \
[2022-06-10T17:07:19.101Z] 		-explainExcludes -xids all,linux_x86-64, -nonZeroExitWhenError; \
[2022-06-10T17:07:19.101Z] 		if [ $? -eq 0 ] ; then echo ""; echo "cmdLineTester_criu_1""_PASSED"; echo ""; cd /home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_x86-64_linux_criu_Personal_testList_0/aqa-tests/TKG/..; rm -f -r "/home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_x86-64_linux_criu_Personal_testList_0/aqa-tests/TKG/../TKG/output_16548796323937/cmdLineTester_criu_1"; else echo ""; echo "cmdLineTester_criu_1""_FAILED"; echo ""; fi; } 2>&1 | tee -a "/home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_x86-64_linux_criu_Personal_testList_0/aqa-tests/TKG/../TKG/output_16548796323937/TestTargetResult";
[2022-06-10T17:07:20.111Z] *** Starting test suite: J9 Criu Command-Line Option Tests ***
[2022-06-10T17:07:20.111Z] Testing: Create and Restore Criu Checkpoint Image once
[2022-06-10T17:07:20.111Z] Test start time: 2022/06/10 14:07:19 Atlantic Standard Time
[2022-06-10T17:07:20.111Z] Running command: bash /home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_x86-64_linux_criu_Personal_testList_0/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/criu/criuScript.sh /home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_x86-64_linux_criu_Personal_testList_0/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/criu /home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_x86-64_linux_criu_Personal_testList_0/openjdkbinary/j2sdk-image/bin/java  -Xint  org.openj9.criu.CRIUSimpleTest SingleCheckpoint
[2022-06-10T17:07:20.111Z] Time spent starting: 109 milliseconds
[2022-06-10T17:07:22.124Z] Time spent executing: 2386 milliseconds
[2022-06-10T17:07:22.124Z] Test result: FAILED
[2022-06-10T17:07:22.124Z] Output from test:
[2022-06-10T17:07:22.124Z]  [OUT] start running script
[2022-06-10T17:07:22.124Z]  [OUT] Single checkpoint:
[2022-06-10T17:07:22.124Z]  [OUT] Pre-checkpoint
[2022-06-10T17:07:22.124Z]  [OUT] finished script
[2022-06-10T17:07:22.124Z]  [ERR] /home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_x86-64_linux_criu_Personal_testList_0/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/criu/criuScript.sh: line 26: 18350 Killed                  $2 -XX:+EnableCRIUSupport $3 -cp "$1/criu.jar" $4 $5 > testOutput 2>&1
[2022-06-10T17:07:22.124Z]  [ERR] pie: 18350: Error (criu/pie/restorer.c:1790): Unable to create a thread: 18352
[2022-06-10T17:07:22.124Z]  [ERR] pie: 18350: Error (criu/pie/restorer.c:1925): Restorer fail 18350
[2022-06-10T17:07:22.124Z]  [ERR] Error (criu/cr-restore.c:2447): Restoring FAILED.
[2022-06-10T17:07:22.124Z] >> Success condition was found: [Output match: Killed]
[2022-06-10T17:07:22.124Z] >> Required condition was found: [Output match: Pre-checkpoint]
[2022-06-10T17:07:22.124Z] >> Required condition was not found: [Output match: Post-checkpoint]
[2022-06-10T17:07:22.124Z] >> Failure condition was not found: [Output match: CRIU is not enabled]
[2022-06-10T17:07:22.124Z] >> Failure condition was not found: [Output match: Operation not permitted]
[2022-06-10T17:07:22.124Z] >> Failure condition was not found: [Output match: ERR]
[2022-06-10T17:07:22.124Z] 
[2022-06-10T17:07:22.124Z] Testing: Create and Restore Criu Checkpoint Image twice
[2022-06-10T17:07:22.124Z] Test start time: 2022/06/10 14:07:21 Atlantic Standard Time
[2022-06-10T17:07:22.124Z] Running command: bash /home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_x86-64_linux_criu_Personal_testList_0/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/criu/criuScript.sh /home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_x86-64_linux_criu_Personal_testList_0/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/criu /home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_x86-64_linux_criu_Personal_testList_0/openjdkbinary/j2sdk-image/bin/java  -Xint  org.openj9.criu.CRIUSimpleTest TwoCheckpoints
[2022-06-10T17:07:22.124Z] Time spent starting: 9 milliseconds
[2022-06-10T17:07:28.230Z] Time spent executing: 4947 milliseconds
[2022-06-10T17:07:28.230Z] Test result: PASSED
[2022-06-10T17:07:28.230Z] 
[2022-06-10T17:07:28.230Z] Testing: Create and Restore Criu Checkpoint Image three times
[2022-06-10T17:07:28.230Z] Test start time: 2022/06/10 14:07:26 Atlantic Standard Time
[2022-06-10T17:07:28.230Z] Running command: bash /home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_x86-64_linux_criu_Personal_testList_0/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/criu/criuScript.sh /home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_x86-64_linux_criu_Personal_testList_0/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/criu /home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_x86-64_linux_criu_Personal_testList_0/openjdkbinary/j2sdk-image/bin/java  -Xint  org.openj9.criu.CRIUSimpleTest ThreeCheckpoints
[2022-06-10T17:07:28.230Z] Time spent starting: 7 milliseconds
[2022-06-10T17:07:35.286Z] Time spent executing: 7159 milliseconds
[2022-06-10T17:07:35.286Z] Test result: PASSED
[2022-06-10T17:07:35.286Z] 
[2022-06-10T17:07:35.286Z] 
[2022-06-10T17:07:35.286Z] ---TEST RESULTS---
[2022-06-10T17:07:35.286Z] Number of PASSED tests: 2 out of 3
[2022-06-10T17:07:35.286Z] Number of FAILED tests: 1 out of 3
[2022-06-10T17:07:35.286Z] 
[2022-06-10T17:07:35.286Z] ---SUMMARY OF FAILED TESTS---
[2022-06-10T17:07:35.286Z] Create and Restore Criu Checkpoint Image once
[2022-06-10T17:07:35.286Z] -----------------------------
[2022-06-10T17:07:35.286Z] 
[2022-06-10T17:07:35.286Z] 
[2022-06-10T17:07:35.286Z] cmdLineTester_criu_1_FAILED

caused by #14974

@mpirvu
Copy link
Contributor

mpirvu commented Jun 11, 2022

I agree that the failures seen in testing are not related to this PR, hence merging.

@mpirvu mpirvu merged commit 20c2e83 into eclipse-openj9:master Jun 11, 2022
@dsouzai dsouzai deleted the criuHang branch February 3, 2023 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug comp:jit criu Used to track CRIU snapshot related work
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CRIU TR::CompilationInfo::prepareForCheckpoint causes hang intermittently at single thread mode
2 participants