Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable JITServer post-restore only if explicitly specified #17205

Merged
merged 3 commits into from
Apr 28, 2023

Conversation

dsouzai
Copy link
Contributor

@dsouzai dsouzai commented Apr 17, 2023

This PR updates the how a JVM Client will connect to a JITServer in the context of CRIU as outlined in the following table; ✅ means the JVM will connect to a JITServer instance and ❌ means it won't.

Non-Portable CRIU Pre-Checkpoint Non-Portable CRIU Post-Restore Portable CRIU Pre-Checkpoint Portable CRIU Post-Restore
No Options Pre-Checkpoint; No Options Post-Restore
No Options Pre-checkpoint; -XX:+UseJITServer Post-Restore
-XX:+UseJITServer Pre-Checkpoint; No Options Post-Restore
-XX:+UseJITServer Pre-Checkpoint; -XX:-UseJITServer Post-Restore
-XX:-UseJITServer Pre-Checkpoint; -XX:+UseJITServer Post-Restore

This PR also adds cmdLineTester tests for jitserver, both by itself and in the context of CRIU.

@dsouzai dsouzai added comp:test comp:jitserver Artifacts related to JIT-as-a-Service project criu Used to track CRIU snapshot related work labels Apr 17, 2023
@dsouzai
Copy link
Contributor Author

dsouzai commented Apr 17, 2023

@mpirvu could you please review?

@llxia could you please review the tests?

Comment on lines +62 to +64
kill -9 $JITSERVER_PID
# Running pkill seems to cause a hang...
#pkill -9 -xf "$TEST_JDK_BIN/jitserver $JITSERVER_OPTIONS"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something about how the cmdLineTester tokenizes the args passed to this script causes it to hang at this point if I use pkill -xf (when I run this script manually there's no issue with pkill -xf). As such, I just stuck with kill.


random_port () {
RANDOM_PORT=$(($(($RANDOM%$DIFF))+$START_PORT))
out=$(lsof -i -P -n | grep LISTEN | grep $RANDOM_PORT)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I was looking for this online, all examples use sudo, I suppose to list out every single port. However, I'm pretty sure that's not needed here since the ports we're searching should only be used by processes that don't have elevated privileges.

@dsouzai dsouzai mentioned this pull request Apr 17, 2023
30 tasks
@mpirvu mpirvu self-requested a review April 17, 2023 18:13
@mpirvu
Copy link
Contributor

mpirvu commented Apr 18, 2023

I would like to better understand the goal of these changes. Here's my take:

Non-portable restore mode (the default)
There is only one checkpoint/restore operation and options are parsed once at JVM bootstrap and second time immediately after restore. Before checkpoint: unless explicitly disabled, the JVM will be setup as a client, but no remote compilations will take place. After a restore: remote compilations will take place only if the user specifies -XX:+UseJITServer
It's debatable whether we want remote compilations post restore, if the user already specified -XX:+UseJITServer at JVM bootstrap.

Portable restore mode
There could be several checkpoint/restore operations and options are only processed at JVM bootstrap.
If the user has specified -XX:+UseJITServer, the JVM will work in client mode all the time performing remote compilations. There is no clear notion of "before checkpoint" or "after restore" since there could be several of such events.
If the user did not specify -XX:+UseJITServer (or if it used -XX:-UseJITServer) at JVM bootstrap, the JVM should not work in client mode at any given time. This is consistent with the fact that the user has to opt in to use JITServer tech.

@dsouzai
Copy link
Contributor Author

dsouzai commented Apr 18, 2023

I would like to better understand the goal of these changes. Here's my take:

Non-portable restore mode (the default)
There is only one checkpoint/restore operation and options are parsed once at JVM bootstrap and second time immediately after restore. Before checkpoint: unless explicitly disabled, the JVM will be setup as a client, but no remote compilations will take place. After a restore: remote compilations will take place only if the user specifies -XX:+UseJITServer

Yeah this is all accurate.

It's debatable whether we want remote compilations post restore, if the user already specified -XX:+UseJITServer at JVM bootstrap.

Ah yeah this is something that's not currently handled. I think if a user specified -XX:+UseJITServer at bootstrap, then we should enable it post-restore, because the user has anticipated at build time itself that a jitserver instance will be available at deployment. I will need to add this functionality (and a test for it as well).

Portable restore mode
There could be several checkpoint/restore operations and options are only processed at JVM bootstrap.

You're right in that there could be several checkpoint/restore operations, but there's still going to be the post-restore hook that's called at each restore, and so options will still be processed post-restore.

If the user has specified -XX:+UseJITServer, the JVM will work in client mode all the time performing remote compilations.

Yes, this is correct; this would be the EXPLICIT_CLIENT mode.

There is no clear notion of "before checkpoint" or "after restore" since there could be several of such events.

At the moment I don't think we have a good story for multiple checkpoint/restore points in that we're still going to call the post-restore options processing each time we restore. I think dealing with that is something we're gonna have to think about for options in general.

If the user did not specify -XX:+UseJITServer (or if it used -XX:-UseJITServer) at JVM bootstrap, the JVM should not work in client mode at any given time. This is consistent with the fact that the user has to opt in to use JITServer tech.

Yes this is correct.

@mpirvu
Copy link
Contributor

mpirvu commented Apr 18, 2023

If options are processed after each restore in portable mode, then some users might provide -XX:+UseJITServer as post restore options and expect that to take effect. However, I don't know what will happen if we generate a new client UID after each restore. Maybe it's better to just disable JITServer in this mode (and document the behavior).
In general we should document all the possible combinations and decisions taken. It's hard to keep track of it all.

@dsouzai
Copy link
Contributor Author

dsouzai commented Apr 18, 2023

If options are processed after each restore in portable mode, then some users might provide -XX:+UseJITServer as post restore options and expect that to take effect. However, I don't know what will happen if we generate a new client UID after each restore.

There's a test for this scenario:
https://github.com/eclipse-openj9/openj9/pull/17205/files#diff-8cca2294b80f1e8bfdae61f3efffc5a9e2676e05d8d6987d3a5ee3ce1457ccc8R116-R139
This portable criu test doesn't explicitly add -XX:+UseJITServer post restore, but because it was specified pre-checkpoint it basically does the same thing. I never ran into any issues even though pre-checkpoint and post-restore there were two different client UIDs generated.

In general we should document all the possible combinations and decisions taken. It's hard to keep track of it all.

Yeah I'll add documentation to this PR, and open another issue to keep track of all the things we need to document on the openj9 docs.

@llxia
Copy link
Contributor

llxia commented Apr 19, 2023

Should this test run in the JITAAS test build? In JITAAS test build, the test framework starts jitserver and sets -XX:+UseJITServer to all tests.
https://github.com/adoptium/TKG/blob/79db2ffe07e64a03150a4ec1960e50770be58dab/testEnv.mk#L23-L31

So the above test will have -XX:+UseJITServer set via JVM_OPTIONS when TEST_FLAG=JITAAS.

See criu test in Test_openjdk11_j9_sanity.functional_x86-64_linux_jit as example:

[2023-04-19T08:12:37.743Z] ===============================================
[2023-04-19T08:12:37.743Z] Running test cmdLineTester_criu_nonPortableRestoreJDK11Up_0 ...
[2023-04-19T08:12:37.743Z] ===============================================
[2023-04-19T08:12:37.743Z] cmdLineTester_criu_nonPortableRestoreJDK11Up_0 Start Time: Wed Apr 19 01:12:37 2023 Epoch Time (ms): 1681891957710
[2023-04-19T08:12:38.178Z] variation: -Xjit -XX:+CRIURestoreNonPortableMode
[2023-04-19T08:12:38.178Z] JVM_OPTIONS: -XX:+UseJITServer -Xjit -XX:+CRIURestoreNonPortableMode 
...

@dsouzai
Copy link
Contributor Author

dsouzai commented Apr 19, 2023

Should this test run in the JITAAS test build?

I don't think so; that's the reason I added the code for the random port. I was essentially following the same principle as

<test>
<testCaseName>testJITServer</testCaseName>
<!-- Variations are passed to the client via the CLIENT_PROGRAM property from $JVM_OPTIONS; neither the test harness nor the server care about these. -->
<variations>
<variation>Mode610</variation>
<variation>Mode610 -Xshareclasses:none -Xjit:optLevel=hot</variation>
<variation>Mode610 -Xshareclasses:name=test_jitscc -XX:+JITServerUseAOTCache</variation>
</variations>
<!-- Check if the JITServer launcher exists and if so start the test and
- specify the executables for the client and server via the CLIENT_EXE and SERVER_EXE properties respectively,
- specify what the client will run via the CLIENT_PROGRAM property.
If the launcher doesn't exist we assume that the build doesn't support JITServer and trivially pass the test. -->
<command>if [ -x $(Q)$(TEST_JDK_BIN)$(D)jitserver$(Q) ]; \
then \
$(JAVA_COMMAND) \
-cp $(Q)$(RESOURCES_DIR)$(P)$(TESTNG)$(P)$(TEST_RESROOT)$(D)jitt.jar$(Q) \
-DSERVER_EXE=$(Q)$(TEST_JDK_BIN)$(D)jitserver$(Q) \
-DCLIENT_EXE=$(JAVA_COMMAND) \
-DCLIENT_PROGRAM=$(SQ)$(JVM_OPTIONS) -cp $(RESOURCES_DIR)$(P)$(TESTNG)$(P)$(TEST_RESROOT)$(D)jitt.jar -DjarTesterArgs=$(Q)-loopforever $(TEST_RESROOT)$(D)jitt.jar$(Q) org.testng.TestNG -d $(REPORTDIR)$(D)client $(TEST_RESROOT)$(D)testng.xml -testnames JarTesterTest -groups $(TEST_GROUP) -excludegroups $(DEFAULT_EXCLUDE)$(SQ) \
org.testng.TestNG \
-d $(REPORTDIR) \
$(Q)$(TEST_RESROOT)$(D)testng.xml$(Q) \
-testnames JITServerTest \
-groups $(TEST_GROUP) \
-excludegroups $(DEFAULT_EXCLUDE); \
else \
echo; \
echo $(Q)$(TEST_JDK_BIN)$(D)jitserver doesn't exist; assuming this JDK does not support JITServer and trivially passing the test.$(Q); \
fi; \
$(TEST_STATUS)</command>
<platformRequirements>os.linux,^arch.arm,^arch.aarch64,bits.64</platformRequirements>
<levels>
<level>sanity</level>
</levels>
<groups>
<group>functional</group>
</groups>
<impls>
<impl>openj9</impl>
</impls>
</test>

which runs even in a non JITAAS test build. It deals with the JITAAS build by using a random port so that it doesn't clash with the port used by the jitserver instance started by the infrastructure.

@llxia
Copy link
Contributor

llxia commented Apr 20, 2023

With the current setup, it will run in JITAAS test build. If we want to disable it in JITAAS build, we need to use JITAAS:nonapplicable.

		<features>
			<feature>CRIU:required</feature>
			<feature>JITAAS:nonapplicable</feature>
		</features>

FYI @renfeiw

@dsouzai
Copy link
Contributor Author

dsouzai commented Apr 20, 2023

I'm ok with it running in a JITAAS build since even the testJITServer test above also runs in a JITAAS build. Unless you think both the new tests added in this PR and testJITServer should be disabled in a JITAAS build?

@llxia
Copy link
Contributor

llxia commented Apr 21, 2023

If it is ok to run in JITAAS build, then #17205 (comment) is not needed.

Copy link
Contributor

@llxia llxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the typo in copyright, the test change lgtm.

@dsouzai
Copy link
Contributor Author

dsouzai commented Apr 24, 2023

@llxia updated the copyright.

@mpirvu good for review again. I removed the ClientMode enum and opted instead for two bools in the compInfo that are used to determine whether -XX:+UseJITServer was specified at bootstrap, and whether the JVM was allowed to connect to a server pre-checkpoint.

Signed-off-by: Irwin D'Souza <dsouzai.gh@gmail.com>
Signed-off-by: Irwin D'Souza <dsouzai.gh@gmail.com>
Copy link
Contributor

@mpirvu mpirvu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mpirvu mpirvu self-assigned this Apr 26, 2023
@mpirvu
Copy link
Contributor

mpirvu commented Apr 26, 2023

jenkins test sanity plinux,xlinux,zlinux jdk17

@mpirvu
Copy link
Contributor

mpirvu commented Apr 27, 2023

zlinux failed cmdLineTester_criu_jitserverPostRestore_2

Testing: Check Verbose Log
Test start time: 2023/04/27 00:38:32 Coordinated Universal Time
Running command: cat vlog
Time spent starting: 12 milliseconds
Time spent executing: 13 milliseconds
Test result: FAILED
Output from test:
 [OUT] #CHECKPOINT RESTORE: Ready for restore
>> Required condition was found: [Output match: CHECKPOINT RESTORE: Ready for restore]
>> Success condition was not found: [Output match: Connected to a server]

@dsouzai
Copy link
Contributor Author

dsouzai commented Apr 27, 2023

Given that cmdLineTester_criu_jitserverPostRestore_0 and cmdLineTester_criu_jitserverPostRestore_1 passed, I think the reason this may have failed is because the jitserver instance failed to launch, perhaps because the port it got was already in use (maybe by a process that the infra couldn't see via lsof). However I'll see if I can reproduce it manually.

The x86 test failed because I guess the error given by curl is different from what the test expects; I'll have to think of a better success condition since the output of curl can change...

Signed-off-by: Irwin D'Souza <dsouzai.gh@gmail.com>
@dsouzai
Copy link
Contributor Author

dsouzai commented Apr 27, 2023

I manually ran the test on two separate zlinux machines and it passed. I think because the restore did succeed but the test didn't see the client connect, it's more than likely that something prevented the jitserver instance from starting up.

The force push should also fix the x86 test failure caused by the change in the curl output.

@dsouzai
Copy link
Contributor Author

dsouzai commented Apr 27, 2023

jenkins test sanity plinux,xlinux,zlinux jdk17

@dsouzai
Copy link
Contributor Author

dsouzai commented Apr 27, 2023

@mpirvu looks like all jobs passed.

@Sreekala-Gopakumar
Copy link
Contributor

@dsouzai - What do the columns in the table indicate exactly (Non-Portable CRIU Pre-Checkpoint, Non-Portable CRIU Post-Restore, etc.)?

@dsouzai
Copy link
Contributor Author

dsouzai commented Jun 30, 2023

The table indicates how JITServer behaves pre-checkpoint and post-restore in Portable CRIU Mode and Non-Portable CRIU mode.

So Non-Portable CRIU Pre-Checkpoint and Non-Portable CRIU Post-Restore go together conceptually, and Portable CRIU Pre-Checkpoint and Portable CRIU Post-Restore go together.

@dsouzai dsouzai deleted the jitserverCRIU branch April 3, 2024 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:jitserver Artifacts related to JIT-as-a-Service project comp:test criu Used to track CRIU snapshot related work
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants