Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement JITServer AOT cache thunk handling #16650

Merged
merged 4 commits into from Feb 15, 2023

Conversation

cjjdespres
Copy link
Contributor

These commits add J2I thunk handling to the JITServer AOT cache. The server now tracks any thunks created during the compilation of methods that will be stored in the JITServer AOT cache. These thunks are stored in the cache, and also serialized and sent in the SerializedAOTMethod records that are sent to the client. The client will deserialize these thunk records and store the thunks in their local shared class cache as necessary.

@cjjdespres
Copy link
Contributor Author

cjjdespres commented Feb 2, 2023

Attn @mpirvu. You can see in the record deserialization that I simply ignore records of type Thunk when updating the SCC offsets. I could instead add another field in the _varSizedData of the serialized AOT method that would hold the thunk records. I would then fill that all at once while constructing the final SerializedAOTMethod record at the end of compilation, and deserialize it all at once in the deserializer.

Some results from acmeair testing:

  • Before these changes, with an acmeair load test I got 102 compilationRelocationFailure (thunkRelocationFailure) errors. I also got 469 compilationRelocationFailure (j2iThunkFromMethodValidationFailure) errors.
  • After these changes, under the same load test I got 0 thunkRelocationFailure errors, but 573 j2iThunkFromMethodValidationFailure errors.

I will run the test again and look at the source of that failure, but it's possible that the thunks being received from the server are simply causing relocation failures in a different way.

@mpirvu mpirvu self-assigned this Feb 2, 2023
@mpirvu mpirvu added the comp:jitserver Artifacts related to JIT-as-a-Service project label Feb 2, 2023
@mpirvu mpirvu added this to In progress in JIT as a Service via automation Feb 2, 2023
@mpirvu
Copy link
Contributor

mpirvu commented Feb 2, 2023

After these changes, under the same load test I got 0 thunkRelocationFailure errors, but 573 j2iThunkFromMethodValidationFailure errors.

Please rebase your code to get this change from Irwin: #16625

Copy link
Contributor

@mpirvu mpirvu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only reviewed the first 2 commits so far. Will continue...

runtime/compiler/runtime/JITServerAOTDeserializer.cpp Outdated Show resolved Hide resolved
bool
JITServerAOTDeserializer::cacheRecord(const ThunkSerializationRecord *record,
TR::Compilation *comp, bool &isNew, bool &wasReset)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to think deeper if this method needs a lock like all the others. Maybe the SCC can handle the concurrency aspect

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isResetInProgress() technically needs a preceding read barrier if we're not acquiring a lock here.

runtime/compiler/runtime/JITServerAOTDeserializer.cpp Outdated Show resolved Hide resolved
return true;
isNew = true;

fej9->setJ2IThunk((char *)record->signature(), record->signatureSize(), record->thunkAddress(), comp);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering why we don't check the return value of setJ2IThunk(), but apparently we fail the compilation by throwing an exception. I hope that the deserialization code can handle sudden interruptions through exceptions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only consequence of an exception thrown here (that I can think of) in the deserialization code itself is that some records will be cached without being added to the "known record IDs" sent back to the server.

However, further up the call stack in remoteCompile(), I think we would handle it as:

catch (...)
{
if (!details.isJitDumpMethod())
{
// For any other type of exception disconnect the socket
client->~ClientStream();
TR_Memory::jitPersistentFree(client);
compInfoPT->setClientStream(NULL);
}
throw; // rethrow the exception
}

unnecessarily (?) disconnecting from the server. Although this already also applies to e.g. exceptions due to persistent allocation failures when adding to deserializer caches.

@cjjdespres
Copy link
Contributor Author

After rebasing and testing (including changing that comp->fe() call), I get:

  • before changes: 4 j2iThunkFromMethodValidationFailure errors, 116 thunkRelocationFailure errors
  • after changes: 609 j2iThunkFromMethodValidationFailure errors, 0 thunkRelocationFailure errors.

This is all with TR_DontDisableSVMDuringStartup=1 set, incidentally, since I happened to have that set for the segfault testing. Without that set, I get:

  • before changes: 0 j2iThunkFromMethodValidationFailure errors, 6 thunkRelocationFailure errors
  • after changes: 0 of both errors

I'm assuming I've done something wrong somewhere.

Copy link
Contributor

@mpirvu mpirvu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I only have a few small suggestions. We do need to understand/debug the strange behavior when SVM is enabled during start-up

runtime/compiler/compile/J9Compilation.cpp Show resolved Hide resolved
runtime/compiler/compile/J9Compilation.hpp Outdated Show resolved Hide resolved
runtime/compiler/env/VMJ9Server.cpp Outdated Show resolved Hide resolved
runtime/compiler/runtime/JITServerAOTCache.cpp Outdated Show resolved Hide resolved
@cjjdespres
Copy link
Contributor Author

Does the JITServer protocol version need to be bumped as well?

@mpirvu
Copy link
Contributor

mpirvu commented Feb 7, 2023

Does the JITServer protocol version need to be bumped as well?

I believe so. In case of a mismatch, the server could send thunk records that the client does not expect and fail.

The new field names in the JITServer AOT cache thunk records are more
consistent with names already in use in the non-AOT JITServer thunk
serialization code. They should also be more reflective of what is being
stored in the thunk records.

Signed-off-by: Christian Despres <despresc@ibm.com>
@cjjdespres cjjdespres force-pushed the thunk-serialization branch 2 times, most recently from 3c46529 to 506c645 Compare February 9, 2023 19:48
@cjjdespres
Copy link
Contributor Author

cjjdespres commented Feb 9, 2023

I've rebased onto master, to incorporate the changes from #16684. I've also changed getJ2IThunk and setJ2IThunk so the client thunk pointer is always returned, like #16684 does. I've also tested this PR now that those changes are incorporated, and it appears that all of the j2iThunkFromMethodValidationFailure relocation errors are gone. I get consistent acmeair results with JITServer AOT caching enabled roughly like:

  • first run: 68 compilationRelocationFailure, 0 of them from AOT loads
  • second run, connecting to same JITServer and named AOT cache: 33 compilationRelocationFailure, 21 of them from AOT loads, 0 of them from j2iThunkFromMethodValidationFailure or thunkRelocationFailure

and without the changes from this PR:

  • first run: 67 compilationRelocationFailure, 0 of them from AOT loads
  • second run, connecting to the same JITServer and named AOT cache: 155 compilationRelocationFailure, 142 from AOT loads, 114 of those from thunkRelocationFailure and 0 from j2iThunkFromMethodValidationFailure.

I thought that I had already tested this PR with that particular modification (always using the client thunk pointer), but I might be misremembering exactly what I did.

To record this for the future: Irwin pointed out that the cause of the excess j2iThunkFromMethodValidationFailure errors in previous testing seemed to be exactly what is described and addressed in #16625; in particular, the VM has a weaker definition of thunk equivalence than them having equal signature, so it shares them more than the local shared class cache does (or used to, now that Irwin's PR is merged).

Irwin and I had a chat, and we believe that this is what was happening:

  • The JITServer AOT cache thunk map is keyed based on signature, not the encoded signature that implements the notion of thunk equivalence that the VM/local SCC uses. This means that it creates and stores more thunks than it really should.
  • During code generation, the server might generate multiple records for equivalent thunks because they had different signatures. These differing server pointers would be returned from the get/setJ2IThunk functions, which would mislead the relocation records into thinking that these thunks were not equivalent (I think).
  • During relocation, the SVM would look up the thunks referred to in these distinct records, find that the were equivalent, get confused, and fail validation.

By consistently using the client thunk pointer, we make sure that equivalent thunks in a particular method always get equal pointers, because we go through the client's thunk lookup mechanism. It doesn't matter that these pointers are nonsense because they get relocated at the client, but it does matter that they're the same (apparently).

This PR still has the problem that more thunk records are created than necessary, because the JITServer AOT cache thunk map is still keyed by signature directly. I haven't investigated how big of a problem this is. Irwin did mention that it might be a bit of a pain to refactor the existing signature encoding mechanism so that it can be called from the JITServer AOT cache code, but I haven't looked at the existing code yet.

@mpirvu
Copy link
Contributor

mpirvu commented Feb 10, 2023

jenkins test sanity plinuxjit,xlinuxjit,zlinuxjit jdk17

@mpirvu
Copy link
Contributor

mpirvu commented Feb 10, 2023

On plinuxjit there is a failure for cmdLineTest_J9test_common_0

Testing: -Xlockword minimizeFootprint mode
Test start time: 2023/02/10 11:52:45 Atlantic Standard Time
Running command: "/home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_ppc64le_linux_jit_Personal_testList_1/openjdkbinary/j2sdk-image/bin/java" -XX:+UseJITServer  -Xdump -cp "/home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_ppc64le_linux_jit_Personal_testList_1/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/utils/utils.jar" -Xlockword:mode=minimizeFootprint VMBench.FibBench
Time spent starting: 2 milliseconds
Time spent executing: 11512 milliseconds
Test result: FAILED
Output from test:
 [OUT] Fibonacci: iterations = 10000
 [OUT] fibonacci(12) = 144
 [ERR] Unhandled exception
 [ERR] Type=Segmentation error vmState=0x00000000

@mpirvu
Copy link
Contributor

mpirvu commented Feb 10, 2023

jenkins test sanity plinuxjit jdk17

@mpirvu
Copy link
Contributor

mpirvu commented Feb 11, 2023

xlinux had a timeout for cmdLineTester_jvmtitests_hcr_openi9_none_SCC_3, not likely due to this PR:

11:47:56  ===============================================
11:47:56  cmdLineTester_jvmtitests_hcr_openi9_none_SCC_3 Start Time: Fri Feb 10 11:47:56 2023 Epoch Time (ms): 1676047676004
11:47:56  variation: Mode112
11:47:56  JVM_OPTIONS: -XX:+UseJITServer -Xgcpolicy:gencon -Xjit:count=0 -Xnocompressedrefs 
11:47:56  { \
11:47:56  echo "";	echo "TEST SETUP:"; \
11:47:56  "/home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_jit_Personal_testList_1/openjdkbinary/j2sdk-image/bin/java" -Xshareclasses:destroyAll; "/home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_jit_Personal_testList_1/openjdkbinary/j2sdk-image/bin/java" -Xshareclasses:groupAccess,destroyAll; echo "cache cleanup done"; \
11:47:56  mkdir -p "/home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_jit_Personal_testList_1/aqa-tests/TKG/../TKG/output_16760439168430/cmdLineTester_jvmtitests_hcr_openi9_none_SCC_3"; \
11:47:56  cd "/home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_jit_Personal_testList_1/aqa-tests/TKG/../TKG/output_16760439168430/cmdLineTester_jvmtitests_hcr_openi9_none_SCC_3"; \
11:47:56  echo "";	echo "TESTING:"; \
11:47:56  export LD_LIBRARY_PATH="/home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_jit_Personal_testList_1/openjdkbinary/openjdk-test-image/openj9:/home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_jit_Personal_testList_1/openjdkbinary/j2sdk-image/bin/../lib/default:/home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_jit_Personal_testList_1/openjdkbinary/j2sdk-image/bin/../lib/j9vm:"; \
11:47:56  "/home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_jit_Personal_testList_1/openjdkbinary/j2sdk-image/bin/java" -XX:+UseJITServer -Xgcpolicy:gencon -Xjit:count=0 -Xnocompressedrefs  -Xshareclasses:none \
11:47:56  -DTEST_ROOT="/home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_jit_Personal_testList_1/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/jvmtitests" \
11:47:56  -DJAR="/home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_jit_Personal_testList_1/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/jvmtitests/jvmtitest.jar:/home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_jit_Personal_testList_1/aqa-tests/TKG/../TKG/lib/asm-all.jar" \
11:47:56  -DEXE='"/home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_jit_Personal_testList_1/openjdkbinary/j2sdk-image/bin/java" -XX:+UseJITServer -Xgcpolicy:gencon -Xjit:count=0 -Xnocompressedrefs  -Xdump ' \
11:47:56  -DEXTRA_Add_OPEN_OPTION='--add-opens=java.base/java.lang.reflect=ALL-UNNAMED' \
11:47:56  -jar "/home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_jit_Personal_testList_1/aqa-tests/TKG/../../jvmtest/functional/cmdline_options_tester/cmdlinetester.jar" \
11:47:56  -config "/home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_jit_Personal_testList_1/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/jvmtitests/jvmtitests_hcr.xml" \
11:47:56  -explainExcludes -xids all,linux_x86-64 -xlist "/home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_jit_Personal_testList_1/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/jvmtitests/jvmtitests_excludes_17.xml" -nonZeroExitWhenError; \
11:47:56  if [ $? -eq 0 ]; then echo "-----------------------------------"; echo "cmdLineTester_jvmtitests_hcr_openi9_none_SCC_3""_PASSED"; echo "-----------------------------------"; cd /home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_jit_Personal_testList_1/aqa-tests/TKG/..; rm -f -r "/home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_jit_Personal_testList_1/aqa-tests/TKG/../TKG/output_16760439168430/cmdLineTester_jvmtitests_hcr_openi9_none_SCC_3"; else echo "-----------------------------------"; echo "cmdLineTester_jvmtitests_hcr_openi9_none_SCC_3""_FAILED"; echo "-----------------------------------"; fi; \
11:47:56  echo "";	echo "TEST TEARDOWN:"; \
11:47:56  "/home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_jit_Personal_testList_1/openjdkbinary/j2sdk-image/bin/java" -Xshareclasses:destroyAll; "/home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_jit_Personal_testList_1/openjdkbinary/j2sdk-image/bin/java" -Xshareclasses:groupAccess,destroyAll; echo "cache cleanup done"; \
11:47:56   } 2>&1 | tee -a "/home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_jit_Personal_testList_1/aqa-tests/TKG/../TKG/output_16760439168430/TestTargetResult";
11:47:56  
11:47:56  TEST SETUP:
11:47:56  JVMSHRC005I No shared class caches available
11:47:56  JVMSHRC005I No shared class caches available
11:47:56  cache cleanup done
11:47:56  
11:47:56  TESTING:
20:31:28  Cancelling nested steps due to timeout

@mpirvu
Copy link
Contributor

mpirvu commented Feb 11, 2023

jenkins test sanity xlinuxjit jdk17

@mpirvu
Copy link
Contributor

mpirvu commented Feb 13, 2023

Tests of xlinux were aborted again, though it's unclear to me why. I think the test as a whole took longer than expected.

20:11:55  ===============================================
20:11:55  Running test cmdLineTester_GCRegressionTests_0 ...
20:11:55  ===============================================
20:11:55  cmdLineTester_GCRegressionTests_0 Start Time: Sat Feb 11 20:11:53 2023 Epoch Time (ms): 1676164313418
20:11:55  variation: Mode110
20:11:55  JVM_OPTIONS: -XX:+UseJITServer -Xjit -Xgcpolicy:gencon -Xnocompressedrefs 
.........
20:12:49  Testing: -Xmaxf1.0a
20:12:49  Test start time: 2023/02/11 20:12:49 Eastern Standard Time
20:12:49  Running command: "/home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_jit_Personal_testList_0/openjdkbinary/j2sdk-image/bin/java" -XX:+UseJITServer -Xjit -Xgcpolicy:gencon -Xnocompressedrefs  -Xmaxf1.0a -version
20:12:55  Terminated
20:12:55  Terminated
20:12:55  make[6]: *** [autoGen.mk:31: cmdLineTester_GCRegressionTests_0] Error 143
20:12:55  make[6]: Leaving directory '/home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_jit_Personal_testList_0/aqa-tests/functional/cmdLineTests/gcRegressionTests'
20:12:55  make[5]: *** [/home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_jit_Personal_testList_0/aqa-tests/TKG/../TKG/settings.mk:356: testList-gcRegressionTests] Error 2
20:12:55  make[5]: Leaving directory '/home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_jit_Personal_testList_0/aqa-tests/functional/cmdLineTests'
20:12:55  make[4]: *** [/home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_jit_Personal_testList_0/aqa-tests/TKG/../TKG/settings.mk:356: testList-cmdLineTests] Error 2
20:12:55  make[4]: Leaving directory '/home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_jit_Personal_testList_0/aqa-tests/functional'
20:12:55  make[3]: *** [/home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_jit_Personal_testList_0/aqa-tests/TKG/../TKG/settings.mk:356: testList-functional] Error 2
20:12:55  make[3]: Leaving directory '/home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_jit_Personal_testList_0/aqa-tests'
20:12:55  make[2]: *** [settings.mk:356: testList-..] Error 2
20:12:55  make[2]: Leaving directory '/home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_jit_Personal_testList_0/aqa-tests/TKG'
20:12:55  make[1]: *** [makefile:54: _testList] Error 2
20:12:55  make[1]: Leaving directory '/home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_jit_Personal_testList_0/aqa-tests/TKG'
20:12:55  make: *** [parallelList.mk:8: testList_0] Error 2
20:12:55  script returned exit code 2

@mpirvu
Copy link
Contributor

mpirvu commented Feb 13, 2023

FYI @AlexeyKhrabrov just in case you want to review this PR as well.

@mpirvu
Copy link
Contributor

mpirvu commented Feb 13, 2023

I compared the performance of (1) OpenJ9 0.35.0, (2) this PR, (3) dev code without this PR, when running AcmeAir in 1P mode and having JITServer AOT cache enabled. I used only "cold" runs that start with an empty SCC (no containers, therefore no embedded SCC).

Results for JDK=/home/mpirvu/sdks/ibm-semeru-open-jdk_x64_linux_17.0.5_8_openj9-0.35.0 jvmOpts=-XX:+UseJITServer -XX:+JITServerUseAOTCache -Xmx256m
Throughput      avg=4950.91     min=4885.70     max=4982.20     stdDev=29.7     maxVar=1.98%    confInt=0.35%   samples=10
Intermediate results:
Run 0   4732.4  4948.1  Avg=4948        CPU=9190 ms  Footprint=265960 KB
Run 1   4778.5  4965.4  Avg=4965        CPU=6051 ms  Footprint=272032 KB
Run 2   4816.7  4973.6  Avg=4974        CPU=6007 ms  Footprint=270608 KB
Run 3   4770.7  4964.1  Avg=4964        CPU=5994 ms  Footprint=269824 KB
Run 4   4774.6  4965.0  Avg=4965        CPU=6092 ms  Footprint=270340 KB
Run 5   4764.6  4943.2  Avg=4943        CPU=6015 ms  Footprint=271428 KB
Run 6   4785.1  4966.6  Avg=4967        CPU=5875 ms  Footprint=268216 KB
Run 7   4719.8  4915.2  Avg=4915        CPU=5880 ms  Footprint=272108 KB
Run 8   4710.8  4885.7  Avg=4886        CPU=5960 ms  Footprint=267212 KB
Run 9   4782.8  4982.2  Avg=4982        CPU=6084 ms  Footprint=270216 KB
CompTime        avg=6314.80     min=5875.00     max=9190.00     stdDev=1013.0   maxVar=56.43%   confInt=9.30%   samples=10
Footprint       avg=269794.40   min=265960.00   max=272108.00   stdDev=2055.6   maxVar=2.31%    confInt=0.44%   samples=10

Results for JDK=/home/mpirvu/sdks/jdk.chris.thunk jvmOpts=-XX:+UseJITServer -XX:+JITServerUseAOTCache -Xmx256m
Throughput      avg=4935.42     min=4877.40     max=4993.30     stdDev=36.1     maxVar=2.38%    confInt=0.42%   samples=10
Intermediate results:
Run 0   4700.3  4899.5  Avg=4900        CPU=9091 ms  Footprint=265184 KB   ==> larger CPU because the cache at server was empty
Run 1   4844.6  4993.3  Avg=4993        CPU=5775 ms  Footprint=270968 KB
Run 2   4714.5  4928.1  Avg=4928        CPU=5760 ms  Footprint=264556 KB
Run 3   4782.2  4950.0  Avg=4950        CPU=5612 ms  Footprint=268152 KB
Run 4   4765.9  4924.1  Avg=4924        CPU=5740 ms  Footprint=263692 KB
Run 5   4699.4  4877.4  Avg=4877        CPU=5793 ms  Footprint=269112 KB
Run 6   4813.8  4979.7  Avg=4980        CPU=5721 ms  Footprint=268224 KB
Run 7   4756.5  4925.2  Avg=4925        CPU=5820 ms  Footprint=268272 KB
Run 8   4736.9  4913.5  Avg=4914        CPU=5696 ms  Footprint=267804 KB
Run 9   4765.8  4963.4  Avg=4963        CPU=5726 ms  Footprint=262976 KB
CompTime        avg=6073.40     min=5612.00     max=9091.00     stdDev=1061.8   maxVar=61.99%   confInt=10.13%  samples=10
Footprint       avg=266894.00   min=262976.00   max=270968.00   stdDev=2616.0   maxVar=3.04%    confInt=0.57%   samples=10

Results for JDK=/home/mpirvu/sdks/jdk.origBeforeChris jvmOpts=-XX:+UseJITServer -XX:+JITServerUseAOTCache -Xmx256m
Throughput      avg=4932.40     min=4882.50     max=4973.90     stdDev=30.8     maxVar=1.87%    confInt=0.36%   samples=10
Intermediate results:
Run 0   4707.9  4882.5  Avg=4882        CPU=8967 ms  Footprint=268572 KB
Run 1   4780.2  4973.9  Avg=4974        CPU=5717 ms  Footprint=268568 KB
Run 2   4747.9  4910.4  Avg=4910        CPU=5812 ms  Footprint=266796 KB
Run 3   4726.2  4907.4  Avg=4907        CPU=5815 ms  Footprint=266464 KB
Run 4   4729.0  4953.1  Avg=4953        CPU=5772 ms  Footprint=269428 KB
Run 5   4803.5  4965.5  Avg=4966        CPU=5670 ms  Footprint=267040 KB
Run 6   4773.3  4936.5  Avg=4936        CPU=5720 ms  Footprint=265780 KB
Run 7   4807.9  4964.8  Avg=4965        CPU=5752 ms  Footprint=267648 KB
Run 8   4740.3  4911.6  Avg=4912        CPU=5707 ms  Footprint=265872 KB
Run 9   4768.3  4918.3  Avg=4918        CPU=5857 ms  Footprint=266904 KB
CompTime        avg=6078.90     min=5670.00     max=8967.00     stdDev=1016.4   maxVar=58.15%   confInt=9.69%   samples=10
Footprint       avg=267307.20   min=265780.00   max=269428.00   stdDev=1220.3   maxVar=1.37%    confInt=0.26%   samples=10

I would say that there is a small 3-4% improvement in CPU spent for compilation, and maybe a small ~1% footprint improvement, though both these improvements come with changes from the dev build, probably the changes from Irwin related to how thunks are keyed in the SCC to avoid duplication and AOT load failures.

Compilation failure stats:

*** ibm-semeru-open-jdk_x64_linux_17.0.5_8_openj9-0.35.0 ***
compilationAotHasInvokehandle               68
compilationSymbolValidationManagerFailure  316
compilationRelocationFailure               807
aotCacheDeserializationFailure             251


*** jdk.chris.thunk ***
compilationAotHasInvokehandle               74
compilationSymbolValidationManagerFailure  303
compilationRelocationFailure               210
aotCacheDeserializationFailure             241


*** jdk.origBeforeChris ***
compilationAotHasInvokehandle               70
compilationSymbolValidationManagerFailure  303
compilationRelocationFailure               365
aotCacheDeserializationFailure             241

So, there is a clear reduction in compilationRelocationFailure (807-->210), though most of this reduction came from Irwin's thunk encoding change (807-->365)

@mpirvu
Copy link
Contributor

mpirvu commented Feb 13, 2023

@cjjdespres I would be interested to know if this latest version of the code eliminates the crash we saw when enabling SVM during start-up.

@mpirvu
Copy link
Contributor

mpirvu commented Feb 13, 2023

jenkins test sanity xlinuxjit jdk17

@cjjdespres
Copy link
Contributor Author

I tested that a bit on Friday, and yes, I was unable to reproduce the crash with SVM at startup. I'll test with a few different configurations today as well.

Copy link
Contributor

@AlexeyKhrabrov AlexeyKhrabrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall except for a few minor issues (see inline comments).

I could instead add another field in the _varSizedData of the serialized AOT method that would hold the thunk records. I would then fill that all at once while constructing the final SerializedAOTMethod record at the end of compilation, and deserialize it all at once in the deserializer.

I don't think that's worth the extra complexity. Assuming there aren't too many thunk records per method on average, the space overhead of storing unused placeholder offsets should be negligible.

runtime/compiler/compile/J9Compilation.cpp Outdated Show resolved Hide resolved
runtime/compiler/env/VMJ9Server.cpp Outdated Show resolved Hide resolved
runtime/compiler/env/VMJ9Server.cpp Outdated Show resolved Hide resolved
bool
JITServerAOTDeserializer::cacheRecord(const ThunkSerializationRecord *record,
TR::Compilation *comp, bool &isNew, bool &wasReset)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isResetInProgress() technically needs a preceding read barrier if we're not acquiring a lock here.

runtime/compiler/runtime/JITServerAOTDeserializer.cpp Outdated Show resolved Hide resolved
@mpirvu
Copy link
Contributor

mpirvu commented Feb 14, 2023

xlinux timed-out again for

12:21:32  cmdLineTester_jvmtitests_hcr_openi9_none_SCC_3 Start Time: Mon Feb 13 13:21:30 2023 Epoch Time (ms): 1676308890794
12:21:32  variation: Mode112
12:21:32  JVM_OPTIONS: -XX:+UseJITServer -Xgcpolicy:gencon -Xjit:count=0 -Xnocompressedrefs 

@cjjdespres
Copy link
Contributor Author

Force push to address Alexey's comments.

@mpirvu
Copy link
Contributor

mpirvu commented Feb 14, 2023

jenkins test sanity plinuxjit,xlinuxjit,zlinuxjit jdk17

JITServer AOT cache method records may now depend on thunk records.
These records must be deserialized and stored in the local shared class
cache.

Signed-off-by: Christian Despres <despresc@ibm.com>
Compilations that are to be stored in the JITServer AOT cache now handle
thunks differently from non-AOT compilations; instead of maintaining the
per-client thunk pointer map directly, we now generate thunks at the
server for those compilations and store them in the JITServer AOT cache.
The compilation object also tracks any thunks a compilation might
require, so the relevant records can be added to the main
CachedAOTMethod record and be sent to clients as needed.

This separate handling is required to ensure that clients have in their
local shared class cache any thunks that a JITServer AOT cache method
might require. Not having these thunks available will cause an AOT
load failure because of the resulting thunk relocation error.

Fixes: eclipse-openj9#16482
Signed-off-by: Christian Despres <despresc@ibm.com>
Signed-off-by: Christian Despres <despresc@ibm.com>
@cjjdespres
Copy link
Contributor Author

I forgot an #include AtomicSupport.hpp in the deserializer. It should build now.

@mpirvu
Copy link
Contributor

mpirvu commented Feb 14, 2023

jenkins test sanity plinuxjit,xlinuxjit,zlinuxjit jdk17

@mpirvu
Copy link
Contributor

mpirvu commented Feb 15, 2023

cmdLineTester_jvmtitests_hcr_openi9_none_SCC_3 timed out on plinux

16:36:36  Running test cmdLineTester_jvmtitests_hcr_openi9_none_SCC_3 ...
16:36:36  ===============================================
16:36:36  cmdLineTester_jvmtitests_hcr_openi9_none_SCC_3 Start Time: Tue Feb 14 17:04:53 2023 Epoch Time (ms): 1676408693650
16:36:36  variation: Mode112
16:36:36  JVM_OPTIONS: -XX:+UseJITServer -Xgcpolicy:gencon -Xjit:count=0 -Xnocompressedrefs
....
16:36:36  TESTING:
01:14:28  Cancelling nested steps due to timeout

@mpirvu
Copy link
Contributor

mpirvu commented Feb 15, 2023

The failure on plinux is a known one and has nothing to do with this PR, hence merging.

@mpirvu mpirvu merged commit c164228 into eclipse-openj9:master Feb 15, 2023
JIT as a Service automation moved this from In progress to Done Feb 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:jitserver Artifacts related to JIT-as-a-Service project
Projects
Development

Successfully merging this pull request may close these issues.

None yet

3 participants