New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement JITServer AOT cache thunk handling #16650
Conversation
72c676a
to
f6f11e9
Compare
Attn @mpirvu. You can see in the record deserialization that I simply ignore records of type Some results from acmeair testing:
I will run the test again and look at the source of that failure, but it's possible that the thunks being received from the server are simply causing relocation failures in a different way. |
f6f11e9
to
e0c3dcd
Compare
Please rebase your code to get this change from Irwin: #16625 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only reviewed the first 2 commits so far. Will continue...
bool | ||
JITServerAOTDeserializer::cacheRecord(const ThunkSerializationRecord *record, | ||
TR::Compilation *comp, bool &isNew, bool &wasReset) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have to think deeper if this method needs a lock like all the others. Maybe the SCC can handle the concurrency aspect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isResetInProgress()
technically needs a preceding read barrier if we're not acquiring a lock here.
return true; | ||
isNew = true; | ||
|
||
fej9->setJ2IThunk((char *)record->signature(), record->signatureSize(), record->thunkAddress(), comp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering why we don't check the return value of setJ2IThunk()
, but apparently we fail the compilation by throwing an exception. I hope that the deserialization code can handle sudden interruptions through exceptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only consequence of an exception thrown here (that I can think of) in the deserialization code itself is that some records will be cached without being added to the "known record IDs" sent back to the server.
However, further up the call stack in remoteCompile()
, I think we would handle it as:
openj9/runtime/compiler/control/JITClientCompilationThread.cpp
Lines 3452 to 3462 in 5a52b0d
catch (...) | |
{ | |
if (!details.isJitDumpMethod()) | |
{ | |
// For any other type of exception disconnect the socket | |
client->~ClientStream(); | |
TR_Memory::jitPersistentFree(client); | |
compInfoPT->setClientStream(NULL); | |
} | |
throw; // rethrow the exception | |
} |
unnecessarily (?) disconnecting from the server. Although this already also applies to e.g. exceptions due to persistent allocation failures when adding to deserializer caches.
After rebasing and testing (including changing that
This is all with
I'm assuming I've done something wrong somewhere. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I only have a few small suggestions. We do need to understand/debug the strange behavior when SVM is enabled during start-up
e0c3dcd
to
5aa218b
Compare
Does the JITServer protocol version need to be bumped as well? |
I believe so. In case of a mismatch, the server could send thunk records that the client does not expect and fail. |
6457bb4
to
2268fd3
Compare
The new field names in the JITServer AOT cache thunk records are more consistent with names already in use in the non-AOT JITServer thunk serialization code. They should also be more reflective of what is being stored in the thunk records. Signed-off-by: Christian Despres <despresc@ibm.com>
3c46529
to
506c645
Compare
I've rebased onto master, to incorporate the changes from #16684. I've also changed
and without the changes from this PR:
I thought that I had already tested this PR with that particular modification (always using the client thunk pointer), but I might be misremembering exactly what I did. To record this for the future: Irwin pointed out that the cause of the excess Irwin and I had a chat, and we believe that this is what was happening:
By consistently using the client thunk pointer, we make sure that equivalent thunks in a particular method always get equal pointers, because we go through the client's thunk lookup mechanism. It doesn't matter that these pointers are nonsense because they get relocated at the client, but it does matter that they're the same (apparently). This PR still has the problem that more thunk records are created than necessary, because the JITServer AOT cache thunk map is still keyed by signature directly. I haven't investigated how big of a problem this is. Irwin did mention that it might be a bit of a pain to refactor the existing signature encoding mechanism so that it can be called from the JITServer AOT cache code, but I haven't looked at the existing code yet. |
jenkins test sanity plinuxjit,xlinuxjit,zlinuxjit jdk17 |
On plinuxjit there is a failure for
|
jenkins test sanity plinuxjit jdk17 |
xlinux had a timeout for cmdLineTester_jvmtitests_hcr_openi9_none_SCC_3, not likely due to this PR:
|
jenkins test sanity xlinuxjit jdk17 |
Tests of xlinux were aborted again, though it's unclear to me why. I think the test as a whole took longer than expected.
|
FYI @AlexeyKhrabrov just in case you want to review this PR as well. |
I compared the performance of (1) OpenJ9 0.35.0, (2) this PR, (3) dev code without this PR, when running AcmeAir in 1P mode and having JITServer AOT cache enabled. I used only "cold" runs that start with an empty SCC (no containers, therefore no embedded SCC).
I would say that there is a small 3-4% improvement in CPU spent for compilation, and maybe a small ~1% footprint improvement, though both these improvements come with changes from the dev build, probably the changes from Irwin related to how thunks are keyed in the SCC to avoid duplication and AOT load failures. Compilation failure stats:
So, there is a clear reduction in compilationRelocationFailure (807-->210), though most of this reduction came from Irwin's thunk encoding change (807-->365) |
@cjjdespres I would be interested to know if this latest version of the code eliminates the crash we saw when enabling SVM during start-up. |
jenkins test sanity xlinuxjit jdk17 |
I tested that a bit on Friday, and yes, I was unable to reproduce the crash with SVM at startup. I'll test with a few different configurations today as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall except for a few minor issues (see inline comments).
I could instead add another field in the _varSizedData of the serialized AOT method that would hold the thunk records. I would then fill that all at once while constructing the final SerializedAOTMethod record at the end of compilation, and deserialize it all at once in the deserializer.
I don't think that's worth the extra complexity. Assuming there aren't too many thunk records per method on average, the space overhead of storing unused placeholder offsets should be negligible.
bool | ||
JITServerAOTDeserializer::cacheRecord(const ThunkSerializationRecord *record, | ||
TR::Compilation *comp, bool &isNew, bool &wasReset) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isResetInProgress()
technically needs a preceding read barrier if we're not acquiring a lock here.
xlinux timed-out again for
|
506c645
to
e4e2fb1
Compare
Force push to address Alexey's comments. |
e4e2fb1
to
3787a08
Compare
jenkins test sanity plinuxjit,xlinuxjit,zlinuxjit jdk17 |
JITServer AOT cache method records may now depend on thunk records. These records must be deserialized and stored in the local shared class cache. Signed-off-by: Christian Despres <despresc@ibm.com>
Compilations that are to be stored in the JITServer AOT cache now handle thunks differently from non-AOT compilations; instead of maintaining the per-client thunk pointer map directly, we now generate thunks at the server for those compilations and store them in the JITServer AOT cache. The compilation object also tracks any thunks a compilation might require, so the relevant records can be added to the main CachedAOTMethod record and be sent to clients as needed. This separate handling is required to ensure that clients have in their local shared class cache any thunks that a JITServer AOT cache method might require. Not having these thunks available will cause an AOT load failure because of the resulting thunk relocation error. Fixes: eclipse-openj9#16482 Signed-off-by: Christian Despres <despresc@ibm.com>
Signed-off-by: Christian Despres <despresc@ibm.com>
3787a08
to
aaf26c9
Compare
I forgot an |
jenkins test sanity plinuxjit,xlinuxjit,zlinuxjit jdk17 |
|
The failure on plinux is a known one and has nothing to do with this PR, hence merging. |
These commits add J2I thunk handling to the JITServer AOT cache. The server now tracks any thunks created during the compilation of methods that will be stored in the JITServer AOT cache. These thunks are stored in the cache, and also serialized and sent in the
SerializedAOTMethod
records that are sent to the client. The client will deserialize these thunk records and store the thunks in their local shared class cache as necessary.