New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for reading/writing a JITServer AOT cache in the background #16075
Conversation
Attn. @AlexeyKhrabrov @cjjdespres |
std::string cacheFileName = buildCacheFileName(compInfo->getPersistentInfo()->getJITServerAOTCacheDir(), name); | ||
OMRPORT_ACCESS_FROM_OMRPORT(TR::Compiler->omrPortLib); | ||
J9FileStat buf; | ||
int32_t rc = omrfile_stat(cacheFileName.c_str(), 0, &buf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would consider not stat
ing the file separately and instead deal with the missing file when we try to open it. This introduces a TOCTTOU issue (a minor one, but still) and adds syscall (and possibly I/O) latency to the current compilation request that bypasses the cache being loaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I wrote the code I thought about the scenario where a client asks for a cache name the server has not seen before and therefore there is no cache. If we proceed without verifying the existence of the cache file, the in-memory cache will be created later (when we attempt the read the nonexistent file) and all the methods compiled until then will not be cached.
I guess only the second client will be affected because the compilations from this one will be cached and the third client will have access to them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Then it does make sense to check if the file exists. Maybe a better solution is to open the file here and pass the file descriptor/stream to the thread that will do the snapshot load.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think that a better solution is to fopen()
the file here. If it succeeds, add the FILE *
along with the name to _cachesToLoadQueue
. Otherwise the file either doesn't exist or we don't have permissions to read it, and in either case we should create a new cache and not try to read it from the file.
This way we avoid the scenario where cache directory permissions are misconfigured and each JITServer instance will try to read the cache instead of creating a new one, and the first few AOT methods will never be cached. I think that would also simplify the implementation a bit. But it's up to you whether you think it's worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking some more about this, I am leaning towards eliminating any system call at this point. In kubernetes the snapshot can be far away on NFS or COS (cloud object storage) and may take many milliseconds just to check whether the file exists. This could delay the current compilation (and others because of the cache map lock) a great deal.
The downside of proceeding without a snapshot existence check is that, the first N compilations from a first client that attaches to a JITServer which does not have the requested cache will not be put in the AOT cache. This may happen later on when another client requests the missing methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I wasn't even thinking of remote storage in my original suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more thought. Once we support merging caches/snapshots, the issue of not caching the first N compilations could be fixed by merging the cache being loaded from the file into the in-memory cache (which we can create and start populating immediately). Assuming that merging doesn't measurably slow down the load.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have eliminated the stat
-ing of the snapshot. In my experiments I only see one method being compiled before we choose to create an empty in-memory cache.
std::string cacheFileName; | ||
// If a directory is given, add it in the front of the cache name | ||
if (!cacheDir.empty()) | ||
cacheFileName = cacheDir + "/"; // TODO: is there something more portable? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C++17 has std::filesystem::path
. Otherwise we can take inspiration from what the SCC code does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately we cannot use C++17.
ac6d45f
to
4602e46
Compare
4602e46
to
2bcdb38
Compare
2bcdb38
to
7d6e09a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some early feedback for the WIP implementation of periodic snapshots. The PR name and description also needs to be updated accordingly.
@@ -133,6 +133,8 @@ int32_t J9::Options::_sharedROMClassCacheNumPartitions = 16; | |||
int32_t J9::Options::_reconnectWaitTimeMs = 1000; | |||
int32_t J9::Options::_highActiveThreadThreshold = -1; | |||
int32_t J9::Options::_veryHighActiveThreadThreshold = -1; | |||
int32_t J9::Options::_numExtraAotMethodsToTriggerSaveOperation = 200; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are these default values chosen?
I think for the number of methods ideally we need to ensure that the overhead (user+kernel CPU time) of persisting the cache is not bigger than the savings in JIT CPU time from reusing that number of cached methods for one client. Note that we always write out the whole cache, not just the delta of newly stored methods.
It's a bit more complicated for the time period because the overhead (e.g. percentage of total CPU time taken by writing snapshots) depends on the number of caches. 10 seconds sounds reasonable in general, but ideally this should be supported by at least some performance numbers and some reasoning about the trade-offs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The values were sort of arbitrary. They might change after experiments will be performed. Currently waiting for the per-requisites to be delivered.
5a01592
to
58f9d36
Compare
5eabcb9
to
7358544
Compare
7358544
to
600b8d9
Compare
Some output from running in Kubernetes with 2 JITServers: First JITServer:
Second JITServer:
It's debatable whether the second JITServer save operation is that useful: saved 208 methods over the 200 methods that existed. |
600b8d9
to
28764cb
Compare
Some traces after load is applied to AcmeAir:
JITServer2:
|
Some observations based on the traces above:
|
Output from new code. Also applied load with more JMeter threads:
JITServer2:
JITServer3 which starts later, so this one will do a read of a snapshot first:
|
In the logs above I see the some inefficiency in that a JITServer is overwriting an existing snapshot just to add a few methods. Maybe we should impose the condition that a server can overwrite the snapshot only if it adds 200 more methods. |
28764cb
to
bb26c3a
Compare
Review comments have been addressed. PR is ready for another review. |
Some traces from the last version of the code. Starting 2 JITServers and 20 AcmeAir coontainers and then applying load: JITServer1:
JITServer2:
|
Started a third JITServer and re-started all the liberty pods. The third JITServer will read the AOT cache snapshot:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See inline comments (mostly minor). Also, the commit messages are outdated:
- "stream==0" in
Load JITServer AOT cache file in the background
. -Xjit:
option names inSave the JITServer AOT cache to file periodically
.
std::string cacheFileName = buildCacheFileName(compInfo->getPersistentInfo()->getJITServerAOTCacheDir(), name); | ||
OMRPORT_ACCESS_FROM_OMRPORT(TR::Compiler->omrPortLib); | ||
J9FileStat buf; | ||
int32_t rc = omrfile_stat(cacheFileName.c_str(), 0, &buf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think that a better solution is to fopen()
the file here. If it succeeds, add the FILE *
along with the name to _cachesToLoadQueue
. Otherwise the file either doesn't exist or we don't have permissions to read it, and in either case we should create a new cache and not try to read it from the file.
This way we avoid the scenario where cache directory permissions are misconfigured and each JITServer instance will try to read the cache instead of creating a new one, and the first few AOT methods will never be cached. I think that would also simplify the implementation a bit. But it's up to you whether you think it's worth it.
Another experiment testing scaling back to 0 (delete all JITServer instances) and then started 2 new instances and 20 AcmeAir instances.
JITServer2:
|
Will wait until @AlexeyKhrabrov approves before running tests. |
// _cachesToSaveQueue is used to remember the caches that need to be saved to file | ||
// A compilation thread dequeues the first name and tries to save the named cache to a file, | ||
// possibly overwriting an existent file | ||
PersistentList<std::string> _cachesToSaveQueue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we store cache names here rather than pointers to cache instances?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure whether caches can be unloaded and I didn't want to hold onto a pointer that could become invalid.
It would be nice to allow cache unloading in the future. If there is no client wanting a particular cache anymore, why hold onto that memory for nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. It should be relatively easy to implement periodic purging of caches that are unused for a certain time period from memory (possibly writing the final version to the snapshot file if it adds any methods). We should open an issue for this.
New options that were introduced: `-XX:[+|-]JITServerAOTCachePersistence` to enable/disable the AOTCache persistence feature `-XX:JITServerAOTCacheDir=<name>` to specify directory where caches are located (default is current directory) Signed-off-by: Marius Pirvu <mpirvu@ca.ibm.com>
Signed-off-by: Marius Pirvu <mpirvu@ca.ibm.com>
ceb7d6f
to
7958281
Compare
JITServer offers support for loading an AOT cache from a file. This is done by a compilation thread that works concurrently with other compilation threads that compile methods. The sequence of actions is the following: 1. Compilation thread at the server reads compilation request from client and determines whether the client wants to use the AOTCache 2. Server checks its hashmap to see if the named cache already exists If it exists, it will be used. 3. If desired cache does not exist, server checks if the persistence feature is enabled and whether a file for the desired cache exists. 4. If the AOT cache file exists, the server will memorize the name of the cache in a set, queue a special compilation request and wake a compilation thread to process it. It will then proceed to compile the requestedn method, without waiting for the AOT cache to be loaded. 5. If the AOT cache file does not exist (or persistence is disabled), the server will create an empty AOT cache which is going to be populated as methods get compiled. 6. When a compilation thread extracts the special request from the queue, it will proceed to load the AOT cache from disk. The name of the AOT cache is retrieved from the set where it was memorized earlier. If there is a problem during the loading, the name of the cache will be included into a set of cached that are excluded from being loaded in the future. Signed-off-by: Marius Pirvu <mpirvu@ca.ibm.com>
7958281
to
145c84a
Compare
This commit implements the code that attempts to save a JITServer AOT cache perodically to file. The save operation is done only under the following conditions which must all be met: 1. The in-memory AOT cache has added at least 200 new methods since the last successful save attempt (controlable by -Xjit:extraAotMethodsToTriggerSaveOperation=<N>) 2. At least 10 seconds have passed since the last save attempt (controlable by -Xjit:aotCachePersistenceMinPeriodMs=<ms>) 3. The in-memory cache contains at least 200 more AOT methods that the cache on disk (controlable with -Xjit:aotCachePersistenceMinDeltaMethods=<N>) The save operation is done asynchronously by a compilation thread. While a save operation is in progress, a flag is set to prevent other threads from saving the same cache concurrently. Signed-off-by: Marius Pirvu <mpirvu@ca.ibm.com>
When a first request for a named AOT cache is made, the code would check for the existence of an appropriately named file cache and then place a special compilation request to load the cache asynchronously. However, since this file can be on network attached storage, the check could take many milliseconds, blocking the current compilation request and possibly others (due to holding the cache map monitor). This commit removes the mentioned file check. The only downside is that, we may delay the creation of an empty in-memory cache and thus we may not store the first few AOT compilations. Signed-off-by: Marius Pirvu <mpirvu@ca.ibm.com>
145c84a
to
cabd3ec
Compare
jenkins test sanity plinuxjit,xlinuxjit,zlinuxjit jdk17 |
pLinux timeouts in this test:
|
jenkins test sanity plinuxjit jdk17 |
All tests have passed |
This PR implements support for loading an AOT cache from a file and saving an AOT cache to a file. This is done by a compilation thread that works concurrently with other compilation threads that compile methods.
For reading, the sequence of actions is the following:
For writing, the sequence of actions is the following:
outOfProcessCompilationEnd()
, when a compilation finishes, we check if conditions for saving the AOT cache are met (in methodtriggerAOTCacheStoreToFileIfNeeded()
). These conditions are: (A) "enough" methods were added to the in-memory AOT cache since the last snapshot and (B) "enough" time has passed since the last snapshot(A) we set
_saveOperationInProgress
to prevent other threads doing a concurrent save;(B) add the name of the cache to a queue of caches to be saved;
(C) queue a special compilation request with stream==SAVE_AOTCACHE_REQUEST
(D) notify a sleeping compilation thread
processEntry()
and we callaotCacheMap->saveNextQueuedAOTCacheToFile();
_saveOperationInProgress
flagSigned-off-by: Marius Pirvu mpirvu@ca.ibm.com