-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to track sandbox stashes in memory #22559
Add option to track sandbox stashes in memory #22559
Conversation
Closes bazelbuild#22044. PiperOrigin-RevId: 625934206 Change-Id: Iee7e8fa9435f6b027755890e3a8be3ac2025e066
In memory reuse sandbox directory stashes Closes bazelbuild#22559. PiperOrigin-RevId: 638617294 Change-Id: I021343a1e5411734f7b09d941092620191619ade
|
||
/** Used to store sandbox stashes in-memory. */ | ||
@AutoValue | ||
public abstract static class StashContents { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be more concise as a record.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
public SandboxStash(String workspaceName, Path sandboxBase) { | ||
private static final int POOL_SIZE = Runtime.getRuntime().availableProcessors(); | ||
private final ExecutorService stashFileListingPool = | ||
Executors.newFixedThreadPool( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried/considered using virtual threads here? This looks mostly IO-bound.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will wait for that till we check in proper common code for I/O operations using virtual threads. As far as I understand, this requires enough careful consideration to be outside the scope of this change and the number of new Java threads I'm introducing here is not very significant compared to how many we already use anyway.
Let me copy paste part of what @coeuvre has written in the past (there isn't any public document) about this:
The vast majority of blocking operations in the JDK will unmount the virtual thread, freeing its carrier and the underlying OS thread to take on new work. However, some blocking operations in the JDK do not unmount the virtual thread, and thus block both its carrier and the underlying OS thread due to OS level limitations (e.g. many filesystem operations). The implementation of these blocking operations will compensate for the capture of the OS thread by temporarily expanding the parallelism of the scheduler. Consequently, the number of platform threads in the scheduler's thread pool may temporarily exceed the number of available cores.
For Bazel, we use VFS and do filesystem calls with JNI, bypassing file APIs from JDK. This means we need to implement similar compensation in VFS as JDK did for its file API for better performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to do this in a follow-up change (although I think that the compensation mechanism has been implemented for vfs).
@AutoValue | ||
public abstract static class StashContents { | ||
@SuppressWarnings("AutoValueImmutableFields") | ||
public abstract Map<String, Path> filesToPath(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment to explain what the keys are (individual path segments)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
// `sandboxExecRoot`. This will use what we computed above, delete anything unnecessary, and | ||
// update `inputsToCreate`/`dirsToCreate` if something can be left without changes (e.g., a, | ||
// symlink that already points to the right destination). We're traversing from | ||
// sandboxExecRoot's parent directory because external repositories can now be symlinked as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part applies to the other branch as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
stashContents.get()); | ||
} else { | ||
// No in-memory stashes enabled but there is a stash. | ||
// When reusing an old sandbox, we do a full traversal of the parent directory of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we avoid code complexity if we just didn't create in-memory stashes for existing stashes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know exactly what you mean. I will answer two different questions that might not be what you meant.
-
Can we create only the stash in the background thread and compare there as opposed to creating a stash here in this function and then doing it again in the background thread with the actual content?
That would be less efficient since in this function we are not readdirring the directories, only adding to the stash the declared inputs and outputs. Then in the background thread we are only changing the dirs that have been modified -
Can we stop creating in-memory stashes for existing disk stashes that don't have a
stashContents
yet?
We may have a disk stash without a correspondingstashContents
if during the lifetime of the server we only enable the flag in a later command.
Please let me know if the question wasn't any of those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latter. I was wondering whether dropping all stashes when flipping the flag instead of scanning them would make the code simpler. But it also looks fine as is.
} | ||
|
||
if (SandboxStash.useInMemoryStashes()) { | ||
Map<PathFragment, StashContents> stashContentsMap = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If peak memory is a concern, we could potentially look into CompactHashMap
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I made the StashContents members use CompactHashMap too.
} | ||
|
||
for (var outputDir : | ||
Iterables.concat( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use Stream.concat
instead, you could also add distinct()
to the chain. I don't yet know whether this would result in duplicate work, will keep reading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
stashContents | ||
.symlinksToPathFragment() | ||
.keySet() | ||
.removeAll( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be replaced with retainAll
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Comparator.comparingInt(countMap::get).reversed(), sortedStashes); | ||
} | ||
|
||
private int getMatchingSegmentsCount(String[] target, String[] oldTarget) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this could be replaced by Arrays.mismatch
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
for (Path stash : stashes) { | ||
Label stashTarget = sandboxToTarget.getOrDefault(stash, /* defaultValue= */ null); | ||
if (stashTarget == null) { | ||
sortedStashes.remove(stash); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these stashes are never reused, could we just drop them right away?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the current logic they might be reused. If the current target is also null, we just return the unsorted list which will contain other stashes without an associated target. If the first one available happens to have no associated target, it gets reused.
But I changed the logic now a bit so that if the current action has a null target it will try to use any others with a null target first.
if (target == null) { | ||
return stashes; | ||
} | ||
List<Path> sortedStashes = new ArrayList<>(stashes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know whether this sorting is a bottleneck, but if profiling shows it is, we could keep the stashes in a TreeMap
by PackageIdentifier
and walk up and down from the insertion position of the new entry to discover the closest stashes.
Since PackageIdentifier
already implements Comparable
, maybe that's even easier than the current logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is important. There will never be more than 2*n_cores stashes, if there are then there is a different problem somewhere. At the same time, I don't think the resulting code would be significantly simpler.
String segment = stashedRunfiles.get(i); | ||
Preconditions.checkState(stashContents.dirEntries().containsKey(segment)); | ||
if (i < stashedRunfiles.size() - 1) { | ||
return getStashedRunfilesStashContents( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't quite get what this function is doing. Is it calling itself recursively in what's essentially a loop? If so, why can't this just be a loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Thank you very much for the great review! |
This change introduces the flag `--experimental_inmemory_sandbox_stashes` to track in memory the contents of the stashes stored with `--reuse_sandbox_directories=true` With the old behavior Bazel has to perform a lot of I/O to read the contents of each stash before reusing it in a new action. Essentially, it checks every directory and subdirectory in the stashed sandbox to find out which files are different to the inputs of the new action about to be executed. With in-memory stashes we associate each stash to the symlinks that needed to be created for that action. We also store the time at which these symlinks were created. In a background thread after the action has finished executing we stat every directory and for the ones that have changed (this should be rare) we update the contents. Because we only read the contents of the directories that have changed we do much less I/O than before. If an action purposefully changes a sandbox symlink in-place, this won't be detected by statting the directory. I don't know any use case for this since the symlink itself is an implementation detail to achieve sandboxing. For this reason, manipulation of sandbox symlinks in-place is not supported. Depending on the build this change might have a significant effect on memory. It should generally improve wall time further in builds where `--reuse_sandbox_directories` already improved wall time. This change also introduces a minor optimization which is to associate each stash with the target that it was originally created for. When a new action wants to reuse a stash and there is more than one available, it will take the stash whose target is closest to its own. This is done with the assumption that targets that are closer together in the graph will have more inputs in common. Fixes #22309 . Closes #22559. PiperOrigin-RevId: 640142481 Change-Id: Iece2d718df47f403e2fe91c1bd887604eceee8ee Commit 1c0135c Co-authored-by: Pedro Liberal Fernndez <plf@google.com>
This change introduces the flag
--experimental_inmemory_sandbox_stashes
to track in memory the contents of the stashes stored with--reuse_sandbox_directories=true
With the old behavior Bazel has to perform a lot of I/O to read the contents of each stash before reusing it in a new action. Essentially, it checks every directory and subdirectory in the stashed sandbox to find out which files are different to the inputs of the new action about to be executed.
With in-memory stashes we associate each stash to the symlinks that needed to be created for that action. We also store the time at which these symlinks were created. In a background thread after the action has finished executing we stat every directory and for the ones that have changed (this should be rare) we update the contents. Because we only read the contents of the directories that have changed we do much less I/O than before.
If an action purposefully changes a sandbox symlink in-place, this won't be detected by statting the directory. I don't know any use case for this since the symlink itself is an implementation detail to achieve sandboxing. For this reason, manipulation of sandbox symlinks in-place is not supported.
Depending on the build this change might have a significant effect on memory. It should generally improve wall time further in builds where
--reuse_sandbox_directories
already improved wall time.This change also introduces a minor optimization which is to associate each stash with the target that it was originally created for. When a new action wants to reuse a stash and there is more than one available, it will take the stash whose target is closest to its own. This is done with the assumption that targets that are closer together in the graph will have more inputs in common.
Fixes #22309 .