Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sandbox slowness on OSX #8230

Open
fenghaolw opened this issue May 3, 2019 · 103 comments
Open

Sandbox slowness on OSX #8230

fenghaolw opened this issue May 3, 2019 · 103 comments
Assignees
Labels
P3 We're not considering working on this, but happy to review a PR. (No assignee) team-Local-Exec Issues and PRs for the Execution (Local) team type: bug

Comments

@fenghaolw
Copy link

fenghaolw commented May 3, 2019

Description of the problem / feature request:

building has been extremely slow with the default darwin-sandbox

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

A mini repro could be found in https://github.com/alexeagle/rules_sass_repro
The repro contains 40 empty sass files
Running the sass compiler on them should be fast

bazel build :all takes ~60s on my mac

bazel build --strategy=SassCompiler=local :all takes ~4s

What operating system are you running Bazel on?

Mac OS 10.14.4

What's the output of bazel info release?

release 0.25.0

Have you found anything relevant by searching the web?

I found these issues: #902 and #1836 but they all seem obsolete.

JSON profile

According to https://docs.bazel.build/versions/master/skylark/performance.html#json-profile, I grabbed profiles for different strategies:

profiles.zip

@jin jin added z-team-Apple Deprecated. Send to rules_apple, or label team-Rules-CPP + platform:apple untriaged labels May 3, 2019
@fenghaolw
Copy link
Author

Note that on linux I also see slowness for the sandbox, but much better than mac.

It takes ~15s for linux-sandbox, and ~3s for strategy=SassCompiler=local on my Lenovo P920.

@meisterT meisterT added team-Local-Exec Issues and PRs for the Execution (Local) team and removed z-team-Apple Deprecated. Send to rules_apple, or label team-Rules-CPP + platform:apple labels May 8, 2019
@meisterT
Copy link
Member

meisterT commented May 8, 2019

cc @jmmv

@alexeagle
Copy link
Contributor

discussed with @dslomov this morning, I think we should disable sandbox on mac by default. It has dubious benefits for keeping your build hermetic while causing lots of broken (non-hermetic) and slow builds

@dslomov
Copy link
Contributor

dslomov commented May 10, 2019

ping @jmmv @philwo for opinions.

@philwo
Copy link
Member

philwo commented May 10, 2019

I think we should disable sandbox on mac by default

That's a bold strategy. Because without sandboxing, all actions will run in the same, shared execroot, which contains all files from your workspace and all output files, missing dependency declarations will no longer be detected. This can cause Bazel to assume that targets that should be rebuild are still up to date, which can result in wrong build results.

If you think that this doesn't matter for your particular use case or if you have other mitigations for this, then feel free to go ahead with the plan, but in general this is not recommended.

If your sass build steps are that slow, it can only mean that they have a huge number of input files. Is that the case?

while causing lots of broken (non-hermetic)

Please give concrete examples of broken builds due to sandboxing. If builds break due to sandboxing, it's almost always because the rules or the BUILD files don't declare all dependencies.

@Globegitter
Copy link

I have also noticed that sandboxing is generally slower on macOS compared to linux. Does anyone actually have any explanation of why that is? Is it due to some macOS reasons, or is the sandbox on macOS not as optimized?

Also @fenghaolw have you tested the same build with https://github.com/bazelbuild/sandboxfs? Apparently there is a chance it can make sandboxed builds on macOS quite a bit faster. Would be curios to see some comparisons with that.

@keith
Copy link
Member

keith commented May 16, 2019

We tested sandboxfs with our iOS build without much luck bazelbuild/sandboxfs#76

@jmmv
Copy link
Contributor

jmmv commented May 16, 2019

There are two problems regarding sandboxing slowness:

  1. We have to create possibly thousands of symlinks per action, and that turns out to degrade badly on macOS. Microbenchmarks don't seem to expose this behavior easily though. sandboxfs is supposed to get rid of this problem but hasn't been successful for all builds yet (there is room for optimizations there that I haven't had a chance to tackle).

  2. There is the theory that sandbox-exec is slow. Same as above, microbenchmarks don't seem to confirm this -- but I haven't looked into it with great detail. If this is inherently slow, then there is nothing much we can do (especially considering that this is deprecated).

There is this other idea floating around of changing the way sandboxing works: instead of preventing operations, we just log them all and then verify, after the action completes, that whatever the action accessed was allowed. This would avoid the symlinks completely and also bypass sandbox-exec. I think this is how Windows sandboxing will work. But... I don't think there is a way to do this kind of tracing on macOS out of the box that doesn't require either root or disabling SIP. It's worth investigating though.

@Globegitter
Copy link

Globegitter commented May 17, 2019

@keith Yeah I saw that, very interesting. I have just done some quick tests on my 6 year old macbook Pro and for a tiny JS build at least (with already a fair amount of inputs due to npm modules) my build time halved from roughly 6s to 3s. Without sandbox it is 2s. That does look pretty good to me but I do wonder for what type of builds and what type of setup/configuration that differs.

Edit: Another slightly larger JS build of only 1 action went down from Elapsed time: 50.031s, Critical Path: 46.10s to 43.213s, Critical Path: 40.52s (I did multiple rounds each with similar results). So about 15% improvement for that also seems pretty good to me.

@jmmv jmmv added P1 I'll work on this now. (Assignee required) type: bug and removed untriaged labels May 22, 2019
@jmmv
Copy link
Contributor

jmmv commented May 22, 2019

Some tests I ran with this reproduction:

$ bazel build :all
INFO: Elapsed time: 36.556s, Critical Path: 7.71s
$ bazel build --spawn_strategy=local :all
INFO: Elapsed time: 5.569s, Critical Path: 1.23s
$ bazel build --experimental_use_sandboxfs :all
INFO: Elapsed time: 9.479s, Critical Path: 3.16s
$ bazel build --sandbox_debug :all
INFO: Elapsed time: 18.091s, Critical Path: 3.90s
$ bazel build --experimental_sandbox_async_tree_delete_idle_threads=auto :all
INFO: Elapsed time: 23.156s, Critical Path: 4.75s

And corresponding observations:

  1. Each action in this build has 11k files.
  2. sandboxfs does seem to help (as expected based on the previous).
  3. --sandbox_debug makes quite a bit of a difference. Deleting all the symlink trees is expensive, and this flag has the side-effect of not deleting them. But creating them is also quite expensive.
  4. The new --experimental_sandbox_async_tree_delete_idle_threads=auto helps approximate the behavior of --sandbox_debug and seems like a significant improvement over the current behavior. We should enable this new feature by default, but I remember seeing a crash recently that needs investigation...

@Globegitter
Copy link

@jmmv I have also seen improvements with the async delete and I reported the crash here: #7527 (comment)

@jmmv
Copy link
Contributor

jmmv commented May 30, 2019

Thanks. I have a fix for that crash when enabling asynchronous deletions but I'm still having trouble with one test... plus in benchmarking the change, I'm consistently getting worse results in large builds. Need to look into that too.

Separately, I've been looking into the old claim that sandbox-exec is inherently slow. I don't think it is. Some microbenchmarks showed no slowdowns a long time ago. I have now patched the "local" strategy to run everything through sandbox-exec using a configuration file that matches what the "sandboxed" strategy generates and got the following for a large iOS build:

local-head: mean 2074.50, median 2075.00, stddev 6.69
local-new: mean 2109.00, median 2110.50, stddev 5.15

where local-head is the behavior of the "local" strategy with head Bazel and local-new is the same strategy patched as described above. And yes, there seems to be a little bit of overhead, but the differences between the two are minimal.

Therefore I conclude that the major cost of sandboxing in macOS continues to be the symlinks tree and handling files.

@meteorcloudy
Copy link
Member

@alexeagle Since we already have a P1 bug for Mac Sandbox performance issue, I think we can continue our previous discussion here.

The current problem is sandbox on Mac is very slow, especially for Angular project, where there are a large number of input files. So we wonder that's the way forward.

@jmmv Is there any current work on improving the Mac sandbox performance? From the above discussion, sandbox might be a way, how big could the improvement be? @alexeagle Did you every try it before?

If it's hard to improve the performance, should we disable sandbox on Mac? Disabling sandbox could make the build less hermetic, but @alexeagle also mentioned it was causing some broken build that may not be the fault of the build rules? I'm also wondering how this would affect remote execution. Some issues may only be spotted when remote execution is enabled if we didn't have sandbox in advance.

If we should not disable sandbox on mac, we should provide a nicer way to disable sandbox for a specific platform. That means, we only want to apply flags like --spawn_strategy=standalone on Mac. Currently, you can group those flags under a config in bazelrc file. But you still have to enable the config from a bazel command line invocation. To make things easier, we can have a platform-default config, like:

build:windows --spawn_strategy=standalone
build:linux  --spawn_strategy=sandboxed
build:macos  --spawn_strategy=standalone

Then Bazel loads different flags according to the platform it's running on and users never have to pass --config=macos explicitly.

@Globegitter
Copy link

Irregardless of the performance improvements of the sandbox I have had the need before for platform specific flags and it would indeed be very nice if it could just be defined centrally and the user would not have to pass in a - -config flag.

@meteorcloudy
Copy link
Member

I agree, #5055 is for tracking this feature request. It has been silent for a while, I'll follow up on this.

@DavidGoldman
Copy link
Contributor

For what it's worth, Swift users will have issues debugging with sandboxing enabled (see bazelbuild/tulsi#15)

alexeagle pushed a commit to alexeagle/rules_nodejs that referenced this issue Jul 24, 2020
This trades-off performance for correctness, and might not be wise because
missing dependencies give non-hermetic behavior.
See bazelbuild/bazel#8230 (comment)

Also fix the default npm publish tag which cost me time in tonights release
alexeagle pushed a commit to alexeagle/rules_nodejs that referenced this issue Jul 24, 2020
This trades-off performance for correctness, and might not be wise because
missing dependencies give non-hermetic behavior.
See bazelbuild/bazel#8230 (comment)

Also fix the default npm publish tag which cost me time in tonights release
alexeagle pushed a commit to alexeagle/rules_nodejs that referenced this issue Jul 24, 2020
This trades-off performance for correctness, and might not be wise because
missing dependencies give non-hermetic behavior.
See bazelbuild/bazel#8230 (comment)
@acecilia
Copy link

Could the slowness be due to notarization? https://sigpipe.macromates.com/2020/macos-catalina-slow-by-design/

Seems like since notarization was introduced, all scripts that run for the first time are firing a network request to apple. Since bazel uses sandbox, potentially all scripts there are considered new by the system.

I could not check this theory though

@larsrc-google
Copy link
Contributor

It might be worth parallelizing some of these things, especially once Loom makes it cheaper to use threads. Having a small threadpool for creating symlinks ought to help. But fundamentally, OS X is just much slower than Linux at handling lots of creation and deletion of small files.

@meisterT
Copy link
Member

@tony-scio is your overhead of 2x with the flag --experimental_reuse_sandbox_directories?

@oquenchil
Copy link
Contributor

These flags didn't help significantly:
build --spawn_strategy=processwrapper-sandbox
build --experimental_reuse_sandbox_directories
build --experimental_sandbox_async_tree_delete_idle_threads=1

@tony-scio in most cases --experimental_reuse_sandbox_directories should help significantly. Would it be possible for you to share a small example workspace where sandboxing enabled gives you a 2x overhead? I see your core and core_tests target but just with that it will be hard to reproduce.

@DavidZbarsky-at
Copy link

DavidZbarsky-at commented Dec 3, 2023

@oquenchil I ran into this issue as well, posting in here to continue the discussion, but let me know if you'd prefer a separate issue.

We are using rules_js. We depend on quite a few npm packages, which the JS rules expose as TreeArtifacts (so I believe they are only 1 symlinked per package). We have unit tests that takes 1-2 seconds to run outside Bazel, but significantly longer under Bazel. Here's a stripped-down repo with a single sh_test depending on node modules: https://github.com/DavidZbarsky-at/nodejs-repro/blob/stuff/BUILD.bazel. There are around 3.8K entries in the runfiles manifest for this test action.

On my (M1 mac) machine, the sh_test spends around 6-7 seconds in RepoMappingManifestAction (@fmeum fixed this already), and 6 seconds in sandbox.createFileSystem. Running the same test in a docker image on the same machine shows only 1 second in sandbox.createFileSystem.

I have reuse_sandbox_directories enabled, but running the test several times in a row shows the same timing each time.

Even stranger; I tried with --noenable_runfiles flag as well, and confirmed that the runfiles were not populated, but I still see the same performance numbers. I would have expected it to be faster if we don't need to create symlinks. Perhaps I'm confused how this is supposed to work?

@oquenchil
Copy link
Contributor

oquenchil commented Dec 12, 2023

Hi David,

Just to make sure the slowness is coming from sandboxing, if you run that same target with --spawn_strategy=local and --spawn_strategy=sandboxed, do you see the massive difference?

Does your repro build with bazel at head? If not, what's the latest version that it works with?

Thanks for the repro, I will take a look in the following days!

@DavidZbarsky-at
Copy link

DavidZbarsky-at commented Dec 12, 2023

Hi David,

Just to make sure the slowness is coming from sandboxing, if you run that same target with --spawn_strategy=local and --spawn_strategy=sandboxed, do you see the massive difference?

Does your repro build with bazel at head? If not, what's the latest version that it works with?

Yes, I can repro with last_green as of this morning.

Thanks for the repro, I will take a look in the following days!

Hi @oquenchil!

I pushed up some more "minimal" examples that don't use the JS rules and manually create TreeArtifacts with a simple rule. Here's what I see when running CC=false bazel test //:med_tree_runfiles_artifacts_test //:med_tree_artifacts_test --cache_test_results=no --profile /tmp/profile.gz:

OSX with darwin-sandbox:
image

OSX with local (not sure what the unattributed time is, looks like it may be a different OSX performance issue):
image

Ubuntu docker image with linux-sandbox on the same machine (this is on 7.0.0-pre.20231011.2 not last_green because there's no arm64 binaries):
image

For completeness, ubuntu docker image with local:
image

It's interesting that the Linux sandboxed version is even faster than the unsandboxed OSX

@larsrc-google
Copy link
Contributor

The Linux file system is much more optimized for the kinds of operations that happen in builds.

@coeuvre
Copy link
Member

coeuvre commented Dec 14, 2023

@DavidZbarsky-at Hi David, I see there are context.prefetchInputs spans in the first screenshot which take a couple of seconds. I am wondering did you use any remote or disk cache in the examples (maybe accidentally with ~/.bazelrc)?

@DavidZbarsky-at
Copy link

@coeuvre yes there was a remote cache (though this multisecond gap is maybe another performance issue?). I see similar timing without it on the sandboxed tests, though it appears to fix the unattributed time in the local strategy:

darwin-sandbox
image

local
image

@coeuvre
Copy link
Member

coeuvre commented Dec 15, 2023

@DavidZbarsky-at Thanks for confirming. Yes, it's another performance issue. Created #20555 to track.

@oquenchil
Copy link
Contributor

There is another performance issue: #20584

I will take care of this one. After it's checked in I will profile again and see what else we can do.

@oquenchil
Copy link
Contributor

@DavidZbarsky-at there are several things we can do to improve performance. I wanted to ask you a couple of questions though:

  1. Would you mind if the overhead is gone only in incremental reruns of the test but the first time you run it the createFileSystem overhead is still there?
  2. You have thousands of files as inputs. Are they spread out over dozens of directories or are they concentrated in individual directories? If you can provide some real world statistics for your JS builds that would be helpful, something like 10000 files in total in 700 directories, not a single directory with more than 1000 files, etc...

I will be able to take a look again on the 2nd of January but any additional information you can provide would be very helpful.

Thanks!

@DavidZbarsky-at
Copy link

DavidZbarsky-at commented Dec 20, 2023

@oquenchil Awesome, that sandbox reuse one seems like it will be a big help for us. Thanks!

Re: your questions:

  1. That's better than the current situation but not ideal. In particular I'd like to be able to run a set of 20-30 tests locally before going to CI, and that becomes tricky with this overhead (i.e. 1-2s per tests * 30 / 4 cores is 15-20 seconds, but the sandboxing overhead here is going to be 3x+)
  2. Looks like a power law to me. For our repo, I counted the number of files in each NPM package in node_modules (raw data attached in case you are curious). Not every action has every NPM package as an input, but hopefully this gives you a sense of the kinds of things the NPM ecosystem does Summary stats:
>>> sum(counts) # Total number of files
192197
>>> len(counts) # Number of directories
4197
>>> sum(counts)/len(counts)
45.7939004050512
>>> statistics.quantiles(counts, n=20)
[4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 6.0, 6.0, 7.0, 8.0, 9.0, 11.0, 12.0, 15.0, 18.0, 24.0, 34.0, 56.0, 137.0]
(p5.....................................................................................................p95)
>>> sorted(counts)[-50:] # Top 50
[650, 653, 653, 669, 671, 694, 698, 704, 704, 706, 710, 720, 723, 762, 765, 765, 767, 808, 812, 821, 824, 828, 868, 868, 918, 935, 942, 966, 1054, 1114, 1120, 1176, 1188, 1225, 1285, 1340, 1484, 1489, 2033, 2277, 2277, 2947, 3165, 3410, 3410, 3616, 4510, 5660, 5722, 10024]

npm_packages.txt

Happy to poke around and provide whatever else would be helpful, just let me know!

@larsrc-google
Copy link
Contributor

Reducing unnecessary dependencies is always good. But here's a thing I've been wanting to try but not been able to: With Loom virtual threads, could we do more parallelism in creating/destroying the sandbox?

@meisterT
Copy link
Member

We are unfortunately not there yet with using JDK21 features (such as Loom) but @oquenchil looked at speeding up sandbox creation by a good plain old executor service yesterday (and that triggered his question above).

@DavidZbarsky-at
Copy link

I came across this doc/idea, not sure if it's feasible but it seems like it would solve this case (folders with hundreds + or thousands of files) quite nicely

@larsrc-google
Copy link
Contributor

I had been considering io_uring, but apparently it's not very secure yet. But I recently learned that kqueue, the Mac (BSD) equivalent is over a decade old and thus probably in a better state. Maybe that's why the default IO on Mac is relatively slow, serious IO users use kqueue? Anyway, a good proof-of-concept for that would be implementing deleteTreesBelow using kqueue, if that is promising we could add other APIs amenable to kqueue - and preferably also to Loom and maybe eventually io_uring - such as createDirsInDir and createSymlinksInDir (strawman names).

@oquenchil
Copy link
Contributor

oquenchil commented Jan 2, 2024

I came across this doc/idea, not sure if it's feasible but it seems like it would solve this case (folders with hundreds + or thousands of files) quite nicely

The idea as I originally described it on that document doesn't quite work because when you symlink directories, the path foo/bar/.. doesn't take you to foo but to the parent of the directory pointed to by foo/bar. After considering several alternatives, bind mounting seems like it has potential for applying the idea as described on that document and we are still looking into it but that's only available on Linux.

That leaves us with macOS where we are at the point of simply optimizing the existing strategies instead of coming up with a new one entirely. As meisterT pointed out, me adding multi-threading is what prompted those questions.

What I found was that with multi-threading the bottleneck is in the directory lock held by the kernel when writing a file. If most files are concentrated in a small handful of directories then multi-threading won't help much but looking at the raw data you provided it seems that's not the case, so it should cut down the time a bit.

Because of where I think the bottleneck is I don't think it will matter whether we use Loom.

For cleaning up an existing stash, I tried moving a tree out of the way then deleting asynchronously. This also gives a significant speed bump. I will check in those two optimizations then we can see where we are at. Without symlinking directories or bind mounting, we might hit a limit in how much more we can do on macOS for cold builds but we aren't there yet.

@rickeylev
Copy link
Contributor

Python builds also suffer from the performance hit due to so many files needing to be materialized. The basic issue is every Python program needs a runtime, and a Python runtime installation is a lot of files (2000 to 4000 files). This IO costs adds up. I think it was Tensorflow that contacted me asking for advice because their CI builds went from 45 minutes using the system python to 2+ hours using the downloaded runtimes, and all evidence pointed to all the extra IO work creating thousands of symlinks for every test. There's similar reports to rules_python by other users, too, so its not just them.

While we think we can partially avoid this by zipping up most of a runtime installation (we can't zip up all of it), that feels like a lateral move overall. There is a still an O(2000 - 4000) cost paid to create the zip file and the cost of copying a decent sized file around. That's still a lot more cost than creating a single symlink to a directory. Zipping stuff correctly can also be somewhat subtle (e.g., making sure you're avoiding nondeterminism bugs).

Relatedly, for Bazel CI and Python, it currently gets to avoid this issue because it's using the system runtime. However, my 2nd or 3rd todo list item right now is to upgrade Bazel CI's rules_python version, which will make it use the downloaded runtimes; i.e. every python build action or python test it runs will incur a separate 2,000+ file copy.

Relatedly, pypi Python dependencies also suffer from this problem, as they work mostly the same as the runtime. A repo rule downloads a zip file and extracts it. The program needs those files at runtime, so they all get copied over into runfiles. But all it really needs is a symlink'd directory pointing to the files. The footprint of third party dependencies is easily the size of the runtime itself or orders of magnitude larger. As an example, I have a program with 5 direct dependencies, and it turns into about 15,0000 runtime files (technically ~7k, but they get duped because of the legacy-externals-runfiles behavior). In comparison, if directories were symlinked, it would be about 32 symlinks. Unfortunately, trying to zip those up isn't as easy, straight forward, or feasible.

@larsrc-google
Copy link
Contributor

What I found was that with multi-threading the bottleneck is in the directory lock held by the kernel when writing a file. If most files are concentrated in a small handful of directories then multi-threading won't help much but looking at the raw data you provided it seems that's not the case, so it should cut down the time a bit.

That actually suggests an API for building an entire tree of symlinks. Then for MacOS, the implementation could spread the writing across multiple directories to get around the directory lock bottleneck.

@oquenchil
Copy link
Contributor

That actually suggests an API for building an entire tree of symlinks. Then for MacOS, the implementation could spread the writing across multiple directories to get around the directory lock bottleneck.

I talked about symlinking directories a while ago here. I sent that to bazel-discuss.

The reason why symlinking directories doesn't work is because as soon as you have symlink_dir1/symlink_dir2/../symlink_di3/. The .. won't behave like you expect, it will go to the parent directory of the target of the link.

There are alternatives using bind mounts on Linux which we are currently looking at but that won't solve the problem for macOS which is what this issue is about.

Regarding the directory lock bottleneck, that wouldn't solve it since each directory would still contains as many files as before. In any case it doesn't look like in practice it's as much of a bottleneck at least for JS since files are not grouped in a single directory. Not sure in the case of Python.

But all it really needs is a symlink'd directory pointing to the files.

For the reason I mention above we wouldn't want to indiscriminately use the strategy of symlinking directories. However, if the rule author could guarantee that there would never be a .. after the symlinked directory then this could be special cased. The sandboxing code in Bazel couldn't take these shortcuts and make these decisions on its own but we could add a mechanism for the cases where the rule author knows what they are doing. If they aren't careful, they might expose themselves to the maintenance burden of user builds breaking in not very clear ways when they do .. from within an input file passed to the action. I saw this type of breakage in at least two different languages rulesets when I tried the symlinking directories idea.

Richard, please tell me how you envision such an API as passed from the rule or toolchain implementation and I can write a prototype.

@rbtcollins
Copy link

rbtcollins commented Jan 24, 2024

This hasn't been linked yet, and is probably worth a read for folk designing in this IO intensive space: https://gregoryszorc.com/blog/2018/10/29/global-kernel-locks-in-apfs/

Also, it might be worth testing concurrent symlink creation from separate processes, based on some of the notes in this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P3 We're not considering working on this, but happy to review a PR. (No assignee) team-Local-Exec Issues and PRs for the Execution (Local) team type: bug
Projects
None yet
Development

No branches or pull requests