Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bazel remote cache is not a clear win #7664

Open
Tracked by #19904
njlr opened this issue Mar 7, 2019 · 26 comments
Open
Tracked by #19904

Bazel remote cache is not a clear win #7664

njlr opened this issue Mar 7, 2019 · 26 comments
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: feature request

Comments

@njlr
Copy link

njlr commented Mar 7, 2019

Description of the problem / feature request:

Applying a Bazel HTTP remote cache can make the build slower, depending on the project and artefacts being built.

I tried a few projects:

  • Cartographer ~2.8x faster with cache
  • Abseil ~1.5x faster with cache
  • cppitertools ~2x slower with cache
  • OpenTracing ~2x slower with cache

For the cache server, I tried both bazel-remote and my own Node.js server that I cobbled together. Both yielded similar results.

The cache was hosted on a reasonable Digital Ocean box in the same city:

$ less /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 85
model name      : Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
stepping        : 4
microcode       : 0x1
cpu MHz         : 2294.608
...

$ free -m
              total        used        free      shared  buff/cache   available
Mem:           3944         119         153           0        3672        3544
Swap:             0           0           0

Internet connection speed for the client was around 10mbps, latency ~20ms. Not the fastest, but Bazel should be able to adapt to this.

The Bazel client was running on a fairly high-end laptop:

$ less /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 142
model name      : Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
stepping        : 10
microcode       : 0x9a
cpu MHz         : 700.060
cache size      : 8192 KB
...

$ free -m
              total        used        free      shared  buff/cache   available
Mem:          15806        4246        7541         755        4018       10582
Swap:         32767           0       32767

Suggestion

Perhaps the HTTP cache should record the time it took to build an artefact (according to the client). This would give Bazel enough information to decide if it is better to build or fetch.

Relevant variables:

  • Connection speed
  • Size of artefact
  • Estimated time to build artefact
  • Current build workload

Currently the server receives minimal metadata from Bazel.

What operating system are you running Bazel on?

Ubuntu 18.10

What's the output of bazel info release?

release 0.23.1

Have you found anything relevant by searching the web?

Related discussion: #6091

@dslomov dslomov added team-Remote-Exec Issues and PRs for the Execution (Remote) team untriaged type: feature request labels Mar 8, 2019
@Globegitter
Copy link

Also 76370d5 has been implemented which could help with such cases.

@artem-zinnatullin
Copy link
Contributor

Ideally that'd use same/similar logic used by Dynamic Scheduling in Remote Execution https://blog.bazel.build/2019/02/01/dynamic-spawn-scheduler.html

@Globegitter
Copy link

Globegitter commented Mar 11, 2019

What is nice about the combined disk/http cache is that it will only use CPU/memory for compiling etc. when actually necessary but yeah I was also thinking that it would be great to get the dynamic scheduler functionality for the remote caching use-case. Is that something feasible @jin @jmmv ?

@buchgr
Copy link
Contributor

buchgr commented Mar 11, 2019

@Globegitter absolutely. In Bazel we'll probably limit dynamic scheduling to remote caching only (and disallow remote execution for safety reasons).

@njlr thanks a lot for doing these benchmarks. We'll be landing #6862 this week in Bazel master probably and it'd love to run these benchmarks again with this change in.

@jmmv jmmv added this to the Dynamic execution milestone Mar 14, 2019
@buchgr buchgr self-assigned this Mar 28, 2019
@buchgr buchgr added P2 We'll consider working on this in future. (Assignee optional) and removed untriaged labels Mar 28, 2019
@buchgr
Copy link
Contributor

buchgr commented Mar 28, 2019

Will pick this up soon!

@njlr
Copy link
Author

njlr commented Mar 29, 2019

Fantastic! I would be keen to re-run some benchmarks when you are ready 👍

@buchgr buchgr removed their assignment Jan 9, 2020
@nkoroste
Copy link
Contributor

nkoroste commented Apr 9, 2020

Any updates on this?

@jongerrish
Copy link
Contributor

@buchgr @jin - is anyone planing to work on this soon, otherwise I'm happy to contribute if you give me pointers, basic outline, do you want a design doc etc etc

@jin
Copy link
Member

jin commented Apr 16, 2020

@philwo would know, but he's unavailable currently. Escalating to @jhfield / @dslomov.

@ulfjack
Copy link
Contributor

ulfjack commented Apr 16, 2020

I'm not sure that we have enough information here to decide on a course of action. What's the reason for the cached case to be slower? Is that something that can be fixed?

If it's "just" the network round-trip time, then it should be possible to make the lookup async and cancel the action if the lookup is faster (or cancel the lookup if the action is faster). This is similar to how the dynamic strategy works, and would hopefully reuse as much of the infrastructure as possible.

However, I'm not sure how to handle cache writes. It is technically possible to make them async, but that'll make it difficult to report errors.

@edbaunton
Copy link
Contributor

@jmmv has some nice blog posts on this topic.

@kastiglione
Copy link
Contributor

it should be possible to make the lookup async and cancel the action if the lookup is faster (or cancel the lookup if the action is faster). This is similar to how the dynamic strategy works, and would hopefully reuse as much of the infrastructure as possible.

Big +1 we would love to see this. I asked @jmmv about this very thing on twitter and he replied:

No plans on that unfortunately, at least from my team at this point... I think this would take a very different implementation than the current dynamic scheduler though.

@brentleyjones
Copy link
Contributor

brentleyjones commented Apr 16, 2020

Same here. Huge +1 if we could get dynamic strategy for cache lookup/download. Right now we have to have developers flip on and off their remote cache based on their download speed. For some it changes based on the time of day because of shared internet resources.

@nkoroste
Copy link
Contributor

The motivation behind jongerrish@ request is stemmed from the fact that we download ~2GB of data from cache for our build which is heavily depending on your download speed. We also have a per-action breakdown comparing building with cache and no-cache and we can see that on machines with lower network speed it's clearly faster to build locally vs downloading it from cache. Another interesting data set is that around ~3000 actions are <1KB so when taking latency into account it's probably not even worth checking if they are present in the cache

@jongerrish
Copy link
Contributor

jongerrish commented Apr 17, 2020

Looking for some implementation guidance for this feature... would it be reasonable to have a new mode where we register a RemoteSpawnStrategy() that takes a new class RemoteCacheSpawnRunner that is more or less just an adapter to a RemoteCache? @ulfjack @philwo @buchgr @jin similar how the existing remote execution strategy is built here: https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/remote/RemoteActionContextProvider.java#L110

@ulfjack
Copy link
Contributor

ulfjack commented Apr 20, 2020

I was thinking about that, but I'm not sure how to get the results written back to the cache. Right now, the interface requires both lookup and write to be done in the same, err, context.

As much as I like the technical challenge here, can we first confirm that it's due to the lookup overhead? Did you try to increase the number of jobs to see if that helps to hide the latency?

@nkoroste
Copy link
Contributor

nkoroste commented Apr 20, 2020

The CPU, RAM and Network are all maxed out during a Bazel run. On my machine it sometimes even dips into SWAP memory so increasing the number of jobs causes an OOM. Alternatively, we can try to experiment with increasing --remote_max_connections as the default is 100. However, I doubt that it will change anything as I can see my network download speed is reaching max - actually it would probably help with checking cache hit on many smaller actions.

Regarding cache writes, first of all we don't seed remote cache from local builds. Instead we have a stable machine on CI that does that. So for v1 it's probably acceptable to not support upload to cache. Unless you're talking about local workspace writes? I didn't check the code but in general I'd assume only the "winning" SpawnRunner should be responsible for writing to cache at the very end.

@ulfjack
Copy link
Contributor

ulfjack commented Apr 21, 2020

@nkoroste I'm afraid in that case this won't be much of a win. The proposal here is to trade CPU for latency while using additional threads - that's only going to be an improvement if you have extra CPU, and if you're almost running OOM, your overall build latency might be dominated by gc rather than network round-trip latency.

AFAICT, --remote_max_connections won't do anything without also changing --jobs. The latter is the primary mechanism to control the number of threads, and they perform blocking calls to the cache.

@nkoroste
Copy link
Contributor

Sorry on the delay on this, to add more context and visibility from some of the offline conversations:

I'm not suggesting that increasing # of jobs and max connections will improve anything. In fact, we benchmarked various variations of those two flags and the performance is generally worse if you increase the numbers for these 2 flags.

All I'm saying is that Bazel produces GBs of data, specially for Android builds, that are required to be downloaded from remote cache. This is obviously directly correlated with your network overall speed and latency. During a build with a high cache hit rate (85%+), for a big app, majority of the time is spent downloading bytes from cache while most of the machine's CPU/Ram are idle/free.

With dynamic spawn strategy we can utilize some of the machines resources and reduce the number of bytes downloaded from cache to hopefully improve the overall build time for developers with bad network connection.

In the meantime, we try to improve the android rules themselves to produce less unnecessary date that will help with this as well. See #11253 for example.

@ashi009
Copy link
Contributor

ashi009 commented Jun 22, 2021

Our profiling shows that the remote cache latency plays a huge part in build performance. It seems that the cache checking uses the same thread as the action runner, hence, a slow remote cache will slow down all the build actions.

A simple approach to this issue is to introduce a dedicated thread pool for all remote cache interaction.

@philwo
Copy link
Member

philwo commented Jun 22, 2021

FYI @coeuvre

@coeuvre
Copy link
Member

coeuvre commented Jun 22, 2021

@ashi009 We have a thread pool for gRPC calls but are block waiting on the result inside the spawn runner. One thing we can certainly improve is to change remote spawn runner to the non-blocking fashion but I doubt that will improve the overall performance - it depends. If your build doesn't have actions that are waiting for available action execution thread (defined by --jobs), changing to non-blocking won't work since Bazel still has to wait for the running actions to finish to start other actions which depends on the result of former one.

That said, can you share your profiling setup and more profiling data?

@ashi009
Copy link
Contributor

ashi009 commented Jul 3, 2021

@coeuvre

Totally missed the notification. Sure thing, but I need to do it privately.

Our build target is a huge iOS app, that has over 10k source files. The majority of actions are compiling ObjC files, which depends on no preceding actions. The critical path converges at the linking action.

We have increased --jobs=HOST_CPUS*N to address the latency issue in our setup. However, this breaks the resource management model of Bazel in an ugly way, so we have to trick it by raising the --local_{cpu,ram}_resources accordingly.

@coeuvre
Copy link
Member

coeuvre commented Jul 5, 2021

Thanks for sharing your build shape. In this case, I think the performance will be improved once the features described by #13632 and #13632 (comment) are implemented.

@github-actions
Copy link

Thank you for contributing to the Bazel repository! This issue has been marked as stale since it has not had any activity in the last 1+ years. It will be closed in the next 14 days unless any other activity occurs or one of the following labels is added: "not stale", "awaiting-bazeler". Please reach out to the triage team (@bazelbuild/triage) if you think this issue is still relevant or you are interested in getting the issue resolved.

@github-actions github-actions bot added the stale Issues or PRs that are stale (no activity for 30 days) label May 10, 2023
@crt-31
Copy link

crt-31 commented May 11, 2023

@bazelbuild/triage, I think this is still relevant... can we keep it open?

@github-actions github-actions bot removed the stale Issues or PRs that are stale (no activity for 30 days) label May 12, 2023
@tjgq tjgq mentioned this issue Oct 20, 2023
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: feature request
Projects
None yet
Development

No branches or pull requests