Bazel remote cache is not a clear win #7664

njlr · 2019-03-07T13:25:21Z

Description of the problem / feature request:

Applying a Bazel HTTP remote cache can make the build slower, depending on the project and artefacts being built.

I tried a few projects:

Cartographer ~2.8x faster with cache
Abseil ~1.5x faster with cache
cppitertools ~2x slower with cache
OpenTracing ~2x slower with cache

For the cache server, I tried both bazel-remote and my own Node.js server that I cobbled together. Both yielded similar results.

The cache was hosted on a reasonable Digital Ocean box in the same city:

$ less /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 85
model name      : Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
stepping        : 4
microcode       : 0x1
cpu MHz         : 2294.608
...

$ free -m
              total        used        free      shared  buff/cache   available
Mem:           3944         119         153           0        3672        3544
Swap:             0           0           0

Internet connection speed for the client was around 10mbps, latency ~20ms. Not the fastest, but Bazel should be able to adapt to this.

The Bazel client was running on a fairly high-end laptop:

$ less /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 142
model name      : Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
stepping        : 10
microcode       : 0x9a
cpu MHz         : 700.060
cache size      : 8192 KB
...

$ free -m
              total        used        free      shared  buff/cache   available
Mem:          15806        4246        7541         755        4018       10582
Swap:         32767           0       32767

Suggestion

Perhaps the HTTP cache should record the time it took to build an artefact (according to the client). This would give Bazel enough information to decide if it is better to build or fetch.

Relevant variables:

Connection speed
Size of artefact
Estimated time to build artefact
Current build workload

Currently the server receives minimal metadata from Bazel.

What operating system are you running Bazel on?

Ubuntu 18.10

What's the output of `bazel info release`?

release 0.23.1

Have you found anything relevant by searching the web?

Related discussion: #6091

The text was updated successfully, but these errors were encountered:

Globegitter · 2019-03-08T14:37:53Z

Also 76370d5 has been implemented which could help with such cases.

artem-zinnatullin · 2019-03-09T22:38:38Z

Ideally that'd use same/similar logic used by Dynamic Scheduling in Remote Execution https://blog.bazel.build/2019/02/01/dynamic-spawn-scheduler.html

Globegitter · 2019-03-11T07:57:19Z

What is nice about the combined disk/http cache is that it will only use CPU/memory for compiling etc. when actually necessary but yeah I was also thinking that it would be great to get the dynamic scheduler functionality for the remote caching use-case. Is that something feasible @jin @jmmv ?

buchgr · 2019-03-11T09:57:40Z

@Globegitter absolutely. In Bazel we'll probably limit dynamic scheduling to remote caching only (and disallow remote execution for safety reasons).

@njlr thanks a lot for doing these benchmarks. We'll be landing #6862 this week in Bazel master probably and it'd love to run these benchmarks again with this change in.

buchgr · 2019-03-28T23:12:38Z

Will pick this up soon!

njlr · 2019-03-29T10:28:29Z

Fantastic! I would be keen to re-run some benchmarks when you are ready 👍

nkoroste · 2020-04-09T22:22:46Z

Any updates on this?

jongerrish · 2020-04-16T11:30:46Z

@buchgr @jin - is anyone planing to work on this soon, otherwise I'm happy to contribute if you give me pointers, basic outline, do you want a design doc etc etc

jin · 2020-04-16T11:39:13Z

@philwo would know, but he's unavailable currently. Escalating to @jhfield / @dslomov.

ulfjack · 2020-04-16T12:10:53Z

I'm not sure that we have enough information here to decide on a course of action. What's the reason for the cached case to be slower? Is that something that can be fixed?

If it's "just" the network round-trip time, then it should be possible to make the lookup async and cancel the action if the lookup is faster (or cancel the lookup if the action is faster). This is similar to how the dynamic strategy works, and would hopefully reuse as much of the infrastructure as possible.

However, I'm not sure how to handle cache writes. It is technically possible to make them async, but that'll make it difficult to report errors.

edbaunton · 2020-04-16T12:31:55Z

@jmmv has some nice blog posts on this topic.

kastiglione · 2020-04-16T14:57:48Z

it should be possible to make the lookup async and cancel the action if the lookup is faster (or cancel the lookup if the action is faster). This is similar to how the dynamic strategy works, and would hopefully reuse as much of the infrastructure as possible.

Big +1 we would love to see this. I asked @jmmv about this very thing on twitter and he replied:

No plans on that unfortunately, at least from my team at this point... I think this would take a very different implementation than the current dynamic scheduler though.

brentleyjones · 2020-04-16T19:28:06Z

Same here. Huge +1 if we could get dynamic strategy for cache lookup/download. Right now we have to have developers flip on and off their remote cache based on their download speed. For some it changes based on the time of day because of shared internet resources.

nkoroste · 2020-04-16T19:42:17Z

The motivation behind jongerrish@ request is stemmed from the fact that we download ~2GB of data from cache for our build which is heavily depending on your download speed. We also have a per-action breakdown comparing building with cache and no-cache and we can see that on machines with lower network speed it's clearly faster to build locally vs downloading it from cache. Another interesting data set is that around ~3000 actions are <1KB so when taking latency into account it's probably not even worth checking if they are present in the cache

jongerrish · 2020-04-17T22:07:03Z

Looking for some implementation guidance for this feature... would it be reasonable to have a new mode where we register a RemoteSpawnStrategy() that takes a new class RemoteCacheSpawnRunner that is more or less just an adapter to a RemoteCache? @ulfjack @philwo @buchgr @jin similar how the existing remote execution strategy is built here: https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/remote/RemoteActionContextProvider.java#L110

ulfjack · 2020-04-20T09:24:41Z

I was thinking about that, but I'm not sure how to get the results written back to the cache. Right now, the interface requires both lookup and write to be done in the same, err, context.

As much as I like the technical challenge here, can we first confirm that it's due to the lookup overhead? Did you try to increase the number of jobs to see if that helps to hide the latency?

nkoroste · 2020-04-20T18:37:36Z

The CPU, RAM and Network are all maxed out during a Bazel run. On my machine it sometimes even dips into SWAP memory so increasing the number of jobs causes an OOM. Alternatively, we can try to experiment with increasing --remote_max_connections as the default is 100. However, I doubt that it will change anything as I can see my network download speed is reaching max - actually it would probably help with checking cache hit on many smaller actions.

Regarding cache writes, first of all we don't seed remote cache from local builds. Instead we have a stable machine on CI that does that. So for v1 it's probably acceptable to not support upload to cache. Unless you're talking about local workspace writes? I didn't check the code but in general I'd assume only the "winning" SpawnRunner should be responsible for writing to cache at the very end.

ulfjack · 2020-04-21T14:40:21Z

@nkoroste I'm afraid in that case this won't be much of a win. The proposal here is to trade CPU for latency while using additional threads - that's only going to be an improvement if you have extra CPU, and if you're almost running OOM, your overall build latency might be dominated by gc rather than network round-trip latency.

AFAICT, --remote_max_connections won't do anything without also changing --jobs. The latter is the primary mechanism to control the number of threads, and they perform blocking calls to the cache.

nkoroste · 2020-05-13T20:51:47Z

Sorry on the delay on this, to add more context and visibility from some of the offline conversations:

I'm not suggesting that increasing # of jobs and max connections will improve anything. In fact, we benchmarked various variations of those two flags and the performance is generally worse if you increase the numbers for these 2 flags.

All I'm saying is that Bazel produces GBs of data, specially for Android builds, that are required to be downloaded from remote cache. This is obviously directly correlated with your network overall speed and latency. During a build with a high cache hit rate (85%+), for a big app, majority of the time is spent downloading bytes from cache while most of the machine's CPU/Ram are idle/free.

With dynamic spawn strategy we can utilize some of the machines resources and reduce the number of bytes downloaded from cache to hopefully improve the overall build time for developers with bad network connection.

In the meantime, we try to improve the android rules themselves to produce less unnecessary date that will help with this as well. See #11253 for example.

ashi009 · 2021-06-22T07:05:50Z

Our profiling shows that the remote cache latency plays a huge part in build performance. It seems that the cache checking uses the same thread as the action runner, hence, a slow remote cache will slow down all the build actions.

A simple approach to this issue is to introduce a dedicated thread pool for all remote cache interaction.

philwo · 2021-06-22T07:35:46Z

FYI @coeuvre

coeuvre · 2021-06-22T08:14:06Z

@ashi009 We have a thread pool for gRPC calls but are block waiting on the result inside the spawn runner. One thing we can certainly improve is to change remote spawn runner to the non-blocking fashion but I doubt that will improve the overall performance - it depends. If your build doesn't have actions that are waiting for available action execution thread (defined by --jobs), changing to non-blocking won't work since Bazel still has to wait for the running actions to finish to start other actions which depends on the result of former one.

That said, can you share your profiling setup and more profiling data?

ashi009 · 2021-07-03T15:56:21Z

@coeuvre

Totally missed the notification. Sure thing, but I need to do it privately.

Our build target is a huge iOS app, that has over 10k source files. The majority of actions are compiling ObjC files, which depends on no preceding actions. The critical path converges at the linking action.

We have increased --jobs=HOST_CPUS*N to address the latency issue in our setup. However, this breaks the resource management model of Bazel in an ugly way, so we have to trick it by raising the --local_{cpu,ram}_resources accordingly.

coeuvre · 2021-07-05T06:15:49Z

Thanks for sharing your build shape. In this case, I think the performance will be improved once the features described by #13632 and #13632 (comment) are implemented.

github-actions · 2023-05-10T01:31:28Z

Thank you for contributing to the Bazel repository! This issue has been marked as stale since it has not had any activity in the last 1+ years. It will be closed in the next 14 days unless any other activity occurs or one of the following labels is added: "not stale", "awaiting-bazeler". Please reach out to the triage team (@bazelbuild/triage) if you think this issue is still relevant or you are interested in getting the issue resolved.

crt-31 · 2023-05-11T19:36:46Z

@bazelbuild/triage, I think this is still relevant... can we keep it open?

dslomov added team-Remote-Exec Issues and PRs for the Execution (Remote) team untriaged type: feature request labels Mar 8, 2019

jmmv added this to the Dynamic execution milestone Mar 14, 2019

jmmv mentioned this issue Mar 14, 2019

Enable dynamic spawn strategy to work with remote cache strategies only #7086

Closed

buchgr self-assigned this Mar 28, 2019

buchgr added P2 We'll consider working on this in future. (Assignee optional) and removed untriaged labels Mar 28, 2019

buchgr removed their assignment Jan 9, 2020

brentleyjones mentioned this issue Jul 1, 2021

Cache uploads shouldn't block dependent local actions #13632

Closed

github-actions bot added the stale Issues or PRs that are stale (no activity for 30 days) label May 10, 2023

github-actions bot removed the stale Issues or PRs that are stale (no activity for 30 days) label May 12, 2023

tjgq mentioned this issue Oct 20, 2023

Rethink spawn strategies #19904

Open

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bazel remote cache is not a clear win #7664

Bazel remote cache is not a clear win #7664

njlr commented Mar 7, 2019

Globegitter commented Mar 8, 2019

artem-zinnatullin commented Mar 9, 2019

Globegitter commented Mar 11, 2019 •

edited

buchgr commented Mar 11, 2019

buchgr commented Mar 28, 2019

njlr commented Mar 29, 2019

nkoroste commented Apr 9, 2020

jongerrish commented Apr 16, 2020

jin commented Apr 16, 2020

ulfjack commented Apr 16, 2020

edbaunton commented Apr 16, 2020

kastiglione commented Apr 16, 2020

brentleyjones commented Apr 16, 2020 •

edited

nkoroste commented Apr 16, 2020

jongerrish commented Apr 17, 2020 •

edited

ulfjack commented Apr 20, 2020

nkoroste commented Apr 20, 2020 •

edited

ulfjack commented Apr 21, 2020

nkoroste commented May 13, 2020

ashi009 commented Jun 22, 2021

philwo commented Jun 22, 2021

coeuvre commented Jun 22, 2021

ashi009 commented Jul 3, 2021 •

edited

coeuvre commented Jul 5, 2021

github-actions bot commented May 10, 2023

crt-31 commented May 11, 2023

Bazel remote cache is not a clear win #7664

Bazel remote cache is not a clear win #7664

Comments

njlr commented Mar 7, 2019

Description of the problem / feature request:

What operating system are you running Bazel on?

What's the output of bazel info release?

Have you found anything relevant by searching the web?

Globegitter commented Mar 8, 2019

artem-zinnatullin commented Mar 9, 2019

Globegitter commented Mar 11, 2019 • edited

buchgr commented Mar 11, 2019

buchgr commented Mar 28, 2019

njlr commented Mar 29, 2019

nkoroste commented Apr 9, 2020

jongerrish commented Apr 16, 2020

jin commented Apr 16, 2020

ulfjack commented Apr 16, 2020

edbaunton commented Apr 16, 2020

kastiglione commented Apr 16, 2020

brentleyjones commented Apr 16, 2020 • edited

nkoroste commented Apr 16, 2020

jongerrish commented Apr 17, 2020 • edited

ulfjack commented Apr 20, 2020

nkoroste commented Apr 20, 2020 • edited

ulfjack commented Apr 21, 2020

nkoroste commented May 13, 2020

ashi009 commented Jun 22, 2021

philwo commented Jun 22, 2021

coeuvre commented Jun 22, 2021

ashi009 commented Jul 3, 2021 • edited

coeuvre commented Jul 5, 2021

github-actions bot commented May 10, 2023

crt-31 commented May 11, 2023

What's the output of `bazel info release`?

Globegitter commented Mar 11, 2019 •

edited

brentleyjones commented Apr 16, 2020 •

edited

jongerrish commented Apr 17, 2020 •

edited

nkoroste commented Apr 20, 2020 •

edited

ashi009 commented Jul 3, 2021 •

edited