Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Actions fetched from remote cache / executed remotely should not count towards parallelism limit #6394

Open
ob opened this issue Oct 15, 2018 · 26 comments
Assignees
Labels
not stale Issues or PRs that are inactive but not considered stale P2 We'll consider working on this in future. (Assignee optional) team-Performance Issues for Performance teams type: feature request

Comments

@ob
Copy link
Contributor

ob commented Oct 15, 2018

When a fairly large application is built using the remote cache, and all the actions are fully cached. Bazel still keeps the parallelism set to the number of cores in the machine. However, most actions are just waiting for network I/O.

With the default settings on an 8-core machine, I get:

$ time bazel build //...
## output elided
INFO: Elapsed time: 97.567s, Critical Path: 26.44s
INFO: 2000 processes: 2000 remote cache hit.
INFO: Build completed successfully, 2240 total actions

real	1m38.518s
user	0m0.051s
sys	0m0.061s

But if I bump up the number of jobs to a crazy number:

$ time bazel build --jobs 2000 //...
## output elided
INFO: Elapsed time: 39.535s, Critical Path: 31.33s
INFO: 2000 processes: 2000 remote cache hit.
INFO: Build completed successfully, 2240 total actions

real	0m40.483s
user	0m0.048s
sys	0m0.058s

I think the --jobs option should only apply to local actions.

@ob ob changed the title Actions fetched from remote cache should not count towards parallelism Actions fetched from remote cache should not count towards parallelism limit Oct 15, 2018
@bergsieker bergsieker added this to the Remote Execution milestone Oct 17, 2018
@buchgr
Copy link
Contributor

buchgr commented Oct 30, 2018

Thanks @ob! A colleague of mine has a prototype of this, however I still expect this to be at least 3-6 months away to land in a released Bazel version!

@philwo
Copy link
Member

philwo commented Nov 6, 2018

FWIW, we "fix" this in Google by actually running with --jobs=200 (or higher). Local actions are still limited by local resources, so this is fine in general.

@philwo philwo added P2 We'll consider working on this in future. (Assignee optional) category: remote execution / caching and removed untriaged labels Nov 6, 2018
@buchgr
Copy link
Contributor

buchgr commented Nov 6, 2018

@philwo @ulfjack has a prototype patch to no longer be limited by --jobs for remote execution.

@ittaiz
Copy link
Member

ittaiz commented Jan 6, 2019

@philwo

Local actions are still limited by local resources, so this is fine in general.

Means you run with jobs together with local_resources?

@jin jin added team-Remote-Exec Issues and PRs for the Execution (Remote) team and removed team-Execution labels Jan 14, 2019
@buchgr
Copy link
Contributor

buchgr commented Jan 16, 2019

@ulfjack what's the status of your work in this area? is there a tracking bug on github?

@ulfjack
Copy link
Contributor

ulfjack commented Jan 16, 2019

Parts of the prototype have been submitted, but I haven't even sent out some critical parts. Making it work also requires rewriting the RemoteSpawnRunner to be async and use ListenableFuture, for which I do not currently have plans. I am not aware of a tracking bug on GitHub apart from this one.

@buchgr
Copy link
Contributor

buchgr commented Jan 16, 2019

Making it work also requires rewriting the RemoteSpawnRunner to be async and use ListenableFuture

That I'd be happy to take over :-)

@ulfjack
Copy link
Contributor

ulfjack commented Jan 16, 2019

Happy for you to start working on the RSR in parallel to me landing the Skyframe changes that are also required.

@buchgr
Copy link
Contributor

buchgr commented Jan 25, 2019

Broke this out into #7182.

@buchgr buchgr removed the team-Remote-Exec Issues and PRs for the Execution (Remote) team label Jan 25, 2019
@buchgr buchgr removed this from the Remote Execution milestone Jan 25, 2019
@buchgr buchgr removed their assignment Jan 25, 2019
@ulfjack
Copy link
Contributor

ulfjack commented Feb 14, 2019

Commit 9beabe0 is related.

@ulfjack
Copy link
Contributor

ulfjack commented Feb 14, 2019

Also related:
47d29eb
57f2f58
14f8b10 rolled back as 68fc46b due to increased memory consumption

@ianoc-stripe
Copy link

It does for us, i'm not sure if others don't see it. but for tests + binaries across a repo. If you pull and get a lot of cache hits having jobs set to say 1000, means the machine locks up more or less with the unpacking of runfiles. So having the runfiles step included in local cpu usage would be great as a means to avoid that. (Now we tell folks to ctrl+c and try run with --jobs 10 when they see it which isn't ideal).

So I would love if you could change it, thank you

@benjaminp
Copy link
Collaborator

45a9bc2 was a change to the resource requirements of runfiles trees, which allowed more parallelism. Probably this should be discussed in another issue, though.

@meisterT meisterT added the team-Performance Issues for Performance teams label May 12, 2020
@meisterT meisterT added P3 We're not considering working on this, but happy to review a PR. (No assignee) P2 We'll consider working on this in future. (Assignee optional) and removed P2 We'll consider working on this in future. (Assignee optional) P3 We're not considering working on this, but happy to review a PR. (No assignee) labels May 4, 2021
@meisterT meisterT changed the title Actions fetched from remote cache should not count towards parallelism limit Actions fetched from remote cache / executed remotely should not count towards parallelism limit May 4, 2021
@github-actions
Copy link

Thank you for contributing to the Bazel repository! This issue has been marked as stale since it has not had any activity in the last 1+ years. It will be closed in the next 14 days unless any other activity occurs or one of the following labels is added: "not stale", "awaiting-bazeler". Please reach out to the triage team (@bazelbuild/triage) if you think this issue is still relevant or you are interested in getting the issue resolved.

@github-actions github-actions bot added the stale Issues or PRs that are stale (no activity for 30 days) label May 10, 2023
@meisterT meisterT added not stale Issues or PRs that are inactive but not considered stale and removed stale Issues or PRs that are stale (no activity for 30 days) labels May 10, 2023
@meisterT
Copy link
Member

We are working on this again, in a different way. Next step is to upgrade to a modem JDK which will allow us to use loom/virtual threads.

@Ryang20718
Copy link

Ryang20718 commented May 16, 2023

Reading through this thread.

If I'm using local execution with a remote cache and almost all tests are cached. with --jobs set to 4, I see. (currently on bazel 6.0.0)

21s remote-cache, linux-sandbox ... (8 actions running)
29s remote-cache, linux-sandbox ... (8 actions running)

With jobs set to 8, I see a max of 16 actions. This would mean jobs fetching from remote cache is not counting towards parallelism. Is there a way we can configure it to do so?

chandlerc added a commit to chandlerc/carbon-lang that referenced this issue Sep 4, 2023
The expensive local actions will be separately gated to not overwhelm
the machine, and currently highly asynchronous actions are a dominant
part of our builds due to downloading cached artifacts. Without a high
concurrency, these are downloaded roughly 2-at-a-time currently.

I've verified that on a small machine without a good cache this doesn't
seem to generate huge amounts of work and local build and test actions
are successfully gated on the local flags.

There is already a Bazel issue tracking this limitation:
bazelbuild/bazel#6394
chandlerc added a commit to chandlerc/carbon-lang that referenced this issue Sep 4, 2023
The expensive local actions will be separately gated to not overwhelm
the machine, and currently highly asynchronous actions are a dominant
part of our builds due to downloading cached artifacts. Without a high
concurrency, these are downloaded roughly 2-at-a-time currently.

I've verified that on a small machine without a good cache this doesn't
seem to generate huge amounts of work and local build and test actions
are successfully gated on the local flags.

There is already a Bazel issue tracking this limitation:
bazelbuild/bazel#6394
chandlerc added a commit to carbon-language/carbon-lang that referenced this issue Sep 4, 2023
The expensive local actions will be separately gated to not overwhelm
the machine, and currently highly asynchronous actions are a dominant
part of our builds due to downloading cached artifacts. Without a high
concurrency, these are downloaded roughly 2-at-a-time currently.

I've verified that on a small machine without a good cache this doesn't
seem to generate huge amounts of work and local build and test actions
are successfully gated on the local flags.

There is already a Bazel issue tracking this limitation:
bazelbuild/bazel#6394
github-merge-queue bot pushed a commit to carbon-language/carbon-lang that referenced this issue Sep 5, 2023
The expensive local actions will be separately gated to not overwhelm
the machine, and currently highly asynchronous actions are a dominant
part of our builds due to downloading cached artifacts. Without a high
concurrency, these are downloaded roughly 2-at-a-time currently.

I've verified that on a small machine without a good cache this doesn't
seem to generate huge amounts of work and local build and test actions
are successfully gated on the local flags.

There is already a Bazel issue tracking this limitation:
bazelbuild/bazel#6394
chandlerc added a commit to chandlerc/carbon-lang that referenced this issue Sep 6, 2023
The expensive local actions will be separately gated to not overwhelm
the machine, and currently highly asynchronous actions are a dominant
part of our builds due to downloading cached artifacts. Without a high
concurrency, these are downloaded roughly 2-at-a-time currently.

I've verified that on a small machine without a good cache this doesn't
seem to generate huge amounts of work and local build and test actions
are successfully gated on the local flags.

There is already a Bazel issue tracking this limitation:
bazelbuild/bazel#6394
fhanau added a commit to cloudflare/workerd that referenced this issue Jun 25, 2024
bazelbuild/bazel#22785 fixes a bug remote caching bug that prevented us from
fetching only top-level targets from cache. Enabling that option should speed
up CI runs.
Also increase the number of concurrent Bazel jobs, see bazelbuild/bazel#6394
fhanau added a commit to cloudflare/workerd that referenced this issue Jun 25, 2024
bazelbuild/bazel#22785 fixes a bug remote caching bug that prevented us from
fetching only top-level targets from cache. Enabling that option should speed
up CI runs.
Also increase the number of concurrent Bazel jobs, see bazelbuild/bazel#6394
fhanau added a commit to cloudflare/workerd that referenced this issue Jun 26, 2024
bazelbuild/bazel#22785 fixes a bug remote caching bug that prevented us from
fetching only top-level targets from cache. Enabling that option should speed
up CI runs.
Also increase the number of concurrent Bazel jobs, see bazelbuild/bazel#6394
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
not stale Issues or PRs that are inactive but not considered stale P2 We'll consider working on this in future. (Assignee optional) team-Performance Issues for Performance teams type: feature request
Projects
None yet
Development

No branches or pull requests