Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Insufficient remote Merkle tree size causes slow builds #18686

Open
lukaciko opened this issue Jun 15, 2023 · 2 comments
Open

Insufficient remote Merkle tree size causes slow builds #18686

lukaciko opened this issue Jun 15, 2023 · 2 comments
Labels
P3 We're not considering working on this, but happy to review a PR. (No assignee) team-Documentation Documentation improvements that cannot be directly linked to other team labels team-Remote-Exec Issues and PRs for the Execution (Remote) team type: documentation (cleanup)

Comments

@lukaciko
Copy link

Description of the bug:

After updating our version of Bazel we saw a significant regression in some of our builds. We saw that a couple of actions towards the end of the build that have a lot of inputs took significantly longer. When looking at trace report we saw that the CPU usage is low when executing those actions while the memory usage of the main Bazel process is constantly going up and down:

00a459d8-7a1a-43f3-a900-71394cb1aecb

Using Git bisect, we found #18015 to be the change that lead to the biggest regression.

We figured out that the builds which have regressed are using --experimental_remote_merkle_tree_cache and we could fix it by increasing --experimental_remote_merkle_tree_cache_size. With an insufficient size, Bazel will keep allocating and deallocating the Merklee trees. Presumably we saw a regression after that change because it keeps Merkle trees around for longer.

While we can work around the issue by increasing the size, a warning or error when this starts happening would be appreciated.

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Haven't tried this myself, but I'm assuming this can be replicated on a decently size Java project with enabling --experimental_remote_merkle_tree_cache and setting --experimental_remote_merkle_tree_cache_size to a low value.

Which operating system are you running Bazel on?

MacOS 13.4 & Ubuntu Focal Fossa

What is the output of bazel info release?

/

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

We built on top of commit 286306e from the 6.x branch with some additional patches.

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

git@github.com:bazelbuild/bazel.git
a7b96f45df00b4024de4e70b90989956904ca4fb
286306e8358542ce272f7442075bf157a2a62ec7

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

1641fa8

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

@Pavank1992 Pavank1992 added the team-Remote-Exec Issues and PRs for the Execution (Remote) team label Jun 15, 2023
@coeuvre
Copy link
Member

coeuvre commented Jun 15, 2023

cc @tjgq

@zhengwei143 zhengwei143 added team-Documentation Documentation improvements that cannot be directly linked to other team labels P3 We're not considering working on this, but happy to review a PR. (No assignee) type: documentation (cleanup) and removed untriaged type: bug labels Jun 27, 2023
DavidANeil added a commit to DavidANeil/repro-bazel-nested-set that referenced this issue Jan 16, 2024
@tjgq
Copy link
Contributor

tjgq commented Jan 18, 2024

I didn't have access to a good repro at the time this issue was filed, but in light of recent developments in #20862 (thanks again for the repro, @DavidANeil) I now have a theory as to why this occurs, which is explained in #20862 (comment).

The wider discussion in #20862 is also making us consider scrapping the Merkle tree cache optimization entirely (or, more likely, scale it back to cover only the cases where we're able to consistently see a benefit - large tree artifacts and runfile trees). We don't have a lot of evidence that it provides a performance benefit for most builds, and quite a bit of evidence that it makes things worse (sometimes dramatically, as #20862 demonstrates) in some cases, which doesn't look like a good tradeoff.

If anyone in the community could point us to a reproducible use case (ideally a real one, but synthetic would work as long as it's representative of a real one) where enabling the Merkle tree cache provides a significant benefit, that would be very helpful to guide future work in this area.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P3 We're not considering working on this, but happy to review a PR. (No assignee) team-Documentation Documentation improvements that cannot be directly linked to other team labels team-Remote-Exec Issues and PRs for the Execution (Remote) team type: documentation (cleanup)
Projects
None yet
Development

No branches or pull requests

6 participants