"Upload missing inputs" performance regression in Bazel 5.0 and 5.3 #16054

clint-stripe · 2022-08-05T15:55:09Z

Description of the bug:

cc @coeuvre -- this bug is very similar to #15872, but the regression actually occurred in the 5.0.0 release. I did a git bisect between 41feb61 and 2ac6581, and it appears db15e47 is the source of this slower behavior.

In the actual repo affected by this, we have ~600k inputs to some actions, which are taking nearly 7 seconds on this step with a recent release-5.3.0 commit. (We thought the fix to #15872 may have helped here, but it is consistent with 5.0's performance, and still slower than 4.2.)

Unfortunately this overhead was much smaller (<1s) with Bazel 4.2.2, and this regression, while not the same magnitude as that of #15872, is still significant for us.

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Using the same test repo, https://github.com/clint-stripe/action_many_inputs/, but with 300,000 inputs to the action (edit the number in WORKSPACE to change this):

I have a script that runs the same set of commands, which reproduces this very clearly: (In the action_many_inputs repo, with bazel checked out and built in a sibling directory)

$ ../bazel/bazel-bin/src/bazel-dev test ... --config=remote --remote_download_toplevel --profile=/tmp/bazel-bisect.profile

Run this once, to ensure that the inputs have all been uploaded -- we want to measure the time when no action inputs change.

Then, run this a few times (it's ok if the test doesn't actually run; I usually hit ctrl-c after ~10 seconds just to ensure the "upload missing inputs" step is complete), and check the timing in the profile.

If it's helpful, you can look at just the event we care about:

cat "$PROFILE_PATH" | jq '.traceEvents[] | select(.name == "upload missing inputs") | .dur = (.dur/1e3)'

commit	run #	duration (ms)
`db15e47`	1	2665
`db15e47`	2	2513
`db15e47`	3	2425
`db15e47`	4	2484
`3ada002`	1	967
`3ada002`	2	679
`3ada002`	3	660
`3ada002`	4	560

Which operating system are you running Bazel on?

linux

What is the output of `bazel info release`?

No response

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

~/.bazel_binaries/bazel-5.0.0/bin/bazel build src:bazel-dev

What's the output of `git remote get-url origin; git rev-parse master; git rev-parse HEAD` ?

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

I still have the full profiles from most of these bazel invocations, happy to share if there's anything else there. (Unfortunately these changes predate the more granular profiling that distinguishes between 'collect digests' and 'find missing digests'.)

The text was updated successfully, but these errors were encountered:

brentleyjones · 2022-08-05T16:04:15Z

Is this fixed by f440f8e?

clint-stripe · 2022-08-05T16:15:57Z

@brentleyjones it's not -- that regression took these tests from ~2s to ~45s, but it's still several times higher than in 4.2. (I ran a few of the same tests on release-5.3.0 branch at 9d57003, with about the same results.)

coeuvre · 2022-08-08T09:43:09Z

I will take a look but I am not entire sure whether this can be fixed.

Part of db15e47 is to implement #13166 which by design needs lock (or similar) for every input -- I guess that's where the regression come from.

coeuvre · 2022-08-17T10:26:00Z

Made a small improvement via #16118 but still the regression is not able to fix since we need lock for every inputs to deduplicate findMissingBlobs calls and share uploads by design.

I was using your test repo with 600,000 inputs to do the benchmarks. The benchmark was done several times for each setup. The numbers below among them are similar.

For the baseline (4.2.2), all the time in the uploading missing inputs step is for findMissingDigests which took around 2900s.

For the HEAD, uploading missing inputs took around 4500ms and can be broke down into:

merkleTree.getAllDigests(): around 380ms. The implementation was changed by Cache merkle trees #13879. There are two differences:
- Previously, the digests map was built when constructing merkle tree. Now it's lazily built when we first call getAllDigests() i.e. we move the process from MerkleTree.build(ActionInput) to upload missing inputs.
- It might be more expense to build the digest map even when the cache is disabled. cc @moroten.
collect digests: around 730ms. This is where we iterate over inputs and use a lock to check for existing uploads for the given input.
findMissingDigests: around 3200ms.
uploads: around 170ms. This is where we wait for the new and existing uploads. Since the upload is shared, we need a lock to update the internal state.

Essentially, we pay around 380ms for feature cache merkle trees and 900ms for feature deduplicate findMissingBlobs calls in this step for an action with 600,000 inputs.

I made a prototype to replace RxJava with vanilla ListenableFuture while keeping the same functionalities. But it only reduces the cost from around 900ms to around 600ms. RxJava indeed introduces some runtime overheads but I won't move ahead with the prototype since it still requires lots of other changes in the codebase.

What's your build wall time difference between 4.2.2 and 5.3 (or HEAD)? and what if you enable --experimental_remote_merkle_tree_cache?

moroten · 2022-08-17T11:41:24Z

It might be more expense to build the digest map even when the cache is disabled.

The execution time increase I would expect is due to the recursive visitor pattern when building the Merkle tree. Now, that pattern is already part of the old MerkleTree build, so for this case I assume it the Merkle tree caching should not add any processing time. Looking at a real measurement would of course of interesting.

sgowroji added type: bug untriaged team-Remote-Exec Issues and PRs for the Execution (Remote) team labels Aug 7, 2022

coeuvre self-assigned this Aug 8, 2022

coeuvre added P1 I'll work on this now. (Assignee required) and removed untriaged labels Aug 8, 2022

coeuvre added P3 We're not considering working on this, but happy to review a PR. (No assignee) and removed P1 I'll work on this now. (Assignee required) labels Aug 22, 2022

coeuvre mentioned this issue Sep 15, 2022

Cache merkle trees #13879

Closed

kormide mentioned this issue Mar 9, 2023

[FR]: Restore merkle tree cache performance settings when regressions fixed aspect-build/bazel-lib#391

Open

1 task

coeuvre closed this as completed Oct 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Upload missing inputs" performance regression in Bazel 5.0 and 5.3 #16054

"Upload missing inputs" performance regression in Bazel 5.0 and 5.3 #16054

clint-stripe commented Aug 5, 2022 •

edited

Loading

brentleyjones commented Aug 5, 2022

clint-stripe commented Aug 5, 2022 •

edited

Loading

coeuvre commented Aug 8, 2022

coeuvre commented Aug 17, 2022 •

edited

Loading

moroten commented Aug 17, 2022 •

edited

Loading

"Upload missing inputs" performance regression in Bazel 5.0 and 5.3 #16054

"Upload missing inputs" performance regression in Bazel 5.0 and 5.3 #16054

Comments

clint-stripe commented Aug 5, 2022 • edited Loading

Description of the bug:

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Which operating system are you running Bazel on?

What is the output of bazel info release?

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

Have you found anything relevant by searching the web?

Any other information, logs, or outputs that you want to share?

brentleyjones commented Aug 5, 2022

clint-stripe commented Aug 5, 2022 • edited Loading

coeuvre commented Aug 8, 2022

coeuvre commented Aug 17, 2022 • edited Loading

moroten commented Aug 17, 2022 • edited Loading

clint-stripe commented Aug 5, 2022 •

edited

Loading

What is the output of `bazel info release`?

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

What's the output of `git remote get-url origin; git rev-parse master; git rev-parse HEAD` ?

clint-stripe commented Aug 5, 2022 •

edited

Loading

coeuvre commented Aug 17, 2022 •

edited

Loading

moroten commented Aug 17, 2022 •

edited

Loading