Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken AC entries are written to disk cache with --remote_download_minimal #17080

Closed
exoson opened this issue Dec 27, 2022 · 8 comments
Closed
Assignees
Labels
P1 I'll work on this now. (Assignee required) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@exoson
Copy link
Contributor

exoson commented Dec 27, 2022

Description of the bug:

Building something with build without bytes using remote exec, running bazel clean and then rebuilding fails due to missing digests at least for some targets. Seems that the ac entry gets written to the disk cache while, in general, the cas entries referenced are not written due to build without bytes. This seems to have been the case for some time now but it wasn't a problem until check for the ac entry integrity was added in #16731.

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Setup the workspace:

touch WORKSPACE
echo 'int main() { return 0; }' > main.cc
echo 'cc_binary(name = "main", srcs = ["main.cc"])' > BUILD

Also setup a config called remote_exec_with_disk which will enable remote execution and disk cache when building this project.

Repro issue

bazel build --config=remote_exec_with_disk --remote_download_minimal //:main && bazel clean && bazel build --config=remote_exec_with_disk --remote_download_minimal //:main

The second build command should give you an error like

ERROR: /repro_project/BUILD:1:10: Linking main failed: (Exit 34): Missing digest: 377e9d2c78fac1172750ae43806f3d3e1ec7dff71c15cdadfbbad10019929b8c/7936

Which operating system are you running Bazel on?

linux

What is the output of bazel info release?

release 6.0.0

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

-

Have you found anything relevant by searching the web?

Any other information, logs, or outputs that you want to share?

@sgowroji sgowroji added team-Remote-Exec Issues and PRs for the Execution (Remote) team untriaged type: bug labels Dec 27, 2022
@exoson
Copy link
Contributor Author

exoson commented Dec 27, 2022

Forgot to add the place in code that seems to be the writing the broken entry to the disk

@brentleyjones
Copy link
Contributor

@coeuvre @tjgq

@brentleyjones
Copy link
Contributor

@bazel-io flag

@bazel-io bazel-io added the potential release blocker Flagged by community members using "@bazel-io flag". Should be added to a release blocker milestone label Dec 27, 2022
@coeuvre coeuvre self-assigned this Jan 3, 2023
@coeuvre coeuvre added P1 I'll work on this now. (Assignee required) and removed untriaged labels Jan 3, 2023
@coeuvre
Copy link
Member

coeuvre commented Jan 12, 2023

Thanks for the report!

The issue is tricky. When building without the bytes with disk cache alone, we want to check ac entry integrity for disk cache. Otherwise the build could fail if the any blob is missing when we are reading it later.

However, when combined with remote cache, if we hit remote cache, we only download ac and skip downloading of blobs. Hence the AC in the disk cache is always incomplete until all the referenced blobs are downloaded later. The current code didn't check the integrity before deciding using the disk part of a combined cache.

We could fix that by checking the integrity when deciding which part to use of a combined cache. The problem is we will never hit the disk cache until all the referenced blobs are downloaded. The fix could reduce the cache-hit rate for disk cache. Considering 2 consecutive builds:

bazel build --config=remote_exec_with_disk --remote_download_minimal ...
bazel build --config=remote_exec_with_disk --remote_download_minimal ...

The second build will miss the disk cache because blobs referenced by AC are not downloaded (but can still hit the remote cache). On the other hand, this should the correct behaviour. Since remote cache could evict blobs. We shouldn't accept AC from disk cache if any referenced blob is missing, hoping we could get it from remote cache -- it could also be missing there.

Another fix is drop the integrity check for AC from disk cache when using combined cache. This will increase the disk-cache hit rate, but the chance to trigger "missing blobs" error is higher if the TTL of blobs on remote cache is low.

I will go with the second fix for now (for performance reason) and see how can we improve it with lease service.

@brentleyjones
Copy link
Contributor

Another fix is drop the integrity check for AC from disk cache when using combined cache. This will increase the disk-cache hit rate, but the chance to trigger "missing blobs" error is higher if the TTL of blobs on remote cache is low.

Can I vote that we don't do this? At least until the lease service is up and running. The change to have the AC check if it's entries are valid is a very good one, in particular with remote caching. If you need to disable the check, and you do it only when disk cache + remote exec, but not disk cache + remote cache?

@coeuvre
Copy link
Member

coeuvre commented Jan 12, 2023

I agree AC check is a good one. But if we enable it for a combined cache (either disk cache + remote exec, or disk cache + remote cache), we will probably never hit the disk cache when building without the bytes.

Taking the repro as an example, since we are using minimal mode, the outputs of //:main will nerve be downloaded which means we will never hit the disk cache if we check the integrity.

In real world builds, I believe there are actions whose outputs will never be downloaded. If we do AC check, these actions will never hit the disk cache.

@brentleyjones
Copy link
Contributor

brentleyjones commented Jan 12, 2023

I'm fine with the AC result being used in the combined cache, as long as Bazel calls FindMissingBlobs on the remote cache to verify that the result is valid. I think it would be cheaper if Bazel instead just asked for the AC result from the remote cache in that case.

The main benefit of the combined disk cache is the storing of the CAS items. There might be some local-only actions that do end up storing themselves, even with BwtB. I feel that BwtB + disk cache only is going to be very rare. The most common case will be the combined cache. And if we want the BwtB experience to be as nice as possible, the combined cache shouldn't degrade the experience. By that, I mean that if one was using just the remote cache, and bazel clean --expunge was run (until the lease service lands), Bazel wouldn't run into missing blobs errors because the remote cache wouldn't return AC results if they had missing CAS blobs. This change would degrade that because the disk cache could incorrectly return an invalid AC result.

@coeuvre
Copy link
Member

coeuvre commented Jan 12, 2023

I get you. From this perspective, not throwing an error is definitely a better experience than slowing the build. And yes, just check remote AC is probably not slower. Reusing CAS is the main benefit.

@meteorcloudy meteorcloudy removed the potential release blocker Flagged by community members using "@bazel-io flag". Should be added to a release blocker milestone label Jan 13, 2023
ShreeM01 added a commit that referenced this issue Jan 25, 2023
This will decrease the cache-hit rate for disk cache when building without the bytes. See #17080 for reasoning.

With lease service, the decrement will be fixed.

Fixes #17080.

Closes #17201.

PiperOrigin-RevId: 502818641
Change-Id: I1f73dd38d7e52e2f39b367e6114aab714de22d3c

Co-authored-by: Chi Wang <chiwang@google.com>
hvadehra pushed a commit that referenced this issue Feb 14, 2023
This will decrease the cache-hit rate for disk cache when building without the bytes. See #17080 for reasoning.

With lease service, the decrement will be fixed.

Fixes #17080.

Closes #17201.

PiperOrigin-RevId: 502818641
Change-Id: I1f73dd38d7e52e2f39b367e6114aab714de22d3c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 I'll work on this now. (Assignee required) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants