Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote cache warning when using --experimental_remote_cache_async #22501

Closed
keith opened this issue May 22, 2024 · 10 comments
Closed

Remote cache warning when using --experimental_remote_cache_async #22501

keith opened this issue May 22, 2024 · 10 comments
Assignees
Labels
more data needed P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@keith
Copy link
Member

keith commented May 22, 2024

Description of the bug:

When running tests with the remote cache enabled and passing --experimental_remote_cache_async I often see this warning:

WARNING: Remote Cache: .../1/test.err (No such file or directory)
WARNING: Remote Cache: .../2/test.err (No such file or directory)
WARNING: Remote Cache: .../3/test.err (No such file or directory)
WARNING: Remote Cache: .../4/test.err (No such file or directory)
WARNING: Remote Cache: .../5/test.err (No such file or directory)

Where the granularity is per test target. I don't think it shows for every test target but I can see dozens in a single build testing ~600 targets.

Which category does this issue belong to?

No response

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

No response

Which operating system are you running Bazel on?

Linux

What is the output of bazel info release?

7.1.1

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse HEAD ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

@brentleyjones brentleyjones added the team-Remote-Exec Issues and PRs for the Execution (Remote) team label May 22, 2024
@keith
Copy link
Member Author

keith commented May 22, 2024

there's actually a stack trace in this case too, i'll try to copy it next time

@coeuvre
Copy link
Member

coeuvre commented May 28, 2024

Please share the stack trace!

@keith
Copy link
Member Author

keith commented May 28, 2024

I think on 7.1.1 it doesn't actually show a trace in all cases

@fmeum
Copy link
Collaborator

fmeum commented May 28, 2024

@keith Does it do so with --verbose_failures?

@keith
Copy link
Member Author

keith commented May 29, 2024

Yep, nice call!

        at java.base/java.io.FileInputStream.open0(Native Method)
        at java.base/java.io.FileInputStream.open(Unknown Source)
        at java.base/java.io.FileInputStream.<init>(Unknown Source)
        at com.google.devtools.build.lib.unix.UnixFileSystem.createFileInputStream(UnixFileSystem.java:497)
        at com.google.devtools.build.lib.vfs.AbstractFileSystem.createMaybeProfiledInputStream(AbstractFileSystem.java:96)
        at com.google.devtools.build.lib.vfs.AbstractFileSystem.getInputStream(AbstractFileSystem.java:59)
        at com.google.devtools.build.lib.vfs.FileSystem$1.openStream(FileSystem.java:355)
        at com.google.common.io.ByteSource.copyTo(ByteSource.java:256)
        at com.google.common.io.ByteSource.hash(ByteSource.java:340)
        at com.google.devtools.build.lib.vfs.FileSystem.getDigest(FileSystem.java:357)
        at com.google.devtools.build.lib.unix.UnixFileSystem.getDigest(UnixFileSystem.java:453)
        at com.google.devtools.build.lib.vfs.Path.getDigest(Path.java:690)
        at com.google.devtools.build.lib.remote.RemoteActionFileSystem.getDigest(RemoteActionFileSystem.java:453)
        at com.google.devtools.build.lib.vfs.Path.getDigest(Path.java:690)
        at com.google.devtools.build.lib.vfs.DigestUtils.manuallyComputeDigest(DigestUtils.java:220)
        at com.google.devtools.build.lib.vfs.DigestUtils.getDigestWithManualFallback(DigestUtils.java:183)
        at com.google.devtools.build.lib.remote.util.DigestUtil.compute(DigestUtil.java:91)
        at com.google.devtools.build.lib.remote.util.DigestUtil.compute(DigestUtil.java:80)
        at com.google.devtools.build.lib.remote.UploadManifest.setStdoutStderr(UploadManifest.java:202)
        at com.google.devtools.build.lib.remote.UploadManifest.create(UploadManifest.java:143)
        at com.google.devtools.build.lib.remote.RemoteExecutionService.lambda$buildUploadManifestAsync$6(RemoteExecutionService.java:1344)
        at io.reactivex.rxjava3.internal.operators.single.SingleFromCallable.subscribeActual(SingleFromCallable.java:43)
        at io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855)
        at io.reactivex.rxjava3.internal.operators.single.SingleFlatMap.subscribeActual(SingleFlatMap.java:37)
        at io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855)
        at io.reactivex.rxjava3.internal.operators.single.SingleUsing.subscribeActual(SingleUsing.java:83)
        at io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855)
        at io.reactivex.rxjava3.internal.operators.single.SingleSubscribeOn$SubscribeOnObserver.run(SingleSubscribeOn.java:89)
        at io.reactivex.rxjava3.internal.schedulers.ScheduledDirectTask.call(ScheduledDirectTask.java:38)
        at io.reactivex.rxjava3.internal.schedulers.ScheduledDirectTask.call(ScheduledDirectTask.java:25)
        at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)

@joeleba joeleba added P2 We'll consider working on this in future. (Assignee optional) and removed untriaged labels Jun 4, 2024
@tjgq
Copy link
Contributor

tjgq commented Jun 4, 2024

This is trivial to repro. The issue is this logic in StandaloneTestStrategy that appends the test stderr file into the stdout file and then deletes the former. With --experimental_remote_cache_async, this logic runs concurrently with the uploading of the test spawn, and might find that the stderr file has already been deleted.

This is an instance of a more general problem that, in order for async uploading to work correctly, we must uphold the invariant that the outputs of a spawn must never be deleted for the remainder of the execution (or at least until uploading has taken place).

@tjgq
Copy link
Contributor

tjgq commented Jun 4, 2024

In addition, the test.err files are guaranteed to always be empty, because test-setup.sh redirects stderr to stdout. I believe the merge logic is just a last resort in case the test was set up incorrectly.

@tjgq
Copy link
Contributor

tjgq commented Jun 4, 2024

...except on Windows, where the test wrapper doesn't redirect. Sigh.

@criemen
Copy link
Contributor

criemen commented Jul 21, 2024

We're seeing this problem as well, and running tests on Windows becomes very flaky with this - due to Windows advanced (ahem) open-files-are-locked-from-deletion semantics, this can even cause build errors (one code path tries to delete the stderr files, the other tries to read them to upload, so the deletion fails).

Funnily, this even happens when using the disk cache, not just a "proper" remote cache.

I guess the workaround for this is

common --experimental_remote_cache_async
test --noexperimental_remote_cache_async

in .bazelrc?

I'm also not sure why this issue is marked as "more data needed" - anything we can help with?

fmeum added a commit to fmeum/bazel that referenced this issue Sep 19, 2024
When an action may modify a spawn's outputs after execution, the upload of outputs to the cache and reuse for deduplicated actions need to happen synchronously directly after spawn execution to avoid a race.

This commit implements this for cache uploads by marking all actions with this property and simply disabling async upload for all spawns executed by such actions.

For output reuse, all executions deduplicated against the first one register atomically upon deduplication and cause the cache upload to wait for all of them to complete reuse.

Fixes bazelbuild#22501
Fixes bazelbuild#23288
Work towards bazelbuild#21578
Closes bazelbuild#23307 (no longer needed)

Closes bazelbuild#23382.

PiperOrigin-RevId: 668101364
Change-Id: Ice75dbe14a7dd46e02ecb096d2b2a30940216356
@iancha1992
Copy link
Member

A fix for this issue has been included in Bazel 7.4.0 RC1. Please test out the release candidate and report any issues as soon as possible.
If you're using Bazelisk, you can point to the latest RC by setting USE_BAZEL_VERSION=7.4.0rc1. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
more data needed P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.