Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[7.0.0] Flaky builds with remote cache #20559

Closed
coryan opened this issue Dec 15, 2023 · 14 comments
Closed

[7.0.0] Flaky builds with remote cache #20559

coryan opened this issue Dec 15, 2023 · 14 comments
Assignees
Labels
awaiting-user-response Awaiting a response from the author team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@coryan
Copy link

coryan commented Dec 15, 2023

Description of the bug:

Since we moved our builds to Bazel 7.0.0 our builds are randomly failing with I/O exception during sandboxed execution messages.

We started to collect the failures on googleapis/google-cloud-cpp#13315 . A good example may be:

WARNING: Remote Cache: Connection reset
com.google.devtools.build.lib.remote.common.BulkTransferException: Connection reset
	at com.google.devtools.build.lib.remote.util.Utils.waitForBulkTransfer(Utils.java:596)
	at com.google.devtools.build.lib.remote.RemoteExecutionService.downloadOutputs(RemoteExecutionService.java:1212)
	at com.google.devtools.build.lib.remote.RemoteSpawnCache.lookup(RemoteSpawnCache.java:118)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:152)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:119)
	at com.google.devtools.build.lib.exec.SpawnStrategyResolver.exec(SpawnStrategyResolver.java:45)
	at com.google.devtools.build.lib.rules.cpp.CppLinkAction.execute(CppLinkAction.java:346)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.executeAction(SkyframeActionExecutor.java:1148)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:1065)
	at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:165)
	at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:94)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:562)
	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:859)
	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.computeInternal(ActionExecutionFunction.java:333)
	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:171)
	at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:461)
	at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:414)
	at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source)
	Suppressed: java.net.SocketException: Connection reset
		at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(Unknown Source)
		at java.base/sun.nio.ch.SocketChannelImpl.read(Unknown Source)
		at io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:256)
		at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1132)
		at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:357)
		at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151)
		at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
		at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
		at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
		at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
		at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
		at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
		at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
		at java.base/java.lang.Thread.run(Unknown Source)
... ... ...
[27,834 / 29,514] 343 / 664 tests; Linking google/cloud/tasks/quickstart/quickstart; Downloading google/cloud/tasks/quickstart/quickstart, 230.7 MiB / 297.4 MiB; 18s remote-cache ... (32 actions, 0 running)
ERROR: /workspace/google/cloud/storageinsights/quickstart/BUILD.bazel:17:10: Linking google/cloud/storageinsights/quickstart/quickstart failed: I/O exception during sandboxed execution: 6 errors during bulk transfer:
java.net.SocketException: Connection reset
java.net.SocketException: Connection reset
java.net.SocketException: Connection reset
java.net.SocketException: Connection reset
java.net.SocketException: Connection reset
java.net.SocketException: Connection reset
ERROR: /workspace/google/cloud/opentelemetry/quickstart/BUILD.bazel:19:10: Linking google/cloud/opentelemetry/quickstart/quickstart failed: I/O exception during sandboxed execution: 4 errors during bulk transfer:
java.net.SocketException: Connection reset
java.net.SocketException: Connection reset
java.net.SocketException: Connection reset
java.net.SocketException: Connection reset

I speculate this may be the remote cache, but of course I could be wrong.

Which category does this issue belong to?

Core

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

No simple repro at the moment, sorry.

If there is a way to enable more detailed logging I could do that on our builds and report any results on the next failure.

Which operating system are you running Bazel on?

Linux

What is the output of bazel info release?

Starting local Bazel server and connecting to it... Build label: 7.0.0 Build target: @@//src/main/java/com/google/devtools/build/lib/bazel:BazelServer Build time: Mon Dec 11 16:51:49 2023 (1702313509) Build timestamp: 1702313509 Build timestamp as int: 1702313509

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

@meteorcloudy
Copy link
Member

/cc @coeuvre @tjgq

@iancha1992 iancha1992 added the team-Remote-Exec Issues and PRs for the Execution (Remote) team label Dec 15, 2023
@coeuvre
Copy link
Member

coeuvre commented Dec 18, 2023

It looks like the connection was closed when Bazel was downloading the action outputs. Some clarifications:

  1. How flaky is it? Does the error also appear with Bazel 6.4?
  2. Which remote cache server is used for the builds?
  3. What flags were set for the builds?

@coryan
Copy link
Author

coryan commented Dec 18, 2023

It looks like the connection was closed when Bazel was downloading the action outputs. Some clarifications:

  1. How flaky is it?

The build with --config msan (see below for the .bazelrc file) had a high failure rate, higher than 10%. But if you include all the other builds, then it was much lower failure rate, probably less than 2%. I said "had" because we had to revert our builds to Bazel 6.4.0 for unrelated reasons.

Does the error also appear with Bazel 6.4?

No.

  1. Which remote cache server is used for the builds?

Google Cloud Storage.

  1. What flags were set for the builds?
2023-12-15T02:57:00Z (+94s): + bazel test --test_output=errors --verbose_failures=true --keep_going --experimental_convenience_symlinks=ignore --cache_test_results=no --remote_cache=https://storage.googleapis.com/cloud-cpp-testing-resources_cloudbuild/bazel-cache/fedora-msan-msan --google_default_credentials --experimental_guard_against_concurrent_changes --config=msan --test_tag_filters=-integration-test\,-no-msan ...

and the .bazelrc file:

https://github.com/googleapis/google-cloud-cpp/blob/bb6dc521d51e2e27d47827e04a607361e8ee6fd1/.bazelrc

@davido
Copy link
Contributor

davido commented Dec 19, 2023

Not sure if it's related, but also Gerrit Code Review project is affected by some changed behaviour in remote-cache code area, after upgrading to Bazel 7.0.0, see this issue.

@tjgq
Copy link
Contributor

tjgq commented Jan 3, 2024

(FWIW, I don't think #20161 is related; we've narrowed down that one and it has a very different nature.)

The exception occurs fairly deep in our networking dependencies, so there isn't a whole lot I can suggest to augment the logging. (I suppose we could try to plumb through some information to narrow down which particular blobs failed to download, but this looks like a transport-level issue rather than application-level, so I'm unsure how much that would help.)

Since this issue isn't present in 6.x, I wonder if it was introduced by the netty upgrade in ff1dc3b. Would you be able to give a Bazel built at the parent commit (2e34965) a try? Bazelisk can help with this, i.e. USE_BAZEL_VERSION=2e34965242280d861f688628ed2bbc5209350d9f bazelisk build ...

But do note that the commit is 7 months old at this point, so it might be incompatible with your project / take some work to make compatible.

@meisterT meisterT added awaiting-user-response Awaiting a response from the author and removed untriaged labels Jan 9, 2024
@dbolduc
Copy link

dbolduc commented Jan 12, 2024

Would you be able to give a Bazel built at the parent commit (2e34965) a try? Bazelisk can help with this, i.e. USE_BAZEL_VERSION=2e34965242280d861f688628ed2bbc5209350d9f bazelisk build ...

But do note that the commit is 7 months old at this point, so it might be incompatible with your project / take some work to make compatible.

I tried this commit. It did not work out of the box. I am not going to try very hard to make it work. We are going to revert to 6.4.0 in more builds, because the flakes are too frequent with 7.0.0.

@derekmauro
Copy link

I also saw what I think is the same issue with Bazel 6.2.1 when we tried to use --remote_download_minimal. We rolled back the use of this flag with abseil/abseil-cpp@3bc08b8. It still seems present in 7.0.0 but maybe a bit less frequent.

@tjgq
Copy link
Contributor

tjgq commented Jun 21, 2024

Assuming the "connection reset" error is transient and not permanent, this change to the retry logic should fix this issue, and will be in the 7.2.1 release.

@tjgq
Copy link
Contributor

tjgq commented Jun 25, 2024

7.2.1 has been released and should contain a fix for this bug. Please comment on this issue if the problem persists.

@tjgq tjgq closed this as completed Jun 25, 2024
@coryan
Copy link
Author

coryan commented Jul 2, 2024

I just got this in a build with Bazel 7.2.1:

ERROR: /workspace/ci/verify_current_targets/BUILD.bazel:31:9: Testing //:verify__bigquery failed: I/O exception during sandboxed execution: Download of '/cloud-cpp-testing-resources_cloudbuild/bazel-cache/fedora-latest-bazel-bazel-targets/cas/87c5c8fc54cd429a8f5d1c673e720f4bf4c28fe1462e5c9be5094847356fab37' timed out. Received 0 bytes.
INFO: Analysis succeeded for only 8 of 9 top-level targets
INFO: Found 9 test targets...
INFO: Elapsed time: 88.587s, Critical Path: 67.55s
INFO: 6775 processes: 4451 remote cache hit, 2314 internal, 8 linux-sandbox, 2 local.
ERROR: Build did NOT complete successfully

https://github.com/googleapis/google-cloud-cpp/runs/26956528060

copybara-service bot pushed a commit that referenced this issue Jul 4, 2024
Address #20559 (comment).

PiperOrigin-RevId: 649360592
Change-Id: I50ac2ed3b54d6cffb5f96d3f8315279786c4f30f
@coeuvre
Copy link
Member

coeuvre commented Jul 4, 2024

6e2e3a1 should address this.

coeuvre added a commit to coeuvre/bazel that referenced this issue Jul 4, 2024
Address bazelbuild#20559 (comment).

PiperOrigin-RevId: 649360592
Change-Id: I50ac2ed3b54d6cffb5f96d3f8315279786c4f30f
github-merge-queue bot pushed a commit that referenced this issue Jul 4, 2024
Address
#20559 (comment).

PiperOrigin-RevId: 649360592
Change-Id: I50ac2ed3b54d6cffb5f96d3f8315279786c4f30f
@coryan
Copy link
Author

coryan commented Jul 4, 2024

6e2e3a1 should address this.

That seems to make the timeout retryable and should fix the specific issue. Thanks.

However, I wonder why the build stops on any caching error. Shouldn't that be treated as a cache miss, and the build continue? Is there a separate bug to fix that problem?

fmeum pushed a commit to fmeum/bazel that referenced this issue Jul 5, 2024
Address bazelbuild#20559 (comment).

PiperOrigin-RevId: 649360592
Change-Id: I50ac2ed3b54d6cffb5f96d3f8315279786c4f30f
@coeuvre
Copy link
Member

coeuvre commented Jul 8, 2024

What you described applies to the case where Bazel downloads outputs of a remote cached action.

However, in this specific case, the download happened when Bazel is fetching remote inputs to local action. If the download times out, Bazel has no choice but to stop the build.

@coryan
Copy link
Author

coryan commented Jul 8, 2024

However, in this specific case, the download happened when Bazel is fetching remote inputs to local action. If the download times out, Bazel has no choice but to stop the build.

Something is going over my head. The error I had in mind was:

ERROR: /workspace/ci/verify_current_targets/BUILD.bazel:31:9: Testing //:verify__bigquery failed: I/O exception during sandboxed execution: Download of '/cloud-cpp-testing-resources_cloudbuild/bazel-cache/fedora-latest-bazel-bazel-targets/cas/87c5c8fc54cd429a8f5d1c673e720f4bf4c28fe1462e5c9be5094847356fab37' timed out. Received 0 bytes.

The thing Bazel is trying to download is an artifact in the remote cache. Granted, it is the input for some action X, but it seems to me that it was created as the result of some action Y. Is there a reason Bazel cannot fallback to executing the action that created the (cached) artifact?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-user-response Awaiting a response from the author team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

No branches or pull requests