Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bazel is crashing while uploading BES events to BuildBuddy #992

Closed
amit-mittal opened this issue Sep 10, 2021 · 9 comments
Closed

Bazel is crashing while uploading BES events to BuildBuddy #992

amit-mittal opened this issue Sep 10, 2021 · 9 comments

Comments

@amit-mittal
Copy link

amit-mittal commented Sep 10, 2021

We have hosted BuildBuddy OnPrem version on Kubernetes, but while running some of the bazel targets, we are seeing that bazel is crashing even after the build/tests succeeded.

Bazel: v4.1.0
BuildBuddy: v2.5.3 (and v2.3.3)

Stack trace:

INFO: Build completed successfully, 98 total actions
WARNING: BES was not properly closed
FATAL: bazel crashed due to an internal error. Printing stack trace:
java.util.concurrent.RejectedExecutionException: Task com.google.common.util.concurrent.TrustedListenableFutureTask@34c86160[status=PENDING, info=[task=[running=[NOT STARTED YET], com.google.devtools.build.lib.remote.ByteStreamBuildEventArtifactUploader$$Lambda$755/0x000000080073c440@1a8c36e8]]] rejected from java.util.concurrent.ThreadPoolExecutor@66d323f6[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 648]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(Unknown Source)
	at com.google.common.util.concurrent.MoreExecutors$ListeningDecorator.execute(MoreExecutors.java:586)
	at java.base/java.util.concurrent.AbstractExecutorService.submit(Unknown Source)
	at com.google.common.util.concurrent.AbstractListeningExecutorService.submit(AbstractListeningExecutorService.java:66)
	at com.google.devtools.build.lib.remote.ByteStreamBuildEventArtifactUploader.upload(ByteStreamBuildEventArtifactUploader.java:220)
	at com.google.devtools.build.lib.buildeventstream.BuildEventArtifactUploader.uploadReferencedLocalFiles(BuildEventArtifactUploader.java:100)
	at com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.enqueueEvent(BuildEventServiceUploader.java:196)
	at com.google.devtools.build.lib.buildeventservice.BuildEventServiceTransport.sendBuildEvent(BuildEventServiceTransport.java:95)
	at com.google.devtools.build.lib.runtime.BuildEventStreamer.post(BuildEventStreamer.java:268)
	at com.google.devtools.build.lib.runtime.BuildEventStreamer.buildEvent(BuildEventStreamer.java:472)
	at com.google.devtools.build.lib.runtime.BuildEventStreamer.buildEvent(BuildEventStreamer.java:481)
	at com.google.devtools.build.lib.runtime.BuildEventStreamer.clearPendingEvents(BuildEventStreamer.java:307)
	at com.google.devtools.build.lib.runtime.BuildEventStreamer.clearEventsAndPostFinalProgress(BuildEventStreamer.java:634)
	at com.google.devtools.build.lib.runtime.BuildEventStreamer.close(BuildEventStreamer.java:354)
	at com.google.devtools.build.lib.runtime.BuildEventStreamer.closeOnAbort(BuildEventStreamer.java:336)
	at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.forceShutdownBuildEventStreamer(BuildEventServiceModule.java:409)
	at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.afterCommand(BuildEventServiceModule.java:578)
	at com.google.devtools.build.lib.runtime.BlazeRuntime.afterCommand(BlazeRuntime.java:626)
	at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.execExclusively(BlazeCommandDispatcher.java:603)
	at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.exec(BlazeCommandDispatcher.java:231)
	at com.google.devtools.build.lib.server.GrpcServerImpl.executeCommand(GrpcServerImpl.java:543)
	at com.google.devtools.build.lib.server.GrpcServerImpl.lambda$run$1(GrpcServerImpl.java:606)
	at io.grpc.Context$1.run(Context.java:579)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
Bazel exited with status code 37

And in the next bazel command, we see the below message which means the bazel crashed.

$ bazel info
Starting local Bazel server and connecting to it...

Related error I am seeing in the BuildBuddy server logs.

31mWRN Error receiving build event stream build_id:"e39b864b-e734-4791-b11f-dc01fdd1256e" invocation_id:"556bdbef-692a-450a-bc2a-ab77292cb5e5" component:TOOL: rpc error: code = Canceled desc = context canceled
31mWRN Marking invocation "556bdbef-692a-450a-bc2a-ab77292cb5e5" as disconnected: rpc error: code = Canceled desc = context canceled

Relevant settings in our .bazelrc

build --remote_cache=grpcs://...
build --remote_upload_local_results=false # Developers don't populate the remote cache
build --bes_results_url=https://.../invocation/
build --bes_backend=grpcs://....
build --bes_timeout=60s
build --bes_upload_mode=nowait_for_upload_complete

Please let me know if you need any other details to debug the issue.

@siggisim
Copy link
Member

Hey @amit-mittal - thanks for reporting!

This type of error message most commonly occurs when your BuildBuddy instance restarts for whatever reason. Are you able to see if your BuildBuddy app has any restarts? Could it be running out of memory?

@amit-mittal
Copy link
Author

Hey @siggisim - thanks for looking into it!

No, I don't see any crashes or restarts on the BuildBuddy server side. The node on which BuildBuddy is running has 60 GB memory, while only 16 GB is being used.

So far, we are able to repro this issue only while running tests of one of our Go services (if that is relevant).

@siggisim
Copy link
Member

siggisim commented Sep 10, 2021

Interesting, is there any chance this is a CI run that runs multiple Bazel invocations back-to-back?

By default, Bazel will try to re-use the build event stream connection across invocations - which could be leading to the issues we're seeing here. You can disable this behavior with the bazel flag --keep_backend_build_event_connections_alive=false. I suspect that might solve this issue.

@amit-mittal
Copy link
Author

I added --keep_backend_build_event_connections_alive=false in our .bazelrc, but I am seeing below errors while doing bazel build .... one after another, that never used to happen for us. The below errors go away if I remove this new setting, so I don't think the overhead of creating the new connection as part of every run would work for us.

WARNING: The background upload of the Build Event Protocol for the previous invocation failed with the following exception: 'com.google.devtools.build.lib.util.AbruptExitException: The Build Event Protocol upload failed: All retry attempts failed. UNAVAILABLE: UNAVAILABLE: Channel shutdown invoked UNAVAILABLE: UNAVAILABLE: Channel shutdown invoked'. Ignoring the failure and starting a new invocation..
WARNING: The background upload of the Build Event Protocol for the previous invocation failed to complete in 5.003 seconds. Cancelling and starting a new invocation...

I don't think it should matter, but the run is happening on the developer machine (MacOS).

Regarding the multiple invocations, we are NOT running bazel commands in parallel, but as part of the usual developer workflow, we do run bazel commands one after another. That is one of the reasons, that we have --bes_upload_mode=nowait_for_upload_complete set, so the developers are not blocked while the events are being uploaded.

@siggisim
Copy link
Member

There are (unfortunately) lots of Bazel bugs with the --bes_upload_mode= flag:

There is some work being done to improve the BES artifact uploader:

Do you see the same error without that --bes_upload_mode= flag?

Do you see the same error if you add the flags --remote_timeout=3600 and --bes_timeout=3600s (wondering if a timeout is being hit and not handled gracefully)?

@amit-mittal
Copy link
Author

That's true! 😞

I don't think we would be able to change the --bes_upload_mode to a blocking call, but we can try it out. As we upload the events in async mode, we'll also try increasing the timeout and share the findings.

We will also prioritize upgrading bazel to v4.2.1, to pick up the fixes in the BES uploader, if there were any. Thanks for helping to investigate the issue!

@BalestraPatrick
Copy link

BalestraPatrick commented Sep 22, 2021

We also see this crash only on our Linux CI (macOS CI is fine). We're on Bazel 4.2.0 and we don't even set --bes_upload_mode, but we still see the exact same crash. Setting --keep_backend_build_event_connections_alive=false did not make a difference for us.

@siggisim
Copy link
Member

This is likely fixed by bazelbuild/bazel#13959 which hasn't made it into any Bazel releases yet.

Are either of you able to share your grpc log for one of these invocations captured with Bazel's --experimental_remote_grpc_log=? You can send it to siggi@buildbuddy.io

@siggisim
Copy link
Member

siggisim commented Jan 5, 2022

Going to close this issue now that bazelbuild/bazel@e855a26 seems to have made it into Bazel 5.0 release candidates bazelbuild/bazel#14013

Please re-open this issue if you're able to reproduce with this with Bazel 5.0 (in which case either a grpc log or a BuildBuddy Cloud invocation would be super helpful).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants