New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bazel JVM does not get killed when OOM #15959
Comments
You can consider tweaking this flag: |
Hmm let me try this flag. Is there some guidance in general about how to go about measuring and setting this value? For us, as we are setting
losing 10 percent of this is actually |
Thanks, @meisterT . Yeah,
Look at the sequence of "Memory usage after full GC" INFO log lines for a representative From the log snippets in your screenshots, it sounds like you need to do (b). If you're interesed in (b.ii), a separate user support interaction would probably make sense. Footnotes
|
Worth to note that these don't happen that often for us, it only happen when our Bazel JVM instance has been kept alive for many days if not weeks and was re-used for multiple builds. I have no doubt that we internally have a memleak somewhere in our build. 🤔 I think the current approach right now is to ensure that JVM is properly shutdown when OOM happen so that we are not stuck in a deadlock. Either that or implement a flag of Once we have exhausted these work arounds, we would then pick (b) as it requires changes to custom internal rules and migrating existing code base on newer rules. 😞 With that said, is it reasonable to get the deadlock condition addressed somehow? What do you guys think about a flag combination such as Internally we have to set |
I think we first need a stronger understanding of the issue. I read what you wrote in your original post but that doesn't match my understanding of how the code works. The java-land lock isn't being physically held by a specific client-land pid; it's being held by a java-land thread (servicing a gRPC-server-side rpc from a client, yes). From your symptom it sounds like that control flow exited from that Do you have a full INFO log rather than just those screenshots of snippets? For example, I'm curious if there are log lines that will tell us whether the execution of the gRPC command thread definitely did or did not proceed outside that synchronized block. |
Hello @sluongng, Did you get a chance to check the above reply comments. Thanks! |
@sgowroji hey thanks for the ping. @haxorz It's very hard for our customer to setup Bazel JVM monitoring in their infrastructure so I could not share how the JVM memory increased overtime. I do realize that there are metrics that I could export from both memory_profile and buildMetrics events, but our customer does not have the capability to consume these metrics right now. Worth to note that most commercial BES implementation today do not come with good telemetry support so collecting these metrics come with a higher cost than usual. A Bazel CI log of a stuck run is very typical:
The analysis cache was being added in more nodes overtime, and this caused OOM in skyframe somewhere. As shown in a previous screenshot, stack trace was thrown at.
This happened during analysis phase which suggest that it's independence from action executions. Since I have made this post, we have implemented a mechanism so that on a schedule basis, we would just run a What I would suggest is that we leave this issue open for references for other folks to troubleshoot. If anyone else running into the same issue, please share here. |
Setting this to And in this comment @sluongng requested the issue be left open for awareness purposes. |
Thank you for contributing to the Bazel repository! This issue has been marked as stale since it has not had any activity in the last 1+ years. It will be closed in the next 90 days unless any other activity occurs. If you think this issue is still relevant and should stay open, please post any comment here and the issue will no longer be marked as stale. |
Description of the bug:
We ran into this issue on our side (Bazel 5.1.0)
java.log
prints several OOM message but the jvm process is kept alive and continue to take in subsequent commands from client.When this happen, Bazel actually run into a dead lock right here https://cs.opensource.google/bazel/bazel/+/release-5.1.0:src/main/java/com/google/devtools/build/lib/runtime/BlazeCommandDispatcher.java;l=172-183 as the lock was held by the previous client PID. The old client PID is obviously long-gone, but server still kept the lock because it has never finished the loading-and-analysis phase.
So I think there are 2 issues here:
Make sure the java process exit when OOM happen
Make sure that we can by-pass / detect the potential deadlock to issue a
bazel shutdown
commandWhat's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
Not easy to reproduce. But after running many builds in CI, we would randomly see a case like this once per 2-3 weeks in one of our CI worker.
We use
--host_jvm_args=-Xmx12G
in our CI setup.Which operating system are you running Bazel on?
Linux Ubuntu 20.04
What is the output of
bazel info release
?release 5.1.0
If
bazel info release
returnsdevelopment version
or(@non-git)
, tell us how you built Bazel.N/A
What's the output of
git remote get-url origin; git rev-parse master; git rev-parse HEAD
?Have you found anything relevant by searching the web?
Initially I thought this was related to #14093 but it's not. Our build hit OOM during the loading-and-analysis phase, after the jvm has been re-used for a lot of builds. We primarly build C/C++ and Go and Python and Docker so Java Builder should not be a part of this.
Any other information, logs, or outputs that you want to share?
The text was updated successfully, but these errors were encountered: