Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bazel JVM does not get killed when OOM #15959

Open
sluongng opened this issue Jul 23, 2022 · 9 comments
Open

Bazel JVM does not get killed when OOM #15959

sluongng opened this issue Jul 23, 2022 · 9 comments
Labels
P4 This is either out of scope or we don't have bandwidth to review a PR. (No assignee) stale Issues or PRs that are stale (no activity for 30 days) team-Core Skyframe, bazel query, BEP, options parsing, bazelrc type: support / not a bug (process)

Comments

@sluongng
Copy link
Contributor

Description of the bug:

We ran into this issue on our side (Bazel 5.1.0)

java.log prints several OOM message but the jvm process is kept alive and continue to take in subsequent commands from client.

When this happen, Bazel actually run into a dead lock right here https://cs.opensource.google/bazel/bazel/+/release-5.1.0:src/main/java/com/google/devtools/build/lib/runtime/BlazeCommandDispatcher.java;l=172-183 as the lock was held by the previous client PID. The old client PID is obviously long-gone, but server still kept the lock because it has never finished the loading-and-analysis phase.

So I think there are 2 issues here:

  1. Make sure the java process exit when OOM happen

  2. Make sure that we can by-pass / detect the potential deadlock to issue a bazel shutdown command

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Not easy to reproduce. But after running many builds in CI, we would randomly see a case like this once per 2-3 weeks in one of our CI worker.

We use --host_jvm_args=-Xmx12G in our CI setup.

Which operating system are you running Bazel on?

Linux Ubuntu 20.04

What is the output of bazel info release?

release 5.1.0

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

N/A

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

N/A

Have you found anything relevant by searching the web?

Initially I thought this was related to #14093 but it's not. Our build hit OOM during the loading-and-analysis phase, after the jvm has been re-used for a lot of builds. We primarly build C/C++ and Go and Python and Docker so Java Builder should not be a part of this.

Any other information, logs, or outputs that you want to share?

image

image

@comius comius added team-Local-Exec Issues and PRs for the Execution (Local) team and removed team-Rules-Java Issues for Java rules labels Jul 25, 2022
@meisterT
Copy link
Member

You can consider tweaking this flag: --experimental_oom_more_eagerly_threshold. Internally, we set it to 90.

@meisterT meisterT removed their assignment Jul 27, 2022
@meisterT meisterT added team-Core Skyframe, bazel query, BEP, options parsing, bazelrc and removed team-Local-Exec Issues and PRs for the Execution (Local) team labels Jul 27, 2022
@sluongng
Copy link
Contributor Author

You can consider tweaking this flag: --experimental_oom_more_eagerly_threshold. Internally, we set it to 90.

Hmm let me try this flag.

Is there some guidance in general about how to go about measuring and setting this value? For us, as we are setting

--host_jvm_args=-Xmx12G

losing 10 percent of this is actually 1.2G which is significant. 🤔

@haxorz
Copy link
Contributor

haxorz commented Jul 28, 2022

Thanks, @meisterT . Yeah, --experimental_oom_more_eagerly_threshold is the existing mechanism to address issue (1).

Is there some guidance in general about how to go about measuring and setting this value? For us, as we are setting

Look at the sequence of "Memory usage after full GC" INFO log lines for a representative bazel invocation 1. This will give you a good sense of the size of Bazel's working set. If your working set is inherently above the threshold implied by --experimental_oom_more_eagerly_threshold=X, then you either need to (a) raise X or (b) do some of combination of (b.i) raise your Xmx and (b.ii) optimize your build graph so as to cause Bazel to use less memory.

From the log snippets in your screenshots, it sounds like you need to do (b). If you're interesed in (b.ii), a separate user support interaction would probably make sense.

Footnotes

  1. For more data you could enable JVM logging and look at the GC logs of Bazel's JVM.

@sluongng
Copy link
Contributor Author

Worth to note that these don't happen that often for us, it only happen when our Bazel JVM instance has been kept alive for many days if not weeks and was re-used for multiple builds. I have no doubt that we internally have a memleak somewhere in our build. 🤔

I think the current approach right now is to ensure that JVM is properly shutdown when OOM happen so that we are not stuck in a deadlock. Either that or implement a flag of max time waited for lock to resolve and kill jvm somewhere in this loop here https://cs.opensource.google/bazel/bazel/+/release-5.1.0:src/main/java/com/google/devtools/build/lib/runtime/BlazeCommandDispatcher.java;l=172-183 perhaps with a special exit code. That way, we can reliably ask user to re-try (or even automatically retry) when these things happen.

Once we have exhausted these work arounds, we would then pick (b) as it requires changes to custom internal rules and migrating existing code base on newer rules. 😞


With that said, is it reasonable to get the deadlock condition addressed somehow? What do you guys think about a flag combination such as --max_wait_for_command_lock and --command_lock_max_wait_restart?

Internally we have to set --noblock_for_lock to avoid bazel run approaching CI job timeout limit (which we often set generously). Having a combination like the above would help us unset --noblock_for_lock and run Bazel a lot more confidently in CI 😆

@haxorz
Copy link
Contributor

haxorz commented Jul 28, 2022

With that said, is it reasonable to get the deadlock condition addressed somehow?

I think we first need a stronger understanding of the issue.

I read what you wrote in your original post but that doesn't match my understanding of how the code works. The java-land lock isn't being physically held by a specific client-land pid; it's being held by a java-land thread (servicing a gRPC-server-side rpc from a client, yes). From your symptom it sounds like that control flow exited from that synchronized (commandLock) block without releasing the underlying object monitor lock for commandLock. That seems like unexpected/impossible JVM behavior.

Do you have a full INFO log rather than just those screenshots of snippets? For example, I'm curious if there are log lines that will tell us whether the execution of the gRPC command thread definitely did or did not proceed outside that synchronized block.

@sgowroji
Copy link
Member

sgowroji commented Sep 5, 2022

Hello @sluongng, Did you get a chance to check the above reply comments. Thanks!

@sluongng
Copy link
Contributor Author

sluongng commented Sep 5, 2022

@sgowroji hey thanks for the ping.

@haxorz It's very hard for our customer to setup Bazel JVM monitoring in their infrastructure so I could not share how the JVM memory increased overtime. I do realize that there are metrics that I could export from both memory_profile and buildMetrics events, but our customer does not have the capability to consume these metrics right now. Worth to note that most commercial BES implementation today do not come with good telemetry support so collecting these metrics come with a higher cost than usual.

A Bazel CI log of a stuck run is very typical:

# bazel test --config=ci //...
(08:43:01) Loading: 
(08:43:01) Loading: 0 packages loaded
(08:43:02) Loading: 1629 packages loaded
    currently loading: <some/pkg> ... (17 packages)
(08:43:03) Analyzing: 12088 targets (5246 packages loaded, 0 targets configured)
(08:43:04) Analyzing: 12088 targets (5933 packages loaded, 28442 targets configured)
(08:43:05) Analyzing: 12088 targets (5950 packages loaded, 40668 targets configured)
(08:43:06) Analyzing: 12088 targets (6008 packages loaded, 43181 targets configured)
(08:43:07) Analyzing: 12088 targets (6086 packages loaded, 67904 targets configured)
(08:43:08) Analyzing: 12088 targets (6281 packages loaded, 70304 targets configured)
(08:43:09) Analyzing: 12088 targets (7022 packages loaded, 99814 targets configured)
(08:43:11) Analyzing: 12088 targets (7779 packages loaded, 117662 targets configured)
(08:43:12) Analyzing: 12088 targets (14200 packages loaded, 155814 targets configured)

The analysis cache was being added in more nodes overtime, and this caused OOM in skyframe somewhere. As shown in a previous screenshot, stack trace was thrown at.

com.gooogle.devtools.build.skyframe.InMemoryGraphImpl.getBatch

This happened during analysis phase which suggest that it's independence from action executions.


Since I have made this post, we have implemented a mechanism so that on a schedule basis, we would just run a bazel shutdown && bazel test //... to recycle the JVM on a random set of CI workers. This helps prevent a 'long running' (weeks / months) Bazel JVM and thus, reduce the impact of memory leaking overtime. We have not seen this issue again since this implementation.

What I would suggest is that we leave this issue open for references for other folks to troubleshoot.

If anyone else running into the same issue, please share here.

@haxorz haxorz added the P4 This is either out of scope or we don't have bandwidth to review a PR. (No assignee) label Dec 2, 2022
@haxorz
Copy link
Contributor

haxorz commented Dec 2, 2022

Setting this to P4 because --experimental_oom_more_eagerly_threshold already exists (as mentioned in this comment).

And in this comment @sluongng requested the issue be left open for awareness purposes.

@haxorz haxorz removed the untriaged label Feb 6, 2023
Copy link

Thank you for contributing to the Bazel repository! This issue has been marked as stale since it has not had any activity in the last 1+ years. It will be closed in the next 90 days unless any other activity occurs. If you think this issue is still relevant and should stay open, please post any comment here and the issue will no longer be marked as stale.

@github-actions github-actions bot added the stale Issues or PRs that are stale (no activity for 30 days) label Apr 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P4 This is either out of scope or we don't have bandwidth to review a PR. (No assignee) stale Issues or PRs that are stale (no activity for 30 days) team-Core Skyframe, bazel query, BEP, options parsing, bazelrc type: support / not a bug (process)
Projects
None yet
Development

No branches or pull requests

5 participants