Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explain java.lang.OutOfMemoryError during bazel bootstrapping #2177

Closed
i3v opened this issue Dec 4, 2016 · 8 comments
Closed

Explain java.lang.OutOfMemoryError during bazel bootstrapping #2177

i3v opened this issue Dec 4, 2016 · 8 comments
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Local-Exec Issues and PRs for the Execution (Local) team type: feature request

Comments

@i3v
Copy link

i3v commented Dec 4, 2016

I was trying to bootstrap bazel 0.4.1 (and 0.4.0) and hit multiple errors, like this:

Error occurred during initialization of VM
java.lang.OutOfMemoryError: unable to create new native thread

,

# There is insufficient memory for the Java Runtime Environment to continue.
# Cannot create GC thread. Out of system resources.

and
bazel ran out of memory and crashed.

After all, I was able to successfully build it, and now I wish to share some knowledge about what I've tried and what actually worked:

  1. I'm not the only one, who ran into this issue (see e.g. Blaze ran out of memory and crashed. #885 and compiling from source gives java.lang.OutOfMemoryError: unable to create new native thread #1341). Some users even gave up trying to make this work. Reading these threads definitely gave me some clues about what's going on, but they are still not giving all the necessary information. The official user manual is not useful at all here.

  2. Although all these error messages may look like "there's not enough free RAM", the actual issue is not related to the amount of free RAM at all. At least in my case - I've got ~190GB of free system RAM, and no user-specific limit (see ulimit info below)

  3. ... neither it is related to java heap size. At least in my case, creating the following "MaxMemory.java" file:

class MaxMemory {                                         
    public static void main(String[] args) { 			
        System.out.println(Runtime.getRuntime().maxMemory());	
    }								
}

and running it gives me a rather large value.

$ javac MaxMemory.java
$ java MaxMemory
28631367680
  1. Playing with other java options like "-Xss", "-XX:ParallelGCThreads=2" also gives nothing.
  2. Java options should only be specified in "compile.sh" - environment variables, like export _JAVA_OPTIONS="-XX:ParallelGCThreads=2" are ignored.
  3. I'm not sure, if it's possible to use normal bazel options, like --jobs=10, at least, the "bazelrc" file is not picked up.
  4. The real issue is the number of threads bazel starts, and the fact, that user is unable to decrease their number. The number of threads, user is allowed to run could be obtained via ulimit -u (threads and processes are just the same term here). My limit on this system is 1024 (see ulimit info below).
  5. To monitor number of currently running threads, I use watch -n 0.5 "ps -f -T -u myusername |grep -v grep|wc -l".
  6. During my observations, bazel was starting ~800 processes (this post mentions "200 jobs" default).
  7. With ~900 "free process slots" bazel bootstrapping runs flawlessly.
  8. With ~750 "slots", there's only ~30% chance to compile it. Yep - according to my experiments, this is a non-deterministic process. So, if you've already killed all unnecessary processes you have, but ulimit -u minus your current number of processes is ~750 - just try ten more times, there's a nice chance that you would make it :)
  9. My system got 20 CPU cores, and one of my ideas was to set the affinity like this: taskset -c 0 ./compile.sh, to send all processes to a single CPU. Assuming that the affinity of all child processes would be the same, as the affinity of the parent process, one may assume, that if there's some wise dynamic parallelism inside - the number of processes would be significantly decreased. However, this seem to give no effect.
  10. @po0ya found the solution that might work in many cases : to increase the number of processes allowed via ulimit -u 100000. It's a pity that I've noticed his post only after I solved the issue myself. Surprisingly, user might be able to significantly increase number of processes allowed without root privileges.

ulimit:

-bash-4.1$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 774201
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1024
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

So, as a feature request:

  1. May be it would be nice to add a string like "To bootstrap bazel, you must be allowed to run ~800 threads" to the documentation, or, maybe, throw a warning about this (if this condition is not satisfied at "./compile.sh" startup). Is there a chance, that a pull request like that would be approved?
  2. May be it would be nice to decrease this number somehow?
@damienmg damienmg added category: sandboxing P2 We'll consider working on this in future. (Assignee optional) type: feature request category: performance and removed category: sandboxing labels Dec 5, 2016
@damienmg
Copy link
Contributor

damienmg commented Dec 5, 2016

IMO we should just use a lower number of threads if we cannot use that much thread.

@ulfjack
Copy link
Contributor

ulfjack commented Dec 8, 2016

We can't automatically detect this situation. However, for non-network file systems and no remote execution, we don't need nearly this many threads, so maybe we should lower the default for Bazel (but keeping it higher internally).

@philwo
Copy link
Member

philwo commented Dec 9, 2016

@ulfjack I agree that we should lower the default value of --jobs. Maybe to the number of CPU threads available on the machine, WDYT?

@philwo philwo added this to the 0.5 milestone Dec 9, 2016
@ulfjack
Copy link
Contributor

ulfjack commented Dec 9, 2016

I think that's a reasonable default, but we need to make sure that our internal default is higher (using the invocation policy?). Also, note that we have a separate flag for loading phase threads.

@edbaunton
Copy link
Contributor

On a host with limited resources; the number of threads can sometimes overwhelm the JVM. I suffered symptoms similar to those reported here.

I believe the root cause of this is the hardcoding of the number of threads in the SkyframeExecution phase:

System.getenv("TEST_TMPDIR") == null ? 200 : 5;

and seemingly also the BuildView phase
return System.getenv("TEST_TMPDIR") == null ? 200 : 5;
.

The --jobs flag limits the amount of concurrent jobs to run during bazel’s execution, however this does not adjust the number of threads that bazel creates internally (and so doesn't help with the problem detailed earlier). This is currently hardcoded at 200 in the two places I mentioned earlier. 

Before going to the effort of making any speculative changes, I wanted to explore the possible approaches of reducing/controlling the number of threads that bazel would create.

  1. Introduce a new environment variable that would allow the user to modify the number of threads created internally, this could then be added to the test framework and the use of TEST_TMPDIR check removed in favour of this more specific and explicit option. However, this has the downside that it might cause confusion with the --jobs option and cannot be specified on the command line.
  2. Adjust the resolution of the number of threads to take into account the --jobs option. Perhaps take the lower of jobs or threads?
  3. Add a new command line option to control the number of threads (--threads?)
  4. Make the thread calculation logic "smart", perhaps as proposed by @philwo earlier in this thread.

@edbaunton
Copy link
Contributor

Would I be better off taking this to bazel-dev rather than trying to revive this old issue?

@meisterT meisterT added team-Performance Issues for Performance teams team-Execution and removed category: performance labels Nov 29, 2018
@jin jin added team-Local-Exec Issues and PRs for the Execution (Local) team and removed team-Execution labels Jan 14, 2019
@dslomov dslomov removed the team-Performance Issues for Performance teams label Feb 15, 2019
@meisterT meisterT removed this from the 0.7 milestone May 12, 2020
@jmmv
Copy link
Contributor

jmmv commented May 13, 2020

We fixed --jobs a while ago to default to the number of CPU threads. We've also parameterized many of the thread pools that had large hardcoded values and changed some to use some reasonable values. Thus, given the lack of activity here and similar reports, I suspect the usability problems are resolved. (We are continuing to work on this area, as we still spawn many more threads than we should, but I don't think they cause major problems.)

@jmmv jmmv closed this as completed May 13, 2020
@brentleyjones
Copy link
Contributor

I've started to run into issues when running multiple bazel builds (for different workspaces) against remote execution at the same time. I've had to lower --jobs to a number that makes the remote execution not worth it. What other settings can I change to reduce the number of threads spawned by Bazel?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Local-Exec Issues and PRs for the Execution (Local) team type: feature request
Projects
None yet
Development

No branches or pull requests

10 participants