-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Java crashes due to hsperfdata file conflicts across sandboxes #3236
Comments
Can you please send the output of "cat /proc/mounts", "free -m", "df -h" and "df -i" from inside the Docker container where the SIGBUS happened? I think this happens when /tmp does not have enough space or is a tmpfs and there's not enough free RAM to back all files that the JVM wants to create via mmap there, but I had a hard time tracking it down exactly. |
@philwo - The machine should be crazy strong... Maybe even too strong? :-)
|
Whoah, yes, resources are not an issue on that machine. :) Still, the only thing I could ever find about JVMs crashing with SIGBUS is this: http://bugs.java.com/view_bug.do?bug_id=6563308 I wonder if it's the same issue. Do you use any --sandbox_* flags for your build? Maybe --sandbox_tmpfs_path=...? |
No... I have some test targets where I set the |
@philwo any ideas? the problem is that network-isolation on our machine (docker container) doesn't work without enabling user namespace, fails for a different reason on 0.5.1 when it's enabled and doesn't build on head when it's enabled. |
I'm trying to find out why this is happening now. |
Can you please try using the flag "--sandbox_tmpfs_path=/tmp" in your "bazel build" (or "bazel test") command and see if the error still happens? This will mount an empty tmpfs on /tmp for each running action. It's generally not a bad idea, because it increases hermeticity (otherwise /tmp is a writable directory shared between all actions of a build, so they could create conflicting files or accidentally keep state there.) An alternate idea might be to mount a tmpfs on /tmp inside the Docker container before running the bazel command, but I'd like to try the first one first, because I remember that it helped a different user. It would be interesting to see if this makes the problem disappear. |
A process gets SIGBUS in one of two conditions:
The first thing might happen when multiple JVM processes try to use the same file via mmap and one of them truncates it, I guess. Is the failing action always the same kind, e.g. a Scala compilation? The second possibility is more likely, but you have 60G free on / and /tmp is not a mount point, so it's hard to see how you can run out of disk space during a build... OTOH you have almost four times as much RAM as free disk space on that machine - maybe some process uses a heuristic like "let's allocate a temp file with the size of 1/4 the RAM, this should always be a reasonable number"? I also don't understand why this only happens when you use the linux-sandbox and not when you use the standalone strategy. Do I understand correctly that the build works fine then? |
@philwo using |
Also, this should be solved right? Shouldn't it be a release blocker? |
Nice! :)
It shouldn't.. actually it should make things a bit faster, because /tmp is now backed by RAM instead of disk. On the other hand, it might be possible that mounting the tmpfs incurs some overhead, too. If you measure a noticeable difference, I'd be quite interested in it.
Absolutely! The problem is, I cannot reproduce this on my machine and the cause is completely unknown. :( It also doesn't seem to affect many people. If we had a clear repro case or someone more familiar with JVM internals, it might be easier to get to the bottom of this issue. |
What I'd be really interested in is if simply mounting a tmpfs on /tmp and not using the --sandbox_tmpfs_path flag also helps, or if the issue then happens again. What I mean is:
If the issue reoccurs, I believe that we're seeing a race condition here, maybe related to this: https://stackoverflow.com/questions/76327/how-can-i-prevent-java-from-creating-hsperfdata-files Maybe the JVMs in a highly parallel build accidentally create hsperfdata files with the same name and when one truncates the file of another running JVM, that JVM gets a SIGBUS because the file underlying its mmap went away. But this is really just a guess. More info: http://www.evanjones.ca/jvm-mmap-pause.html |
OMG, wait, I got it |
root@ubuntu:~# strace -f -- java HelloWorld Every JVM creates a temporary performance instrumentation file in /tmp/hsperfdata_$USERNAME/$PID. When we use sandboxing, we use PID namespaces, which means that the PIDs are virtualized and all running JVMs believe they are PID 2. This means that they all open/ftruncate/mmap the same file and that gives you SIGBUS eventually, due to case 1 I mentioned above: "It tries to read an address that no longer exists from an mmap'd file". When you use --sandbox_tmpfs_path=/tmp, each running sandbox gets its own /tmp, so the files don't conflict. This means the solution is quite simple and I'll come up with something on Monday. For now, I'd recommend to use the --sandbox_tmpfs_path=/tmp flag. |
Well done!
…On Sat, Jun 24, 2017 at 12:21 AM Philipp Wollermann < ***@***.***> wrote:
@aehlig <https://github.com/aehlig> @ulfjack <https://github.com/ulfjack>
FYI.
***@***.***:~# strace -f -- java HelloWorld
[pid 1432] open("/tmp/hsperfdata_root",
O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 4
[pid 1432] fchdir(4) = 0
[pid 1432] open("1431", O_RDWR|O_CREAT|O_NOFOLLOW, 0600) = 5
[pid 1432] ftruncate(5, 0) = 0
[pid 1432] mmap(NULL, 32768, PROT_READ|PROT_WRITE, MAP_SHARED, 5, 0) =
0x7f54cad93000
Every JVM creates a temporary performance instrumentation file in
/tmp/hsperfdata_$USERNAME/$PID. When we use sandboxing, we use PID
namespaces, which means that the PIDs are virtualized and all running JVMs
believe they are PID 2.
This means that they all open/ftruncate/mmap the same file and that gives
you SIGBUS eventually, due to case 1 I mentioned above: "It tries to read
an address that no longer exists from an mmap'd file".
When you use --sandbox_tmpfs_path=/tmp, each running sandbox gets its own
/tmp, so the files don't conflict.
This means the solution is quite simple and I'll come up with something on
Monday. For now, I'd recommend to use the --sandbox_tmpfs_path=/tmp flag.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3236 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABUIFwM1GwYHur29dB7Ex-QEaStV-72Dks5sHCxXgaJpZM4OBNWF>
.
|
Does it help if we set TMPDIR to a unique path for each action? |
Is that targeted to me? If so how can I do that?
…On Sat, Jun 24, 2017 at 10:27 PM Ulf Adams ***@***.***> wrote:
Does it help if we set TMPDIR to a unique path for each action?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3236 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABUIF53PNfGvdHgfy4dVVDdUlJnl_4Egks5sHWM9gaJpZM4OBNWF>
.
|
I don't think there is any way to change this. "/tmp" seems to be the hard-coded location. philwo@ubuntu:~$ java -version
openjdk version "1.8.0_131"
OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-0ubuntu1.16.04.2-b11)
OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
philwo@ubuntu:~$ strace -f \
-E TMP=/home/philwo/tmp \
-E TMPDIR=/home/philwo/tmp -- \
java -Djava.io.tmpdir=/home/philwo/tmp HelloWorld 2>&1 | \
fgrep hsperfdata
[pid 1589] open("/tmp/hsperfdata_philwo", O_RDONLY|O_NOFOLLOW) = 3
[pid 1589] open("/tmp/hsperfdata_philwo", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 4
[pid 1589] mkdir("/tmp/hsperfdata_philwo", 0755) = -1 EEXIST (File exists)
[pid 1589] lstat("/tmp/hsperfdata_philwo", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
[pid 1589] open("/tmp/hsperfdata_philwo", O_RDONLY|O_NOFOLLOW) = 3
[pid 1589] open("/tmp/hsperfdata_philwo", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 4
[pid 1589] unlink("/tmp/hsperfdata_philwo/1588") = 0 This is consistent with what I found on the Java bugtracker:
No further action or intend to change this behavior since then. |
Some ideas:
Any other ideas? I think I like 2) the best and it would be simple to implement. |
I don't know what the constraints are here, but instead of mounting a tmpfs instance, could you mount --bind a fresh temporary directory onto /tmp? Should avoid the memory issues, though you'd still be paying the cost to mount the extra fs. (Given that this is a problem caused by PID namespaces, which are Linux-specific, I'm assuming this unportable solution would be acceptable.) |
@jmmv In theory yes, but this will unfortunately break the people who put their workspace or output base inside /tmp again, because then the tmp mounted on top of /tmp will hide your input respectively output files / dirs. I still have no clue why anyone would do that, but it comes up every single time I accidentally break it. |
We could detect that case and construct a sequence of bind mounts that make the workspace / output base visible even though we're bind mounting an empty dir to /tmp. Alternatively, we could bind mount an empty directory to /tmp/hsperfdata_/. |
I'll give the "empty dir on /tmp" idea a try today and if that doesn't work out go for the make "/tmp/hsperfdata_$USERNAME read-only in sandboxes" version. |
Any update?
…On Tue, 27 Jun 2017 at 10:49 Philipp Wollermann ***@***.***> wrote:
I'll give the "empty dir on /tmp" idea a try today and if that doesn't
work out go for the make "/tmp/hsperfdata_$USERNAME read-only in sandboxes"
version.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3236 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABUIF9vrOZAO8A97DnTuoowCBdVmfXlyks5sILQSgaJpZM4OBNWF>
.
|
@ittaiz Unfortunately I got sick just after writing that comment and was out of office the entire week :| I'm fine again now and will be back on Monday. Current plan is to mkdir that directory and then make it read-only unless a tmpfs is mounted on /tmp (because that also solves the problem in a different way). It should be a rather simple fix that I can get easily done on Monday. |
Thanks! Glad to hear you're better.
…On Sat, 8 Jul 2017 at 12:06 Philipp Wollermann ***@***.***> wrote:
@ittaiz <https://github.com/ittaiz> Unfortunately I got sick just after
writing that comment and was out of office the entire week :| I'm fine
again now and will be back on Monday. Current plan is to mkdir that
directory and then make it read-only unless a tmpfs is mounted on /tmp
(because that also solves the problem in a different way).
It should be a rather simple fix that I can get easily done on Monday.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3236 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABUIF0V7f0PAG0mofhq3OzmKHvD3QHwgks5sL0apgaJpZM4OBNWF>
.
|
Unfortunately fixing the problem was accompanied by a warning message printed to STDOUT, which is breaking some of our build actions that write to STDOUT (https://github.com/google/google-java-format), and filling our build logs with hundreds of those warnings for all other JVM-tool actions. I'm not sure how we were not affected by the crash, but are now affected by the logging, but hopefully |
Here's a thing to try for the warning message: Add the following to the JVM options for the actions:
|
Quick chime in, looking at This is fundamentally not a fix, it's a workaround for not using secure temp file creation. Enabling this workaround globally as done in #18892 smells like a "bruteforce" attempt that could actually mask real issues in tools (such as the one in the JDK) that are not really prepared for parallelization, regardless whether you're running it in a isolated environment. And of course flag is marked as incompatible since users are surely relying already on the assumption that /tmp is shared. So I think that the solution is definitely going in the right direction, but it has a couple issues:
|
It is both a fix and a workaround:
I think that 2) by itself is a reason for having and flipping |
Agreed for point 2, still would like to see the flexibility there but I understand now better this issue is beyond the hsperfdata file only. Also understand that from bazel maintainers POV having out of the box support for common JVMs is a must-have. Then the note on flexibility is done, consider it a possible minor improvement for some more esoteric tools / setups. Appreciate the swift the reply 😄 |
The switch to Java-11 has brought instability in Bazel execution and random crashes (see [1]). Bazel doesn't have a fix at the moment, the only workaround is to add the --sandbox_tmpfs_path=/tmp option after the build/test commands. [1] bazelbuild/bazel#3236 Change-Id: I3aaabe756808a7da4ae51042922f9094c3759e22
Fixes bazelbuild#3236 Closes bazelbuild#19915 RELNOTES[INC]: `--incompatible_sandbox_hermetic_tmp` is enabled by default. See bazelbuild#19915 for migration advice. Closes bazelbuild#19943. PiperOrigin-RevId: 581165770 Change-Id: I0d98102f10b1e47c1d8fcf32fb1f7dee5ae0788c
Fixes #3236 Closes #19915 RELNOTES[INC]: `--incompatible_sandbox_hermetic_tmp` is enabled by default. See #19915 for migration advice. Closes #19943. Commit e2c0276 PiperOrigin-RevId: 581165770 Change-Id: I0d98102f10b1e47c1d8fcf32fb1f7dee5ae0788c Co-authored-by: Fabian Meumertzheim <fabian@meumertzhe.im>
A fix for this issue has been included in Bazel 7.0.0 RC5. Please test out the release candidate and report any issues as soon as possible. Thanks! |
The Bazel flag `--incompatible_sandbox_hermetic_tmp` is added to fix some issues with `mockito-core` and `byte-buddy`, e.g. see https://github.com/EngFlow/bazel_invocation_analyzer/actions/runs/7067252322/job/19240379595?pr=151 See mockito/mockito#1879 for a similar issue reported, and bazelbuild/bazel#3236 for the "fix" by specifying `--incompatible_sandbox_hermetic_tmp`. Note that this flag will be set to true by default with Bazel 7, which is expected to be released next week. --------- Signed-off-by: Sara Adams <sara.e.adams@gmail.com>
…chainCompileBootClasspath Copybara Import from #149 BEGIN_PUBLIC Disable perfdata when running JavaToolchainCompileClasses or JavaToolchainCompileBootClasspath (#149) A [recent JDK update](openjdk/jdk@84f2314#diff-7313eb3d328797a7720fa1b2b73cd159934506593443e45534baad80cb1382b7R924-R927) started printing the following warning message from the JVM ``` [warning][perf,memops] Cannot use file /tmp/hsperfdata_username/2 because it is locked by another process (errno = 11) ``` on linux hosts. Also referenced from bazelbuild/bazel#3236 This PR disables the perfdata generation when running JavaToolchainCompileClasses or JavaToolchainCompileBootClasspath so this warning message won't be printed. Closes #149 END_PUBLIC COPYBARA_INTEGRATE_REVIEW=#149 from cheister:no-perfdata 601c3e5 PiperOrigin-RevId: 614679249 Change-Id: I7e90c0ec0b93ac57763d3a5af546c4af9f7c9dc0
Running
bazel build
on a fat java/scala project (several thousands of targets) fails when working on linux debian with user namespace enabled.Issue
Trying to run
bazel build
with user namespace enabled:The build runs alright but at some point it crashes with weird memory issue:
Environment info
The machine is docker container based on debian image
specs2
versions and test runner env preparation)additional information
unprivileged_userns_clone=0
(but clearly - that's not a solution)The text was updated successfully, but these errors were encountered: