Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows, Java workers: worker sometimes gets stuck in a bad state #2538

Closed
laszlocsomor opened this issue Feb 16, 2017 · 11 comments
Closed

Windows, Java workers: worker sometimes gets stuck in a bad state #2538

laszlocsomor opened this issue Feb 16, 2017 · 11 comments
Assignees
Labels
P3 We're not considering working on this, but happy to review a PR. (No assignee) platform: windows type: bug
Milestone

Comments

@laszlocsomor
Copy link
Contributor

laszlocsomor commented Feb 16, 2017

Description of the problem / feature request / question:

Sometimes the persistent Java worker gets into a bad state where I can't build. I haven't found a reliable way to trigger this, and it doesn't always happen, but once it does, no amount of retrying the build helps and I need to kill the running worker in Task Manager.

$ bazel --output_user_root=/c/tmp build src:bazel.exe --cpu=x64_windows_msvc
...
INFO: Found 1 target...
ERROR: C:/work/bazel/src/main/java/com/google/devtools/build/lib/BUILD:229:1: Extracting interface //src/main/java/com/google/devtools/build/lib:os_util failed: ijar.exe failed: error executing command external/bazel_tools/tools/jdk/ijar/ijar.exe bazel-out/vc_14_0_x64-fastbuild/bin/src/main/java/com/google/devtools/build/lib/libos_util.jar ... (remaining 1 argument(s) skipped): com.google.devtools.build.lib.shell.BadExitStatusException: Process exited with status 35584.
CreateFile(): Access is denied.

      0 [main] ijar 10772 cygwin_exception::open_stackdumpfile: Dumping stack trace to ijar.exe.stackdump
Target //src:bazel.exe failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 3.141s, Critical Path: 2.07s

Environment info

  • Operating System: Windows 10, MSYS2 20161025

  • Bazel version (output of bazel info release): dev version, but I've seen this weeks, maybe months ago already

@laszlocsomor laszlocsomor added platform: windows type: bug P3 We're not considering working on this, but happy to review a PR. (No assignee) and removed platform: windows labels Feb 16, 2017
@laszlocsomor laszlocsomor added this to the 0.6 milestone Feb 16, 2017
@dslomov
Copy link
Contributor

dslomov commented Feb 16, 2017

what is the right category @laszlocsomor? This seems to be Windows only?

@laszlocsomor
Copy link
Contributor Author

I don't know. I'm only using Windows these days, so I haven't had the chance to see it elsewhere.

@philwo
Copy link
Member

philwo commented Feb 22, 2017

The 90s called, they want their general protection faults back:

INFO: Found 1 target...
ERROR: C:/msys64/home/philwo/bazel/src/main/java/com/google/devtools/build/lib/BUILD:292:1: Extracting interface //src/main/java/com/google/devtools/build/lib:events failed: ijar.exe failed: error executing command external/bazel_tools/tools/jdk/ijar/ijar.exe bazel-out/local-fastbuild/bin/src/main/java/com/google/devtools/build/lib/libevents.jar ... (remaining 1 argument(s) skipped): com.google.devtools.build.lib.shell.BadExitStatusException: Process exited with status 35584.
CreateFile(): Access is denied.

Exception: STATUS_ACCESS_VIOLATION at rip=0010041B670
rax=0000000000000000 rbx=000000060005BED8 rcx=0000000000000001
rdx=0000000006054B50 rsi=000000018024F6A0 rdi=0000000000008000
r8 =00000000FFFFB6CC r9 =0000000180141D70 r10=0000000100000000
r11=0000000100405AE6 r12=0000000000000062 r13=00000000FFFFCC55
r14=00000000FFFFCC55 r15=0000000000000002
rbp=00000000FFFFB890 rsp=00000000FFFFB890
program=C:\msys64\tmp\_bazel_philwo\MAvyClY-\execroot\bazel\external\bazel_tools\tools\jdk\ijar\ijar.exe, pid 7892, thread main
cs=0033 ds=002B es=002B fs=0053 gs=002B ss=002B
Stack trace:
Frame        Function    Args
000FFFFB890  0010041B670 (00006054B50, 0018024F6A0, 00000008000, 000FFFFB930)
000FFFFB8F0  001004052F2 (0060005BED8, 0018024F6A0, 00000008000, 000FFFFB950)
000FFFFB920  001004056AF (000FFFFCC55, 00000000002, 00180185E96, 000FFFFB980)
000FFFFB950  00100416833 (000FFFFB9A8, 0060005BE80, 000FFFFB9A0, 000FFFFBA10)
000FFFFB980  001004167F5 (000FFFFCBF9, 0060005BED8, 00000100000, 000001C5D74)
000FFFFBA10  00100405B04 (0000000B3AA, 001004010D0, 001800464C0, 000FFFFBA30)
000FFFFBA60  001004030DD (000FFFFCBA1, 000006C3420, 000FFFFBB80, 00000000000)
000FFFFBB00  00100403631 (000FFFFCB10, 000FFFFCC55, 00000000030, FF0700010302FF00)
000FFFFCCB0  00180047C37 (00000000000, 00000000000, 00000000000, 00000000000)
00000000000  00180045873 (00000000000, 00000000000, 00000000000, 00000000000)
000FFFFFFF0  00180045924 (00000000000, 00000000000, 00000000000, 00000000000)
End of stack trace
ERROR: C:/msys64/home/philwo/bazel/src/tools/android/java/com/google/devtools/build/android/desugar/BUILD:28:1: output 'src/tools/android/java/com/google/devtools/build/android/desugar/desugar_bin_deploy.jar' was not created.
ERROR: C:/msys64/home/philwo/bazel/src/main/java/com/google/devtools/build/lib/BUILD:240:1: output 'src/main/java/com/google/devtools/build/lib/libutil-ijar.jar' was not created.

Considering that ijar is crashing here, but ijar is not a worker, ... how is this related to workers?

@philwo
Copy link
Member

philwo commented Feb 22, 2017

Dumping some notes, so that I don't forget:

<philwo> what's the workaround? 
<laszlocsomor> https://github.com/bazelbuild/bazel/issues/2538
<laszlocsomor> kill the java processes in Task Manager
<philwo> but.. Is "ijar" a worker?
<laszlocsomor> I think it is, because when I kill all java processes except the bazel server, upon next build it says "Worker n/m exited or was killed"
<philwo> Yes, that is a side-effect of killing the workers (Bazel only notices when it tries to use them the next time and then reports it)
<philwo> But if that fixes the problem...
<philwo> OH WAIT
<philwo> The worker probably keeps a file handle open
<philwo> That prevents ijar from writing to the same file
<laszlocsomor> why does the worker keep an fd open? is that intentional?
<philwo> I guess because "hey, Java has garbage collection, why should we care about closing files?" :(

One idea to "fix" this would be to force a GC run again after each unit of work in the JavaBuilder. This would also "solve" the recent memory leak / file handle leak issues reported by users, but comes at a performance cost.

The correct solution would be to find where those file handles are being leaked and close them correctly, e.g. by always using try-with-resources.

@philwo
Copy link
Member

philwo commented Feb 22, 2017

@cushon FYI.

@ulfjack
Copy link
Contributor

ulfjack commented Feb 24, 2017

I'm not sure the file handles are leaked. It may be that they're intentionally kept open?

@philwo
Copy link
Member

philwo commented Feb 24, 2017

@ulfjack The problem is, even if it is "intentional" from a Javac perspective (e.g. for caching purposes or something), it is very undesirable from our perspective, because worker processes should not keep visible state between units of work. It is fine if they cache things and still return correct results, but if they keep file handles open, blocking the deletion of input artifacts, we have a problem.

Another issue with the file handle leaking is that we had to disable persistent workers in our Google-internal config on macOS, because JavaBuilder would crash all the time due to running out of file handles.

@cushon
Copy link
Contributor

cushon commented Feb 24, 2017

The file ijar can't open is the output jar of a library. JavaBuilder only opens those files to write them, they should never be on the classpath (that's what ijars are for). It should be easy to make sure the output files get closed.

I sent @philwo a CL to refactor things a bit, but I don't see anything that code is obviously doing wrong:

bazel-io pushed a commit that referenced this issue Feb 28, 2017
To make it more obvious that the JarOutputStream is closed (see #2538).

--
PiperOrigin-RevId: 148791125
MOS_MIGRATED_REVID=148791125
@philwo
Copy link
Member

philwo commented Mar 2, 2017

I think we have a fix for this now (dc857b3), but I haven't had the time to verify on Windows.

I've checked an idle JavaBuilder using http://file-leak-detector.kohsuke.org/ and couldn't see leaked file handles anymore after the fix, so let's hope that this helped.

@laszlocsomor
Copy link
Contributor Author

@philwo : can we close this?

@laszlocsomor
Copy link
Contributor Author

AFAIK this was fixed and I haven't seen this occurring in a long time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P3 We're not considering working on this, but happy to review a PR. (No assignee) platform: windows type: bug
Projects
None yet
Development

No branches or pull requests

5 participants