Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openjdk\bin keeps being emptied on VM testing-windows #252

Closed
katre opened this issue Mar 21, 2018 · 27 comments
Closed

openjdk\bin keeps being emptied on VM testing-windows #252

katre opened this issue Mar 21, 2018 · 27 comments
Assignees
Labels

Comments

@katre
Copy link
Member

katre commented Mar 21, 2018

Earlier today, I noticed that on the testing-windows VM, the C:\openjdk\bin directory (and several others) was empty. @philwo then re-imaged the VM, and everything was fine.

Just now, I logged into the VM again, and c:\openjdk\bin is again empty, with no Java binaries to be found.

@philwo
Copy link
Member

philwo commented Mar 21, 2018

@katre Do you remember if you ran "bazel clean" or "bazel clean --expunge" at some point in time?

@laszlocsomor I wonder if Bazel is following symlinks for the local JDK, ends up in C:\openjdk and thoroughly "cleans it up" ... any idea?

FYI, we don't see this on the Buildkite VMs, because there the buildkite user does not have write-permissions to C:, so it can't delete the folder.

@katre
Copy link
Member Author

katre commented Mar 21, 2018

No, I didn't run "bazel clean", although I might have tried to delete my entire bazel checkout in a fit of cleanliness.

@philwo
Copy link
Member

philwo commented Mar 21, 2018

I might have tried to delete my entire bazel checkout in a fit of cleanliness.

How did you delete the checkout? :D

Some tools in Windows don't understand symbolic links and then follow them. I think this could cause the problem.

@katre
Copy link
Member Author

katre commented Mar 21, 2018

I used "rmdir /s /q". Let me guess, that followed the junction and clobbered openjdk? Lovely.

@katre
Copy link
Member Author

katre commented Mar 21, 2018

Given the number of tests that create bazel workspaces and then clean them up after, this would probably be triggered no matter what I had done.

@laszlocsomor
Copy link
Contributor

@philwo : do you know if anything runs git clean on CI ? I heard that it can delete the JDK, though I haven't tried it myself yet.

@laszlocsomor
Copy link
Contributor

@katre : IIRC rmdir /s /q doesn't follow junctions so that can't be the culprit. (I have to verify this when I get to the office today.)

@philwo
Copy link
Member

philwo commented Mar 22, 2018

@laszlocsomor CI does run „git clean“, but the testing-* machines are not used by CI.

@philwo
Copy link
Member

philwo commented Mar 22, 2018

Thinking about it, IIRC I initially used „rmdir /s /q“ to clean up after a CI run in the infra scripts and that indeed followed Bazel‘s bazel-out junction and deleted the entire JDK.

@philwo
Copy link
Member

philwo commented Mar 22, 2018

After I switched to PowerShell‘s „Remove-Item“ function, this worked fine.

@laszlocsomor
Copy link
Contributor

laszlocsomor commented Mar 22, 2018

I tried it just now: neither del /s /q nor rd /s /q follow junctions.

@laszlocsomor
Copy link
Contributor

laszlocsomor commented Mar 22, 2018

Ha! del /s /q bazel-bin will follow the junction. Ouch. rd /s /q bazel-bin won't though.

@laszlocsomor
Copy link
Contributor

So, moment of truth (at least on my machine... who knows, maybe there's some registry setting that controls rd's behavior):

I created two directories (foo-del and foo-sym), each with a file in it, and two junctions pointing to the directories:

C:\tmp\junc>dir
 Volume in drive C has no label.
 Volume Serial Number is A23E-7E0A

 Directory of C:\tmp\junc

2018-03-22  10:51    <DIR>          .
2018-03-22  10:51    <DIR>          ..
2018-03-22  10:51    <DIR>          foo-del
2018-03-22  10:51    <JUNCTION>     foo-del.sym [C:\tmp\junc\foo-del]
2018-03-22  10:39    <DIR>          foo-rd
2018-03-22  10:51    <JUNCTION>     foo-rd.sym [C:\tmp\junc\foo-rd]
               0 File(s)              0 bytes
               6 Dir(s)  838,317,199,360 bytes free

Deleting with del /s /q follows the junction, but only deletes files, not directories (see below):

C:\tmp\junc>del /s /q foo-del.sym
Deleted file - C:\tmp\junc\foo-del.sym\hello.txt

Deleting with rmdir /s /q does not follow, and as you see, foo-del is still there but empty:

C:\tmp\junc>rmdir /s /q foo-rd.sym

C:\tmp\junc>dir
 Volume in drive C has no label.
 Volume Serial Number is A23E-7E0A

 Directory of C:\tmp\junc

2018-03-22  10:51    <DIR>          .
2018-03-22  10:51    <DIR>          ..
2018-03-22  10:51    <DIR>          foo-del
2018-03-22  10:51    <JUNCTION>     foo-del.sym [C:\tmp\junc\foo-del]
2018-03-22  10:39    <DIR>          foo-rd
               0 File(s)              0 bytes
               5 Dir(s)  838,317,133,824 bytes free

C:\tmp\junc>dir foo-del
 Volume in drive C has no label.
 Volume Serial Number is A23E-7E0A

 Directory of C:\tmp\junc\foo-del

2018-03-22  10:51    <DIR>          .
2018-03-22  10:51    <DIR>          ..
               0 File(s)              0 bytes
               2 Dir(s)  838,317,133,824 bytes free

@katre
Copy link
Member Author

katre commented Mar 22, 2018

Huh, that's interesting. So the takeaway seems to be that I (and, I guess, Bazel) should use "del" on Windows, not "rmdir".

This seems to be pointing back towards setting up a shell_tools toolchain, with items for "shell", "rmdir", and whatever other system-level utilities are needed, so we can properly switch them based on the platform in use.

@laszlocsomor
Copy link
Contributor

So the takeaway seems to be that I (and, I guess, Bazel) should use "del" on Windows, not "rmdir".

I don't believe Bazel does that. Not with bazel clean anyway.

Do you have a repro? I'd like to investigate.

@laszlocsomor
Copy link
Contributor

laszlocsomor commented Mar 22, 2018

I don't believe Bazel does that. Not with bazel clean anyway.

I mean, I don't believe Bazel uses either rmdir or del. It uses FileSystemUtils.deleteTree which uses WindowsFileSystem's delete(Path) method which is declared in JavaIoFileSystem.delete(Path), ultimately using File.delete.

@katre
Copy link
Member Author

katre commented Mar 22, 2018

I might try today and see if I can reproduce this. My theory is as follows:

  1. We run one of Bazel's tests under //src/test/shell/integration
  2. The test creates a temporary directory, creates a WORKSPACE in there, and runs the test.
  3. After the test finishes, the temporary directory is deleted, triggering a delete that crosses the junction and wipes out the JVM directory.

I think that the last step there is using whatever msys is providing, presumably some variant of "rm -r", and I don't know which of the "rmdir"/"del" modes that maps to.

@laszlocsomor
Copy link
Contributor

@katre , did you get to try this out?

@katre
Copy link
Member Author

katre commented Mar 27, 2018

Haven't had a chance to. I'll see if I can today.

@katre
Copy link
Member Author

katre commented Mar 27, 2018

Okay, I was unable to reproduce this by running tests. My assumption then is that I was doing something strange while testing manually and triggered the wrong delete command.

Thanks for the help, I'll close the issue now.

@katre katre closed this as completed Mar 27, 2018
@laszlocsomor
Copy link
Contributor

laszlocsomor commented Apr 17, 2018

@jasharpe found the culprit:

  • When you RDP into Windows, you get a different TMP/TEMP than usual and Windows (or the RDP client?) cleans these directories (seemingly upon restart, but maybe also sooner)
  • Bazel puts the output_base to the TMP directory by default (!!!)
  • The execroot is under the output_base and contains junctions pointing to... all your external repos. And your workspace. And to your JDK.
  • The cleanup logic follows junctions. (sigh)

@laszlocsomor laszlocsomor reopened this Apr 17, 2018
@laszlocsomor laszlocsomor added P1 and removed P2 labels Apr 17, 2018
@laszlocsomor laszlocsomor assigned laszlocsomor and unassigned philwo Apr 17, 2018
@laszlocsomor
Copy link
Contributor

laszlocsomor commented Apr 17, 2018

Here's the culprit: https://github.com/bazelbuild/bazel/blob/9dd14499876a4ca8f51a890bccc84971d16c0332/src/main/cpp/blaze_util_windows.cc#L205

By checking all possible TMP envvars, we try very hard to do the wrong thing.

@jasharpe
Copy link

To add slightly more color:

  • I'm pretty sure Windows is responsible for this. Whether this deletion occurs is controlled by the "HKLM:\System\CurrentControlSet\Control\Terminal Server\DeleteTempDirsOnExit" registry entry (it is 1 by default, and this problem goes away if you set it to 0).

  • This occurs at session end, which is sign out, not restart. It's much faster to reproduce by just signing out (from Powershell you can just do logoff to make this happen).

@laszlocsomor
Copy link
Contributor

That means junctions are inherently dangerous: by default they look like directories, and if a directory-tree deleter routine is unaware of junctions and doesn't check the FILE_ATTRIBUTE_REPARSE_POINT bit (see GetFileAttributes()), they'll blindly follow the junction.

@laszlocsomor
Copy link
Contributor

A better approach would be to do the same as on Linux: create directories with file symlinks in them. That way deleting the directory tree is safe because deleting a file symlink just deletes the symlink.

The downside is of course that creating symlinks requires admin privileges.

@laszlocsomor
Copy link
Contributor

laszlocsomor commented Apr 17, 2018

@philwo pointed out that directory symlinks would be risky on Linux too. And AFAIK we don't even create directory symlinks on any platform other than Windows.

@dslomov : FYI.

The implications:

  • we have to abstain from creating junctions
  • we need to create symlink forests (directory trees with file symlinks)
  • we need admin privileges, or at least some specific permissions to use CreateSymbolicLink
  • therefore we may need to give up the "xcopy install" concept, add an installer, and require admin approval

@laszlocsomor
Copy link
Contributor

bazelbuild/bazel#5038 is closed and released in 0.13.0 so this bug should no longer occur.

The implications:

On second thought, we cannot fix the Remote Desktop server not to follow junctions while deleting the temp directory, and abstaining from using junctions would limit the usability of Bazel, particularly on older Windows versions that don't support unprivileged symlink creation. Therefore I think Bazel is working as intended.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants