-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support non-ASCII source artifact paths on UNIX platforms. #10111
Support non-ASCII source artifact paths on UNIX platforms. #10111
Conversation
Never mind, CI wasn't satisfied with the |
18be614
to
2c2684b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid I'm not qualified to comment on encoding issues, or at least Laszlo and Alan are way more qualified than I am, so I'll defer to them.
// On some platforms, whether a path can be read as a file won't be checked until | ||
// the first read(). | ||
// | ||
// http://mail.openjdk.java.net/pipermail/nio-dev/2014-December/002877.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wat
src/main/java/com/google/devtools/build/lib/vfs/AbstractFileSystem.java
Outdated
Show resolved
Hide resolved
src/main/java/com/google/devtools/build/lib/vfs/AbstractFileSystem.java
Outdated
Show resolved
Hide resolved
|
||
// Paths returned from NativePosixFiles are Strings containing raw bytes from the filesystem, | ||
// but Java's IO subsystem expects paths to be encoded in the current locale. We can avoid this | ||
// assumption by converting the path to a URI, which permits percent-encoding of any octet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clever :)
src/main/java/com/google/devtools/build/lib/unix/UnixFileSystem.java
Outdated
Show resolved
Hide resolved
src/main/java/com/google/devtools/build/lib/vfs/AbstractFileSystem.java
Outdated
Show resolved
Hide resolved
Update: I tried importing this change but couldn't, because some Google-internal tests failed. Sorry about that, I'll try to look at those as soon as I can, hopefully tomorrow. |
final java.nio.file.Path nioPath = createJavaNioPath(path); | ||
InputStream stream = null; | ||
try { | ||
stream = Files.newInputStream(nioPath, java.nio.file.StandardOpenOption.READ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After much debugging, I found two culprits of the breaking internal tests.
One is this line. I don't know why, but returning new FileInputStream(nioPath.toFile())
works, but Files.newInputStream
doesn't. Maybe something somewhere checks for instanceof FileInputStream
-- which of course they shouldn't, but do so anyway -- and while I'd love to fix that, I'm not sure I can. So, would you be open to using FileInputStream here?
The other error is the logic in lines 490..511. Again, frustratingly, I have no idea why it's wrong. But considering that this logic -- no matter how reasonable -- doesn't serve the goal of this PR (i.e. to support non-ASCII characters), nor is it essential to it, would you be OK with removing it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, would you be open to using FileInputStream here?
Well, using java.nio.Files
was needed to make Bazel's CI pass. Otherwise the CentOS instances failed to locate the UTF-8 filename.
The other error is the logic in lines 490..511. Again, frustratingly, I have no idea why it's wrong. But considering that this logic -- no matter how reasonable -- doesn't serve the goal of this PR (i.e. to support non-ASCII characters), nor is it essential to it, would you be OK with removing it?
I didn't add this logic for fun. The CI tests failed unless I added the seek-once (since tests assert trying to read a directory fails), and using the normal buffering stream caused a different CI failure.
There doesn't seem to be an easy way around using NIO. I've been digging into the JDK source, and Bazel's current behavior of forcing LANG=en_US.ISO-8859-1
has a side effect on Linux of setting sun.jnu.encoding
to ASCII-only mode, which prevents java.io.File
from being able to represent bytes > 0x7F at all.
There are ... a lot of things to unpack here, and it's possible that I'll find something that can be made to work. But it would be much easier if the dependencies on FileInputStream
could be identified and characterized.
673dd22
to
97a79a1
Compare
Implementation rebased to I had an idea this morning about how the JVM encoding interacts with Bazel and it seems to be working (though with limitations):
The only case from |
Internal tests pass. \o/ Thank you for reworking the PR. Let me get it reviewed again.
But this is not a regression, since such systems simply still won't support non-ascii characters, right? |
Correct, no change in behavior there. Doesn't work before, still won't work. |
Thanks for confirming that. |
Fixes #7255, and mitigates some of the most annoying limitations of #374.
Summary: On UNIX platforms, Bazel uses
readdir()
viaNativePosixFiles
but opens paths withjava.io.File
. These two libraries use different representations of non-ASCII filesystem paths, which prevents Bazel from reading source artifacts.There is a workaround, because
java.io.File
can also accept the path as aURI
with percent-encoded octets. Using this mechanism for paths containing characters outside the ASCII range allows Bazel to happily consume source artifacts with Unicode filenames.cc @davispuh for #7255
cc @aehlig for #4555
cc @alandonovan who, per #374 (comment), was working on a fix but ran into unknown difficulties.