Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support non-ASCII source artifact paths on UNIX platforms. #10111

Closed

Conversation

jmillikin
Copy link
Contributor

@jmillikin jmillikin commented Oct 27, 2019

Fixes #7255, and mitigates some of the most annoying limitations of #374.

Summary: On UNIX platforms, Bazel uses readdir() via NativePosixFiles but opens paths with java.io.File. These two libraries use different representations of non-ASCII filesystem paths, which prevents Bazel from reading source artifacts.

There is a workaround, because java.io.File can also accept the path as a URI with percent-encoded octets. Using this mechanism for paths containing characters outside the ASCII range allows Bazel to happily consume source artifacts with Unicode filenames.

cc @davispuh for #7255
cc @aehlig for #4555
cc @alandonovan who, per #374 (comment), was working on a fix but ran into unknown difficulties.

@jmillikin
Copy link
Contributor Author

jmillikin commented Oct 27, 2019

Also, small note: due to limitations within java.io this only supports Unicode paths, not the "arbitrary blob of bytes" semantics found on Linux or BSD. According to https://stackoverflow.com/questions/14171565/java-read-write-unicode-utf-8-filenames-not-contents we'd have to switch Bazel to opening files with java.nio, which is a larger change than I wanted to make.

Never mind, CI wasn't satisfied with the java.io.File behavior, so I've expanded this to use java.nio.file.Files for opening the InputStream.

@jmillikin jmillikin force-pushed the unicode-source-artifact-paths branch from 18be614 to 2c2684b Compare October 27, 2019 08:40
@irengrig irengrig added the team-Rules-Server Issues for serverside rules included with Bazel label Oct 28, 2019
Copy link
Contributor

@lberki lberki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid I'm not qualified to comment on encoding issues, or at least Laszlo and Alan are way more qualified than I am, so I'll defer to them.

// On some platforms, whether a path can be read as a file won't be checked until
// the first read().
//
// http://mail.openjdk.java.net/pipermail/nio-dev/2014-December/002877.html
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wat


// Paths returned from NativePosixFiles are Strings containing raw bytes from the filesystem,
// but Java's IO subsystem expects paths to be encoded in the current locale. We can avoid this
// assumption by converting the path to a URI, which permits percent-encoding of any octet.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clever :)

@laszlocsomor
Copy link
Contributor

Update: I tried importing this change but couldn't, because some Google-internal tests failed. Sorry about that, I'll try to look at those as soon as I can, hopefully tomorrow.

final java.nio.file.Path nioPath = createJavaNioPath(path);
InputStream stream = null;
try {
stream = Files.newInputStream(nioPath, java.nio.file.StandardOpenOption.READ);
Copy link
Contributor

@laszlocsomor laszlocsomor Nov 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After much debugging, I found two culprits of the breaking internal tests.

One is this line. I don't know why, but returning new FileInputStream(nioPath.toFile()) works, but Files.newInputStream doesn't. Maybe something somewhere checks for instanceof FileInputStream -- which of course they shouldn't, but do so anyway -- and while I'd love to fix that, I'm not sure I can. So, would you be open to using FileInputStream here?

The other error is the logic in lines 490..511. Again, frustratingly, I have no idea why it's wrong. But considering that this logic -- no matter how reasonable -- doesn't serve the goal of this PR (i.e. to support non-ASCII characters), nor is it essential to it, would you be OK with removing it?

Copy link
Contributor Author

@jmillikin jmillikin Nov 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, would you be open to using FileInputStream here?

Well, using java.nio.Files was needed to make Bazel's CI pass. Otherwise the CentOS instances failed to locate the UTF-8 filename.

The other error is the logic in lines 490..511. Again, frustratingly, I have no idea why it's wrong. But considering that this logic -- no matter how reasonable -- doesn't serve the goal of this PR (i.e. to support non-ASCII characters), nor is it essential to it, would you be OK with removing it?

I didn't add this logic for fun. The CI tests failed unless I added the seek-once (since tests assert trying to read a directory fails), and using the normal buffering stream caused a different CI failure.

There doesn't seem to be an easy way around using NIO. I've been digging into the JDK source, and Bazel's current behavior of forcing LANG=en_US.ISO-8859-1 has a side effect on Linux of setting sun.jnu.encoding to ASCII-only mode, which prevents java.io.File from being able to represent bytes > 0x7F at all.

There are ... a lot of things to unpack here, and it's possible that I'll find something that can be made to work. But it would be much easier if the dependencies on FileInputStream could be identified and characterized.

@jmillikin jmillikin force-pushed the unicode-source-artifact-paths branch from 673dd22 to 97a79a1 Compare November 7, 2019 07:58
@jmillikin
Copy link
Contributor Author

jmillikin commented Nov 7, 2019

Implementation rebased to java.io.File, by changing the locale logic in the Bazel launcher to pass through LANG, LC_ALL in some circumstances. @laszlocsomor PTAL.

I had an idea this morning about how the JVM encoding interacts with Bazel and it seems to be working (though with limitations):

  • On machines with a ISO-8859-1 locale, the JVM can accept arbitrary paths.
  • On machines without ISO-8859-1, we can still support the most common case (UTF-8) by re-encoding the file paths.
    • The same code path works for Darwin because the JVM hardcodes the file encoding to UTF-8 regardless of LC_ALL.

The only case from java.nio that can't be supported now is a Linux/BSD machine, without ISO-8859-1 locales, that has non-UTF-8 file paths. I think this case is uncommon enough to not worry about. If there's anyone out there who needs this, then they'll have to either use java.nio or go direct to the POSIX native C API.

@laszlocsomor
Copy link
Contributor

Internal tests pass. \o/

Thank you for reworking the PR. Let me get it reviewed again.

The only case from java.nio that can't be supported now is a Linux/BSD machine, without ISO-8859-1 locales, that has non-UTF-8 file paths. I think this case is uncommon enough to not worry about. If there's anyone out there who needs this, then they'll have to either use java.nio or go direct to the POSIX native C API.

But this is not a regression, since such systems simply still won't support non-ascii characters, right?

@jmillikin
Copy link
Contributor Author

But this is not a regression, since such systems simply still won't support non-ascii characters, right?

Correct, no change in behavior there. Doesn't work before, still won't work.

@laszlocsomor
Copy link
Contributor

Thanks for confirming that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes team-Rules-Server Issues for serverside rules included with Bazel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bazel doesn't work when $HOME path contains non-ASCII characters
5 participants