Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert paths, arguments, and env vars to Unicode for remote execution. #15333

Closed
wants to merge 14 commits into from

Conversation

jmillikin
Copy link
Contributor

Fixes #14381

I didn't touch the Platform construction because it seems to be used in parts of Bazel that aren't specific to remote execution. Maybe it'd be fine to change too? Comments welcomed.

@coeuvre
Copy link
Member

coeuvre commented Apr 25, 2022

cc @tjgq

@sgowroji sgowroji added team-Remote-Exec Issues and PRs for the Execution (Remote) team awaiting-review PR is awaiting review from an assigned reviewer labels Apr 25, 2022
@tjgq tjgq self-assigned this Apr 25, 2022
Copy link
Contributor

@tjgq tjgq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not entirely obvious to me that Platform has the same issue (its contents originate from command line arguments, not Starlark). Adding a test would be the way to be sure, but feel free to punt on it, since non-ASCII values are probably not as common there.

FYI, while this PR is most definitely welcome, I suspect it doesn't completely fix #14381, since that bug report also mentions an encoding issue in bazel aquery. So feel free to mention it, but I'd suggest leaving it open for now.

@jmillikin
Copy link
Contributor Author

FYI, while this PR is most definitely welcome, I suspect it doesn't completely fix #14381, since that bug report also mentions an encoding issue in bazel aquery. So feel free to mention it, but I'd suggest leaving it open for now.

I think the bad bazel aquery output might be the same bug, or one nearly identical. It's assembling a Unicode string for the output, and doesn't convert Starlark strings to Unicode before formatting them.

I've tweaked this PR so it also fixes aquery, using the new helper functions.

It's not entirely obvious to me that Platform has the same issue (its contents originate from command line arguments, not Starlark).

While platform properties can be obtained from CLI flags, they also come from Starlark (the platform rule or exec_properties common attribute). I agree that non-ASCII values in platform properties are probably rare, so not going to worry about it for now, but it might be worth fixing in the future.

For reference, the following property configurations are equivalent:

test_unicode(
    name = "test_unicode",
    exec_properties = {"PROPERTY_あ": "あ"},
)

$ bazel build --remote_default_platform_properties='properties {name:"PROPERTY_あ" value:"あ"}'

$ bazel build --remote_default_exec_properties=PROPERTY_あ=あ

And result in the same (incorrect) Platform message being sent to the remote executor.

platform: {
  properties: {
    name: "PROPERTY_ã\u0081\u0082"
    value: "ã\u0081\u0082"
  }
}

Copy link
Contributor

@tjgq tjgq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tweaked this PR so it also fixes aquery, using the new helper functions.

Thanks! Do you mind also adding a test for bazel aquery, probably in src/test/shell/integration/aquery_test.sh?

While platform properties can be obtained from CLI flags, they also come from Starlark (the platform rule or exec_properties common attribute). I agree that non-ASCII values in platform properties are probably rare, so not going to worry about it for now, but it might be worth fixing in the future.

Thanks, I wasn't aware of the exec_properties attribute. I agree that we don't have to fix this right now.

@jmillikin
Copy link
Contributor Author

jmillikin commented Apr 27, 2022

Thanks! Do you mind also adding a test for bazel aquery, probably in src/test/shell/integration/aquery_test.sh?

Done -- I added a test for bazel aquery --output=text.

Fixing the other aquery output formats (for example --output=textproto) are a bit beyond my depth in terms of Bazel knowledge. They seem to live mostly in skyframe/actiongraph/ and I'm having some trouble figuring out what downstream consumers of their output might depend on exactly matching Bazel internal state.

For those, I've added a DISABLED_ test that would pass if --output=textproto passed along Unicode. Hopefully it's OK to fix that in a separate change.

@jmillikin
Copy link
Contributor Author

Note on the bazel aquery output: on Windows, reliably writing non-ASCII to stdout requires a specially configured PrintWriter. I got lost trying to figure out how the OutputStream stream gets built, so instead it now escapes non-ASCII characters rather than try to print them. It's a little less pretty in the output, but is still usable for debugging and works on all tested platforms.

Also, apparently, on Windows strings for file paths are Unicode and strings from Starlark are UTF-8. I extended the logic in decodeBytestringUtf8() to try to handle this as best it can.

Here's how the new aquery output looks:

$ cat pkg/BUILD 
genrule(
    name = "bar",
    srcs = glob(["*.txt"]),
    outs = ["bar_out.txt"],
    cmd = "echo unused > $(OUTS)",
)
$ ls pkg
BUILD            ünïcödë.txt
$ bazel aquery --output=text //pkg:bar | grep Inputs
  Inputs: [external/bazel_tools/tools/genrule/genrule-setup.sh, pkg/{U+00FC}n{U+00EF}c{U+00F6}d{U+00EB}.txt]

* to be durable against unusual encoding settings, and does not guarantee
* that the escaping process is reverseable.
*
* Non-printable ASCII characters are formatted as `{U+00XX}`. Characters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go with the standard-ish \uXXXX notation to make the output copy-pastable into a script.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@tjgq tjgq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'm going to import this now. Apologies for my confusion earlier and the review delay (I was out on Friday).

@bazel-io bazel-io closed this in c55b01e May 2, 2022
ckolli5 added a commit that referenced this pull request May 10, 2022
…utput and remote execution protocol. (#15447)

For historical reasons, Bazel internally encodes non-ASCII in BUILD/bzl files by taking individual input bytes (assumed to be UTF-8) and storing them in a String as the corresponding Latin1 characters. This encoding must be undone whenever these strings escape to the outside world.

Fixes #14381.

Closes #15333.

PiperOrigin-RevId: 445941013

Co-authored-by: John Millikin <john@john-millikin.com>
@ShreeM01 ShreeM01 removed the awaiting-review PR is awaiting review from an assigned reviewer label Sep 15, 2022
@jmillikin jmillikin deleted the rexec-unicode-strings branch September 18, 2022 03:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Remote-Exec Issues and PRs for the Execution (Remote) team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wrong encoding of filename in bazel action inputs
5 participants