Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the ability to emit *.runfiles_manifest as JSON #14676

Conversation

EdSchouten
Copy link
Contributor

The *.runfiles_manifest files currently use a pretty simple format: two
pathnames separated by a space character. This can be problematic at
times. On Windows, usernames may contain spaces, meaning home
directories with spaces are not uncommon. If you look at the macOS
ecosystem, it's also not uncommon to have applications or packages
containing files with spaces in their names. The goal of this change is
to start addressing this issue, by moving *.runfiles_manifest to newline
delimited JSON.

Runfiles libraries for each programming language will need to be
adjusted to support this new format. Because not all runfiles libraries
are bundled/versioned together with Bazel, those libraries will need to
seamlessly support both formats. Tuples are encoded as lists, JSON
entries will always start with an opening square bracket. Because paths
are generally prefixed with a workspace name, it's extremely unlikely
that this causes any conflict. Runfiles libraries can thus distinguish
both formats as follows:

if (line[0] == '[') {
  // Parse as JSON.
} else {
  // Parse as legacy format.
}

To prevent any breakage, we place this feature behind a flag named
--incompatible_json_source_manifests. Even if this flag is disabled, we
emit JSON entries for paths that are not representable in the legacy
format. This should at least make it possible to build targets and
instantiate their runfiles directories with build-runfiles, regardless
of whether the manifest can be parsed at runtime.

Before considering JSON, I experimented with CSV instead. Though easier
to parse, RFC 4180 requires the use of carriage return newlines. Because
both Bazel and runfiles libraries tend to open runfiles manifests in
text mode, this turned out to be hard to get right.

Given the fact that Bazel does not permit non-printable characters and
backslashes in pathnames, parsing the resulting JSON turns out to be
remarkably easy. If it's not realistic to let a given runfiles library
depend on a third-party library to do the JSON parsing, it's most likely
acceptable to handroll a simplified parser.

This PR already contains updates to build-runfiles, the Bash runfiles
library and the Python runfiles library to support the new format.
Updates to the C++ and Java runfiles libraries remain to be done.

Issue: #4327

@gregestren
Copy link
Contributor

@EdSchouten do you have a recommended reviewer for this PR? Or at least team (https://bazel.build/maintaining/maintainers-guide.html#team-labels)?

I can route as make sense to me, but if we can route to anyone we already know is paying attention that offers a better chance of timely response.

@EdSchouten
Copy link
Contributor Author

Hey Greg! Thanks for responding!

@EdSchouten do you have a recommended reviewer for this PR? Or at least team (https://bazel.build/maintaining/maintainers-guide.html#team-labels)?

I can route as make sense to me, but if we can route to anyone we already know is paying attention that offers a better chance of timely response.

I suspect team-Core would be the best fit for this PR, considering that runfiles handling is pretty core to Bazel. Happy to bring other teams in the loop when it comes to reviewing some of the runfiles libraries changes, but we can always do that later.

@gregestren gregestren added the team-Rules-Server Issues for serverside rules included with Bazel label Feb 1, 2022
@gregestren
Copy link
Contributor

I believe Core is more focused on the core execution engine, i.e. Skyframe.

I assigned team-Rules-Server for lack of a better label. @lberki should actually be a good contact but he's on a temporary tech leave at the moment.

Generic runfiles I think suffers from weak ownership.

@gregestren
Copy link
Contributor

So basically, I'm trying to get this proper attention in accordance with https://bazel.build/maintaining/maintainers-guide.html#lifecycle-of-a-pull-request.

Ping me specifically if you worry about responsiveness, but as a heads up I'd expect this one to take some time.

@EdSchouten EdSchouten force-pushed the eschouten/20220131-runfiles-manifest-json branch from 2ec1149 to 0222c9c Compare February 9, 2022 11:56
The *.runfiles_manifest files currently use a pretty simple format: two
pathnames separated by a space character. This can be problematic at
times. On Windows, usernames may contain spaces, meaning home
directories with spaces are not uncommon. If you look at the macOS
ecosystem, it's also not uncommon to have applications or packages
containing files with spaces in their names. The goal of this change is
to start addressing this issue, by moving *.runfiles_manifest to newline
delimited JSON.

Runfiles libraries for each programming language will need to be
adjusted to support this new format. Because not all runfiles libraries
are bundled/versioned together with Bazel, those libraries will need to
seamlessly support both formats. Tuples are encoded as lists, JSON
entries will always start with an opening square bracket. Because paths
are generally prefixed with a workspace name, it's extremely unlikely
that this causes any conflict. Runfiles libraries can thus distinguish
both formats as follows:

    if (line[0] == '[') {
      // Parse as JSON.
    } else {
      // Parse as legacy format.
    }

To prevent any breakage, we place this feature behind a flag named
--incompatible_json_source_manifests. Even if this flag is disabled, we
emit JSON entries for paths that are not representable in the legacy
format. This should at least make it possible to build targets and
instantiate their runfiles directories with build-runfiles, regardless
of whether the manifest can be parsed at runtime.

Before considering JSON, I experimented with CSV instead. Though easier
to parse, RFC 4180 requires the use of carriage return newlines. Because
both Bazel and runfiles libraries tend to open runfiles manifests in
text mode, this turned out to be hard to get right.

Given the fact that Bazel does not permit non-printable characters and
backslashes in pathnames, parsing the resulting JSON turns out to be
remarkably easy. If it's not realistic to let a given runfiles library
depend on a third-party library to do the JSON parsing, it's most likely
acceptable to handroll a simplified parser.

This PR already contains updates to build-runfiles, the Bash runfiles
library and the Python runfiles library to support the new format.
Updates to the C++ and Java runfiles libraries remain to be done.

Issue: bazelbuild#4327
@gregestren
Copy link
Contributor

I was pinged. So I'll give this a closer look.

@gregestren
Copy link
Contributor

Please note I'm not super-familiar with the runfiles implementation logic, so I have some pretty basic questions.

Is build-runfiles the logic that Bazel uses to construct a target's foo.runfiles directory, which it reads the foo.runfiles_manifest file as input?

How much of #4327 does this PR address? The issue is titled as runfiles-related, but I think it's actually a broader issue with spaced path names in general, of which runfiles is one major but not the only example? i.e. the failure of genrule $(SRCS) at #4327 (comment), and maybe the --output_user_root issue?

Do you know what's going on at #4327 (comment) - which is setting --nobuild_runfile_links but still failing with a runfiles creation error?

Aside from these specifics I'm curious how many logic holes exist throughout the code base w.r.t. spaces in paths, and how much harder of a problem that is vs. what I think is a subset of the problem being addressed here.

@gregestren gregestren added the awaiting-user-response Awaiting a response from the author label Apr 27, 2022
@EdSchouten
Copy link
Contributor Author

Hey Greg,

Sorry for the late response; I missed the notification.

Please note I'm not super-familiar with the runfiles implementation logic, so I have some pretty basic questions.

Is build-runfiles the logic that Bazel uses to construct a target's foo.runfiles directory, which it reads the foo.runfiles_manifest file as input?

Yes. But there are also cases in which Bazel constructs 'foo.runfiles' directories without going through 'build-runfiles'. It may construct the runfiles directory directly, based on its in-memory bookkeeping.

How much of #4327 does this PR address? The issue is titled as runfiles-related, but I think it's actually a broader issue with spaced path names in general, of which runfiles is one major but not the only example? i.e. the failure of genrule $(SRCS) at #4327 (comment), and maybe the --output_user_root issue?

It by no means addresses all issues related to whitespace in filenames. For my specific use case I want to drag in a large directory hierarchy with pre-generated contents into the runfiles directory of a tool of ours. My intent is not to call genrule() on any of those files. That said, those issues are also interesting and we should also work towards fixing those in the long run.

Do you know what's going on at #4327 (comment) - which is setting --nobuild_runfile_links but still failing with a runfiles creation error?

So if I understand it correctly, bazel run is broken, while bazel test does work.

  • bazel test works, because it needs to instantiate a runfiles directory in the sandbox context. Bazel is capable of doing that without using runfiles_manifest at all. It constructs the runfiles directory from its own in-memory bookkeeping.
  • bazel run doesn't work, because it does call into build-runfiles to construct a runfiles directory in bazel-out/. The --nobuild_runfile_links flag prevents this from happening during bazel build, but if you do a bazel run it must be done to satisfy the execution requirements of the tool.

Aside from these specifics I'm curious how many logic holes exist throughout the code base w.r.t. spaces in paths, and how much harder of a problem that is vs. what I think is a subset of the problem being addressed here.

Yeah, there are likely many other things that are broken. That said, we don't necessarily need to support all these use cases. My hope is that we can at least support being able to filegroup() a directory structure containing spaces in names and adding those to data = of tools.

I hope that answers all your questions!

@gregestren
Copy link
Contributor

gregestren commented Jun 6, 2022

Yes, these are great answers. Thank you.

I'm caught up again and I'll summarize where I'm at. And my thoughts on trying to move this effectively forward:

  • I was curious about the larger issue (general support for paths with spaces) to clarify this change actually helps real users and doesn't just defer the error to a later point in the build. If the latter it'd probably make sense to consider the whole problem at once. But I think you and others are arguing this alone is helpful.
    • If anyone else wants to speak to that effect please do!
  • So it looks like this is intended as a good incremental step forward. Not complete but strictly better than the status quo
  • The complete problem remains unprioritized
  • I appreciate your diligence in evaluating different formats before settling on JSON
  • I consider this a user-visible API change, particularly if we're introducing a new output format
  • That's not a criticism, but we're trying to be extra cautious that Bazel's API remains globally coherent (i.e. we don't arbitrarily use JSON here while some other use case uses another format)
  • We're building out a team review for these kinds of things, which involves me, @comius , and @brandjon all being looped in

So in summary I support this PR but want to run it through the team review. I'll get that scheduled in and ensure all three of us have discussed it at our next regular sync. It's a little bit of bureaucracy but we hope it ultimately helps move these changes forward more effectively.

@gregestren
Copy link
Contributor

gregestren commented Jun 10, 2022

Hi @EdSchouten . @brandjon and I discussed this yesterday.

Would you be willing to output both a runfiles_manifest and runfiles_manifest.json file with the corresponding formats? That would offer format consistency: a file's format is fixed, period. It doesn't depend on conditions that the user may or may not understand.

If runfiles libraries aren't released with Bazel itself, we may also need them to look for the .json version and fall back to the default version if not found?

The ultimate end game could be to remove the runfiles_manifest version, but that's of course a few more feature PRs down the road.

@EdSchouten
Copy link
Contributor Author

EdSchouten commented Jun 11, 2022

Hey Greg,

Would you be willing to output both a runfiles_manifest and runfiles_manifest.json file with the corresponding formats? That would offer format consistency: a file's format is fixed, period. It doesn't depend on conditions that the user may or may not understand.

Just so I get this right: Does that mean that we should also add a foo.runfiles/MANIFEST.ndjson?

I suppose we should also introduce a RUNFILES_MANIFEST_NDJSON_FILE environment variable for this, right? Or are you suggesting that code should always try to open ${RUNFILES_MANIFEST_FILE}.ndjson as well?

@gregestren
Copy link
Contributor

Hey Greg,

Just so I get this right: Does that mean that we should also add a foo.runfiles/MANIFEST.ndjson?

Does your current patch adjust both files?

Are you preferring ndjson vs. full json because it's easier to implement, or you think it's cleaner, or both?

It's probably most consistent to update both foo.runfiles_manifest and foo.runfiles/MANIFEST atomically in the same PR.

I suppose we should also introduce a RUNFILES_MANIFEST_NDJSON_FILE environment variable for this, right? Or are you suggesting that code should always try to open ${RUNFILES_MANIFEST_FILE}.ndjson as well?

I'd suggest the latter. I'm wary of reliance of environment variables for reproducibility reasons, but maybe I'm not appreciating this use case for them.

@EdSchouten
Copy link
Contributor Author

Hey Greg,

Just so I get this right: Does that mean that we should also add a foo.runfiles/MANIFEST.ndjson?

Does your current patch adjust both files?

Yes!

Are you preferring ndjson vs. full json because it's easier to implement, or you think it's cleaner, or both?

NDJSON has the advantage that it can be parsed line by line. Especially for stuff like runfiles.bash it's easier if we allow that. Allowing arbitrary JSON in there (which may contain newlines at arbitrary spots) can be tricky to deal with in those languages.

It's probably most consistent to update both foo.runfiles_manifest and foo.runfiles/MANIFEST atomically in the same PR.

Yeah, I think they also have to be in sync. I seem to remember there are some places in the tree where runfiles_manifest and runfiles/MANIFEST are compared for equality. If they don't match, the runfiles directory is rebuilt. The runfiles/MANIFEST file is thus used as a marker to indicate whether the runfiles directory is up to date.

I suppose we should also introduce a RUNFILES_MANIFEST_NDJSON_FILE environment variable for this, right? Or are you suggesting that code should always try to open ${RUNFILES_MANIFEST_FILE}.ndjson as well?

I'd suggest the latter. I'm wary of reliance of environment variables for reproducibility reasons, but maybe I'm not appreciating this use case for them.

Gotcha. Will try to get this PR updated next week. Thanks!

@gregestren gregestren added the awaiting-user-response Awaiting a response from the author label Aug 8, 2022
@matthewjh
Copy link

matthewjh commented Aug 28, 2022

Any update, @EdSchouten? Really appreciate the work you have put in here; it's going to help a lot of people!

@keertk
Copy link
Member

keertk commented Dec 7, 2022

Hi there! Thank you for contributing to the Bazel repository. We appreciate your time and effort. We're doing a clean up of old PRs and will be closing this one since it seems to have stalled. Please feel free to reopen if you’re still interested in pursuing this or if you'd like to discuss anything further. We’ll respond as soon as we have the bandwidth/resources to do so.

@keertk keertk closed this Dec 7, 2022
@gregestren
Copy link
Contributor

FYI still happy to review when there's the energy to move forward.

@matthewjh
Copy link

@keertk @gregestren how can we reopen this?

@gregestren
Copy link
Contributor

@matthewjh we just need ownership. Would you be willing to check around to see who might want to take this on, or if @EdSchouten wants to re-focus? Ask around on Slack, https://github.com/bazelbuild/bazel/discussions, etc?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-user-response Awaiting a response from the author team-Rules-Server Issues for serverside rules included with Bazel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants