Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bazel 7 broke nested action directory outputs #21782

Closed
jmillikin opened this issue Mar 25, 2024 · 13 comments
Closed

Bazel 7 broke nested action directory outputs #21782

jmillikin opened this issue Mar 25, 2024 · 13 comments
Labels
team-Remote-Exec Issues and PRs for the Execution (Remote) team type: support / not a bug (process) untriaged

Comments

@jmillikin
Copy link
Contributor

Description of the bug:

Output directories declared with ctx.actions.declare_directory are allowed to be nested, for example an action can declare it outputs "{rule_name}/topdir" and "{rule_name}/topdir/subdir". This functionality broke in Bazel 7, and such output graphs now cause an analysis error.

$ bazel-7.0.0 build //:output_dirs
[...]
ERROR: One of the output paths 'bazel-out/darwin_arm64-fastbuild/bin/output_dirs/a' (belonging to //:output_dirs) and 'bazel-out/darwin_arm64-fastbuild/bin/output_dirs/a/b' (belonging to //:output_dirs) is a prefix of the other. These actions cannot be simultaneously present; please rename one of the output files or build just one of them

Which category does this issue belong to?

Core

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

# rule.bzl
def _output_dirs(ctx):
    name = ctx.attr.name
    a = ctx.actions.declare_directory(name + "/a")
    b = ctx.actions.declare_directory(name + "/a/b")
    outputs = [a, b]

    ctx.actions.run(
        inputs = [],
        outputs = outputs,
        executable = "/bin/echo",
        arguments = [],
        mnemonic = "OutputDirs",
    )
    return [
        DefaultInfo(files = depset(direct = outputs)),
    ]

output_dirs = rule(_output_dirs)
# BUILD.bazel
load(":rule.bzl", "output_dirs")
output_dirs(name = "output_dirs")

The build succeeds with Bazel versions <= 6.5.0:

bazel-6.5.0 build //:output_dirs
INFO: Analyzed target //:output_dirs (4 packages loaded, 6 targets configured).
INFO: Found 1 target...
INFO: From OutputDirs output_dirs/a:

Target //:output_dirs up-to-date:
  bazel-bin/output_dirs/a
  bazel-bin/output_dirs/a/b

The build fails with Bazel versions >= 7.0.0 (tested v7.0.0 and v7.1.1)

bazel-7.0.0 build //:output_dirs
ERROR: One of the output paths 'bazel-out/darwin_arm64-fastbuild/bin/output_dirs/a' (belonging to //:output_dirs) and 'bazel-out/darwin_arm64-fastbuild/bin/output_dirs/a/b' (belonging to //:output_dirs) is a prefix of the other. These actions cannot be simultaneously present; please rename one of the output files or build just one of them
Target //:output_dirs failed to build

Which operating system are you running Bazel on?

macOS

What is the output of bazel info release?

release 7.0.0 / release 7.1.1

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse HEAD ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

I suspect this is related to the directory output changes in v7.0 (--incompatible_disallow_unsound_directory_outputs, etc), so more of a "forgot to test that case in rewritten impl" rather than a specific bad commit.

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

@github-actions github-actions bot added the team-Core Skyframe, bazel query, BEP, options parsing, bazelrc label Mar 25, 2024
@jmillikin
Copy link
Contributor Author

cc @tjgq as this may be related to #18646

@tjgq
Copy link
Contributor

tjgq commented Mar 25, 2024

This is an intentional change; the associated incompatible flag is actually --incompatible_strict_conflict_checks, which was flipped in Bazel 7 and will be deleted in Bazel 8 (see #16729).

@jmillikin
Copy link
Contributor Author

Hmm. Is that intentional change documented anywhere? I might just be missing it, but I don't see it mentioned anywhere in the v7.0 release notes.

And also, is it possible to undo that change? I don't understand the motivation behind it, and nested output directories are extremely useful for working with certain language ecosystems.

For example, Javascript package managers produce a node_modules/ output directory with per-package subdirectories, and those packages might themselves contain sub-packages (for TypeScript declarations, etc). Representing those various directory pseudo-roots as File values is convenient and natural, but given a File(path = "node_modules") the only way I know of to obtain a File(path = "node_modules/somepkg") is to have the same rule generate them both as nested output artifacts.

@tjgq
Copy link
Contributor

tjgq commented Mar 25, 2024

Hmm. Is that intentional change documented anywhere? I might just be missing it, but I don't see it mentioned anywhere in the v7.0 release notes.

Actually, it was flipped in Bazel 6.0.0 (commit fe16965) but seems to be missing from the 6.0.0 release notes. We've had this issue before with changes that are cherry-picked into a release branch after it has already been cut from the main branch; I'll see if I can get it fixed.

And also, is it possible to undo that change? I don't understand the motivation behind it, and nested output directories are extremely useful for working with certain language ecosystems.

No, sorry. The value of this feature doesn't justify its implementation cost; it broke an extremely useful invariant (that every output file is owned by exactly one declared artifact) whose absence was a constant source of bugs, and whose existence simplifies several subsystems (notably conflict checking and input prefetching).

For example, Javascript package managers produce a node_modules/ output directory with per-package subdirectories, and those packages might themselves contain sub-packages (for TypeScript declarations, etc). Representing those various directory pseudo-roots as File values is convenient and natural, but given a File(path = "node_modules") the only way I know of to obtain a File(path = "node_modules/somepkg") is to have the same rule generate them both as nested output artifacts.

Do the consumers of this rule receive the entire node_modules directory as an input, or individual modules?

If consumers always receive the entire node_modules, then there's no need to create nested artifacts. You can pass around the File for the entire node_modules, and construct paths to subdirectories (if needed for action arguments, etc) through string concatenation instead of by instantiating a File object.

If consumers receive only certain modules as inputs but not the rest, I'm afraid you'll have to split each module into its own action. (You might also want to look into how rules_js does it, as I'm not aware of them having been broken by this flag flip.)

@tjgq
Copy link
Contributor

tjgq commented Mar 25, 2024

Actually, it was flipped in Bazel 6.0.0 (commit fe16965) but seems to be missing from the 6.0.0 release notes. We've had this issue before with changes that are cherry-picked into a release branch after it has already been cut from the main branch; I'll see if I can get it fixed.

Correction: a flip was initially attempted for 6.0.0, but rolled back. A second attempt took place for 7.0.0 (commit: 7bd0ab6) and that one has stuck. tl;dr: 7.0.0 is the first release where --incompatible_strict_conflict_checks defaults to true.

@jmillikin
Copy link
Contributor Author

No, sorry. The value of this feature doesn't justify its implementation cost; it broke an extremely useful invariant (that every output file is owned by exactly one declared artifact) whose absence was a constant source of bugs, and whose existence simplifies several subsystems (notably conflict checking and input prefetching).

Do you think there would be any way to provide an equivalent Starlark API to rules authors, while maintaining that new invariant? It's frustrating that a feature present since Bazel v1.0 has been silently broken without providing a replacement.

I'm somewhat rusty on the Bazel internals at this point, but given the example rules fragment:

a = ctx.actions.declare_directory(name + "/a")
b = ctx.actions.declare_directory(name + "/a/b")
ctx.actions.run(outputs = [a, b], ...)

It seems that there is a path forward in which a is a File representing the full output directory artifact, and b is a File representing a scoped view into it. There would be a potential conflict if these were outputs of different actions, but they're from the same action. The merkle tree of ./a will always and by necessity include ./a/b as a node.

Alternatively, maybe a modified API like this would be acceptable?

a = ctx.actions.declare_directory(name + "/a")
b = ctx.actions.declare_directory("b", parent = a)
ctx.actions.run(outputs = [a], ...)
# b is a File that identifies output directory "{name}/a/b"

I'd prefer to maintain the old API if possible, but even a new API would be better than just being broken.

Note that the desire to identify subsets of a directory wouldn't be as necessary if Bazel supported marking files as non-symlinkable (#10299), because that would allow synthetic node_modules directories to be assembled as needed. But without that ability, the node_modules directory has to be assembled all at once in a single action, otherwise the Node runtime's habit of symlink peeking will bypass the Bazel output layout.

Do the consumers of this rule receive the entire node_modules directory as an input, or individual modules?

If consumers always receive the entire node_modules, then there's no need to create nested artifacts. You can pass around the File for the entire node_modules, and construct paths to subdirectories (if needed for action arguments, etc) through string concatenation instead of by instantiating a File object.

If consumers receive only certain modules as inputs but not the rest, I'm afraid you'll have to split each module into its own action. (You might also want to look into how rules_js does it, as I'm not aware of them having been broken by this flag flip.)

It depends on the consumer. Given a tree of js_library targets and a few js_binary targets, the libraries will receive only the trees of the packages they directly declare dependencies on. When it comes time to bundle the transitive sources into a "binary", the entire node_modules directory needs to be depended on.

This strategy is because the full node_modules directory might contain tens or hundreds of thousands of files, but the direct packages might only be 10-20 files. Having to assemble the entire node_modules tree is terribly slow, and minimizing the scope of depended-on files is important for build performance.

Public JS rulesets such as rules_js tend to be aimed at JS programmers trying to minimally integrate with Bazel, so their approach to sandboxing and repository rule hermeticity tends to be very different from the Bazel norm. I had to write my own rulesets for JS to avoid having a bunch of package.json and node_modules and other weird JS cruft in my repository.

@iancha1992 iancha1992 added team-Remote-Exec Issues and PRs for the Execution (Remote) team and removed team-Core Skyframe, bazel query, BEP, options parsing, bazelrc labels Mar 25, 2024
@tjgq
Copy link
Contributor

tjgq commented Mar 26, 2024

There would be a potential conflict if these were outputs of different actions, but they're from the same action. The merkle tree of ./a will always and by necessity include ./a/b as a node.

Yes, that's strictly better than the nested artifacts being potentially created by different actions, but the fact that an action can consume the inner artifact but not the outer one still creates (created) a lot of implementation complexity and potential for bugs in various places, because the outer artifact can be in a state of having been "half produced" (if e.g. only the inner artifact is fetched from a disk/remote cache). A different API doesn't do away with this issue.

Note that the desire to identify subsets of a directory wouldn't be as necessary if Bazel supported marking files as non-symlinkable (#10299), because that would allow synthetic node_modules directories to be assembled as needed. But without that ability, the node_modules directory has to be assembled all at once in a single action, otherwise the Node runtime's habit of symlink peeking will bypass the Bazel output layout.

I think the sentence "Node runtime's habit of symlink peeking" is doing a lot of work here, but you have to connect the dots for me because I'm not that familiar with Node.js :) Are you saying that, given a node_modules layout with no symlinks, and another where one of the subdirectories has been replaced with a symlink (but the contents behind the symlink are identical to the subdirectory in the former layout), Node.js behaves differently? So you always have to assemble the node_modules tree with the dependencies for a particular action by copying instead of symlinking?

@jmillikin
Copy link
Contributor Author

Ah well, it sounds like there's no going back to the old capability, so I'll just ask that the change be made clearer in the release notes. And any similarly breaking changes in the future, since otherwise they come as a bit of a shock.

Regarding Node:

When a .js file contains import "some-module", the Node runtime will compute a set of directories to search for a file named some-module/index.js (etc). The set of directories is computed relative to the fully-resolved path of the importing file -- in Java terms, Node calls java.nio.file.Path.toRealPath() for every input file.

For Bazel, this has two consequences:

  • Source files (symlinked from the source tree) get their imports resolved relative to the source tree, not the sandbox. Importing generated files generally doesn't work, and it's very easy to accidentally depend on undeclared source files.
  • A node_modules directory containing multiple packages generally cannot be assembled from the outputs of multiple actions. For actions that resolve cross-package imports (such as bundlers, the JS equivalent of C++ linkers) that node_modules directory must contain all the transitive dependencies.
    • This means that if you want to be able to subset the (often very large) node_modules directory, that needs to happen in the same action using overlapping declare_directory artifacts.

For my own use I built a Remote Execution executor that can apply Node-specific fixups (replacing symlinks with their targets), but the resulting projects can't be built by Bazel's local executor. My hope is that Bazel will eventually provide enough control over artifact layout and sandbox setup such that such projects can be built with plain vanilla Bazel.

@tjgq
Copy link
Contributor

tjgq commented Apr 2, 2024

I will get the release notes for 7.0.0 fixed (and, more importantly, see if the process can also be fixed so we don't lose release notes again in the future.)

Thanks for giving some more context on the Node.js limitations around symlinks. It's still a little unclear to me why the ability to nest declare_directory artifacts is required to work around this. Using a declare_directory implies that you're copying node_modules from the source tree into the output tree. Why isn't it viable to keep the node_modules in the source tree and, say, declare a filegroup rule (or a custom rule similar in spirit) for each individual package? Then the filegroup targets could be as granular as desired. (This could all be packaged up in a repository rule that encapsulates both the fetching of external dependencies and the setting up of the source tree.)

It also seems to me that Bazel's reliance on symlinks to implement sandboxing is the greater of the two issues, because it will work against Node regardless of whether the node_modules live in the source or the output tree. We do have an alternative sandboxing implementation that relies on hardlinks instead of symlinks (--experimental_use_hermetic_linux_sandbox). I'm curious if you've given it a try?

@jmillikin
Copy link
Contributor Author

Thanks for giving some more context on the Node.js limitations around symlinks. It's still a little unclear to me why the ability to nest declare_directory artifacts is required to work around this. Using a declare_directory implies that you're copying node_modules from the source tree into the output tree. Why isn't it viable to keep the node_modules in the source tree and, say, declare a filegroup rule (or a custom rule similar in spirit) for each individual package? Then the filegroup targets could be as granular as desired. (This could all be packaged up in a repository rule that encapsulates both the fetching of external dependencies and the setting up of the source tree.)

I'm not sure what you mean by this. The node_modules directory doesn't exist in the source tree, instead there are repository and build rules like this:

# third_party/github.com/lit/lit/repository.bzl
load("//build/rules_yarn/yarn:yarn.bzl", "yarn_archives")

def com_github_lit_lit():
    yarn_archives(
        name = "com_github_lit_lit",
        lockfiles = ["//third_party/github.com/lit/lit:v2.2.1/yarn.lock"],
    )

# third_party/github.com/lit/lit/BUILD
load("//build/rules_yarn/yarn:yarn.bzl", "yarn_install")

licenses(["notice"])  # BSD-3-Clause

yarn_install(
    name = "yarn_installed",
    archives = ["@com_github_lit_lit"],
    package_json = ":v2.2.1/package.json",
    packages = [
        "lit",
        "lit-element",
        "lit-html",
        "@lit/reactive-element",
        "@types/trusted-types",
    ],
    visibility = ["//visibility:public"],
    yarn_lock = ":v2.2.1/yarn.lock",
)

When the yarn_install build rule runs, it uses the tarballs downloaded by yarn_archives to generate a node_modules directory. The packages= attribute lists specific packages (= subdirectories) that need to be made available for dependency resolution, and those packages each generate a call to declare_directory().

The reason the "fetch archives" and "generate node_modules" steps are separate is to avoid running third-party code in the repository rule. The yarn install is run in the build rule so that it can be sandboxed away from the filesystem and prevented from accessing the network.

It also seems to me that Bazel's reliance on symlinks to implement sandboxing is the greater of the two issues, because it will work against Node regardless of whether the node_modules live in the source or the output tree. We do have an alternative sandboxing implementation that relies on hardlinks instead of symlinks (--experimental_use_hermetic_linux_sandbox). I'm curious if you've given it a try?

I haven't used that feature yet, but would be interested in trying it out once it works for macOS. For Linux builds I mostly use a remote execution setup rather than Bazel's built-in sandboxing.

@tjgq
Copy link
Contributor

tjgq commented Apr 3, 2024

When the yarn_install build rule runs, it uses the tarballs downloaded by yarn_archives to generate a node_modules directory. The packages= attribute lists specific packages (= subdirectories) that need to be made available for dependency resolution, and those packages each generate a call to declare_directory().

The reason the "fetch archives" and "generate node_modules" steps are separate is to avoid running third-party code in the repository rule. The yarn install is run in the build rule so that it can be sandboxed away from the filesystem and prevented from accessing the network.

Thanks, that's the bit I was missing; I assumed you could just run the entire process that produces the node_modules in the repository rule, but you prefer to move it to the execution phase because of isolation concerns.

I haven't used that feature yet, but would be interested in trying it out once it works for macOS. For Linux builds I mostly use a remote execution setup rather than Bazel's built-in sandboxing.

To be clear, I don't believe we have plans at the moment to extend --experimental_use_hermetic_linux_sandbox support to MacOS. cc @oquenchil since your use case illustrates why it might be worth having a non-symlink-based sandbox solution that works on MacOS.

@tjgq
Copy link
Contributor

tjgq commented Apr 3, 2024

Here's an additional thought (apologies if this makes no sense - I haven't worked with Node.js in a long time, and my experience predates yarn :))

My understanding is that yarn produces (or can be convinced to produce?) a flat node_modules hierarchy. If that's an accurate statement, there should be no need for nested artifacts, as long as you're able to declare a separate directory for every child of node_modules. So if you can arrange for the repository rule to produce a complete list of module names, a non-repository rule could use that information (which is available at analysis time) to declare N non-overlapping directories, one per module, and have a single action produce all of them, while making it possible to consume each one independently.

Does this approximate a viable solution?

@jmillikin
Copy link
Contributor Author

The nested artifacts are direct children of node_modules/ -- I haven't tried getting into the wild world of nested dependencies-within-dependencies. In general, my interactions with Node.JS stuff is just because a lot of web tools are written in JS nowadays, and I use Bazel to insulate me from it as much as possible.

So if you can arrange for the repository rule to produce a complete list of module names, a non-repository rule could use that information (which is available at analysis time) to declare N non-overlapping directories, one per module, and have a single action produce all of them, while making it possible to consume each one independently.

I think that would be difficult without running Yarn itself during the repository rule phase. The repository rule downloads archives listed in a lock file, and the build rule uses those archives as inputs to build node_modules, but the build rule's "entry point" can be any package in the lockfile. In other words, the action could declare all directories that might get created, but it won't know which ones actually get created.


Since the original point of this issue has been addressed (the change being intentional), feel free to close. I'll try to find another way to wrestle Node into cooperation with the new Bazel v7 requirements.

@tjgq tjgq closed this as not planned Won't fix, can't repro, duplicate, stale Apr 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Remote-Exec Issues and PRs for the Execution (Remote) team type: support / not a bug (process) untriaged
Projects
None yet
Development

No branches or pull requests

6 participants