Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Symlinks inside of TreeArtifacts fail with RBE and BwoB #22504

Open
thesayyn opened this issue May 22, 2024 · 6 comments
Open

Symlinks inside of TreeArtifacts fail with RBE and BwoB #22504

thesayyn opened this issue May 22, 2024 · 6 comments
Labels
team-Remote-Exec Issues and PRs for the Execution (Remote) team type: support / not a bug (process)

Comments

@thesayyn
Copy link
Contributor

Description of the bug:

With rules_oci 2.x we implemented a feature where we create symlinks inside of a tree-artifact that points out of the tree-artifact (#20891) to files created by other actions and this works fine with RBE and Local execution.

Note that the files that symlinks point to are not in the inputs of the action that creates the symlink, and i don't think it matters much.

However, when the same action run on RBE with --remote_download_minimal flag, it fails with the message below.

Here's the Bazel slack channel threads which might be helpful: https://aspect-build.slack.com/archives/C04281DTLH0/p1716397067812139

Which category does this issue belong to?

No response

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

1- clone https://github.com/bazel-contrib/rules_oci/tree/2.x
2- run bazel test //examples/... --remote_download_minimal on RBE

Which operating system are you running Bazel on?

Darwin

What is the output of bazel info release?

7.1.1

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse HEAD ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

I don't think is a regression.

Have you found anything relevant by searching the web?

Probably related #15454

Any other information, logs, or outputs that you want to share?

ERROR: /workspaces/infrastructure/services/bazel/buildbarn/asset/BUILD.bazel:10:6: OCI Image //services/bazel/buildbarn/asset:asset.image failed: not all outputs were created or valid```
@thesayyn
Copy link
Contributor Author

cc @tjgq

@tjgq
Copy link
Contributor

tjgq commented May 22, 2024

Note that the files that symlinks point to are not in the inputs of the action that creates the symlink, and i don't think it matters much.

I think that might be the problem here. Building without the bytes uses a virtual filesystem to compute the metadata for action outputs, and that virtual filesystem only has access to the action's declared inputs, for both hermeticity and efficiency reasons. (If, on the other hand, you can repro this even when the symlink can be resolved given only access to the action's inputs, I'd consider it a bug.)

@tjgq
Copy link
Contributor

tjgq commented May 22, 2024

(To further clarify: if I understand the scenario correctly, your build is already non-hermetic; without the input dependency, nothing guarantees that the output pointed to by the symlink is produced first, or even that it's produced at all. Building without the bytes only makes the problem easier to detect.)

@thesayyn
Copy link
Contributor Author

Sorry, i was not arguing if this is bug or not, i am simply trying to understand it and see if it's me who's doing something really horrible or it's a Bazel issue.

I think that might be the problem here.

I agree, that's indeed the culprit, and i understand why one might consider it not a bug. It's just odd given the action succeeds with other spawn strategies, even with RBE + --remote_download_toplevel, With --remote_download_toplevel[all] it's not using a VFS i assume?

your build is already non-hermetic; without the input dependency, nothing guarantees that the output pointed to by the symlink is produced first

It is produced because the files which symlinks point to are in the DefaultInfo of the target, and the action doesn't care if the target file symlink points to is produced first because it eventually is. I would think that what being done here is not a lot different than using declare_symlink + symlink(target_path = ...).

Also since we are talking about hermeticity, do you think it's more appropriate to put these unused inputs into unused_inputs_list attribute? (Okay i don't think it's quite what i thought it was)

Am i doing something really weird here?, because i don't fully understand what's fundamentally wrong here.

FWIW, we have already tested this feature with a few large codebases and it seems to work.

@tjgq
Copy link
Contributor

tjgq commented May 22, 2024

It's just odd given the action succeeds with other spawn strategies

I agree that it's unfortunate that the behavior depends on the contents of the output tree. We could maybe do a better job of flagging these symlinks as non-hermetic (assuming it can be done with a reasonable amount of overhead).

With --remote_download_toplevel[all] it's not using a VFS i assume?

It's an overlay filesystem that falls back to the actual filesystem; in a purely "without the bytes" build, there's nothing to fall back to. (It should probably only fall back for files that have been determined to be in the input set; this would make the behavior consistent, but is tricky to implement in an efficient manner.)

the action doesn't care if the target file symlink points to is produced first because it eventually is

This is simply not how Bazel works; the outputs of an action must be available when the action finishes, not later. Among other reasons why, Bazel needs to hash the tree artifact contents to be able to determine whether and how to run followup actions that depend on it.

I would think that what being done here is not a lot different than using declare_symlink + symlink(target_path = ...).

It's different because declare_symlink doesn't care about file contents; declare_file and declare_directory do (and in order to determine the contents, they must follow symlinks).

Also since we are talking about hermeticity, do you think it's more appropriate to put these unused inputs into unused_inputs_list attribute? (Okay i don't think it's quite what i thought it was)

It won't help;unused_inputs_list is unrelated to the issue you're having.

Am i doing something really weird here?, because i don't fully understand what's fundamentally wrong here.

I worry that you might be trying too hard to optimize things by creating symlinks (and running into non-intuitive problems as a result). Would it perhaps be acceptable to copy files around instead, and rely on the various caching mechanisms built into Bazel to keep your builds fast?

@iancha1992 iancha1992 added the team-Rules-API API for writing rules/aspects: providers, runfiles, actions, artifacts label May 22, 2024
@comius comius added team-Remote-Exec Issues and PRs for the Execution (Remote) team and removed team-Rules-API API for writing rules/aspects: providers, runfiles, actions, artifacts labels May 23, 2024
@alexeagle
Copy link
Contributor

Thanks @tjgq

copy files around instead

Note that they are often huge. PyTorch with cuda is multiple GB by itself and some clients need to build that into multiple OCI images.

But I think it's simply unavoidable given your analysis here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Remote-Exec Issues and PRs for the Execution (Remote) team type: support / not a bug (process)
Projects
None yet
Development

No branches or pull requests

8 participants