Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for symbolic links #18

Merged
merged 1 commit into from Jul 17, 2018
Merged

Conversation

juergbi
Copy link
Contributor

@juergbi juergbi commented Jul 2, 2018

We want to make the Remote Execution API useful for building arbitrary packages using their existing build systems. Symbolic links are commonly used on POSIX systems for various purposes. Package build systems may rely on them and packages may create symlinks in their build output directory.

See also https://docs.google.com/document/d/1gnOYszitgrLVet3sQk-TKGqIcpkkDsc6aw-izoo-d64/edit#

@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here (e.g. I signed it!) and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

1 similar comment
@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here (e.g. I signed it!) and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

@googlebot googlebot added the cla: no Pull requests whose authors are not covered by a CLA with Google. label Jul 2, 2018
// it can be an absolute path starting with `/`. Absolute paths are
// interpreted relative to the input root directory, i.e., they are not
// allowed to escape the input space. The canonical form forbids the
// substrings `/./` and `//` in the target path. `..` components are allowed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we restrict .. to only be allowed to be leading components, and only for relative symlinks?

I'm looking to avoid two things:

  1. foo/../bar which is redundant and non-canonical - this feels trivial for callers to avoid, so should be forbidden
  2. /../foo which is not allowed because it would escape the input root directory - explicitly saying both "they are not allowed to escape the input space" and ".. components are allowed anywhere" seems to open up an ambiguity that we can avoid.

The text I would suggest is "The canonical form forbids the substrings /./ and // in the target path. .. components are allowed only in relative paths, and only as leading components in the target path, i.e. any substring ../ may only be preceded by 0 or more ../ literals."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. /../foo which is not allowed because it would escape the input root directory - explicitly saying both "they are not allowed to escape the input space" and ".. components are allowed anywhere" seems to open up an ambiguity that we can avoid.

Following POSIX semantics, /../foo would be identical to /foo and thus also not escape the input root directory. However, as this is trivial to canonicalize, I agree that we should not allow symlink targets starting with /../.

  1. foo/../bar which is redundant and non-canonical - this feels trivial for callers to avoid, so should be forbidden

This is not as simple as it may seem. If foo is a symlink to a directory, foo/../bar is not guaranteed to be equivalent to bar (.. must be resolved after following foo). And foo might even be a dangling symlink that becomes valid as part of the build process. To be on the safe side with regards to compatibility I prefer allowing .. components in symlink targets.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I re-read the whole discussion in the document comments about absolute paths now, and recall the problems. I'm still not sure whether to allow absolute paths in general, and only block them on a per-platform or per-server basis. I'm not a big fan of treating absolute paths as relative to the input root, because that is very confusing, and I'm not sure what problem will it solve -- packages that create absolute symlinks would still need to be heavily amended, right?
Suggestion: how about we simply disallow absolute paths for now, and will add them later (it is a non-breaking change) when particular use-cases arise?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

packages that create absolute symlinks would still need to be heavily amended, right?

BuildStream will use workers that build in sandboxes where the input root is the filesystem root and thus, packages can use absolute symlinks without any modifications. I understand that this doesn't apply to other platforms, which is why we propose to leave this decision to each platform.

If this is not currently acceptable, we could definitely follow your suggestion and disallow absolute paths for now and add them later after more discussion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bleh. Okay, how about the server returning something like:

// Describes how the server treats absolute symlink targets.
enum SymlinkAbsolutePathStrategy {
  UNKNOWN = 0;

  // Server will return an INVALID_ARGUMENT on symlinks with absolute targets.
  DISALLOWED = 1;

  // Server will allow symlink targets to escape the input root tree, possibly resulting in
  // non-hermetic builds.
  ALLOWED = 2;

  // Server will treat absolute symlink targets as relative to the input root, i.e. for "/a/foo"
  // an absolute symlink "/worker123/.../input_rootXYZ/a/foo" will be created.
  INPUT_ROOT = 3;
}
message CacheCapabilities {
   ...
   SymlinkAbsolutePathStrategy symlink_absolute_path_strategy = 5;
}

And then you can just refer to that enum in the comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although, wait, that affects the client behavior as well, for symlinks returned from the server. In the ALLOWED case, the client should just symlink the exact same absolute target non-hermetically; for INPUT_ROOT, the client should do the same translation? But bleh, that is just so non-intuitive...

Wait, we already allow setting the working directory in the API. What if we restrict the enum to just allow/disallow for now, and BuildStream just sets the working directory to "/" for its actions? In that case, allowing absolute paths is the same as treating them as input root, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to chime in and I think that ALLOWED and DISALLOWED should be the only behaviours allowed; INPUT_ROOT I think has too much ambiguity to it. If we require the server to reinterpret symlinks in any way, the behaviour could end up depending on whether the server creates the symlink as an absolute or relative path (because the symlink /input_root/foo and ../foo are semantically different, even if they both appear in /input_root/bar, and if /input_root is a symlink to /real_root, is the server expected to make a link to /real_root/foo?). I worry that trying to specify exactly how they should work is going to get confusing especially on Windows where path handling is very complicated.

I see concerns about symlinks escaping the input root, but I can't say that I really understand them. There's plenty of other ways that commands can escape the input root, and I don't think a command that runs /usr/bin/gcc is any less dependent on the worker environment than one that runs ./gcc which is a symlink to /usr/bin/gcc, say.

As for canonicalization, realistically, it is more strongly a set of best practices than hard-and-fast rules. The advantage to encoding them into the API, rather than letting the client do what is best, is merely to let the server detect and reject non-canonical actions. If the server does not do this, then the rule makes no difference (and by Hyrum's law, we really shouldn't have some but not all servers rejecting non-canonical paths); it just becomes something that the client can manage. We could even imagine a "canonical enforcement proxy" which proxies requests, after verifying that they meet the canonicalization requirements.

I personally do not feel at all comfortable trying to specify canonicalization requirements for Windows. For POSIX, I think that we could get away with the rules about // and /./. It may be easier to explicitly frame the requirements here as guidelines, rather than hard rules for the API. And honestly, that should be considered for all canonicalization requirements unless all the implementers are enforcing them or prepared to take that on as a priority.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The working directory does not influence the resolution of absolute symlinks. However, as the workers for BuildStream will set the filesystem root to the input root, allowing absolute paths is the same as treating them as relative to the input root. I.e., restricting the enum to just allow/disallow works for us. Let me know if I shall add this enum to the commit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if I shall add this enum to the commit.

Yes, please! Even though we are effectively making it be a boolean, an enum allows us to revisit this decision later on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the branch based on the comments above. I've prefixed the enum values with ABSOLUTE_SYMLINK_ to avoid conflict with UNKNOWN in DigestFunction as reported by protoc. Let me know if I should use a different name as prefix or if the conflict should be resolved in a different way.

@juergbi juergbi force-pushed the symlinks branch 2 times, most recently from 7a70a77 to 9b73828 Compare July 4, 2018 17:21
@googlebot
Copy link

CLAs look good, thanks!

1 similar comment
@googlebot
Copy link

CLAs look good, thanks!

@googlebot googlebot added cla: yes Pull requests whose authors are covered by a CLA with Google. and removed cla: no Pull requests whose authors are not covered by a CLA with Google. labels Jul 4, 2018
@ola-rozenfeld
Copy link
Contributor

Ping!

@@ -1231,6 +1249,18 @@ message PriorityCapabilities {
repeated PriorityRange priorities = 1;
}

// Describes how the server treats absolute symlink targets.
enum SymlinkAbsolutePathStrategy {
ABSOLUTE_SYMLINK_UNKNOWN = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grrr, C++ and its stupid scoping rules! How about we move the enum under CacheCapabilities message instead? Our current enums ignore the C++ use case in their naming, and I'd like to be consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine by me. Update pushed.

enum SymlinkAbsolutePathStrategy {
ABSOLUTE_SYMLINK_UNKNOWN = 0;

// Server will return an INVALID_ARGUMENT on symlinks with absolute targets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, how about:

// Server will return an INVALID_ARGUMENT on input symlinks with absolute targets.
// If an action tries to create an output symlink with an absolute target, a
// FAILED_PRECONDITION will be returned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This clarification makes sense. Applied.

@ola-rozenfeld
Copy link
Contributor

@alercah @illicitonion PTAL, thank you!
After this goes in, I will make the official 2.0 release.

Copy link
Contributor

@illicitonion illicitonion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great :) Thanks!

@ola-rozenfeld ola-rozenfeld merged commit c1c1ad2 into bazelbuild:master Jul 17, 2018
@ola-rozenfeld
Copy link
Contributor

A couple of questions (better late than never):

In the design doc you wrote "Due to this purely symbolic nature of symlinks, it doesn't make sense to include a digest of the target file or directory in the SymlinkNode" But doesn't it? What if the symlink pointed outside of the working directory, and the contents of the file changed, for example symlinks were used to point to some heavyweight tools installed in some system location, and those tools were updated? Isn't it really important to invalidate our cache key as a result? I guess what I'm suggesting is to include the Digest, and use it a precondition check -- error out (with PRECONDITION_FAILURE) if the actual Digest doesn't match the specified one, both on server and client. WDYT?

Secondly, I just discovered that apparently some people generate symlink outputs in their genrules. And yes, I don't mean directories, I mean actual files that are symlinks. We probably want to add an OutputSymlink in addition to OutputFile and OutputDirectory -- either that or add a field is_symlink in both these messages. WDYT?

@juergbi
Copy link
Contributor Author

juergbi commented Sep 18, 2018

What if the symlink pointed outside of the working directory, and the contents of the file changed, for example symlinks were used to point to some heavyweight tools installed in some system location, and those tools were updated?

That's equivalent to specifying the path to a tool outside the input root as a command or in a script. If the exact version of a tool is significant, it should either be part of the input space or covered by platform constraints, in my opinion. Using tools outside the input root is possible with and without symlinks (if the platform allows escaping the input root at all). I don't see why we should treat a path in a symlink target differently from a path in other places.

Am I overlooking a use case where this would be required?

I guess what I'm suggesting is to include the Digest, and use it a precondition check -- error out (with PRECONDITION_FAILURE) if the actual Digest doesn't match the specified one, both on server and client.

I think we should avoid significantly deviating from the common symlink semantics. Otherwise we introduce incompatibilities. E.g., symlinks may point to a parent directory, which would make it impossible to calculate a digest for that directory (cyclic dependency). Symlinks are also allowed to be dangling. And this new requirement could slow down some things as the server would be forced to check/update digests of symlinks in the output tree even in subtrees where there were no real changes.

Secondly, I just discovered that apparently some people generate symlink outputs in their genrules. And yes, I don't mean directories, I mean actual files that are symlinks. We probably want to add an OutputSymlink in addition to OutputFile and OutputDirectory -- either that or add a field is_symlink in both these messages. WDYT?

Yes, OutputSymlink sounds useful. In BuildStream we always use whole directories as output, however, other clients might indeed want that.

@ola-rozenfeld
Copy link
Contributor

Good point on symlinks being misused in this way, let me find out some more about that particular use case, and if it can be addressed differently.

Created #28 for the OutputSymlink. I can't add you as a reviewer for some reason, but please take a look! Thank you!

santigl pushed a commit to santigl/remote-apis that referenced this pull request Aug 26, 2020
…ld#18)

* Adding DownloadActionResult function to intermediate layer.

* Addressing comments.

* Fix breakage.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes Pull requests whose authors are covered by a CLA with Google.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants