Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Asset API: Use of Directory instead of Tree for PushDirectory #165

Open
Qinusty opened this issue Aug 27, 2020 · 2 comments
Open

Asset API: Use of Directory instead of Tree for PushDirectory #165

Qinusty opened this issue Aug 27, 2020 · 2 comments
Assignees

Comments

@Qinusty
Copy link

Qinusty commented Aug 27, 2020

The use of the Directory instead of Tree for Push/Fetch Directory leads to additional overhead on each Fetch which could instead be performed once by the client when pushing the Directories to the CAS, or once by the Fetch Service when performing an optional remote fetch.

This overhead is made clear when attempting to ensure that the full directory Tree under the root_directory_digest is present with the CAS before responding to a FetchDirectory request.

Whilst verifying the existence of these files is not a requirement within the spec, I have encountered issues with clients falling over following a Fetch request which returned a digest to an item which isn't in the CAS. Is this something clients should handle, or servers should ensure the digests within ActionResults exist within the CAS.

It appears that a similar change occurred during the early days of the RE spec. f85ddf0#diff-23d388eca6738e626128af49c6c8f0e5R745-R750

@sstriker
Copy link
Collaborator

Thanks for engaging @Qinusty.

The use of the Directory instead of Tree for Push/Fetch Directory leads to additional overhead on each Fetch which could instead be performed once by the client when pushing the Directories to the CAS, or once by the Fetch Service when performing an optional remote fetch.

The Remote Asset API Push/FetchDirectory calls are deliberately operating on directory digests. The reasoning here is that the API is very likely used for inputs (read: source trees). Source trees tend to leave large portions of a tree untouched when observed through time, which could mean quite a bit of duplication if all were stored as a Tree, rather than a hierarchy of Directories. This is in line with the input_root_digest in Action.

This overhead is made clear when attempting to ensure that the full directory Tree under the root_directory_digest is present with the CAS before responding to a FetchDirectory request.

Implementations can choose to optimize the verification path under the hood. Going from directory to tree is but a CAS.GetTree() call away, and an implementation may choose to cache that.
One could argue that a FindMissingBlobsInTree(root_directory_digest) call would be a useful for implementations as an alternative.

Whilst verifying the existence of these files is not a requirement within the spec, I have encountered issues with clients falling over following a Fetch request which returned a digest to an item which isn't in the CAS. Is this something clients should handle, or servers should ensure the digests within ActionResults exist within the CAS.

Clients need to be able to deal with cases where blobs disappear from CAS. I would say that not doing so leads to a poor user experience. That said, Servers SHOULD ensure that referenced files are present in the CAS at the time of the response.

For what it's worth, I would welcome a suggestion to inline the Blob or Tree in the FetchBlob / FetchDirectory responses based on a to-be-defined request flag. The downside to that is that it would require a higher complexity in service implementations, rather than that being optional.

I think that #144 and #159 are tangentially related to this issue.

@EdSchouten
Copy link
Collaborator

EdSchouten commented Sep 4, 2020

The Remote Asset API Push/FetchDirectory calls are deliberately operating on directory digests. The reasoning here is that the API is very likely used for inputs (read: source trees). Source trees tend to leave large portions of a tree untouched when observed through time, which could mean quite a bit of duplication if all were stored as a Tree, rather than a hierarchy of Directories.

How is that different for directory outputs that use Trees? To make a bit more clear what I mean, let me take the sentence above and change 'Source trees' with 'Directory outputs of build actions'.

Directory outputs of build actions tend to leave large portions of a tree untouched when observed through time, which could mean quite a bit of duplication if all were stored as a Tree, rather than a hierarchy of Directories.

This is in line with the input_root_digest in Action.

But it is not in line with tree_digest in OutputDirectory. In the case of REv2:

  • all data that travels from client -> storage, Directory is used.
  • all data that travels from storage -> client, Tree is used.

The reasoning behind this is somewhat obvious:

  • If ActionResult were to refer to a hierarchy of Directory objects, GetActionResult() would require a crazy amount of work to check for completeness. Or you would require a data store that has a garbage collector that potentially needs to scan very deep graphs.
  • Similarly, a client would need lots of roundtrips to download the data.

The Remote Asset API is not in line with REv2 in that sense. And that's a pity, because it means there can be less reuse of code. If there were concerns about performance/handling of large subtrees, it would have been better to discuss that separately. For example, this was already discussed to a certain extent in #140. Note how various people in that discussion were in agreement that GetTree should likely be removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants