Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track empty directories with BAZEL_TRACK_SOURCE_DIRECTORIES #15774

Closed

Conversation

fmeum
Copy link
Collaborator

@fmeum fmeum commented Jun 30, 2022

RecursiveFilesystemTraversalFunction did not emit a ResolvedFile for
empty directories and thus didn't track their existence at all since
they have no children that would track it implicitly.

This commit adds the tracking of empty directories for
DirectoryArtifactTraversalRequest only, which is used to implement the
BAZEL_TRACK_SOURCE_DIRECTORIES flag.

Also adds a basic test for the source directory tracking functionality.

@fmeum fmeum force-pushed the bazel-track-source-directories branch from a657740 to 389682a Compare June 30, 2022 11:16
@fmeum
Copy link
Collaborator Author

fmeum commented Jun 30, 2022

@lberki Is this also something you feel comfortable reviewing?

I'm looking into BAZEL_TRACK_SOURCE_DIRECTORIES because I would personally like to see it flipped, maybe even for Bazel 6. It has seen some real-world usage with rules_nodes (see Features under https://github.com/bazelbuild/rules_nodejs/releases/tag/3.6.0) and has made it onto @alexeagle's list of recommended Bazel flags. What else would you say is necessary for the flip?

@fmeum fmeum marked this pull request as ready for review June 30, 2022 12:48
@sgowroji sgowroji added team-Core Skyframe, bazel query, BEP, options parsing, bazelrc awaiting-review PR is awaiting review from an assigned reviewer labels Jun 30, 2022
@lberki
Copy link
Contributor

lberki commented Jun 30, 2022

@fmeum I would be, but I'm on vacation starting from about 5 minutes from today so I'll bounce this to @haxorz for further routing.

In the abstract, we should totally track source directories correctly because it's an ancient and truly embarrassing bug.

@lberki lberki requested a review from haxorz June 30, 2022 15:56
@haxorz
Copy link
Contributor

haxorz commented Jun 30, 2022

[@lberki]
@fmeum I would be, but I'm on vacation starting from about 5 minutes from today so I'll bounce this to @haxorz for further routing.

In the abstract, we should totally track source directories correctly because it's an ancient and truly embarrassing bug.

Our current plan is to completely remove support for source directories 1, so I'd prefer we didn't make any changes to BAZEL_TRACK_SOURCE_DIRECTORIES esp. not ones that suggest we intend to keep janky support for source directories.

We are going to do this at relatively low priority. If you (@fmeum) or any Bazel stakeholders don't like that, let us know and we'll reconsider.

Otherwise I'd prefer rejecting this PR or at least putting it on hold until @lberki and I discuss further.

Footnotes

  1. Specifically, we'll have Bazel error-out when it encounters a source directory. Users will instead need to use glob. @alexjski currently owns this.

@fmeum
Copy link
Collaborator Author

fmeum commented Jun 30, 2022

@haxorz This is definitely unexpected. Could you elaborate a bit more on why you are planning to do this?

As far as I understand the situation, having proper source directory tracking has at least two distinct advantages compared to globs:

  1. It moves work from the loading to the execution phase, which can (noticeably?) improve performance for large directories. I don't have personal experience with directories of a size that would matter here though, but it may make sense to ask the community.
  2. It makes it possible to deal with file paths that are not valid labels. This point seems quite important to me: Such files are relatively common (popular NPM and Python packages have file names containing spaces) and, short of hand-crafting smart glob exclude patterns, very difficult to deal with within Bazel. As general support for such file paths within Bazel is probably very far out in the future (if it ever happens), source directory tracking always seemed like the best bet at solving these issues in a neat way.

If it's just an issue of priorities/staffing, I'm happy to contribute more extensive tests for the feature.

@haxorz
Copy link
Contributor

haxorz commented Jun 30, 2022

Could you elaborate a bit more on why you are planning to do this?

The decision to research removing support for source directories was made in ~Dec 2021. The outcome of this research may be to remove the feature. It may also be to go in the other direction and make BAZEL_TRACK_SOURCE_DIRECTORIES better and the default.

I didn't want to prematurely commit to the latter outcome right now by approving this PR today.

... having proper source directory tracking has at least two distinct advantages compared to globs...

Yes, great points. Both of those were brought up as theoretical concerns in the discussion in Dec 2021 (comments 9-12 & 32-35 in Google internal issue 7075837). Researching whether they are concerns in practice, both for Google's internal codebase, and for Bazel users everywhere, is still an open task.

... As general support for such file paths within Bazel is probably very far out in the future (if it ever happens), source directory tracking always seemed like the best bet at solving these issues in a neat way.

In particular, I personally agree with you. But (a) there was not full agreement on this in Dec 2021 and (b) I want to leave this decision up to the assignee of the research task (currently @alexjski, who is OOO for the next few weeks).

If it's just an issue of priorities/staffing, I'm happy to contribute more extensive tests for the feature.

Thanks for the offer. @alexjski , wdyt? Should we just go all-in on BAZEL_TRACK_SOURCE_DIRECTORIES and scrap the glob approach?

@fmeum
Copy link
Collaborator Author

fmeum commented Jun 30, 2022

@haxorz Thanks for the clarifications, I now have a much better understanding of the situation. Any chance you could make the information on that Google-internal ticket accessible to me or even public?

@lberki
Copy link
Contributor

lberki commented Jul 2, 2022

I'l be out of office for a while so I probably shouldn't express opinions, so I'll just have an observation: approving this pull request would in no way mean committing to the BAZEL_TRACK_SOURCE_DIRECTORIES approach so in theory @haxorz could do that without any damage to future plans, be it globbing of BAZEL_TRACK_SOURCE_DIRECTORIES.

@alexjski
Copy link
Contributor

My current thinking is that I will try as hard as I can to remove that mostly to remove complexity. That is not to say that this is a done deal, but my approach is that unless I see strong evidence that it is necessary to support this feature and it is the best way to support given usage cases, I will try to cut it. I think of this as an opportunity to reevaluate this feature.

@haxorz This is definitely unexpected. Could you elaborate a bit more on why you are planning to do this?

As far as I understand the situation, having proper source directory tracking has at least two distinct advantages compared to globs:

  1. It moves work from the loading to the execution phase, which can (noticeably?) improve performance for large directories. I don't have personal experience with directories of a size that would matter here though, but it may make sense to ask the community.

This is a great point, definitely something to keep in mind. My hope is that a push to remove that internally would show us whether the problems really exist.

I would like to stress out that there is some complexity to handle here, namely when detecting changes locally. If we are to support the directories, we would need to implement that for source directories properly and maybe make that work with things like --watchfs.

Now, let's imagine we implement support for source directories, if it uses O(files) structure in Skyframe, then we could probably consider that not better than globs. Example is if we depended on DirectoryListingState(source_dir) for incrementality. Now, if we don't do that, then we must rely on some form of composite hash for the source directory. Imagine that we get a watchfs notification that large_source_dir/file1 has changed, in order to compute the new composite hash, naively, we need to do O(files in large_source_dir) reads. If that was a glob, we would only need to read that single file. What I am trying to say is that making that efficient is difficult and we could end up needing some O(files) storage anyway.

  1. It makes it possible to deal with file paths that are not valid labels. This point seems quite important to me: Such files are relatively common (popular NPM and Python packages have file names containing spaces) and, short of hand-crafting smart glob exclude patterns, very difficult to deal with within Bazel. As general support for such file paths within Bazel is probably very far out in the future (if it ever happens), source directory tracking always seemed like the best bet at solving these issues in a neat way.

For this one, I wonder how much this is the case of "this use case is best solved by source directories" vs "this use case happens to work with them".

One thing which comes to my mind is tree artifacts which do allow for e.g. : for files in them. That would share a lot of the characteristics of the source directory (the file list does stay in Skyframe cache, in ActionExecutionValue, hence will take memory). A hack to create such a tree artifact (those are outputs, not inputs) which comes to my mind is to have a tar/zip file as a source and an untar/unzip action. Anyway, that is just a silly example which is quite generic--if you have more specific examples where you rely on that, I would love to hear more about that so I can see how bad not having source directories would be.

I can see couple categories of problems here (not sure which ones do you run into):

  • Make Bazel work with files which do have prohibited characters in them as real inputs, used in actions--is that today served by taking a directory source and looking at just a specific file in it?
  • Make Bazel work with files which have allowed names, but have irrelevant files which are not in the same directory (sounds like what you mentioned)--for that, I would ask if maybe we could just not have such files in the repository?
  • Use Bazel to construct output directories which have Bazel outputs mixed with sources which have disallowed characters in names.

If it's just an issue of priorities/staffing, I'm happy to contribute more extensive tests for the feature.

There is obviously a problem of staffing, but to me, we also have an issue of complexity. One more problem to mention is remote execution which actually needs enumerated file inputs, leaving the need to readdir and digest the files (again?) at execution time. This is something that globs do not suffer from.

We could make source directories internally equivalent to globs, in which case the feature becomes more of a convenience shorthand. In order to recommend that, I would need to understand the use cases which really need that better.

@ulfjack
Copy link
Contributor

ulfjack commented Jul 12, 2022

Well, that might be easy to answer. The reason I added BAZEL_TRACK_SOURCE_DIRECTORIES back then was because glob converts filenames into labels, and there are file names that aren't valid labels, and those filenames cannot be changed because there are existing tools that rely on them. Most critically, this applies to pretty much all iOS builds, and also affects ~all iOS developers at Google. Either they're using the new flag, or they have incorrect incremental builds. Given that this also applies to rules_nodejs and friends, it is critical to find a solution for this ASAP, as it affects a lot of people outside of Google (as far as we know, there are more people outside of Google using Bazel than inside Google using Bazel or Blaze combined).

As for simplification, I don't recall BAZEL_TRACK_SOURCE_DIRECTORIES adding a lot of code given that RecursiveFilesystemTraversalFunction was pre-existing at the time. Unless you see a way towards removing that, I would be surprised if this would significantly affect Bazel complexity. Is Fileset still around?

Changing the label syntax is unlikely to be a viable option, IMO, but - of course - if there's a way to do that, that would be good.

@alexjski
Copy link
Contributor

Well, that might be easy to answer. The reason I added BAZEL_TRACK_SOURCE_DIRECTORIES back then was because glob converts filenames into labels, and there are file names that aren't valid labels, and those filenames cannot be changed because there are existing tools that rely on them. Most critically, this applies to pretty much all iOS builds, and also affects ~all iOS developers at Google. Either they're using the new flag, or they have incorrect incremental builds.

Nice, so that means that this was added with that particular goal in mind. Would you mind expanding to why we didn't use tree artifacts for that instead? I understand that would be less convenient to use, but wouldn't require a feature.

Given that this also applies to rules_nodejs and friends, it is critical to find a solution for this ASAP, as it affects a lot of people outside of Google (as far as we know, there are more people outside of Google using Bazel than inside Google using Bazel or Blaze combined).

As for simplification, I don't recall BAZEL_TRACK_SOURCE_DIRECTORIES adding a lot of code given that RecursiveFilesystemTraversalFunction was pre-existing at the time. Unless you see a way towards removing that, I would be surprised if this would significantly affect Bazel complexity. Is Fileset still around?

Changing the label syntax is unlikely to be a viable option, IMO, but - of course - if there's a way to do that, that would be good.

I gave that a brief thought, on the surface, that seems relatively simple--we could e.g. have escaping like SQL does for '. The issue is all the external tools around which today correctly assume things about the label syntax which would be broken by that.

@fmeum fmeum force-pushed the bazel-track-source-directories branch 5 times, most recently from 57a7f27 to 8903f0d Compare July 29, 2022 09:16
@alexjski
Copy link
Contributor

Well, that might be easy to answer. The reason I added BAZEL_TRACK_SOURCE_DIRECTORIES back then was because glob converts filenames into labels, and there are file names that aren't valid labels, and those filenames cannot be changed because there are existing tools that rely on them. Most critically, this applies to pretty much all iOS builds, and also affects ~all iOS developers at Google.

Update on this--I did some more research which leads me to believe that this statement about iOS is no longer true. It is however true that internally there are files which have : in their names, so that means that the path to glob migration would be much more painful than I initially thought, making it more questionable.

Anyway, I believe that adding full support for source directories will be quite some work, which is why tree artifacts hack is so tempting (those do work already). At this point, I would refrain from making any promises going either way.

@fmeum
Copy link
Collaborator Author

fmeum commented Jul 29, 2022

Do you have any expectations regarding the next steps? Are there any obvious issues with source directory tracking that would need to be resolved or is it more of a hunt for edge cases?

@alexjski
Copy link
Contributor

Do you have any expectations regarding the next steps? Are there any obvious issues with source directory tracking that would need to be resolved or is it more of a hunt for edge cases?

I think that this discussion is orthogonal to this PR. This PR purely improves the behavior of BAZEL_TRACK_SOURCE_DIRECTORIES, so there is no problem accepting it. The reasons why you are using the feature can be used as arguments for keeping it around. At this point, all I wanted to avoid is creating an impression that by accepting this PR we are making a commitment to support this feature in the long term.

@fmeum fmeum force-pushed the bazel-track-source-directories branch from 8903f0d to d90f167 Compare August 5, 2022 15:53
@fmeum fmeum force-pushed the bazel-track-source-directories branch from d90f167 to 5d25731 Compare August 5, 2022 15:56
@fmeum fmeum force-pushed the bazel-track-source-directories branch 2 times, most recently from 8eebf75 to ced7e8a Compare August 8, 2022 13:17
@fmeum fmeum force-pushed the bazel-track-source-directories branch from ced7e8a to d1894c9 Compare August 18, 2022 15:44
@fmeum
Copy link
Collaborator Author

fmeum commented Aug 18, 2022

@alexjski Friendly ping. I resolved the merge conflict.

@alexjski
Copy link
Contributor

@alexjski Friendly ping. I resolved the merge conflict.

Sorry, it fell off my radar. Looking pretty good, added only some minor comments.

RecursiveFilesystemTraversalFunction did not emit a ResolvedFile for
empty directories and thus didn't track their existence at all since
they have no children that would track it implicitly.

This commit adds the tracking of empty directories for
DirectoryArtifactTraversalRequest only, which is used to implement the
BAZEL_TRACK_SOURCE_DIRECTORIES flag.

Also adds a basic test for the source directory tracking functionality.
@fmeum fmeum force-pushed the bazel-track-source-directories branch from d1894c9 to bd52176 Compare August 19, 2022 09:05
Copy link
Contributor

@alexjski alexjski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for fixing that!

@ShreeM01 ShreeM01 removed the awaiting-review PR is awaiting review from an assigned reviewer label Sep 15, 2022
@fmeum fmeum deleted the bazel-track-source-directories branch October 6, 2022 13:58
aiuto pushed a commit to aiuto/bazel that referenced this pull request Oct 12, 2022
RecursiveFilesystemTraversalFunction did not emit a ResolvedFile for
empty directories and thus didn't track their existence at all since
they have no children that would track it implicitly.

This commit adds the tracking of empty directories for
DirectoryArtifactTraversalRequest only, which is used to implement the
BAZEL_TRACK_SOURCE_DIRECTORIES flag.

Also adds a basic test for the source directory tracking functionality.

Closes bazelbuild#15774.

PiperOrigin-RevId: 468804794
Change-Id: I2c06c05b335781234d4fa90f3d8c1e7fe6dd184a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Core Skyframe, bazel query, BEP, options parsing, bazelrc
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants