Bazel hashing of external directory output
Bazel hashing of external directory output
We allow repository rules to follow freely floating targets, provided they return a modified set of arguments to provide a reproducible version of themselves (design, blog post). As writing rule that correctly provide those arguments is a hard and potentially error-prone task, bazel will support this by computing and supporting verification of hashes of the directory generated by an external repository. This is similar to the way as sandboxing supports writing rules by detecting missing dependencies in actions.
Data included in the repository hash
It is important to understand that this hash is not a security feature. Certain data of the directory structure, in particular the owner of the files, are deliberately not included. And while a reasonable build will not depend on the ownership of its source file, it is technically possible to have a build with the behavior of the resulting binary depending on the owner of the source file, even in a malicious way.
Owners and timestamps ignored
We expect users to build as ordinary users (i.e., not as the privileged user). Therefore, files can only be created owned by the current user, and hence file ownership of files in external repositories differs from developer to developer, but this is fine, as long as the external repository is built in a reasonable way.
For the same reason, we do not include the time stamp in the hash of the
directory. Files generated by
ctx.file, as well as files checked out by
git, will have the current time as time stamp, which is not reproducible,
but a sensible build process will not depend on the time stamps.
Executable bit stored
We do include the information whether the file is executable for the owner. Wrong permissions can be the cause of annoying build failures, and most ways of providing an external repository, either track this information explicitly or, at least, set if to a reproducible value (not executable).
Symlinks partially expanded
Symlinks provide two kind of information. Primarily they are just a
string that can be read via
readlink(2); also if accessed as a
file or directory, e.g., via
fopen(3), the string is interpreted as
a filename and the respective file or directory, if existing, is
Examples of symlinks in external repositories
http_archiveand other bazel-provided repository rules symlink the
BUILDfile into the generated repository. The symlink is absolute, so in particular depends on the location of the workspace on the local machine. Hence we cannot just
readlink(2)all links and expect a reproducible hash.
Some external projects come with cyclic symlinks. E.g., the alsa library (alsa-lib-1.1.2.tar.bz2 with sha256 d38dacd9892b06b8bff04923c380b38fb2e379ee5538935ff37e45b395d861d6) has, in the
includesubdirectory a symlink
.. So, replacing all symlinks by (the hash of) what they point to, does not work as symlink cycles exist in the real world.
Proposal on the hash
For absolute paths pointing to files, the file being pointed to is hashed, including the information whether it is executable.
For all other symlinks (relative paths, absolute links to directories, dangling symlinks), the link itself is hashed.
For external repository rules like
git_repository, additional directories
are created besides the actual source code, e.g., the
And while the actual source is determined by the specified commit id,
the contents of those subdirectories are not. The knowledge which additional
such files and directories are created is specific to the individual rule.
Proposal: rules clean up themselves
We propose that the rules are in charge of removing all unrelated files and directories; at the very least they must remove all parts that are not byte-for-byte reproducible.
Alternative considered: rules tell bazel to ignore certain parts
An alternative considered was be that the rules would declare which parts of the created directory are not part of the code and should be ignored by bazel. This would, however imply an even more complicated interface, as rules will have to then return two kind of information: the actual resolved information (i.e., the new dict of keyword arguments), as well as the set of objects to ignore.
This seems quite some extension of the interface for unclear benefits;
as the directory of an external repository is completely removed
before another call to the rule, we cannot save bandwidth by keeping
.git repository around.
hash included in the
resolved value in the
contain an additional key
output_tree_hash for every entry in the
repsoitories field of every entry indicating the call to a Skylark
repository rule. As only a new file is added (and the value is experimental
anyway) this change does not break any legitimate use cases: users are free
to ignore the additional value.
In the long run: all source-like repositories will be taken from the
WORKSPACE.resolved file, where hashes are provided separately, and can be
checked. This check will only be done for source-like rules, and we will add
an option to ignore it for individual repositories.
To allow for a clean transition period, the output-directory verification will
be opt-in initially, even before we fully switch to the
WORKSPACE.resolved distinction, and also before the source-like/configure-like
distinction is done. We add a new option. This option will
specify a Skylark file, where the value
resolved is taken from and expected
to have the same structure as a resolved value in a file written by the
--experimental_repository_resolved_file option. Repositories are associated
name attribute. As rules will only start to become reproducible
one by one, in the transition period an option will be used to specify the
repository rules for which verification should happen (defaulting to the
empty list, so this feature is opt-in as well).