CODEC-335: Add DigestUtils.gitBlob and DigestUtils.gitTree methods#427
CODEC-335: Add DigestUtils.gitBlob and DigestUtils.gitTree methods#427
DigestUtils.gitBlob and DigestUtils.gitTree methods#427Conversation
This change adds two methods to `DigestUtils` that compute generalized Git object identifiers using an arbitrary `MessageDigest`, rather than being restricted to SHA-1: - `gitBlob(digest, input)`: computes a generalized [Git blob object identifier](https://git-scm.com/book/en/v2/Git-Internals-Git-Objects) for a given file or byte content. - `gitTree(digest, file)`: computes a generalized [Git tree object identifier](https://git-scm.com/book/en/v2/Git-Internals-Git-Objects) for a given directory. ### Motivation The standard Git object identifiers use SHA-1, which is [in the process of being replaced by SHA-256](https://git-scm.com/docs/hash-function-transition) in Git itself. These methods generalize the identifier computation to support any `MessageDigest`, enabling both forward compatibility and use with external standards. In particular, the `swh:1:cnt:` (content) and `swh:1:dir:` (directory) identifier types defined by [SWHID (ISO/IEC 18670)](https://www.swhid.org/specification/v1.2/5.Core_identifiers/) are currently compatible with Git blob and tree identifiers respectively (using SHA-1), and can be used to generate canonical, persistent identifiers for unpacked source and binary distributions.
There was a problem hiding this comment.
Hi @ppkarwasz
Should all this git related code be in a new GitDigest class instead?
Curious: isn't all this in jgit?
You'll need to run 'mvn' by itself and fix build issues before you push.
JGit does provide the building blocks via
For reference, here is the equivalent JGit code for a two-file tree: final byte[] aBytes = ...; // a.txt
final byte[] bBytes = ...; // nested/b.txt
try (ObjectInserter inserter = new ObjectInserter.Formatter()) {
final ObjectId aBlob = inserter.idFor(OBJ_BLOB, aBytes);
final ObjectId bBlob = inserter.idFor(OBJ_BLOB, bBytes);
final TreeFormatter nestedTreeFormatter = new TreeFormatter();
nestedTreeFormatter.append("b.txt", FileMode.REGULAR_FILE, bBlob);
final ObjectId nestedTree = inserter.idFor(nestedTreeFormatter);
final TreeFormatter rootTreeFormatter = new TreeFormatter();
rootTreeFormatter.append("a.txt", FileMode.REGULAR_FILE, aBlob);
rootTreeFormatter.append("nested", FileMode.TREE, nestedTree);
return inserter.idFor(rootTreeFormatter).name();
} |
|
What would you say about an API like the one below? It would have the advantage of being reusable in other contexts. For example Commons Compress could use it to compute a SWHID without extracting an archive. public final class GitId {
public enum FileMode {
/** Regular, non-executable file ({@code 100644}). */
REGULAR_FILE("100644"),
/** Executable file ({@code 100755}). */
EXECUTABLE_FILE("100755"),
/** Symbolic link ({@code 120000}). */
SYMBOLIC_LINK("120000"),
/** Directory / subtree ({@code 40000}). */
DIRECTORY("40000");
}
public static byte[] blobId(MessageDigest digest, byte[] content);
public static byte[] blobId(MessageDigest digest, InputStream input) throws IOException;
public static byte[] blobId(MessageDigest digest, Path path) throws IOException;
public static TreeBuilder treeBuilder(MessageDigest digest);
public static final class TreeBuilder {
public TreeBuilder addFile(String name, FileMode mode, byte[] content);
public TreeBuilder addFile(String name, FileMode mode, InputStream input) throws IOException;
public TreeBuilder addFile(String name, FileMode mode, Path path) throws IOException;
public TreeBuilder addDirectory(String name, TreeBuilder subtree);
public byte[] build();
}
} |
Hi @ppkarwasz I'm not sure what Commons component the above should belong. I think you mean it to belong in Codec but I can't tell what's supposed to be an interface vs. implementation. Would this PR be reimplemented in terms of the above? Or would this PR provide the implementation for the above? The name TreeBuilder is confusing to me without Javadoc. It's not building a tree, it's building a byte array. Do you mean it processes a directory tree? I can't tell. In the PR description, you write:
Since Git has been migrating to SHA-256, does this still matter? You only mention SHA-1 in the above. From API design, the API inflation is already present with byte[], InputStream, Path, and hints that File, Channel, Buffer, and URI should also be available, which is the problem Commons IOs builder package attempts to solve. Aside from that, the current PR seems focused on narrow functionality without introducing framework code, so it fits in nicely. Let me review it again in the morning. |
I am not sure which component this belongs either. To add more context: I am trying to create SLSA Provenance attestations for Java builds. For such attestations to have some value, they need to record some invariants of the build toolchain. When you build on your local machine, the most important build data is what you usually add to the vote e-mail: the Maven and JDK version. Maven and JDK are already unpacked on your build machine, so it's not possible to get a classical hash of their distribution, but it is possible to make a “gitTree” hash, which is also among the digests allowed in SLSA. That's why I am looking to introduce some support for I am trying to introduce support for “gitTree” in two steps: Step 1Initially I would need to just compute Step 2Once we compute the Devising the best API is complex, so I would leave it for now, but I would take it into consideration to decide, where to put TL;DR What would you say about refactoring this PR to create some helper methods in a new public final class GitIdentifiers {
public static byte[] blobId(MessageDigest digest, byte[] content);
public static byte[] blobId(MessageDigest digest, InputStream input) throws IOException;
public static byte[] blobId(MessageDigest digest, Path path) throws IOException;
public static byte[] treeId(MessageDigest digest, Path path) throws IOException;
}Later on, we could extend that class to allow computing a |
This change adds two methods to
DigestUtilsthat compute generalized Git object identifiers using an arbitraryMessageDigest, rather than being restricted to SHA-1:gitBlob(digest, input): computes a generalized Git blob object identifier for a given file or byte content.gitTree(digest, file): computes a generalized Git tree object identifier for a given directory.Motivation
The standard Git object identifiers use SHA-1, which is in the process of being replaced by SHA-256 in Git itself. These methods generalize the identifier computation to support any
MessageDigest, enabling both forward compatibility and use with external standards.In particular, the
swh:1:cnt:(content) andswh:1:dir:(directory) identifier types defined by SWHID (ISO/IEC 18670) are currently compatible with Git blob and tree identifiers respectively (using SHA-1), and can be used to generate canonical, persistent identifiers for unpacked source and binary distributions.Before you push a pull request, review this list:
mvn; that'smvnon the command line by itself.