New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

For discussion: full-file hashes in hyperdrive metadata #12

Open

bnewbold wants to merge 2 commits into dat-ecosystem-archive:master from bnewbold:dep-hyperdrive-hashes

Contributor

bnewbold commented Mar 18, 2018

Rendered pre-merge

"Full-file hashes are optionally included in hyperdrive metadata to complement the existing cryptographic-strength hashing of sub-file chunks. Multiple popular hash algorithms can be included at the same time."

Previous discussion:

Before seriously reviewing to merge as Draft, I would want to demonstrate working code and have a better idea what the API would look like, but this is otherwise pretty flushed out.

cc: @martinheidegger

bnewbold added 2 commits

March 17, 2018 22:42


          hyperdrive hashes first draft

f122772


          add PR back-link

645b9d7

bnewbold force-pushed the dep-hyperdrive-hashes branch from a4dffe0 to 645b9d7 Compare

March 18, 2018 05:43

Contributor

pfrazee commented Mar 18, 2018

Other advantages:

Hashes can be used as pointers on the network which imply happens-before causality. That is: If a record can point to the hash of the file, and the record has not been changed since it was created, then we know that the pointed-to file existed before the record did.
The discovery & wire network can be expanded in the future to fetch individual files using these hashes.

martinheidegger reviewed

View reviewed changes

Contributor

martinheidegger left a comment

This is a nice addition to Dat. I added a few questions and conversation starters, but it all boils down to one question to me: How strict should we be with hashes? Making it optional for interoperability reasons is good imo.
I feel like there are various recommendations missing from the dat project: What expectations should DAT clients/users have about hashes? Should we encourage writers to add hashes? Should we encourage writers to add multiple hashes?
Since the data-integrity of dat is already existent on a chunk-level I feel like we should have a blake2d (how many bit?) has as a recommendation.

proposals/0000-hyperdrive-hashes.md

+              Full-file hashes are optionally included in hyperdrive metadata to complement
+              the existing cryptographic-strength hashing of sub-file chunks. Multiple
+              popular hash algorithms can be included at the same time.

Contributor

martinheidegger Mar 19, 2018

Does it matter if they are popular?

Contributor Author

bnewbold Mar 20, 2018

Yes! If we are trying to be inter-operable with existing databases and large users, and in particular to bridge to older "legacy" systems which might not support newer ("better") hash functions. To be transparent, I work at the Internet Archive, and we have dozens of petabytes of files hashed and cataloged with the MD5 and SHA1 hashes (because they are popular, not because they are "strong" or the "best" in any sense). We'll probably re-hash with new algorithms some day, but would like to do so only rarely.

Contributor

martinheidegger Mar 20, 2018

This was a little nitpicking around my impression that the sentence would have the same meaning and impact without "popular"; this blew out of proportion.

proposals/0000-hyperdrive-hashes.md

+              The design decisions to adopt hash variants are usually well-founded, motivated
+              by security concerns (such as pre-image attacks), efficiency, and
+              implementation concerns.

Contributor

martinheidegger Mar 19, 2018

... but this is note the case in dat. The data is both validated and secured on a chunk level. Isn't it in our
case merely to deduplicate?

Contributor Author

bnewbold Mar 20, 2018

I'm not sure I understand. The dat implementation does use something other than popular full-file hash algorithms internally, for good reasons.

Contributor

martinheidegger Mar 20, 2018

It uses chunk-based hashes, yes. But all data received through the dat protocol is signed. I am not sure how you would smuggle unsigned content into a dat.

proposals/0000-hyperdrive-hashes.md

+              Multiple hashes would be calculated in parallel with the existing
+              chunking/hashing process, in a streaming fashion. Final hashes would be
+              calculated when the chunking is complete, and included in the `Stat` metadata.

Contributor

martinheidegger Mar 19, 2018

But multiple hashes are not required, right?

Contributor Author

bnewbold Mar 20, 2018

Correct; this is making the point that even with multiple hashes, the file only needs to be scanned (read from disk) only once.

proposals/0000-hyperdrive-hashes.md

+              What does the user-facing API look like, specifically?
+              Should we allow non-standard hashes, like the git "hash", or higher-level
+              references like (single-file) bittorrent magnet links or IPFS file references?

Contributor

martinheidegger Mar 19, 2018

We shouldn't allowed arbitrary values for the hash field. There should be one clear specification of what type defines which hashing algorithm.

Contributor Author

bnewbold Mar 20, 2018

I'm not sure I follow. What I was getting at here was "what if there are additional hash or merkel tree references a user would want to include that are not in the multihash table"?

Contributor

martinheidegger Mar 20, 2018

Let me rephrase: No: we should not allow non-standard hashes. Though "standard" in this context means the standard we set. I am okay with a "githash" being added for example.

proposals/0000-hyperdrive-hashes.md

+              Modifying a small part of a large file would require re-hashing the entire
+              file, which is slow. Should we skip including the updated hashes in this case?
+              Currently mitigated by the fact that we duplicate the entire file when recoding
+              changes or additions.

Contributor

martinheidegger Mar 19, 2018

Hashes are optional so it is up to the implementor/user to add a hash or not. Including for updates of a file.

proposals/0000-hyperdrive-hashes.md

+              `type` is a number representing the hash algorithm, and `value` is the
+              bytestring of the hash output itself. The length of the hash digest (in bytes)
+              is available from protobuf metadata for the value. This scheme, and the `type`
+              value table, is intended to be interoperable with the [multihash][multihash]

Contributor

martinheidegger Mar 19, 2018

We should reference the multihash table to specify what the types are?!

Contributor Author

bnewbold Mar 20, 2018

The table is linked from the multihash homepage linked... i'm wary of deep linking directly into a github blob (the repo could move to a new platform or file could be renamed), but maybe that's an overblown concern.

Contributor

martinheidegger Mar 20, 2018

Having a link seems better than not having it. To remove/reduce that concern you could link to a commit number or make a mirror of it.

Contributor Author

bnewbold commented Mar 20, 2018

I feel like there are various recommendations missing from the dat project

This was intentional. I don't think we should be over prescriptive; folks should be able to adapt this feature to their own needs (including ones we aren't even thinking of at this time).

For the sake of simplicity, if users or implementors don't want to be bothered choosing or coordinating algorithms to use by default, this draft does say:

For 2018, recommended default full-file hash functions to include are SHA1 (for popularity and interoperability) and blake2b-256 (already used in other parts of the Dat protocol stack).

Contributor

martinheidegger commented Mar 20, 2018 •

edited

Loading

I seemed to have overread that section 😊

For 2018, recommended default full-file hash functions to include are SHA1 (for popularity and interoperability) and blake2b-256 (already used in other parts of the Dat protocol stack).

Now I am all good - though @mafintosh might have something to say about additionally having to compute SHA1.

holepunchto/hyperdrive#203 (comment)

Contributor Author

bnewbold commented Mar 20, 2018 via email

Maybe "widely used"

On March 20, 2018 2:29:29 PM PDT, Martin Heidegger ***@***.***> wrote: martinheidegger commented on this pull request. > + +Type: Standard + +Status: Undefined (as of YYYY-MM-DD) + +Github PR: [Discussion](#12) + +Authors: [Bryan Newbold](https://github.com/bnewbold) + + +# Summary +[summary]: #summary + +Full-file hashes are optionally included in hyperdrive metadata to complement +the existing cryptographic-strength hashing of sub-file chunks. Multiple +popular hash algorithms can be included at the same time. This was a little nitpicking around my impression that the sentence would have the same meaning and impact without "popular"; this blew out of proportion. -- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: #12 (comment)

martinheidegger mentioned this pull request

Provide information on deduplication dat-ecosystem-archive/docs#135

Open

jmatsushita mentioned this pull request

Current chunking: content-aware? de-duped? dat-ecosystem-archive/datproject-discussions#77

Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment