Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For discussion: full-file hashes in hyperdrive metadata #12

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

bnewbold
Copy link
Contributor

Rendered pre-merge

"Full-file hashes are optionally included in hyperdrive metadata to complement the existing cryptographic-strength hashing of sub-file chunks. Multiple popular hash algorithms can be included at the same time."

Previous discussion:

Before seriously reviewing to merge as Draft, I would want to demonstrate working code and have a better idea what the API would look like, but this is otherwise pretty flushed out.

cc: @martinheidegger

@pfrazee
Copy link
Contributor

pfrazee commented Mar 18, 2018

Other advantages:

  • Hashes can be used as pointers on the network which imply happens-before causality. That is: If a record can point to the hash of the file, and the record has not been changed since it was created, then we know that the pointed-to file existed before the record did.
  • The discovery & wire network can be expanded in the future to fetch individual files using these hashes.

Copy link
Contributor

@martinheidegger martinheidegger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice addition to Dat. I added a few questions and conversation starters, but it all boils down to one question to me: How strict should we be with hashes? Making it optional for interoperability reasons is good imo.
I feel like there are various recommendations missing from the dat project: What expectations should DAT clients/users have about hashes? Should we encourage writers to add hashes? Should we encourage writers to add multiple hashes?
Since the data-integrity of dat is already existent on a chunk-level I feel like we should have a blake2d (how many bit?) has as a recommendation.


Full-file hashes are optionally included in hyperdrive metadata to complement
the existing cryptographic-strength hashing of sub-file chunks. Multiple
popular hash algorithms can be included at the same time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it matter if they are popular?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! If we are trying to be inter-operable with existing databases and large users, and in particular to bridge to older "legacy" systems which might not support newer ("better") hash functions. To be transparent, I work at the Internet Archive, and we have dozens of petabytes of files hashed and cataloged with the MD5 and SHA1 hashes (because they are popular, not because they are "strong" or the "best" in any sense). We'll probably re-hash with new algorithms some day, but would like to do so only rarely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a little nitpicking around my impression that the sentence would have the same meaning and impact without "popular"; this blew out of proportion.


The design decisions to adopt hash variants are usually well-founded, motivated
by security concerns (such as pre-image attacks), efficiency, and
implementation concerns.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... but this is note the case in dat. The data is both validated and secured on a chunk level. Isn't it in our
case merely to deduplicate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand. The dat implementation does use something other than popular full-file hash algorithms internally, for good reasons.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It uses chunk-based hashes, yes. But all data received through the dat protocol is signed. I am not sure how you would smuggle unsigned content into a dat.


Multiple hashes would be calculated in parallel with the existing
chunking/hashing process, in a streaming fashion. Final hashes would be
calculated when the chunking is complete, and included in the `Stat` metadata.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But multiple hashes are not required, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct; this is making the point that even with multiple hashes, the file only needs to be scanned (read from disk) only once.

What does the user-facing API look like, specifically?

Should we allow non-standard hashes, like the git "hash", or higher-level
references like (single-file) bittorrent magnet links or IPFS file references?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't allowed arbitrary values for the hash field. There should be one clear specification of what type defines which hashing algorithm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow. What I was getting at here was "what if there are additional hash or merkel tree references a user would want to include that are not in the multihash table"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me rephrase: No: we should not allow non-standard hashes. Though "standard" in this context means the standard we set. I am okay with a "githash" being added for example.

Modifying a small part of a large file would require re-hashing the entire
file, which is slow. Should we skip including the updated hashes in this case?
Currently mitigated by the fact that we duplicate the entire file when recoding
changes or additions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hashes are optional so it is up to the implementor/user to add a hash or not. Including for updates of a file.

`type` is a number representing the hash algorithm, and `value` is the
bytestring of the hash output itself. The length of the hash digest (in bytes)
is available from protobuf metadata for the value. This scheme, and the `type`
value table, is intended to be interoperable with the [multihash][multihash]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should reference the multihash table to specify what the types are?!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table is linked from the multihash homepage linked... i'm wary of deep linking directly into a github blob (the repo could move to a new platform or file could be renamed), but maybe that's an overblown concern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a link seems better than not having it. To remove/reduce that concern you could link to a commit number or make a mirror of it.

@bnewbold
Copy link
Contributor Author

I feel like there are various recommendations missing from the dat project

This was intentional. I don't think we should be over prescriptive; folks should be able to adapt this feature to their own needs (including ones we aren't even thinking of at this time).

For the sake of simplicity, if users or implementors don't want to be bothered choosing or coordinating algorithms to use by default, this draft does say:

For 2018, recommended default full-file hash functions to include are SHA1 (for popularity and interoperability) and blake2b-256 (already used in other parts of the Dat protocol stack).

@martinheidegger
Copy link
Contributor

martinheidegger commented Mar 20, 2018

I seemed to have overread that section 😊

For 2018, recommended default full-file hash functions to include are SHA1 (for popularity and interoperability) and blake2b-256 (already used in other parts of the Dat protocol stack).

Now I am all good - though @mafintosh might have something to say about additionally having to compute SHA1.

holepunchto/hyperdrive#203 (comment)

@bnewbold
Copy link
Contributor Author

bnewbold commented Mar 20, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants