Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store version separately from hash in integrity checks #335

Closed
MonoidMusician opened this issue Dec 30, 2018 · 10 comments
Closed

Store version separately from hash in integrity checks #335

MonoidMusician opened this issue Dec 30, 2018 · 10 comments

Comments

@MonoidMusician
Copy link
Collaborator

As it stands, the version (of the standard) is an input to the sha256 hash which is generated for and used in the semantic integrity checks. It makes sense to invalidate hashes when the version changes, but there are version changes that don't affect the hash, other than changing the version string, right?

My (incomplete) understanding is that this causes a few difficulties: hashes need to be completely regenerated for even minor version changes (which increases the size of diffs and requires the use of tools such as dhall freeze to upgrade every import), and it's not easy for an implementation to detect if a hash is from a version it is still compatible with. (Well, I guess this is the part that I don't fully understand – see the second half of this issue.)

Anyhow, my proposal is that the version of the standard be no longer included in the hash itself but instead be stored next to the hash, with a syntax that looks like this: sha256:5.0.0/bb1e096305428d2e155282d156ddec47cca75cd61bc0ef1aa6ce4cf5eae91e38.

I could see two advantages of this check:

  • (1) Implementations would be able to immediately determine whether they are compatible with the version under which the hash was generated, by inspecting the now-transparent version string. So an implementation of version 5.1.0 could easily recognize version 5.0.0 and verify the hashes using its same algorithm to obtain the same result (since the integrity check should backwards compatible across the minor release).
  • (2a) For releases that don't change the contents of the integrity check, hashes would not even need to be recomputed for imports, just the version string updated. (This results in a very clean diff, and is easy to accomplish by hand.) In fact, the version string might not even need to change, if the new implementation still recognizes the former one! This seems like it would be an easier way to be more flexible in accommodating what versions a Dhall import will work with.
  • (2b) Even for releases that change the contents of the integrity check, but only for some language constructs, files that don't use the affected parts of the language will still have the same hash, so their version string can be updated by itself, again resulting in a cleaner diff. (Obviously one would want to recompute the hash just to verify it is the same, but there will a greater share of cases where the hash does not change.)

I'm raising this proposal purely for the purposes of discussion, since this is a format that would make sense to me. But I don't actually work with integrity checks, so I don't have a preference either way, and I don't know that I have a particularly good understanding of the issues at hand. In particular, I see that the standard specifies that backwards-compatibility SHOULD be supported already, but I guess I am unfamiliar with the mechanism by which it would be accomplished – is it easier to achieve than I thought?

All three changes require users to update their semantic integrity checks if they upgrade since the entire version string is an input to the hash in a semantic integrity check. Implementations SHOULD support older versions of the standard that match in the first version component so that users can defer upgrading their code until they are ready to regenerate their semantic integrity checks.

(This proposal would effectively delete the first sentence there, and hopefully make the rest of it easier for both users and implementations.)

Prior discussions:

  • It seems that the current behavior was introduced in Update versioning policy for the standard and binary protocol #243 (comment) as part of the solution that there should be one version used to version all parts of the standard (which makes sense), but without an explanation of why the version is required to be a part of the hashed content (which I think bears examination):

    Newer releases of the standard (even semantics-preserving ones) always break the hash

  • I believe my proposal could also help mitigate the difficulty of keeping hashes for tests up to date with unreleased versions, as discussed here: Update test import hash with dhall-1.19.1 suggestion #324 (comment)

    However, as you can see above, even e2d014696fb7d773727ae5aa42dc20bbd2447ea82bcb5971ccbb7763906edace is technically not correct because the hash assumes that the version string is still "4.0.0", which is not really correct since that unpublished behavior does not correspond to version "4.0.0" of the standard.

@Gabriella439
Copy link
Contributor

@MonoidMusician: Yeah, I think this is a great idea. My only minor suggestion is to do it in such a way that it provides a smooth migration path (i.e. treat a semantic integrity check without a version as the last standard version published before this change). Then we can provide dhall lint support for migration and then fully deprecate it.

Do you think you would be able to standardize this change?

@Gabriella439
Copy link
Contributor

@MonoidMusician: So I'm looking at this more closely so I can standardize this soon. I have one main suggested change, which is to remove the version tag entirely. In other words:

  • Don't include the version tag in the user-facing integrity check

    i.e. it would always have the form sha256:${HASH} as it currently does

  • Don't include the version in the encoded expression

  • Therefore, the version is also not included when computing the hash

    ... since it's not part of the encoded expression

I believe we can do this so long as we take care to avoid the same encoded expression meaning two different things across different standard versions. For example, suppose we deprecate the constructors tag (i.e. [ 13, e ]) in the CBOR encoding then we either (A) never use the tag ever again or (B) don't use the tag again until after a suitably long deprecation/migration period based on user feedback.

Gabriella439 added a commit that referenced this issue Jan 31, 2019
Fixes #335

The binary representation of Dhall expressions no longer includes the
version string, which enables hash stability across standard version
changes.

This implies that changes to the language standard should take care to avoid
reusing the same encoding to mean different things over time.  This also
implies that interpreters will need to make a greater effort to anticipate
likely decoding failures for older expressions and provide helpful
error messages.
@Gabriella439
Copy link
Contributor

Here's the proposal to remove the version tag entirely: #362

Gabriella439 added a commit that referenced this issue Feb 4, 2019
Fixes #335

The binary representation of Dhall expressions no longer includes the
version string, which enables hash stability across standard version
changes.

This implies that changes to the language standard should take care to avoid
reusing the same encoding to mean different things over time.  This also
implies that interpreters will need to make a greater effort to anticipate
likely decoding failures for older expressions and provide helpful
error messages.
@Nadrieril
Copy link
Member

Ah it's been merged already. I was going to suggest using a version string distinct from the dhall version (a dhall-binary-format version number), that we would only change when we have made a breaking change to the standard. We'd probably change it very rarely but to me it feels more future-proof to keep this possibility open.

@Gabriella439
Copy link
Contributor

@Nadrieril: That's how the binary format started out, if I remember correctly.

In my view, the main reason against tracking the the version number in the encoding was to compare the following two representative proposals:

  • Proposal 1: If the expression's version header contains a version string signaling a potentially incompatible difference, reject the expression
  • Proposal 2: Always attempt to decode the expression and if decoding succeeds then accept the expression, otherwise reject the expression

These are not the only two possible proposals, but they are useful to compare.

Proposal 2 accepts strictly more expressions because there are many expressions that Proposal 1 would reject (false negatives) purely on the basis of the version string being a too-coarse-grained measure of protocol incompatibility. Really what we want is a more fine-grained measure of compatibility that permits older expressions that were still compatible.

That fine-grained measure of compatibility already exists: just attempt to decode the expression. Rather than relying on the version string to determine whether or not an older expression is backwards compatible, we can empirically determine compatibility by directly trying to decode the older expression. This empiric determination is more accurate because it doesn't rely on humans to correctly assign version numbers and it accepts the subset of older expressions that don't contain any backwards-incompatible changes.

@jneira
Copy link
Collaborator

jneira commented Mar 6, 2019

we can empirically determine compatibility by directly trying to decode the older expression

But would it possible that the new version is able to decode it but returning another result? It could drive to, maybe improbable, but serious bugs, right? Taking in account that it is possible to decode binary representantions without semantic integrity checks.

I think @Nadrieril argument is: we can get the actual behaviour simply not changing the bynary specific version but we have it just in case we want to be more strict about it. And we could only change that version only if it is possible to get two different results for the same representation.

@Nadrieril
Copy link
Member

Inded, I was thinking we would only increment it for major breaking changes that change the format enough that it can't be decoded the same anymore. So in all likelyhood it wouldn't be incremented any time soon, or ever.
It's mostly a reflex I try to have to put version numbers on any kind of interchange format because it can often bite you when you realize something important has to change and you can't anymore. Whether we think it's worth the future-proofing depends on how confident we are that we won't want to change it.

@Gabriella439
Copy link
Contributor

But would it possible that the new version is able to decode it but returning another result? It could drive to, maybe improbable, but serious bugs, right? Taking in account that it is possible to decode binary representantions without semantic integrity checks.

It is possible, but I view it more likely that we have a false negative (rejecting an old expression even though it was compatible) than a false positive (decoding an old expression the wrong way). Usually, if we reclaim an unused older protocol encoding to mean something new we are probably doing so intentionally (such as reclaiming the [13, u] encoding that used to be taken by the now-defunct constructors prefix).

Regarding future proofing, I don't believe we paint ourselves into a corner by omitting the version number. We can carve out additional unused protocol space by using new list prefixes. This is very similar in spirit to how protocol buffers 3 works for protocol evolution: you allocate new numeric fields and reserve older no-longer-used numeric fields if you want to ensure that they are never repurposed to have a different meaning.

In practice, we will probably make breaking changes in the short term while the protocol format is still evolving rapidly so that new implementations don't have to support legacy protocol encodings. Once the language stabilizes and we have enough language bindings then we'll probably be more careful to avoid breaking backwards compatibility.

@Nadrieril
Copy link
Member

Fair enough then, thanks for taking the time to explain :)

@Gabriella439
Copy link
Contributor

@Nadrieril: You're welcome! 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants