Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use SHA256 for cache invalidation #1985

Merged
merged 1 commit into from
Nov 21, 2021

Conversation

martijnbastiaan
Copy link
Member

@martijnbastiaan martijnbastiaan commented Nov 6, 2021

Before this commit, we used hashables hash. For the purposes of
cache invalidation, it is very important that if hash a == hash b,
the chance that a /= b is vanishingly small. It is not clear that
hashable satisfies this requirement, whereas a cryptographically
secure hash such as SHA-256 clearly does.

TODO:

  • Write a changelog entry (see changelog/README.md)
  • Check copyright notices are up to date in edited files

@christiaanb
Copy link
Member

So it’s considered broken then, since our equality on terms is alpha-equality, and our Binary instance doesn’t store terms in a nameless form.

@martijnbastiaan
Copy link
Member Author

Yes, that's correct. Cache invalidation will throw out more caches than strictly necessary, however it will never keep a cache that's stale. I'd like to point out that this is true before this patch too - Hashable Term and Hashable Type don't account for alpha equivalency either. This PR just addresses the fact that we're using the completely wrong algorithm in the first place. I.e., even if Hashable Term and Hashable Type were correct, they couldn't be used for cache invalidation. If we ever implement Hashing Modulo Alpha-Equivalence we should remember to make our cache invalidation logic use it too.

I'm working on another proposal (two actually) to fix - or rather work around - the broken Hashable instance for Term and Type.

@martijnbastiaan
Copy link
Member Author

martijnbastiaan commented Nov 7, 2021

As an addendum, I don't think our caching logic is completely broken. As long GHC keeps its generated names stable (either by being deterministic, or because entities were loaded from hi-files) caching should still work. Only in cases where GHC generates the same Core modulo alpha equivalence a cache gets pessimistically thrown out. This is annoying, but not nearly as bad as using a stale cache - which can happen in the current state.

@DigitalBrains1
Copy link
Member

After discussion on Slack, I'm suggesting the following PR cover letter and commit message:

Before this commit, we used hashables hash. For the purposes of cache invalidation, it is very important that if hash a == hash b, the chance that a /= b is vanishingly small. It is not clear that hashable satisfies this requirement, whereas a cryptographically secure hash such as SHA-256 clearly does.

Copy link
Member

@leonschoorl leonschoorl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree with the reasoning that we "need" a cryptographic hash function because only that would "guarantee" that we won't have collisions.

We shouldn't have to worry about someone malicious trying to create hash collisions here. If someone were to do that they'd only hurt themself.

The stuff we're hashing is essentially random data as far as the hashing algorithm is concerned. And any competent hashing algorithm should be able to handle that.
Especially because we're not hashing thousands of things, like for example a hashmaps might.
In fact we're only comparing two hashes, the current compile target and one cached compile target.

As I see it the most meaningful effect of this change is a big increase in the hash digest size, from 64 to 256 bits.
Which does drop the collision chance astronomically. But I intuitively feel the collision chance was quiet small already.

The added work of the more expensive hash function is unlikely to be noticeable compared to the rest clash has to do.
And if this allows us to sidestep the issues with the #1916 Hashable instances 👍

@DigitalBrains1
Copy link
Member

DigitalBrains1 commented Nov 18, 2021

Right, that's largely why I suggested the new commit message and PR cover letter, to better cover the reasoning for the change.

Given the tradeoffs with speed versus uniformity that hashable makes, it's not clear that hashable actually is well suited for a scheme where equality is fully decided by the hash, rather than using it to narrow down the list of suspects drastically, followed by a full comparison.

For that reason, I think it's a good idea to switch to SHA-256 and not use hashable anymore. Not because SHA-256 has properties that withstand targeted attack, but because it clearly provides the one thing we need: that when hash a == hash b, the chances that a /= b are vanishingly small, so we can rely on a always being equal to b in practice.

We really don't want the cache invalidation to pick up a stale entity in practice; using hashable for this feels uneasy.

[edit]
Also, the hash is a puny 32 bits in size for 32-bit OSes.
[/edit]

@martijnbastiaan
Copy link
Member Author

Good comments, thanks Leon and Peter. The cover letter can indeed be a lot better. I also should have included the reason why I'm suspicious of hash in the first place. It's not even that I think collisions can be trivially generated, it's that I can easily produce them without looking up the implementation:

>>> hash (1 :: Int, 1 :: Int)
16777618
>>> hash (1 :: Int, 16777618 :: Int)
1
>>> hash (0 :: Int, 1 :: Int)
1

Additionally, things like this do not sit well with me:

>>> hash 0
0
>>> hash (0 :: Int, 0 :: Int)
0

I realize these are somewhat cherry-picked examples, and that they are by no means proof that we can't use it. But given the seriousness of a stale cache hit and the tiny change needed I'd like to go through with this PR.

I will fix the XXX and amend the commit messages and reasoning soon(tm).

Before this commit, we used `hashable`s `hash`. For the purposes of
cache invalidation, it is very important that if `hash a == hash b`,
the chance that `a /= b` is vanishingly small. It is not clear that
`hashable` satisfies this requirement, whereas a cryptographically
secure hash such as SHA-256 clearly does.
@martijnbastiaan
Copy link
Member Author

I thought about backporting it but... #2008.

@martijnbastiaan martijnbastiaan enabled auto-merge (squash) November 21, 2021 19:13
@martijnbastiaan martijnbastiaan merged commit 3204648 into master Nov 21, 2021
@martijnbastiaan martijnbastiaan deleted the use-sha256-cache-invalidation branch November 21, 2021 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants