Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More reliable checksums using NAR hash #4200

Open
Micket opened this issue Jan 31, 2023 · 11 comments
Open

More reliable checksums using NAR hash #4200

Micket opened this issue Jan 31, 2023 · 11 comments
Milestone

Comments

@Micket
Copy link
Contributor

Micket commented Jan 31, 2023

Summarizing discussion from slack:

Background: Changing datestamps in auto-generated tarballs. Change in compression level (like github did recently, but reverted). Change in tar.gz vs tar.xz. Sources from git_config lacking stable checksums. Our checksum implementation is lacking.

NAR Hash would solve this by serializing the file tree and its contents ( https://gist.github.com/jbeda/5c79d2b1434f0018d693 )

Ref. https://nixos.wiki/wiki/Nix_Hash

I assume this could even be made opt-in? If sha256sum is used, then apply to binary (still relevant in many cases, where we e.g. have iso-files for MATLAB etc.). i.e.

checksums = ['nar:a_narred_sha256sum', 'a_normal_sha_256sum']

Ref. easybuilders/easybuild-easyconfigs#5151

@tgamblin
Copy link

tgamblin commented Jan 31, 2023

If you do this, you are vulnerable to malformed tarball CVEs (like nix is), and EB does not have a sandboxed environment to mitigate this risk (somewhat) like nix does.

So, please don’t do this in EB.

see https://github.com/orgs/community/discussions/45830#discussioncomment-4824886

@boegel boegel added this to the 4.x milestone Jan 31, 2023
@boegel
Copy link
Member

boegel commented Jan 31, 2023

If you do this, you are vulnerable to malformed tarball CVEs (like nix is), and EB does not have a sandboxed environment to mitigate this risk (somewhat) like nix does.

So, please don’t do this in EB.

see https://github.com/orgs/community/discussions/45830#discussioncomment-4824886

That's a valid point of course, since we basically need to start playing with an unchecked tarball (unzipping/unpacking it) before we can actually obtain a checksum for it...

Does that mean we'll just need to suffer the pain if the tarballs served by GitHub start changing (again, since this also happened in Sept'17, see easybuilders/easybuild-easyconfigs#5151 + spack/spack#5411)? :-/

@grahamc
Copy link

grahamc commented Jan 31, 2023

Please also know that it the way we've done it in Nix isn't amazing. It would be better to not do it this way. We would much rather know that the archives produced by GitHub are hash-stable over time.

@Flamefire
Copy link
Contributor

checksums = ['nar:a_narred_sha256sum', 'a_normal_sha_256sum']

Side note: This syntax would required both checksums to match. So we'd need a tuple and likely better invert the order, so the tar checksum is checked first and only on failure the NAR is used.

@mboisson
Copy link
Contributor

mboisson commented Feb 2, 2023

I understand Todd's point about a vulnerability, but regardless of how the Github issue evolves, having a way to check whether the content has changed seem to go through NAR hashes. Implementing it as a second step, after having checked the tarball's checksum, would get us one step closer to being ready for a possible future change in Git.

I don't see the downfall (other than a bit more time to build stuff) to add NAR hashes to reduce our work in the future. If checksums from github change down the line, we will be one step closer to being able to validate that new archives are indeed the same as previous ones.

Note that I would not automatically use the NAR one if the tarball's hash fails. I would error out, provide an expert flag to do check the NAR and suggest a way to add a secondary checksum for the tarball through a PR if the NAR matches.

@mboisson
Copy link
Contributor

mboisson commented Feb 2, 2023

Also, moving checksums outside of EasyConfigs, into checksums.json (which admittedly would need to become a bit more complete), would make an eventual future massive update of checksums a lot easier to handle (because it would require updating a lot fewer files).

@boegel
Copy link
Member

boegel commented Feb 2, 2023

checksums = ['nar:a_narred_sha256sum', 'a_normal_sha_256sum']

Side note: This syntax would required both checksums to match. So we'd need a tuple and likely better invert the order, so the tar checksum is checked first and only on failure the NAR is used.

That doesn't seem wise taking into account @tgamblin's very valid remark above, to put it mildly.

If we get the wrong SHA256 checksum, either it's a benign change (like GitHub suddenly serving a different source tarball, but with the same contents), or something malicious is going on (someone manages to serve you a zip bomb via MITM).
In the latter case, you most definitely don't want to unpack the tarball to compute the NAR hash, I think it's clear now that that would be a mistake.

That doesn't mean a NAR hash is entirely useless: if we have two source tarballs that have a different SHA256 checksum, and we're pretty sure they're both safe (both served by GitHub), we could use a function that computes the NAR hash (in an isolated environment) to easily check whether the contents of the tarball are still the same.
I'm not sure that's a good enough reason to implement support for checking a NAR hash though, since the output produced by diff -ru is usually more informative, since it not only tells you whether there is a change, but also what has changed (see easybuilders/easybuild-easyconfigs#17233 for example).

@mboisson
Copy link
Contributor

mboisson commented Feb 2, 2023

I'm not sure that's a good enough reason to implement support for checking a NAR hash though, since the output produced by diff -ru is usually more informative, since it not only tells you whether there is a change, but also what has changed (see easybuilders/easybuild-easyconfigs#17233 for example).

diff -ru implies that you still have the "original" copy, which may not be the case. Computing the NAR hash and storing it would allow you to validate with confidence without needing to have the old archive.

@tgamblin
Copy link

tgamblin commented Feb 2, 2023

Computing the NAR hash and storing it would allow you to validate with confidence without needing to have the old archive.

I think this is a nice benefit because you don't have to maintain a mirror (or fetch from it). I added number 5 here after @mboisson mentioned it. You still have to keep around two hashes though, and you have to train your contributors to NAR stuff (or automate that in CI)

@mboisson
Copy link
Contributor

mboisson commented Feb 2, 2023

The way I could see NAR hash used is:

  1. Automate their calculation and storage when adding checksums (similar to --inject-checksums & co)
  2. Always test the tarball's checksum first. If it matches, and you already have the NAR hash, no need to do anything else.
  3. If the tarball's checksum does not match, error out. Someone has to make a decision on re-validating the NAR hash or not, through some expert flag.
  4. If someone decided it is ok to compute the NAR hash:
  5. If the NAR hash match, inject a new alternate checksum for that tarball
  6. If the NAR hash don't match, someone has to decide again, investigate with diff -ru to figure out what has changed

@boegel
Copy link
Member

boegel commented Feb 2, 2023

One potential safe (?) use of NAR hashes could be stuff that we download via git_config, since then a SHA256 checksum of the tarball is not possible (cfr. #2726)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants