Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SHA1 sum of package dists should be more reliable than a SHA1 on the zip result #2540

Open
Seldaek opened this issue Dec 31, 2013 · 47 comments
Assignees
Milestone

Comments

@Seldaek
Copy link
Member

Seldaek commented Dec 31, 2013

(Replacing #1496 which has become a mess, references #5940)

If multiple servers create archives, then those archives can have different SHA1s which is problematic. Potential solutions:

  • hash the contents in a reproducible way (might be very slow)
  • just avoid recreating archives all the time (depends on ecosystem but preferable)
@fadoe
Copy link

fadoe commented May 13, 2014

Is there a solution until now?

@david0
Copy link

david0 commented Feb 13, 2015

I'm not an expert in the ZIP format, but I guess the file access times (https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT) are changing and thereby messing up the checksum.
Would it be an option the mask out the relevant parts?

@Seldaek
Copy link
Member Author

Seldaek commented Feb 14, 2015

@david0 it's a good theory/idea, either that or even modification times. I am not sure if these metadata bits are part of the header or if they are also compressed together with the file contents. If it's in the headers then maybe we could more or less easily mask them and then hash the outcome yes.

If anyone is up for digging further in the spec and possibly trying to build a small proof of concept (it really does not need to be part of composer as a first step) that'd be great.

@david0
Copy link

david0 commented Feb 14, 2015

My proof of concept.

I also noticed that file access times etc. are stored in an optional field "extra". It is possible to omit this field when creating zip files using zip -X. So if we would find a way to create the zip-files without "extra" for satis, the problem should be solved.

@Seldaek
Copy link
Member Author

Seldaek commented Feb 16, 2015

@david0 ideally we should support this for zips created by github as well though where we can't control the process, so if we can strip the headers it'd be best.

I am not sure about the proof of concept because I get the same shasums anyway for the testdata1/2 and 3/4 couples. So the sha.php code does not prove much but it does indeed produce different shas that are also consistent.

It'd be nice if we could reproduce the exact issue (two different zips of same content) and get the same sha.

@david0
Copy link

david0 commented Feb 17, 2015

@Seldaek must be some difference on your system (with atime). I updated the poc and included my testfiles.

$ shasum *.zip
689595f5f847e9edc4c01fb2c81a17472f009df5  testdata1_extra.zip
0911f01200a2a9a0b7f251b7308fcbe3a18c8f56  testdata2_extra.zip
312d94af3ac4c929169e5e515b53ee69be4b1e1c  testdata3_noextra.zip
312d94af3ac4c929169e5e515b53ee69be4b1e1c  testdata4_noextra.zip
34ef754c68ffbb559954f7932b70e46b5d826cc3  testdata5_corrupted_extra.zip
823ced3f35d98acbb4cb33dd1b1bb25e2f611323  testdata5_corrupted_noextra.zip
$ php sha.php
testdata1_extra.zip: e122eb938e4a680503b97c278f71583d27537463
testdata2_extra.zip: e122eb938e4a680503b97c278f71583d27537463
testdata3_noextra.zip: 31b74a5869270919e41f185f593f14306d9dc8c9
testdata4_noextra.zip: 31b74a5869270919e41f185f593f14306d9dc8c9
testdata5_corrupted_extra.zip: 63eae002cb6deb46240d300308ea783f9fd41735
testdata5_corrupted_noextra.zip: 3b0d719237c790a00cf28fc998ebf6b7dc2a8f71

It would be also possible to make testdata 1-4 have the same hash by ignoring the central directory structure also.

@Seldaek
Copy link
Member Author

Seldaek commented Feb 17, 2015

OK I get the same hashes as you do so that's a good start ;) Not sure what central directory structure is, does that include paths/directory structure of the unzipped output? Because ignoring that might be risky.

@david0
Copy link

david0 commented Feb 20, 2015

The central directory structure seems the be a repetition of the file headers for random access.
I updated the poc once again and now I get same hashes for testdata 1-4.

@sroze
Copy link
Contributor

sroze commented Aug 3, 2015

@Seldaek do you have some feedback about @david0's work ? It would be really interesting to fix these checksum "issue" IMO because it is causing some problems with private dependency management basically.

@Seldaek
Copy link
Member Author

Seldaek commented Aug 10, 2015

@sroze just haven't had time to look into this in more details, but yes in theory it's a good way to do hashing for zip files, it's probably not very hard to add that as an optional hashing mechanism to verify the shasum of zips.

@rawkode
Copy link

rawkode commented Aug 10, 2015

What about the ability to disable the checksum until a real solution can be found?

@Seldaek
Copy link
Member Author

Seldaek commented Aug 10, 2015

AFAIK packagist doesn't have checksums and toran proxy doesn't either, I am not sure about satis but overall checksums aren't really used atm due to the issues they caused.

@stof
Copy link
Contributor

stof commented Aug 10, 2015

Well, Packagist does not use the checksum because the Github API does not provide a checksum for its archives (probably because it changes over time as they generate archives on demand for commits).
However, Satis sets the checksum as it has the archive already and so can compute it

@Seldaek
Copy link
Member Author

Seldaek commented Aug 10, 2015 via email

@rawkode
Copy link

rawkode commented Aug 10, 2015

So if I submit a PR with a config inside satis.json to allow disabling satis checksums inside packages.json - there'd be no initial objections?

@stof
Copy link
Contributor

stof commented Aug 10, 2015

@Seldaek that would be both inefficient (downloading all archives in the GithubDriver just to get the checksum) and a bad idea (as Github archives are generated on the fly and so can have different metadata over time)

@sroze
Copy link
Contributor

sroze commented Aug 11, 2015

I would agree to have an option (just a command-line one could be sufficient) to disable checksum checks. At the same time, having a real solution like this one would be far better.

@Seldaek if you're in with this idea, I can PR that --disable-checksums option.

@mbrodala
Copy link
Contributor

mbrodala commented Feb 4, 2016

Any updates here? I am regularly seeing issues like this which slow down our deployments due to Composer falling back to Git cloning instead of downloading dist archives:

- Installing typo3/cms (6.2.17)
    Downloading
Failed to download typo3/cms from dist: The checksum verification of the file failed (downloaded from https://api.github.com/repos/TYPO3/TYPO3.CMS/zipball/449cf97a914c2dd719b728dc77d7b547ba557f33)
Now trying to download from source
- Installing typo3/cms (6.2.17)
Cloning 449cf97a914c2dd719b728dc77d7b547ba557f33

@stof
Copy link
Contributor

stof commented Feb 4, 2016

how are you ending up with a checksum for github downloads ? Packagist and Composer don't add them as github does not provide checksums (and it cannot as they generate the archive on demand meaning that the checksum is not stable over time)

@mbrodala
Copy link
Contributor

mbrodala commented Feb 4, 2016

@stof Good question, I cannot say what causes these checksums to show up. Could it be that Composer caches can cause this issue? I am unable to reproduce the issue when pulling typo3/cms locally but I can see it reliably in our build logs (Shippable).

@oligriffiths
Copy link

Any update to this? We're using satis and keep seeing this same issue...

@fgrosse
Copy link

fgrosse commented Apr 19, 2016

We also had the issue that we tried to keep the lockfile clean of any shasum values but they would always magically come back upon composer update ... even though they were not present on packagist.

Following @mbrodala s hint I deleted $HOME/.cache/composer but the checksums did still come back.

It turned out that those checksums were also present in vendor/installed.json and they stop reappearing out of nowhere once I also removed ./vendor and ran a clean composer install

@oligriffiths
Copy link

Whilst that may work, that doesn't seem like a solution to me. The real question is why are they different...

On 19 Apr 2016, at 08:06, Friedrich Große notifications@github.com wrote:

We also had the issue that we tried to keep the lockfile clean of any shasum values but they would always magically come back upon composer update ... even though they were not present on packagist.

Following @mbrodala s hint I deleted $HOME/.cache/composer but the checksums did still come back.

It turned out that those checksums were also present in vendor/installed.json and they stop reappearing out of nowhere once I also removed ./vendor and ran a clean composer install


You are receiving this because you commented.
Reply to this email directly or view it on GitHub

@oligriffiths
Copy link

I've also noticed that when re-creating the tar/zip, satis loses the executable flag on files, e.g. https://github.com/brianium/paratest/tree/master/bin these get reset, however if you download the zip or clone the repo from github, those flag are set.

@mbrodala
Copy link
Contributor

mbrodala commented May 3, 2016

@oligriffiths Seems to be a known issue in Satis but not really related to the issue here. ;-)

@oligriffiths
Copy link

Cool, just saying the direct download from github would be a solution

@boneskull
Copy link

I'm having this issue with .tar archives specifically. Please update the issue title.

@pwolanin
Copy link

Is there an existing issue to move the hashing also to sha-2 instead of md5 or sha1? I'm a bit surprised that composer as a newish tool is using an outdated and deprecated hash for verification

@emanuelb
Copy link

emanuelb commented Dec 5, 2016

Regarding striping metadata from ZIP files, look at the implementation in:
https://anonscm.debian.org/git/reproducible/strip-nondeterminism.git/tree/lib/File/StripNondeterminism/handlers/zip.pm
from strip-nondeterminism:
https://packages.debian.org/sid/strip-nondeterminism
it also contain handlers (code to strip metadata) from other archives/file-formats:
https://anonscm.debian.org/git/reproducible/strip-nondeterminism.git/tree/lib/File/StripNondeterminism/handlers/

@Ilyes512
Copy link

Ilyes512 commented May 30, 2018

So I have wasted hours on finding out why I am getting these warning:

The checksum verification of the file failed

Are there any plans to fix this?

@rgpublic
Copy link

Don't know if anyone finds this interesting but I usually use "faketime" to create consistent archives:

http://manpages.ubuntu.com/manpages/trusty/man1/faketime.1.html

@mpdude
Copy link
Contributor

mpdude commented Jan 3, 2020

Could somebody please add a little more context regarding when exactly this problem occurs? What are these "multiple servers" from the OP, and how do they create the archives?

For the record, I am able to create a ZIP file from the https://github.com/symfony/symfony repo that has the same checksum as the ZIP archive downloadable on the GitHub releases page. Try:

TZ=America/Los_Angeles git archive --format zip --prefix=symfony-5.0.2/ v5.0.2 | sha384sum

Edit:

To be a bit more elaborate myself, here's an example for doctrine/cache 1.10.0 as a random choice.

In a composer.lock file, I've got this...

{
            "name": "doctrine/cache",
            "version": "1.10.0",
            "source": {
                "type": "git",
                "url": "https://github.com/doctrine/cache.git",
                "reference": "382e7f4db9a12dc6c19431743a2b096041bcdd62"
            },
            "dist": {
                "type": "zip",
                "url": "https://api.github.com/repos/doctrine/cache/zipball/382e7f4db9a12dc6c19431743a2b096041bcdd62",
                "reference": "382e7f4db9a12dc6c19431743a2b096041bcdd62",
                "shasum": ""
            },
...

So:

$ wget -q https://api.github.com/repos/doctrine/cache/zipball/382e7f4db9a12dc6c19431743a2b096041bcdd62 -O gh.zip
$ sha384sum gh.zip
2251d87b5a288ba10edbbf9c22540cc63ba2e8636a9b6460407b5a55252b5cdc6f89fe9f683cc922d1237035d25d3fcf  gh.zip
$ git clone https://github.com/doctrine/cache.git
$ cd cache
$ TZ=America/Los_Angeles git archive --format zip --prefix=doctrine-cache-382e7f4/ 382e7f4 | sha384sum
2251d87b5a288ba10edbbf9c22540cc63ba2e8636a9b6460407b5a55252b5cdc6f89fe9f683cc922d1237035d25d3fcf  -

Yes, the America/Los_Angeles timezone is a GitHub peculiarity. But I guess Packagist has to clone the repo anyway, so it might also generate and include checksums for the dist archives?

@Seldaek Seldaek modified the milestones: 2.0, Nice To Have Jan 31, 2020
@stof
Copy link
Contributor

stof commented Feb 20, 2020

But I guess Packagist has to clone the repo anyway

No it does not. For github repos, the GithubDriver extracts all data from the API.

@mpdude
Copy link
Contributor

mpdude commented May 27, 2021

Hey @peff 👋🏼,

you've been very helpful back in 2017 over at Homebrew/homebrew-core#18044, where the Homebrew team had to deal with checksums for GitHub repo archives that suddenly changed. To my understanding, this was due to an internal fix in Git and how it exports archives.

Counting on your expertise and hoping that you're still in some way affiliated with GitHub, and also referring to your comment regarding byte-stable tarballs, can you tell us...

  • If it reasonable to assume that checksums of zipballs (like https://api.github.com/repos/doctrine/cache/zipball/382e7f4db9a12dc6c19431743a2b096041bcdd62) are stable over time?
  • What might be possible reasons for a change?

Thanks!

Update for clarification: Obviously I gave one reason myself why the checksum might change, namely changes to the way Git exports and packages archives. But since you mentioned that GitHub might take measures to guarantee byte-stable archives, and given the fact that also Homebrew seems to rely on checksums of GitHub archives (arbitrary example), we might consider the possible fall-out of a checksum change to be big enough that its not going to happen by accident 😉 . Doest that make sense?

@peff
Copy link

peff commented May 27, 2021

Counting on your expertise and hoping that you're still in some way affiliated with GitHub, and also referring to your comment regarding byte-stable tarballs, can you tell us...

  • If it reasonable to assume that checksums of zipballs (like https://api.github.com/repos/doctrine/cache/zipball/382e7f4db9a12dc6c19431743a2b096041bcdd62) are stable over time?

  • What might be possible reasons for a change?

Not much (nothing?) has changed since that comment. Zipfiles are treated the same as tarballs: generated by git-archive on the fly. We did talk about cementing byte-for-byte output at the time of tagging, but it's complicated at scale. And it wouldn't help with the example you gave anyway, which is requesting a non-tag. The only reason for a change would be Git's output changing, which itself would likely be caused by a bugfix.

given the fact that also Homebrew seems to rely on checksums of GitHub archives (arbitrary example), we might consider the possible fall-out of a checksum change to be big enough that its not going to happen by accident

Yeah. I think the situation remains "changes should be very rare and avoided if possible, but technically there are no promises".

@Seldaek
Copy link
Member Author

Seldaek commented May 27, 2021

Thanks for the clarification. This to me seems like a no-go as it means there is a chance a git or github change would render virtually all composer.lock file in existence invalid and uninstallable due to security warnings. Even if it's very rare it seems like a pretty bad disruption to the whole ecosystem I'd rather avoid.

@mpdude
Copy link
Contributor

mpdude commented May 27, 2021

Jordi, thank you for getting back to this (c)old case so quickly.

The reason why I was browsing this and related issues (#5940, #4022) is that I was hoping that there would be an (easy?) way of making sure that the code that we see and install is the same on our local development machines, on CI and in production.

I cannot really say if this is already addressed by the composer.lock file, or if another level of verification on the dist file level would be helpful. It seemed to me that if Packagist can derive and provide a checksum for the dist archive at the time a package is added to the registry, and that checksum is verified and written into my composer.lock file during composer update on a dev machine, it would give me some more peace of mind knowing that the code that will be downloaded and used by automatic processes down the road (CI, deployments etc.) will be exactly the same.

I have no experience when it comes to designing security for systems like this, so I cannot even tell what kind of attacks this would prevent or at least mitigate... Maybe a misbehaving package owner trying a targeted attack, serving different dist versions from their own URL? Mirrors providing modified versions of dist archives?

On the other hand: As long as we're downloading GitHub-provided ZIP archives over HTTPS directly from GitHub, and since we can trust GitHub as a platform, and since they create the ZIPs from the corresponding commit ref, just having the reference in the .lock file should be enough.

So, Jordi, do you still recall what exactly your motivation was for opening this issue? Were it security/integrity concerns?

Regarding possible checksum changes: Yes, I see the risk and I understand that users would probably be yelling at Composer/Packagist. OTOH, since at least Homebrew would be affected as well, there would probably be a lot of visibility and support by all parties involved to sort such issues out. And anyway, in my mind, all this would be opt-in (or out?) by a switch like composer install --verify-checksums.

But again, before we're discussing such technical details I'd need to understand what benefits a simple checksum-based approach would bring security-wise.

@Seldaek
Copy link
Member Author

Seldaek commented May 27, 2021

This isn't opt-in or opt-out, all existing Composer versions if they have a checksum defined will verify it and if it fails they throw. So that's why we disabled it for github early on as it wasn't stable enough.

On the other hand: As long as we're downloading GitHub-provided ZIP archives over HTTPS directly from GitHub, and since we can trust GitHub as a platform, and since they create the ZIPs from the corresponding commit ref, just having the reference in the .lock file should be enough.

That is correct I'd say security wise it doesn't add a ton for GitHub, but for other less reputable sources it wouldn't hurt to have. Then again last I checked ~97% of packages were on GitHub, so all in all I'm not sure what is worth spending resources on.

@Seldaek
Copy link
Member Author

Seldaek commented May 28, 2021

Gave this some more thought yesterday, and I'm thinking maybe what we can do is this:

  • Add a new checksums key which would be a dictionary instead of the current sha1sum-or-bust, which isn't very future proof.
  • checksums.sha256 or similar would provide a simple hash of the zip, hoping that remains stable over time
  • checksums.contentSha256 would be a more bulletproof hash of the contents of each file sorted in a stable manner. Something we can for sure reproduce, but that'd be slower to generate. That could be used as fallback in case the other hash should fail to validate. So we'd have a slowdown but not a catastrophic community-wide install failure should github change their output.
  • In addition I would add config options:
    • verify-checksums: { foo/bar: true, acme/*: false, *: true } (defaults to {"*": true}) to allow opting out of verification.
    • require-checksums: { foo/bar: true, acme/*: false, *: true } (defaults to {"*": false}) to allow opting in to requiring verification.

That should allow us to roll this in smoothly and in a future-proof manner. That said, before proceeding any further we should first assess what can be done to share/integrate this together with the TUF work going on in #9740

These are just notes, I don't want to promise real progress on this any time soon as we have still tons of other work already planned.

@peff
Copy link

peff commented May 28, 2021

checksums.contentSha256 would be a more bulletproof hash of the contents of each file sorted in a stable manner.

This should mostly work, but I'll mention one case where it might not: if the dependent package uses Git's export-subst mechanism, then the resulting file contents might not be stable. That mechanism replaces a string like $Format:%H$ in the file using %-placeholders like git-log. Most of those are immutable with respect to the commit, but a few aren't (e.g., %d is the list of refs that point to the commit). The only case I've seen of this in the wild is somebody using %h (the abbreviated sha1). That may change over time if another object which collides with the abbreviation, and Git has to extend it to make it unique.

This should be quite rare, I'd think (again, I've seen only one report total of it messing up Homebrew). And it can perhaps be solved by politely explaining to the project you're hashing why %h isn't a great idea. So I'd probably disregard it, but I thought I'd mention it in the name of completeness. :)

@Seldaek
Copy link
Member Author

Seldaek commented May 28, 2021

There is always a way things can go wrong :) Thanks for the pointer, but I would indeed expect this to be rare enough that the combination of using export-subst + github changing its output would only impact a small number of people.

@talbergs
Copy link

talbergs commented Aug 3, 2023

The purpose of checksum is to lock the "not-yet-downloaded" file's integrity. We can implement this as opt-in feature and let it work as slow as it needs to - as proposed by @Seldaek in generating checksum each of files in the package sort the checksums and take sha of that ... well, not sure if this raises possibility of collision .. I hope it's rare.

Speaking up here because nix community will bring php into wider application if our package manager becomes nix-compatible.

@drupol
Copy link
Contributor

drupol commented Dec 7, 2023

Speaking up here because nix community will bring php into wider application if our package manager becomes nix-compatible.

I am actively working on this integration. A notable challenge arises from the lack of checksums in composer.lock, which invariably complicates the process.

Regarding stability concerns with Github-produced archives, it's true that relying on them "as-is" is not advisable (Github said we shouldn't do that because it might change without notice). However, the Nix package manager ingeniously bypasses this issue by computing hashes based on the content of the archives rather than the archives themselves. This approach ensures that potential changes in Github's archive format won't affect the hashes.

A few months ago, I experimented with creating stable hashes using PHP (https://gist.github.com/drupol/2bb45818f4ccb73362d26b4d9aee9ec2). Unfortunately, the process was quite slow and I didn't pursue it further.

Currently, I don't have a definitive solution for this specific issue, but I am convinced that Nix excels in many areas, particularly in generating stable hashes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests