-
-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SHA-1 hash collisions? #173
Comments
Currently yes - the approach being used to prevent this is Git internals now having built in counter-cryptanalysis. Hopefully by the time a feasible attack becomes easy, Git will have moved to a transition plan which upgrades their hashing and can be adopted also by CWL Viewer. (A related and interesting talk about this attack which came out of defcon: https://youtu.be/NbHL0SYlrSQ) |
Right - my concern was the URL scheme assumes globally unique hashes, not just unique within a single git repository. i.e. The risk of collision is much higher (still unlikely to be a problem by accident). |
That's an excellent point, there is some risk of this kind of attack globally which may be difficult to check for under this scheme. Thanks for bringing it up in an issue. |
You are right in that a sha1 conflict here is theoretically possible. The current https://view.commonwl.org/git will just pick the first entry with that commit. Using these URIs as internal graph names should mean that only first time the sha1 is seen at the viewer will win. We can add to the listing "Appears in repository: a,b,c,d" and then it would be at least possible to discover multiple attacks. Is there any additional hash we can easily get out of git? I guess we could do our own SHA-256 or so hashing by directory listing instead, but then we need to be very careful with ordering, file name cases, ignore files, etc. Git commits are easily accessed - and presumably as @MarkRobbo suggests git will have to deal with this at some point. In theory we could make a SHA-256 that also embedded the origin repository URL, but that is not an option I prefer, as it means we can no longer identify the same workflow commit in multiple git repositories / forks. (Note that the git commit ID of a CWL workflow is still naive - there could be a non-important commit like a change in a README file that does not affect the workflow. Someone has already used git subtree, where there could be new commits in a subtree while the top-level tree is at the same tree. (Uu.. I know!) Comparing the packed workflow in addition to default files (and their secondaries) would give better equality, but is again something that would need to be very carefully defined and implemented. Docker images and local binaries are also moving targets that to some extent should be included in the measurement of "workflow equality". By having |
I think we can settle Linus' comments that sha1 is still "OK" for content addressing, but not really for cryptographic security (e.g. What is our leap of fate here, as @peterjc points out, is assuming that the same commit sha1 in a different repository is actually the same commit (e.g. same file content). That GitHub does not currently check or validate. We could detect that by for instance doing a SHA512 of the generated research object - perhaps turn it into a bagit-ro that contains checksum manifests for different algorithms. The known sha1 attack needs to modify both the "good" and the "bad" commits, say by inserting weird binaries into a .cwl file - and in CWL Viewer we have many simpler attacks you could do (e.g. I could write a rubbish workflow with fake metadata claiming @peterjc as author); I will close this issue as "Won't fix" - but feel free to re-raise to suggest good alternative identifiers! |
This sounds like a prudent course of action:
But can we/should we tweak the database behind the scenes to insist the hash is unique, and give a clear error/warning on registering a workflow with a hash collision - in which case the user can likely easily avoid this (e.g. minor README file update to generate a new hash)? Use case: Person/Team B takes forks a copy of a an existing workflow repository by Team A (which have previously been published under A's name), and starts by attempting to upload existing version(s) under B's name before moving on to sharing their derived workflow. Here there would absolutely be git hash collisions as not only is the content identical, so are the first commits of the two repositories A and B. |
Given the proposed CWL permalink URI scheme,
https://github.com/common-workflow-language/cwlviewer/wiki/Permalinks
If I understand this correctly this depends on globally unique git commit hashes.
By design hashes are unique within a git repository, but potentially two different repositories could have the same commit hash with different content. This is extremely unlikely by accident, but might one day be possible by malicious intent - see https://stackoverflow.com/questions/9392365/how-would-git-handle-a-sha-1-collision-on-a-blob/42450327#42450327 and the proof of principle SHA-1 collision with two PDF files https://shattered.io/
Does it fall to https://view.commonwl.org/git to detect and cope with a duplicate commit hash?
The text was updated successfully, but these errors were encountered: