Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SHA-1 hash collisions? #173

Closed
peterjc opened this issue Aug 29, 2017 · 6 comments
Closed

SHA-1 hash collisions? #173

peterjc opened this issue Aug 29, 2017 · 6 comments

Comments

@peterjc
Copy link

peterjc commented Aug 29, 2017

Given the proposed CWL permalink URI scheme,

https://w3id.org/cwl/view/{scheme}/{commit}/{path}#{anchor}

https://github.com/common-workflow-language/cwlviewer/wiki/Permalinks

If I understand this correctly this depends on globally unique git commit hashes.

By design hashes are unique within a git repository, but potentially two different repositories could have the same commit hash with different content. This is extremely unlikely by accident, but might one day be possible by malicious intent - see https://stackoverflow.com/questions/9392365/how-would-git-handle-a-sha-1-collision-on-a-blob/42450327#42450327 and the proof of principle SHA-1 collision with two PDF files https://shattered.io/

Does it fall to https://view.commonwl.org/git to detect and cope with a duplicate commit hash?

@MarkRobbo
Copy link
Member

Currently yes - the approach being used to prevent this is Git internals now having built in counter-cryptanalysis.

Hopefully by the time a feasible attack becomes easy, Git will have moved to a transition plan which upgrades their hashing and can be adopted also by CWL Viewer.

(A related and interesting talk about this attack which came out of defcon: https://youtu.be/NbHL0SYlrSQ)

@peterjc
Copy link
Author

peterjc commented Aug 30, 2017

Right - my concern was the URL scheme assumes globally unique hashes, not just unique within a single git repository.

i.e. The risk of collision is much higher (still unlikely to be a problem by accident).

@MarkRobbo
Copy link
Member

That's an excellent point, there is some risk of this kind of attack globally which may be difficult to check for under this scheme.

Thanks for bringing it up in an issue.

@stain
Copy link
Member

stain commented Oct 6, 2017

You are right in that a sha1 conflict here is theoretically possible. The current https://view.commonwl.org/git will just pick the first entry with that commit. Using these URIs as internal graph names should mean that only first time the sha1 is seen at the viewer will win.

We can add to the listing "Appears in repository: a,b,c,d" and then it would be at least possible to discover multiple attacks.

Is there any additional hash we can easily get out of git? I guess we could do our own SHA-256 or so hashing by directory listing instead, but then we need to be very careful with ordering, file name cases, ignore files, etc. Git commits are easily accessed - and presumably as @MarkRobbo suggests git will have to deal with this at some point.

In theory we could make a SHA-256 that also embedded the origin repository URL, but that is not an option I prefer, as it means we can no longer identify the same workflow commit in multiple git repositories / forks.

(Note that the git commit ID of a CWL workflow is still naive - there could be a non-important commit like a change in a README file that does not affect the workflow. Someone has already used git subtree, where there could be new commits in a subtree while the top-level tree is at the same tree. (Uu.. I know!)

Comparing the packed workflow in addition to default files (and their secondaries) would give better equality, but is again something that would need to be very carefully defined and implemented.

Docker images and local binaries are also moving targets that to some extent should be included in the measurement of "workflow equality".

By having git/ as prefix we can add other alternative identifiers later. These would then be related with prov:alternateOf as they would have different time/content spans.

@stain
Copy link
Member

stain commented Oct 6, 2017

I think we can settle Linus' comments that sha1 is still "OK" for content addressing, but not really for cryptographic security (e.g. git --sign). I did not propose the CWL permalinks for security, but for identity - although if used together with say blockchain technologies then the security concerns would come back.

What is our leap of fate here, as @peterjc points out, is assuming that the same commit sha1 in a different repository is actually the same commit (e.g. same file content). That GitHub does not currently check or validate. We could detect that by for instance doing a SHA512 of the generated research object - perhaps turn it into a bagit-ro that contains checksum manifests for different algorithms.

The known sha1 attack needs to modify both the "good" and the "bad" commits, say by inserting weird binaries into a .cwl file - and in CWL Viewer we have many simpler attacks you could do (e.g. I could write a rubbish workflow with fake metadata claiming @peterjc as author); I will close this issue as "Won't fix" - but feel free to re-raise to suggest good alternative identifiers!

@stain stain closed this as completed Oct 6, 2017
@peterjc
Copy link
Author

peterjc commented Oct 7, 2017

This sounds like a prudent course of action:

You are right in that a sha1 conflict here is theoretically possible. The current https://view.commonwl.org/git will just pick the first entry with that commit. Using these URIs as internal graph names should mean that only first time the sha1 is seen at the viewer will win.

We can add to the listing "Appears in repository: a,b,c,d" and then it would be at least possible to discover multiple attacks.

But can we/should we tweak the database behind the scenes to insist the hash is unique, and give a clear error/warning on registering a workflow with a hash collision - in which case the user can likely easily avoid this (e.g. minor README file update to generate a new hash)?

Use case: Person/Team B takes forks a copy of a an existing workflow repository by Team A (which have previously been published under A's name), and starts by attempting to upload existing version(s) under B's name before moving on to sharing their derived workflow. Here there would absolutely be git hash collisions as not only is the content identical, so are the first commits of the two repositories A and B.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants