SHA-1 hash collisions? #173

peterjc · 2017-08-29T15:01:38Z

Given the proposed CWL permalink URI scheme,

https://w3id.org/cwl/view/{scheme}/{commit}/{path}#{anchor}

https://github.com/common-workflow-language/cwlviewer/wiki/Permalinks

If I understand this correctly this depends on globally unique git commit hashes.

By design hashes are unique within a git repository, but potentially two different repositories could have the same commit hash with different content. This is extremely unlikely by accident, but might one day be possible by malicious intent - see https://stackoverflow.com/questions/9392365/how-would-git-handle-a-sha-1-collision-on-a-blob/42450327#42450327 and the proof of principle SHA-1 collision with two PDF files https://shattered.io/

Does it fall to https://view.commonwl.org/git to detect and cope with a duplicate commit hash?

The text was updated successfully, but these errors were encountered:

MarkRobbo · 2017-08-30T09:51:16Z

Currently yes - the approach being used to prevent this is Git internals now having built in counter-cryptanalysis.

Hopefully by the time a feasible attack becomes easy, Git will have moved to a transition plan which upgrades their hashing and can be adopted also by CWL Viewer.

(A related and interesting talk about this attack which came out of defcon: https://youtu.be/NbHL0SYlrSQ)

peterjc · 2017-08-30T09:59:15Z

Right - my concern was the URL scheme assumes globally unique hashes, not just unique within a single git repository.

i.e. The risk of collision is much higher (still unlikely to be a problem by accident).

MarkRobbo · 2017-08-30T10:02:37Z

That's an excellent point, there is some risk of this kind of attack globally which may be difficult to check for under this scheme.

Thanks for bringing it up in an issue.

stain · 2017-10-06T09:03:02Z

You are right in that a sha1 conflict here is theoretically possible. The current https://view.commonwl.org/git will just pick the first entry with that commit. Using these URIs as internal graph names should mean that only first time the sha1 is seen at the viewer will win.

We can add to the listing "Appears in repository: a,b,c,d" and then it would be at least possible to discover multiple attacks.

Is there any additional hash we can easily get out of git? I guess we could do our own SHA-256 or so hashing by directory listing instead, but then we need to be very careful with ordering, file name cases, ignore files, etc. Git commits are easily accessed - and presumably as @MarkRobbo suggests git will have to deal with this at some point.

In theory we could make a SHA-256 that also embedded the origin repository URL, but that is not an option I prefer, as it means we can no longer identify the same workflow commit in multiple git repositories / forks.

(Note that the git commit ID of a CWL workflow is still naive - there could be a non-important commit like a change in a README file that does not affect the workflow. Someone has already used git subtree, where there could be new commits in a subtree while the top-level tree is at the same tree. (Uu.. I know!)

Comparing the packed workflow in addition to default files (and their secondaries) would give better equality, but is again something that would need to be very carefully defined and implemented.

Docker images and local binaries are also moving targets that to some extent should be included in the measurement of "workflow equality".

By having git/ as prefix we can add other alternative identifiers later. These would then be related with prov:alternateOf as they would have different time/content spans.

stain · 2017-10-06T10:28:32Z

I think we can settle Linus' comments that sha1 is still "OK" for content addressing, but not really for cryptographic security (e.g. git --sign). I did not propose the CWL permalinks for security, but for identity - although if used together with say blockchain technologies then the security concerns would come back.

What is our leap of fate here, as @peterjc points out, is assuming that the same commit sha1 in a different repository is actually the same commit (e.g. same file content). That GitHub does not currently check or validate. We could detect that by for instance doing a SHA512 of the generated research object - perhaps turn it into a bagit-ro that contains checksum manifests for different algorithms.

The known sha1 attack needs to modify both the "good" and the "bad" commits, say by inserting weird binaries into a .cwl file - and in CWL Viewer we have many simpler attacks you could do (e.g. I could write a rubbish workflow with fake metadata claiming @peterjc as author); I will close this issue as "Won't fix" - but feel free to re-raise to suggest good alternative identifiers!

peterjc · 2017-10-07T13:50:17Z

This sounds like a prudent course of action:

You are right in that a sha1 conflict here is theoretically possible. The current https://view.commonwl.org/git will just pick the first entry with that commit. Using these URIs as internal graph names should mean that only first time the sha1 is seen at the viewer will win.

We can add to the listing "Appears in repository: a,b,c,d" and then it would be at least possible to discover multiple attacks.

But can we/should we tweak the database behind the scenes to insist the hash is unique, and give a clear error/warning on registering a workflow with a hash collision - in which case the user can likely easily avoid this (e.g. minor README file update to generate a new hash)?

Use case: Person/Team B takes forks a copy of a an existing workflow repository by Team A (which have previously been published under A's name), and starts by attempting to upload existing version(s) under B's name before moving on to sharing their derived workflow. Here there would absolutely be git hash collisions as not only is the content identical, so are the first commits of the two repositories A and B.

stain closed this as completed Oct 6, 2017

FarahZKhan mentioned this issue Oct 30, 2017

SHA1 or SHA256 for file checksums and Provenance document common-workflow-language/common-workflow-language#537

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SHA-1 hash collisions? #173

SHA-1 hash collisions? #173

peterjc commented Aug 29, 2017 •

edited

MarkRobbo commented Aug 30, 2017

peterjc commented Aug 30, 2017

MarkRobbo commented Aug 30, 2017

stain commented Oct 6, 2017

stain commented Oct 6, 2017

peterjc commented Oct 7, 2017

SHA-1 hash collisions? #173

SHA-1 hash collisions? #173

Comments

peterjc commented Aug 29, 2017 • edited

MarkRobbo commented Aug 30, 2017

peterjc commented Aug 30, 2017

MarkRobbo commented Aug 30, 2017

stain commented Oct 6, 2017

stain commented Oct 6, 2017

peterjc commented Oct 7, 2017

peterjc commented Aug 29, 2017 •

edited