to further copy git (why not) this switches to 40 character hex sha1s for folders + files
also this switches the event-stream dep with stream-combiner (does the same thing, is less grab-baggy than event-stream)
we should probably make the getHash and getPath functions configurable at some point
switch to sha1, switch some deps
i did half of this already so I'm going to need to manually bring your changes in, one thing to think about is that git isn't all that good with large files so we may want to switch from a single folder level to a sensible number of folder levels (i.e. less then 35 like we had before, but more then one)
agreed, I think the depth and hash functions should be pluggable
also @dominictarr revealed that he wrote this recently https://github.com/dominictarr/content-addressable-store and pushed it to git yesterday. cabs is pure streams whereas content-addressable-store isn't quite as streamy, but they are similar.
one thing cabs should steal though is https://github.com/dominictarr/content-addressable-store/blob/master/index.js#L77 and then https://github.com/dominictarr/content-addressable-store/blob/master/index.js#L89, which will make sure that corrupted files never get written to the blob folder
also I ran some basic benchmarks:
for a ~700mb AVI:
calculating SHA-1 in node takes 2950ms
copying the file in node takes 1704ms
but cabs'ing the file takes 30368ms (with current defaults)
the problem is that if I bump up the limit to, say, 1gb, then byte-stream will buffer 1gb which is baaad. maybe we need to rethink our approach here. I'm willing to bet that if we just got rid of the file limit then it would be faster + simplify things a lot more (e.g. store entire blobs as single files)
of course this means that for super huge files we might run into limits, but i'm not as concerned with that as I am the number of files-per-directory. i'd rather have cabs be fast
You should definitely have a pluggable hash function. sha1 should not be used in new systems - weaknesses have been found that mean you can generate collisions in 2^52 evaluations (avg). This infeasible currently, but in a few years it won't be.
estimated time to generate sha1 collision:
weakened to 2^52
If sha1 hadn't been weakened then it would be 2^80 evaluations - to put this in perspective, 2^80 / 2^52 is 262 million times easier. so if it cost 50k to generate 2^52 hashes, if you needed 50k*262m = 13 trillion, that is about 1/4 of the total world gdp.
Using sha1 is acceptable if you need to be backwards compatible with other systems currently in use - but if you are building something that you hope may be in use (or future revisions of it may be in use) in 20 years then you should not use sha1.
sha256 is okay, although it's possible to do a length extension attack,
if you know sha256(X) you can calculate sha256(X + foo)
even if you don't know what X is. This can be avoided if you use double sha256: sha256(sha256(X))
sha256(X + foo)
Also, if you want to make a performant blob store, this has some very promising ideas: http://www.youtube.com/watch?v=T4DgxvS9Xho
@calvinmetcalf you mention "editing" - what do you mean here? I'm confused, because you can't edit in a content addressable store - because changing the file means that the hash is now different.
aha so you mean removing something and adding a new thing?
sha256 is good, and the key isn't too long, and there are reasonable implementations in pure js incase you want to run in the browser (if that is a design goal) although, blake2s is better for that.
ok guys I made it default to sha256 but since @maxogden wants to focus on performance for his application, made everything configurable including the folder depth (default 3).