Sync files with the same size but not content #29

amiel · 2014-03-21T16:59:23Z

This is done by comparing an MD5 sum of the file with the etag from S3. The etag
from S3 is an md5 of the entire file unless it is a multipart upload.

My assumption is that the filesize comparison was used to prevent downloading or
opening large files. This commit only runs the MD5 comparison for "small files"
(< 50 kilobytes). This is for two reasons: this way we avoid the processing of
claculating an MD5 sum for very large files, and we avoid the issue of dealing
with miltipart uploads. The reasonable assumption is that if a file changes
without it's size changing, it is likely to be a small file. For example, my
use-case is syncing a REVISION file that contains the git revision as a sha1.
This file is always 41 bytes, but changes frequently.

The "small file" is 50 kilobytes for now, but could easily be changed.

This is done by comparing an MD5 sum of the file with the etag from S3. The etag from S3 is an md5 of the entire file unless it is a multipart upload. My assumption is that the filesize comparison was used to prevent downloading or opening large files. This commit only runs the MD5 comparison for "small files" (< 50 kilobytes). This is for two reasons: this way we avoid the processing of claculating an MD5 sum for very large files, and we avoid the issue of dealing with miltipart uploads. The reasonable assumption is that if a file changes without it's size changing, it is likely to be a small file. For example, my use-case is syncing a REVISION file that contains the git revision as a sha1. This file is always 41 bytes, but changes frequently.

Sync files with the same size but not content

clarete · 2014-03-21T17:10:30Z

Thank you so much for the patch. I definitely agree with you about trying to achieve flexibility when deciding how we want to compare each file. We can easily change the 50k value or even allow the caller to inform a different value, but the patch as it is right definitely helps a lot! Thanks!

amiel · 2014-03-21T17:15:08Z

😀 Thanks for the quick turnaround. Let me know when there's an update to the gem :)

clarete · 2014-03-21T17:23:29Z

Just pushed the version 2.0.2 to rubygems! Have a great day!!! :)

amiel · 2014-03-21T17:27:41Z

We can easily change the 50k value or even allow the caller to inform a different value

Yeah, I thought about having the caller inform the small value. This would be nice as the expense / feasibility of running the comparison changes depending on the context. For example, there is no cost difference of the computation on s3 objects from 0 - 5GB, but at 5GB it becomes impossible since AWS will split the file up. However a reasonable size to File.read is considerably smaller...

Anyway, this seemed like a reasonable place to start.

Thanks for the gem update, now I can fix my deploy script :)

clarete added a commit that referenced this pull request Mar 21, 2014

Merge pull request #29 from carnesmedia/27-fix-same-filesize

35747b0

Sync files with the same size but not content

clarete merged commit 35747b0 into clarete:master Mar 21, 2014

clarete mentioned this pull request Mar 21, 2014

sync doesn't update files when filesize is same, but content mismatches #27

Closed

kristianfreeman deleted the 27-fix-same-filesize branch March 21, 2014 17:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync files with the same size but not content #29

Sync files with the same size but not content #29

amiel commented Mar 21, 2014

clarete commented Mar 21, 2014

amiel commented Mar 21, 2014

clarete commented Mar 21, 2014

amiel commented Mar 21, 2014

Sync files with the same size but not content #29

Sync files with the same size but not content #29

Conversation

amiel commented Mar 21, 2014

clarete commented Mar 21, 2014

amiel commented Mar 21, 2014

clarete commented Mar 21, 2014

amiel commented Mar 21, 2014