Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync files with the same size but not content #29

Merged
merged 1 commit into from
Mar 21, 2014
Merged

Sync files with the same size but not content #29

merged 1 commit into from
Mar 21, 2014

Conversation

amiel
Copy link
Contributor

@amiel amiel commented Mar 21, 2014

This is done by comparing an MD5 sum of the file with the etag from S3. The etag
from S3 is an md5 of the entire file unless it is a multipart upload.

My assumption is that the filesize comparison was used to prevent downloading or
opening large files. This commit only runs the MD5 comparison for "small files"
(< 50 kilobytes). This is for two reasons: this way we avoid the processing of
claculating an MD5 sum for very large files, and we avoid the issue of dealing
with miltipart uploads. The reasonable assumption is that if a file changes
without it's size changing, it is likely to be a small file. For example, my
use-case is syncing a REVISION file that contains the git revision as a sha1.
This file is always 41 bytes, but changes frequently.

The "small file" is 50 kilobytes for now, but could easily be changed.

This is done by comparing an MD5 sum of the file with the etag from S3. The etag
from S3 is an md5 of the entire file unless it is a multipart upload.

My assumption is that the filesize comparison was used to prevent downloading or
opening large files. This commit only runs the MD5 comparison for "small files"
(< 50 kilobytes). This is for two reasons: this way we avoid the processing of
claculating an MD5 sum for very large files, and we avoid the issue of dealing
with miltipart uploads. The reasonable assumption is that if a file changes
without it's size changing, it is likely to be a small file. For example, my
use-case is syncing a REVISION file that contains the git revision as a sha1.
This file is always 41 bytes, but changes frequently.
clarete added a commit that referenced this pull request Mar 21, 2014
Sync files with the same size but not content
@clarete clarete merged commit 35747b0 into clarete:master Mar 21, 2014
@clarete
Copy link
Owner

clarete commented Mar 21, 2014

Thank you so much for the patch. I definitely agree with you about trying to achieve flexibility when deciding how we want to compare each file. We can easily change the 50k value or even allow the caller to inform a different value, but the patch as it is right definitely helps a lot! Thanks!

@amiel
Copy link
Contributor Author

amiel commented Mar 21, 2014

😀 Thanks for the quick turnaround. Let me know when there's an update to the gem :)

@clarete
Copy link
Owner

clarete commented Mar 21, 2014

Just pushed the version 2.0.2 to rubygems! Have a great day!!! :)

@amiel
Copy link
Contributor Author

amiel commented Mar 21, 2014

We can easily change the 50k value or even allow the caller to inform a different value

Yeah, I thought about having the caller inform the small value. This would be nice as the expense / feasibility of running the comparison changes depending on the context. For example, there is no cost difference of the computation on s3 objects from 0 - 5GB, but at 5GB it becomes impossible since AWS will split the file up. However a reasonable size to File.read is considerably smaller...

Anyway, this seemed like a reasonable place to start.

Thanks for the gem update, now I can fix my deploy script :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants