Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash sample optimization #908

Merged
merged 7 commits into from Aug 14, 2021
Merged

Conversation

glubsy
Copy link
Contributor

@glubsy glubsy commented Jun 21, 2021

For files bigger than the threshold set by the user, instead of computing the full md5 (which takes time), pick three chunks located at 25%, 60% of the file length, then the last chunk, and compute this "md5samples" hash instead from these three samples.

Closes #734.

* Big files above the user selected threshold can be partially hashed in 3 places.
* If the user is willing to take the risk, we consider files with identical md5samples as being identical.
* Instead of keeping md5 samples separate, merge them as one hash computed from the various selected chunks we picked.
* We don't need to keep a boolean to see whether or not the user chose to optimize; we can simply compare the value of the threshold, since 0 means no optimization currently active.
@glubsy
Copy link
Contributor Author

glubsy commented Jun 21, 2021

Here is what it looks like in the preference dialog window:
2021-06-21_22-42-27

@arsenetar
Copy link
Owner

Preliminary review looks good, I'll need to pull this and give it a try.

@arsenetar
Copy link
Owner

@glubsy, did some testing with this and I am wondering if we should force a limit on the minimum value here (currently allows 1MB). The reason being the CHUNK_SIZE used for the three hashes is 1MB so if you set the limit to anything smaller than 3MB you are actually doing more work than just hashing the whole file once. I don't know why you would actively set the limit that low, but one could, I tested this with the limit at 1MB and it does not cause issues but does do more work than just hashing the files I tested with (just over 1MB).

@arsenetar
Copy link
Owner

Ran the tests locally while they pass, the negative seek is causing an exception which is caught on test cases which have files smaller than the chunk size. I would prefer either the test be updated to test files that are 1MB or larger (as you cannot actually specify for it to run on smaller files), or write that seek as fp.seek(-min(size, CHUNK_SIZE),2).

@glubsy
Copy link
Contributor Author

glubsy commented Aug 8, 2021

That's a very good point, thanks! I'll work on this very shortly.

@ElDavoo
Copy link

ElDavoo commented Aug 12, 2021

Saved my life in analyzing my movie collection, thanks!

@glubsy
Copy link
Contributor Author

glubsy commented Aug 13, 2021

Had a quick look today, and it just occurred to me:

if size <= CHUNK_SIZE * 3:
    setattr(self, field, self.md5)

Wouldn't this be enough?

Computing 3 hash samples for files less than 3MiB (3 * CHUNK_SIZE) is not efficient since spans of later samples would overlap a previous one.
Therefore we can simply return the hash of the entire small file instead.
The "walrus" operator is only available in python 3.8 and later. Fall back to more traditional notation.
Perhaps the python byte code is already optimized, but just in case it is not, keep pre-compute the constant expression.
@arsenetar
Copy link
Owner

Yeah that looks good to me, tests seem to be good now as well. I am thinking of changing the edit box for this value and the ignore small files to spinboxes instead to prevent some other input validation issues unrelated to this change so I'll handle that separately from this PR. Thanks for the work.

@arsenetar arsenetar merged commit e11f996 into arsenetar:master Aug 14, 2021
@glubsy glubsy deleted the hash_sample_optimization branch August 14, 2021 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Substantial speed-up for hash scanning of huge files with same file size (4GB)
3 participants