Hash sample optimization #908

glubsy · 2021-06-21T20:49:40Z

For files bigger than the threshold set by the user, instead of computing the full md5 (which takes time), pick three chunks located at 25%, 60% of the file length, then the last chunk, and compute this "md5samples" hash instead from these three samples.

Closes #734.

* Big files above the user selected threshold can be partially hashed in 3 places. * If the user is willing to take the risk, we consider files with identical md5samples as being identical.

* Instead of keeping md5 samples separate, merge them as one hash computed from the various selected chunks we picked. * We don't need to keep a boolean to see whether or not the user chose to optimize; we can simply compare the value of the threshold, since 0 means no optimization currently active.

glubsy · 2021-06-21T20:54:33Z

Here is what it looks like in the preference dialog window:

arsenetar · 2021-06-22T03:58:29Z

Preliminary review looks good, I'll need to pull this and give it a try.

arsenetar · 2021-08-07T23:12:59Z

@glubsy, did some testing with this and I am wondering if we should force a limit on the minimum value here (currently allows 1MB). The reason being the CHUNK_SIZE used for the three hashes is 1MB so if you set the limit to anything smaller than 3MB you are actually doing more work than just hashing the whole file once. I don't know why you would actively set the limit that low, but one could, I tested this with the limit at 1MB and it does not cause issues but does do more work than just hashing the files I tested with (just over 1MB).

arsenetar · 2021-08-07T23:28:31Z

Ran the tests locally while they pass, the negative seek is causing an exception which is caught on test cases which have files smaller than the chunk size. I would prefer either the test be updated to test files that are 1MB or larger (as you cannot actually specify for it to run on smaller files), or write that seek as fp.seek(-min(size, CHUNK_SIZE),2).

glubsy · 2021-08-08T01:28:12Z

That's a very good point, thanks! I'll work on this very shortly.

ElDavoo · 2021-08-12T10:59:02Z

Saved my life in analyzing my movie collection, thanks!

glubsy · 2021-08-13T18:32:23Z

Had a quick look today, and it just occurred to me:

if size <= CHUNK_SIZE * 3:
    setattr(self, field, self.md5)

Wouldn't this be enough?

Computing 3 hash samples for files less than 3MiB (3 * CHUNK_SIZE) is not efficient since spans of later samples would overlap a previous one. Therefore we can simply return the hash of the entire small file instead.

The "walrus" operator is only available in python 3.8 and later. Fall back to more traditional notation.

Perhaps the python byte code is already optimized, but just in case it is not, keep pre-compute the constant expression.

arsenetar · 2021-08-14T04:41:03Z

Yeah that looks good to me, tests seem to be good now as well. I am thinking of changing the edit box for this value and the ignore small files to spinboxes instead to prevent some other input validation issues unrelated to this change so I'll handle that separately from this PR. Thanks for the work.

glubsy added 2 commits June 21, 2021 19:03

Add partial hashes optimization for big files

e07dfd5

* Big files above the user selected threshold can be partially hashed in 3 places. * If the user is willing to take the risk, we consider files with identical md5samples as being identical.

glubsy mentioned this pull request Jun 21, 2021

Substantial speed-up for hash scanning of huge files with same file size (4GB) #734

Closed

Remove unused import

718ca5b

Avoid partially hashing small files

7b764f1

Computing 3 hash samples for files less than 3MiB (3 * CHUNK_SIZE) is not efficient since spans of later samples would overlap a previous one. Therefore we can simply return the hash of the entire small file instead.

glubsy force-pushed the hash_sample_optimization branch from 6a50a1a to 7b764f1 Compare August 13, 2021 18:47

glubsy added 3 commits August 13, 2021 20:56

Fix for older python versions

545a5a7

The "walrus" operator is only available in python 3.8 and later. Fall back to more traditional notation.

Cache constant expression

891a875

Perhaps the python byte code is already optimized, but just in case it is not, keep pre-compute the constant expression.

Fix flake 8

e95306e

arsenetar merged commit e11f996 into arsenetar:master Aug 14, 2021

glubsy deleted the hash_sample_optimization branch August 14, 2021 16:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hash sample optimization #908

Hash sample optimization #908

glubsy commented Jun 21, 2021 •

edited

glubsy commented Jun 21, 2021

arsenetar commented Jun 22, 2021

arsenetar commented Aug 7, 2021

arsenetar commented Aug 7, 2021

glubsy commented Aug 8, 2021

ElDavoo commented Aug 12, 2021

glubsy commented Aug 13, 2021 •

edited

arsenetar commented Aug 14, 2021

Hash sample optimization #908

Hash sample optimization #908

Conversation

glubsy commented Jun 21, 2021 • edited

glubsy commented Jun 21, 2021

arsenetar commented Jun 22, 2021

arsenetar commented Aug 7, 2021

arsenetar commented Aug 7, 2021

glubsy commented Aug 8, 2021

ElDavoo commented Aug 12, 2021

glubsy commented Aug 13, 2021 • edited

arsenetar commented Aug 14, 2021

glubsy commented Jun 21, 2021 •

edited

glubsy commented Aug 13, 2021 •

edited