How to make sure program reads content of the file? #1086

rikuiki · 2022-12-31T18:06:27Z

I have 1.2TB of data(photos and videos), out of which DupeGuru identified 400GB of duplicates.

My concern is that it finished analysis very fast (within one minute), and I didn't see much disk reading activity in Windows Task Manager.
I would expect DupeGuru to spend significant time (30mins?) to check content of 400GB of duplicate files to calculate hashes.
I have "partial hash for large files" option disabled.

How to explain such behavior and how to make sure DupeGuru is actually checking files content and not just sizes?
OS: Windows 11, filesystem: NTFS, disk: SSD NVME.

glubsy · 2022-12-31T19:37:42Z

It depends on the Application Mode you use and perhaps some other preferences settings.

rikuiki · 2022-12-31T19:41:58Z

Thank you. I use Content mode. What other preferences can have impact?..

arsenetar · 2023-01-06T06:58:25Z

When doing a content mode scan dupeguru first will collect all the files within the directories specified, excluding any directory contents marked with exclude, and excluding any files in the exclude list. Additionally if you have set preferences to ignore files smaller or larger than a value those will also be ignored.

The resulting files that have been collected will be put into groups based on file size. (If the file size does not match it is not a duplicate for a contents scan.) Within the size groups the files will have a very small section of the file read to create a partial digest if that does not match with the partial digest of other files in that group no further file reads are done on that file as it has already been verified as different. If a files partial digest matches one of another within the group then the files will either have the full digest computed or if partially hash large files is selected it will grab a set of samples from the file instead.

The digests calculated at each step are cached so each file will have each digest calculated only at most once. The resulting digests are cached across runs so if a file has not changed size or mtime it will not recalculate the digest.

The sooner a file is determined to not be a duplicate the less reading of file data is needed by the program. So it is possible for it to process a large amount of files in a relatively short time.

1024mb · 2023-01-08T05:38:31Z

Since you are talking about the cache here I would like to ask something.

I was searching for duplicate images with a high filter (99%) and dupeGuru managed to find various duplicates, browsing through them I noticed there was a duplicate group with two totally different files, I mean there was zero similarity between the two of them other than their filename which was pretty simple -something like 4555_21.jpg- so I started to look as to why would dG find those two files as 100% duplicates, and found out it was because of the cache, deleting the cache db file solved this.
So my question is: what does dG checks other than the filename? I forgot to check the size of the files but I think they had the exact same size too, so I'm guessing it checks the filename + size? If so, can I request a change?

I guess one of the files was in the path of the other file, got cached by dG then I moved the file to another location and another file appeared with the same name and size in the old location and dG thought it was the old file. I'm thinking an option to select a minimum filename length (as files with a filename like a.jpg or 12.jpg are too common to be stored as unique) to cache the file and also to compare the modification date too like it's done with backup software (path, size, modtime). An option to disable caching would be great too, processing is quite fast.

arsenetar · 2023-01-08T06:24:13Z

@1024mb The cache database uses the uses the the full path for the file key, and verifies it has the same size and modification time. So the only way I can see it doing this sort of thing as you describe are:

At least one of the files was hashed and had the digests cached in a prior scan
That file was somehow replaced with another with the same name or modified without updating the modification time or size (resulting in same full path, modification time, and size).
Another scan was done with the replaced file along with another file that was a duplicate of the original from 1. resulting in both the cache loading the old digests and the duplicate file of the original also matching the old digests.

As long as the filesystem being used does support modification time, I would think it is highly unlikely to come across a situation where a file on the disk that was cached successfully matches a cache entry while at the same time being a different file. (Outside of the user or other tools to modifying modification time.)

Not opposed to add an option to disable the cache as depending on the files and storage used the cache may not speed things up that much, in some situations it is very helpful to performing subsequent scans.

I would be interested if the modification time was good for these files in this particular case (as I have seen some cases with the mtime being a very large and invalid value before but that results in the digest not being cached).

For tracking this should probably be a separate issue.

rikuiki · 2023-01-16T04:18:11Z

The resulting digests are cached across runs so if a file has not changed size or mtime it will not recalculate the digest.

So, it is possible that results are cached in my case from my previous runs and that's why I see very fast processing time.
Is there any way I can clear the cache to see if that's the case?

My worry is that as I mentioned program found 400GB of duplicates in my case, and it should take some time to calculate all digests, and I afraid I may delete something non-duplicate if my setup is not correct.

Thank you for your prompt answers!

arsenetar · 2023-01-16T04:19:35Z

There is an option to clear the cache in the file menu.

rikuiki added the bug Bug reports. label Dec 31, 2022

NewUserHa mentioned this issue Jan 26, 2023

Add CreationTime and Filename to all_categories() re-priority list. #1094

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to make sure program reads content of the file? #1086

How to make sure program reads content of the file? #1086

rikuiki commented Dec 31, 2022

glubsy commented Dec 31, 2022

rikuiki commented Dec 31, 2022

arsenetar commented Jan 6, 2023

1024mb commented Jan 8, 2023

arsenetar commented Jan 8, 2023 •

edited

rikuiki commented Jan 16, 2023

arsenetar commented Jan 16, 2023

How to make sure program reads content of the file? #1086

How to make sure program reads content of the file? #1086

Comments

rikuiki commented Dec 31, 2022

glubsy commented Dec 31, 2022

rikuiki commented Dec 31, 2022

arsenetar commented Jan 6, 2023

1024mb commented Jan 8, 2023

arsenetar commented Jan 8, 2023 • edited

rikuiki commented Jan 16, 2023

arsenetar commented Jan 16, 2023

arsenetar commented Jan 8, 2023 •

edited