New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to make sure program reads content of the file? #1086
Comments
It depends on the Application Mode you use and perhaps some other preferences settings. |
Thank you. I use Content mode. What other preferences can have impact?.. |
When doing a content mode scan dupeguru first will collect all the files within the directories specified, excluding any directory contents marked with exclude, and excluding any files in the exclude list. Additionally if you have set preferences to ignore files smaller or larger than a value those will also be ignored. The resulting files that have been collected will be put into groups based on file size. (If the file size does not match it is not a duplicate for a contents scan.) Within the size groups the files will have a very small section of the file read to create a partial digest if that does not match with the partial digest of other files in that group no further file reads are done on that file as it has already been verified as different. If a files partial digest matches one of another within the group then the files will either have the full digest computed or if partially hash large files is selected it will grab a set of samples from the file instead. The digests calculated at each step are cached so each file will have each digest calculated only at most once. The resulting digests are cached across runs so if a file has not changed size or mtime it will not recalculate the digest. The sooner a file is determined to not be a duplicate the less reading of file data is needed by the program. So it is possible for it to process a large amount of files in a relatively short time. |
Since you are talking about the cache here I would like to ask something. I was searching for duplicate images with a high filter (99%) and dupeGuru managed to find various duplicates, browsing through them I noticed there was a duplicate group with two totally different files, I mean there was zero similarity between the two of them other than their filename which was pretty simple -something like I guess one of the files was in the path of the other file, got cached by dG then I moved the file to another location and another file appeared with the same name and size in the old location and dG thought it was the old file. I'm thinking an option to select a minimum filename length (as files with a filename like |
@1024mb The cache database uses the uses the the full path for the file key, and verifies it has the same size and modification time. So the only way I can see it doing this sort of thing as you describe are:
As long as the filesystem being used does support modification time, I would think it is highly unlikely to come across a situation where a file on the disk that was cached successfully matches a cache entry while at the same time being a different file. (Outside of the user or other tools to modifying modification time.) Not opposed to add an option to disable the cache as depending on the files and storage used the cache may not speed things up that much, in some situations it is very helpful to performing subsequent scans. I would be interested if the modification time was good for these files in this particular case (as I have seen some cases with the mtime being a very large and invalid value before but that results in the digest not being cached). For tracking this should probably be a separate issue. |
So, it is possible that results are cached in my case from my previous runs and that's why I see very fast processing time. My worry is that as I mentioned program found 400GB of duplicates in my case, and it should take some time to calculate all digests, and I afraid I may delete something non-duplicate if my setup is not correct. Thank you for your prompt answers! |
There is an option to clear the cache in the file menu. |
I have 1.2TB of data(photos and videos), out of which DupeGuru identified 400GB of duplicates.
My concern is that it finished analysis very fast (within one minute), and I didn't see much disk reading activity in Windows Task Manager.
I would expect DupeGuru to spend significant time (30mins?) to check content of 400GB of duplicate files to calculate hashes.
I have "partial hash for large files" option disabled.
How to explain such behavior and how to make sure DupeGuru is actually checking files content and not just sizes?
OS: Windows 11, filesystem: NTFS, disk: SSD NVME.
The text was updated successfully, but these errors were encountered: