Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to make sure program reads content of the file? #1086

Open
rikuiki opened this issue Dec 31, 2022 · 7 comments
Open

How to make sure program reads content of the file? #1086

rikuiki opened this issue Dec 31, 2022 · 7 comments
Labels
bug Bug reports.

Comments

@rikuiki
Copy link

rikuiki commented Dec 31, 2022

I have 1.2TB of data(photos and videos), out of which DupeGuru identified 400GB of duplicates.

My concern is that it finished analysis very fast (within one minute), and I didn't see much disk reading activity in Windows Task Manager.
I would expect DupeGuru to spend significant time (30mins?) to check content of 400GB of duplicate files to calculate hashes.
I have "partial hash for large files" option disabled.

How to explain such behavior and how to make sure DupeGuru is actually checking files content and not just sizes?
OS: Windows 11, filesystem: NTFS, disk: SSD NVME.

@rikuiki rikuiki added the bug Bug reports. label Dec 31, 2022
@glubsy
Copy link
Contributor

glubsy commented Dec 31, 2022

It depends on the Application Mode you use and perhaps some other preferences settings.

@rikuiki
Copy link
Author

rikuiki commented Dec 31, 2022

Thank you. I use Content mode. What other preferences can have impact?..

@arsenetar
Copy link
Owner

When doing a content mode scan dupeguru first will collect all the files within the directories specified, excluding any directory contents marked with exclude, and excluding any files in the exclude list. Additionally if you have set preferences to ignore files smaller or larger than a value those will also be ignored.

The resulting files that have been collected will be put into groups based on file size. (If the file size does not match it is not a duplicate for a contents scan.) Within the size groups the files will have a very small section of the file read to create a partial digest if that does not match with the partial digest of other files in that group no further file reads are done on that file as it has already been verified as different. If a files partial digest matches one of another within the group then the files will either have the full digest computed or if partially hash large files is selected it will grab a set of samples from the file instead.

The digests calculated at each step are cached so each file will have each digest calculated only at most once. The resulting digests are cached across runs so if a file has not changed size or mtime it will not recalculate the digest.

The sooner a file is determined to not be a duplicate the less reading of file data is needed by the program. So it is possible for it to process a large amount of files in a relatively short time.

@1024mb
Copy link

1024mb commented Jan 8, 2023

Since you are talking about the cache here I would like to ask something.

I was searching for duplicate images with a high filter (99%) and dupeGuru managed to find various duplicates, browsing through them I noticed there was a duplicate group with two totally different files, I mean there was zero similarity between the two of them other than their filename which was pretty simple -something like 4555_21.jpg- so I started to look as to why would dG find those two files as 100% duplicates, and found out it was because of the cache, deleting the cache db file solved this.
So my question is: what does dG checks other than the filename? I forgot to check the size of the files but I think they had the exact same size too, so I'm guessing it checks the filename + size? If so, can I request a change?

I guess one of the files was in the path of the other file, got cached by dG then I moved the file to another location and another file appeared with the same name and size in the old location and dG thought it was the old file. I'm thinking an option to select a minimum filename length (as files with a filename like a.jpg or 12.jpg are too common to be stored as unique) to cache the file and also to compare the modification date too like it's done with backup software (path, size, modtime). An option to disable caching would be great too, processing is quite fast.

@arsenetar
Copy link
Owner

arsenetar commented Jan 8, 2023

@1024mb The cache database uses the uses the the full path for the file key, and verifies it has the same size and modification time. So the only way I can see it doing this sort of thing as you describe are:

  1. At least one of the files was hashed and had the digests cached in a prior scan
  2. That file was somehow replaced with another with the same name or modified without updating the modification time or size (resulting in same full path, modification time, and size).
  3. Another scan was done with the replaced file along with another file that was a duplicate of the original from 1. resulting in both the cache loading the old digests and the duplicate file of the original also matching the old digests.

As long as the filesystem being used does support modification time, I would think it is highly unlikely to come across a situation where a file on the disk that was cached successfully matches a cache entry while at the same time being a different file. (Outside of the user or other tools to modifying modification time.)

Not opposed to add an option to disable the cache as depending on the files and storage used the cache may not speed things up that much, in some situations it is very helpful to performing subsequent scans.

I would be interested if the modification time was good for these files in this particular case (as I have seen some cases with the mtime being a very large and invalid value before but that results in the digest not being cached).

For tracking this should probably be a separate issue.

@rikuiki
Copy link
Author

rikuiki commented Jan 16, 2023

The resulting digests are cached across runs so if a file has not changed size or mtime it will not recalculate the digest.

So, it is possible that results are cached in my case from my previous runs and that's why I see very fast processing time.
Is there any way I can clear the cache to see if that's the case?

My worry is that as I mentioned program found 400GB of duplicates in my case, and it should take some time to calculate all digests, and I afraid I may delete something non-duplicate if my setup is not correct.

Thank you for your prompt answers!

@arsenetar
Copy link
Owner

There is an option to clear the cache in the file menu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug reports.
Projects
None yet
Development

No branches or pull requests

4 participants