Feature request: Allow usage of perceptual hashes for images #65

grayfallstown · 2022-12-28T15:21:24Z

This would allow users to dedup pictures even if they have different file formats, resolutions or quality. Usually one wants to keep the largest file.

im_hash delivers the rust functionality.

darakian · 2023-01-09T18:20:28Z

Interesting. I think this would break my difference algorithm, but I'm intrigued. Can you elaborate on the use case a bit more?

grayfallstown · 2023-01-10T04:01:57Z

First Usecase (Saved Wallpapers or Memes)

The User has a lot of images saved on his computer. Images are often reuploaded on the internet (think memes for example) and by that, are most likely recompressed and have additional quality loss. If I come across an image I like, I might not remember having it downloaded already (also think about wallpapers for a diashow here) and download it again. Now I have the same image twice, as in duplicate, however, the images are unlikely to have the same sha512 or similar hash, even when downloaded from the same page, but at different URLs, as the different URL is likely to be a reupload and has a different quality or even image format (jpg vs png vs webp, etc).

A perceptual hash can Identify duplicates even with a quality loss, as in compression artifacts, or reduced resolution.

Second Usecase (Memes again)

Memes are known to reuse the same template over and over again, each time producing new (hopefully) original content by adding different texts or using faceswap (bron memes).
A user might want to find all memes by searching by the template (either for organizing memes or to find a specific one), as each meme from the same template is technically just a duplicate of the template with some 'dirt' on it (text + compression artifacts).
Using perceptual hashes the user can find all memes from the same template. The comparison parameters must be set somewhat lose for this, and therefore must be configurable in the tool.

Third Usecase (Searching higher quality version of an image)

A user might download and image and then remembers to have the same image, but with better quality somewhere on his PC. By searching for perceptual duplicates of the image, the user will find the higher quality version already stored on his PC.
This greatly applies to photos that the user keeps backing up to a cloud like google drive. Google drive, by default, compresses images further while losing quality. Such compressed images do not count on the used drive space (not sure if this free space is still the case with gdrive, but it used to be). Uncompressed images do. If the user wants a better version of an image he found in his google photos on his gdrive andhas a manual backup of the original image on an external hard drive, he can find the original file by searching for duplicates using perceptual hashes, 'normal' file hashes would see them as different files, as they actually are, but perceptually they are the same with different quality.

darakian · 2023-01-10T17:43:35Z

So if I have this right

Use case 1

I have one image so I perceptual hash (phash) that and look for similar images

Use case 2

I have one or more directory of images and I want to see the buckets of images as defined by the phash (maybe ordered by quality or file size or something)

Use case three seems like a repeat of use case one to me in the I have one image show me similar images method. Let me know if I'm wrong there.

But two questions

Is this strictly for bitmap images? eg. no svg, no gimp multilayer projects, no video, etc...
Say the user has 10s of images that fit a given phash; how should that be displayed?

This strikes me like its own project to be honest, but if you're willing to provide feedback and testing then I'm at least willing to give it a crack 😄

grayfallstown · 2023-01-11T11:46:01Z

I would totally beta test and use this. No idea how to design the user interface though.

Hashes for videos are a bit more difficult. There was a csharp solution for gifs. Maybe they just took a frame every 0.25 seconds and compared them.

You would have to rasterize svg to make use of phash. Same for gimp.

darakian · 2023-01-18T19:07:09Z

Cool. I'll try to get some time this weekend to start experimenting with it. Probably just jpegs to start with but we'll see. I'll ping you on the new repo 👍

darakian · 2023-01-31T23:31:03Z

@grayfallstown give darakian/rustExperiments#2 a gander. It's a quick and dirty first pass but let me know.

darakian closed this as completed Jan 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: Allow usage of perceptual hashes for images #65

Feature request: Allow usage of perceptual hashes for images #65

grayfallstown commented Dec 28, 2022

darakian commented Jan 9, 2023

grayfallstown commented Jan 10, 2023 •

edited

darakian commented Jan 10, 2023

grayfallstown commented Jan 11, 2023

darakian commented Jan 18, 2023

darakian commented Jan 31, 2023

Feature request: Allow usage of perceptual hashes for images #65

Feature request: Allow usage of perceptual hashes for images #65

Comments

grayfallstown commented Dec 28, 2022

darakian commented Jan 9, 2023

grayfallstown commented Jan 10, 2023 • edited

darakian commented Jan 10, 2023

Use case 1

Use case 2

grayfallstown commented Jan 11, 2023

darakian commented Jan 18, 2023

darakian commented Jan 31, 2023

grayfallstown commented Jan 10, 2023 •

edited