First gather.py then index.py, then search.py

What are we going to do?

We are going to utilize image fingerprinting to perform near-duplicate image detection. This technique is commonly called “perceptual image hashing” or simply “image hashing”.


What is image fingerprinting/hashing?

Image hashing is the process of examining the contents of an image and then constructing a value that uniquely identifies an image based on these contents.

For example, take a look at the image at the top of this post. Given an input image, we are going apply a hash function and compute an “image hash” based on the image’s visual appearance. Images that are “similar” should have hashes that are “similar” as well”. Using image hashing algorithms makes performing near-duplicate image detection substantially easier.

In particular, we’ll be using the “difference hash”, or simply dHash algorithm to compute our image fingerprints. Simply put, the dHash algorithm looks at the difference between adjacent pixel values. Then, based on these differences, a hash value is created.


Why can’t we use md5, sha-1, etc.?

Unfortunately, we cannot use cryptographic hashing algorithms in our implementation. Due to the nature of cryptographic hashing algorithms, very tiny changes in the input file will result in a substantially different hash. In the case of image fingerprinting, we actually want our similar inputs to have similar output hashes as well.


Simply put, you can use image fingerprinting/hashing in nearly any setting where you are concerned with detecting near-duplicate copies of an image.

Among computer vision researchers, the CALTECH-101 dataset is legendary. It contains over 7,500 images from 101 different categories, including people, motorcycles, and airplanes.

From these ~7,500 images, I have randomly selected 17 of them.

Then, from these 17 randomly selected images, I have created N new images by randomly resizing them by +/- a few percentage points. Our goal here is to find these near-duplicate images — kind of like finding a needle in a haystack.

Again, these images are identical in every way, except for width and height. And since they do not have the same dimensions, we cannot rely on simple md5 checksums. And more importantly, images with similar content may have dramatically different md5 hashes. Instead, we can resort to image hashing, where images with similar content will also have similar hash fingerprints.

So let’s get started by writing the code to fingerprint our dataset. Open up a new file, name it index.py, and let’s get to work:

The first thing we’ll do is import the packages we’ll need. We’ll use the Image class from PIL or Pillow to load our images off disk. Then the imagehash library can be utilized to construct the perceptual hash.

From there, argparse is used to parse command line arguments, shelve is used as a simple key-value database (Python dictionary) residing on disk, and glob is utilized to easily gather the paths to our images.

We then parse our command line arguments. The first, --dataset is the path to our input directory of images. The second, --shelve is the output path to our shelve database.

Next, we open our shelve database for writing. This db will store our image hashes. More on that next:

In [None]:
filename = imagePath[imagePath.rfind("/") + 1:]
db[h] = db.get(h, []) + [filename]


Like I mentioned earlier in this post, images with the same fingerprint are considered to be identical.

Thus, if our goal is to find near-identical images, we need to maintain a list of images that have the same fingerprint value.

And that’s exactly what those lines do.

The former extracts the filename of the image. And then the latter maintains a list of filenames that have the same image hash.

To extract image fingerprints from our dataset and build our database of hashes, issue the following command:

The script will run for a few seconds and once it is done, you’ll have a file named db.shelve that contains the key-value pairs of image fingerprints and filenames.

This same basic algorithm is what I utilized years ago when I was working for the dating startup. We took our dataset of inappropriate images, constructed an image fingerprint for each image, and then stored them in our database. When a new image arrived, I simply computed the hash of the image and checked to database to see if the upload had already been flagged for invalid content.

In the next step I’ll show you how to perform the actual search to determine if an image already exists in the database with the same hash value.

Step 2: Searching a Dataset

Now that we have built a database of image fingerprints, it’s time to search our dataset.

Open up a new file, name it search.py, and we’ll get coding

Once again we’ll import our relevant packages on. We then parse our command line arguments on. We’ll need three switches, –dataset, which is the path to our original dataset of images, –shelve, the path to where our shelve database of key-value pairs resides, and –query, the path to our query/uploaded image. Our goal will be to take the query image and determine if it already exists in our database.

Now, let’s write the code to perform the actual search:

We first open our database, and then we load our image off of disk, compute the image fingerprint, and find all images with the same fingerprint value.

If there are any images with the same hash value, we loop over these images and display them to our screen.

Using this code we will be able to determine if an image already exists in our database using nothing but the fingerprint value.