## Check: that there are no duplicate images in your dataset.

To do a quick automated check for duplicate images, we'll work from this article: [Detect and remove duplicate images from a dataset for deep learning.](https://www.pyimagesearch.com/2020/04/20/detect-and-remove-duplicate-images-from-a-dataset-for-deep-learning/). 

The article explains:
- why it is important to check for duplicates
- how to check for duplicates in a large dataset using image hashing

An image hash is a function that maps an image to an value that is identical for two identical images. The space of possible values of the hash function, however, should be so large that it is unlikely that two different images would map to the same value.    

This article also gives an implementation of a very simple image hashing function, _dhash_(). The space of possible output values of _dhash_() is a very large but finite subset of the positive integers. In addition, the function is designed to be somewhat invariant to image rescaling: this means that if a dataset contains two identical RGB images, but one has had all of its values in all bands multiplied by a single positive value, then their hashes will be identical. 

This hash function is not designed to be robust to a bad actor who is actually trying to mess up your dataset (by, for example, adding the same image twice, with two different labels). It's intended to find duplicate images that landed in the dataset by accident. This is a common occurrence.

The larger hashSize is, the larger the space of possible hashes, and the lower is the likelihood of 'hash-collisions' - images that are different but have the same hash value. 

But keep in mind that the space of possible hash values for the default hashSize, 8, is huge (2^64). This is big enough for most  purposes.

In [None]:
%matplotlib inline

import logging
logging.basicConfig(level = logging.INFO)

from pathlib import Path
import sys, os
import numpy as np
RAMP_HOME = os.environ["RAMP_HOME"]

from osgeo import gdal, gdalnumeric as gdn
import matplotlib.pyplot as plt

In [None]:
# I added this code to prevent mysterious out of memory problems with the GPU.
# see this webpage for explanation: 
# https://stackoverflow.com/questions/43147983/could-not-create-cudnn-handle-cudnn-status-internal-error
import tensorflow as tf

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # tensorflow to only print errors please

physical_devices = tf.config.experimental.list_physical_devices('GPU')
assert len(physical_devices) > 0, "Not enough GPU hardware devices available"
config = tf.config.experimental.set_memory_growth(physical_devices[0], True)
config = tf.config.experimental.set_memory_growth(physical_devices[1], True)

In [None]:
from ramp.utils.img_utils import gdal_get_mask_tensor, gdal_get_image_tensor, convert_RGB_array_to_grayscale, dhash

In [None]:
# Construct pathnames for all images in the dataset. 

dataset = Path(RAMP_HOME) / 'ramp-code/notebooks/sample-data/training_data/chips'
img_files = [str(fn) for fn in sorted(dataset.glob('**/*.tif'))]

##### The hashing method first converts the image to a grayscale image. 

First we look at the color image, then its grayscale version, for a sanity check. 

In [None]:
# first look at color image
f1 = gdal_get_image_tensor(img_files[6])
plt.imshow(f1)

In [None]:
# view after conversion to grayscale
gray1 = convert_RGB_array_to_grayscale(f1)
plt.imshow(gray1, cmap='gray')

### Check for duplicates using the hash function. 

This is done the same way as described in the reference article. 

Create a dictionary with the hash value of an image as the key, and lists of image paths as the value. 
Any hash values with more than one image path in the list denotes a duplicate image.

No duplicates should be found. If any are found, they should be removed.

In [None]:
# construct a dictionary with dhash-values as keys, and a list of files with that dhash-value as values
hashes = {}
for filepath in img_files:
    c_img = gdal_get_image_tensor(filepath)
    h = dhash(c_img)
    h_imglist = hashes.get(h, [])
    h_imglist.append(filepath)

In [None]:
# finally, check the data set for duplicates
found_dups = False
for hashval, pathlist in hashes.items():
    if len(pathlist) > 1:

        # found a hash matching more than one image
        found_dups = True
        print(f"found duplicate images:")
        for p in pathlist: 
            print(p)
        print("")
        
if not found_dups:
    print("No duplicates found")

In [None]:
##### Created for ramp project, August 2022
##### Author: carolyn.johnston@dev.global