**Data duplication rationale**

As we collected images of climbing holds from multiple data sources on the same climbing site, there is a significant risk of having duplicate images in our dataset. Duplicates can bias the training process by causing the model to overfit, distort the evaluation metrics, and reduce its ability to generalize to new data. To ensure the quality and reliability of our model, we will implement a deduplication step to remove identical or highly similar images before training.

In [1]:
import os

PROJECT_PATH = os.getcwd().split('notebooks')[0]

IMAGES_DIR_PATHS = [
    os.path.join(PROJECT_PATH, 'data/', data_dir, sub_dir, 'images')
    for data_dir in ['boulder/', 'mountain/']
    for sub_dir in ['train/', 'valid/', 'test/']
]

IMAGE_PATHS = [
    os.path.join(dir_path, file_name)
    for dir_path in IMAGES_DIR_PATHS
    for file_name in os.listdir(dir_path)
    if file_name.endswith('.jpg')
]

assert len(IMAGE_PATHS) == 9427, f'Expected 9427 images, but found {len(IMAGE_PATHS)} images.'  # checked manually to get the number
len(IMAGE_PATHS), IMAGE_PATHS[:5]

(9427,
 ['/Users/alessandroarensberg/Documents/summit-seeker/data/boulder/train/images/yellow_v2_cave_feb21_frame_sec_10_jpg.rf.d39c54db62f22b16fa6554ace0b92e98.jpg',
  '/Users/alessandroarensberg/Documents/summit-seeker/data/boulder/train/images/img6_jpg.rf.46f24b188d525ec02e0cae1aee5d8acd.jpg',
  '/Users/alessandroarensberg/Documents/summit-seeker/data/boulder/train/images/wall_img_16_jpg.rf.6cf49f5119aaa43993ae718e3056654b.jpg',
  '/Users/alessandroarensberg/Documents/summit-seeker/data/boulder/train/images/yellow_v2_cave_feb22_frame_sec_002_jpg.rf.588c3fc9f1addb7eee0dbe17480007b1.jpg',
  '/Users/alessandroarensberg/Documents/summit-seeker/data/boulder/train/images/34-3_jpg.rf.7c1029353ec68c88fe1d617affa199fb.jpg'])

**Duplicate detection method**

To detect duplicate images, we will compare the MD5 hash values of each image. MD5 is a widely used hashing algorithm that generates a unique fingerprint for each file. By using MD5, we can efficiently and reliably identify exact duplicates, since identical images will have the same hash value. This method is fast and well-suited for large datasets.

To evaluate the effectiveness of hash-based deduplication, I manually selected images that look exactly the same but have different file names. I will then compare their hash values. This step will help determine if the hash method is sufficient for identifying all duplicates, or if some visually similar images require a different approach.

In [2]:
from hashlib import md5

def file_hash(file_path: str) -> str:
    """Calculate the MD5 hash of a file."""
    h = md5()
    with open(file_path, 'rb') as f:
        for chunk in iter(lambda: f.read(4096), b''):
            h.update(chunk)
    return h.hexdigest()

[file_hash(os.path.join('/Users/alessandroarensberg/Documents/summit-seeker/data/boulder/test/images/', file_name))
 for file_name in [
     'ArticleImageHandler_jpeg_jpg.rf.2fe6836468df96f3757cda82d79de585.jpg',
     'ArticleImageHandler_jpeg_jpg.rf.7f7778d73a1cf7151e20834b4e9c1cb2.jpg',
     'ArticleImageHandler_jpeg_jpg.rf.ae70d55cfee35c218ba3b99d2cc98646.jpg',
     'ArticleImageHandler_jpeg_jpg.rf.c63f4489daf9f9f511b7421d01db8e95.jpg'
    ]
]

['7df45aeed145811d58f7ed2e6ab20858',
 '7df45aeed145811d58f7ed2e6ab20858',
 'd89c1c20f13afc2518e141f11f90dd95',
 'd89c1c20f13afc2518e141f11f90dd95']

**Limitations of binary hash comparison**

After comparing four visually similar images, I noticed that not all of them share the same hash value. This happens because the MD5 hash checks the binary content of the files, not their visual appearance. Even small differences in file encoding or compression can result in different hashes, even if the images look almost identical.

To address this, I will use a perceptual hash (such as the `imagehash`⁠ library). A perceptual hash summarizes the visual features of an image, rather than its exact binary content. This approach will allow me to compare images based on their visual similarity and improve the accuracy of the deduplication process.

In [3]:
from PIL.Image import open
from imagehash import phash

visually_similar = [phash(
    open(
        os.path.join('/Users/alessandroarensberg/Documents/summit-seeker/data/boulder/test/images/', file_name)))
        for file_name in [
            'ArticleImageHandler_jpeg_jpg.rf.2fe6836468df96f3757cda82d79de585.jpg',
            'ArticleImageHandler_jpeg_jpg.rf.7f7778d73a1cf7151e20834b4e9c1cb2.jpg',
            'ArticleImageHandler_jpeg_jpg.rf.ae70d55cfee35c218ba3b99d2cc98646.jpg',
            'ArticleImageHandler_jpeg_jpg.rf.c63f4489daf9f9f511b7421d01db8e95.jpg'
        ]
]

not_visually_similar = [phash(
    open(
        os.path.join('/Users/alessandroarensberg/Documents/summit-seeker/data/boulder/test/images/', file_name)))
        for file_name in [
            'AF1QipMHSzeW9a3RBngF-kjg-s9qVsoXgyhNGw6BYzV8-w406-h318-k-no_jpeg_jpg.rf.d7b7ca9416109dc17a31d7f55684a32f.jpg',
            'AF1QipN-svP3tQtZP4mILTpB1G0B3FFXGYrasBg6YHSi-s1024_jpeg_jpg.rf.1b8616ab82bb3bef2e73045336221457.jpg',
            'AF1QipP1IcXZ_ySbNmSOzJ24yxOjVLe5VAIsa8oNif9e-w406-h721-k-no_jpeg_jpg.rf.ba8e81ace008620c80842b1d73acfc70.jpg',
            'AF1QipP4tRtE_sz_okNuAp7_kKMf86QMxVzsr8GC1QS9-s512_jpeg_jpg.rf.49b0efdb7cd460f62cb4b04263e7d84d.jpg'
        ]
]

assert all(visually_similar[0] == img_hash for img_hash in visually_similar[1:]) == True, 'Some images are not identity as visually similar, but they should be'
assert all(not_visually_similar[i] != not_visually_similar[j]
           for i in range(len(not_visually_similar))
           for j in range(i + 1, len(not_visually_similar))) == True, 'Some images are identity as visually similar, but they should not be'

**Perceptual hash results**

The tests with the perceptual hash are promising. When I compare four visually similar images, they all have the same hash value. On the other hand, four visually different images each have a unique hash. This confirms that the perceptual hash is effective at capturing visual similarity and distinguishing between different images in the dataset.

In [4]:
images_hash = {
    path: phash(open(path))
    for path in IMAGE_PATHS
}

assert len(images_hash) == 9427, f'Expected 9427 images, but found {len(images_hash)} images'  # checked manually to get the number
assert None not in images_hash.values(), 'Some images have no hash'

hash_to_paths = {
    img_hash: [p for p, h in images_hash.items() if h == img_hash]
    for img_hash in set(images_hash.values())
    if len([p for p, h in images_hash.items() if h == img_hash]) > 1  # only hashes corresponding to several images are retained
}

FILES_TO_REMOVE = [path
 for duplicated in hash_to_paths.values()  # browse each sublist of duplicated images
 for path in duplicated[1:]]  # keep paths of all but one the first image (the one to keep)

len(FILES_TO_REMOVE), FILES_TO_REMOVE[:5]

(1733,
 ['/Users/alessandroarensberg/Documents/summit-seeker/data/boulder/valid/images/wall_img_18_jpg.rf.6b8aa3c81004f3e7cbf5c8fde179dee1.jpg',
  '/Users/alessandroarensberg/Documents/summit-seeker/data/boulder/train/images/IMG_4274_JPG_jpg.rf.5ab2eb7a56fa6eb6ec06dac728a74d27.jpg',
  '/Users/alessandroarensberg/Documents/summit-seeker/data/boulder/train/images/IMG_2520_jpeg_jpg.rf.1a7b9acfa489d1d5277c84c874e28f73.jpg',
  '/Users/alessandroarensberg/Documents/summit-seeker/data/boulder/train/images/IMG_2520_jpeg_jpg.rf.3ccc7b202e82eb2122022e282cccfb2d.jpg',
  '/Users/alessandroarensberg/Documents/summit-seeker/data/boulder/train/images/IMG_2520_jpeg_jpg.rf.7caa1072e993b7db2d8d5559b73e09f8.jpg'])

**Image and label consistency check**

Before removing the duplicate files, we will check that the filenames of the images and their corresponding labels match correctly. After deleting the duplicate files, we will repeat this check to ensure that the associated labels have also been properly removed. This step helps maintain consistency between the images and their labels throughout the cleaning process.

In [5]:
def image_label_consistency(images_dir_paths: str, labels_dir_paths: str) -> bool:
    """
    Check if image and label file names in all data directories match
    
    Args:
        images_dir_paths (str): Paths to the directories containing images
        labels_dir_paths (str): Paths to the directories containing labels

    Returns:
        bool: True if all image and label file names match, False otherwise
        list: List of directories where the file names do not match
    """
    
    problematic_dirs = []

    for image_dir, label_dir in zip(images_dir_paths, labels_dir_paths):
        image_files = [file.split('.jpg')[0] for file in os.listdir(image_dir) if file.endswith('.jpg')]
        label_files = [file.split('.txt')[0] for file in os.listdir(label_dir) if file.endswith('.txt')]
        
        dir_name = image_dir.split('data')[1].split('/images')[0]
        
        if not set(image_files) == set(label_files):
            problematic_dirs.append(dir_name)
            print(f'❌ {dir_name}')
        else:
            print(f'✅ {dir_name}')

    return (True, None) if not problematic_dirs else (False, problematic_dirs)

LABELS_DIR_PATHS = [
    os.path.join(PROJECT_PATH, 'data/', data_dir, sub_dir, 'labels')
    for data_dir in ['boulder/', 'mountain/']
    for sub_dir in ['train/', 'valid/', 'test/']
]

assert image_label_consistency(IMAGES_DIR_PATHS, LABELS_DIR_PATHS)[0], 'Some images and labels do not match'

✅ /boulder/train
✅ /boulder/valid
✅ /boulder/test
✅ /mountain/train
✅ /mountain/valid
✅ /mountain/test


In [None]:
for file_name in FILES_TO_REMOVE:
    os.remove(file_name)  # remove the image
    os.remove(file_name.replace('images', 'labels').replace('.jpg', '.txt'))  # remove the corresponding label

assert image_label_consistency(IMAGES_DIR_PATHS, LABELS_DIR_PATHS)[0], 'Deleting the files led to inconsistencies between images and labels'

NEW_IMAGE_PATHS = [
    os.path.join(dir_path, file_name)
    for dir_path in IMAGES_DIR_PATHS
    for file_name in os.listdir(dir_path)
    if file_name.endswith('.jpg')
]

assert len(NEW_IMAGE_PATHS) == len(IMAGE_PATHS) - len(FILES_TO_REMOVE), f'Expected {len(IMAGE_PATHS) - len(FILES_TO_REMOVE)} images ; found {len(NEW_IMAGE_PATHS)}'
len(NEW_IMAGE_PATHS), NEW_IMAGE_PATHS[:5]

✅ /boulder/train
✅ /boulder/valid
✅ /boulder/test
✅ /mountain/train
✅ /mountain/valid
✅ /mountain/test


**Deduplication summary**

We started with 9427 images in the dataset. After removing duplicates, we have 7694 images left. This means we found and removed 1733 duplicate images. The dataset is now cleaner and more reliable for training.