**Data Deduplication Rationale**

As we collected images of climbing holds from multiple data sources on the same climbing site, there is a significant risk of having duplicate images in our dataset. Duplicates can bias the training process by causing the model to overfit, distort the evaluation metrics, and reduce its ability to generalize to new data. To ensure the quality and reliability of our model, we will implement a deduplication step to remove identical or highly similar images before training.

In [None]:
import os

PROJECT_DIR = os.getcwd().split('notebooks')[0]

IMAGES_DIR = [
    os.path.join(PROJECT_DIR, 'data/', data_dir, sub_dir, 'images')
    for data_dir in ['block/', 'montagne/']
    for sub_dir in ['train/', 'valid/', 'test/']
]

images_path = [
    os.path.join(images_dir, image)
    for images_dir in IMAGES_DIR
    for image in os.listdir(images_dir)
    if image.endswith('.jpg')
]

assert len(images_path) == 9427, f'Expected 9427 images, but found {len(images_path)} images.'  # checked manually to get the number

**Duplicate Detection Method**

To detect duplicate images, we will compare the MD5 hash values of each image. MD5 is a widely used hashing algorithm that generates a unique fingerprint for each file. By using MD5, we can efficiently and reliably identify exact duplicates, since identical images will have the same hash value. This method is fast and well-suited for large datasets.

In [None]:
import hashlib

def file_hash(path):
    """Calcule le hash MD5 d’un fichier."""
    h = hashlib.md5()
    with open(path, 'rb') as f:
        for chunk in iter(lambda: f.read(4096), b''):
            h.update(chunk)
    return h.hexdigest()

In [None]:
import os
import hashlib

PROJECT_DIR = os.getcwd().split('notebooks')[0]

def file_hash(path):
    """Calcule le hash MD5 d’un fichier."""
    h = hashlib.md5()
    with open(path, 'rb') as f:
        for chunk in iter(lambda: f.read(4096), b''):
            h.update(chunk)
    return h.hexdigest()

def find_duplicates(folder):
    hashes = {}
    duplicates = []
    
    for root, _, files in os.walk(folder):
        for fname in files:
            if fname.lower().endswith('.jpg'):
                fpath = os.path.join(root, fname)
                h = file_hash(fpath)
                if h in hashes:
                    duplicates.append((fpath, hashes[h]))
                else:
                    hashes[h] = fpath

    return duplicates

"""duplicates = {}
for data_dir in ['block/', 'montagne/']:
    for sub_dir in ['train/', 'valid/', 'test/']:
        dir_path = os.path.join(PROJECT_DIR, 'data/', data_dir, sub_dir, 'images')
        duplicates[dir_path] = find_duplicates(dir_path)"""

"duplicates = {}\nfor dir in ['block/', 'montagne/']:\n    for sub_dir in ['train/', 'valid/', 'test/']:\n        dir_path = os.path.join(PROJECT_DIR, 'data/', dir, sub_dir, 'images')\n        duplicates[dir_path] = find_duplicates(dir_path)"

In [3]:
[os.path.join(PROJECT_DIR, 'data/', data_dir, sub_dir, 'images') for data_dir in ['block/', 'montagne/'] for sub_dir in ['train/', 'valid/', 'test/']]

['/Users/alessandroarensberg/Documents/summit-seeker/data/block/train/images',
 '/Users/alessandroarensberg/Documents/summit-seeker/data/block/valid/images',
 '/Users/alessandroarensberg/Documents/summit-seeker/data/block/test/images',
 '/Users/alessandroarensberg/Documents/summit-seeker/data/montagne/train/images',
 '/Users/alessandroarensberg/Documents/summit-seeker/data/montagne/valid/images',
 '/Users/alessandroarensberg/Documents/summit-seeker/data/montagne/test/images']