# <center> <u>**Trouver les images (trop) similaires de Inaturalist grâce au réseau VGG16** </u></center>



<div style="display: flex; justify-content: center; margin-top: 60px;">
    <div style="text-align: center; margin-right: 0px;">
        <img src="../datafiles/images/173740578.jpg" alt="173740578.jpg" style="width: 500px;">
        <p style="font-weight: ;">173740578.jpg</p>
    </div>
    <div style="text-align: center; margin-left: 0px;">
        <img src="../datafiles/images/173740578.jpg" alt="173740578.jpg" style="width: 500px;">
        <p style="font-weight: ;">173740578.jpg</p>
    </div>
</div>


# Contexte



* Comme vu dans le Notebook précédent (./2_3_2_find_duplicates_with_vgg16.ipynb), il est judicieux de comparer uniquement les photos qui ont le même taxon_id et la même observation_id.

* On va utiliser le réseau VGG16 préentrainé sur ImageNet pour extraires les features des images de Inaturalist. 

* On va regarder la cosine similarity entre les features des images.

Idéalement, à la fin de l'expérimentation, on définira un seuil à partir duquel on considère que deux images sont similaires.
On pourra alors supprimer les images qui sont trop similaires.

# 0. Test 3 images

In [14]:
import cv2
import keras
from keras.applications.vgg16 import VGG16

feature_extractor = VGG16(weights='imagenet', include_top=False, input_shape=(224, 224, 3))

def extract_features(img_path):
    img = cv2.imread(img_path)
    try :
        img = cv2.resize(img, (224, 224))
    except:
        print(img_path)
    img = img.reshape((1, 224, 224, 3))
    img = keras.applications.vgg16.preprocess_input(img)
    features = feature_extractor.predict(img)
    return features

img_1 = '../datafiles/images/173740578.jpg'
img_2 = '../datafiles/images/173740328.jpg'
img_3 = '../datafiles/images/perimeter.png'

# 25088 features
features_1 = extract_features(img_1).flatten()
features_2 = extract_features(img_2).flatten()
features_3 = extract_features(img_3).flatten()



In [20]:
from keras.metrics import cosine_similarity
import tensorflow as tf

cs_1 = cosine_similarity(features_1, features_2).numpy()
cs_2 = cosine_similarity(features_1, features_3).numpy()

print('Cosine similarity between image 1 and image 2 (similar) : {:.2f}'.format(cs_1))
print('Cosine similarity between image 1 and image 3 (different) : {:.2f}'.format(cs_2))

print('\n')

de_1 = tf.norm(features_1-features_2,ord='euclidean').numpy()
de_2 = tf.norm(features_1-features_3,ord='euclidean').numpy()

print('Euclidean distance between image 1 and image 2 (similar) : {:.2f}'.format(de_1))
print('Euclidean distance between image 1 and image 3 (different) : {:.2f}'.format(de_2))


Cosine similarity between image 1 and image 2 (similar) : 1.00
Cosine similarity between image 1 and image 3 (different) : 0.10


Euclidean distance between image 1 and image 2 (similar) : 0.00
Euclidean distance between image 1 and image 3 (different) : 1854.70


# 1. Construction et vérification de la pipeline sur un dossier test

In [1]:
import os 
import pandas as pd
import cv2 
import numpy as np
import matplotlib.pyplot as plt
import itertools

import tensorflow as tf
import keras
from keras.applications.vgg16 import VGG16
from keras.metrics import cosine_similarity

feature_extractor = VGG16(weights='imagenet', include_top=False, input_shape=(224, 224, 3))


def extract_features(img_path):
    img = cv2.imread(img_path)
    
    try :
        img = cv2.resize(img, (224, 224))
        img = img.reshape((1, 224, 224, 3))
    except : 
        print('exception occured with {} while reshaping the picture'.format(img_path))
        img = np.zeros(150528)
        img = img.reshape((1, 224, 224, 3))


    img = keras.applications.vgg16.preprocess_input(img)
    features = feature_extractor.predict(img,verbose = 0)
    return features.flatten()


def compare_images(paths_to_compare):

    # compute features
    features_dict= {path:extract_features(path) for path in paths_to_compare}

    # create combinations list
    combinations = list(itertools.combinations(paths_to_compare, 2))

    # compute cosines
    cosines = {str(pair): cosine_similarity(features_dict[pair[0]],features_dict[pair[1]]).numpy().sum() for pair in combinations}

    # compute euclidians 
    euclidians = {str(pair): tf.norm(features_dict[pair[0]]-features_dict[pair[1]],ord='euclidean').numpy() for pair in combinations}

    # readable img names 
    img_names = {str(pair): pair[0].split('/')[-1]+' vs '+pair[1].split('/')[-1] for pair in combinations}

    # create dataframe
    df = pd.DataFrame({'compared_images': img_names,'cosine_similarity': cosines, 'euclidean_distance': euclidians})

    # sort by cosine similarity and euclidean distance
    df.sort_values(by=['cosine_similarity', 'euclidean_distance'], ascending=False, inplace=True)

    # reset index
    df.reset_index(drop=True, inplace=True)

    return df



2023-07-09 23:45:44.679160: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-07-09 23:45:44.805058: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-07-09 23:45:47.693524: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-09 23:45:47.704926: I tensorflow/comp

In [38]:
paths_to_compare = os.listdir('../../data_bees_detection/test_duplicates')
paths_to_compare = [os.path.join('../../data_bees_detection/test_duplicates',path) for path in paths_to_compare]

df = compare_images(paths_to_compare)





In [24]:
df.describe()

Unnamed: 0,cosine_similarity,euclidean_distance
count,325.0,325.0
mean,0.181309,1992.003052
std,0.065332,212.014236
min,0.054448,1288.347412
25%,0.138802,1839.710327
50%,0.170699,1985.522949
75%,0.213472,2142.566162
max,0.523685,2566.611328


# 2. Import et mise en forme des données

In [2]:
import pandas as pd
import os

df = pd.read_csv('../../data_bees_detection/inat_utils/inaturalist_2305_observation_uuid.csv')

# keep only rows were concatenate is not unique
df = df[df['observation_uuid'].duplicated(keep=False)]

df['path'] = df.apply(lambda x: os.path.join('../../data_bees_detection/whole_dataset/inaturalist_2305',x['genus_species'],str(x['photo_id'])+'.'+x['extension']), axis=1)


df.head(10)

Unnamed: 0,photo_id,genus_species,observation_uuid,extension,path
0,132248441,Hylaeus hyalinatus,e29b4186-381c-4dae-92e8-61438d2ab5ea,jpg,../../data_bees_detection/whole_dataset/inatur...
3,198353246,Hylaeus hyalinatus,c139e219-2436-4df6-8128-f83490a89d3f,jpg,../../data_bees_detection/whole_dataset/inatur...
4,212828135,Hylaeus hyalinatus,9abe51ae-d287-403d-a5e4-7fea0382d122,jpg,../../data_bees_detection/whole_dataset/inatur...
5,129674943,Hylaeus hyalinatus,bed9595c-1c70-49df-b8f9-9ca32b70e8cc,jpg,../../data_bees_detection/whole_dataset/inatur...
6,212828085,Hylaeus hyalinatus,9abe51ae-d287-403d-a5e4-7fea0382d122,jpg,../../data_bees_detection/whole_dataset/inatur...
7,198353300,Hylaeus hyalinatus,c139e219-2436-4df6-8128-f83490a89d3f,jpg,../../data_bees_detection/whole_dataset/inatur...
8,199335007,Hylaeus hyalinatus,ce01cc40-21a9-43b0-a112-12f867f030a1,jpeg,../../data_bees_detection/whole_dataset/inatur...
9,145388431,Hylaeus hyalinatus,2c57d7b2-c093-4373-8e4a-1ba3a83aea91,jpg,../../data_bees_detection/whole_dataset/inatur...
10,198351228,Hylaeus hyalinatus,576eb944-52dd-42d8-af63-4b1c6d2333b3,jpg,../../data_bees_detection/whole_dataset/inatur...
11,198351249,Hylaeus hyalinatus,576eb944-52dd-42d8-af63-4b1c6d2333b3,jpg,../../data_bees_detection/whole_dataset/inatur...


In [7]:
# count number of iterations
import itertools

uuids = df['observation_uuid'].unique()
print('number of uuids : {}'.format(len(uuids)))

photos_per_uuid = df['observation_uuid'].value_counts()
sum_photos = photos_per_uuid.sum()
print('number of feature vectors to compute : {}'.format(sum_photos))


comparisons_per_uuid = [itertools.combinations(range(uuid),2) for uuid in photos_per_uuid]

nb_comparisons_per_uuid = [sum(1 for iter in iterable) for iterable in comparisons_per_uuid]
sum_comparisons = sum(nb_comparisons_per_uuid)
print('number of comparisons to make : {}'.format(sum_comparisons))




number of uuids : 42058
number of feature vectors to compute : 140172
number of comparisons to make : 245810


# 3. Définition d'un premier seuil

## 3.1 Approche naïve : afficher des échantillons d'images similaires sous différents seuils

In [None]:
observation_uuids = df['observation_uuid'].unique()

# take 500 random uuid
observation_uuids = np.random.choice(observation_uuids,1000)

# get correponding paths
paths = [df[df['observation_uuid'] == id]['path'] for id in observation_uuids]

# compare paths
dfs_comparisons = [compare_images(path_list) for path_list in paths]

final_df = pd.concat(dfs_comparisons)

final_df.to_csv('../datafiles/scrap_inat/duplicates.csv')

In [3]:
import cv2
import pandas as pd

def save_duplicate(img_1_path,img_2_path,cosine_similarity,euclidean_distance,plot=False):

    fig, axs = plt.subplots(1, 2, figsize=(10, 10))

    img_1 = cv2.imread(img_1_path, cv2.IMREAD_COLOR)
    img_2 = cv2.imread(img_2_path, cv2.IMREAD_COLOR)    

    img_1_rgb = cv2.cvtColor(img_1, cv2.COLOR_BGR2RGB)
    img_2_rgb = cv2.cvtColor(img_2, cv2.COLOR_BGR2RGB)

    axs[0].imshow(img_1_rgb)
    axs[1].imshow(img_2_rgb)
    
    img_1_name , img_2_name = img_1_path.split('/')[-1].split('.')[0], img_2_path.split('/')[-1].split('.')[0]
    
    axs[0].set_title('Image 1: {}'.format(img_1_name))
    axs[1].set_title('Image 2 : {}'.format(img_2_name))

    fig.suptitle('Cosine similarity: {:.2f}, Euclidean distance: {:.2f}'.format(cosine_similarity, euclidean_distance))

    # save image
    plt.savefig('../datafiles/scrap_inat/duplicates/{}vs{}.png'.format(img_1_name,img_2_name))

    if plot != False:
        plt.show()


In [None]:
# iterate over the dataframe
import os
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


# RECUPERATION DU TAXON POUR POUVOIR FAIRE DES STATISTIQUES PAR ESPECES


# Load the data
df_duplicates = pd.read_csv('../datafiles/scrap_inat/duplicates.csv',index_col=0)

df_duplicates['image_1'] = df_duplicates['compared_images'].apply(lambda x: x.split(' ')[0])
df_duplicates['image_2'] = df_duplicates['compared_images'].apply(lambda x: x.split(' ')[-1])
df_duplicates['photo_id'] = df_duplicates['image_1'].apply(lambda x : x.split('.')[0]).astype(int)


# join with inat dataset
df_inat = pd.read_csv('../../data_bees_detection/inat_utils/inaturalist_2305.csv')

# set indexes
df_duplicates = df_duplicates.merge(df_inat,left_on=['photo_id'],right_on=['photo_id'],how ='left')

#drop comared_images column
df_duplicates.drop('compared_images',axis=1,inplace=True)

base_path = '../../data_bees_detection/whole_dataset/inaturalist_2305/'


# shuffle the dataframe
df_duplicates = df_duplicates.sample(frac=1)

counter = 0
for index, row in df_duplicates.iterrows():
    if counter < 10:
        
        img_1, img_2 = row['image_1'], row['image_2']
        label = row['genus_species']

        img_1_path , img_2_path = os.path.join(base_path,label,img_1), os.path.join(base_path,label,img_2)


        cosine_similarity, euclidean_distance = row['cosine_similarity'], row['euclidean_distance']

        save_duplicate(img_1_path,img_2_path,cosine_similarity,euclidean_distance,plot=False)

        print('Image 1: {}, Image 2: {}, Cosine similarity: {:.2f}, Euclidean distance: {:.2f}'.format(img_1_path,img_2_path,cosine_similarity, euclidean_distance))
        
        counter += 1


## 3.2 Approche  statistique : influence du seuil sur le nombre d'images similaires

Prédictions sur une partie du dataset.

In [None]:
import numpy as np


observation_uuids = df['observation_uuid'].unique()


# split uuids in 100 batches
observation_uuids_batches = np.array_split(observation_uuids, 100)

for i, batch in enumerate(observation_uuids_batches):

    # get correponding paths
    paths = [df[df['observation_uuid'] == id]['path'] for id in batch]

    # compare paths
    dfs_comparisons = [compare_images(path_list) for path_list in paths]

    final_df = pd.concat(dfs_comparisons)
    final_df.to_csv('../datafiles/scrap_inat/to_merge/duplicates_{}.csv'.format(i))

(On comptait faire sur tout le dataset mais ça prend trop de temps, le code précédent pourrait suffire pour prédire sur tout le dataset)

In [8]:
# count number of already computed comparisons
import glob
import pandas as pd

path = '../datafiles/scrap_inat/to_merge/duplicates_*.csv'
all_files = glob.glob(path)

nb = 0
li = []

for filename in all_files:
    df = pd.read_csv(filename,index_col=0,header=0)
    nb += df.shape[0]
    li.append(df)


frame = pd.concat(li, axis=0, ignore_index=False)

frame.to_csv('../datafiles/scrap_inat/duplicates_0710.csv',index=False)
print('number of comparisons already computed : {}'.format(nb))

frame.describe()

number of comparisons already computed : 72404


Unnamed: 0,cosine_similarity,euclidean_distance
count,72404.0,72404.0
mean,0.341891,1599.952511
std,0.167263,456.821912
min,0.015183,0.0
25%,0.223082,1295.7259
50%,0.30329,1615.8384
75%,0.42053,1915.51905
max,1.0,3399.182


Récupération des labels pour pouvoir avoir une idée de l'imact du seuil par espèce.

In [29]:

# Load the data
df_duplicates = pd.read_csv('../datafiles/scrap_inat/duplicates_0710.csv',index_col=None)

df_duplicates['image_1'] = df_duplicates['compared_images'].apply(lambda x: x.split(' ')[0])
df_duplicates['image_2'] = df_duplicates['compared_images'].apply(lambda x: x.split(' ')[-1])
df_duplicates['photo_id'] = df_duplicates['image_1'].apply(lambda x : x.split('.')[0]).astype(int)


# join with inat dataset
df_inat = pd.read_csv('../../data_bees_detection/inat_utils/inaturalist_2305.csv')

# set indexes
df_duplicates = df_duplicates.merge(df_inat,left_on=['photo_id'],right_on=['photo_id'],how ='left')

#drop comared_images column
df_duplicates.drop(['image_1','image_2','photo_id','compared_images'],inplace=True,axis=1)

df_duplicates.groupby(['genus_species']).quantile(0.5).describe()

Unnamed: 0,cosine_similarity,euclidean_distance
count,80.0,80.0
mean,0.318352,1631.177934
std,0.096663,279.714303
min,0.123936,799.45264
25%,0.265161,1516.869688
50%,0.310879,1617.960725
75%,0.346382,1800.772725
max,0.667839,2336.42165


In [21]:
cosine_thresholds = np.linspace(0.2,1,0.1)



Unnamed: 0,cosine_similarity,euclidean_distance,genus_species
0,0.286500,1741.91280,Anthidium manicatum
1,0.448318,1731.69340,Anthidium manicatum
2,0.336898,2034.01640,Anthidium manicatum
3,0.308329,1750.44790,Anthidium manicatum
4,0.269743,1404.17700,Anthidium manicatum
...,...,...,...
72399,0.935105,571.19135,Osmia bicornis
72400,0.552981,1645.89990,Osmia bicornis
72401,0.217821,2136.90230,Osmia bicornis
72402,0.502273,1210.88850,Osmia bicornis


## En première approche : on peut ne garder que ceux dont la cosine similarity est supérieure à 0.4 pour désengorger

# 4. Filtre avec ce premier seuil

## 4.1 Code final

In [3]:
import os 
import pandas as pd
import cv2 
import numpy as np
import matplotlib.pyplot as plt
import itertools

import tensorflow as tf
import keras
from keras.applications.vgg16 import VGG16
from keras.metrics import cosine_similarity

feature_extractor = VGG16(weights='imagenet', include_top=False, input_shape=(224, 224, 3))


def extract_features(img_path):
    img = cv2.imread(img_path)
    
    try :
        img = cv2.resize(img, (224, 224))
        img = img.reshape((1, 224, 224, 3))
    except : 
        print('exception occured with {} while reshaping the picture'.format(img_path))
        img = np.zeros(150528)
        img = img.reshape((1, 224, 224, 3))


    img = keras.applications.vgg16.preprocess_input(img)
    features = feature_extractor.predict(img,verbose = 0)
    return features.flatten()


def compare_images(paths_to_compare,cosine_threshold =0.4):

    # compute features
    features_dict= {path:extract_features(path) for path in paths_to_compare}

    # create combinations list
    combinations = list(itertools.combinations(paths_to_compare, 2))

    # compute cosines
    cosines = {str(pair): cosine_similarity(features_dict[pair[0]],features_dict[pair[1]]).numpy().sum() for pair in combinations}

    # compute euclidians 
    euclidians = {str(pair): tf.norm(features_dict[pair[0]]-features_dict[pair[1]],ord='euclidean').numpy() for pair in combinations}

    # readable img names 
    img_names = {str(pair): pair[0].split('/')[-1]+' vs '+pair[1].split('/')[-1] for pair in combinations}

    # create dataframe
    df = pd.DataFrame({'compared_images': img_names,'cosine_similarity': cosines, 'euclidean_distance': euclidians})

    # filter dataframe
    df_filtered = df[df['cosine_similarity']>cosine_threshold].copy()

    # sort by cosine similarity and euclidean distance
    df_filtered.sort_values(by=['cosine_similarity', 'euclidean_distance'], ascending=False, inplace=True)

    # reset index
    df_filtered.reset_index(drop=True, inplace=True)

    return df_filtered



2023-07-10 22:24:51.637800: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-07-10 22:24:51.764315: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-07-10 22:24:55.107883: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-10 22:24:55.123044: I tensorflow/comp

## 4.2 Import des données

In [6]:
import pandas as pd
import os

df = pd.read_csv('../../data_bees_detection/inat_utils/inaturalist_2305_observation_uuid.csv')

# keep only rows were concatenate is not unique
df = df[df['observation_uuid'].duplicated(keep=False)]

df['path'] = df.apply(lambda x: os.path.join('../../data_bees_detection/whole_dataset/inaturalist_2305',x['genus_species'],str(x['photo_id'])+'.'+x['extension']), axis=1)

# on ne prédit pas ceux ui ont déja été prédits
df_predicted = pd.read_csv('../datafiles/scrap_inat/biggest_duplicates.csv')
df_predicted['image_1'] = df_predicted['compared_images'].apply(lambda x: x.split(' ')[0].split('.')[0]).astype(int)
df_predicted['image_2'] = df_predicted['compared_images'].apply(lambda x: x.split(' ')[-1].split('.')[0]).astype(int)



In [7]:
# Create a boolean mask indicating whether each value in photo_id is present in photo_predicted
mask = df['photo_id'].isin(df_predicted['image_1'])

# Filter rows where photo_id is not present in photo_predicted
df_filtered = df[~mask]

# Create a boolean mask indicating whether each value in photo_id is present in photo_predicted
mask = df['photo_id'].isin(df_predicted['image_2'])

# Filter rows where photo_id is not present in photo_predicted
df_filtered = df[~mask]

# keep only rows were concatenate is not unique
df_filtered = df_filtered[df_filtered['observation_uuid'].duplicated(keep=False)]


In [8]:
df_filtered

Unnamed: 0,photo_id,genus_species,observation_uuid,extension,path
43217,239605746,Anthidium manicatum,21086320-9536-42f4-9456-6982cd4b5fe7,jpeg,../../data_bees_detection/whole_dataset/inatur...
43218,237055751,Anthidium manicatum,7ff53329-be36-4b12-8d02-85c9fc6b47be,jpg,../../data_bees_detection/whole_dataset/inatur...
43221,4239494,Anthidium manicatum,84039afc-e2ea-45e3-a35f-7a597828ac3d,jpg,../../data_bees_detection/whole_dataset/inatur...
43223,207758917,Anthidium manicatum,22aaa8d1-32ea-4bc0-8225-abd54f2b761a,jpg,../../data_bees_detection/whole_dataset/inatur...
43225,20476871,Anthidium manicatum,633d3f3d-cfbb-44d0-ae3f-58c075218f0d,jpeg,../../data_bees_detection/whole_dataset/inatur...
...,...,...,...,...,...
178755,145103301,Megachile albisecta,dfb73991-fd60-4dab-b847-4a191670d785,jpeg,../../data_bees_detection/whole_dataset/inatur...
178758,89133683,Megachile albisecta,2596ad92-b4f1-43bf-a292-293d07deea3d,jpg,../../data_bees_detection/whole_dataset/inatur...
178765,52581492,Megachile albisecta,3eca66fd-06b0-4e3e-a41a-a1aa3dab24d2,jpeg,../../data_bees_detection/whole_dataset/inatur...
178774,89133727,Megachile albisecta,2596ad92-b4f1-43bf-a292-293d07deea3d,jpg,../../data_bees_detection/whole_dataset/inatur...


In [None]:
import numpy as np

df = df_filtered.copy()


observation_uuids = df['observation_uuid'].unique()

# split uuids in 100 batches
observation_uuids_batches = np.array_split(observation_uuids,50)

# shuffle batches
np.random.shuffle(observation_uuids_batches)



for i, batch in enumerate(observation_uuids_batches):

    # get correponding paths
    paths = [df[df['observation_uuid'] == id]['path'] for id in batch]

    # compare paths
    dfs_comparisons = [compare_images(path_list,cosine_threshold=0.5)for path_list in paths]
    
    final_df = pd.concat(dfs_comparisons)
    final_df.to_csv('../datafiles/scrap_inat/to_merge_11/duplicates_{}.csv'.format(i))

    print('batch {} done'.format(i))

  

# 5. Filtre du dataset entier (pas implémenté)

On fixe le seuil à 0.5, i.e. on ne garde que les images dont la cosine similarity est supérieure à 0.5.

Arbitrairement, entre deux images on garde la première.

On teste un filtre à petite mailles : on ne garde qu'un seul exemplaire d'une série de doublons



In [None]:
df_duplicates = pd.read_csv('../datafiles/scrap_inat/final_duplicates',index_col=0)

df_duplicates['image_1'] = df_duplicates['compared_images'].apply(lambda x: x.split(' ')[0])
df_duplicates['image_2'] = df_duplicates['compared_images'].apply(lambda x: x.split(' ')[-1])


# keep only comparisons with cosine similarity > 0.5
df_duplicates = df_duplicates[df_duplicates['cosine_similarity']>0.5]


to_remove = df_duplicates['image_2']

to_keep = df_duplicates[~df['image_1'].isin(to_remove)]['image_1']





