# remove_duplicate_images.ipynb

This notebook removes duplicate and near-duplicate images from the dataset.
Duplicates and near-duplicatessuch wereas caused when multiple images are take at a single location because the survey vehicle had stopped.

I followed the technique used in https://colab.research.google.com/github/voxel51/fiftyone-examples/blob/master/examples/image_deduplication.ipynb.

In [2]:
import fiftyone as fo
# import fiftyone.brain as fob
from icecream import ic
# from sklearn.metrics.pairwise import cosine_similarity
import fiftyone.zoo as foz
import pickle
import numpy as np
import os
from datetime import datetime

In [3]:
dataset = fo.load_dataset('dataset3')

In [25]:
# Launch app in browser
session = fo.launch_app(dataset, auto=False)
session

Session launched. Run `session.show()` to open the App in a cell output.


Dataset:          dataset3
Media type:       image
Num samples:      70888
Selected samples: 0
Selected labels:  0
Session URL:      http://localhost:5151/

In [4]:
print(dataset)

Name:        dataset3
Media type:  image
Num samples: 70888
Persistent:  True
Tags:        []
Sample fields:
    id:                 fiftyone.core.fields.ObjectIdField
    filepath:           fiftyone.core.fields.StringField
    tags:               fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:           fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    created_at:         fiftyone.core.fields.DateTimeField
    last_modified_at:   fiftyone.core.fields.DateTimeField
    ground_truth:       fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    YOLOv8_predictions: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    uniqueness:         fiftyone.core.fields.FloatField
    max_similarity:     fiftyone.core.fields.FloatField
    datetime:           fiftyone.core.fields.DateTimeField
    embeddings:         fiftyone.core.fields.VectorField


## Generate Embeddings

Images store a lot of information in their pixel values. Comparing images pixel-by-pixel would be an expensive operation and result in poor quality results. 

Instead, we can use a pretrained computer vision model to generate embeddings for each image. An embedding is a result of processing an image through a model and extracting an intermediate representation of the image from within the model in the form of a vector containing a few thousand values distilling the information stored in the millions of pixels.

For deep learning models, one typically uses the output of a fully-connected layer near the end of the forward pass to generate embeddings.

The [FiftyOne Model Zoo](https://voxel51.com/docs/fiftyone/user_guide/model_zoo/index.html) contains a host of different pretrained models that we can use for this task. In this example, we will use a [MobileNet v2 model trained on ImageNet](https://voxel51.com/docs/fiftyone/user_guide/model_zoo/models.html#mobilenet-v2-imagenet-torch). This model provides relatively high performance, but most importantly is lightweight and can process our dataset quicker than other models. 

Any off-the-shelf model will be informative, but one can easily experiment with other models that may be more useful for particular datasets.

We can easily load the model and compute embeddings on our dataset.

In [None]:
model = foz.load_zoo_model("mobilenet-v2-imagenet-torch")

In [None]:
embeddings = dataset.compute_embeddings(model=model, embeddings_field='embeddings')
dataset.save()
print(dataset)

In [None]:
ic(embeddings.shape)
# embeddings32 = embeddings.astype(np.float32)
# print(embeddings32[0][:5])

In [None]:
sorted_by_datetime_view = dataset.sort_by('datetime')
dataset.save_view('sorted_by_datatime_view', sorted_by_datetime_view)

In [7]:
dataset.list_saved_views()

['max_similarity_view', 'sorted_by_datatime_view']

In [23]:
import numpy as np
from numpy.linalg import norm

def cosine_similarity(a, b):
    return np.dot(a,b)/(norm(a)*norm(b))
 
# a = np.array([2,1,2,3,2,9])
# b = np.array([3,4,2,4,5,5])
# cosine_similarity(a, b)

In [34]:
sorted_by_datetime_view = dataset.load_saved_view('sorted_by_datatime_view')

thresh = 0.92

first_sample = True
for sample in sorted_by_datetime_view:
    if first_sample:
        current_embeddings = sample.embeddings
        similarity = 0.0
        first_sample = False
    else:
        previous_embeddings = current_embeddings
        current_embeddings = sample.embeddings
        similarity = cosine_similarity(previous_embeddings, current_embeddings)
        sample['similarity_with_prev_img'] = similarity
    if similarity > thresh:
        sample.tags.append(f'similarity>{thresh}')
    else:
        sample.tags.append('similarity OK') 
    sample.save()

## Remove samples (images) tagged with "similarity OK" and save in dataset4

In [None]:
# Clone current dataset
dataset4 = dataset.clone('dataset4', persistent=True)
fo.list_datasets()

['2024.11.14.08.24.25', 'dataset3', 'dataset4']

In [58]:
# Delete tagged samples (images) and save new dataset

from fiftyone import ViewField as F

dataset = fo.load_dataset('dataset4')
view = dataset.filter_field('tags', F().contains('similarity>0.92'))
dataset.delete_samples(view)
dataset.save()
dataset

Name:        dataset4
Media type:  image
Num samples: 59997
Persistent:  True
Tags:        []
Sample fields:
    id:                       fiftyone.core.fields.ObjectIdField
    filepath:                 fiftyone.core.fields.StringField
    tags:                     fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:                 fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    created_at:               fiftyone.core.fields.DateTimeField
    last_modified_at:         fiftyone.core.fields.DateTimeField
    ground_truth:             fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    YOLOv8_predictions:       fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    uniqueness:               fiftyone.core.fields.FloatField
    max_similarity:           fiftyone.core.fields.FloatField
    datetime:                 fiftyone.core.fields.DateTimeField
    embeddings:           

In [59]:
# Check save dataset

fo.load_dataset('dataset4')

Name:        dataset4
Media type:  image
Num samples: 59997
Persistent:  True
Tags:        []
Sample fields:
    id:                       fiftyone.core.fields.ObjectIdField
    filepath:                 fiftyone.core.fields.StringField
    tags:                     fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:                 fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    created_at:               fiftyone.core.fields.DateTimeField
    last_modified_at:         fiftyone.core.fields.DateTimeField
    ground_truth:             fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    YOLOv8_predictions:       fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    uniqueness:               fiftyone.core.fields.FloatField
    max_similarity:           fiftyone.core.fields.FloatField
    datetime:                 fiftyone.core.fields.DateTimeField
    embeddings:           

In [63]:
dataset.stats(include_media=True)

Computing metadata...
 100% |█████████████| 59997/59997 [14.8s elapsed, 0s remaining, 4.3K samples/s]      


{'samples_count': 59997,
 'samples_bytes': 353156633,
 'samples_size': '336.8MB',
 'media_bytes': 7188092459,
 'media_size': '6.7GB',
 'total_bytes': 7541249092,
 'total_size': '7.0GB'}

## Calculate Similarity

Now that we have significantly reduced the dimensionality of our images, we can use classical similarity algorithms to compute how similar every image embedding is to every other image embedding.

In this case, we will use [cosine similarity provided by Scikit Learn](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html) since this algorithm is simple and works fairly well in high dimensional spaces.

In [None]:
similarity_matrix = cosine_similarity(embeddings)

print(similarity_matrix.shape)
print(similarity_matrix.dtype)
print(similarity_matrix)

As you can see, all diagonal values are 1 since every image is identical to itself. We can subtract by the identity matrix (N x N matrix with 1's on the diagonal and 0's elsewhere) in order to zero out the diagonal so those values don't show up when we look for samples with maximum similarity.

In [None]:
n = len(similarity_matrix)
similarity_matrix = similarity_matrix - np.identity(n, dtype=np.float32)

In [None]:
print(similarity_matrix.shape)
print(similarity_matrix.dtype)
print(similarity_matrix)

## Visualize and remove duplicates

We can now iterate through every sample and find which other samples are the most similar to it.

In [None]:
id_map = [s.id for s in dataset.select_fields(["id"])]

for idx, sample in enumerate(dataset):
    sample["max_similarity"] = similarity_matrix[idx].max()
    sample.save()

In [None]:
from fiftyone import ViewField as F

# Create a view
max_similarity_view = (
    dataset
    .select_fields("max_similarity")
    .sort_by(F("max_similarity"), reverse=True)
)

# Save the view
dataset.save_view("max_similarity_view", max_similarity_view)


In [None]:
dataset.list_saved_views()

In [None]:
# Create datetime field
for sample in dataset:
    timestamp_str = os.path.basename(sample.filepath).replace('.jpg', '').replace('IMG', '').replace('_', '')
    dt = datetime.strptime(timestamp_str, '%Y%m%d%H%M%S')
    sample['datetime'] = dt
    sample.save()