**Purpose of this notebook**

This notebook presents how we can use the machine learning the similarity between images.
Particularly we would like to distinguish several types of relationship between images:
1. exact duplicate
1. near-duplicate
1. similar
1. different

![alt text](categories_similarity_openfoodfact.jpg "Title")

**Proposal**  
Use the machine learning to represent the image in a new space where the distance is correlated with similarity.

**Hypothesis**  
The deep learning with neural network (NN) is supposed to be able to catch/learn some patterns from its training dataset that helps itself to discriminate instance of this dataset. By using the trained neural networks, it will be possible to represent the picture in some embeddings that `would be easier to discriminate`, or allow us to build `a metric of similarity`.

**Protocol**  
1. load images
1. download a already trained NN for images
2. use the backbone of the model to generate the embeddings of images (more exactly to transform the pixel of images into another representation called embeddings). Thus, we considered the following hypothesis: `the euclidean distance in the embedding space` is correlated with the `similarity`.
1. by products, look at the distance between images, tag some of them that are `exact_duplicate`, `near_duplicate`, `very_similar` and `different`
1. build a small model that determine the optimal threshold.

\*: if you do not understand something, be curious :)

# Protocol

## Load images

In [1]:
from pathlib import Path
from datasets import Dataset, Image, load_dataset

In [2]:
# to clean data if necessary
# import os
# path = Path('../data/images').resolve()
# for dir in os.listdir(path):
#     for file in os.listdir(path / dir):
#         if 'front.' in file:
#             os.remove(path / f'{dir}/{file}')
#         if 'ingredients.' in file:
#             os.remove(path / f'{dir}/{file}')
#         if 'nutrition.' in file:
#             os.remove(path / f'{dir}/{file}')
#         if 'packaging' in file:
#             os.remove(path / f'{dir}/{file}')
#         if 'other' in file:
#             os.remove(path / f'{dir}/{file}')

In [3]:
images = load_dataset("imagefolder", data_dir="../data/images")
images = images['train'].cast_column('image', Image(decode=True)) # all images are in train

Resolving data files:   0%|          | 0/25141 [00:00<?, ?it/s]

Found cached dataset imagefolder (/home/machine_learning/.cache/huggingface/datasets/imagefolder/default-b8cf0324ec202c2e/0.0.0/37fbb85cc714a338bea574ac6c7d0b5be5aff46c1862c1989b20e0771199e93f)


  0%|          | 0/1 [00:00<?, ?it/s]

## Load models and produce embeddings

In [4]:
from transformers import AutoFeatureExtractor, AutoModel
import torch
import torchvision.transforms as T

In [5]:
model_ckpt = "nateraw/vit-base-beans"
extractor = AutoFeatureExtractor.from_pretrained(model_ckpt)
model = AutoModel.from_pretrained(model_ckpt)

Some weights of the model checkpoint at nateraw/vit-base-beans were not used when initializing ViTModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing ViTModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ViTModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ViTModel were not initialized from the model checkpoint at nateraw/vit-base-beans and are newly initialized: ['vit.pooler.dense.weight', 'vit.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [6]:
candidate_subset = images.filter(lambda x: x['label'] == 1)
# candidate_subset

Loading cached processed dataset at /home/machine_learning/.cache/huggingface/datasets/imagefolder/default-b8cf0324ec202c2e/0.0.0/37fbb85cc714a338bea574ac6c7d0b5be5aff46c1862c1989b20e0771199e93f/cache-05395bdf6f5cc8b4.arrow


In [7]:
# Data transformation chain.
transformation_chain = T.Compose(
    [
        # We first resize the input image to 256x256 and then we take center crop.
        T.Resize(int((256 / 224) * extractor.size["height"])),
        T.CenterCrop(extractor.size["height"]),
        T.ToTensor(),
        T.Normalize(mean=extractor.image_mean, std=extractor.image_std),
    ]
)


def extract_embeddings(model: torch.nn.Module):
    """Utility to compute embeddings."""
    device = model.device

    def pp(batch):
        images = batch["image"]
        image_batch_transformed = torch.stack(
            [transformation_chain(image) for image in images]
        )
        new_batch = {"pixel_values": image_batch_transformed.to(device)}
        with torch.no_grad():
            embeddings = model(**new_batch).last_hidden_state[:, 0].cpu()
        return {"embeddings": embeddings}
    return pp

In [None]:
extract_fn(candidate_subset)

In [9]:
# Here, we map embedding extraction utility on our subset of candidate images.
batch_size = 1
device = "cuda" if torch.cuda.is_available() else "cpu"
extract_fn = extract_embeddings(model.to(device))

In [81]:
extract_fn(candidate_subset)

RuntimeError: output with shape [1, 224, 224] doesn't match the broadcast shape [3, 224, 224]

In [None]:
def extract_embeddings(model: torch.nn.Module):
    """Utility to compute embeddings."""
    device = model.device

    def pp(batch):
        images = batch["image"]
        # `transformation_chain` is a compostion of preprocessing
        # transformations we apply to the input images to prepare them
        # for the model. For more details, check out the accompanying Colab Notebook.
        image_batch_transformed = torch.stack(
            [transformation_chain(image) for image in images]
        )
        new_batch = {"pixel_values": image_batch_transformed.to(device)}
        with torch.no_grad():
            embeddings = model(**new_batch).last_hidden_state[:, 0].cpu()
        return {"embeddings": embeddings}

    return pp

## Distance / similarity to determine threshold

## Evaluate