<a href="https://colab.research.google.com/github/basadhi/VPR_Image_Search/blob/main/VPR_image_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Visual product recognition model develop

**Group 25**

## Introduction

The example demonstrates how to build an encoder neural network
model to search for images using sample image. The model is inspired by
the [CLIP](https://openai.com/blog/clip/)
approach, introduced by Alec Radford et al. The idea is to train a vision encoder to project the representation of images a same embedding
space, such that the image embeddings are located near the similar embeddings of the images they describe.

This example requires TensorFlow 2.4 or higher.
In addition, [TensorFlow Hub](https://www.tensorflow.org/hub)
 and [TensorFlow Addons](https://www.tensorflow.org/addons)
is required for the AdamW optimizer. These libraries can be installed using the
following command:

```python
pip install -q -U tensorflow-hub tensorflow-text tensorflow-addons
```

## Setup

In [1]:
import os
import collections
import json
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow_hub as hub

import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from tqdm import tqdm

# Suppressing tf.hub warnings
tf.get_logger().setLevel("ERROR")

## Prepare the data

We will use the [MS-COCO](https://cocodataset.org/#home) dataset to train our
dual encoder model. MS-COCO contains over 82,000 images, each of which has at least
5 different caption annotations. The dataset is usually used for
[image captioning](https://www.tensorflow.org/tutorials/text/image_captioning)
tasks, but we can repurpose the image-caption pairs to train our dual encoder
model for image search.

###
Download and extract the data

First, let's download the dataset, which consists of two compressed folders:
one with images, and the other—with associated image captions.
Note that the compressed images folder is 13GB in size.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# !gdown 1AqQVhmEkb8NvnoGodZdHydagpizMSWaU

In [3]:
%%capture
!unzip /content/drive/MyDrive/archive.zip -d /content/training-dataset/

In [None]:
!rm /content/archive.zip

rm: cannot remove '/content/archive.zip': No such file or directory


In [None]:
!mv "/content/training-dataset/train/train" "/content/"
!mv "/content/training-dataset/test/test" "/content/"

!rmdir "/content/training-dataset/train"
!rmdir "/content/training-dataset/test"

!mv "/content/train" "/content/training-dataset/"
!mv "/content/test" "/content/training-dataset/"

mv: cannot stat '/content/training-dataset/train/train': No such file or directory
mv: cannot stat '/content/training-dataset/test/test': No such file or directory
rmdir: failed to remove '/content/training-dataset/train': Directory not empty
rmdir: failed to remove '/content/training-dataset/test': Directory not empty
mv: cannot stat '/content/train': No such file or directory
mv: cannot stat '/content/test': No such file or directory


In [None]:
! gdown 1OpTvvc66Olx4P2oktK40emmJ99E0h0jF

Downloading...
From: https://drive.google.com/uc?id=1OpTvvc66Olx4P2oktK40emmJ99E0h0jF
To: /content/test-archive.zip
100% 441M/441M [00:07<00:00, 60.8MB/s]


In [None]:
%%capture
!unzip /content/test-archive.zip -d /content/testing-dataset

In [None]:
!rm /content/test-archive.zip

In [None]:
!mv "/content/testing-dataset/development_test_data/gallery" "/content/"
!mv "/content/testing-dataset/development_test_data/queries/" "/content/"
!mv "/content/testing-dataset/development_test_data/gallery.csv" "/content/"
!mv "/content/testing-dataset/development_test_data/queries.csv" "/content/"

!rmdir "/content/testing-dataset/development_test_data"

!mv "/content/gallery" "/content/testing-dataset/"
!mv "/content/queries" "/content/testing-dataset/"
!mv "/content/gallery.csv" "/content/testing-dataset/"
!mv "/content/queries.csv" "/content/testing-dataset/"

mv: cannot move '/content/testing-dataset/development_test_data/gallery' to '/content/gallery': Directory not empty
mv: cannot move '/content/testing-dataset/development_test_data/queries/' to '/content/queries': Directory not empty
rmdir: failed to remove '/content/testing-dataset/development_test_data': Directory not empty
mv: cannot move '/content/gallery' to '/content/testing-dataset/gallery': Directory not empty
mv: cannot move '/content/queries' to '/content/testing-dataset/queries': Directory not empty


In [None]:
import os

# Path to your image dataset directory
dataset_dir = '/content/training-dataset/train'

# List all image files in the dataset directory
image_files = [f for f in os.listdir(dataset_dir) if f.endswith('.jpg') or f.endswith('.png')]

# Count the number of image files
num_images = len(image_files)

print(f'Total number of images in the dataset: {num_images}')


Total number of images in the dataset: 141931


In [None]:
# import os
import shutil

# Path to your original image dataset directory
original_dataset_dir = '/content/training-dataset/train'

# Path to the directory where you want to save the smaller subsets
output_dir = '/content/training-dataset/NewTrain'

# Number of images in each subset
subset_size = 10000

# Create the output directory if it doesn't exist
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# List all image files in the original dataset directory
image_files = [f for f in os.listdir(original_dataset_dir) if f.endswith('.jpg') or f.endswith('.png')]

# Loop through image files and create subsets
for i in range(0, len(image_files), subset_size):
    subset_images = image_files[i:i + subset_size]
    subset_dir = os.path.join(output_dir, f'subset_{i // subset_size}')
    os.makedirs(subset_dir, exist_ok=True)

    for image_file in subset_images:
        source_path = os.path.join(original_dataset_dir, image_file)
        target_path = os.path.join(subset_dir, image_file)
        shutil.copy(source_path, target_path)

print("Dataset has been broken into subsets.")


Dataset has been broken into subsets.


In [None]:


# Path to the folder containing images you want to delete
folder_to_delete = '/content/training-dataset/train'

# List all image files in the folder
image_files = [f for f in os.listdir(folder_to_delete) if f.endswith('.jpg') or f.endswith('.png')]

# Delete each image file
for image_file in image_files:
    image_path = os.path.join(folder_to_delete, image_file)
    os.remove(image_path)

print(f"All images in {folder_to_delete} have been deleted.")


All images in /content/training-dataset/train have been deleted.


### Process and save the data to TFRecord files

You can change the `sample_size` parameter to control many image-caption pairs
will be used for training the dual encoder model.
In this example we set `train_size` to 30,000 images,
which is about 35% of the dataset. We use 2 captions for each
image, thus producing 60,000 image-caption pairs. The size of the training set
affects the quality of the produced encoders, but more examples would lead to
longer training time.

In [None]:
# import os
# import tensorflow as tf

# Path to the image folder
image_folder = '/content/training-dataset/NewTrain/subset_0'

# Path to save TFRecord files
tfrecords_dir = '/content/training-dataset/Tfrecords'

# Create the TFRecords directory if it doesn't exist
if not os.path.exists(tfrecords_dir):
    os.makedirs(tfrecords_dir)

# List all image files in the folder
image_files = [f for f in os.listdir(image_folder) if f.endswith('.jpg') or f.endswith('.png')]

# Define function to read and preprocess images
def preprocess_image(image_path):
    # Load and preprocess your image (resize, normalize, etc.)
    # Return the preprocessed image data as bytes
    with open(image_path, 'rb') as f:
        image_data = f.read()
    return image_data

# Create and write images to TFRecord files
tfrecord_count = 0
images_per_tfrecord = 100  # Number of images per TFRecord file

for i in range(0, len(image_files), images_per_tfrecord):
    tfrecord_filename = os.path.join(tfrecords_dir, f'images_{tfrecord_count}.tfrecord')
    with tf.io.TFRecordWriter(tfrecord_filename) as writer:
        for image_file in image_files[i:i + images_per_tfrecord]:
            image_path = os.path.join(image_folder, image_file)
            image_data = preprocess_image(image_path)

            # Create a feature dictionary
            feature_dict = {
                'image': tf.train.Feature(bytes_list=tf.train.BytesList(value=[image_data]))
            }

            # Create an Example
            example = tf.train.Example(features=tf.train.Features(feature=feature_dict))

            # Serialize and write the Example to the TFRecord file
            writer.write(example.SerializeToString())

        print(f'TFRecord file {tfrecord_filename} created')
        tfrecord_count += 1


TFRecord file /content/training-dataset/Tfrecords/images_0.tfrecord created
TFRecord file /content/training-dataset/Tfrecords/images_1.tfrecord created
TFRecord file /content/training-dataset/Tfrecords/images_2.tfrecord created
TFRecord file /content/training-dataset/Tfrecords/images_3.tfrecord created
TFRecord file /content/training-dataset/Tfrecords/images_4.tfrecord created
TFRecord file /content/training-dataset/Tfrecords/images_5.tfrecord created
TFRecord file /content/training-dataset/Tfrecords/images_6.tfrecord created
TFRecord file /content/training-dataset/Tfrecords/images_7.tfrecord created
TFRecord file /content/training-dataset/Tfrecords/images_8.tfrecord created
TFRecord file /content/training-dataset/Tfrecords/images_9.tfrecord created
TFRecord file /content/training-dataset/Tfrecords/images_10.tfrecord created
TFRecord file /content/training-dataset/Tfrecords/images_11.tfrecord created
TFRecord file /content/training-dataset/Tfrecords/images_12.tfrecord created
TFRecord 

### Create `tf.data.Dataset` for training and evaluation

In [None]:

feature_description = {
    "caption": tf.io.FixedLenFeature([], tf.string),
    "raw_image": tf.io.FixedLenFeature([], tf.string),
}


def read_example(example):
    features = tf.io.parse_single_example(example, feature_description)
    raw_image = features.pop("raw_image")
    features["image"] = tf.image.resize(
        tf.image.decode_jpeg(raw_image, channels=3), size=(299, 299)
    )
    return features


def get_dataset(file_pattern, batch_size):

    return (
        tf.data.TFRecordDataset(tf.data.Dataset.list_files(file_pattern))
        .map(
            read_example,
            num_parallel_calls=tf.data.AUTOTUNE,
            deterministic=False,
        )
        .shuffle(batch_size * 10)
        .prefetch(buffer_size=tf.data.AUTOTUNE)
        .batch(batch_size)
    )


## Implement the projection head

The projection head is used to transform the image and the text embeddings to
the same embedding space with the same dimensionality.

In [None]:

def project_embeddings(
    embeddings, num_projection_layers, projection_dims, dropout_rate
):
    projected_embeddings = layers.Dense(units=projection_dims)(embeddings)
    for _ in range(num_projection_layers):
        x = tf.nn.gelu(projected_embeddings)
        x = layers.Dense(projection_dims)(x)
        x = layers.Dropout(dropout_rate)(x)
        x = layers.Add()([projected_embeddings, x])
        projected_embeddings = layers.LayerNormalization()(x)
    return projected_embeddings


## Implement the vision encoder

In this example, we use [Xception](https://keras.io/api/applications/xception/)
from [Keras Applications](https://keras.io/api/applications/) as the base for the
vision encoder.

Xception is a deep convolutional neural network architecture. It uses depthwise seperable convolutions. Unlike other traditional convolutionals that apply a single convolutional filter to all input channels, depthwise seperable convolutions apply seperate convolutional filters to each input channel before combining them. Xception model contains 71 convolutional layers.

In [None]:

def create_vision_encoder(
    num_projection_layers, projection_dims, dropout_rate, trainable=False
):
    # Load the pre-trained Xception model to be used as the base encoder.
    xception = keras.applications.Xception(
        include_top=False, weights="imagenet", pooling="avg"
    )
    # Set the trainability of the base encoder.
    for layer in xception.layers:
        layer.trainable = trainable
    # Receive the images as inputs.
    inputs = layers.Input(shape=(299, 299, 3), name="image_input")
    # Preprocess the input image.
    xception_input = tf.keras.applications.xception.preprocess_input(inputs)
    # Generate the embeddings for the images using the xception model.
    embeddings = xception(xception_input)
    # Project the embeddings produced by the model.
    outputs = project_embeddings(
        embeddings, num_projection_layers, projection_dims, dropout_rate
    )
    # Create the vision encoder model.
    return keras.Model(inputs, outputs, name="vision_encoder")


## Implement the text encoder

We use [BERT](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-256_A-4/1)
from [TensorFlow Hub](https://tfhub.dev) as the text encoder

In [None]:

# def create_text_encoder(
#     num_projection_layers, projection_dims, dropout_rate, trainable=False
# ):
#     # Load the BERT preprocessing module.
#     preprocess = hub.KerasLayer(
#         "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/2",
#         name="text_preprocessing",
#     )
#     # Load the pre-trained BERT model to be used as the base encoder.
#     bert = hub.KerasLayer(
#         "https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1",
#         "bert",
#     )
#     # Set the trainability of the base encoder.
#     bert.trainable = trainable
#     # Receive the text as inputs.
#     inputs = layers.Input(shape=(), dtype=tf.string, name="text_input")
#     # Preprocess the text.
#     bert_inputs = preprocess(inputs)
#     # Generate embeddings for the preprocessed text using the BERT model.
#     embeddings = bert(bert_inputs)["pooled_output"]
#     # Project the embeddings produced by the model.
#     outputs = project_embeddings(
#         embeddings, num_projection_layers, projection_dims, dropout_rate
#     )
#     # Create the text encoder model.
#     return keras.Model(inputs, outputs, name="text_encoder")


## Implement the dual encoder

To calculate the loss, we compute the pairwise dot-product similarity between
each `caption_i` and `images_j` in the batch as the predictions.
The target similarity between `caption_i`  and `image_j` is computed as
the average of the (dot-product similarity between `caption_i` and `caption_j`)
and (the dot-product similarity between `image_i` and `image_j`).
Then, we use crossentropy to compute the loss between the targets and the predictions.

In [None]:

class Encoder(keras.Model):
    def __init__(self, text_encoder, image_encoder, temperature=1.0, **kwargs):
        super().__init__(**kwargs)
        self.image_encoder = image_encoder
        self.temperature = temperature
        self.loss_tracker = keras.metrics.Mean(name="loss")

    @property
    def metrics(self):
        return [self.loss_tracker]

    def call(self, features, training=False):
        # Place each encoder on a separate GPU (if available).
        # TF will fallback on available devices if there are fewer than 2 GPUs.
        # with tf.device("/gpu:0"):
        #     # Get the embeddings for the captions.
        #     caption_embeddings = text_encoder(features["caption"], training=training)
        with tf.device("/gpu:1"):
            # Get the embeddings for the images.
            image_embeddings = vision_encoder(features["image"], training=training)
        return image_embeddings

    def compute_loss(self, image_embeddings):
        # logits[i][j] is the dot_similarity(caption_i, image_j).
        # logits = (
        #     tf.matmul(image_embeddings,image_embeddings, transpose_b=True)
        #     / self.temperature
        # )
        # images_similarity[i][j] is the dot_similarity(image_i, image_j).
        images_similarity = tf.matmul(
            image_embeddings, image_embeddings, transpose_b=True
        )
        # captions_similarity[i][j] is the dot_similarity(caption_i, caption_j).
        # captions_similarity = tf.matmul(
        #     caption_embeddings, caption_embeddings, transpose_b=True
        # )
        # targets[i][j] = avarage dot_similarity(caption_i, caption_j) and dot_similarity(image_i, image_j).
        targets = keras.activations.softmax(
             images_similarity
        )
        # # Compute the loss for the captions using crossentropy
        # captions_loss = keras.losses.categorical_crossentropy(
        #     y_true=targets, y_pred=images_similarity, from_logits=True
        # )
        # Compute the loss for the images using crossentropy
        images_loss = keras.losses.categorical_crossentropy(
            y_true=tf.transpose(targets), y_pred=tf.transpose(images_embeddings), from_logits=True
        )
        # Return the mean of the loss over the batch.
        return images_loss

    def train_step(self, features):
        with tf.GradientTape() as tape:
            # Forward pass
            image_embeddings = self(features, training=True)
            loss = self.compute_loss(image_embeddings)
        # Backward pass
        gradients = tape.gradient(loss, self.trainable_variables)
        self.optimizer.apply_gradients(zip(gradients, self.trainable_variables))
        # Monitor loss
        self.loss_tracker.update_state(loss)
        return {"loss": self.loss_tracker.result()}

    def test_step(self, features):
        image_embeddings = self(features, training=False)
        loss = self.compute_loss(image_embeddings)
        self.loss_tracker.update_state(loss)
        return {"loss": self.loss_tracker.result()}


## Train the encoder model

In this experiment, we freeze the base encoders for text and images, and make only
the projection head trainable.

In [None]:
num_epochs = 5  # In practice, train for at least 30 epochs
batch_size = 256

vision_encoder = create_vision_encoder(
    num_projection_layers=1, projection_dims=256, dropout_rate=0.1
)
# text_encoder = create_text_encoder(
#     num_projection_layers=1, projection_dims=256, dropout_rate=0.1
# )
encoder = Encoder(vision_encoder, temperature=0.05)
encoder.compile(
    optimizer=tfa.optimizers.AdamW(learning_rate=0.001, weight_decay=0.001)
)

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/xception/xception_weights_tf_dim_ordering_tf_kernels_notop.h5


TypeError: ignored

Note that training the model with 60,000 image-caption pairs, with a batch size of 256,
takes around 12 minutes per epoch using a V100 GPU accelerator. If 2 GPUs are available,
the epoch takes around 8 minutes.

In [None]:
print(f"Number of GPUs: {len(tf.config.list_physical_devices('GPU'))}")
# print(f"Number of examples (caption-image pairs): {train_example_count}")
print(f"Batch size: {batch_size}")
print(f"Steps per epoch: {int(np.ceil(train_example_count / batch_size))}")
train_dataset = get_dataset(os.path.join(tfrecords_dir, "train-*.tfrecord"), batch_size)
valid_dataset = get_dataset(os.path.join(tfrecords_dir, "valid-*.tfrecord"), batch_size)
# Create a learning rate scheduler callback.
reduce_lr = keras.callbacks.ReduceLROnPlateau(
    monitor="val_loss", factor=0.2, patience=3
)
# Create an early stopping callback.
early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor="val_loss", patience=5, restore_best_weights=True
)
history = dual_encoder.fit(
    train_dataset,
    epochs=num_epochs,
    validation_data=valid_dataset,
    callbacks=[reduce_lr, early_stopping],
)
print("Training completed. Saving vision encoder...")
vision_encoder.save("vision_encoder")

print("Models are saved.")

Plotting the training loss:

In [None]:
plt.plot(history.history["loss"])
plt.plot(history.history["val_loss"])
plt.ylabel("Loss")
plt.xlabel("Epoch")
plt.legend(["train", "valid"], loc="upper right")
plt.show()

## Search for images using natural language queries

We can then retrieve images corresponding to natural language queries via
the following steps:

1. Generate embeddings for the images by feeding them into the `vision_encoder`.
2. Feed the natural language query to the `text_encoder` to generate a query embedding.
3. Compute the similarity between the query embedding and the image embeddings
in the index to retrieve the indices of the top matches.
4. Look up the paths of the top matching images to display them.

Note that, after training the `dual encoder`, only the fine-tuned `vision_encoder`
and `text_encoder` models will be used, while the `dual_encoder` model will be discarded.

### Generate embeddings for the images

We load the images and feed them into the `vision_encoder` to generate their embeddings.
In large scale systems, this step is performed using a parallel data processing framework,
such as [Apache Spark](https://spark.apache.org) or [Apache Beam](https://beam.apache.org).
Generating the image embeddings may take several minutes.

In [None]:
print("Loading vision and text encoders...")
vision_encoder = keras.models.load_model("vision_encoder")
print("Models are loaded.")


def read_image(image_path):
    image_array = tf.image.decode_jpeg(tf.io.read_file(image_path), channels=3)
    return tf.image.resize(image_array, (299, 299))


print(f"Generating embeddings for {len(image_paths)} images...")
image_embeddings = vision_encoder.predict(
    tf.data.Dataset.from_tensor_slices(image_paths).map(read_image).batch(batch_size),
    verbose=1,
)
print(f"Image embeddings shape: {image_embeddings.shape}.")

### Retrieve relevant images

In this example, we use exact matching by computing the dot product similarity
between the input query embedding and the image embeddings, and retrieve the top k
matches. However, *approximate* similarity matching, using frameworks like
[ScaNN](https://github.com/google-research/google-research/tree/master/scann),
[Annoy](https://github.com/spotify/annoy), or [Faiss](https://github.com/facebookresearch/faiss)
is preferred in real-time use cases to scale with a large number of images.

In [None]:

def find_matches(image_embeddings, img_queries, k=9, normalize=True):
    # Get the embedding for the query.
    query_embedding = text_encoder(tf.convert_to_tensor(queries))
    # Normalize the query and the image embeddings.
    if normalize:
        image_embeddings = tf.math.l2_normalize(image_embeddings, axis=1)
        img_query_embedding = tf.math.l2_normalize(img_query_embedding, axis=1)
    # Compute the dot product between the query and the image embeddings.
    dot_similarity = tf.matmul(img_query_embedding, image_embeddings, transpose_b=True)
    # Retrieve top k indices.
    results = tf.math.top_k(dot_similarity, k).indices.numpy()
    # Return matching image paths.
    return [[image_paths[idx] for idx in indices] for indices in results]


Set the `query` variable to the type of images you want to search for.
Try things like: 'a plate of healthy food',
'a woman wearing a hat is walking down a sidewalk',
'a bird sits near to the water', or 'wild animals are standing in a field'.

In [None]:
#here add image query path
# query =
matches = find_matches(image_embeddings, [query], normalize=True)[0]

plt.figure(figsize=(20, 20))
for i in range(9):
    ax = plt.subplot(3, 3, i + 1)
    plt.imshow(mpimg.imread(matches[i]))
    plt.axis("off")


## Evaluate the retrieval quality

To evaluate the dual encoder model, we use the captions as queries.
We use the out-of-training-sample images and captions to evaluate the retrieval quality,
using top k accuracy. A true prediction is counted if, for a given caption, its associated image
is retrieved within the top k matches.

In [None]:

def compute_top_k_accuracy(image_paths, k=100):
    hits = 0
    num_batches = int(np.ceil(len(image_paths) / batch_size))
    for idx in tqdm(range(num_batches)):
        start_idx = idx * batch_size
        end_idx = start_idx + batch_size
        current_image_paths = image_paths[start_idx:end_idx]
        queries = [
            image_path_to_caption[image_path][0] for image_path in current_image_paths
        ]
        result = find_matches(image_embeddings, queries, k)
        hits += sum(
            [
                image_path in matches
                for (image_path, matches) in list(zip(current_image_paths, result))
            ]
        )

    return hits / len(image_paths)


print("Scoring training data...")
train_accuracy = compute_top_k_accuracy(train_image_paths)
print(f"Train accuracy: {round(train_accuracy * 100, 3)}%")

print("Scoring evaluation data...")
eval_accuracy = compute_top_k_accuracy(image_paths[train_size:])
print(f"Eval accuracy: {round(eval_accuracy * 100, 3)}%")


## Final remarks

You can obtain better results by increasing the size of the training sample,
train for more  epochs, explore other base encoders for images and text,
set the base encoders to be trainable, and tune the hyperparameters,
especially the `temperature` for the softmax in the loss computation.

Example available on HuggingFace

| Trained Model | Demo |
| :--: | :--: |
| [![Generic badge](https://img.shields.io/badge/%F0%9F%A4%97%20Model-nl%20image%20search-black.svg)](https://huggingface.co/keras-io/dual-encoder-image-search) | [![Generic badge](https://img.shields.io/badge/%F0%9F%A4%97%20Spaces-nl%20image%20search-black.svg)](https://huggingface.co/spaces/keras-io/dual-encoder-image-search) |
