# Deep Learning Assignment 2023
## From words to bounding boxes: exploring visual grounding using CLIP

|     #    |                 |                                 @ |
|:--------:|-----------------|----------------------------------:|
| `238746` | Luca Mosetti    | luca.mosetti-1@studenti.unitn.it  |
| `240074` | Stefano Genetti | stefano.genetti@studenti.unitn.it |


In [None]:
%%shell
if ! [ -d assets ]; then
  gdown -q 1WTQNojr6KvWbzowuqfBDj5z4CYd03eeQ &&
  tar --warning=no-unknown-keyword -xf assets.tar &&
  rm assets.tar
fi

## 1 Abstract

Visual grounding involves linking language and perception by grounding linguistic symbols in the visual world. More in depth, in this work we face the problem usually referred to by the literature as *Referring expression comprehension* (REC). In this context the overall goal is to localize a target object in an image described by a referring expression phrased in natural language. In order to accomplish this challenging task we rely on the CLIP (*Contrastive Language-Image Pre-training*) [2] pre-trained model as a starting point for transfer learning. The capabilities of this foundation model pose a starting point to design a joint embedding approach to solve the problem at hand. In this report we provide an overview of the strategies which we have adopted in order to fine-tune CLIP for the task under discussion. We have evaluated our proposed models on the commonly used RefCOCOg dataset [3]. In addition to this, our contribution is to provide three useful instances of the dataset filled with the bounding boxes proposed by some well known  object detection algorithms. As further explained in the following of this report this solution allows to considerably speed up the training procedure. We conveniently  provide these datasets together with the code to generate them at the following GitHub repository: https://github.com/StefanoGenettiUniTN/refcocog-augmentation. Furthermore, in the present notebook we alternate the text cells with code cells incorporating the implemented code.

![image.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/01.png)

**Figure 1**

*two woman one in black eatting and the other has a white shirt at the desk*

![image.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/02.png)

**Figure 2**

*a brown bull in front of feeding tub*

![image.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/03.png)

**Figure 3**

*green color vegetable in between potato and carrot*

## 2 Introduction

Language and vision are closely related in daily life. We naturally use verbal descriptions in our conversations to refer to the objects in a given scene. Although such an activity is straightforward for the human being, the task of referring expression comprehension remains challenging for a software agent which has to bridge computer vision and natural language processing in order to achieve a comprehensive understanding of complex language semantics and various types of visual information. The problem has been receiving increasing attention from both academia and industry due to its great potential in vision-and-language navigation [1] and natural human-computer interaction. The aim of our work is to train a model which takes as input an image and a natural language prompt and outputs a single bounding box which corresponds to the entity referred to in the textual description (Figure 4).

\begin{equation}
    f: I \times P \rightarrow B
\end{equation}

According to the literature, the most common methods to tackle the task at hand are based on the encoding of image regions and expressions into the same vector space [4]. To this end we adopt the CLIP (Contrastive Language-Image Pre-training) [2] model as a foundation model for our framework. This strategy allows us to take advantage of a powerful model pre-trained with massive computational resources and consequently reduce the amount of power that we need to obtain meaningful results. Clearly, visual grounding is not the original purpose of CLIP. Consequently, we need to perform transfer learning in order to fine-tune the original model to build a customized one that excels in our downstream task. In this paper we provide a detailed overview of the solutions that we have studied to solve the problem. To this end we have trained and evaluated several model architectures according to the metrics commonly suggested by the literature, on the RefCOCOg dataset [3], a variant of the Referring Expression Generation (REG) dataset, which is particularly suitable in our case. For each implemented model we report the obtained performances bringing to light its strengths and weaknesses. A methodological comparison of the proposed architecture designs has allowed us to select the most promising implementation. As outlined at the end of this document, the final model has been further refined in an attempt to improve its generalization capabilities. The overall goal of this work is not to achieve state of the art performances. Rather, our contribution is to suggest original solutions to tackle the problem and highlight promising directions which should be further investigated with stronger hardware capabilities. In this regard, throughout the report we strain our attention on several valuable strategies which have been adopted in order to deal with limited time and computational resources. The notebook is organized as follows.
*   At the beginning we provide a brief overview of the related works proposed by the literature over the years to face similar challenges.
*   Then, we explain the peculiarities of our reference dataset. There is no predefined dataset class appropriate for the visual grounding task. Hence, we describe how we create our custom dataset classes to load and read the available data collection correctly and appropriately.
* In Section 6 we describe the metrics which we have adopted in order to evaluate and compare the implemented solutions.
* In the subsequent section we describe our training free baseline algorithm which has represented a convenient starting point for our project.
* In the following sections we comment on the architectures that we have designed on top of the observations and experience progressively maturated. First of all we illustrate in Section 9 the object detection algorithms which have been examined in order to maximize the quality of the proposed bounding boxes to be evaluated.
* In Section 10 we describe our first fine-tuned architecture. In this first implementation we focus on a standard fine-tuning approach.
* Subsequently, inspired by the work of Sachin Goyal et al. [5] we have tried to fine-tune our model following the same training procedure employed by CLIP.
* In Section 12 we propose an original technique to exploit the self-attention mechanism in order to produce contextualized latent space representations of visual and textual prompts.
* This pipeline allowed us to identify the most auspicious model among the ones considered. In Section 13 we present our solutions to further improve its generalization capabilities.
* At the end, we conclude the report with some final considerations and valuable suggestions to inspire further research on the field.     

In between the textual descriptions of our findings we conveniently provide our implemented code. Everything has been written in Python programming language and specifically with the PyTorch machine learning framework. We strain our attention to be as clear as possible with our lines of code. To this end we have carefully annotated the data types which should make the overall implementation more comprehensible for the reader. To achieve this we have used Python [typing](https://docs.python.org/3/library/typing.html) and [Pydantic](https://docs.pydantic.dev/latest/). Furthermore, we have sometimes utilized the [doctest](https://docs.python.org/3/library/doctest.html) module to verify that the implemented functions behave exactly as shown.


![input-output.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/04.png)

**Figure 4**

On the left the input of our problem. On the right a bounding box is drawn around the most interesting portion of the image according to the description. As clarified in the following, sometimes in our dataset there are provided more than a single description for a given bounding box. In this case we profitably take advantage of the multiple prompts available trying to achieve better predictions.  

## 3 Hardware

In this section we briefly mention the hardware infrastructures that we have used to execute our experiments, train the proposed neural networks and evaluate our models. Throughout the project we have strained our attention to carefully plan our executions in order to deal with limited resources and to invest the computational time at our disposal as fruitfully as possible.

The whole project has been written with the environment provided by the free of charge version of Google Colab. The platform allows to execute the code with only the CPU or with Nvidia T4 GPUs. Unfortunately, the execution on GPU is subject to strict and very limited time constraints. Consequently, we use the CPU mainly to verify that the implemented code works as intended and for debugging. Once the overall architecture behaves correctly, we perform short time tests with small dataset subsets on GPU with the aim of understanding whether the model is learning something and how its performance could be refined.

In addition to this we had at our disposal 50 hours of execution on the very powerful GPUs Tesla V100 provided by Microsoft Azure. The performance evaluations presented in this report have been computed on models trained on this hardware for a reasonable amount of time.


In [None]:
%load_ext tensorboard

In [None]:
%%shell
tee requirements.txt << END
ftfy
jaxtyping
jupyter
matplotlib
optuna
pandas
pydantic
regex
sentencepiece
tensorboard
textaugment
torch
torchinfo
torchvision
tqdm
transformers
END

pip install -q -r requirements.txt
pip install -q git+https://github.com/openai/CLIP.git

In [None]:
import csv
import doctest
import itertools as it
import math
import os
import typing as t
import random

import clip
import matplotlib.pyplot as plt
import optuna
import pandas as pd
import torch
import torch.nn as nn

from collections import defaultdict
from clip.model import CLIP
from jaxtyping import Float, UInt, Int
from pydantic.dataclasses import dataclass
from torch.utils.data import DataLoader, Dataset
from torch.utils.tensorboard import SummaryWriter
from torchinfo import summary
from torchinfo.model_statistics import ModelStatistics
from torchvision.io import read_image, ImageReadMode
from torchvision.ops import box_iou, box_convert
from torchvision.transforms import (
    Compose,
    Resize,
    CenterCrop,
    Normalize,
    InterpolationMode,
    ConvertImageDtype,
    ColorJitter,
    GaussianBlur,
    RandomChoice,
    RandomInvert,
    RandomPosterize,
    RandomSolarize,
    RandomAdjustSharpness,
    RandomAutocontrast,
    RandomEqualize,
    Grayscale,
)
from torchvision.transforms.functional import crop
from tqdm.auto import tqdm, trange
from optuna.study import Study
from optuna.visualization import (
    plot_contour,
    plot_edf,
    plot_intermediate_values,
    plot_optimization_history,
    plot_parallel_coordinate,
    plot_param_importances,
    plot_rank,
    plot_slice,
    plot_timeline,
)
from optuna.storages import RDBStorage
from optuna.trial import Trial
from optuna.study import Study

In [None]:
device: torch.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.set_default_device(device)

In [None]:
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)
torch.use_deterministic_algorithms(False)  # CLIP uses non-deterministic algorithms
g: torch.Generator = torch.Generator(device=device).manual_seed(42)
random.seed(42)

In [None]:
T = t.TypeVar("T")
K = t.TypeVar("K")
V = t.TypeVar("V")


def groupby(
    xs: list[T],
    map_key: t.Callable[[T], K],
    map_value: t.Callable[[T], V] = lambda x: x,
) -> dict[K, list[V]]:
    return {
        k: [map_value(v) for v in vs]
        for k, vs in it.groupby(sorted(xs, key=map_key), key=map_key)
    }



In [None]:
def unzip(batch: list[tuple[T, ...]]) -> tuple[tuple[T, ...], ...]:
    """

    >>> unzip([('A', 1), ('B', 2)])
    (('A', 'B'), (1, 2))

    """
    return tuple(zip(*batch))

In [None]:
def best_bbox(
    pred: Float[torch.Tensor, "crops 4"], groundtruth: Float[torch.Tensor, "1 4"]
) -> int:
    """

    >>> best_bbox(
    ...     torch.tensor([[0, 0, 1, 1], [0, 0, 2, 2], [1, 1, 2, 2]]),
    ...     torch.tensor([[0, 0, 1, 1]])
    ... )
    0

    >>> best_bbox(
    ...     torch.tensor([[0, 0, 0, 0], [0, 0, 2, 2], [1, 1, 2, 2]]),
    ...     torch.tensor([[0, 0, 1, 1]])
    ... )
    1

    """
    return torch.argmax(box_iou(pred, groundtruth)).item()

In [None]:
def eval_summary(model: nn.Module) -> ModelStatistics:
    return summary(
        model,
        input_size=[(5, 3, 244, 244), (2, 77)],
        dtypes=[torch.float, torch.int],
        col_names=["input_size", "output_size", "num_params", "trainable"],
    )

In [None]:
def contrastive_summary(model: nn.Module) -> ModelStatistics:
    return summary(
        model,
        input_size=[(8, 3, 244, 244), (8, 77)],
        dtypes=[torch.float, torch.int],
        col_names=["input_size", "output_size", "num_params", "trainable"],
    )

## 4 Related work

To the best of our knowledge it is not obvious to understand whether our task falls into the realm of Referring Expression Comprehension or Visual Grounding problem. Actually the two categories are very similar. According to the exhaustive survey of Yanyuan Quiao, Chaorui Deng and Qi Wu [4], visual grounding is to localize multiple object regions in an image corresponding to multiple noun phrases from a sentence that describes the underlying scene. While the goal of referring expression comprehension is to find the best matching region by the given expression.
More broadly, referring expression is normally associated with three tasks: generation, segmentation and comprehension.
* Referring expression generation (REG) aims at generating a discriminative description of an object in an image, which is very similar to the image captioning task. Different from general image captions, referring expressions are more specific about an object or region in the image [4].
* Referring expression segmentation (RES) aims to segment the referenced objects according to the referring expression [6].
* Referring expression comprehension (REC) is the reverse task of REG, which aims at localizing objects in an image based on natural language descriptions. The REC problem is typically formulated as selecting the best region from a set of region proposals extracted from the image.

Furthermore, even Object Detection resembles our objectives. However, although this latter uses predefined category labels to classify fixed objects, in our project we focus on natural language expressions to refer to objects. These phrases are more practical because they vary according to the content of images and texts, so they are more suitable for real application scenarios. Succeeding in this task is of crucial importance for other vision and language problems, such as Visual Question Answering [7][8] and Visual dialogue [9][10]. Though they have diverse model architectures, they necessitate a prior localization of the objects corresponding to a given language description or question. Notably, since the textual information is not a separate label, a simple detection method cannot meet the requirements.

More in depth, the methods to face this problem proposed by the literature over the years are divided into seven categories: joint embedding approaches, modular-based approaches, graph-based approaches, approaches using external parsers, weakly supervised approaches, one stage approaches, vision-language pre-training approaches. In this work, we focus our attention on a joint embedding approach. In essence, the main idea behind these methods is to encode the image regions and the natural language prompts into the same vector space in order to link visual and textual representations. A representative and pioneered work in this field is the one proposed by Mao et al. [11]. In this case as depicted in Figure 5, they use a Convolutional Neural Network  to generate rich image representations by embedding input images into fixed-length vectors and a LSTM network to generate text features.

By mimicking the core principles of previous joint embedding approaches, a crucial part of our solution is the representation of text and images in a shared latent space. To accomplish this, we rely on CLIP (Contrastive Language-Image Pre-training) [2], a recent large-scale model pretrained jointly on image and text data. More specifically, the architecture proposed by Radford et al., has been trained via a contrastive loss that finds a joint embedding over a set of paired image and text data. From other studies, this model has demonstrated exceptional performance on many downstream tasks. For instance, many works have fine-tuned the original model to perform zero-shot classification tasks [12][13]. In line with these approaches, in the following of the report we present our ideas to fine-tune CLIP for solving our downstream visual grounding problem.


![formerApproach.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/05.png)

**Figure 5**

A common approach to solve visual grounding task using CNN-LSTM framework.

## 5 Dataset

For the purpose of training and assessing the models presented in this report, we made use of the umd segment of the RefCOCOg dataset. This dataset contains an extensive assortment of 85'474 referring expressions, each corresponding to one of 54'822 unique objects present in 26'711 images.
The aim of this section is to concisely describe the structure of the RefCOCOg dataset and the custom classes that we have prepared to load and read the dataset correctly.

The RefCOCOg dataset was adapted to our needs by agglomerating and reducing the key information to 2 csv:

- `refs.csv`
- `sentences.csv`

> `refs.csv`
>
> The single row refers to an image in the dataset and corresponds to a bounding box ground truth, expressed in `xyxy` coordinates

> `sentences.csv`
>
> The single row refers to a bounding box in `refs.csv` and corresponds to a possible textual description of the scene enclosed therein

In addition, as detailed in Section 9, the dataset has been enriched with 3 csv files:
- `bboxes[YOLOv5].csv`
- `bboxes[YOLOv8].csv`
- `bboxes[DETR].csv`

> `bboxes[V].csv`
>
> The single row in these files corresponds to a bounding box proposed by the `V` model, expressed in `xyxy` coordinates, complete with confidence level

Following a fail-fast style, first the csvs are read in full, then the `torch.data.Dataset` classes are defined.

For the training and the evaluation of the models presented in this report we have implemented four `torch.data.Dataset` classes. The overall meaning of these custom classes will be much clearer once read the following of the document.

- `CocoDataset`
- `Coco4MetricsDataset`
- `Coco4TrainingDataset`
- `Coco4ContrastiveDataset`

> `CocoDataset`
>
> The single item is a tuple with the original image in `torch.Tensor` format, the prompts, the bounding boxes proposed by the visual model, and the bounding box ground truth.
> `CocoDataset` filters the bounding boxes proposed by the model by setting a lowerbound on:
>
> - The confidence level
> - The width in pixels of the bounding box
> - The height in pixels of the bounding box

> `Coco4MetricsDataset`
>
> The single item is a pair such that: the bounding boxes proposed by the visual model and the bounding box ground truth

> `Coco4TrainingDataset`
>
> The single item, resembles the one of the aforementioned `CocoDataset`, but includes also the crops of the original image.
>
> In `Coco4TrainingDataset` we filter the bounding boxes as proposed in the `CocoDataset`.
> In addition, `Coco4TrainingDataset` filters items based on the number of bounding boxes found by the visual model.
> In fact, during the training phase, the model can only learn if it has at least two options to choose from.

> `Coco4ContrastiveDataset`
>
> The single item is a pair defined as follows: original image crop ground truth and the prompts.

Since the `torch.data.Dataset` custom classes defined so far are characterized by items of variable size, it is necessary to define a custom collate function. In particular, in this report, we have definded the followings:

- `unzip`
- `augment`

#### 5.1 Dataset and type declaration

In [None]:
%%shell
if ! [ -d refcocog ]; then
  gdown 1i-LHWSRp2F6--yhAi4IG3DiiCHmgE4cw &&
  tar -xf refcocog.tar &&
  rm refcocog.tar
fi

In [None]:
path_root: str = os.path.join("refcocog", "")
path_annotations: str = os.path.join(path_root, "annotations", "")
path_bboxes: str = os.path.join(path_root, "bboxes", "")
path_images: str = os.path.join(path_root, "images", "")

path_refs: str = os.path.join(path_annotations, "refs.csv")
path_sentences: str = os.path.join(path_annotations, "sentences.csv")

path_DETR: str = os.path.join(path_bboxes, "bboxes[DETR].csv")
path_YOLOv5: str = os.path.join(path_bboxes, "bboxes[YOLOv5].csv")
path_YOLOv8: str = os.path.join(path_bboxes, "bboxes[YOLOv8].csv")

In [None]:
Split = t.Literal["train", "test", "val"]


@dataclass
class Ref:
    ref_id: int  # unique id for refering expression
    file_name: str  # file name of image relative to img_root
    split: Split
    xmin: float
    ymin: float
    xmax: float
    ymax: float


with open(path_refs, "r") as f:
    raw = csv.DictReader(f)
    refs: list[Ref] = [Ref(**row) for row in raw]


In [None]:
@dataclass
class Sentence:
    ref_id: int  # unique id for refering expression
    sent: str


with open(path_sentences, "r") as f:
    raw = csv.DictReader(f)
    sentences: list[Sentence] = [Sentence(**row) for row in raw]


id2sents: dict[int, list[str]] = groupby(
    sentences, lambda x: x.ref_id, lambda x: x.sent
)



In [None]:
@dataclass
class BBox:
    file_name: str  # file name of image relative to img_root
    xmin: float
    ymin: float
    xmax: float
    ymax: float
    confidence: float


with open(path_DETR, "r") as f:
    raw = csv.DictReader(f)
    bboxes: list[BBox] = [BBox(**row) for row in raw]

img2detr: dict[str, list[BBox]] = defaultdict(
    list, groupby(bboxes, lambda x: x.file_name)
)


with open(path_YOLOv5, "r") as f:
    raw = csv.DictReader(f)
    bboxes: list[BBox] = [BBox(**row) for row in raw]

img2yolov5: dict[str, list[BBox]] = defaultdict(
    list, groupby(bboxes, lambda x: x.file_name)
)


with open(path_YOLOv8, "r") as f:
    raw = csv.DictReader(f)
    bboxes: list[BBox] = [BBox(**row) for row in raw]

img2yolov8: dict[str, list[BBox]] = defaultdict(
    list, groupby(bboxes, lambda x: x.file_name)
)



In [None]:
TensorImage = UInt[torch.Tensor, "3 H W"]

In [None]:
class CocoDataset(Dataset[tuple[TensorImage, list[str], Float[torch.Tensor, "X 4"], Float[torch.Tensor, "4"]]]):
    def __init__(
        self,
        split: Split,
        img2bboxes: dict[str, list[BBox]],
        limit: int = -1,
    ):
        self.items: list[
            tuple[
                str, list[str], Float[torch.Tensor, "X 5"], Float[torch.Tensor, "1 4"]
            ]
        ] = [
            (img, sents, xyxys, xyxy)
            for ref in refs
            if ref.split == split
            for img in [os.path.join(path_images, ref.file_name)]
            for sents in [id2sents[ref.ref_id]]
            for bboxes in [img2bboxes[ref.file_name]]
            for xyxys in [
                torch.tensor([
                    (bbox.xmin, bbox.ymin, bbox.xmax, bbox.ymax)
                    for bbox in bboxes
                    if bbox.confidence > .25  # lower bound on confidence
                ])
            ]
            for xyxy in [torch.tensor([(ref.xmin, ref.ymin, ref.xmax, ref.ymax)])]
        ]
        self.len: int = len(self.items) if limit < 0 else min(limit, len(self.items))

    def __len__(self) -> int:
        return self.len

    def __getitem__(
        self, index: int
    ) -> tuple[
        TensorImage, list[str], Float[torch.Tensor, "X 5"], Float[torch.Tensor, "1 4"]
    ]:
        file_name, sents, xyxys, xyxy = self.items[index]
        return read_image(file_name, ImageReadMode.RGB).to(device), sents, xyxys, xyxy



In [None]:
class Coco4MetricsDataset(Dataset[tuple[Float[torch.Tensor, 'X 5'], Float[torch.Tensor, '1 4']]]):

    def __init__(
        self,
        split: Split,
        img2bboxes: dict[str, list[BBox]],
        limit: int = -1,
    ):
        self.items: list[tuple[Float[torch.Tensor, 'X 5'], Float[torch.Tensor, '1 4']]] = [
            (xyxys, xyxy)
            for ref in refs
            if ref.split == split
            for bboxes in [img2bboxes[ref.file_name]]
            for xyxys in [torch.tensor([ (bbox.xmin, bbox.ymin, bbox.xmax, bbox.ymax, bbox.confidence) for bbox in bboxes ], dtype=torch.float)]
            for xyxy in [torch.tensor([(ref.xmin, ref.ymin, ref.xmax, ref.ymax)], dtype=torch.float)]
        ]
        self.len: int = len(self.items) if limit < 0 else min(limit, len(self.items))

    def __len__(self) -> int:
        return self.len

    def __getitem__(self, index: int) -> tuple[Float[torch.Tensor, 'X 5'], Float[torch.Tensor, '1 4']]:
        return self.items[index]



In [None]:
class Coco4TrainingDataset(
    Dataset[
        tuple[
            list[TensorImage],
            list[str],
            int,
            Float[torch.Tensor, "crops 4"],
            Float[torch.Tensor, "1 4"],
        ]
    ]
):
    def __init__(
        self,
        split: Split,
        img2bboxes: dict[str, list[BBox]],
        limit: int = -1,
    ):
        self.items: list[
            tuple[
                str,
                list[str],
                int,
                Float[torch.Tensor, "X 4"],
                Float[torch.Tensor, "1 4"],
            ]
        ] = [
            (img, sents, i, xyxys, xyxy)
            for ref in refs
            if ref.split == split
            for img in [os.path.join(path_images, ref.file_name)]
            for sents in [id2sents[ref.ref_id]]
            for bboxes in [img2bboxes[ref.file_name]]
            for xyxys in [
                torch.tensor([
                    (bbox.xmin, bbox.ymin, bbox.xmax, bbox.ymax)
                    for bbox in bboxes
                    if bbox.confidence > .25  # lower bound on confidence
                    if bbox.xmax - bbox.xmin > 16  # lower bound on width
                    if bbox.ymax - bbox.ymin > 16  # lower bound on heigth
                ])
            ]
            if xyxys.shape[0] > 1 # lower bound on bbox per image
            for xyxy in [
                torch.tensor([(ref.xmin, ref.ymin, ref.xmax, ref.ymax)])
            ]
            for ious in [box_iou(xyxys, xyxy)]
            if torch.max(ious).item() > .5  # ensure at least .5 of maximum IoU
            for i in [torch.argmax(ious).item()]
        ]
        self.len: int = len(self.items) if limit < 0 else min(limit, len(self.items))

    def __len__(self) -> int:
        return self.len

    def __getitem__(
        self, index: int
    ) -> tuple[
        list[TensorImage],
        list[str],
        int,
        Float[torch.Tensor, "crops 4"],
        Float[torch.Tensor, "1 4"],
    ]:
        file_name, sents, i, xyxys, xyxy = self.items[index]
        img: TensorImage = read_image(file_name, ImageReadMode.RGB).to(device)

        xywhs: Int[torch.Tensor, "X 4"] = box_convert(xyxys, in_fmt="xyxy", out_fmt="xywh").round().int()

        crops: list[TensorImage] = [
            crop(img, top=y, left=x, height=h, width=w)
            for xywh in xywhs
            for [x, y, w, h] in [xywh.tolist()]
        ]

        return crops, sents, i, xyxys, xyxy



In [None]:
class Coco4ContrastiveDataset(Dataset[tuple[TensorImage, list[str]]]):
    def __init__(
        self,
        split: Split,
        limit: int = -1,
    ):
        self.items: list[tuple[str, list[str], Float[torch.Tensor, "1 4"]]] = [
            (img, sents, xyxy)
            for ref in refs
            if ref.split == split
            for img in [os.path.join(path_images, ref.file_name)]
            for sents in [id2sents[ref.ref_id]]
            for xyxy in [torch.tensor([(ref.xmin, ref.ymin, ref.xmax, ref.ymax)])]
        ]
        self.len: int = len(self.items) if limit < 0 else min(limit, len(self.items))

    def __len__(self) -> int:
        return self.len

    def __getitem__(self, index: int) -> tuple[TensorImage, list[str]]:
        file_name, sents, xyxy = self.items[index]
        img: TensorImage = read_image(file_name, ImageReadMode.RGB).to(device)

        xywh: Int[torch.Tensor, "1 4"] = box_convert(xyxy, in_fmt="xyxy", out_fmt="xywh").round().int()
        [[x, y, w, h]] = xywh.tolist()

        return crop(img, top=y, left=x, height=h, width=w), sents



## 6 Evaluation metrics

Recent works [12][13] have shown that even subtle changes in the finetuning process can lead to surprisingly large differences in the final performance. In order to provide valuable comparisons between our solution proposals and to ensure that the implemented models are performing as intended, it is essential to evaluate them according to appropriate metrics. With the aim of quantitatively measure the capabilities of our algorithms, we have considered the following criteria.

### 6.1 Localization accuracy

Localization accuracy (Formula 1) measures how accurately the fine-tuned network can ground the localized object to a language description. Intersection over Union (IoU) is a common measure of localization accuracy (Figure 6). In particular, we keep track of:
*   mean intersection over union (**mIoU**)
*   the fraction of correct predictions with respect to the total number of processed samples, considering an IoU threshold of 0.3 (**mAP [IoU .3]**). That is, the predicted bounding box $\hat{b}$ is considered correct (true positive) if the IoU between $\hat{b}$ and the ground-truth bounding box $b$ is at least 30%. This metric is usually referred to by the literature as *mean Average Precision* (mAP) [15].
* the fraction of correct predictions with respect to the total number of processed samples, considering an IoU threshold of 0.5 (**mAP [IoU .5]**)
* the fraction of correct predictions with respect to the total number of processed samples, considering an IoU threshold of 0.7 (**mAP [IoU .7]**)

### 6.2 Semantic similarity

Semantic similarity (Formula 2) measures the similarity between the predicted bounding box $\hat{b}$ and the portion of the image which corresponds to the ground-truth bounding box $b$. Consider the output proposed in Figure 7. The neural network predicts the red bounding box $\hat{b}$ while in the dataset the green bounding box $b$ is the annotated ground-truth. Evidently, the algorithm response does not represent a true mistake. Conceptually, the two portions of the picture delimited by $\hat{b}$ and $b$ are semantically equivalent. For the purpose of computing the semantic similarity between two crops we evaluate the mean cosine similarity (**mCos**) and the mean euclidean distance (**mED**) between the $\hat{b}_z$s and the $b_z$s which are the latent space representations of the predicted bounding boxes $\hat{b}_z$s and the ground-truth bounding boxes $b$s respectively. In order to achieve a fair comparison, we consistently use the original CLIP visual encoder (`encode_image()` method) to geometrically represent a cropped portion of the image with a vector of size 1024 (Figure 8).

**Formula 1**

$d(b_z, \hat{b}_z) = ||b_z - \hat{b}||$

**Formula 2**

$S_C(b_z, \hat{b}_z) = \frac{b_z \cdot \hat{b}_z}{\lVert{} b_z \rVert{} \lVert{}\hat{b}_z\rVert{}} = \frac{\sum_{i=1}^{n}b_{z_i}\hat{b}_{z_i}}{\sqrt{\sum_{i=1}^{n}b_{z_i}^2 \cdot \sum_{i=1}^{n}\hat{b}_{z_i}^2}}$

![iou.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/06.png)

**Figure 6**

Intersection over Union


![semantic1.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/07.png)

**Figure 7**

![semantic2.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/08.png)

**Figure 8**

### 6.3 Code

In [None]:
def eval_step(
    model: nn.Module,
    data_loader: DataLoader[tuple[TensorImage, list[str], Float[torch.Tensor, "X 4"], Float[torch.Tensor, "4"]]],
    img_preprocess: t.Callable[[TensorImage], Float[torch.Tensor, "3 244 244"]],
) -> pd.DataFrame:
    model.eval()

    ious: list[float] = []
    coss: list[float] = []
    euds: list[float] = []

    with torch.inference_mode():
        img: TensorImage
        prompts: list[str]
        xyxys: Float[torch.Tensor, "crops 4"]
        xyxy: Float[torch.Tensor, "4"]

        progress = tqdm(data_loader, desc="eval")

        for iter, (img, prompts, xyxys, true_xyxy) in enumerate(progress):

            if xyxys.shape[0] == 0:
                xyxys = torch.tensor((0, 0, img.shape[3], img.shape[2]))

            # from xyxys to crops
            xywhs: Int[torch.Tensor, "X 4"] = box_convert(xyxys, in_fmt="xyxy", out_fmt="xywh").round().int()

            crops: list[TensorImage] = [
                crop(img, top=y, left=x, height=h, width=w)
                for xywh in xywhs
                for [x, y, w, h] in [xywh.tolist()]
            ]

            # from true_xyxy to true_crop
            true_xywh: Int[torch.Tensor, "1 4"] = box_convert(true_xyxy, in_fmt="xyxy", out_fmt="xywh").round().int()

            true_crop: TensorImage
            [true_crop] = [
                crop(img, top=y, left=x, height=h, width=w)
                for xywh in true_xywh
                for [x, y, w, h] in [xywh.tolist()]
            ]

            # forward pass
            model_output: Float[torch.Tensor, "crops"] = model(crops, prompts)

            # get index of the predicted bounding box to compute IoU accuracy
            pred_i: int = torch.argmax(model_output).item()

            # get predicted bounding
            pred_xyxy: Float[torch.Tensor, "1 4"] = xyxys[pred_i].unsqueeze(0)

            iou: float = box_iou(true_xyxy, pred_xyxy).item()
            ious.append(iou)

            true_z: Float[torch.Tensor, "1 1024"] = clip_frozen_img_encoder(img_preprocess(true_crop).unsqueeze(0))
            pred_z: Float[torch.Tensor, "1 1024"] = clip_frozen_img_encoder(img_preprocess(crops[pred_i]).unsqueeze(0))

            cos: float = torch.nn.functional.cosine_similarity(true_z, pred_z).item()
            coss.append(cos)

            eud: float = torch.cdist(true_z, pred_z, p=2).item()
            euds.append(eud)

        return pd.DataFrame(
            {
                "iou": ious,
                "cos similarity": coss,
                "euclidean distance": euds,
            }
        )



In [None]:
def showtime(
    model: nn.Module,
    data_loader: DataLoader[tuple[TensorImage, list[str], Float[torch.Tensor, "X 4"], Float[torch.Tensor, "4"]]],
    writer: SummaryWriter,
    global_step: int,
) -> None:
    model.eval()

    with torch.inference_mode():
        img: TensorImage
        prompts: list[str]
        xyxys: Float[torch.Tensor, "crops 4"]
        xyxy: Float[torch.Tensor, "4"]

        progress = tqdm(data_loader, desc="showtime")

        for iter, (img, prompts, xyxys, true_xyxy) in zip(it.count(1), progress):
            true_i: int = best_bbox(xyxys, true_xyxy)

            # from xyxys to crops
            xywhs: Int[torch.Tensor, "X 4"] = box_convert(xyxys, in_fmt="xyxy", out_fmt="xywh").round().int()

            crops: list[TensorImage] = [
                crop(img, top=y, left=x, height=h, width=w)
                for xywh in xywhs
                for [x, y, w, h] in [xywh.tolist()]
            ]

            # forward pass
            model_output: Float[torch.Tensor, "crops"] = model(crops, prompts)

            # get index of the predicted bounding box to compute IoU accuracy
            pred_i: int = torch.argmax(model_output).item()

            # https://github.com/pytorch/pytorch/issues/65449
            writer.add_image_with_boxes(
                tag=f"{iter}: {' ¶ '.join(prompts)}",
                img_tensor=img,
                box_tensor=torch.stack((xyxys[pred_i], xyxys[true_i], true_xyxy.squeeze())),
                labels=["prediction", "best region proposal", "ground truth"],
                global_step=global_step,
            )



In [None]:
def compare(reports: dict[str, pd.DataFrame]) -> pd.DataFrame:
    return pd.DataFrame(
        {
            "mAP[IoU .3]": [(report["iou"] >= 0.3).sum() / report["iou"].count() for report in reports.values()],
            "mAP[IoU .5]": [(report["iou"] >= 0.5).sum() / report["iou"].count() for report in reports.values()],
            "mAP[IoU .7]": [(report["iou"] >= 0.7).sum() / report["iou"].count() for report in reports.values()],
            "mIoU": [report["iou"].mean() for report in reports.values()],
            "mCos": [report["cos similarity"].mean() for report in reports.values()],
            "mED": [report["euclidean distance"].mean() for report in reports.values()],
        },
        index=reports.keys(),
    )

## 7 Baseline

`BASELINE`

In [None]:
%tensorboard --logdir ./assets/baseline/runs --port 6001

At the beginning of our project we have implemented a training free baseline algorithm. The development of this solution, hereafter usually referred to as `BASELINE`, has been profitable since it allowed us to familiarize ourselves with the task of visual grounding, the dataset being used and the CLIP model. Moreover the evaluation of the obtained results has pointed out several interesting aspects including things that can be improved and an approximate understanding of the performance that our further solutions should obtain.

The purpose of this section is to describe how the baseline algorithm works, report the obtained performance and provide a readable implementation of the described functionalities.

The method is a training-free approach that combines CLIP zero-shot with a Yolo architecture [14]. More in depth, we rely on the Yolov5 implementation provided by TorchHub at the following link [Ultralytics Yolov5](https://pytorch.org/hub/ultralytics_yolov5/). The computational process involves extracting all the bounding boxes proposed by Yolo and evaluating their similarity with the textual query. In order to make comparisons between crops and prompts we rely on CLIP visual encoder (`encode_image()`) and CLIP text encoder (`encode_text()`) respectively, in order to map texts and images into the same latent space. This done, visual and textual prompts are represented as vectors belonging to the same 1024 dimensional vector space. At this point, the vectors which correspond to the embedded crops ($\hat{b}_{z_1}, ..., \hat{b}_{z_N}$) can be compared with the prompt encoding ($p$) using cosine similarity function ($S_C$). At the end of the day, the output of the execution is the bounding box corresponding to the vector $\hat{b}_{z_i} \in \{\hat{b}_{z_1}, ..., \hat{b}_{z_N}\}$ characterized by the highest similarity with the latent space representation of the textual prompt. For the sake of clarity we propose a schematic illustration of the overall architecture in Figure 9 and Figure 10.


![image.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/09.png)

**Figure 9**

The first step of the baseline is to perform object detection with Yolov5 algorithm.

![baseline.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/10.png)

**Figure 10**

The purpose of the second step of the baseline algorithm is to compare the vector embeddings of the previously extracted crops with the latent space representation of the input prompt $p$. The overall output is the bounding box $\hat{b}_{z_i}$ which corresponds to the vector characterized by the highest similarity with respect to vector $p$. Reasonably, we rely on the cosine similarity function $S_C$ to do the comparison.

### 7.1 Results

Although our zero-shot `BASELINE` involves no additional training at all, it achieves very good performance on our downstream task. The obtained results are reported in the following table.

### 7.2 Observations

In the described two steps `BASELINE` algorithm the comparison of the image crops embeddings and the latent representation of the given textual prompt in order to identify the bounding box which is most similar to the given description, is computed among the regions of the input image proposed by Yolov5. As a consequence, the goodness of the predicted bounding box at the end of the execution is inevitably strongly related to the quality of the bounding boxes proposed by the former detection step. More broadly, being $\hat{b}_{\text{max}}$ the bounding box predicted by Yolov5 characterized by the highest intersection over union with the dataset ground truth bounding box, the localization accuracy of the `BASELINE` have an upper bound which is the average intersection over union between the $\hat{b}_{\text{max}}$s and the corresponding ground truths. In this regard, a good starting point to improve the performance is to replace Yolov5 with another region proposal algorithm which succeeds in suggesting better bounding boxes, i.e. rectangles with a higher intersection over union with the annotated ground truth.

By means of a manual inspection of the regions proposed by Yolo, we have noticed that some crops are very small areas characterized by few pixels (Figure 11). In general, it does not make sense to consider these portions which clearly make the overall computation more expensive. A simple solution to overcome this is to discard all the bounding boxes whose edges are both smaller than a given threshold.

We assess that the two steps pipeline of the `BASELINE` algorithm in which the predicted bounding box is chosen on top of the regions proposed in the first stage of the algorithm, is reasonably appropriate to tackle the problem. In the process, CLIP is used to map the image crops and the input prompt into a mutual latent space. The final answer of the algorithm strongly depends on the ability of CLIP to extract good features. For the purpose of improving the performance of our implemented solution, we believe that the most promising strategy is to apply transfer learning to fine-tune the CLIP image encoder and text encoder architectures in order to extract features which are more discriminative for our narrow domain. Intuitively, from a high level point of view, using more specialized features for the task at hand, leads necessarily to a better and more valuable end-to-end representation of visual and textual prompts, and therefore a finest comprehension of the scene. In Section 10 we present the solutions that we have come up with in order to further improve performance via supervised finetuning.

In our reference dataset there are a lot of samples which include spatial relationships (Figure 12). In general, the baseline model has displayed good results on appearance-based descriptions that are independent of viewer perspective. However, the algorithm struggles in the presence of location words such as "the dog on the left", in the textual descriptions. We consider this limitation and describe our attempt to mitigate it in Section 12.

Finally, in this first implementation, we always based our predictions on a single textual prompt which describes the portion of interest. However, in the dataset there are sometimes  more than one description for a given ground truth bounding box. In the following of this paper we describe our approach to consider more than one prompt in order to output more accurate predictions. More widely, in this regard in Section 13, we further investigate the chance of using data augmentation to improve the generalization capabilities of our final model.


![image.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/11.png)

**Figure 11**

Three crops proposed by Yolo. Evidently, two of them are meaningless and are too small to accomodate interesting portions of the picture.

![image.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/12.png)

**Figure 12**

We have notices that the `BASELINE` algorithm has severe issues in dealing with spatial relationships. For instance in this case it has difficulties in choosing between the two dogs in the picture.

### 7.3 Code

In [None]:
clip_model, clip_preprocessor = clip.load("RN50", device=device)
clip_model.float()
clip_model.eval()

for p in clip_model.parameters():
    p.requires_grad = False

In [None]:
def transform(n_px: int) -> Compose:
    """
    https://github.com/openai/CLIP/blob/a1d071733d7111c9c014f024669f959182114e33/clip/clip.py#L75-L86
    """
    return Compose([
        ConvertImageDtype(torch.float),
        Resize(n_px, interpolation=InterpolationMode.BICUBIC, antialias=True),
        CenterCrop(n_px),
        Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711)),
    ])


preprocess: Compose = transform(224)

In [None]:
class ClipFrozenImgEnc(nn.Module):
    def forward(
        self, image: Float[torch.Tensor, "crops 3 244 244"]
    ) -> Float[torch.Tensor, "crops 1024"]:
        with torch.no_grad():
            return clip_model.encode_image(image).float()


class ClipFrozenTxtEnc(nn.Module):
    def forward(
        self, text: Int[torch.Tensor, "prompts 77"]
    ) -> Float[torch.Tensor, "prompts 1024"]:
        with torch.no_grad():
            return clip_model.encode_text(text).float()

In [None]:
clip_frozen_img_encoder: ClipFrozenImgEnc = ClipFrozenImgEnc()
clip_frozen_txt_encoder: ClipFrozenTxtEnc = ClipFrozenTxtEnc()

---

In [None]:
class ClipWrapper(nn.Module):
    def __init__(self, clip_model: CLIP):
        super().__init__()
        self.img_preprocess: Compose = preprocess
        self.txt_preprocess: t.Callable[[t.Union[str, list[str]]], Float[torch.Tensor, "77"]] = clip.tokenize
        self.clip_model: CLIP = clip_model

    def forward(
        self, crops: list[TensorImage], prompts: list[str]
    ) -> Float[torch.Tensor, "crops 1"]:
        with torch.no_grad():
            # step 1: preprocess crops as required by the visual encoder
            crops_preprocessed: Float[torch.Tensor, "crops 3 244 244"] = torch.stack([
                self.img_preprocess(crop)
                for crop in crops
            ])

            # step 2: preprocess prompts as required by the text encoder
            prompts_preprocessed: Int[torch.Tensor, "prompts 77"] = self.txt_preprocess(prompts)

            similarity_matrix: Float[torch.Tensor, "prompts crops"]
            _, similarity_matrix = self.clip_model(
                crops_preprocessed,
                prompts_preprocessed,
            )

            return torch.mean(similarity_matrix, dim=0)

In [None]:
eval_summary(
    clip_model
)

Layer (type:depth-idx)                             Input Shape               Output Shape              Param #                   Trainable
CLIP                                               [5, 3, 244, 244]          [5, 2]                    563,713                   False
├─ModifiedResNet: 1-1                              [5, 3, 244, 244]          [5, 1024]                 --                        False
│    └─Conv2d: 2-1                                 [5, 3, 244, 244]          [5, 32, 122, 122]         (864)                     False
│    └─BatchNorm2d: 2-2                            [5, 32, 122, 122]         [5, 32, 122, 122]         (64)                      False
│    └─ReLU: 2-3                                   [5, 32, 122, 122]         [5, 32, 122, 122]         --                        --
│    └─Conv2d: 2-4                                 [5, 32, 122, 122]         [5, 32, 122, 122]         (9,216)                   False
│    └─BatchNorm2d: 2-5                            [5,

In [None]:
splits: list[Split] = ["train", "val", "test"]

pd.concat(
    [
        pd.read_csv(f"assets/baseline/eval-baseline-{split}.csv", index_col=0)
        for split in splits
    ],
    axis=1,
    keys=splits,
)

Unnamed: 0_level_0,train,train,train,val,val,val,test,test,test
Unnamed: 0_level_1,iou,cos similarity,euclidean distance,iou,cos similarity,euclidean distance,iou,cos similarity,euclidean distance
count,42226.0,42226.0,42226.0,2573.0,2573.0,2573.0,5023.0,5023.0,5023.0
mean,0.54818,0.885833,0.836584,0.545417,0.884527,0.840914,0.547995,0.88389,0.840413
std,0.394632,0.119882,0.467774,0.395396,0.120132,0.466306,0.394547,0.12347,0.473257
min,0.0,0.238326,0.0,0.0,0.386646,0.078198,0.0,0.3055,0.0
25%,0.094556,0.82279,0.444478,0.085112,0.820903,0.449173,0.099722,0.819754,0.444113
50%,0.74995,0.941208,0.693325,0.72944,0.938441,0.705266,0.748046,0.940718,0.691364
75%,0.920812,0.976068,1.177713,0.921546,0.975418,1.182641,0.921519,0.976637,1.185348
max,0.999468,1.0,2.96567,0.990354,0.999322,2.501004,0.995819,1.0,2.962296


**Table 1**

In [None]:
pd.read_csv(f"assets/baseline/comparing.csv", index_col=0)

Unnamed: 0,mAP[IoU .3],mAP[IoU .5],mAP[IoU .7],mIoU,mCos,mED
train,0.628239,0.571709,0.520414,0.54818,0.885833,0.836584
val,0.628838,0.570152,0.514963,0.545417,0.884527,0.840914
test,0.629305,0.568784,0.518813,0.547995,0.88389,0.840413


**Table 2**

## 8 Approach

Motivated by these observations, in this work we propose a two steps joint embedding algorithm to solve the visual grounding problem. The general framework is depicted in Figure 13.

The purpose of the first stage is to select a collection of regions of interest in the input image. To this end we need a deep learning model which predicts a set of bounding boxes which frequently contains a bounding box with high intersection over union with the ground truth bounding box of the RefCOCOg dataset.

In the second phase we incorporate our fine-tuned versions of the CLIP visual and textual encoders to encode the image regions delimited by the proposed bounding boxes and the referring expression into the same vector space. By doing so we can locate the portion of the image which is more in line with the given natural language description by means of the cosine similarity function as proposed in our baseline.

![image.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/13.png)

**Figure 13**

## 9 Region proposal algorithms

In consonance with the schema proposed in Figure 13, the purpose of the first step of our algorithm is to propose potential regions of interest within the field of view. Understanding the location of the relevant objects in a given scene is a trivial task for humans. However, the development of a software agent with such a capability has been an uphill task until the turn of the last decade [15]. In recent years, the computer vision literature has proposed several models to accomplish this task motivated by a huge range of possible applications. In the `BASELINE` algorithm we have employed the widely used Yolov5 model. As mentioned above, the goodness of the bounding box proposed at the end of the overall execution is inherently related to the quality of the bounding boxes proposed by Yolov5. More in depth, being $\hat{B}=\{\hat{b}_1, …, \hat{b}_N\}$ the set of bounding boxes proposed by Yolov5 and $\hat{b}_{\text{max}} \in \hat{B}$ the proposed bounding box characterized by the highest intersection over union with the dataset ground truth bounding box $b$; the localization accuracy of the final predicted bounding box is necessarily limited superiorly by the IoU($\hat{b}_{\text{max}}, b$). In addition to this, also the cardinality of $B$ is not negligible. Actually, the computational complexity of the successive iterations depends on the number of bounding boxes to be evaluated. Based on this observations, we have tested multiple object detection algorithms made freely available by the authors with the aim of finding out the most performing one on our dataset. To this end, we have considered some of the state-of-the-art methods listed by paperswithcode.com [16]. Particularly, in this project we have executed Yolov5 [14], Yolov8 [17] and DETR [18]. For the sake of a proper comparison, all the algorithms have been configured with the same level of confidence. The obtained results are reported in Table 3. As we can see, though all the models perform decently, it turns out that DETR provides the best tradeoff between number of bounding boxes and average IoU with the dataset ground truth. Ultimately, in the following of the project all the presented results have been achieved on top of the regions proposed by DETR.


### 9.1 Bounding box preprocessing
As exhaustively written in Section 8 and depicted in Figure 13, once the first step of the overall algorithm has been completed, the obtained bounding boxes are not further refined in the following of the execution. Therefore, we have decided to preprocess our entire dataset in order to fill it with the bounding boxes proposed by the aforementioned object detectors. In doing so we have considerably speeded up the overall computational execution without incurring any loss of generality. A single epoch without this preprocessing last 90 minutes on average. On the other hand, with this enhancement we complete an epoch iteration in 50 minutes. As a consequence of this, we have been able to make more experiments and to train our models on more data for a longer time. We believe that this preprocessing can be potentially applied on countless deep learning domains. Hence, as an important contribution of our work we have made available at this GitHub repository the code to compute this preprocessing with whatever object detection algorithm. Moreover, in the repo, we have conveniently published `yolov5.csv`, `yolov8.csv` and `detr.csv` files including the results calculated by the three aforementioned object detectors.

### 9.2 Code

In [None]:
def metrics(dataset: Dataset[tuple[Float[torch.Tensor, 'X 5'], Float[torch.Tensor, '1 4']]]) -> pd.DataFrame:

    dataloader: DataLoader[tuple[Float[torch.Tensor, 'X 5'], Float[torch.Tensor, '1 4']]] = DataLoader(dataset, batch_size=None)
    Z: Float[torch.Tensor, '1 5'] = torch.zeros(1, 5)

    ious: list[float] = [ torch.max(box_iou(true_xyxy, torch.cat((Z, xyxys))[:, :4])).item() for xyxys, true_xyxy in tqdm(dataloader) ]
    rs: list[int] = [ xyxys.shape[0] for xyxys, _ in tqdm(dataloader) ]

    return pd.DataFrame({'iou': ious, '#': rs})

In [None]:
splits: list[Split] = ['train', 'val', 'test']
report: pd.DataFrame = pd.concat(
    [
        pd.concat(
            [yolov5, yolov8, detr],
            axis=1,
            keys=['yolov5', 'yolov8', 'detr']
        ).describe()
        for split in splits
        for yolov5 in [metrics(Coco4MetricsDataset(split, img2yolov5))]
        for yolov8 in [metrics(Coco4MetricsDataset(split, img2yolov8))]
        for detr in [metrics(Coco4MetricsDataset(split, img2detr))]
    ],
    axis=1,
    keys=splits
)

In [None]:
report

Unnamed: 0_level_0,train,train,train,train,train,train,val,val,val,val,val,val,test,test,test,test,test,test
Unnamed: 0_level_1,yolov5,yolov5,yolov8,yolov8,detr,detr,yolov5,yolov5,yolov8,yolov8,detr,detr,yolov5,yolov5,yolov8,yolov8,detr,detr
Unnamed: 0_level_2,iou,#,iou,#,iou,#,iou,#,iou,#,iou,#,iou,#,iou,#,iou,#
count,42226.0,42226.0,42226.0,42226.0,42226.0,42226.0,2573.0,2573.0,2573.0,2573.0,2573.0,2573.0,5023.0,5023.0,5023.0,5023.0,5023.0,5023.0
mean,0.825308,11.371288,0.917952,11.666438,0.916711,25.652181,0.823866,11.614069,0.914574,12.156627,0.915542,26.676642,0.825951,11.19351,0.916911,11.474617,0.917533,25.201672
std,0.183443,9.895845,0.114324,9.746315,0.082985,23.022665,0.18291,9.82122,0.118846,10.222522,0.082728,23.541012,0.181452,9.891445,0.118269,9.782079,0.080795,22.871364
min,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.026681,1.0,0.103549,2.0,0.0,1.0,0.0,1.0,0.20701,1.0
25%,0.790363,5.0,0.916166,5.0,0.900589,8.0,0.778337,5.0,0.912336,5.0,0.899168,8.0,0.784575,5.0,0.916092,5.0,0.901468,8.0
50%,0.897469,8.0,0.952232,9.0,0.941296,18.0,0.89947,8.0,0.951201,9.0,0.940851,18.0,0.89694,8.0,0.953306,8.0,0.942469,17.0
75%,0.939805,15.0,0.971201,15.0,0.964255,37.0,0.939672,15.0,0.970959,16.0,0.963627,38.0,0.940734,14.0,0.971352,14.0,0.964712,35.0
max,0.999468,127.0,0.998281,117.0,0.998923,100.0,0.990354,72.0,0.996324,87.0,0.996266,99.0,0.995819,96.0,0.9972,99.0,0.998143,100.0


**Table 3**

## 10 Standard fine-tuning

`net1` `net2` `net3` `net4`

In [None]:
%tensorboard --logdir ./assets/standard-finetuning/runs --port 6002

Given a set of proposed bounding boxes $\hat{B}=\{\hat{b}_1, …, \hat{b}_N\}$ and a natural language description $p$ referring to one of them, the purpose of the second step of our solution proposal, is to map both visual and textual prompts into a mutual vector space. In the latent space, the bounding boxes and the referring expression are represented as vectors of features. Given that these vectors share a common geometrical space, they have the same dimensionality. As a consequence, we can mathematically compare the visual information within the bounding boxes with the semantic of the provided description simply using cosine similarity. In this way, our algorithm can state which is the image region more in line with the input expression.

Evidently, the tough part of this road map is the definition of a meaningful mapping between visual and textual information. To this end, we rely on the powerful pretrained CLIP model recently proposed by OpenAI [2]. More specifically, the deep neural network provides a visual encoder $\Psi: I \rightarrow Z$ and a textual encoder $\Gamma: P \rightarrow Z$ such that given respectively an image $i \in I$ and a text $p \in P$ as input, they produce two latent space representations $i_z, p_z \in Z$ united by a joint 1024 dimensional vector space. In our `BASELINE` algorithm (Section 7) we apply the two encoders with no additional training. Although the two networks have achieved reasonable performance and robustness, motivated by several previous works [19][20][21] our aim is to fine-tune the two architectures in order to make them more proficient in our downstream task. Intuitively, applying transfer learning, our attempt is to build two encoders ($\Psi^*, \Gamma^*$)  which are trained in order to provide more refined text and image feature descriptions for our narrow domain. Ultimately, a more characterizing representation of visual and textual prompts leads straightforwardly to a more accurate image-text comparison and consequently a more conscious identification of the final bounding box. The adoption of a pretrained model, not only saves precious computational costs and allows us to benefit from state-of-the-art models without having to train one from scratch, but it also reduces the carbon footprint, which is one of the major concerns of contemporary scientific literature [22].

As a first attempt to accomplish our objectives we have experimented with standard fine-tuning procedures that add multilayer perceptrons on top of the freezed pretrained architecture. Specifically we have designed, developed, trained and evaluated the following architectures:
*   `net 1` (Figure 14). In this architecture we fine-tune only the image encoder module of CLIP while preserving the pretrained weights of the textual counterpart. We add a non linear activation function (ReLU) and a linear head on top of the 1024 CLIP features. The deep neural network has been trained with Stochastic Gradient Descent optimizer. The implementation of this architecture is reported in Section 10.1.
*   `net 2` (Figure 15). In this second standard fine-tuned architecture, we add a bottleneck on top of the 1024 features proposed by CLIP. Having such a shrinkage encourages the network to compress feature representations to best fit in the available space. This time, both the image encoder and text encoder are fine-tuned symmetrically. The deep neural network has been trained with Stochastic Gradient Descent optimizer. The implementation of this architecture is reported in Section 10.1.
* `net 3` (Figure 16). This third architecture resembles the previous one. We maintain the idea of the bottleneck but we use a different activation function in the text encoder portion of the network. The neurons of the natural language processing portion of the network are activated by a Sigmoid. Moreover, for the sake of experimenting as much as possible, this network is trained with Adagrad optimizer instead of SGD. The implementation of this architecture is reported in Section 10.1.
* `net 4` (Figure 17). In this network we try to reduce the number of features of the final output layer. To accomplish this we append a linear head with 1024 input features and 512 output neurons on top of CLIP text and image encoders. As in the case of `net 1` we use the ReLU activation function and the SGD optimizer.

Throughout the epochs we keep track of the loss and of the accuracy on the training and validation sets. Recalling that in this stage the algorithm should predict the bounding box proposed by DETR in step 1 characterized by the highest IoU with the dataset groundtruth, the most suitable loss function to be minimized is cross entropy loss. On the other hand, as a representative metric to quantify the accuracy in the predictions, we annotate the average intersection over union.

We have trained each of these architectures on Azure GPUs (see Section 3) for 1.5 hours in order to figure out the effects of the various design choices. With the aim of saving precious computational resources but at the same time achieving a meaningful comparison of the proposed architectures, we have systematically limited the dataset. As described in Section 5, this last contains more than one annotation for a given image. Consequently, for the purpose of maximizing the number of images inspected by our apprentice software agent, we consider only one annotation for every image. Since the dataset is still too large for this first sequence of tests, we have further reduced its cardinality by selecting a random subset of training items.

The obtained results are reported in the following tables and graphs. As we can see all the networks perform decently well and except `net 3` they all display ideal training curves. Looking at the loss and accuracy curves of the third standard fine tuned neural network, we clearly understand that Adagrad converges much slower than Stochastic Gradient Descent.
From these preliminary tests, we notice that though the standard finetuning approach is reasonable, it seems inappropriate to achieve satisfactory improvements with respect to the `BASELINE` solution. In the following section we describe a contrastive learning based fine-tuning solution alternative which immediately displays more promising results.


![image.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/14.png)

**Figure 14**

`net1`

![image.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/15.png)

**Figure 15**

`net2`

![image.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/16.png)

**Figure 16**

`net3`

![image.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/17.png)

**Figure 17**

`net4`

![loss-net1.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/18.png)

**Figure 18**

`net1` loss

![loss-net2.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/19.png)

**Figure 19**

`net2` loss

![loss-net3.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/20.png)

**Figure 20**

`net3` loss

![loss-net4.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/21.png)

**Figure 21**

`net4` loss

![acc-net1.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/22.png)

**Figure 22**

`net1` acc

![acc-net2.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/23.png)

**Figure 23**

`net2` acc

![acc-net3.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/24.png)

**Figure 24**

`net3` acc

![acc-net4.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/25.png)

**Figure 25**

`net4` acc

### 10.1 Code

In [None]:
class ClipSfCore(nn.Module):
    def __init__(
        self,
        img_encoder: nn.Module,
        txt_encoder: nn.Module,
    ):
        super().__init__()
        self.img_encoder = img_encoder
        self.txt_encoder = txt_encoder

    def cosine_similarity(
        self,
        crops_z: Float[torch.Tensor, "crops 1024"],
        prompts_z: Float[torch.Tensor, "prompts 1024"],
    ) -> Float[torch.Tensor, "prompts crops"]:
        # normalise the image and the text
        crops_z: Float[torch.Tensor, "crops 1024"] = crops_z / crops_z.norm(dim=-1, keepdim=True)
        prompts_z: Float[torch.Tensor, "prompts 1024"] = prompts_z / prompts_z.norm(dim=-1, keepdim=True)

        # evaluate the cosine similarity between the sets of features
        return prompts_z @ crops_z.T

    def forward(
        self,
        crops: Float[torch.Tensor, "crops 3 244 244"],
        prompts: Int[torch.Tensor, "prompts 77"],
    ) -> Float[torch.Tensor, "crops 1"]:
        # step 1: compute crop representation in the latent space
        crop_z: Float[torch.Tensor, "crops 1024"] = self.img_encoder(crops)

        # step 2: compute prompt representation in the latent space
        prompt_z: Int[torch.Tensor, "prompts 1024"] = self.txt_encoder(prompts)

        # step 3: evaluate logits
        similarity_matrix: Float[torch.Tensor, "prompts crops"] = self.cosine_similarity(crop_z, prompt_z)

        # step 4: crops classification
        return torch.mean(similarity_matrix, dim=0)

In [None]:
class ClipSf(nn.Module):
    def __init__(
        self,
        img_encoder: nn.Module,
        txt_encoder: nn.Module,
    ):
        super().__init__()
        self.img_preprocess: Compose = preprocess
        self.txt_preprocess: t.Callable[[t.Union[str, list[str]]], Float[torch.Tensor, "77"]] = clip.tokenize
        self.core = ClipSfCore(img_encoder, txt_encoder)

    def forward(self, crops: list[TensorImage], prompts: list[str]) -> Float[torch.Tensor, "crops 1"]:
        # step 1: preprocess crops as required by the visual encoder
        with torch.no_grad():
            crops_preprocessed: Float[torch.Tensor, "crops 3 244 244"] = torch.stack([
                self.img_preprocess(crop)
                for crop in crops
            ])

        # step 2: preprocess prompts as required by the text encoder
        with torch.no_grad():
            prompts_preprocessed: Int[torch.Tensor, "prompts 77"] = self.txt_preprocess(prompts)

        return self.core(crops_preprocessed, prompts_preprocessed)

---

In [None]:
_ = lambda params: torch.optim.SGD(params=params, lr=.01, weight_decay=.01, momentum=.9)

eval_summary(
    ClipSf(
        img_encoder=nn.Sequential(
            clip_frozen_img_encoder,
            nn.ReLU(),
            nn.Linear(1024, 1024)
        ),
        txt_encoder=clip_frozen_txt_encoder,
    ).to(device).core
)

Layer (type:depth-idx)                   Input Shape               Output Shape              Param #                   Trainable
ClipSfCore                               [5, 3, 244, 244]          [5]                       --                        True
├─Sequential: 1-1                        [5, 3, 244, 244]          [5, 1024]                 --                        True
│    └─ClipFrozenImgEnc: 2-1             [5, 3, 244, 244]          [5, 1024]                 --                        --
│    └─ReLU: 2-2                         [5, 1024]                 [5, 1024]                 --                        --
│    └─Linear: 2-3                       [5, 1024]                 [5, 1024]                 1,049,600                 True
├─ClipFrozenTxtEnc: 1-2                  [2, 77]                   [2, 1024]                 --                        --
Total params: 1,049,600
Trainable params: 1,049,600
Non-trainable params: 0
Total mult-adds (M): 5.25
Input size (MB): 3.57
Forward/b

In [None]:
_ = lambda params: torch.optim.SGD(params=params, lr=.01, weight_decay=.01, momentum=.9)

eval_summary(
    ClipSf(
        img_encoder=nn.Sequential(
            clip_frozen_img_encoder,
            nn.ReLU(),
            nn.Linear(1024, 512),
            nn.ReLU(),
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Linear(256, 1024),
        ),
        txt_encoder=nn.Sequential(
            clip_frozen_txt_encoder,
            nn.ReLU(),
            nn.Linear(1024, 512),
            nn.ReLU(),
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Linear(256, 1024),
        ),
    ).to(device).core
)

Layer (type:depth-idx)                   Input Shape               Output Shape              Param #                   Trainable
ClipSfCore                               [5, 3, 244, 244]          [5]                       --                        True
├─Sequential: 1-1                        [5, 3, 244, 244]          [5, 1024]                 --                        True
│    └─ClipFrozenImgEnc: 2-1             [5, 3, 244, 244]          [5, 1024]                 --                        --
│    └─ReLU: 2-2                         [5, 1024]                 [5, 1024]                 --                        --
│    └─Linear: 2-3                       [5, 1024]                 [5, 512]                  524,800                   True
│    └─ReLU: 2-4                         [5, 512]                  [5, 512]                  --                        --
│    └─Linear: 2-5                       [5, 512]                  [5, 256]                  131,328                   True
│    └─Re

In [None]:
_ = lambda params: torch.optim.Adadelta(params=params, lr=.0015, weight_decay=.01)

eval_summary(
    ClipSf(
        img_encoder=nn.Sequential(
            clip_frozen_img_encoder,
            nn.ReLU(),
            nn.Linear(1024, 512),
            nn.ReLU(),
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Linear(256, 1024),
        ),
        txt_encoder=nn.Sequential(
            clip_frozen_txt_encoder,
            nn.Sigmoid(),
            nn.Linear(1024, 512),
            nn.Sigmoid(),
            nn.Linear(512, 256),
            nn.Sigmoid(),
            nn.Linear(256, 1024),
        ),
    ).to(device).core
)

Layer (type:depth-idx)                   Input Shape               Output Shape              Param #                   Trainable
ClipSfCore                               [5, 3, 244, 244]          [5]                       --                        True
├─Sequential: 1-1                        [5, 3, 244, 244]          [5, 1024]                 --                        True
│    └─ClipFrozenImgEnc: 2-1             [5, 3, 244, 244]          [5, 1024]                 --                        --
│    └─ReLU: 2-2                         [5, 1024]                 [5, 1024]                 --                        --
│    └─Linear: 2-3                       [5, 1024]                 [5, 512]                  524,800                   True
│    └─ReLU: 2-4                         [5, 512]                  [5, 512]                  --                        --
│    └─Linear: 2-5                       [5, 512]                  [5, 256]                  131,328                   True
│    └─Re

In [None]:
_ = lambda params: torch.optim.SGD(params=params, lr=.01, weight_decay=.01, momentum=.9)

eval_summary(
    ClipSf(
        img_encoder=nn.Sequential(
            clip_frozen_img_encoder,
            nn.ReLU(),
            nn.Linear(1024, 512),
        ),
        txt_encoder=nn.Sequential(
            clip_frozen_txt_encoder,
            nn.ReLU(),
            nn.Linear(1024, 512),
        ),
    ).to(device).core
)

Layer (type:depth-idx)                   Input Shape               Output Shape              Param #                   Trainable
ClipSfCore                               [5, 3, 244, 244]          [5]                       --                        True
├─Sequential: 1-1                        [5, 3, 244, 244]          [5, 512]                  --                        True
│    └─ClipFrozenImgEnc: 2-1             [5, 3, 244, 244]          [5, 1024]                 --                        --
│    └─ReLU: 2-2                         [5, 1024]                 [5, 1024]                 --                        --
│    └─Linear: 2-3                       [5, 1024]                 [5, 512]                  524,800                   True
├─Sequential: 1-2                        [2, 77]                   [2, 512]                  --                        True
│    └─ClipFrozenTxtEnc: 2-4             [2, 77]                   [2, 1024]                 --                        --
│    └─Re

---

In [None]:
loss_fn: t.Callable[[Float[torch.Tensor, "crops"], Int[torch.Tensor, "1"]], Float[torch.Tensor, "1"]] = nn.functional.cross_entropy

In [None]:
def training_step(
        model: nn.Module,
        data_loader: DataLoader[
            tuple[
                tuple[TensorImage, ...],
                tuple[str, ...],
                int,
                Float[torch.Tensor, "crops 4"],
                Float[torch.Tensor, "1 4"],
            ]
        ],
        optimizer: torch.optim.Optimizer,
) -> tuple[float, float]:
    model.train()

    running_loss: float = 0
    running_acc: float = 0
    progress = tqdm(data_loader, desc="training")

    cropss: tuple[tuple[TensorImage, ...], ...]
    promptss: tuple[tuple[str, ...], ...]
    true_is: tuple[int, ...]
    xyxyss: tuple[Float[torch.Tensor, "crops 4"], ...]
    true_xyxys: tuple[Float[torch.Tensor, "1 4"], ...]

    for iter, (cropss, promptss, true_is, xyxyss, true_xyxys) in zip(it.count(1), progress):
        # forward pass
        preds: list[Float[torch.Tensor, "crops"]] = [
            model(crops, prompts) for crops, prompts in zip(cropss, promptss)
        ]

        # calculate loss
        losses: Float[torch.Tensor, "batch"] = torch.stack([
            loss_fn(pred, torch.tensor(true_i))
            for pred, true_i in zip(preds, true_is)
        ])
        loss: Float[torch.Tensor, "1"] = torch.mean(losses)
        running_loss += loss.item()

        # optimizer zero grad
        optimizer.zero_grad()

        # loss backward
        loss.backward()

        # optimizer step
        optimizer.step()

        # calculate IoU accuracy
        with torch.inference_mode():
            # # get indexes of the predicted bounding box to compute IoU accuracy
            pred_is: list[int] = [
                torch.argmax(pred).item()
                for pred in preds
            ]

            # # get predicted bounding boxes
            pred_xyxys: list[Float[torch.Tensor, "4"]] = [
                xyxys[pred_i]
                for xyxys, pred_i in zip(xyxyss, pred_is)
            ]

            # # IoU
            acc: float = torch.mean(box_iou(torch.cat(true_xyxys), torch.stack(pred_xyxys)).diagonal()).item()
            running_acc += acc

            progress.set_postfix(
                {
                    "loss": running_loss / iter,
                    "iou": running_acc / iter,
                },
                refresh=False,
            )

    return running_loss / len(data_loader), running_acc / len(data_loader)

In [None]:
def test_step(
        model: nn.Module,
        data_loader: DataLoader[tuple[TensorImage, list[str], Float[torch.Tensor, "X 4"], Float[torch.Tensor, "4"]]],
) -> tuple[float, float]:
    model.eval()

    running_loss: float = 0
    running_acc: float = 0
    progress = tqdm(data_loader, desc="testing")

    with torch.inference_mode():
        img: TensorImage
        prompts: list[str]
        xyxys: Float[torch.Tensor, "crops 4"]
        xyxy: Float[torch.Tensor, "4"]

        for iter, (img, prompts, xyxys, true_xyxy) in zip(it.count(1), progress):
            true_i: int = best_bbox(xyxys, true_xyxy)

            # from xyxys to crops
            xywhs: Int[torch.Tensor, "X 4"] = box_convert(xyxys, in_fmt="xyxy", out_fmt="xywh").round().int()

            crops: list[TensorImage] = [
                crop(img, top=y, left=x, height=h, width=w)
                for xywh in xywhs
                for [x, y, w, h] in [xywh.tolist()]
            ]

            # forward pass
            model_output: Float[torch.Tensor, "crops"] = model(crops, prompts)

            # calculate loss
            loss: float = loss_fn(model_output, torch.tensor(true_i)).item()
            running_loss += loss

            # calculate IoU accuracy

            # # get index of the predicted bounding box to compute IoU accuracy
            pred_i: int = torch.argmax(model_output).item()

            # # get predicted bounding
            pred_xyxy: Float[torch.Tensor, "4"] = xyxys[pred_i]

            # # IoU
            acc: float = box_iou(true_xyxy, pred_xyxy.unsqueeze(0)).item()
            running_acc += acc

            progress.set_postfix(
                {
                    "loss": running_loss / iter,
                    "iou": running_acc / iter,
                },
                refresh=False,
            )

        return running_loss / len(data_loader), running_acc / len(data_loader)

---

In [None]:
keys: list[str] = [f"net{i + 1}" for i in range(4)]

In [None]:
pd.concat(
    [
        pd.read_csv(f"assets/standard-finetuning/eval-{key}.csv", index_col=0)
        for key in keys
    ],
    axis=1,
    keys=keys,
)

Unnamed: 0_level_0,net1,net1,net1,net2,net2,net2,net3,net3,net3,net4,net4,net4
Unnamed: 0_level_1,iou,cos similarity,euclidean distance,iou,cos similarity,euclidean distance,iou,cos similarity,euclidean distance,iou,cos similarity,euclidean distance
count,5023.0,5023.0,5023.0,5023.0,5023.0,5023.0,5023.0,5023.0,5023.0,5023.0,5023.0,5023.0
mean,0.457677,0.852569,0.954881,0.380285,0.813678,1.112312,0.355783,0.804044,1.150898,0.405283,0.83384,1.026658
std,0.412066,0.140639,0.531035,0.3969,0.157147,0.576089,0.390836,0.158404,0.575043,0.408287,0.145446,0.542099
min,0.0,0.285263,0.0,0.0,0.276203,0.0,0.0,0.259929,0.0,0.0,0.290687,0.0
25%,0.028989,0.757906,0.445398,0.016817,0.698765,0.528522,0.006893,0.688604,0.575107,0.007908,0.731535,0.482025
50%,0.325832,0.895899,0.932816,0.178222,0.839692,1.153889,0.14899,0.825239,1.205171,0.202125,0.865183,1.051583
75%,0.93584,0.976137,1.367539,0.901817,0.967429,1.566146,0.867523,0.961532,1.606206,0.921305,0.971797,1.455355
max,0.99605,1.0,2.881575,0.998143,1.0,2.881575,0.99605,1.0,2.998202,0.998143,1.0,2.559178


**Table 4**

In [None]:
pd.read_csv("assets/standard-finetuning/comparing.csv", index_col=0)

Unnamed: 0,mAP[IoU .3],mAP[IoU .5],mAP[IoU .7],mIoU,mCos,mED
net1,0.511248,0.44555,0.401354,0.457677,0.852569,0.954881
net2,0.414692,0.348995,0.314155,0.380285,0.813678,1.112312
net3,0.386024,0.323512,0.290066,0.355783,0.804044,1.150898
net4,0.448935,0.385626,0.347999,0.405283,0.83384,1.026658


**Table 5**

## 11 Fine-tune like you pretrain

`FLYP`


In [None]:
%tensorboard --logdir ./assets/flyp/runs --port 6003

In [None]:
%tensorboard --logdir ./assets/flyp-solve-overfitting/runs --port 6004

In [None]:
%tensorboard --logdir ./assets/flyp-optuna/runs --port 6005

In [None]:
%tensorboard --logdir ./assets/flyp-augmented/runs --port 6006

One of the trickiest aspects of standard fine-tuning approaches lies in the understanding of the role of the subtle applicable changes since there is no simple rule of thumb for what is the correct modification. As an alternative methodology to standard fine-tuning techniques, inspired by the work of Sachin Goyal et al. "Finetune like your pretrain: Improved finetuning of zero-shot vision models" [5], we propose to finetune the visual and textual encoder components of CLIP via the same pretraining process employed by the original authors of the model.

More precisely, the purpose of CLIP is to learn a multi-modal embedding of image and text.

Let $\Psi: I \rightarrow {R}^d$ denote the image encoder that maps an image to a $d$-dimensional image-text embedding space. $\Psi$ is parameterized by parameters $\theta_{\text{img}}$.

Let ${P}$ be the space for text descriptions of images. Analogously, $\Gamma: {P} \rightarrow {R}$  is the language encoder with model parameters $\theta_{\text{text}}$.

The backbone of the pretraining objective is contrastive learning, where the goal is to align the embedding $\Psi(I_i)$ of an image close to the embedding $\Gamma(P_i)$ of its corresponding text description, and away from other text embeddings $\Gamma(P_j)$ in the batch.

Given a batch with $B$ images with their corresponding text descriptions $D = \{(I_1, P_1), \ldots, (I_B, P_B)\}$, pretraining objective is formally formulated as follows:

$$
{L}_{\text{pre}}(D, \theta) :=
\sum_{i=1}^{B} -\log{\frac{\exp(\bar{f}(I_i) \cdot \bar{g}(P_i))}{\sum_{j=1}^{B}\exp(\bar{f}(I_i) \cdot \bar{g}(P_j))}} + \sum_{i=1}^{B} -\log{\frac{\exp(\bar{f}(I_i) \cdot \bar{g}(P_i))}{\sum_{j=1}^{B}\exp(\bar{f}(I_j) \cdot \bar{g}(P_i))}}
$$

where

- $\theta = [\theta_{\text{img}}, \theta_{\text{text}}]$ are image and text encoder parameters
- $\bar{f}$ and $\bar{g}$ are the $l_2$ normalized versions of $f$ and $g$ respectively.

In other words, as a result of the minimization of this objective function, the algorithm maximizes the cosine similarity of the image and text embeddings of the $N$ real pairs in the batch while minimizing the cosine similarity of the embeddings of the $N^2-N$ incorrect pairings.

In consonance with this promising direction we have designed, developed and trained a contrastive learning architecture as summarized in Figure 26. As depicted in the proposed illustration, we append a symmetric linear head on both the pre-trained CLIP encoders, where to accommodate our trainable parameters. Moreover, for the purpose of training the architecture we have implemented a second customized dataset class together with the corresponding data loaders. Indeed, the input and output of our network to be trained are significantly changed. In order to successfully apply contrastive learning, at each iteration we need to compare a collection of $B \in {N}$ ground truth bounding boxes with their corresponding $B$ textual descriptions. Throughout the training epochs the goal is to maximize the cosine similarity between the crop-text couples which actually occur in the batch while minimizing the cosine similarity of the other pairs. To this end we have personalized the loss function proposed by [this GitHub repository](https://github.com/locuslab/FLYP/) whose code has been made available by the authors of the aforementioned "Finetune like you pretrain" paper [5]. On the other hand, unfortunately, in [the official GitHub repository of CLIP](https://github.com/OpenAI/CLIP) there is not the implementation of the original procedure used to train the model. Fortunately, we have been able to personalize for our scopes the code published at [this GitHub repository](https://github.com/mlfoundations/open_clip/) whose goal is to enable the training of models with contrastive image-text supervision. At the end of the day, the implementation follows closely the pseudocode proposed in the paper of CLIP that we conveniently report here:

```python
# image_encoder - ResNet or Vision Transformer
# text_encoder - CBOW or Text Transformer
# I[n, h, w, c] - minibatch of aligned images
# T[n, l] - minibatch of aligned texts
# W_i[d_i, d_e] - learned proj of image to embed
# W_t[d_t, d_e] - learned proj of text to embed
# t - learned temperature parameter

# extract feature representations of each modality
I_f = image_encoder(I) #[n, d_i]
T_f = text_encoder(T)  #[n, d_t]

# joint multimodal embedding [n, d_e]
I_e = l2_normalize(np.dot(I_f, W_i), axis=1)
T_e = l2_normalize(np.dot(T_f, W_t), axis=1)

# scaled pairwise cosine similarities [n, n]
logits = np.dot(I_e, T_e.T) * np.exp(t)

# symmetric loss function
labels = np.arange(n)
loss_i = cross_entropy_loss(logits, labels, axis=0)
loss_t = cross_entropy_loss(logits, labels, axis=1)
loss = (loss_i + loss_t)/2
```

The resulting training procedure is extremely computationally efficient both in terms of time and memory consumption. Consequently, we have been able to train our fine-tuned image and text encoders on the whole dataset for many epochs in a reasonable amount of time. Also in this case, stochastic gradient descent has been the optimizer of choice. As regards the cardinality of the batches, contrastive learning procedures tend to provide better outcomes with larger batch sizes [23]. Intuitively, considering only one (image, text) pair at a time, i.e. a batch size of 1, the resulting contrastive loss is completely meaningless. Remarkably, with our proposed implementation we have succeeded in using a batch size of 1024 (image, text) pairs without exceeding memory limits.

Once the two multi-modal encoders $\Psi^*, \Gamma^*$ have been symmetrically finetuned by the above-mentioned training strategy, we have tested their acquired skills on our downstream task. The obtained results are reported at the end of this section. The performance obtained by the model is impressive and outperforms all the previously proposed standard finetuning networks.


![image.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/26.png)

**Figure 26**

![Screenshot 2023-09-01 at 01.23.15.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/27.png)

**Figure 27**

`FLYP` loss

### 11.1 Observations

`FLYP solve overfitting`

Looking at the training curves in the plot depicted in Figure 27, we can evidently understand that during training the model has overfitted around epoch 10. Gladly, the displayed behavior strongly suggests that there is still space for improvement with this architecture.

Pursuing our objectives, we have refined the neural network and the training procedure by incorporating regularization techniques in order to prevent overfitting. Specifically, in our improved implementation we consider the following strategies.
early stopping: we save the state of the network at the end of every epoch. Notably, the tweak poorly affects the speed of the computation. Following this strategy, after training, we can manually inspect the training and validation error and easily select the most optimal version of the model.
dropout: we add a dropout that randomly shuts down some fraction of the layers' neurons at each training step by zeroing out their values.

The effects of the application of these effective regularization strategies can be appreciated by the reader at the end of this subsection. Manifestly, the training curves now exhibit an optimal descending behavior. The encouraging trend is visually reflected in the matrices shown in Figure 43, Figure 44 and Figure 45 in which the color of the diagonal becomes progressively brighter throughout the training process.

Consistently with the improvements in the training objective, also the evaluation on the visual grounding downstream task of interest has produced astonishing results as reported in Table 10.


![image.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/28.png)

**Figure 28**

Baseline prediction.

*a chili dog with slices of cheese visible under the chili.*


![image.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/29.png)

**Figure 29**

Contrastive learning model prediction.

*a chili dog with slices of cheese visible under the chili.*


---

![image.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/30.png)

**Figure 30**

"train" split

![image.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/31.png)

**Figure 31**

"test" split

![image.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/32.png)

**Figure 32**

"val" split

![Screenshot 2023-09-01 at 01.23.36.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/33.png)

**Figure 33**

`FLYP solve overfitting`

### 11.2 Code

In [None]:
class ClipFlypCore(nn.Module):
    def __init__(self):
        super().__init__()
        self.img_encoder = nn.Sequential(
            clip_frozen_img_encoder,
            nn.ReLU(),
            nn.Dropout(.25),
            nn.Linear(1024, 1024),
        )
        self.txt_encoder = nn.Sequential(
            clip_frozen_txt_encoder,
            nn.ReLU(),
            nn.Dropout(.25),
            nn.Linear(1024, 1024),
        )
        # the temperature parameter is added as suggested by the original paper in order to prevent training instability
        self.logit_scale: Float[torch.Tensor, "1"] = nn.Parameter(
            torch.log(torch.tensor(1 / 0.07))
        )

    def forward(
        self,
        crop: Float[torch.Tensor, "entries 3 244 244"],
        prompt: Int[torch.Tensor, "entries 77"],
    ) -> tuple[
        Float[torch.Tensor, "1024"],
        Float[torch.Tensor, "1024"],
        Float[torch.Tensor, "entries"],
    ]:
        # step 1: compute crop representation in the latent space
        crop_z: Float[torch.Tensor, "entries 1024"] = self.img_encoder(crop)

        # step 2: compute prompt representation in the latent space
        prompt_z: Int[torch.Tensor, "entries 1024"] = self.txt_encoder(prompt)

        return crop_z, prompt_z, self.logit_scale.exp()

In [None]:
class ClipFlyp(nn.Module):
    def __init__(self):
        super().__init__()
        self.img_preprocess: Compose = preprocess
        self.txt_preprocess: t.Callable[[t.Union[str, list[str]]], Float[torch.Tensor, "77"]] = clip.tokenize
        self.core = ClipFlypCore()

    def forward(
        self, entries: list[tuple[TensorImage, str]]
    ) -> Float[torch.Tensor, "entries"]:
        # step 1: preprocess crops as required by the visual encoder
        with torch.no_grad():
            crops_preprocessed: Float[torch.Tensor, "entries 3 244 244"] = torch.stack([
                self.img_preprocess(crop)
                for crop, _ in entries
            ])

        # step 2: preprocess prompts as required by the text encoder
        with torch.no_grad():
            prompts_preprocessed: Int[torch.Tensor, "entries 77"] = self.txt_preprocess([
                prompt
                for _, prompt in entries
            ])

        return self.core(crops_preprocessed, prompts_preprocessed)

In [None]:
_ = lambda params: torch.optim.SGD(params=params, lr=.01, weight_decay=.01, momentum=.9)
_ = lambda params: torch.optim.Adam(params=params, lr=.00043, weight_decay=.01)

contrastive_summary(
    ClipFlyp().to(device).core
)

Layer (type:depth-idx)                   Input Shape               Output Shape              Param #                   Trainable
ClipFlypCore                             [8, 3, 244, 244]          [8, 1024]                 1                         True
├─Sequential: 1-1                        [8, 3, 244, 244]          [8, 1024]                 --                        True
│    └─ClipFrozenImgEnc: 2-1             [8, 3, 244, 244]          [8, 1024]                 --                        --
│    └─ReLU: 2-2                         [8, 1024]                 [8, 1024]                 --                        --
│    └─Dropout: 2-3                      [8, 1024]                 [8, 1024]                 --                        --
│    └─Linear: 2-4                       [8, 1024]                 [8, 1024]                 1,049,600                 True
├─Sequential: 1-2                        [8, 77]                   [8, 1024]                 --                        True
│    └─Cl

---

In [None]:
class ClipFlypEvalCore(nn.Module):
    def __init__(
        self,
        img_encoder: nn.Module,
        txt_encoder: nn.Module,
    ):
        super().__init__()
        self.img_encoder = img_encoder
        self.txt_encoder = txt_encoder

    def cosine_similarity(
        self,
        crops_z: Float[torch.Tensor, "crops 1024"],
        prompts_z: Float[torch.Tensor, "prompts 1024"],
    ) -> Float[torch.Tensor, "prompts crops"]:
        # normalise the image and the text
        crops_z: Float[torch.Tensor, "crops 1024"] = crops_z / crops_z.norm(dim=-1, keepdim=True)
        prompts_z: Float[torch.Tensor, "prompts 1024"] = prompts_z / prompts_z.norm(dim=-1, keepdim=True)

        # evaluate the cosine similarity between the sets of features
        return prompts_z @ crops_z.T

    def forward(
        self,
        crops: Float[torch.Tensor, "crops 3 244 244"],
        prompts: Int[torch.Tensor, "prompts 77"],
    ) -> Float[torch.Tensor, "crops 1"]:
        # step 1: compute crop representation in the latent space
        crop_z: Float[torch.Tensor, "crops 1024"] = self.img_encoder(crops)

        # step 2: compute prompt representation in the latent space
        prompt_z: Int[torch.Tensor, "prompts 1024"] = self.txt_encoder(prompts)

        # step 3: evaluate logits
        similarity_matrix: Float[torch.Tensor, "prompts crops"] = self.cosine_similarity(crop_z, prompt_z)

        # step 4: crops classification
        return torch.mean(similarity_matrix, dim=0)

In [None]:
class ClipFlypEval(nn.Module):
    def __init__(
        self,
        img_encoder: nn.Module,  # visual encoder
        txt_encoder: nn.Module,  # natural language prompts encoder
    ):
        super().__init__()
        self.img_preprocess: Compose = preprocess
        self.txt_preprocess: t.Callable[[t.Union[str, list[str]]], Float[torch.Tensor, "77"]] = clip.tokenize
        self.core = ClipFlypEvalCore(img_encoder, txt_encoder)

    def forward(
        self, crops: list[TensorImage], prompts: list[str]
    ) -> Float[torch.Tensor, "crops 1"]:
        # step 1: preprocess crops as required by the visual encoder
        with torch.no_grad():
            crops_preprocessed: Float[torch.Tensor, "crops 3 244 244"] = torch.stack([
                self.img_preprocess(crop)
                for crop in crops
            ])

        # step 2: preprocess prompts as required by the text encoder
        with torch.no_grad():
            prompts_preprocessed: Int[torch.Tensor, "prompts 77"] = self.txt_preprocess(prompts)

        return self.core(crops_preprocessed, prompts_preprocessed)

In [None]:
eval_summary(
    ClipFlypEval(
        ClipFlyp().core.img_encoder,
        ClipFlyp().core.txt_encoder,
    ).to(device).core
)

Layer (type:depth-idx)                   Input Shape               Output Shape              Param #                   Trainable
ClipFlypEvalCore                         [5, 3, 244, 244]          [5]                       --                        True
├─Sequential: 1-1                        [5, 3, 244, 244]          [5, 1024]                 --                        True
│    └─ClipFrozenImgEnc: 2-1             [5, 3, 244, 244]          [5, 1024]                 --                        --
│    └─ReLU: 2-2                         [5, 1024]                 [5, 1024]                 --                        --
│    └─Dropout: 2-3                      [5, 1024]                 [5, 1024]                 --                        --
│    └─Linear: 2-4                       [5, 1024]                 [5, 1024]                 1,049,600                 True
├─Sequential: 1-2                        [2, 77]                   [2, 1024]                 --                        True
│    └─Cl

---

In [None]:
class ClipLoss(nn.Module):
    def forward(
        self,
        imgs_features: Float[torch.Tensor, "entries 1024"],
        txts_features: Float[torch.Tensor, "entries 1024"],
        logit_scale: Float[torch.Tensor, "1"],
    ) -> Float[torch.Tensor, "1"]:
        # compute logits per image and logits per text
        logits_per_image: Float[torch.Tensor, "entries entries"] = logit_scale * imgs_features @ txts_features.T
        logits_per_text: Float[torch.Tensor, "entries entries"] = logit_scale * txts_features @ imgs_features.T

        # get ground truth labels for the computation of the cross entropy loss
        labels: Int[torch.Tensor, "entries"] = torch.arange(logits_per_image.shape[0])

        return torch.stack((
            nn.functional.cross_entropy(logits_per_image, labels),
            nn.functional.cross_entropy(logits_per_text, labels),
        )).mean()

In [None]:
contrastive_loss_fn: t.Callable[
    [
        Float[torch.Tensor, "entries 1024"],
        Float[torch.Tensor, "entries 1024"],
        Float[torch.Tensor, "1"],
    ],
    Float[torch.Tensor, "1"],
] = ClipLoss()

In [None]:
def contrastive_training_step(
    model: ClipFlyp,
    data_loader: DataLoader[tuple[TensorImage, list[str]]],
    optimizer: torch.optim.Optimizer,
) -> float:
    running_loss: float = 0.0
    progress = tqdm(data_loader, desc="training")

    model.train()

    entries: tuple[tuple[TensorImage, list[str]], ...]
    entry: tuple[TensorImage, list[str]]

    for iter, entries in zip(it.count(1), progress):
        # forward computation
        imgs_features, txts_features, logit_scale = model(
            [(img, prompts[0]) for img, prompts in entries]
        )

        # calculate loss
        loss: Float[torch.Tensor, "1"] = contrastive_loss_fn(imgs_features, txts_features, logit_scale)
        running_loss += loss.item()

        # optimizer zero grad
        optimizer.zero_grad()

        # loss backward
        loss.backward()

        # optimizer step
        optimizer.step()

        # Note: we clamp to 4.6052 = ln(100), as in the original paper.
        with torch.no_grad():
            model.core.logit_scale.clamp_(0, math.log(100))

            progress.set_postfix({"loss": running_loss / iter}, refresh=False)

    return running_loss / len(data_loader)

In [None]:
def contrastive_test_step(
    model: ClipFlyp,
    data_loader: DataLoader[tuple[TensorImage, list[str]]],
) -> float:
    running_loss: float = 0.0
    progress = tqdm(data_loader, desc="testing")

    model.eval()

    with torch.inference_mode():
        entries: tuple[tuple[TensorImage, list[str]], ...]
        entry: tuple[TensorImage, list[str]]

        for iter, entries in zip(it.count(1), progress):
            # forward computation
            imgs_features, txts_features, logit_scale = model(
                [(img, prompts[0]) for img, prompts in entries]
            )

            # calculate loss
            loss: Float[torch.Tensor, "1"] = contrastive_loss_fn(imgs_features, txts_features, logit_scale)
            running_loss += loss.item()

            progress.set_postfix({"loss": running_loss / iter}, refresh=False)

    return running_loss / len(data_loader)

In [None]:
def contrastive_showtime(
    model: ClipFlyp,
    spli2loader: dict[Split, DataLoader[tuple[TensorImage, list[str]]]],
    writer: SummaryWriter,
    global_step: int,
) -> None:
    model.eval()

    with torch.inference_mode():

        for split, data_loader in spli2loader.items():

            progress = tqdm(data_loader, desc="showtime [{split}]")

            entries: tuple[tuple[TensorImage, list[str]], ...]
            entry: tuple[TensorImage, list[str]]

            for iter, entries in zip(it.count(1), progress):

                # forward computation
                imgs_features, txts_features, _ = model([
                    (img, prompts[0])
                    for img, prompts in entries
                ])

                imgs_features: Float[torch.Tensor, "entries 1024"] = imgs_features / imgs_features.norm(dim=-1, keepdim=True)
                txts_features: Float[torch.Tensor, "entries 1024"] = txts_features / txts_features.norm(dim=-1, keepdim=True)
                similarity: Float[torch.Tensor, "entries entries"] = (txts_features @ imgs_features.T).cpu()

                f: plt.Figure
                ax: plt.Axes
                f, ax = plt.subplots(1, 1, figsize=(10, 8))

                ax.imshow(similarity, vmin=torch.min(similarity).item(), vmax=torch.max(similarity).item())

                ax.set_yticks(
                    range(len(entries)),
                    ["\n".join(prompts) for _, prompts in entries],
                    fontsize=10,
                )
                ax.set_xticks([])

                for i, image in enumerate([ crop for crop, _ in entries ]):
                    ax.imshow(
                        image.permute(1, 2, 0).cpu(),
                        extent=(i - 0.5, i + 0.5, -1.6, -0.6),
                        origin="lower",
                    )

                for x in range(similarity.shape[1]):
                    for y in range(similarity.shape[0]):
                        ax.text(
                            x,
                            y,
                            f"{similarity[y, x]:.2f}",
                            ha="center",
                            va="center",
                            size=12,
                        )

                for side in ["left", "top", "right", "bottom"]:
                    f.gca().spines[side].set_visible(False)

                ax.set_xlim([-0.5, len(entries) - 0.5])
                ax.set_ylim([len(entries) + 0.5, -2])

                f.tight_layout()

                writer.add_figure(tag=f"matrix {iter}/{split}", figure=f, global_step=global_step)

## 12 Exploiting self-attention to provide contextualized latent space representations

`CLIP context`

As demonstrated by the numbers achieved by our fine-tuned architecture on the test set of the RefCOCOg dataset, our trained model is very well prepared in mapping the visual appearance of the image regions delimited by the proposed bounding boxes together with the semantic of the given textual description. Even though the algorithm achieves remarkable results, by manually inspecting the wrong predictions we have noticed that the AI agent makes mistakes in presence of location words such as "in the middle", "on the right", "on the left" and similar. However, if we carefully reason about the pipeline pursued by our software to make the final prediction (Figure 13) we understand that the described deficiency is justifiable. Intuitively, with our implementation, the model can merely look at each image region proposed in step 1 in isolation, disregarding the surrounding elements which populates the field of view. Under such circumstances even a human would be in trouble in understanding which is "The dog on the left" and which is "The dog on the right" (Figure 12). In order to figure out which is the dog referred to by the natural language description, we need to grasp the content of the picture as a whole. Even though achieving a general understanding of the global scene is easily solvable in less than one second by a human observer, the task is not trivial for a machine. The purpose of this section is to describe our solution proposal to partially overcome this limitation.

The crucial observation behind the idea that we have come up with in order to accomplish this, is that we need a way to enhance the latent representations of visual and textual prompts so that the obtained encodings embody the overall context of the given picture. Intuitively, the image crops and the textual description have to be interpreted as words in an arbitrary long sentence. The encoding of a bounding box region, just as a pronoun, might acquire different meanings depending on the rest of the portrayed phrase. Consistently with this perspective, we have devised a self-attention mechanism as originally proposed by the famous "Attention Is All You Need" research paper [24]. For the sake of clarity the architecture is summarized in Figure 34. In this additional proposed encoding procedure, the visual and textual CLIP embeddings are fed as query, key and value inputs of a self-attention module. By doing so, the algorithm still has severe problems in realizing how the objects are reciprocally disseminated in the environment. However, a cropped tasty apple within a bounding box is no longer analyzed in isolation, but its feature representation incorporates the presence of the other beautiful fruits in the basket (Figure 35). Given the query: “A yellow banana fruit in a basket” $p$, the probability assigned to the apple bounding box $b$ is probably still not negligible. However, this time, the cosine similarity between $p$ and $b$ is mitigated by the awareness that somewhere in the picture there is also a bunch of bananas.

As always, before launching the training procedure on our powerful but time limited Azure GPUs, we have executed some representative experiments on Google Collaboratory. Sadly, the displayed results are not promising at all. After several unsuccessful attempts, we believe that the reason beyond these poor feedbacks is related to the limited portion of the dataset that we are inevitably constrained to consider by the limited resources made available by Google Collaboratory. Indeed, when dealing with attention modules, very large training data are typically mandatory in order to appreciate satisfactory outcomes. Unfortunately, at this point of the project, we had spent most of our GPU execution time on Azure and we considered it too hazardous to spend the remaining hours of execution for something that could potentially result in a discouraging stalemate. We strongly believe that training this architecture on the whole dataset for at least 50 epochs is a worthy future direction that we cannot afford to consider in this work. At the end of the day, given that the deep neural network trained with contrastive learning have shown wonderful achievements, we consider it more profitable and interesting to investigate promising strategies to further improve the results obtained so far. In this regard, we discuss our proposed solutions to enhance the generalization capabilities of our model in the next section.


![image.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/34.png)

**Figure 34**

![fruits.jpg](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/35.jpeg)

**Figure 35**

### 12.1 Code

In [None]:
class ClipContexCore(nn.Module):
    def __init__(
        self,
        img_encoder: nn.Module,  # visual encoder
        txt_encoder: nn.Module,  # natural language prompts encoder
    ):
        super().__init__()
        self.img_encoder = img_encoder
        self.txt_encoder = txt_encoder
        self.attention = nn.MultiheadAttention(embed_dim=1024, num_heads=1)

    def contextualize(
        self,
        crops_z: Float[torch.Tensor, "crops 1024"],
        prompts_z: Float[torch.Tensor, "prompts 1024"],
    ) -> tuple[Float[torch.Tensor, "crops 1024"], Float[torch.Tensor, "prompts 1024"]]:
        # concatenate image embeedings and prompt embeedings in the same latent context
        concat: Float[torch.Tensor, "crops+prompts 1024"] = torch.cat((crops_z, prompts_z), dim=0)

        contextualized: Float[torch.Tensor, "crops+prompts 1024"]
        contextualized, _ = self.attention(concat, concat, concat)

        # retrive image_features and text_features by means of the previously stored indexes
        return contextualized[: crops_z.shape[0]], contextualized[-prompts_z.shape[0] :]

    def cosine_similarity(
        self,
        crops_z: Float[torch.Tensor, "crops 1024"],
        prompts_z: Float[torch.Tensor, "prompts 1024"],
    ) -> Float[torch.Tensor, "prompts crops"]:
        # normalise the image and the text
        crops_z: Float[torch.Tensor, "crops 1024"] = crops_z / crops_z.norm(dim=-1, keepdim=True)
        prompts_z: Float[torch.Tensor, "prompts 1024"] = prompts_z / prompts_z.norm(dim=-1, keepdim=True)

        # evaluate the cosine similarity between the sets of features
        return prompts_z @ crops_z.T

    def forward(
        self,
        crops: Float[torch.Tensor, "crops 3 244 244"],
        prompts: Int[torch.Tensor, "prompts 77"],
    ) -> Float[torch.Tensor, "crops 1"]:
        # step 1: compute crop representation in the latent space
        crops_z: Float[torch.Tensor, "crops 1024"] = self.img_encoder(crops)

        # step 2: compute prompt representation in the latent space
        prompts_z: Float[torch.Tensor, "prompts 1024"] = self.txt_encoder(prompts)

        # step 3: refine the latent representation of each text and image according to the overall context by means of the attention mechanism
        crop_context_z: Float[torch.Tensor, "crops 1024"]
        prompt_context_z: Float[torch.Tensor, "prompts 1024"]
        crop_context_z, prompt_context_z = self.contextualize(crops_z, prompts_z)

        # step 4: evaluate logits
        similarity_matrix: Float[torch.Tensor, "prompts crops"] = self.cosine_similarity(crop_context_z, prompt_context_z)

        # step 5: crops classification
        return torch.mean(similarity_matrix, dim=0)

In [None]:
class ClipContex(nn.Module):
    def __init__(
        self,
        img_encoder: nn.Module,  # visual encoder
        txt_encoder: nn.Module,  # natural language prompts encoder
    ):
        super().__init__()
        self.core = ClipContexCore(img_encoder, txt_encoder)

        self.img_preprocess: Compose = preprocess
        self.txt_preprocess: t.Callable[[t.Union[str, list[str]]], Float[torch.Tensor, "77"]] = clip.tokenize

    def forward(
        self, crops: list[TensorImage], prompts: list[str]
    ) -> Float[torch.Tensor, "crops 1"]:
        # step 1: preprocess crops as required by the visual encoder
        with torch.no_grad():
            crops_preprocessed: Float[torch.Tensor, "crops 3 244 244"] = torch.stack([
                self.img_preprocess(crop)
                for crop in crops
            ])

        # step 2: preprocess prompts as required by the text encoder
        with torch.no_grad():
            prompts_preprocessed: Int[torch.Tensor, "prompts 77"] = self.txt_preprocess(
                prompts
            )

        return self.core(crops_preprocessed, prompts_preprocessed)

In [None]:
eval_summary(
    ClipContex(
        img_encoder=clip_frozen_img_encoder,
        txt_encoder=clip_frozen_txt_encoder,
    ).to(device).core
)

Layer (type:depth-idx)                   Input Shape               Output Shape              Param #                   Trainable
ClipContexCore                           [5, 3, 244, 244]          [5]                       --                        True
├─ClipFrozenImgEnc: 1-1                  [5, 3, 244, 244]          [5, 1024]                 --                        --
├─ClipFrozenTxtEnc: 1-2                  [2, 77]                   [2, 1024]                 --                        --
├─MultiheadAttention: 1-3                [7, 1024]                 [7, 1024]                 4,198,400                 True
Total params: 4,198,400
Trainable params: 4,198,400
Non-trainable params: 0
Total mult-adds (M): 0
Input size (MB): 3.57
Forward/backward pass size (MB): 0.00
Params size (MB): 0.00
Estimated Total Size (MB): 3.57

## 13 Strategies to improve model generalization

The results obtained with the architecture fine-tuned in a contrastive learning fashion are outstanding and outperform all the other deep neural networks we have proposed so far. Nevertheless, there is still space for improvement. The purpose of this section is to present and provide an open source implementation of the solutions that we have put in place in order to refine our most promising architecture with the aim of further improving its generalization capabilities on the downstream task at hand.

### 13.1 Hyperparameter tuning

`FLYP optuna`

In the architectures presented so far we have adopted the hyperparameter values which are commonly used in deep learning codes. Although the proposed configuration led to satisfactory results, with the aim of exploring more effectively the highly non-linear landscape of our complex objective function, we have leveraged on automatic optimization tools for hyperparameter tuning. Specifically, in the implementation presented in the following of this section, we use [Optuna](https://optuna.org/), an automatic hyperparameter optimization software framework, particularly designed for machine learning. The software enables efficient hyperparameter optimization by adopting state-of-the-art algorithms for sampling hyperparameters and pruning efficiently unpromising trials. We execute Optuna with the combination of sampling and pruning algorithms suggested by the official documentation for our use case scenario. More in depth, the sampling algorithm of choice is Tree-structured Parzen Estimator [25] while the pruning algorithm is Hyperband [26].
Unfortunately, the computational complexity of the search through the hyperparameter space rapidly increases with the increase of the number of hyperparameters to be optimized. Because of our limited computational resources, we can not afford to finetune the values of all the hyperparameters which populate our architecture. As a consequence, we have selected the variables which in our opinion have the largest impact on the training procedure.


*   optimizer: undoubtedly, the model's convergence and performance highly depend on the adopted optimizer. For this hyperparameter optimization process we select two widely used algorithms: Stochastic Gradient Descent (SGD) and Adam.
*   learning rate: the learning rate determines the step size at each iteration while moving toward a desirable basin of attraction. Certainly, the pace of the solution space explorer significantly impacts the final outcome. In our implementation Optuna is asked to find the optimal learning rate parameter within the interval [0.00001, 0.1].
* dropout: in Section 11.1 we write about the regularization techniques that we have put in place in order to prevent overfitting. Among these, the dropout rate has turned out to be one of the most influential. Therefore, in the implementation proposed in the following of this section, we configure Optuna to search for the most suitable dropout rate value within the interval [0.1, 0.75].

We let Optuna play with these parameters for five hours on Azure GPUs. Obviously, a longer period of time would have been more desirable but we have been forced to select an interval compatible with the resources we had at that point of the project.

The following (Figure 37) are the hyperparameters suggested by Optuna at the end of its execution.

Given this optimized hyperparameter configuration we have assessed its goodness repeating the overall training procedure with the recommended parameters. Factually, a more accurate selection of the hyperparameters which govern the behavior of the training, led to better achievements as reported below.

![Screenshot 2023-09-01 at 01.24.14.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/36.png)

**Figure 36**

`FLYP optuna` loss

In [None]:
OPTUNA_BATCH_SIZE: int = 1024
OPTUNA_LIMIT: int = 30 * OPTUNA_BATCH_SIZE
OPTUNA_EPOCHS: int = 10

In [None]:
optuna_split2loader: dict[Split, DataLoader[tuple[TensorImage, list[str]]]] = {
    split: DataLoader(
        dataset=Coco4ContrastiveDataset(split=split, limit=OPTUNA_LIMIT),
        generator=g,
        batch_size=OPTUNA_BATCH_SIZE,
        collate_fn=lambda x: x,
        shuffle=(split == "train"),
    )
    for split in ["train", "val"]
}

In [None]:
def objective(trial: Trial):

    lr: float = trial.suggest_float("learning rate", 1e-5, .1, log=True)
    p: float = trial.suggest_float("dropout", .1, .9)
    optim: t.Literal["Adam", "SGD"] = trial.suggest_categorical("optimizer", ["Adam", "SGD"])

    optuna_model: ClipFlyp = ClipFlyp(p=p).to(device)

    match optim:
      case "Adam":
            optimizer: torch.optim.Optimizer = torch.optim.Adam(
                params=optuna_model.parameters(),
                lr=lr,
                weight_decay=.01
            )

      case "SGD":
            optimizer: torch.optim.Optimizer = torch.optim.SGD(
                params=optuna_model.parameters(),
                lr=lr,
                weight_decay=.01,
                momentum=.9
            )

    for epoch in trange(OPTUNA_EPOCHS):

        contrastive_training_step(
            model = optuna_model,
            data_loader = optuna_split2loader["train"],
            optimizer = optimizer,
        )

        val_loss = contrastive_test_step(
            model = optuna_model,
            data_loader = optuna_split2loader["val"],
        )

        trial.report(val_loss, epoch)

        # Handle pruning based on the intermediate value.
        if trial.should_prune():
            raise optuna.TrialPruned()

    return val_loss

In [None]:
optuna.logging.set_verbosity(optuna.logging.WARNING)

study: optuna.study.Study = optuna.create_study(
    study_name="optuna-hyperparameter-optimization",
    direction="minimize",
    pruner=optuna.pruners.HyperbandPruner(),
    load_if_exists=False,
)

# study.optimize(
#     func=objective,
#     timeout=5 * 60 * 60,
#     show_progress_bar=True,
# )

---

In [None]:
study: Study = optuna.create_study(
    storage=RDBStorage("sqlite:///assets/optuna.db"),
    study_name="flyp",
    load_if_exists=True
)

In [None]:
plot_parallel_coordinate(study)

![image.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/37.png)

**Figure 37**

In [None]:
plot_param_importances(study)

![image.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/38.png)

**Figure 38**

In [None]:
plot_timeline(study)

![image.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/39.png)

**Figure 39**

### 13.2 Data augmentation and noise injection

`FLYP augmented`

As detailed in Section 5, for each annotation the RefCOCOg dataset provides one, two or three equivalent descriptions. Up to this point, we have always considered only one of them. Without doubt, basing the decisions on all the available referring expressions rather than a subset of them would be more desirable. It is important to recall that the training loop of our contrastive learning procedure is extremely efficient and it is able to process the entire dataset in an acceptable amount of time. With this in mind, with the next pieces of code, not only we are able to consider all the prompts in the dataset, but we also succeed in developing data augmentation and noise injection techniques with the purpose of enhancing the generalization capabilities of our model.

#### 13.2.1 Text augmentation

Given a natural language description referring to a region of interest of the input image, our aim is to produce a set of different, but semantically equivalent, expressions. To this end, we have developed a set of possible transformations which are randomly applied on the input sequences in order to produce a desired number of phrases with the same meaning. The purpose of the paragraphs of this subsection is to outline the text augmentation techniques that we have applied in this work.

Inspired by the paper of CLIP [2], we generate new sentences following the concept of prompt engineering [27] [28], which simply consist in arbitrary appending a set of suggested prefixes ("A photo of {}", "A picture of {}", "An image of {}", "This is {}", "We can see {}") at the beginning of a given input sequence.

In addition to this, we also leverage on the impressive skills of more sophisticated modern sequence-to-sequence text generation models which are able, given a sentence, to generate a bunch of synonyms with the same meaning [29]. More in depth, we have succeeded in configuring and executing the following powerful NLP algorithms:


*   EDA [30]. The Easy Data Augmentation algorithm consists of four simple operations: synonym replacement, random insertion, random swap, and random deletion. For our scopes, we have used only the synonym replacement component of the model. The reference guidelines for using EDA are clearly documented [at this GitHub repository](https://github.com/dsfsi/textaugment).
*   PEGASUS [31]. We use a fine-tuned version specifically designed for paraphrasing of the PEGASUS transformer-based encoder-decoder model. The reference implementation of this deep neural network is provided by [huggingface.co at this link](https://huggingface.co/tuner007/pegasus_paraphrase)
* BART [32]. Additionally, we have also been able to employ a large BART sequence-to-sequence generation model fine-tuned on three paraphrase datasets made available by [huggingface.co at this URL address](https://huggingface.co/eugenesiow/bart-paraphrase)


```
a apple desktop computer
- a apple desktop calculator

a blonde woman in a white shirt and long black skirt
- We can see a blonde woman in a white shirt and long black skirt

an old truck covered in snow except for the grill and door
- An old truck is covered in snow.

the adult giraffe
- the adult camelopard
```

**Figure 40**

#### 13.2.2 Image augmentation

Fascinatingly, in every human language, given a sentence describing a specific object in a scene or a certain situation, there exist endless equivalent expressions characterized by the same meaning. "Sara exhibited considerable bravery during the challenging ordeal."; "Sara displayed significant courage throughout the demanding experience"; "Sare showcased noteworthy valor while facing the arduous trial.". Each of these clauses can be used interchangeably without affecting our semantic comprehension. Equivalently, given a portion of an image delimited by a rectangular bounding box, we can think about endless possible modifications of its appearance that do not affect our understanding of the content. For instance, we can straightforwardly recognize a cat in a crop even if it is gray scaled, rotated or slightly blurred. Following this idea, given a picture and a sequence of semantically equivalent expressions referring to a particular object in the field of view, for each description we augment the dataset introducing a "visual synonym" of the bounding box region which delimits the object of interest. To accomplish this, we apply a visual aberration randomly chosen from a set of imported [torchvision transformations](https://pytorch.org/vision/stable/transforms.html).


![color.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/40.png)

**Figure 41**

#### 13.2.3 Noise injection

Experimental results suggest that noise injection might improve the generalization ability of the resulting neural network
[33]. In line with this widely used concept, we enrich our text and visual data augmentation solution with slight perturbations applied randomly.

As regards textual data, we have experimentally noticed that the aforementioned EDA generation algorithm sometimes fails to replace text words with appropriate synonyms. In these cases, the resulting sentences do not precisely reflect the content of the initial phrase. The frequency at which this occurs makes EDA not only a useful tool for data augmentation but unintentionally also a valuable noise injection mechanism.

On the other hand, the situation is considerably easier for images due to the vast amount of existing computer vision libraries. In our work we perturb the visual content of a random subset of image crops of our dataset by blurring the content of the pixels.

![noise.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/41.png)

**Figure 42**

### 13.3 Code

In [None]:
def contrastive_training_step_with_synonyms(
    model: ClipFlyp,
    data_loader: DataLoader[tuple[TensorImage, list[str]]],
    synonyms: int,
    optimizer: torch.optim.Optimizer,
) -> float:
    running_loss: float = 0.0
    progress = tqdm(data_loader, desc="training")

    model.train()

    entries: list[list[tuple[TensorImage, str]]]
    entry: list[tuple[TensorImage, str]]

    for iter, entries in zip(it.count(1), progress):

        # forward computation
        imgs_features: Float[torch.Tensor, "entries*synonyms 1024"]
        txts_features: Float[torch.Tensor, "entries*synonyms 1024"]
        logit_scale: Float[torch.Tensor, "1"]
        imgs_features, txts_features, logit_scale = model(list(it.chain(*entries)))

        imgs_features_3d: Float[torch.Tensor, "entries synonyms 1024"] = imgs_features.view(len(entries), synonyms, 1024)
        imgs_features_3d: Float[torch.Tensor, "synonyms entries 1024"] = imgs_features_3d.transpose(0, 1)

        txts_features_3d: Float[torch.Tensor, "entries synonyms 1024"] = txts_features.view(len(entries), synonyms, 1024)
        txts_features_3d: Float[torch.Tensor, "synonyms entries 1024"] = txts_features_3d.transpose(0, 1)

        # calculate loss
        loss: Float[torch.Tensor, "1"] = torch.stack([
            loss_fn(imgs_features_2d, txts_features_2d, logit_scale)
            for imgs_features_2d, txts_features_2d in zip(imgs_features_3d, txts_features_3d)
        ]).mean()
        running_loss += loss.item()

        # optimizer zero grad
        optimizer.zero_grad()

        # loss backward
        loss.backward()

        # optimizer step
        optimizer.step()

        # Note: we clamp to 4.6052 = ln(100), as in the original paper.
        with torch.no_grad():
            model.core.logit_scale.clamp_(0, math.log(100))

            progress.set_postfix({"loss": running_loss / iter}, refresh=False)

    return running_loss / len(data_loader)

---

In [None]:
# EDA
# paper: https://aclanthology.org/D19-1670.pdf
# paper: https://arxiv.org/abs/1907.03752
# code reference: https://github.com/dsfsi/textaugment
from textaugment import EDA

import nltk  # NLTK is a leading platform for building Python programs to work with human language data

nltk.download("stopwords")
nltk.download("wordnet")

In [None]:
# PEGASUS fine-tuned for paraphrasing
# paper: https://arxiv.org/abs/1912.08777
# code reference: https://huggingface.co/tuner007/pegasus_paraphrase
from transformers import PegasusForConditionalGeneration, PegasusTokenizer

pegasus_model_name = "tuner007/pegasus_paraphrase"
pegasus_tokenizer = PegasusTokenizer.from_pretrained(pegasus_model_name)
pegasus_model = PegasusForConditionalGeneration.from_pretrained(pegasus_model_name).to(device)

pegasus_model.eval()

for p in pegasus_model.parameters():
    p.requires_grad = False

In [None]:
# A large BART seq2seq (text2text generation) model fine-tuned on 3 paraphrase datasets.
# paper: https://arxiv.org/abs/1910.13461
# code reference: https://huggingface.co/eugenesiow/bart-paraphrase
from transformers import BartForConditionalGeneration, BartTokenizer

bart_model_name = "eugenesiow/bart-paraphrase"
bart_tokenizer = BartTokenizer.from_pretrained(bart_model_name)
bart_model = BartForConditionalGeneration.from_pretrained(bart_model_name).to(device)

bart_model.eval()

for p in bart_model.parameters():
    p.requires_grad = False

In [None]:
def pegasus(txt: str) -> str:
    with torch.inference_mode():
        batch: dict[str, Int[torch.Tensor, "1 P"]] = pegasus_tokenizer(
            [txt],
            truncation=True,
            padding="longest",
            max_length=60,
            return_tensors="pt",
        )
        translated: Int[torch.Tensor, "1 X"] = pegasus_model.generate(
            **batch,
            max_length=60,
            num_beams=10,
            num_return_sequences=1,
            temperature=1.5
        )
        [out] = pegasus_tokenizer.batch_decode(translated, skip_special_tokens=True)
        return out

In [None]:
def bart(txt: str) -> str:
    with torch.inference_mode():
        batch: dict[str, Int[torch.Tensor, "1 P"]] = bart_tokenizer(txt, return_tensors="pt")
        translated: Int[torch.Tensor, "1 X"] = bart_model.generate(batch["input_ids"])
        [out] = bart_tokenizer.batch_decode(translated, skip_special_tokens=True)
        return out

In [None]:
eda: EDA = EDA(random_state=42)

In [None]:
txt_transform: RandomChoice = RandomChoice([
    "A photo of {}".format,
    "A picture of {}".format,
    "An image of {}".format,
    "This is {}".format,
    "We can see {}".format,
    eda.synonym_replacement,
    pegasus,
    bart,
])

In [None]:
img_transform: RandomChoice = RandomChoice([
    ColorJitter(brightness=0.5, hue=0.3), # randomly changes the brightness, saturation, and other properties of an image
    GaussianBlur(kernel_size=(5, 9), sigma=(0.1, 5)),  # performs gaussian blur transform on an image
    RandomPosterize(bits=2),  # randomly posterizes the image by reducing the number of bits of each color channel
    RandomSolarize(threshold=192.0),  # randomly solarizes the image by inverting all pixel values above the threshold
    RandomAdjustSharpness(sharpness_factor=2),  # randomly adjusts the sharpness of the given image
    RandomAutocontrast(),  # randomly applies autocontrast to the given image
    RandomEqualize(),  # randomly equalizes the histogram of the given image
    Grayscale(num_output_channels=3),  # converts an image to grayscale
])

In [None]:
SYNONYMS: int = 2

In [None]:
def augment(batch: list[tuple[TensorImage, list[str]]]) -> list[list[tuple[TensorImage, str]]]:
    return [
        list(zip(
            random.sample(
                [img] +
                [img_transform(img) for _ in range(SYNONYMS - 1)],
                SYNONYMS
            ),
            random.sample(
                prompts +
                [txt_transform(random.choice(prompts)) for _ in range(SYNONYMS - len(prompts))],
                SYNONYMS
            )
        ))
        for img, prompts in batch
    ]

### 13.4 Results

In the following part of the notebook we present our implementation proposal which incorporates the data augmentation logics presented in this section. Consistently with the aforementioned concepts, in this novel implementation, during training our batch is no more populated by $B$ (bounding box image, description) pairs. Now, given a configurable number of synonyms $m$, the resulting batch has the following structure:

$$
\begin{bmatrix}
\text{bbox}_{1, 1} & \text{bbox}_{1, 2} & \cdots & \text{bbox}_{1, m} \\
\text{bbox}_{2, 1} & \text{bbox}_{2, 2} & \cdots & \text{bbox}_{2, m} \\
\vdots & \vdots & \ddots & \vdots \\
\text{bbox}_{B, 1} & \text{bbox}_{B, 2} & \cdots & \text{bbox}_{B, m} \\
\end{bmatrix}
\quad
\begin{bmatrix}
\text{prompt}_{1, 1} & \text{prompt}_{1, 2} & \cdots & \text{prompt}_{1, m} \\
\text{prompt}_{2, 1} & \text{prompt}_{2, 2} & \cdots & \text{prompt}_{2, m} \\
\vdots & \vdots & \ddots & \vdots \\
\text{prompt}_{B, 1} & \text{prompt}_{B, 2} & \cdots & \text{prompt}_{B, m} \\
\end{bmatrix}
$$

Where $\text{bbox}_{i, j}$ is the $j$th synonym of the image region delimited by bounding box $\text{bbox}_i$ and $\text{prompt}_j$ is the $j$th synonym of the corresponding referring expression.

Unluckily, the number of bounding box and description synonyms $m$, heavily impacts the execution time and the memory consumption of the training. Our training set is populated by around 40.000 items. If we augment the data with $m=4$ synonyms for each (bounding box, text) pair, we end up with a training set with 160.000 instances. Clearly, we cannot afford the execution time required to process 160.000 examples. Furthermore, as a consequence of our data augmentation solution, the size of each batch increases. During the execution, the variables required by the calculations are stored in GPU RAM memory. The Azure GPUs have 15 Gigabytes of memory which quickly saturates with the increase of the number of synonyms $m$.

At the end of the day, we did our best to deal with these limitations and we have been able to train our model for 60 epochs on our augmented dataset with $m=2$ synonyms and a batch size of 512 (that, with the $m=2$ synonyms for each ($I,P$) pair is in fact 1024). The obtained results are reported at the end of this section. Observing the learning curves, we can notice that the objective loss goes down much faster than previous solutions. Moreover, the colors on the diagonal of the matrices depicted in Figure 43, Figure 44 and Figure 45 are much brighter. Consistently, we were expected that also the evaluation on our downstream task would have been improved. On the contrary, as we can see in Table 10, the results have improved very little. To the best of our knowledge, we can formulate two reasons behind this performance evaluation outcome. First, the number of considered synonyms $m=2$ used to augment the dataset is too small to appreciate a substantial performance boost. Alternatively, it could be that it is not possible to further improve the results with this architecture. In other words, only a radical modification of the model and of the training procedure can affect the final results. Ultimately, we label this as an interesting and promising aspect to be investigated in a future expansion of this work.


![image.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/42.png)

**Figure 43**

![image.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/43.png)

**Figure 44**

![image.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/44.png)

**Figure 45**

![Screenshot 2023-09-01 at 01.24.35.png](https://raw.githubusercontent.com/gekoramy/uni.deep-learning/refs/heads/main/assets/45.png)

**Figure 46**

`FLYP augmented` loss

---

In [None]:
BATCH_SIZE: int = 1024
LIMIT: int = -1
EPOCHS: int = 50

In [None]:
split2loader: dict[Split, DataLoader[tuple[TensorImage, list[str]]]] = {
    "train": DataLoader(
        dataset=Coco4ContrastiveDataset(split="train", limit=LIMIT),
        generator=g,
        batch_size=BATCH_SIZE,
        collate_fn=augment,
        shuffle=True,
    ),
    **{
        split: DataLoader(
            dataset=Coco4ContrastiveDataset(split=split, limit=LIMIT),
            batch_size=BATCH_SIZE,
            collate_fn=lambda x: x,
        )
        for split in ["val", "test"]
    }
}

split2showtime_dataloader: dict[Split, DataLoader[tuple[TensorImage, list[str]]]] = {
    split: DataLoader(
        dataset=Coco4ContrastiveDataset(split=split, limit=5 * 6),
        batch_size=6,
        collate_fn=lambda x: x,
        shuffle=False,
    )
    for split in ['train', 'val', 'test']
}

In [None]:
def training_loop(
        name: str,
        model: ClipFlyp,
        optimizer: t.Callable[[t.Iterable[torch.Tensor]], torch.optim.Optimizer],
) -> pd.DataFrame:
    loss: dict[str, list[float]] = defaultdict(list)

    # create a logger for the experiment
    with SummaryWriter(f"runs/{name}") as writer:
        # computes evaluation results before training
        print("Before training:")
        test_loss: float = contrastive_test_step(
            model=model,
            data_loader=split2loader["test"],
        )
        val_loss: float = contrastive_test_step(
            model=model,
            data_loader=split2loader["val"],
        )

        loss["test"].append(test_loss)
        loss["val"].append(val_loss)

        contrastive_showtime(
            model,
            split2showtime_dataloader,
            writer,
            0
        )

        # log to TensorBoard
        writer.add_scalars(
            main_tag="loss",
            tag_scalar_dict={
                "test": test_loss,
                "val": val_loss,
            },
            global_step=0,
        )

        progress = trange(EPOCHS, desc="EPOCHS")
        for epoch in progress:
            train_loss: float = contrastive_training_step_with_synonyms(
                model=model,
                data_loader=split2loader["train"],
                optimizer=optimizer(model.parameters()),
                synonyms=SYNONYMS
            )

            val_loss: float = contrastive_test_step(
                model=model,
                data_loader=split2loader["val"],
            )

            loss["train"].append(train_loss)
            loss["val"].append(val_loss)

            # log to TensorBoard
            writer.add_scalars(
                main_tag="loss",
                tag_scalar_dict={
                    "train": train_loss,
                    "val": val_loss,
                },
                global_step=epoch + 1,
            )

            progress.set_postfix(
                {
                    "train/loss": train_loss,
                    "val/loss": val_loss,
                },
                refresh=False,
            )

            # store model
            torch.save(obj=model.state_dict(), f=f"{name}-{(epoch + 1):02d}.pth")

        # compute final evaluation results
        print("After training:")

        test_loss: float = contrastive_test_step(
            model=model,
            data_loader=split2loader["test"],
        )

        loss["test"].append(test_loss)

        contrastive_showtime(
            model,
            split2showtime_dataloader,
            writer,
            EPOCHS
        )

        # log to TensorBoard
        writer.add_scalars(
            main_tag="loss",
            tag_scalar_dict={
                "test": test_loss,
            },
            global_step=EPOCHS,
        )

        return pd.concat(
            [
                pd.concat(
                    [pd.Series(v).describe() for v in loss.values()],
                    axis=1,
                    keys=[k for k in loss.keys()],
                ),
            ],
            axis=1,
            keys=["loss"],
        )

---

In [None]:
keys: list[str] = ["flyp", "flyp-solve-overfitting", "flyp-optuna", "flyp-augmented"]

In [None]:
pd.concat(
    [
        pd.read_csv(f"assets/{key}/eval-FLYP-train.csv", index_col=0)
        for key in keys
    ],
    axis=1,
    keys=keys,
)

Unnamed: 0_level_0,flyp,flyp,flyp,flyp-solve-overfitting,flyp-solve-overfitting,flyp-solve-overfitting,flyp-optuna,flyp-optuna,flyp-optuna,flyp-augmented,flyp-augmented,flyp-augmented
Unnamed: 0_level_1,iou,cos similarity,euclidean distance,iou,cos similarity,euclidean distance,iou,cos similarity,euclidean distance,iou,cos similarity,euclidean distance
count,42226.0,42226.0,42226.0,42226.0,42226.0,42226.0,42226.0,42226.0,42226.0,42226.0,42226.0,42226.0
mean,0.655846,0.90949,0.71047,0.610673,0.905833,0.743905,0.61616,0.906825,0.740328,0.619415,0.906646,0.739636
std,0.381167,0.116469,0.467298,0.391974,0.10673,0.441283,0.389926,0.106528,0.441471,0.389647,0.107497,0.443674
min,0.0,0.224193,0.0,0.0,0.220139,0.0,0.0,0.220139,0.0,0.0,0.220139,0.0
25%,0.25242,0.878059,0.371801,0.167114,0.856192,0.387568,0.178334,0.858358,0.387047,0.18358,0.858426,0.386145
50%,0.89352,0.966205,0.510608,0.858108,0.958639,0.574556,0.862752,0.959355,0.568923,0.867282,0.959949,0.564835
75%,0.952867,0.982813,0.971743,0.950501,0.981684,1.060347,0.950845,0.981756,1.053231,0.951531,0.981854,1.053846
max,0.998923,1.0,3.284631,0.998923,1.0,2.925502,0.998923,1.0,2.925502,0.998923,1.0,2.790918


**Table 6**

In [None]:
pd.concat(
    [
        pd.read_csv(f"assets/{key}/eval-FLYP-val.csv", index_col=0)
        for key in keys
    ],
    axis=1,
    keys=keys,
)

Unnamed: 0_level_0,flyp,flyp,flyp,flyp-solve-overfitting,flyp-solve-overfitting,flyp-solve-overfitting,flyp-optuna,flyp-optuna,flyp-optuna,flyp-augmented,flyp-augmented,flyp-augmented
Unnamed: 0_level_1,iou,cos similarity,euclidean distance,iou,cos similarity,euclidean distance,iou,cos similarity,euclidean distance,iou,cos similarity,euclidean distance
count,2573.0,2573.0,2573.0,2573.0,2573.0,2573.0,2573.0,2573.0,2573.0,2573.0,2573.0,2573.0
mean,0.44474,0.845191,0.964074,0.559027,0.892097,0.802483,0.566936,0.892032,0.802094,0.561768,0.890657,0.807573
std,0.411338,0.147209,0.5373,0.400013,0.113741,0.457968,0.398357,0.115519,0.462112,0.397323,0.116169,0.464711
min,0.0,0.310306,0.070107,0.0,0.36521,0.057728,0.0,0.36521,0.057728,0.0,0.36521,0.057728
25%,0.009099,0.757501,0.45735,0.102296,0.826868,0.411174,0.109141,0.831729,0.411081,0.112111,0.82723,0.410835
50%,0.305549,0.89093,0.922397,0.724764,0.941753,0.677477,0.754302,0.943839,0.663102,0.724764,0.941327,0.677092
75%,0.921113,0.972397,1.349703,0.944698,0.979479,1.143081,0.945063,0.979501,1.137241,0.94424,0.979479,1.150478
max,0.993985,0.999632,2.726751,0.994411,0.999632,2.47404,0.994411,0.999632,2.47404,0.993985,0.999632,2.47404


**Table 7**

In [None]:
pd.concat(
    [
        pd.read_csv(f"assets/{key}/eval-FLYP-test.csv", index_col=0)
        for key in keys
    ],
    axis=1,
    keys=keys,
)

Unnamed: 0_level_0,flyp,flyp,flyp,flyp-solve-overfitting,flyp-solve-overfitting,flyp-solve-overfitting,flyp-optuna,flyp-optuna,flyp-optuna,flyp-augmented,flyp-augmented,flyp-augmented
Unnamed: 0_level_1,iou,cos similarity,euclidean distance,iou,cos similarity,euclidean distance,iou,cos similarity,euclidean distance,iou,cos similarity,euclidean distance
count,5023.0,5023.0,5023.0,5023.0,5023.0,5023.0,5023.0,5023.0,5023.0,5023.0,5023.0,5023.0
mean,0.451015,0.846404,0.961283,0.580078,0.896025,0.785909,0.577689,0.895028,0.789919,0.576252,0.893777,0.794152
std,0.4105,0.1475,0.545707,0.397923,0.113407,0.462311,0.398581,0.115062,0.46661,0.398819,0.116289,0.469464
min,0.0,0.263428,0.0,0.0,0.342195,0.0,0.0,0.342195,0.0,0.0,0.280131,0.0
25%,0.013494,0.753941,0.450327,0.122994,0.836112,0.397114,0.122399,0.834798,0.39796,0.119033,0.832247,0.399706
50%,0.326402,0.891457,0.921334,0.797624,0.951409,0.630213,0.794154,0.950434,0.631388,0.788843,0.950299,0.631416
75%,0.924748,0.974087,1.36028,0.948823,0.980802,1.12464,0.948546,0.980756,1.128206,0.948779,0.980454,1.13875
max,0.998143,1.0,2.888018,0.99605,1.0,2.777238,0.995533,1.0,2.794651,0.99605,1.0,2.565383


**Table 8**

In [None]:
pd.read_csv("assets/comparing[train].csv", index_col=0)

Unnamed: 0,mAP[IoU .3],mAP[IoU .5],mAP[IoU .7],mIoU,mCos,mED
FLYP,0.735542,0.686875,0.638232,0.655846,0.90949,0.71047
FLYP solve overfitting,0.691967,0.628712,0.573509,0.610673,0.905833,0.743905
FLYP optuna,0.698385,0.635248,0.579809,0.61616,0.906825,0.740328
FLYP augmented,0.700374,0.639345,0.584142,0.619415,0.906646,0.739636


**Table 9**

In [None]:
pd.read_csv("assets/comparing[val].csv", index_col=0)

Unnamed: 0,mAP[IoU .3],mAP[IoU .5],mAP[IoU .7],mIoU,mCos,mED
FLYP,0.501749,0.447338,0.395647,0.44474,0.845191,0.964074
FLYP solve overfitting,0.640497,0.570152,0.511465,0.559027,0.892097,0.802483
FLYP optuna,0.649048,0.582588,0.519627,0.566936,0.892032,0.802094
FLYP augmented,0.643995,0.573649,0.511465,0.561768,0.890657,0.807573


**Table 10**

In [None]:
pd.read_csv("assets/comparing[test].csv", index_col=0)

Unnamed: 0,mAP[IoU .3],mAP[IoU .5],mAP[IoU .7],mIoU,mCos,mED
FLYP,0.511447,0.448139,0.394983,0.451015,0.846404,0.961283
FLYP solve overfitting,0.66096,0.591678,0.532749,0.580078,0.896025,0.785909
FLYP optuna,0.655385,0.588891,0.530958,0.577689,0.895028,0.789919
FLYP augmented,0.654788,0.588095,0.528967,0.576252,0.893777,0.794152


**Table 11**

13.5 Weight initialization

In addition to the aforementioned strategies, a tentative which is worth mentioning is the change of the weight initialization algorithm. More in depth, we have implemented the popular Xavier initialization as originally proposed by Xavier Glorot and Yoshua Bengio in their work "Understanding the difficulty of training deep feedforward neural networks" [34]. However, in line with [this post](https://stats.stackexchange.com/a/319849) on stackexchange.com, we have experimentally noticed that the PyTorch default Kaiming initialization [35] performs considerably better in our case. Hence, we have not further investigated this aspect.

## 14 Conclusion and further research directions

In this work we have built and evaluated several deep learning frameworks on top of the zero-shot capabilities of CLIP to perform visual grounding on the RefCOCOg dataset. To this end, motivated by other related works (Section 4), we have presented a two steps joint embedding approach to tackle the problem.

Given an input image, the purpose of the first step is to propose a set of regions delimited by rectangular bounding boxes containing potentially relevant objects. In order to accomplish this we have compared several modern object detection algorithms. In this regard, as an important contribution, we have made available three new versions of the RefCOCOg dataset filled with the bounding boxes predicted by Yolov5, Yolov8 and DETR respectively, together with the code readily available to be customized for other arbitrary object detection algorithms. With this preprocessing we increase the number of iterations per second of our training loop by the 70 percent. We believe that this resource can be helpful for many other future research directions.

The proposed solution to address the second step of the of the overall procedure can be divided in four categories:
* training-free approach
* standard fine-tuning approach
* contrastive learning approach
* self-attention approach

All of them have been evaluated with commonly used metrics on their ability to ground textual descriptions on the visual world. As an important contribution of this notebook we show that despite its simplicity, fine-tuning the CLIP image and textual encoders following the same pretraining contrastive learning pattern proposed by the original authors, consistently outperforms alternative approaches.

Unfortunately, we have not been able due to our limited computational resources to evaluate our self-attention model on a proper amount of data for a proper amount of time. We leave this as a prominent and interesting future research direction.

Finally, in Section 13, we have further refined the generalization capabilities of our most promising architecture by means of Optuna automatic hyperparameter optimization tool, data augmentation, and noise injection.

Last but not least, we have remarkably provided in this document the Python code implementation of every commented aspect.

We conclude our project with some final valuable future research direction which might inspire further research on this topic.

In our opinion the first stage of our two steps joint embedding approach is the component with the largest room of improvement. In order to find the regions of the image which reflect the most the content of a textual description we, as humans, do not extract all the relevant objects which populates the field of view as a prior step. Rather, the attention of our cognitive system is guided by the referring expression to immediately preclude portions of the image which are not compatible with the input query. In this regard, we suggest to explore the literature about text-guided attention models as a good starting point to improve our framework [36][37].

The most challenging problem which remains substantially unsolved by this work concerns the presence of spatial relationships, like "The fruit in the middle", in the natural language descriptions. Our self-attention approach detailed in Section 12 partially overcomes this limitation. However, given the frequency with which location words are used in real world applications, this topic definitely deserves further investigation in the future.

> ***Remark:***
> *in order to systematically keep track of the experiments that we have made throughout the development of the project and to convenitenly visualize their corresponsing outcomes we have used TensorBoard. Unfortunately, we can not deliver the notebook with the beautiful interactive TensorBoard dashboard. Indeed, the resulting file would be to big to be sent via email.*

## 15 References

[[1]](https://www.cs.utexas.edu/users/ai-lab/downloadPublication.php?filename=http://www.cs.utexas.edu/users/ml/papers/thomason.robonlp17.pdf&pubid=127642)  Jesse Thomason, Jivko Sinapov, and Raymond Mooney,
"Guiding interaction behaviors for multi-modal grounded language learning," in Proceedings of the First Workshop on Language Grounding for Robotics, 2017

[[2]](https://arxiv.org/abs/2103.00020) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748-8763. PMLR, 2021.

[[3]](https://arxiv.org/abs/1608.00272) Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expressions. In Computer Vision-ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 69-85. Springer, 2016.


[[4]](https://arxiv.org/abs/2007.09554) Yanyuan Qiao and Chaorui Deng and Qi Wu. Referring Expression Comprehension: A Survey of Methods and Datasets. Year 2020.

[[5]](https://arxiv.org/abs/2212.00638) Sachin Goyal and Ananya Kumar and Sankalp Garg and Zico Kolter and Aditi Raghunathan. Finetune like you pretrain: Improved finetuning of zero-shot vision models. Year 2022.

[[6]](https://ieeexplore.ieee.org/document/8845685) S. Qiu, Y. Zhao, J. Jiao, Y. Wei, and S. Wei, "Referring image segmentation by generative adversarial learning", IEEE Trans. Multimedia. Year 2020.

[[7]](https://arxiv.org/abs/1505.00468) S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick,
and D. Parikh, "Vqa: Visual question answering", in Proc. IEEE Int.
Conf. Comput. Vis. Year 2015.

[[8]](https://ieeexplore.ieee.org/document/9422035) Q. Wu, D. Teney, P. Wang, C. Shen, A. Dick, and A. van den Hengel, "Visual question answering: A survey of methods and datasets" Comput.
Vis. Image Underst. Year 2017.

[[9]](https://arxiv.org/abs/1904.05548) Z. Zheng, W. Wang, S. Qi, and S. Zhu, "Reasoning visual dialogs with structural and partial observations" in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. Year 2019.

[[10]](https://arxiv.org/abs/1809.01816) S. Kottur, J. M. F. Moura, D. Parikh, D. Batra, and M. Rohrbach, "Visual coreference resolution in visual dialog using neural module networks",
in Proc. Eur. Conf. Comput. Vis. Year 2018.

[[11]](https://arxiv.org/abs/1511.02283) J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy, "Generation and comprehension of unambiguous object descriptions", in
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Year 2016.

[[12]](https://arxiv.org/abs/2202.10054) Ananya Kumar, Aditi Raghunathan, Robbie Matthew Jones, Tengyu Ma, and Percy Liang. "Fine-tuning can distort
pretrained features and underperform out-of-distribution". In International Conference on Learning Representations
(ICLR). Year 2022.

[[13]](https://arxiv.org/abs/2109.01903) Mitchell Wortsman, Gabriel Ilharco, Mike Li, Jong Wook Kim, Hannaneh Hajishirzi, Ali Farhadi, Hongseok
Namkoong, and Ludwig Schmidt. Robust fine-tuning of zero-shot models.CoRR. Year 2021.

[[14]](https://arxiv.org/abs/1803.01534) Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Path aggregation network for instance segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. Year 2018.

[[15]](https://arxiv.org/abs/2104.11892) Syed Sahil Abbas Zaidi and Mohammad Samar Ansari and Asra Aslam and Nadia Kanwal and Mamoona Asghar and Brian Lee. A Survey of Modern Deep Learning based Object Detection Models. Year 2021.

[[16]](https://paperswithcode.com/task/object-detection) paperswithcode.com list of state of the art Object Detection algorithms. Last visited on 28th August 2023.

[[17]](https://arxiv.org/abs/1506.02640) Joseph Redmon and Santosh Divvala and Ross Girshick and Ali Farhadi. You Only Look Once: Unified, Real-Time Object Detection. Year 2015.

[[18]](https://arxiv.org/abs/2005.12872) Nicolas Carion and Francisco Massa and Gabriel Synnaeve and Nicolas Usunier and Alexander Kirillov and Sergey Zagoruyko. End-to-End Object Detection with Transformers. Year 2020.

[[19]](https://arxiv.org/abs/2109.01903) Mitchell Wortsman and Gabriel Ilharco and Jong Wook Kim and Mike Li and Simon Kornblith and Rebecca Roelofs and Raphael Gontijo-Lopes and Hannaneh Hajishirzi and Ali Farhadi and Hongseok Namkoong and Ludwig Schmidt. Robust fine-tuning of zero-shot models. Year 2021.

[[20]](https://arxiv.org/abs/2308.12919) Jian Liang and Lijun Sheng and Zhengbo Wang and Ran He and Tieniu Tan. Towards Realistic Unsupervised Fine-tuning with CLIP. Year 2023.

[[21]](https://arxiv.org/abs/2212.03640) Hanoona Rasheed and Muhammad Uzair Khattak and Muhammad Maaz and Salman Khan and Fahad Shahbaz Khan. Fine-tuned CLIP Models are Efficient Video Learners Year 2022.

[[22]](https://arxiv.org/abs/2007.03051) Lasse F. Wolff Anthony and Benjamin Kanding and Raghavendra Selvan. Carbontracker: Tracking and Predicting the Carbon Footprint of Training Deep Learning Models. Year 2020.

[[23]](https://arxiv.org/abs/2101.06983) Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan. Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup. Year 2021.

[[24]](https://arxiv.org/abs/1706.03762) Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin. Attention Is All You Need. Year 2017.

[[25]](https://arxiv.org/abs/2304.11127) Shuhei Watanabe. Tree-Structured Parzen Estimator: Understanding Its Algorithm Components and Their Roles for Better Empirical Performance. Year 2023.

[[26]](https://arxiv.org/abs/1603.06560) Lisha Li and Kevin Jamieson and Giulia DeSalvo and Afshin Rostamizadeh and Ameet Talwalkar. Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization. Year 2016.

[[27]](https://arxiv.org/abs/2005.14165) Tom B. Brown and Benjamin Mann and Nick Ryder and Melanie Subbiah and Jared Kaplan and Prafulla Dhariwal and Arvind Neelakantan and Pranav Shyam and Girish Sastry and Amanda Askell and Sandhini Agarwal and Ariel Herbert-Voss and Gretchen Krueger and Tom Henighan and Rewon Child and Aditya Ramesh and Daniel M. Ziegler and Jeffrey Wu and Clemens Winter and Christopher Hesse and Mark Chen and Eric Sigler and Mateusz Litwin and Scott Gray and Benjamin Chess and Jack Clark and Christopher Berner and Sam McCandlish and Alec Radford and Ilya Sutskever and Dario Amodei. Language Models are Few-Shot Learners. Year 2020.

[[28]](https://arxiv.org/abs/2012.15723) Tianyu Gao and Adam Fisch and Danqi Chen. Making Pre-trained Language Models Better Few-shot Learners. Year 2020.

[[29]](https://arxiv.org/abs/1907.03752) Marivate, Vukosi and Sefara, Tshephisho. Improving Short Text Classification Through Global Augmentation Methods. Year 2020.


[[30]](https://aclanthology.org/D19-1670.pdf) Jason Wei, Kai Zou. EDA: Easy Data Augmentation Techniques for Boosting Performance on
Text Classification Tasks. Year 2019.

[[31]](https://arxiv.org/abs/1912.08777) Jingqing Zhang and Yao Zhao and Mohammad Saleh and Peter J. Liu. PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization. Year 2019.

[[32]](https://arxiv.org/abs/1910.13461) Mike Lewis and Yinhan Liu and Naman Goyal and Marjan Ghazvininejad and Abdelrahman Mohamed and Omer Levy and Ves Stoyanov and Luke Zettlemoyer. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. Year 2019.

[[33]](https://doi.org/10.1162/neco.1997.9.5.1093) Yves Grandvalet, Stéphane Canu, Stéphane Boucheron. Noise Injection: Theoretical Prospects. Year 1997.

[[34]](https://paperswithcode.com/paper/understanding-the-difficulty-of-training-deep) Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. Year 2010.

[[35]](https://arxiv.org/abs/1502.01852v1) Kaiming He and Xiangyu Zhang and Shaoqing Ren and Jian Sun. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. Year 2015.

[[36]](https://ieeexplore.ieee.org/document/9207417) L. Zhang, X. Wang, L. Yao and F. Zheng. "Zero-Shot Object Detection with Textual Descriptions Using Convolutional Neural Networks. Year 2020.

[[37]](https://arxiv.org/abs/1612.03557) Jonghwan Mun and Minsu Cho and Bohyung Han. Text-guided Attention Model for Image Captioning. Year 2016.