In [None]:
!pip install jaxtyping

In [None]:
import torch
from jaxtyping import Float
import torchvision.ops
import torch.nn.functional as F

FYI
---

When evaluating a visual grounding framework, it is essential to consider appropriate metrics to ensure that the model is performing as intended.
Some good metrics to consider include:

- localization accuracy
- grounding accuracy
- semantic similarity

**Localization accuracy** measures how accurately the system can localize an object in the image.

e.g., Intersection over Union (IoU)

_IoU between predicted bbox and ground truth bbox_

**Grounding accuracy** measures how accurately it can ground the localized object to a language description.

e.g, Recall

**Semantic similarity** measures the similarity between the predicted bounding boxes and the ground-truth descriptions.

e.g., Cosine similarity, Euclidean distance

_cosine similarity / Euclidean distance between predicted bbox latin space coordinate and ground truth bbox latin space coordinate_

Evaluating the model using these metrics can provide valuable insights into the model’s performance and areas for improvement.

![metrics](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*S0P4uuuK_w4JVcNeE1O-bg.jpeg)
![TP, FP, FN](https://miro.medium.com/v2/resize:fit:4800/format:webp/1*mdqpx5V7TYhXRz046tm6zQ.jpeg)

IoU Intersection over Union
---

$$
J(A, B) = \frac {|A \cap B|} {|A \cup B|}
$$

In [None]:
b1: Float[torch.Tensor, '4'] = torch.tensor([167, 238, 249, 276], dtype=torch.double)
b2: Float[torch.Tensor, '4'] = torch.tensor([146, 230, 228, 268], dtype=torch.double)
b3: Float[torch.Tensor, '4'] = torch.tensor([0, 10, 10, 20], dtype=torch.double)

In [None]:
[
    (x, y, torchvision.ops.box_iou(torch.unsqueeze(x, 0), torch.unsqueeze(y, 0)))
    for x in [b1, b2, b3]
    for y in [b1, b2, b3]
]

Cosine similarity
---

$$
\text{cosine similarity} A, B := \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{||\mathbf{A}||  ||\mathbf{B}||}
$$

Measure of similarity between two non-zero vectors defined in an inner product space.

Cosine similarity is the cosine of the angle between the vectors; that is, it is the dot product of the vectors divided by the product of their lengths.
It follows that the cosine similarity does not depend on the magnitudes of the vectors, but only on their angle.
The cosine similarity always belongs to the interval $[−1, 1]$.

For example:

- two proportional vectors have a cosine similarity of 1
- two orthogonal vectors have a similarity of 0
- two opposite vectors have a similarity of -1

In some contexts, the component values of the vectors cannot be negative, in which case the cosine similarity is bounded in $[0, 1]$

> It is important to note that the cosine distance is not a true distance metric as it does not exhibit the triangle inequality property — or, more formally, the Schwarz inequality — and it violates the coincidence axiom.

In [None]:
p1: Float[torch.Tensor, '1 2'] = torch.tensor([[1, 1]], dtype=torch.double)
p2: Float[torch.Tensor, '1 2'] = torch.tensor([[2, 2]], dtype=torch.double)
p3: Float[torch.Tensor, '1 2'] = torch.tensor([[-1, -1]], dtype=torch.double)
p4: Float[torch.Tensor, '1 2'] = torch.tensor([[-1, 1]], dtype=torch.double)

[
    (x, y, F.cosine_similarity(x, y))
    for x in [p1, p2, p3, p4]
    for y in [p1, p2, p3, p4]
]

Euclidean distance
---

The Euclidean distance between two points in Euclidean space is the length of a line segment between the two points.
It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, therefore occasionally being called the Pythagorean distance.

In general, for points given by Cartesian coordinates in $n$-dimensional Euclidean space, the distance is:

$$
||\mathbf x - \mathbf y||
$$

### Norms

$$
| \mathbf{x} |_1 = \sum_{i=1}^n |x_i|
\\
| \mathbf{x} |_2 = \left( \sum_{i=1}^n |x_i|^2 \right)^{\frac{1}{2}} = \sqrt{\mathbf{x}^T \mathbf{x}}
\\
| \mathbf{x} |_\infty = \max_{i=1, \cdots, n} |x_i|
$$

![norms](https://secure-res.craft.do/v2/LDKtNmspDKuemrRt7B2CN321caVNTbsv8AGosCiqG492ZxaKVK2FGUfzzgFWYadAReRTY7Jsr7ScmWfxcwNvqBXnBi59zSUTa2eZKtDweo9XhQ4vaeG3MKMBQhr2EP2EM1RvqDUHbcfDnoLMMSXoVc436ninA1mnw6dJrq315TFVYDknrLzgBkJCXtV4Lbjjd8nKWRrd3VswUGBBTtupdDNYeNWGUCdrokWjd6gRLMj5EQVLWrtzuVdTn5PZ51cmLhHToKEyYXZTaNnvsXpZ3mYLN8HbenHuxzv5JK6oWqaZd7ofCZ/norms.svg)

In [None]:
a: Float[torch.Tensor, 'A 2'] = torch.tensor([[0, 0], [1, 0], [0, 1], [2, 2]], dtype=torch.double)
b: Float[torch.Tensor, 'B 2'] = torch.tensor([[0, 0], [1, 1]], dtype=torch.double)
c: Float[torch.Tensor, 'A B'] = torch.cdist(a, b, p=2)


Recall
---

![precision vs recall](https://upload.wikimedia.org/wikipedia/commons/2/26/Precisionrecall.svg)

Metric used to evaluate the performance of a machine learning model, particularly in classification tasks.
In the context of visual grounding, recall can be used to measure how well your model is able to identify and correctly label relevant visual objects or features in an image.

More specifically, recall is the fraction of relevant items (in this case, visual objects or features) that are correctly identified by the model, out of all the relevant items in the image.
In other words, it measures how well your model is able to "recall" the correct information.

$$
\text{Pre} = \frac{TP}{TP + FP} \qquad \text{Rec} = \frac{TP}{TP + FN}
$$

- [article a](https://towardsdatascience.com/what-is-average-precision-in-object-detection-localization-algorithms-and-how-to-calculate-it-3f330efe697b)
- [article b](https://jonathan-hui.medium.com/map-mean-average-precision-for-object-detection-45c121a31173)
- [article c](https://www.v7labs.com/blog/mean-average-precision)
- [article d](https://hasty.ai/docs/mp-wiki/metrics/map-mean-average-precision) $\to$ [colab](https://colab.research.google.com/drive/1eR1Qyo99k0O7XrHpscPOiLPRPx-yJ57p?usp=sharing#scrollTo=MNZOW-_5dyWu)
- [library](https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html)

$TP$ è quando IoU tra il bbox ground truth e bbox scelto da CLIP è $\geq X$

~~$FN$ è quando IoU tra il bbox ground truth e un bbox (proposto da YOLO) scartato da CLIP è $\geq X$~~ $\to$ Questa cosa non è la definzione di $FN$ per il motivo detto sotto

binary classification tra bbox di YOLO:

- Giusta quando "corrisponde" alla ground truth
- Sbagliata altrimenti

Però i bbox restituiti da YOLO potrebbero essere tutti sbagliati...

Queste metriche devono essere indipendenti dall'implementazione, quindi il $\top / \bot$ ... non si basano sulle bbox accettate / rifiutate correttamente, ma solo sulla bbox predetta e la ground truth.
Perché? Perché altrimenti non sarebbero comparabili con altre implementazioni.

Potrebbe aver senso includere la cosine similarity per la distinzione tra $FN, FP$?