# Precision & Recall

Precision measures the ratio of true positive to all detected positives. Recall measures the ratio of true positive to all ground truth positives.

$$
\def\tp{\text{True Positives}}
\def\fp{\text{False Positives}}
\def\tn{\text{True Negatives}}
\def\fn{\text{False Negatives}}
\begin{aligned}
\text{Precision} &= \frac{\tp}{\tp + \fp} \\
\text{Recall} &= \frac{\tp}{\tp + \fn}
\end{aligned}
$$

In terms of trade-off, assuming that the model's true positive stays fixed.

- Maximizing precision is same as minimizing false positives, maybe at the expense of more false negatives.
- Maximizing recall is same as minimizing false negatives, maybe at the expense of more false positives.

# Intersection over Union (IoU)

<div>
<img src="./assets/iou.png" width="600"/>
</div>

IoU is used as a threshold. When area of overlap equals to the area of union, i.e. `iou == 1`, we have a perfect bounding box detection. We use this as a threshold to define true positive or false positive.

<div>
<img src="./assets/iou-threshold.jpeg" width="600"/>
</div>

## Average Precision

For a set of predictions, we can adjust the IoU threshold to produce a set of precision and recall values. This will result in a **receiver operator characteristic** curve. Now the area under the curve on ROC graph is the average precision. 

![ROC](./assets/roc.png)

# Mean Average Precision

Area under the curve is the same as average precision. It _averaged_ over all thresholds of detection. The mean average precision is calculated by taking the mean of **AUC** across all classes. For a model that predicts a single class, the mean AP is the same as AP because there is only 1 class.

# Mean IoU

In **segmentation**, we can define true/false positives or negatives directly on pixel level for each class of instances.

In [21]:
import numpy as np


def calculate_confusion_matrix_values(gt_mask, pred_mask):
    """
    Calculates the true positive, false positive, and false negative values for
    two binary segmentation masks.

    Args:
        gt_mask (numpy.ndarray): binary mask of shape (H, W)
        pred_mask (numpy.ndarray): binary mask of shape (H, W)

    Returns:
        tuple: true positive, false positive, false negative values
    """
    TP = np.logical_and(gt_mask, pred_mask)
    FP = np.logical_and(np.logical_not(gt_mask), pred_mask)
    FN = np.logical_and(gt_mask, np.logical_not(pred_mask))

    return np.sum(TP), np.sum(FP), np.sum(FN)

In [33]:
# Let softmax be one-hot encoding of classes for each pixel (flatten into 1D array of one-hot encodings)
softmax = np.array([
    [1, 0, 0, 0],
    [1, 0, 0, 0],
    [1, 0, 0, 0],
    [0, 1, 0, 0],
    [0, 0, 1, 0],
])
pred_labels = np.argmax(softmax, axis=1) + 1
gt_labels = np.array([1, 1, 2, 1, 1])
print("Predicted labels", pred_labels)
print("Ground truth labels", gt_labels)

# For a given class, say class = 1
pred_mask = np.where(pred_labels == 1, 1, 0)
gt_mask = np.where(gt_labels == 1, 1, 0)
print("Class = 1 prediction mask", pred_mask)
print("Class = 1 ground truth mask", gt_mask)
TP, FP, FN = calculate_confusion_matrix_values(gt_mask, pred_mask)
print(f"TP: {TP}, FP: {FP}, FN: {FN}")

Predicted labels [1 1 1 2 3]
Ground truth labels [1 1 2 1 1]
Class = 1 prediction mask [1 1 1 0 0]
Class = 1 ground truth mask [1 1 0 1 1]
TP: 2, FP: 1, FN: 2


Then IoU has a different usage, it measures how well the model segments a scene (or image) for a given class.

$$
\def\tp{\text{TP}}
\def\fp{\text{FP}}
\def\fn{\text{FN}}
\text{IoU} = \frac{\tp}{\tp + \fp + \fn}
$$

Mean IoU is the average of IoU for all classes. However, there are few issues with mean IoU.

1. **Class Imbalance**: In many real-world datasets, some classes may be much more common than others. If the model is very good at segmenting common classes but poor at identifying rare classes, it could still achieve very high mIoU.
2. **Misinterpretation of Mean**: mIoU computes a simple arithmetic mean of the IoU values for each class. This means that each class is given equal weight in the final score. This shares the same vein as class imbalance.
3. **Failure to Capture Spatial Discrepancies**: mIoU does not take into account the relative spatial distributions of predicted regions. Two predictions with same IoU could look quite different if one has errors scattered throughout the image and the other has a single consolidated region of error.

# Dice Coefficient

$$
\text{Dice} = \frac{2 * \text{Intersection}}{\text{Union} + \text{Intersection}}
$$

This is equivalent to the F1 score.

$$
\def\tp{\text{TP}}
\def\fp{\text{FP}}
\def\tn{\text{TN}}
\def\fn{\text{FN}}
\begin{aligned}
\text{F1}
&= \frac{2 * \text{Precision} * \text{Recall}}{\text{Precision} + \text{Recall}} \\
&= 2*\frac{\tp}{\tp + \fp}*\frac{\tp}{\tp + \fn} * \frac{(\tp + \fp) * (\tp + \fn)}{\tp * (\tp + \tp + \fp + \fn)}\\
&= \frac{2TP}{2TP + FP + FN} \\
\end{aligned}
$$

By definition, union and intersection are

$$
\def\tp{\text{TP}}
\def\fp{\text{FP}}
\def\tn{\text{TN}}
\def\fn{\text{FN}}
\begin{aligned}
\text{Union} &= \tp + \fp + \fn \\
\text{Intersection} &= \tp
\end{aligned}
$$

This is equivalent to computing harmonic means of precision and recall.

$$
H(x_1, x_2, \dots, x_n) = \frac{n}{\frac{1}{x_1} + \frac{1}{x_2} + \dots + \frac{1}{x_n}}
$$

Harmonic mean biased toward the lowest value. Thus, F1 or Dice coefficient is biased toward the lowest value of precision or recall.

> One characteristic of the harmonic mean is that it is dominated by the smallest elements of the set, more so 
  than either the arithmetic mean or the geometric mean. This makes the harmonic mean valuable in certain 
  situations where you want smaller values to have a higher influence on the mean.

In [39]:
def harmonic_mean(values):
    reciprocal = 1 / values
    return len(values) / np.sum(reciprocal)

values = np.array([1.0, 10.0])
print("Harmonic mean", harmonic_mean(values))
print("Mean", values.mean())

Harmonic mean 1.8181818181818181
Mean 5.5
