# Computing evaluation metrics for detection

## TP, FP, TN and FN for object detection?

We have already learned about [evaluation metrics for classification](../2_classification/2_evaluation_metrics.ipynb). In object detection, we also perform some kind of classification. More specifically, a detector returns a set of bounding boxes with corresponding class labels. As such, the model has classified the image regions that correspond to the bounding boxes.

Just like any classification problem, the label that is assigned to a certain image region can either be correct or incorrect. Let's say the model assigns a certain class label *A* to an image region. Then, there are two options

1. **True positive** (TP) of label *A*: the region is predicted to have label *A* and indeed has label *A*
2. **False positive** (FP) of label *A*: the region is predicted to have label *A*, but *does not* have label *A*

Of course, there will be many image regions that **do not receive a certain label** from the detector. Let's say the model did not assign label *A* to a certain image region. Then, again, there are two possibilities:

1. **True negative** (TN) of label *A*: the region is predicted to *not* have label *A* and indeed does not have label *A*
2. **False negative** (FN) of label *A*: the region is predicted to *not* have label *A*, but *does* have label *A*

## TP, FP, FN at an IoU level

To evaluate the predictions, we compare them with the **ground-truth bounding boxes**. However, **predictions will almost never exactly overlap** with a ground-truth bounding box. Of course, this does not mean that all predicted bounding boxes are *bad* and should be counted as false positives. Instead, we use the **Intersection over Union** (IoU) to define when a predicted bounding box is a TP, FP or FN for a certain class.

We can choose this IoU ourselves. Let's say we give it a value $\alpha$. Then a TP, FP and FN of a class label *A* at this IoU level $\alpha$ is defined as follows:

- **True positive at $\alpha$** ($\text{TP}_\alpha$) of label *A*: the predicted bounding box has label *A* and has an IoU $\ge \alpha$ with a ground truth bounding box that has label *A*
- **False positive at $\alpha$** ($\text{FP}_\alpha$) of label *A*: the predicted bounding box has label *A*, but does not have an IoU $\ge \alpha$ with a ground truth bounding box that has label *A*
- **False negative at $\alpha$** ($\text{FN}_\alpha$) of label *A*: a ground truth bounding box that has label *A*, without there being any prediction of label *A* that has an IoU $\ge \alpha$ with it

Defining *True negatives at $\alpha$* is not necessary, since we do not use it to calculate precision and recall. This makes sense, because the amount of regions in an image that *do not* belong to a certain class is huge and rather uninformative.

## Precision and Recall at an IoU level

Now that we have defined how we to calculate $\text{TP}_\alpha$, $\text{FP}_\alpha$ and $\text{FN}_\alpha$ for a certain class label at a give IoU level $\alpha$, we can calculate the precision and recall at that IoU level:

$$\text{Precision}_\alpha = \frac{\text{TP}_\alpha}{\text{TP}_\alpha + \text{FP}_\alpha}$$

$$\text{Recall}_\alpha = \frac{\text{TP}_\alpha}{\text{TP}_\alpha + \text{FN}_\alpha}$$

## Average Precision at an IoU level

The detector will return a score that indicates how convinced it is of its prediction. As with classification, we can threshold this score to improve our precision. When we plot the precision and recall for all possible threshold values, we obtain a PR-curve for the chosen IoU level.

To summarize the PR-curve at the chosen IoU level, we can again compute the **average precision**, which is the integral of the PR-curve.

$$
\text{AP}_\alpha = \sum_{t} P_\alpha(t)\cdot (R_{\alpha}(t) - R_\alpha(t-1))
$$

with $P_\alpha(t)$ and $R_\alpha(t)$ the precision, resp. recall, at threshold $t$ and IoU level $\alpha$ ($R_\alpha(0)$ is defined as 1).

The **mean Average Precision at $\alpha$** ($\text{mAP}_\alpha$) is the mean of the $\text{AP}_\alpha$ of each class in the dataset.

$$
\text{mAP}_\alpha = \frac{1}{C} \sum_{i=1}^{C} \text{AP}_{\alpha,i}
$$

With $C$ the number of classes and $\text{AP}_{\alpha,i}$ the average precision for class $i$ at IoU $\alpha$.

We will use the [*Penn-Fudan Database for Pedestrian Detection and Segmentation*](https://www.cis.upenn.edu/~jshi/ped_html/PennFudanPed.zip) here. It contains 170 images with 345 instances of pedestrians. To run the notebook, download the dataset, unzip it and move it into the `data/` directory.

In [1]:
from lib.penn_fundan import PennFudanDataset
from lib.detection.transforms import ToTensor

dataset = PennFudanDataset('data/PennFudanPed/', ToTensor())

## COCO AP

It is common to simply use $\alpha=0.5$ when computing $\text{mAP}_\alpha$ for a detector. However, we can also calculate $\text{mAP}_\alpha$ for multiple values of $\alpha$. In the literature, **COCO AP** is frequently used as a metric. It is computed as the average of $\text{mAP}_{0.50}$, $\text{mAP}_{0.55}$, $\text{mAP}_{0.60}$, ..., $\text{mAP}_{0.95}$. It is written as $\text{mAP}_{[.5:.05:.95]}$ or $\text{AP}_{\text{COCO}}$ or even just $\text{AP}$.

When reporting COCO AP, it is recommended to use the official `pycocotools` to guarantee a fair comparison with other published work. The `pycocotools` API itself is rather cumbersome, however, so we will use a wrapper around it. This wrapper comes from torchvision reference scripts that we have copied these from [here](https://github.com/pytorch/vision/tree/v0.3.0/references/detection) and put into `lib.detection`.

## Converting a `Dataset` into a `COCO` instance

To compute the COCO metrics, we first need to create a `pycocotools.coco.COCO` instance. This is an object that represents the entire dataset with ground truth annotations. With our wrapper, we can easily convert a PyTorch `Dataset` into such a `pycocotools.coco.COCO` instance. For this to work, the dataset `__getitem__` should return:

* image: a torch Tensor of shape (C, H, W)
* target: a dict containing the following fields
    * `boxes` (`FloatTensor[N, 4]`): the coordinates of the `N` bounding boxes in `[x0, y0, x1, y1]` format, ranging from `0` to `W` and `0` to `H`
    * `labels` (`Int64Tensor[N]`): the label for each bounding box
    * `image_id` (`Int64Tensor[1]`): an image identifier. It should be unique between all the images in the dataset, and is used during evaluation
    * `area` (`Tensor[N]`): The area of the bounding box. This is used during evaluation with the COCO metric, to separate the metric scores between small, medium and large boxes.
    * `iscrowd` (`UInt8Tensor[N]`): instances with `iscrowd=True` will be ignored during evaluation.

In [2]:
from lib.detection.coco_utils import get_coco_api_from_dataset

coco = get_coco_api_from_dataset(dataset)

creating index...
index created!


## Creating an evaluator

The `pycocotools.coco.COCO` represents the ground-truth data. It will be used by `pycocotools.cocoeval.COCOeval` to compute the COCO AP (among others). Again, we will use a torchvision reference script to hide the `pycocotools` API and make it a bit easier to use.

In [3]:
from lib.detection.coco_eval import CocoEvaluator

coco_evaluator = CocoEvaluator(coco, ["bbox"])

Once our evaluator is set up, it is only a matter of passing our predictions to it and computing the COCO metrics!

This happens in two stages, however. First, we **update** the evaluator with each image for which we have calculated a prediction. Next, we **accumulate** the results of the images.

## Updating the evaluator

This is done with a result `dict` that contains the corresponding `image_id`s as dictionary keys. The dictionary value of each of these keys is also a dictionary with the following fields:

* `boxes` (`FloatTensor[N, 4]`): the coordinates of the `N` predicted bounding boxes in `[x0, y0, x1, y1]` format, ranging from `0` to `W` and `0` to `H`
* `labels` (`Int64Tensor[N]`): the predicted label for each bounding box
* `scores` (`Tensor[N]`): The confidence score for each bounding box.

You can call `update()` multiple times, each time with a different set of images. As long as you do not call `accumulate()`, the results will simply be *appended* (unless if you call `update()` with the same image id again, then the previous result for that image will be overwritten by the new one).

In [4]:
img_idx = 5
img, target = dataset[img_idx]

In [5]:
import torch

res = {
    img_idx: {
        'boxes': target['boxes'],
        'scores': torch.tensor([1]*len(target['boxes'])),
        'labels': target['labels'],
    }
}

coco_evaluator.update(res)

## Accumulate the evaluator

Once we have updated the evaluator with each image for which we have a prediction, we call `accumulate()` to accumulate the evaluations of these predictions.

In [6]:
coco_evaluator.accumulate()

Accumulating evaluation results...
DONE (t=0.02s).


## Print the results

To show the results, we can call `summarize()`.

In [7]:
coco_evaluator.summarize()

IoU metric: bbox
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 1.000
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 1.000
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 1.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 1.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.500
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 1.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 1.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 1.000
