# Object Detection Evaluation

This notebook aims at showing what kind of graph you can draw thank's to Lours evaluator

In [1]:
%load_ext autoreload

%autoreload 2
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from lours.dataset import from_coco
from lours.evaluation.detection import DetectionEvaluator as de
from lours.evaluation.detection.util import display_confusion_matrix
from lours.utils.grouper import ContinuousGroup

warnings.simplefilter(action="ignore", category=FutureWarning)

## Loading the dataset and the predictions

Note that they are both treated as datasets at first, and only when creating the eval object we have a detection evaluator

As a second Note, you can add several prediction datasets at the same time

In [2]:
coco_eval = from_coco("notebook_data/coco_valid.json").remap_from_preset(
    "coco", "supercategory"
)
coco_darknet = from_coco(
    "notebook_data/yolov4_prediction_coco_eval.json"
).remap_from_preset("coco", "supercategory")
evaluator = de(
    groundtruth=coco_eval, predictions=coco_darknet, predictions2=coco_darknet
)

In [3]:
evaluator

VBox(children=(HTML(value='<b> Evaluation object, containing 5,000 images, 36,781 groundtruth objects, and 2 p…

## Compute the matches

This is arguably the slowest part.

Hopefully, we can multiprocess it in the future

You can compute them by taking category into account or not.

- The category agnostic is useful for e.g. computing confusion matrices
- The category specific is useful for e.g. computing precision-recall curves

In [None]:
matches = evaluator.compute_matches("predictions", category_agnostic=True)
display(matches["predictions"])
matches = evaluator.compute_matches("predictions", category_agnostic=False)
display(matches["predictions"])

computing matches between groundtruth and predictions (category agnostic)


  0%|          | 0/4979 [00:00<?, ?it/s]

Unnamed: 0,prediction_id,iou,groundtruth_id
0,48019,0.954652,34646
1,48033,0.912193,104368
2,48034,0.939746,103487
3,48042,0.895466,230831
4,48020,0.922641,35802
...,...,...,...
60,17979,0.000000,
61,17980,0.000000,
62,17981,0.000000,
63,17982,0.000000,


computing matches between groundtruth and predictions (category specific)


  0%|          | 0/14503 [00:00<?, ?it/s]

See how two new tabs have been added to the dataset widget

In [None]:
evaluator

Here, we just plot the IOU distribution. As you can see more than half the detections have a IoU of 0. These predictions typically have a very low confidence as well, which means they will be easily filtered and won't have a great influence on evaluation.

In [None]:
plt.plot(
    evaluator.matches["category_specific"]["predictions"]["iou"].sort_values().values
)

### Computing confusion matrix

The confusion matrix can be computed for all matches or by groups if the argument *groups* is defined.

The values are normalized over the groundtruth.

Notes:
- The class *None* corresponds to the False Positive and False Negative.
- The `model` indicated the name of given predictions. Here, we get the data for confusion for predictions named `predictions` and `predictions2` (which are the same, for the sake of the example)
- Since matches have already been computed for `predictions` we only have to compute them for `predictions2`

In [None]:
confusion_data = evaluator.compute_confusion_matrix()
confusion_data

### Display confusion matrix for prediction dataframe named "predictions"

In [None]:
display_confusion_matrix(
    confusion_data.loc[confusion_data["model"] == "predictions"].drop(columns="model"),
    title="All data",
)

### Display confusion matrix for a specific group of prediction dataframe named "predictions"

Here, we divide the evaluation dataset in 3 groups of equal size based on box_height

In [None]:
box_height_group = ContinuousGroup(name="box_height", bins=3, qcut=True)
confusion_data = evaluator.compute_confusion_matrix(
    "predictions", groups=[box_height_group]
)
for (range_data, data), name in zip(
    confusion_data.groupby("box_height"), ["small", "medium", "big"]
):
    display_confusion_matrix(
        data.drop("model", axis=1),
        title=(
            f"Confusion for {name} bounding boxes ({range_data.left:.1f}px to"
            f" {range_data.right:.1f}px)"
        ),
    )

## Computing AP + Yolov5 metrics

Here, we follow usual convention, by computing Average precision per class and per iou threshold.

The we get the AP per category, the AP\@0.5:0.95 per class and finally the mAP and the mAP\@0.5:0.95

see original code for yolov5 (if you dare) here : https://github.com/ultralytics/yolov5/blob/master/val.py

Namely, In addition to AP and mAP, we want the precision\@0.5 at best F1 score averaged over categories, and the recall\@0.5 at best F1 score averaged over categories

Notice, how we use the "index column" and "index_values" argument, to enforce that every category has the same confidence_threshold coordinates, i.e. 100 evenly spaced points between 0 and 1

* `ious` are the different minimum iou values to consider a detection valid
* `index_column` is the name of the value we want to use as index. This will force all values in the PR curve to be aligned. If not set, the resulting PR dataframe will no longer have aligned values, only where it actually changes, which depends on the category. This value can be `recall`, `precision` or `confidence_threshold`.
* `index_values` are the values we want the curves to be aligned on. Typically, a set of increasing values between 0 and 1

In [None]:
pr, ap = evaluator.compute_precision_recall(
    predictions_names="predictions",
    ious=np.linspace(0.5, 0.95, 10).round(3),
    index_column=None,
)

print(f"mAP@0.5 = {ap[ap['iou_threshold'] == 0.5]['AP'].mean()}")
print(f"mAP@0.5:0.95 = {ap['AP'].mean()}")

pr50, ap50 = evaluator.compute_precision_recall(
    predictions_names="predictions",
    ious=0.5,
    index_column="confidence_threshold",
    index_values=np.linspace(0, 1, 101),
    f_scores_betas=(0.5, 1, 2),
)


# Note that next line would be invalid if we did not force the data points
# to be aligned on the same confidence thresholds
mean_f1 = pr50.groupby("confidence_threshold").mean(numeric_only=True)
best_mean_f1_score = mean_f1.loc[mean_f1["f1_score"].idxmax()]
print("F1 scores averaged over classes")
print(f"best F1 = {best_mean_f1_score['f1_score']}")
print(f"precision @ best F1 = {best_mean_f1_score['precision']}")
print(f"recall @ best F1 = {best_mean_f1_score['recall']}")

Detailed view of Average Precision

In [None]:
display(ap)
ap_consolidated = pd.pivot_table(
    ap, values=["AP"], index="category_id", columns="iou_threshold"
)
ap_consolidated["mean"] = ap_consolidated["AP"].mean(axis=1)
ap_consolidated

In [None]:
mAP = ap_consolidated.mean(axis=0)
mAP

mAP\@0.5:0.95 is thus equal to $0.510150$


### Showing Curves

Now we can show the PR curve to have a look at the precision vs recall for a particular class and different IOU values. Here is an example with class 2 (persons)

First, we plot the different PR curves for different IOU threshold values,

and then we plot the f1 score vs confidence_threshold.

Finally, for an IoU threshold of 0.5, we plot recall, precision and F1_score vs confidence threshold

#### Recall vs Precision vs IoU threshold

In [None]:
pr_persons = pr[pr["category_id"] == 2]
sns.relplot(
    data=pr_persons,
    x="recall",
    y="precision",
    hue="iou_threshold",
    kind="line",
    estimator=None,
    sort=False,
)
plt.show()

#### F1 score vs confidence_threshold vs IoU threshold

Notice how the optimal confidence threshold is lower with the IoU

In [None]:
sns.relplot(
    data=pr_persons,
    x="confidence_threshold",
    y="f1_score",
    hue="iou_threshold",
    kind="line",
    estimator=None,
    sort=False,
)
plt.show()

#### Precision, recall, $F_\beta$ score \@0.5 vs confidence threshold for persons
Here, we graph recall, precision and F0.5, F1, and F2 with respect to confidence_threshold, for an IoU threshold of 0.5

In addition, we annotate the confidence values where the F05, F1 and F2 scores are the highest, to show how each score weights precision and recall.

Note that we don't use seaborn for this plot

Side note, We can very clearly see that this set of predictions was cut off at a confidence threshold of 0.05

We could lower that threshold, but it would dramatically increase the number of predictions without adding much information to the plot.

In [None]:
to_plot = pr50[pr50["category_id"] == 2].set_index("confidence_threshold")

f_scores = to_plot[["f1_score", "f0.5_score", "f2_score"]]
best_confidences = f_scores.idxmax()

fig, ax = plt.subplots()
to_plot[["precision", "recall"]].plot(ax=ax)
to_plot[["f1_score", "f0.5_score", "f2_score"]].plot(
    style=["r--", "b--", "g--"], ax=ax, linewidth=0.5
)
plt.scatter(f_scores.idxmax(), f_scores.max(), marker="+")
for x, y in zip(f_scores.idxmax(), f_scores.max()):
    ax.annotate(
        f"{x:.2f}",
        [x + 0.01, y + 0.01],
    )
plt.show()

## Computing grouped pr and ap curves

Now is time to make things more interesting

* `box_group` is how we want to split the data. Most usual group is `category_id`, but here we add the `box_height` group with 10 bins.
  Be careful, the more groups you add, the more granular your curves become but the less data you have for each.
* `image_group` is not used here but could be used the same as `box_groups` with e.g. weather condition or focal length

Notice we don't use index alignment anymore

In [None]:
from lours.utils.grouper import ContinuousGroup

box_height_group = ContinuousGroup(name="box_height", bins=10, qcut=True)
pr, ap = evaluator.compute_precision_recall(
    predictions_names="predictions",
    ious=(0.3, 0.5, 0.7, 0.9),
    groups=["category_id", box_height_group],
    index_column=None,
)

### Exploring the `pr` and `ap` DataFrames

Each given group in the former function call will have its dedicated column

In [None]:
ap[(ap["iou_threshold"] == 0.5) & (ap["category_id"] == 1)].sort_values(
    by="AP"
).reset_index()

In [None]:
pr[
    (pr["category_id"] == 2)
    & (pr["iou_threshold"] == 0.5)
    & (pr["box_height"].apply(lambda x: x.left) == 12.196)
]

### Plotting Precision - Recall curves

Here we used a filtered dataframe with only the 41 category and the easiest iou_threshold (0.5)
notice the parameters `estimator=None` and `sort=False` to be able to plot vertical lines

In [None]:
sns.relplot(
    data=pr[(pr["category_id"] == 1) & (pr["iou_threshold"] == 0.5)],
    x="recall",
    y="precision",
    hue="box_height",
    kind="line",
    estimator=None,
    sort=False,
)
plt.show()

Here is a more complicated example for Persons (class id = 1, the most represented class, by far)

colors and line styles can help you understand strengths and weakness of the network

In [None]:
sns.relplot(
    data=pr[(pr["category_id"] == 1)],
    x="recall",
    y="precision",
    hue="box_height",
    style="iou_threshold",
    kind="line",
    estimator=None,
    sort=False,
)
plt.show()

## Getting Average Precision wrt to other parameters

Usually, mean AP is just a single number giving you a general idea of the network quality.

Here, we try to have a better understanding of the influence of some parameters.

Namely here, we want to know if the network is better with small or large targets.

Seaborn can let us visualise several dimensions at the same time like in the following graph

In [None]:
data = ap.copy()
data["box_mean_height"] = data["box_height"].apply(lambda x: x.mid)
data["category_str"] = data["category_id"].replace(evaluator.label_map)
display(data)
g = sns.relplot(
    data=data, x="box_mean_height", y="AP", kind="line", hue="iou_threshold"
)
g.set(xscale="log")
plt.show()

Former plot would present mean AP across all categories.

The next (very large !) grid will let you see AP vs box height for each class.

In [None]:
g = sns.relplot(
    data=data[data["iou_threshold"] == 0.5],
    x="box_mean_height",
    y="AP",
    col="category_str",
    col_wrap=4,
    kind="line",
)
g.set(xscale="log")
for axis in g.axes.flat:
    axis.tick_params(labelbottom=True)
plt.subplots_adjust(hspace=0.15)
plt.show()

## Dealing with more absolute metrics : target precision

The next usecase aims at being closer to real life metrics than AP.

In real world, AP is not that interesting because you ultimately have to choose a confidence threshold and thus a single point in the Precision/Recall curve. You will then have to make compromises between precision and recall.

Here we are interested in a target precision. Given a wanted precision (because I want to minimize the fals positive) what Recall can I hope for ? Of course this problem can easily be transposed with a target recall and the corresponding precisions

Next graphs shows an example where we want a precision of 60%. The recall values are where the different curves cross the horizontal line of value 0.6

In [None]:
persons = pr[(pr["category_id"] == 1) & (pr["iou_threshold"] == 0.5)]
plt.figure(figsize=(7, 7))
precision = plt.plot([0, 1], [0.6, 0.6], label="precision @0.6", linestyle="--")
pl = sns.lineplot(
    data=persons,
    x="recall",
    y="precision",
    hue="box_height",
    estimator=None,
    sort=False,
    palette="bright",
)
plt.show()

For this example, we want the recall values for 10 different wanted precisions

In [None]:
from functools import partial


def interpolate_precision(data, value):
    if isinstance(value, float):
        value = [value]
    recall_values = np.interp(
        value, xp=data["precision"][::-1], fp=data["recall"][::-1]
    )
    recall_values = pd.Series(
        recall_values, index=pd.Index(value, name="target_precision"), name="recall"
    ).to_frame()
    return recall_values

In [None]:
recall_at_precision_persons = persons.groupby("box_height").apply(
    partial(interpolate_precision, value=np.linspace(0.1, 0.9, 5).round(3)),
    include_groups=False,
)
recall_at_precision_persons = recall_at_precision_persons.reset_index()
recall_at_precision_persons["box_mean_height"] = recall_at_precision_persons[
    "box_height"
].apply(lambda x: x.mid)

In [None]:
g = sns.relplot(
    data=recall_at_precision_persons,
    x="box_mean_height",
    hue="target_precision",
    y="recall",
    kind="line",
)
g.set(xscale="log")
plt.show()

In [None]:
sns.relplot(
    data=recall_at_precision_persons,
    x="target_precision",
    hue="box_height",
    y="recall",
    kind="line",
    palette="bright",
)
plt.show()

Next example covers all classes

In [None]:
all_classes_iou_05 = pr[pr["iou_threshold"] == 0.5]
recall_at_precision = all_classes_iou_05.groupby(["box_height", "category_id"]).apply(
    partial(interpolate_precision, value=np.linspace(0.1, 0.9, 5).round(2)),
    include_groups=False,
)
recall_at_precision = recall_at_precision.reset_index()
recall_at_precision["box_mean_height"] = recall_at_precision["box_height"].apply(
    lambda x: x.mid
)
recall_at_precision["category_str"] = recall_at_precision["category_id"].replace(
    evaluator.label_map
)

In [None]:
sns.relplot(
    data=recall_at_precision,
    x="target_precision",
    hue="box_height",
    y="recall",
    kind="line",
    palette="bright",
)
plt.show()

In [None]:
g = sns.relplot(
    data=recall_at_precision,
    x="box_mean_height",
    hue="target_precision",
    y="recall",
    kind="line",
    palette="bright",
)
g.set(xscale="log")
plt.show()

In [None]:
g = sns.relplot(
    data=recall_at_precision,
    x="box_mean_height",
    hue="target_precision",
    y="recall",
    col="category_str",
    col_wrap=4,
    kind="line",
    palette="bright",
)
g.set(xscale="log")
for axis in g.axes.flat:
    axis.tick_params(labelbottom=True)
plt.subplots_adjust(hspace=0.15)
plt.show()