# Evaluate Detections

This recipe demonstrates how to use FiftyOne to perform hands-on evaluation of your detection model.

It covers the following concepts:

- Loading a dataset with ground truth labels [into FiftyOne](https://docs.voxel51.com/user_guide/dataset_creation/index.html)
- [Adding model predictions](https://docs.voxel51.com/recipes/adding_detections.html) to your dataset
- [Evaluating your model](https://docs.voxel51.com/user_guide/evaluation.html#detections) using FiftyOne’s evaluation API
- Viewing the best and worst performing samples in your dataset

## Setup

If you haven't already, install FiftyOne:

In [None]:
!pip install fiftyone

In this tutorial, we’ll use an [Ultralytics](https://www.ultralytics.com/) detection model using the FiftyOne + Ultralytics [integration](https://docs.voxel51.com/integrations/ultralytics.html). To use it, you’ll need to install `ultralytics`, if necessary.

In [None]:
!pip install "ultralytics>=8.1.0" "torch>=1.8"

## Loading In Our Dataset

For the walkthrough, we will be using the [MSCOCO 2017](https://cocodataset.org/#home) validation split from the [FiftyOne Dataset Zoo](https://docs.voxel51.com/user_guide/dataset_zoo/datasets.html#coco-2017). We can load it in with the following:

In [None]:
import fiftyone as fo
import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset(
    "coco-2017",
    split="validation",
    dataset_name="evaluate-detections-tutorial",
)
dataset.persistent = True

We can see the fields on our dataset by printing it out. Here we can see that MSCOCO 2017 comes with the `ground_truth` detections field prepopulated.

In [2]:
print(dataset)

Name:        evaluate-detections-tutorial
Media type:  image
Num samples: 5000
Persistent:  True
Tags:        []
Sample fields:
    id:           fiftyone.core.fields.ObjectIdField
    filepath:     fiftyone.core.fields.StringField
    tags:         fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)


We can also inspect a detection to get an idea of the format:

In [3]:
# Print a ground truth detection
sample = dataset.first()
print(sample.ground_truth.detections[0])

<Detection: {
    'id': '663ba4b4cff68a59420a974f',
    'attributes': {},
    'tags': [],
    'label': 'potted plant',
    'bounding_box': [
        0.37028125,
        0.3345305164319249,
        0.038593749999999996,
        0.16314553990610328,
    ],
    'mask': None,
    'confidence': None,
    'index': None,
    'supercategory': 'furniture',
    'iscrowd': 0,
}>


Before we go further, let’s launch the FiftyOne App and use the GUI to explore the dataset visually:

In [None]:
session = fo.launch_app(dataset)

![coco](../assets/coco.png)

## Add Predictions

Now let’s generate some predictions to analyze.

We can load the model from the FiftyOne Model Zoo or from Ultralytics and then can apply it directly to our dataset (or any subset thereof) for inference using the sample collection’s `apply_model()` method. Here we load from the [Model Zoo](https://docs.voxel51.com/user_guide/model_zoo/index.html):

In [None]:
model = foz.load_zoo_model("yolov8s-coco-torch")

dataset.apply_model(model, label_field="predictions")

session.show()

Now we can check out our predictions on our dataset below! Trying exploring the newly added predictions by playing with the sidebar. Try adding or removing classes or moving the confidence threshold!

![Yolo-COCO](../assets/coco-yolo.png)

## Analyzing Detections

Let’s analyze the raw predictions we’ve added to our dataset in more detail. FiftyOne provides the ability to write [expressions](https://docs.voxel51.com/user_guide/using_views.html#filtering) that match, filter, and sort detections based on their attributes. See using [DatasetViews](https://docs.voxel51.com/user_guide/using_views.html) for full details.

We can start by creating a view that filters our labels so to only have detections that have a `confidence` higher than 0.75.

Note the `only_matches=False` argument. When filtering labels, any samples that no longer contain labels would normally be removed from the view. However, this is not desired when performing evaluations since it can skew your results between views. We set `only_matches=False` so that all samples will be retained, even if some no longer contain labels.

In [2]:
from fiftyone import ViewField as F

# Only contains detections with confidence >= 0.75
high_conf_view = dataset.filter_labels("predictions", F("confidence") > 0.75, only_matches=False)

In [10]:
# Print some information about the view
print(high_conf_view)

Dataset:     evaluate-detections-tutorial
Media type:  image
Num samples: 5000
Sample fields:
    id:           fiftyone.core.fields.ObjectIdField
    filepath:     fiftyone.core.fields.StringField
    tags:         fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    predictions:  fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
View stages:
    1. FilterLabels(field='predictions', filter={'$gt': ['$$this.confidence', 0.75]}, only_matches=False, trajectories=False)


In [11]:
# Print a prediction from the view to verify that its confidence is > 0.75
sample = high_conf_view.first()
print(sample.predictions.detections[0])

<Detection: {
    'id': '664289ef3962bfad6410992d',
    'attributes': {},
    'tags': [],
    'label': 'tv',
    'bounding_box': [
        0.007986664772033691,
        0.3902428150177002,
        0.23381657898426056,
        0.22575199604034424,
    ],
    'mask': None,
    'confidence': 0.9355676174163818,
    'index': None,
}>


There are multiple situations where it can be useful to visualize each object separately. For example, if a sample contains dozens of objects overlapping one another or if you want to look specifically for instances of a class of objects.

You can view our dataset as a set of Object Patches as well using the [patches view button](https://docs.voxel51.com/user_guide/app.html#viewing-object-patches)! The button allows you to take any Detections field in your dataset and visualize each object as an individual patch in the image grid.

![object-patches](../assets/object-patches.gif)

## Evaluate detections

Now that we have samples with ground truth and predicted objects, let’s use FiftyOne to evaluate the quality of the detections.

FiftyOne provides a powerful [evaluation API](https://docs.voxel51.com/user_guide/evaluation.html) that contains a collection of methods for performing evaluation of model predictions. Since we’re working with object detections here, we’ll use [detection evaluation](https://docs.voxel51.com/user_guide/evaluation.html#detections).

### Running Evaluation

We can run evaluation on our samples via [evaluate_detections()](https://docs.voxel51.com/api/fiftyone.core.collections.html#fiftyone.core.collections.SampleCollection.evaluate_detections). Note that this method is available on both the `Dataset` and `DatasetView` classes, which means that we can run evaluation on our high_conf_view to assess the quality of only the high confidence predictions in our dataset.

By default, this method will use the [COCO evaluation protocol](https://cocodataset.org/#detection-eval), plus some extra goodies that we will use later.

In [None]:
# Evaluate the predictions in the `predictions` field of our dataset
# with respect to the objects in the `ground_truth` field
results = dataset.evaluate_detections(
    "predictions",
    gt_field="ground_truth",
    eval_key="eval",
    compute_mAP=True,
)

### Aggregate Results

The `results` object returned by the evaluation routine provides a number of convenient methods for analyzing our predictions.

For example, let’s print a classification report for the top-10 most common classes in the dataset, as well as the mAP score:

In [27]:
# Get the 10 most common classes in the dataset
counts = dataset.count_values("ground_truth.detections.label")
classes_top10 = sorted(counts, key=counts.get, reverse=True)[:10]

# Print a classification report for the top-10 classes
results.print_report(classes=classes_top10)

# Print out the mAP score
print(f"Yolov8s mAP score: {results.mAP()}")

               precision    recall  f1-score   support

       person       0.85      0.75      0.79     11723
         kite       0.74      0.65      0.69       365
          car       0.73      0.63      0.67      1979
         bird       0.80      0.52      0.63       487
       carrot       0.55      0.46      0.50       411
         boat       0.62      0.45      0.52       436
    surfboard       0.70      0.57      0.63       269
     airplane       0.79      0.87      0.83       143
traffic light       0.68      0.47      0.56       637
        bench       0.55      0.37      0.45       413

    micro avg       0.80      0.69      0.74     16863
    macro avg       0.70      0.57      0.63     16863
 weighted avg       0.80      0.69      0.74     16863

Yolov8s mAP score: 0.39593808224616606


We can also evaluate our detections on specific view. For instance, we can evaluate our detections on our `high_conf_view` from before!

In [4]:
# Evaluate the predictions in the `predictions` field of our `high_conf_view`
# with respect to the objects in the `ground_truth` field
results = high_conf_view.evaluate_detections(
    "predictions",
    gt_field="ground_truth",
    eval_key="eval_high_conf",
    compute_mAP=True,
)

Evaluating detections...
 100% |███████████████| 5000/5000 [31.9s elapsed, 0s remaining, 158.3 samples/s]      
Performing IoU sweep...
 100% |███████████████| 5000/5000 [23.1s elapsed, 0s remaining, 226.1 samples/s]      


We can also create widgets in our jupyter notebook to display interactive graphs like PR curves. Let's try below, making sure that we have the correct version of `ipywidgets` installed.

In [None]:
!pip install 'ipywidgets>=8,<9'

In [30]:
plot = results.plot_pr_curves(classes=["person", "car"])
plot.show()





FigureWidget({
    'data': [{'customdata': array([    0.96414,     0.94467,     0.93928,     0.93412,     0.93048,
                                       0.92616,      0.9225,       0.919,     0.91503,     0.91094,
                                       0.90676,     0.90267,     0.89791,     0.89257,     0.88409,
                                       0.80297,     0.79909,     0.79541,     0.79203,     0.78861,
                                       0.78479,     0.78045,     0.77582,     0.77094,     0.76531,
                                       0.75879,     0.75229,     0.74514,     0.66498,     0.65954,
                                        0.6533,     0.64692,     0.64031,     0.63343,     0.55195,
                                       0.54611,     0.46507,     0.45966,     0.37968,     0.22583,
                                             0,           0,           0,           0,           0,
                                             0,           0,           0,           0

### Evaluation Views

So, now that we have a sense for the aggregate performance of our model, let’s dive into sample-level analysis by creating an [evaluation view](https://docs.voxel51.com/user_guide/app.html#viewing-evaluation-patches).

Any evaluation that you stored on your dataset can be used to generate an [evaluation view](https://docs.voxel51.com/user_guide/app.html#viewing-evaluation-patches) that is a patches view creating a sample for every true positive, false positive, and false negative in your dataset. Through this view, you can quickly filter and sort evaluated detections by their type (TP/FP/FN), evaluated IoU, and if they are matched to a crowd object.

These evaluation views can be created through Python or directly in the App as shown below.



In [5]:
eval_patches = dataset.to_evaluation_patches("eval")
print(eval_patches)

Dataset:     evaluate-detections-tutorial
Media type:  image
Num patches: 45509
Patch fields:
    id:           fiftyone.core.fields.ObjectIdField
    sample_id:    fiftyone.core.fields.ObjectIdField
    filepath:     fiftyone.core.fields.StringField
    tags:         fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    predictions:  fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    crowd:        fiftyone.core.fields.BooleanField
    type:         fiftyone.core.fields.StringField
    iou:          fiftyone.core.fields.FloatField
View stages:
    1. ToEvaluationPatches(eval_key='eval', config=None)


In [None]:
session.view = eval_patches

![eval_patches](../assets/eval_patches.gif)

## View the best-performing samples

To dig in further, let’s create a view that sorts by eval_tp so we can see the best-performing cases of our model (i.e., the samples with the most correct predictions):

In [None]:
# Show samples with most true positives
session.view = high_conf_view.sort_by("eval_tp", reverse=True)

Similarly, we can sort by the eval_fp field to see the worst-performing cases of our model (i.e., the samples with the most false positive predictions):

In [None]:
# Show samples with most false positives
session.view = high_conf_view.sort_by("eval_fp", reverse=True)

## Filtering by Bounding Box Area

Dataset views are extremely powerful. For example, let’s look at how our model performed on small objects by creating a view that contains only predictions whose bounding box area is less than `32^2` pixels:

In [10]:
# Compute metadata so we can reference image height/width in our view
dataset.compute_metadata()

In [None]:
#
# Create an expression that will match objects whose bounding boxes have
# area less than 32^2 pixels
#
# Bounding box format is [top-left-x, top-left-y, width, height]
# with relative coordinates in [0, 1], so we multiply by image
# dimensions to get pixel area
#
bbox_area = (
    F("$metadata.width") * F("bounding_box")[2] *
    F("$metadata.height") * F("bounding_box")[3]
)
small_boxes = bbox_area < 32 ** 2

# Create a view that contains only small (and high confidence) predictions
small_boxes_view = high_conf_view.filter_labels("predictions", small_boxes)

session.view = small_boxes_view

![eval_boxes](../assets/eval_boxes.png)

In [None]:
# Create a view that contains only small GT and predicted boxes
small_boxes_eval_view = (
    high_conf_view
    .filter_labels("ground_truth", small_boxes, only_matches=False)
    .filter_labels("predictions", small_boxes, only_matches=False)
)

# Run evaluation
small_boxes_results = small_boxes_eval_view.evaluate_detections(
    "predictions",
    gt_field="ground_truth",
)

In [14]:
# Get the 10 most common small object classes
small_counts = small_boxes_eval_view.count_values("ground_truth.detections.label")
classes_top10_small = sorted(small_counts, key=small_counts.get, reverse=True)[:10]

# Print a classification report for the top-10 small object classes
small_boxes_results.print_report(classes=classes_top10_small)

               precision    recall  f1-score   support

       person       0.89      0.02      0.05      3217
          car       0.91      0.05      0.09      1005
         book       0.00      0.00      0.00       677
       bottle       0.88      0.03      0.06       521
traffic light       1.00      0.02      0.05       490
        chair       0.67      0.00      0.01       461
          cup       0.82      0.05      0.09       379
      handbag       0.00      0.00      0.00       235
         bird       0.71      0.02      0.04       234
  sports ball       0.93      0.32      0.47       218

    micro avg       0.89      0.03      0.06      7437
    macro avg       0.68      0.05      0.09      7437
 weighted avg       0.77      0.03      0.06      7437

