![Roboflow banner](https://media.roboflow.com/banner.jpeg?updatedAt=1682523622384)

# Roboflow Model Evaluation 🔎

[Roboflow Evaluations](https://github.com/roboflow/evaluations) is a framework for evaluating the results of computer vision models. Think OpenAI Evals, but for computer vision models.

Using Evaluations, you can

1. Evaluate the difference between ground truth (your annotated data) and predictions from your [Roboflow models](https://roboflow.com). You can use this information to better understand the quality of predictions from your model and areas where improvement is needed.

2. Evaluate ground truth against results from a zero-shot model with a text input. [Grounding DINO](https://github.com/IDEA-Research/GroundingDINO) and [CLIP](https://github.com/openai/clip) are supported.

## Steps in this Tutorial

In this tutorial, we are going to cover:

- How to set up Roboflow Evaluations with a Roboflow model, and;
- How to run ground truth / Roboflow prediction analysis on an existing model.

By the end of this guide, we will have a confusion matrix like the one below, as well as the following statistics:

- Precision
- Accuracy
- F1 Score

Without further ado, let's begin!

## Step 1: Create a Data Loader 🗃️

Evaluations uses ground truth data from either:

1. An existing Roboflow model, or;
2. A JSON file that contains ground truth mapped to file names (see the evaluations.dataloaders.JSONDataLoader class docstrings for more information on how to compose this file).

In this example, we will evaluate a model in Roboflow.

In [1]:
from evaluations.dataloaders import (RoboflowDataLoader, RoboflowPredictionsDataLoader)
from evaluations.roboflow import RoboflowEvaluator



## Create an Evaluator 💻

An Evaluator uses a model to run inference on data in a dataset. This data run through the model. Inference results are compared to the ground truth from the provided data.

Confusion matrices and ground truth vs. inference result visualizations are created for each image on which inference is run, saved in `output/matrices` and `output/images/` respectively.

In the code below, we will create an evaluator that uses the aforementioned Roboflow model that we initialized and the data we collected from the Roboflow API.

In [2]:
class_names, ground_truth, model = RoboflowDataLoader(
    workspace_url="james-gallagher-87fuq",
    project_url="mug-detector-eocwp",
    project_version=12,
    image_files="/Users/james/src/clip/model_eval/dataset-new",
).download_dataset()

predictions = RoboflowPredictionsDataLoader(
    model=model,
    model_type="object-detection",
    image_files="/Users/james/src/clip/model_eval/dataset-new/",
    class_names=class_names,
).process_files()

You are already logged into Roboflow. To make a different login, run roboflow.login(force=True).
loading Roboflow workspace...
loading Roboflow project...


## Run Analysis 📊

The following lines of code will plot an aggregate confusion matrix representing the results of inference from all images in your dataset and display it in this notebook.

After showing the confusion matrix, we will calculate and display precision, recall, and f1 score associated with our model.

In [3]:
evaluator = RoboflowEvaluator(
    ground_truth=ground_truth, predictions=predictions, class_names=class_names, mode="batch"
)

cf = evaluator.eval_model_predictions()

data = evaluator.calculate_statistics()

print("Precision:", data.precision)
print("Recall:", data.recall)
print("f1 Score:", data.f1)

evaluating image predictions against ground truth /Users/james/src/clip/model_eval/dataset-new/valid/images/IMG_1549-Large_jpeg_jpg.rf.21ff07dae783140d021c11ded2304576.jpg ... ['cup', 'background']
evaluating image predictions against ground truth /Users/james/src/clip/model_eval/dataset-new/valid/images/IMG_1511-Large_jpeg_jpg.rf.b230e98b40b8ef4ece04e375dce34d75.jpg ... ['cup', 'background']
evaluating image predictions against ground truth /Users/james/src/clip/model_eval/dataset-new/valid/images/IMG_4769_JPG_jpg.rf.a6774c9040a1adbe5652b28199552b9a.jpg ... ['cup', 'background']
evaluating image predictions against ground truth /Users/james/src/clip/model_eval/dataset-new/valid/images/IMG_1544-Large_jpeg_jpg.rf.3d49ae7cd57da2930617648df0ba1df3.jpg ... ['cup', 'background']
evaluating image predictions against ground truth /Users/james/src/clip/model_eval/dataset-new/valid/images/IMG_1540-Large_jpeg_jpg.rf.c14bd974a669f3a3cd4f8a4f2cf6b140.jpg ... ['cup', 'background']
evaluating image 

## Compare Prompts

You can use the `CompareEvaluations` class to run multiple evaluations and return the class associated with the best performing model.

Note: The `CompareEvaluations` class takes in single class names.

In the example below, we will compare two prompts against CLIP to find out which prompt most effectively classifies our data.

We will work with a dataset of apples. The sample size is 10 so inference should not take too long.

In the code cell below, we will load the dataset with which we will be working in our project.

In [2]:
from evaluations.clip import CLIPEvaluator
from evaluations.dataloaders import RoboflowDataLoader
from evaluations.dataloaders.cliploader import CLIPDataLoader
from evaluations import CompareEvaluations
import copy

EVAL_DATA_PATH = "/Users/james/src/clip/model_eval/dataset-new-apples"

class_names, predictions, model = RoboflowDataLoader(
    workspace_url="mit-3xwsm",
    project_url="appling",
    project_version=1,
    image_files=EVAL_DATA_PATH,
    model_type="classification",
).download_dataset()

You are already logged into Roboflow. To make a different login, run roboflow.login(force=True).
loading Roboflow workspace...
loading Roboflow project...
evaluating image predictions against ground truth /Users/james/src/clip/model_eval/dataset-new-apples/valid/apple/000325191_jpg.rf.8e478cb228cb5e7a7b17b14e26466968.jpg ... ['orange', 'background']
evaluating image predictions against ground truth /Users/james/src/clip/model_eval/dataset-new-apples/valid/apple/apple-fuji-1-kg-product-images-o590000001-p590000001-0-202203151906_jpg.rf.f518348005b1d18b089c26bc018a02d2.jpg ... ['orange', 'background']
evaluating image predictions against ground truth /Users/james/src/clip/model_eval/dataset-new-apples/test/apple/images-1-_jpg.rf.23bed4416b487cf1419b2761f3b6a492.jpg ... ['orange', 'background']
evaluating image predictions against ground truth /Users/james/src/clip/model_eval/dataset-new-apples/train/apple/GP_74888233_GP_L_jpg.rf.c016cba8bc070a691955b37c0df96b02.jpg ... ['orange', 'backgr

### Run Comparison

Next, we need to choose the prompts we want to evaluate. In the example below, we'll evaluate "orange" and "apple" to see which one classifies the most images in our dataset correctly.

We'll use the prompts to create a list of objects to pass into the `CompareEvaluations` class for comparison.

In [3]:
evals = [
    ["orange", "background"],
    ["red apple", "background"],
]

best = CompareEvaluations(
    [
        CLIPEvaluator(
            data=CLIPDataLoader(
                data=copy.deepcopy(predictions),
                class_names=cn,
                eval_data_path=EVAL_DATA_PATH,
            ).process_files(),
            class_names=cn,
            mode="batch",
        )
        for cn in evals
    ]
)

precision, recall, f1, class_name = best.compare()

print("Precision:", precision)
print("Recall:", recall)
print("f1 Score:", f1)
print("Class Name:", class_name)

evaluating image predictions against ground truth /Users/james/src/clip/model_eval/dataset-new-apples/valid/apple/000325191_jpg.rf.8e478cb228cb5e7a7b17b14e26466968.jpg ... ['orange', 'background']
evaluating image predictions against ground truth /Users/james/src/clip/model_eval/dataset-new-apples/valid/apple/apple-fuji-1-kg-product-images-o590000001-p590000001-0-202203151906_jpg.rf.f518348005b1d18b089c26bc018a02d2.jpg ... ['orange', 'background']
evaluating image predictions against ground truth /Users/james/src/clip/model_eval/dataset-new-apples/test/apple/images-1-_jpg.rf.23bed4416b487cf1419b2761f3b6a492.jpg ... ['orange', 'background']
evaluating image predictions against ground truth /Users/james/src/clip/model_eval/dataset-new-apples/train/apple/GP_74888233_GP_L_jpg.rf.c016cba8bc070a691955b37c0df96b02.jpg ... ['orange', 'background']
evaluating image predictions against ground truth /Users/james/src/clip/model_eval/dataset-new-apples/train/apple/images_jpg.rf.b21a1b7b5a0ea2c68ac8

: 

# Next steps 🚀

Congratulations on completing this notebook! Use the insights you have derived from this notebook to improve your existing model on Roboflow or to find the ideal prompt for zero-shot labelling.

## Learning Resources

Roboflow has produced many resources that you may find interesting as you advance your knowledge of computer vision:

- [Roboflow Notebooks](https://github.com/roboflow/notebooks): A repository of over 20 notebooks that walk through how to train custom models with a range of model types, from YOLOv7 to SegFormer. (This notebook is in the Notebooks repository!)
- [Roboflow Supervision](https://github.com/roboflow/supervision): Utilities to implement common computer vision functions into your project, from drawing bounding boxes to counting predictions in specified zones.
- [Roboflow YouTube](https://www.youtube.com/c/Roboflow): Our library of videos featuring deep dives into the latest in computer vision, detailed tutorials that accompany our notebooks, and more.
- [Roboflow Discuss](https://discuss.roboflow.com/): Have a question about how to do something on Roboflow? Ask your question on our discussion forum.
- [Roboflow Models](https://roboflow.com): Learn about state-of-the-art models and their performance. Find links and tutorials to guide your learning.

## Connect computer vision to your project logic

[Roboflow Templates](https://roboflow.com/templates) is a public gallery of code snippets that you can use to connect computer vision to your project logic. Code snippets range from sending emails after inference to measuring object distance between detections.