# Prawn Counting Analysis Notebook

This notebook provides comprehensive analysis tools for counting prawns in underwater images, including:

1. **Dataset Creation & Management** - FiftyOne dataset setup and validation
2. **Model Evaluation** - YOLO model performance analysis across different confidence thresholds  
3. **Detection Statistics** - Statistical analysis of detection counts per image and split
4. **Confusion Matrix Analysis** - Performance metrics extraction from confusion matrices
5. **Utilities** - Helper functions for file management and data processing

## Table of Contents
- [Setup & Imports](#setup)
- [Dataset Creation](#dataset-creation)
- [Model Evaluation](#model-evaluation) 
- [Detection Statistics](#detection-statistics)
- [Confusion Matrix Analysis](#confusion-matrix-analysis)
- [Utilities](#utilities)


## Setup & Imports 

Import all necessary libraries for dataset management, model evaluation, and analysis.


In [1]:
# Core libraries
import os
import shutil
import glob
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# FiftyOne for dataset management
import fiftyone as fo
from fiftyone import ViewField as F

# YOLO model
from ultralytics import RTDETR

# Image processing
from PIL import Image
import pytesseract

print("✅ All libraries imported successfully")

✅ All libraries imported successfully


## Configuration

Set up paths and configuration parameters for the analysis.


In [2]:
# Configuration parameters
CONFIG = {
    'dataset_name': 'circle-pond-analysis',
    'dataset_dir': r"/Users/gilbenor/Downloads/circle pond.v20i.yolov8",
    'model_path': r"/Users/gilbenor/Library/CloudStorage/OneDrive-Personal/measurement_paper_images/detection drone/runs-detections-drone-14.08/detect/train/weights/best.pt",
    'data_yaml': r"/Users/gilbenor/Downloads/circle pond.v23i.yolov8/data.yaml",
    'splits': ["test", "valid", "train"],
    'confidence_thresholds': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
}

print("📁 Configuration loaded:")
for key, value in CONFIG.items():
    print(f"   {key}: {value}")
    
# Verify paths exist
if os.path.exists(CONFIG['dataset_dir']):
    print(f"✅ Dataset directory exists: {len(os.listdir(CONFIG['dataset_dir']))} items")
else:
    print("❌ Dataset directory not found")


📁 Configuration loaded:
   dataset_name: circle-pond-analysis
   dataset_dir: /Users/gilbenor/Downloads/circle pond.v20i.yolov8
   model_path: /Users/gilbenor/Library/CloudStorage/OneDrive-Personal/measurement_paper_images/detection drone/runs-detections-drone-14.08/detect/train/weights/best.pt
   data_yaml: /Users/gilbenor/Downloads/circle pond.v23i.yolov8/data.yaml
   splits: ['test', 'valid', 'train']
   confidence_thresholds: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
✅ Dataset directory exists: 5 items


In [3]:
import os
import glob
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from PIL import Image
import pytesseract

# Path to the uploaded files
file_paths = glob.glob("/mnt/data/val*_confusion_matrix.png")

# Function to extract text from confusion matrix images using OCR
def extract_confusion_matrix_values(image_path):
    img = Image.open(image_path)
    text = pytesseract.image_to_string(img)
    lines = text.strip().split('\n')
    
    # Find numeric values in the extracted text
    values = []
    for line in lines:
        nums = [int(s) for s in line.split() if s.isdigit()]
        values.extend(nums)
    
    if len(values) >= 4:
        # Assuming order: TP, FP, FN, TN
        tp, fp, fn, tn = values[:4]
        return tp, fp, fn, tn
    else:
        return None

# Extract data from all matrices
results = {}
for file in file_paths:
    name = os.path.basename(file)
    values = extract_confusion_matrix_values(file)
    if values:
        tp, fp, fn, tn = values
        accuracy = (tp + tn) / (tp + fp + fn + tn)
        precision = tp / (tp + fp) if (tp + fp) > 0 else 0
        recall = tp / (tp + fn) if (tp + fn) > 0 else 0
        f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
        
        results[name] = {
            'TP': tp, 'FP': fp, 'FN': fn, 'TN': tn,
            'Accuracy': accuracy,
            'Precision': precision,
            'Recall': recall,
            'F1 Score': f1_score
        }

# Sort results by F1 score (best overall measure for classification)
best_model = sorted(results.items(), key=lambda x: x[1]['F1 Score'], reverse=True)

# Display best model and summary of metrics
best_model


[]

# This notebook cell evaluates an RT-DETR object detection model at multiple confidence thresholds

This code block loads a trained RT-DETR model from a specified path and evaluates its performance on the test split of a YOLO-format dataset at various confidence thresholds. For each threshold (from 0.1 to 1.0), it runs validation, saves the results in JSON format, generates plots, and prints the results. This allows you to analyze how the model's performance changes as the detection confidence threshold varies.

---


In [4]:
from ultralytics import RTDETR

model= RTDETR(r"/Users/gilbenor/Library/CloudStorage/OneDrive-Personal/measurement_paper_images/detection drone/runs-detections-drone-14.08/detect/train/weights/best.pt")


for threshold in [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]:
    results = model.val(save_json=True, data=r"/Users/gilbenor/Downloads/circle pond.v23i.yolov8/data.yaml", plots=True,split='test',conf=threshold)
    print(f"threshold: {threshold}")
    print(results)

Ultralytics 8.3.100 🚀 Python-3.9.6 torch-2.0.1 CPU (Apple M4)
rt-detr-l summary: 302 layers, 31,985,795 parameters, 0 gradients, 103.4 GFLOPs


[34m[1mval: [0mScanning /Users/gilbenor/Downloads/circle pond.v23i.yolov8/test/labels.cache... 79 images, 1 backgrounds, 0 corrupt: 100%|██████████| 79/79 [00:00<?, ?it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95):   0%|          | 0/5 [00:05<?, ?it/s]


KeyboardInterrupt: 



->

# Creating and Inspecting a FiftyOne Dataset from YOLO-Formatted Data

This code block creates a new FiftyOne dataset called `circle-pond-analysis` from a directory containing images and YOLO-format label files, organized into `train`, `valid`, and `test` splits.

- **Dataset Structure:**  
  The code expects the following folder structure for each split:
  ```
  <dataset_dir>/<split>/images/
  <dataset_dir>/<split>/labels/
  ```
  where `<split>` is one of `train`, `valid`, or `test`.

- **Label Parsing:**  
  For each image, the code looks for a corresponding YOLO label file. Each label file is parsed, and the bounding boxes are converted into FiftyOne `Detection` objects (with the label `'ground_truth'`).

- **Sample Creation:**  
  Each image and its detections are added as a sample to the FiftyOne dataset, with the split name added as a tag.

- **Dataset Summary:**  
  After loading, the code prints:
  - The number of files in the test directory
  - The number of samples in the dataset
  - A summary and schema of the dataset
  - Details of the first sample, including its ground truth detections

- **Visualization:**  
  Optionally, you can launch the FiftyOne app to visually inspect the dataset by uncommenting the last two lines.

This setup is useful for preparing and verifying your dataset before training or evaluation.

In [5]:
import fiftyone as fo
import os
import glob

# Set the dataset name
name = "circle-pond-analysis"

# Set the dataset directory
dataset_dir = r"/Users/gilbenor/Downloads/circle pond.v20i.yolov8"

# The splits to load
splits = ["test", "valid","train"]


dataset = fo.Dataset(name,overwrite=True)
print(f"Created new dataset '{name}'")


#print dataset dir
print(len(os.listdir("/Users/gilbenor/Downloads/circle pond.v20i.yolov8/test")))

def read_yolo_label(label_path):
    with open(label_path, 'r') as file:
        lines = file.readlines()
    detections = []
    for line in lines:
        class_id, x_center, y_center, width, height = map(float, line.strip().split())
        detections.append(
            fo.Detection(
                label='ground_truth' , # Convert class_id to string
                bounding_box=[x_center - width/2, y_center - height/2, width, height]
            )
        )
    return detections

for split in splits:
    split_dir = os.path.join(dataset_dir, split)
    images_dir = os.path.join(split_dir, "images")
    labels_dir = os.path.join(split_dir, "labels")
    
    # Add images and labels
    for image_path in glob.glob(os.path.join(images_dir, "*")):
        image_name = os.path.basename(image_path)
        label_name = os.path.splitext(image_name)[0] + ".txt"
        label_path = os.path.join(labels_dir, label_name)
        
        sample = fo.Sample(filepath=image_path)
        sample.tags.append(split)

        if os.path.exists(label_path):
            detections = read_yolo_label(label_path)
            sample["ground_truth"] = fo.Detections(detections=detections,label_field="ground_truth")
           
        #add sample to dataset
        dataset.add_sample(sample)
        print(f"Added {image_name} with label (if exists)")

print(f"\nDataset '{name}' now has {len(dataset)} samples")

# Print detailed information about the dataset
print("\nDataset Summary:")
print(dataset.summary())

# Print schema of the dataset
print("\nDataset Schema:")
for field_name, field_type in dataset.get_field_schema().items():
    print(f"{field_name}: {field_type}")

# If you want to examine a specific sample
if len(dataset) > 0:
    sample = dataset.first()
    print("\nFirst Sample Details:")
    print(sample)

    if "ground_truth" in sample:
        print("\nGround Truth for First Sample:")
        print(sample.ground_truth)
    else:
        print("\nWarning: 'ground_truth' field not found in the first sample")
else:
    print("\nWarning: Dataset is empty")

# Optionally, you can visualize the dataset
# session = fo.launch_app(dataset)
# session.wait()


Created new dataset 'circle-pond-analysis'
2
Added 20230920_120410_jpg.rf.228c5eec6b39d44ad174701845916268.jpg with label (if exists)
Added 20230920_115951_jpg.rf.cd3046bb49355e395045e8631aded018.jpg with label (if exists)
Added 20230920_121523_jpg.rf.dbe991c9c4bd750d730ba2a50ef21087.jpg with label (if exists)
Added 20230920_121416_jpg.rf.afffe41c31395002ec7d0dbca1d9d2e1.jpg with label (if exists)
Added 20230920_115310_jpg.rf.b8a6ccd11c7598ec045e2fda3e7dc524.jpg with label (if exists)
Added 20230920_120814_jpg.rf.f4cbaf661cdcdb67b2b820182fd80a51.jpg with label (if exists)
Added 20230920_120425_jpg.rf.f3d9a39cfeb1b2857480894f05663172.jpg with label (if exists)
Added 20230920_120359_jpg.rf.6c0789ba372afb3eb463288e5a235837.jpg with label (if exists)
Added 20230920_115213_jpg.rf.c73c0a0cec2c595a80e6a7ef7066d63c.jpg with label (if exists)
Added 20230920_121213_jpg.rf.638b72a42e4db7183e192536efd0057a.jpg with label (if exists)
Added 20230920_115946_jpg.rf.76816e35cb54c653ab07be62c6caabab.jpg

### check version ####

In [None]:
import fiftyone as fo
print(fo.__version__)

1.6.0


# tag each sample for test, valid train #

In [6]:
# Print detection count statistics per image

import numpy as np
import warnings
warnings.filterwarnings("ignore", category=RuntimeWarning, 
                       module="numpy.core")

from fiftyone import ViewField as F
#for each split, print the avg number of detections and std
splits = ["test", "valid","train"]

splits_test=[]
splits_val=[]
splits_train=[]

for sample in dataset:
    if "test" in sample.tags:
        splits_test.append(sample)
    elif "valid" in sample.tags:
        splits_val.append(sample)
    elif "train" in sample.tags:
        splits_train.append(sample)

print(f"Average number of detections for split test: {np.mean([len(sample.ground_truth.detections) for sample in splits_test]):.2f}")
print(f"Standard deviation of detections for split test: {np.std([len(sample.ground_truth.detections) for sample in splits_test]):.2f}")

print(f"Average number of detections for split val: {np.mean([len(sample.ground_truth.detections) for sample in splits_val]):.2f}")
print(f"Standard deviation of detections for split val: {np.std([len(sample.ground_truth.detections) for sample in splits_val]):.2f}")


#number of samples in each split
print(f"Number of samples in split test: {len(splits_test)}")
print(f"Number of samples in split val: {len(splits_val)}")
print(f"Number of samples in split train: {len(splits_train)}")

print(f"Average number of detections for split train: {np.mean([len(sample.ground_truth.detections) for sample in splits_train]):.2f}")
print(f"Standard deviation of detections for split train: {np.std([len(sample.ground_truth.detections) for sample in splits_train]):.2f}")


print("\nDetection Count Statistics:")
detection_counts = []
for sample in dataset:
    if "ground_truth" in sample:
        count = len(sample.ground_truth.detections)
        detection_counts.append(count)

if detection_counts:
    print(f'min detection counts that are not 0: {min([count for count in detection_counts if count != 0])}')
    print(f"Min detections per image: {min(detection_counts)}")
    print(f"Max detections per image: {max(detection_counts)}")
    print(f"Average detections per image: {sum(detection_counts)/len(detection_counts):.2f}")
    #std
    #total number of detections
    print(f"Total number of detections: {sum(detection_counts)}")
    print(f"Standard deviation of detections per image: {np.std(detection_counts):.2f}")
else:
    print("No detection counts found in dataset")


#for each tag, print the avg number of detections and std
for tag in dataset.distinct("tags"):
    print(f"Average number of detections for tag {tag}: {np.mean([len(sample.ground_truth.detections) for sample in dataset.filter_labels('ground_truth', F('tags') == tag)]):.2f}")
    print(f"Standard deviation of detections for tag {tag}: {np.std([len(sample.ground_truth.detections) for sample in dataset.filter_labels('ground_truth', F('tags') == tag)]):.2f}")
    

Average number of detections for split test: 7.10
Standard deviation of detections for split test: 4.32
Average number of detections for split val: 2.51
Standard deviation of detections for split val: 2.37
Number of samples in split test: 79
Number of samples in split val: 102
Number of samples in split train: 494
Average number of detections for split train: 4.29
Standard deviation of detections for split train: 3.24

Detection Count Statistics:
min detection counts that are not 0: 1
Min detections per image: 0
Max detections per image: 20
Average detections per image: 4.35
Total number of detections: 2938
Standard deviation of detections per image: 3.48
Average number of detections for tag test: nan
Standard deviation of detections for tag test: nan
Average number of detections for tag train: nan
Standard deviation of detections for tag train: nan
Average number of detections for tag valid: nan
Standard deviation of detections for tag valid: nan


 # Launch app #

In [7]:
import fiftyone as fo

dataset = fo.load_dataset("circle-pond-2333")


session = fo.launch_app(dataset, port=5160, auto=False)
session.show()

Session launched. Run `session.show()` to open the App in a cell output.




->

# Model Evaluation on Test Set

This code applies the RT-DETR model to the test split of the dataset and evaluates its performance against ground truth detections.

**What it does:**
- Filters the dataset to get only samples tagged with "test"
- Applies the RT-DETR model to generate predictions with confidence threshold 0.3
- Evaluates the model performance using IoU threshold of 0.5
- Compares predicted detections against ground truth detections

**Key correction:** We use `match_tags("test")` to filter samples by their tags, not `filter_labels()` which filters individual detections.

In [14]:
dataset.app_config.sidebar_groups = []
dataset.save()

In [8]:
import fiftyone as fo
import fiftyone.utils.eval as foue
from ultralytics import RTDETR
from fiftyone import ViewField as F
from fiftyone.core.odm.dataset import SidebarGroupDocument

# Load the model
model = RTDETR(r"/Users/gilbenor/Library/CloudStorage/OneDrive-Personal/measurement_paper_images/detection drone/runs-detections-drone-14.08/detect/train/weights/best.pt")

# CORRECT WAY: Filter samples (not labels) by tags
test_set_view = dataset.match_tags("test")

results = None
print(f"Number of samples in test set: {len(test_set_view)}")

# Check if we have any samples
if len(test_set_view) == 0:
    print("No test samples found! Available tags:")
    print(dataset.distinct("tags"))
else:
    # Apply the model to the filtered test set
    test_set_view.apply_model(model, label_field="prawn", confidence_thresh=0.3)
    
    # Evaluate the model's performance - using a simpler approach to avoid sidebar validation error
    try:
        foue.evaluate_detections(
        test_set_view,
        pred_field="prawn",
        gt_field="ground_truth",
        eval_key="eval",
        iou=0.5,         # lower threshold to accept slight misalignments
        classwise=False  # don’t require identical labels
    )

        print("Evaluation results:")
        print(results)
        
    except Exception as e:
        print(f"FiftyOne evaluation failed: {e}")
        print("Using manual evaluation instead...")
        
        # Manual evaluation as fallback
        results = None
        print("Applied model predictions successfully. Use manual evaluation in next cell.")

Number of samples in test set: 79
 100% |███████████████████| 79/79 [56.2s elapsed, 0s remaining, 1.6 samples/s]      
Evaluating detections...
 100% |███████████████████| 79/79 [826.0ms elapsed, 0s remaining, 95.6 samples/s]      
Evaluation results:
None


In [15]:
# Manual Evaluation Functions - Bug Fix for TP=0 Issue

def calculate_iou(box1, box2):
    """
    Calculate Intersection over Union (IoU) between two bounding boxes.
    
    Args:
        box1, box2: [x_center, y_center, width, height] in normalized coordinates (0-1)
    
    Returns:
        IoU value between 0 and 1
    """
    # Convert from center format to corner format
    def center_to_corners(box):
        x_center, y_center, width, height = box
        x1 = x_center - width / 2
        y1 = y_center - height / 2
        x2 = x_center + width / 2
        y2 = y_center + height / 2
        return [x1, y1, x2, y2]
    
    box1_corners = center_to_corners(box1)
    box2_corners = center_to_corners(box2)
    
    # Calculate intersection
    x1 = max(box1_corners[0], box2_corners[0])
    y1 = max(box1_corners[1], box2_corners[1])
    x2 = min(box1_corners[2], box2_corners[2])
    y2 = min(box1_corners[3], box2_corners[3])
    
    if x2 <= x1 or y2 <= y1:
        return 0.0
    
    intersection = (x2 - x1) * (y2 - y1)
    
    # Calculate union
    area1 = box1[2] * box1[3]  # width * height
    area2 = box2[2] * box2[3]  # width * height
    union = area1 + area2 - intersection
    
    if union <= 0:
        return 0.0
    
    return intersection / union

def manual_evaluate_detections(sample, iou_threshold=0.5):
    """
    Manually evaluate detections for a single sample.
    
    Args:
        sample: FiftyOne sample with 'ground_truth' and 'prawn' fields
        iou_threshold: IoU threshold for considering a detection as TP
    
    Returns:
        dict with 'tp', 'fp', 'fn' counts
    """
    # Get ground truth and predictions
    gt_detections = sample.ground_truth.detections if sample.ground_truth else []
    pred_detections = sample.prawn.detections if sample.prawn else []
    
    # Extract bounding boxes
    gt_boxes = []
    for det in gt_detections:
        # det.bounding_box is [x, y, width, height] where x,y is top-left corner
        # Convert to center format: [x_center, y_center, width, height]
        x, y, w, h = det.bounding_box
        x_center = x + w/2
        y_center = y + h/2
        gt_boxes.append([x_center, y_center, w, h])
    
    pred_boxes = []
    for det in pred_detections:
        x, y, w, h = det.bounding_box
        x_center = x + w/2
        y_center = y + h/2
        pred_boxes.append([x_center, y_center, w, h])
    
    # Match predictions to ground truth using Hungarian algorithm (greedy approach)
    gt_matched = [False] * len(gt_boxes)
    pred_matched = [False] * len(pred_boxes)
    
    tp = 0
    
    # For each prediction, find the best matching ground truth
    for pred_idx, pred_box in enumerate(pred_boxes):
        best_iou = 0
        best_gt_idx = -1
        
        for gt_idx, gt_box in enumerate(gt_boxes):
            if gt_matched[gt_idx]:  # Already matched
                continue
                
            iou = calculate_iou(pred_box, gt_box)
            if iou > best_iou:
                best_iou = iou
                best_gt_idx = gt_idx
        
        # If best IoU is above threshold, it's a match
        if best_iou >= iou_threshold and best_gt_idx != -1:
            tp += 1
            gt_matched[best_gt_idx] = True
            pred_matched[pred_idx] = True
    
    # Count unmatched predictions as FP and unmatched ground truth as FN
    fp = sum(1 for matched in pred_matched if not matched)
    fn = sum(1 for matched in gt_matched if not matched)
    
    return {
        'tp': tp,
        'fp': fp, 
        'fn': fn
    }

print("✅ Manual evaluation functions created!")
print("   - calculate_iou(): Computes IoU between two bounding boxes")
print("   - manual_evaluate_detections(): Evaluates TP/FP/FN for a sample")
print("   - Uses IoU threshold of 0.5 by default")
print("   - Handles center vs corner coordinate conversion")

# Test the manual evaluation on a sample
test_sample = dataset.match_tags("test").first()
if test_sample:
    eval_results = manual_evaluate_detections(test_sample, iou_threshold=0.5)
    print(f"\n🔧 Testing manual evaluation on first test sample:")
    print(f"   TP: {eval_results['tp']}, FP: {eval_results['fp']}, FN: {eval_results['fn']}")
    print(f"   (Compare this to old broken evaluation: {test_sample['eval_tp']}, {test_sample['eval_fp']}, {test_sample['eval_fn']})")
else:
    print("\n⚠️  No test samples found")

✅ Manual evaluation functions created!
   - calculate_iou(): Computes IoU between two bounding boxes
   - manual_evaluate_detections(): Evaluates TP/FP/FN for a sample
   - Uses IoU threshold of 0.5 by default
   - Handles center vs corner coordinate conversion

🔧 Testing manual evaluation on first test sample:
   TP: 7, FP: 2, FN: 0
   (Compare this to old broken evaluation: 0, 9, 7)


In [13]:
print(dataset.get_field_schema())

OrderedDict([('id', <fiftyone.core.fields.ObjectIdField object at 0x337bf3880>), ('filepath', <fiftyone.core.fields.StringField object at 0x337bf3e80>), ('tags', <fiftyone.core.fields.ListField object at 0x337bf3eb0>), ('metadata', <fiftyone.core.fields.EmbeddedDocumentField object at 0x337bf3940>), ('created_at', <fiftyone.core.fields.DateTimeField object at 0x337bf3f70>), ('last_modified_at', <fiftyone.core.fields.DateTimeField object at 0x337bf3670>), ('ground_truth', <fiftyone.core.fields.EmbeddedDocumentField object at 0x337bf3a00>), ('prawn', <fiftyone.core.fields.EmbeddedDocumentField object at 0x337bf37f0>), ('eval_simple_tp', <fiftyone.core.fields.IntField object at 0x3438bec10>), ('eval_simple_fp', <fiftyone.core.fields.IntField object at 0x3438bef70>), ('eval_simple_fn', <fiftyone.core.fields.IntField object at 0x3438beb50>), ('eval_tp', <fiftyone.core.fields.IntField object at 0x336352970>), ('eval_fp', <fiftyone.core.fields.IntField object at 0x336352220>), ('eval_fn', <fi



->

# Interactive Detection Performance Analysis

This code creates an interactive Plotly visualization showing TP/FP/FN breakdown per image with MAE overlay. 

**Features:**
- Stacked bars showing True Positives, False Positives, and False Negatives per image
- MAE line overlay to show counting errors
- Interactive hover showing image details (filename, counts, MAE)
- Sortable by MAE, FP, or FN to identify problematic cases
- Reveals that images with same MAE can have very different error patterns

**How to interpret:**
- **Bar height** = total detection activity per image
- **Green** = correctly detected objects (TP)
- **Red** = incorrect detections (FP) 
- **Orange** = missed objects (FN)
- **Blue line** = counting error magnitude (MAE)
- **Hover** to see image filename and exact counts

In [12]:
sample = test_set_view.first()
print(sample["eval_tp"], sample["eval_fp"], sample["eval_fn"])


0 9 7


In [10]:
import pandas as pd

print(dataset.get_field_schema())

# Check if 'tp' attribute exists in the dataset and print it
if hasattr(dataset, 'tp'):
    print("True Positives (tp):", dataset.eval_simple_tp)
else:
    print("The dataset does not have a 'tp' attribute.")

OrderedDict([('id', <fiftyone.core.fields.ObjectIdField object at 0x337bf3880>), ('filepath', <fiftyone.core.fields.StringField object at 0x337bf3e80>), ('tags', <fiftyone.core.fields.ListField object at 0x337bf3eb0>), ('metadata', <fiftyone.core.fields.EmbeddedDocumentField object at 0x337bf3940>), ('created_at', <fiftyone.core.fields.DateTimeField object at 0x337bf3f70>), ('last_modified_at', <fiftyone.core.fields.DateTimeField object at 0x337bf3670>), ('ground_truth', <fiftyone.core.fields.EmbeddedDocumentField object at 0x337bf3a00>), ('prawn', <fiftyone.core.fields.EmbeddedDocumentField object at 0x337bf37f0>), ('eval_simple_tp', <fiftyone.core.fields.IntField object at 0x3438bec10>), ('eval_simple_fp', <fiftyone.core.fields.IntField object at 0x3438bef70>), ('eval_simple_fn', <fiftyone.core.fields.IntField object at 0x3438beb50>), ('eval_tp', <fiftyone.core.fields.IntField object at 0x336352970>), ('eval_fp', <fiftyone.core.fields.IntField object at 0x336352220>), ('eval_fn', <fi

In [16]:
# Fixed Manual Evaluation - Bug Fix for TP=0 Issue

records = []
test_set_view = dataset.match_tags("test")

print("🔧 Using manual evaluation to fix TP=0 bug...")
print("📊 Processing test samples with proper IoU calculation...")

for sample in test_set_view.iter_samples(progress=True):
    # Use our manual evaluation function instead of broken FiftyOne evaluation
    eval_results = manual_evaluate_detections(sample, iou_threshold=0.5)
    
    tp = eval_results['tp']
    fp = eval_results['fp'] 
    fn = eval_results['fn']

    print(tp, fp, fn)
    gt = tp + fn
    pred = tp + fp
    mae = abs(pred - gt)

    records.append({
        "image_id": sample.filepath.split("/")[-1],
        "filepath": sample.filepath.split("/")[-1],
        "TP": tp,
        "FP": fp,
        "FN": fn,
        "GT": gt,
        "Pred": pred,
        "MAE": mae
    })

print(f"✅ Number of records collected: {len(records)}")
print(f"🎯 Total TP across all images: {sum(r['TP'] for r in records)}")
print(f"❌ Total FP across all images: {sum(r['FP'] for r in records)}")  
print(f"⚠️  Total FN across all images: {sum(r['FN'] for r in records)}")

# Quick verification that we fixed the TP=0 bug
tp_count = sum(r['TP'] for r in records)
if tp_count > 0:
    print(f"🎉 SUCCESS: Fixed TP=0 bug! Now have {tp_count} true positives")
else:
    print("⚠️  Still have TP=0 issue - may need to adjust IoU threshold or check coordinate format")


🔧 Using manual evaluation to fix TP=0 bug...
📊 Processing test samples with proper IoU calculation...
7 2 0                                                                       
4 2 0
10 1 0
0 0 0
5 1 0
6 2 0
7 2 1
5 1 0
9 0 0
10 3 3
1 2 0
11 1 4
7 1 1
9 0 0
3 1 0
8 2 1
4 0 0
2 0 0
3 0 0
12 1 2
10 0 0
2 0 3
4 4 1
4 3 0
2 5 2
7 1 0
1 0 1
7 1 1
1 1 0
8 3 4
5 1 0
4 0 2
3 0 0
10 2 0
6 1 0
6 3 5
4 1 0
2 2 0
2 1 0
9 2 1
12 2 1
3 1 1
5 3 0
5 1 0
8 3 5
6 0 0
10 2 9                                                                                 
8 2 5
6 1 2
14 1 0
2 2 0
11 3 9
11 2 4
3 0 0
8 2 0
4 0 0
2 2 0
7 5 2
4 3 3
5 1 0
14 2 2
3 1 1
10 1 5
4 0 0
2 1 0
7 3 1
4 1 0
2 2 1
4 1 0
8 0 0
9 2 1
6 2 0
7 1 1
8 0 1
3 1 1
9 0 1
6 3 1
5 2 0
7 2 0
 100% |███████████████████| 79/79 [192.1ms elapsed, 0s remaining, 411.3 samples/s]     
✅ Number of records collected: 79
🎯 Total TP across all images: 472
❌ Total FP across all images: 114
⚠️  Total FN across all images: 89
🎉 SUCCESS: Fixed TP=0 bug! Now hav

In [None]:
sample = test_set_view.first()
print("Number of predicted boxes:", len(sample["prawn"].detections))


Number of predicted boxes: 9


In [19]:
import plotly.graph_objects as go
import pandas as pd

# Create and sort DataFrame
df = pd.DataFrame(records)
df = df.sort_values("MAE", ascending=False).reset_index(drop=True)

# Clean up image names for better readability
df["clean_name"] = df["filepath"].str.replace("20230920_", "").str.replace("_jpg.rf.", "_").str.replace(".jpg", "")

# Create figure with clean hover templates
fig = go.Figure()

# TP bar - clean hover
fig.add_trace(go.Bar(
    x=df.index,
    y=df["TP"],
    name="True Positives",
    marker_color="green",
    hovertemplate="<b>%{customdata}</b><br>" +
                  "True Positives: %{y}<br>" +
                  "<extra></extra>",
    customdata=df["clean_name"]
))

# FP bar - clean hover  
fig.add_trace(go.Bar(
    x=df.index,
    y=df["FP"],
    name="False Positives", 
    marker_color="orange",
    base=df["TP"],
    hovertemplate="<b>%{customdata}</b><br>" +
                  "False Positives: %{text}<br>" +
                  "<extra></extra>",
    customdata=df["clean_name"],
    text=df["FP"]  # Show actual FP count, not stacked height
))

# FN bar - clean hover
fig.add_trace(go.Bar(
    x=df.index,
    y=-df["FN"],
    name="False Negatives",
    marker_color="red",
    hovertemplate="<b>%{customdata}</b><br>" +
                  "False Negatives: %{text}<br>" +
                  "<extra></extra>",
    customdata=df["clean_name"],
    text=df["FN"]  # Show positive FN value
))

# MAE line - simplified hover
fig.add_trace(go.Scatter(
    x=df.index,
    y=df["MAE"],
    name="MAE",
    mode="lines+markers",
    line=dict(color="black", dash="dot", width=2),
    marker=dict(size=4),
    yaxis="y2",
    hovertemplate="<b>%{customdata}</b><br>" +
                  "MAE: %{y}<br>" +
                  "Predicted: %{text}<br>" +
                  "<extra></extra>",
    customdata=df["clean_name"],
    text=df["Pred"].astype(str) + " | GT: " + df["GT"].astype(str)
))

# Layout settings
fig.update_layout(
    title="🎯 Object Detection Performance: TP/FP/FN Breakdown with MAE",
    barmode="relative",
    xaxis=dict(
        title="Images (sorted by MAE: worst → best)",
        showgrid=True,
        gridcolor="lightgray"
    ),
    yaxis=dict(
        title="Detection Counts",
        showgrid=True,
        gridcolor="lightgray"
    ),
    yaxis2=dict(
        title="MAE (Mean Absolute Error)",
        overlaying="y", 
        side="right",
        showgrid=False
    ),
    legend=dict(x=1.05, y=1),
    hovermode="closest",  # Changed from "x unified" to reduce clutter
    height=600,
    width=1200,
    plot_bgcolor="white"
)

# Show it
fig.show()
