# Test Evaluation System

**Purpose:** Quick test of the new evaluation module with minimal training.

This notebook:
1. Downloads dataset (Roboflow)
2. Builds evaluation indices
3. Trains a tiny model for 50 epochs (just for testing)
4. Generates predictions on test set
5. Runs all 3 evaluation metrics
6. Generates plots

**NOT for actual experiments** - just for debugging the evaluation pipeline!

## 0. Setup & Clone Repo

In [None]:
# Check if in Colab
import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    print("Running in Google Colab")
    # Clone repo if not already cloned
    import os
    if not os.path.exists('Deep_Learning_Gil_Alon'):
        !git clone https://github.com/gil-attar/Deep_Learning_Project_Gil_Alon.git Deep_Learning_Gil_Alon
    %cd Deep_Learning_Gil_Alon
else:
    print("Running locally")
    import os
    from pathlib import Path
    # Navigate to project root if in notebooks/
    if os.path.basename(os.getcwd()) == 'notebooks':
        os.chdir('..')

print(f"Working directory: {os.getcwd()}")

In [None]:
# Install dependencies
!pip install -q ultralytics roboflow pyyaml pillow numpy matplotlib

In [None]:
# Check GPU
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

## 1. Download Dataset

In [None]:
# Set Roboflow API key
import os
os.environ["ROBOFLOW_API_KEY"] = "zEF9icmDY2oTcPkaDcQY"  # Your API key

# Download dataset
!python scripts/download_dataset.py --output_dir data/raw

In [None]:
# Verify dataset downloaded
!echo "Train images: $(ls data/raw/train/images/ 2>/dev/null | wc -l)"
!echo "Valid images: $(ls data/raw/valid/images/ 2>/dev/null | wc -l)"
!echo "Test images: $(ls data/raw/test/images/ 2>/dev/null | wc -l)"

## 2. Build Evaluation Indices

In [None]:
# Build train/val/test indices (with ACTUAL image dimensions - this is critical!)
# Remove old indices first to ensure fresh rebuild
import shutil
from pathlib import Path

if Path("data/processed/evaluation").exists():
    shutil.rmtree("data/processed/evaluation")
    print("✓ Removed old indices")

!python scripts/build_evaluation_indices.py \
    --dataset_root data/raw \
    --output_dir data/processed/evaluation

In [None]:
# Verify indices created
import json
from pathlib import Path

test_index_path = "data/processed/evaluation/test_index.json"

if Path(test_index_path).exists():
    with open(test_index_path) as f:
        test_data = json.load(f)
    print(f"✓ Test index: {test_data['metadata']['num_images']} images")
    print(f"  Total objects: {test_data['metadata']['total_objects']}")
    print(f"  Classes: {test_data['metadata']['num_classes']}")
else:
    print(f"❌ Test index not found!")

## 3. Create data.yaml for Training

In [None]:
# Create data.yaml with absolute paths for Colab
import yaml
from pathlib import Path
import os

# Get absolute path to dataset
dataset_root = Path('data/raw').resolve()

# Read original data.yaml to get class names
with open(dataset_root / 'data.yaml', 'r') as f:
    config = yaml.safe_load(f)

# Create config with ABSOLUTE paths
train_config = {
    'path': str(dataset_root),  # Absolute base path
    'train': 'train/images',
    'val': 'valid/images', 
    'test': 'test/images',
    'names': config['names'],
    'nc': len(config['names'])
}

# Save to data/processed/
output_path = Path('data/processed/data.yaml')
output_path.parent.mkdir(parents=True, exist_ok=True)

with open(output_path, 'w') as f:
    yaml.dump(train_config, f, default_flow_style=False, sort_keys=False)

print(f"✓ Created data.yaml with absolute paths")
print(f"  Path: {train_config['path']}")
print(f"  Classes: {train_config['nc']}")

## 4. Clean Previous Results & Train Fresh Model (50 epochs)

In [None]:
# Clean up old results to force fresh training
import shutil

# Remove old training results
if Path('runs/test_eval/quick_test').exists():
    shutil.rmtree('runs/test_eval/quick_test')
    print("✓ Removed old training results")

# Remove old predictions
if Path('evaluation/metrics/test_quick_predictions.json').exists():
    Path('evaluation/metrics/test_quick_predictions.json').unlink()
    print("✓ Removed old predictions")

print("\nStarting fresh training...")

from ultralytics import YOLO

# Load pretrained model
model = YOLO('yolov8n.pt')  # Nano model (starts with COCO pretrained weights)

print("Training for 50 epochs...")
results = model.train(
    data='data/processed/data.yaml',
    epochs=50,
    imgsz=640,
    batch=-1,  # Auto batch size (like legacy)
    patience=10,  # Early stopping after 10 epochs without improvement
    save=True,
    project='runs/test_eval',
    name='quick_test',
    exist_ok=True,
    pretrained=True,  # Use pretrained weights
    optimizer='auto',  # Auto optimizer selection
    verbose=True,
    seed=42,  # Reproducibility
    cache=True  # Cache images for faster training
)

print("\n✓ Training complete!")
print(f"✓ Model saved to: runs/test_eval/quick_test/weights/best.pt")

## 5. Verify Training Results

In [None]:
# Check training metrics (mAP50)
import pandas as pd

# Find results.csv
results_csv_paths = [
    'runs/test_eval/quick_test/results.csv',
    'runs/detect/runs/test_eval/quick_test/results.csv'
]

results_csv = None
for path in results_csv_paths:
    if Path(path).exists():
        results_csv = path
        break

if results_csv:
    df = pd.read_csv(results_csv)
    # Get final epoch metrics
    final_metrics = df.iloc[-1]
    
    print("=" * 60)
    print("TRAINING RESULTS (Final Epoch)")
    print("=" * 60)
    
    # Try to find mAP columns (column names vary by YOLO version)
    map_cols = [col for col in df.columns if 'mAP' in col or 'map' in col.lower()]
    if map_cols:
        for col in map_cols[:2]:  # Show first 2 mAP columns
            print(f"{col}: {final_metrics[col]:.4f}")
    
    print(f"\nTotal epochs trained: {len(df)}")
    print("=" * 60)
    
    if len(map_cols) > 0 and final_metrics[map_cols[0]] > 0.3:
        print("✓ Model trained successfully! (mAP > 0.3)")
    else:
        print("⚠️ Warning: Low mAP - model may not have learned properly")
        print("   Check if data.yaml has correct absolute paths!")
else:
    print("❌ Could not find results.csv")
    print("Training may have failed or saved to unexpected location")

## 6. Generate Predictions

In [None]:
# Load trained model and test images
from ultralytics import YOLO
from tqdm import tqdm
import time

# Find the trained weights
possible_paths = [
    "runs/test_eval/quick_test/weights/best.pt",
    "runs/detect/runs/test_eval/quick_test/weights/best.pt"
]

weights_path = None
for path in possible_paths:
    if Path(path).exists():
        weights_path = path
        break

if not weights_path:
    # Search for any best.pt in runs/
    found = list(Path("runs").rglob("quick_test/weights/best.pt"))
    if found:
        weights_path = str(found[0])
    else:
        raise FileNotFoundError("Could not find trained weights!")

model_eval = YOLO(weights_path)
print(f"✓ Loaded model from {weights_path}")

# Load test index
with open(test_index_path) as f:
    test_index = json.load(f)

test_images = test_index['images'][:50]  # Use only first 50 images for quick test
print(f"Running inference on {len(test_images)} test images...")

# Run inference and collect predictions
predictions = []

for img_data in tqdm(test_images, desc="Inference"):
    image_id = img_data['image_id']
    image_filename = img_data['image_filename']
    image_path = Path("data/raw/test/images") / image_filename
    
    if not image_path.exists():
        print(f"Warning: {image_path} not found")
        continue
    
    # Run inference with LOW confidence threshold (save almost everything)
    results = model_eval.predict(
        source=str(image_path),
        conf=0.01,  # Very low threshold to save all predictions
        imgsz=640,
        verbose=False
    )[0]
    
    # Extract detections
    detections = []
    if len(results.boxes) > 0:
        boxes = results.boxes
        for i in range(len(boxes)):
            detections.append({
                "class_id": int(boxes.cls[i].item()),
                "class_name": results.names[int(boxes.cls[i].item())],
                "confidence": float(boxes.conf[i].item()),
                "bbox": boxes.xyxy[i].tolist(),
                "bbox_format": "xyxy"
            })
    
    predictions.append({
        "image_id": image_id,
        "detections": detections
    })

print(f"\n✓ Generated predictions for {len(predictions)} images")

In [None]:
# Save predictions in new format
pred_output_path = "evaluation/metrics/test_quick_predictions.json"
Path(pred_output_path).parent.mkdir(parents=True, exist_ok=True)

pred_json = {
    "run_id": "test_quick_50epochs",
    "split": "test",
    "model_family": "yolo",
    "model_name": "yolov8n",
    "inference_settings": {
        "conf_threshold": 0.01,
        "iou_threshold": 0.50,
        "imgsz": 640
    },
    "predictions": predictions
}

with open(pred_output_path, 'w') as f:
    json.dump(pred_json, f, indent=2)

print(f"✓ Saved predictions to {pred_output_path}")

# DIAGNOSTIC: Check how many detections we got
total_detections = sum(len(p['detections']) for p in predictions)
images_with_detections = sum(1 for p in predictions if len(p['detections']) > 0)
print(f"\nDIAGNOSTIC INFO:")
print(f"  Total predictions across {len(predictions)} images: {total_detections}")
print(f"  Images with at least 1 detection: {images_with_detections}")
if total_detections > 0:
    # Show confidence range
    all_confs = [d['confidence'] for p in predictions for d in p['detections']]
    print(f"  Confidence range: {min(all_confs):.3f} - {max(all_confs):.3f}")
    print(f"  Detections above 0.1: {sum(1 for c in all_confs if c >= 0.1)}")
    print(f"  Detections above 0.5: {sum(1 for c in all_confs if c >= 0.5)}")
else:
    print("  ⚠️ WARNING: Model made ZERO detections! Try training longer.")

# DEBUG: Compare prediction and GT for first image
print("\n" + "="*60)
print("DEBUG: Comparing first image's prediction vs ground truth")
print("="*60)

# Get first prediction
first_pred = predictions[0]
print(f"\nImage ID: {first_pred['image_id']}")
print(f"Predictions ({len(first_pred['detections'])} total):")
for det in first_pred['detections'][:3]:  # Show first 3
    print(f"  Class: {det['class_name']} (id={det['class_id']})")
    print(f"  BBox: {det['bbox']}")
    print(f"  Confidence: {det['confidence']:.3f}")
    print()

# Get corresponding ground truth
with open(test_index_path) as f:
    test_data = json.load(f)
    
first_gt = next((img for img in test_data['images'] if img['image_id'] == first_pred['image_id']), None)
if first_gt:
    print(f"Ground Truth ({len(first_gt['ground_truth'])} objects):")
    for obj in first_gt['ground_truth'][:3]:  # Show first 3
        print(f"  Class: {obj['class_name']} (id={obj['class_id']})")
        print(f"  BBox (xyxy): {obj['bbox_xyxy']}")
        print()
    
    # Check actual image dimensions
    from PIL import Image as PILImage
    img_path = Path("data/raw/test/images") / first_gt['image_filename']
    if img_path.exists():
        img = PILImage.open(img_path)
        print(f"Actual image size: {img.size} (width x height)")
        print(f"GT was computed assuming: 640x640")
else:
    print("Ground truth not found for this image!")

## 7. Run Evaluation

Test all 3 evaluation metrics.

In [None]:
# Import evaluation module
from evaluation.io import load_predictions, load_ground_truth, load_class_names
from evaluation.metrics import (
    eval_detection_prf_at_iou,
    eval_per_class_metrics_and_confusions,
    eval_counting_quality
)
from evaluation.plots import plot_all_metrics

print("✓ Evaluation module imported successfully")

In [None]:
# Load predictions and ground truth
preds = load_predictions(pred_output_path, split="test")
gts = load_ground_truth(test_index_path, split="test")
class_names = load_class_names(test_index_path)

# Filter GTs to match predictions (first 50 images)
pred_image_ids = {p['image_id'] for p in preds}
gts = [g for g in gts if g['image_id'] in pred_image_ids]

print(f"✓ Loaded {len(preds)} predictions")
print(f"✓ Loaded {len(gts)} ground truths")
print(f"✓ Loaded {len(class_names)} classes")

In [None]:
# 1. P/R/F1 at multiple thresholds
print("1. Running detection P/R/F1 evaluation...")
threshold_sweep = eval_detection_prf_at_iou(
    preds, gts,
    iou_threshold=0.5,
    conf_thresholds=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8]
)

print("\nResults by confidence threshold:")
for conf_thr, metrics in threshold_sweep.items():
    print(f"  conf={conf_thr}: P={metrics['precision']:.3f}, R={metrics['recall']:.3f}, F1={metrics['f1']:.3f}")

best_thr = max(threshold_sweep.keys(), key=lambda k: threshold_sweep[k]['f1'])
print(f"\n✓ Best threshold: {best_thr} (F1={threshold_sweep[best_thr]['f1']:.3f})")

In [None]:
# 2. Per-class metrics
print("\n2. Running per-class evaluation...")
per_class_results = eval_per_class_metrics_and_confusions(
    preds, gts,
    iou_threshold=0.5,
    conf_threshold=float(best_thr),
    class_names=class_names
)

print(f"\n✓ Evaluated {len(per_class_results['per_class'])} classes")
print(f"✓ Found {len(per_class_results['top_confusions'][:5])} top confusions")

# Show top 3 classes by F1
sorted_classes = sorted(
    per_class_results['per_class'].items(),
    key=lambda x: x[1]['f1'],
    reverse=True
)
print("\nTop 3 classes by F1:")
for class_name, metrics in sorted_classes[:3]:
    print(f"  {class_name}: F1={metrics['f1']:.3f} (support={metrics['support']})")

In [None]:
# 3. Counting quality
print("\n3. Running counting quality evaluation...")
counting_results = eval_counting_quality(
    preds, gts,
    iou_threshold=0.5,
    conf_threshold=float(best_thr),
    class_names=class_names
)

print(f"\n✓ Matched-only MAE: {counting_results['matched_only']['global_mae']:.4f}")
print(f"✓ All-predictions MAE: {counting_results['all_predictions']['global_mae']:.4f}")

## 8. Generate Plots

In [None]:
# Generate all plots
output_dir = "evaluation/results/test_quick/"
Path(output_dir).mkdir(parents=True, exist_ok=True)

plot_all_metrics(
    threshold_sweep=threshold_sweep,
    per_class_results=per_class_results['per_class'],
    confusion_data=per_class_results,
    counting_results=counting_results,
    output_dir=output_dir,
    run_name="Quick Test (50 epochs)"
)

print(f"\n✓ All plots saved to {output_dir}")

## 9. Test CLI Script

In [None]:
# Test the standalone evaluation script
!python scripts/evaluate_run.py \
    --predictions evaluation/metrics/test_quick_predictions.json \
    --ground_truth data/processed/evaluation/test_index.json \
    --output_dir evaluation/results/test_quick_cli/ \
    --run_name "Quick Test CLI" \
    --conf_thresholds 0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8

In [None]:
from IPython.display import Image, display
import os

# Show all plots from the notebook results
plot_dir = "evaluation/results/test_quick/"

plots = [
    "threshold_sweep.png",
    "per_class_f1.png", 
    "confusion_matrix.png",
    "count_mae_comparison.png"
]

for plot_name in plots:
    plot_path = os.path.join(plot_dir, plot_name)
    if os.path.exists(plot_path):
        print(f"\n{'='*60}")
        print(f"{plot_name}")
        print('='*60)
        display(Image(plot_path))


## Summary

If you see this without errors, the evaluation system is working!

**What was tested:**
- ✅ Dataset download
- ✅ Evaluation indices generation
- ✅ Model training (50 epochs)
- ✅ Prediction generation and saving
- ✅ Loading predictions and ground truth
- ✅ Detection P/R/F1 at multiple thresholds
- ✅ Per-class metrics and confusion matrix
- ✅ Counting quality (both methods)
- ✅ Plot generation
- ✅ CLI evaluation script

**Next steps:**
1. Run actual experiments with proper training
2. Use the evaluation system on train/val/test splits
3. Compare models (YOLO vs RT-DETR)