# Test Evaluation System

**Purpose:** This notebook is for conducting E1 Experiment and processing the experiment's results.

This notebook: 

1. Downloads dataset (Roboflow)
2. Downloads COCO pre-trained models (YOLOv8m and RT-DETER-L)
2. Builds evaluation indices
3. Conducting the experiment
4. Generates predictions on test set
5. Runs all 3 evaluation metrics
6. Generates plots


## 0. Setup & Clone Repo

In [None]:
# Check if in Colab
import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    print("Running in Google Colab")
    # Clone repo if not already cloned
    import os
    if not os.path.exists('Deep_Learning_Gil_Alon'):
        !git clone https://github.com/gil-attar/Deep_Learning_Project_Gil_Alon.git Deep_Learning_Gil_Alon
    %cd Deep_Learning_Gil_Alon
else:
    print("Running locally")
    import os
    from pathlib import Path
    # Navigate to project root if in notebooks/
    if os.path.basename(os.getcwd()) == 'notebooks':
        os.chdir('..')

print(f"Working directory: {os.getcwd()}")

In [None]:
# Install dependencies
!pip install -q ultralytics roboflow pyyaml pillow numpy matplotlib pandas tqdm


In [None]:
# Check GPU
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

## 1. Download Dataset

In [None]:
# Set Roboflow API key
import os
os.environ["ROBOFLOW_API_KEY"] = "zEF9icmDY2oTcPkaDcQY"  # Your API key

# Download dataset
!python scripts/download_dataset.py --output_dir data/raw

In [None]:
# Verify dataset downloaded
!echo "Train images: $(ls data/raw/train/images/ 2>/dev/null | wc -l)"
!echo "Valid images: $(ls data/raw/valid/images/ 2>/dev/null | wc -l)"
!echo "Test images: $(ls data/raw/test/images/ 2>/dev/null | wc -l)"

## 2. Fetch COCO-Pretrained Weights (YOLOv8m + RT-DETR-L)

Will be stored under `artifacts/weights/`.


In [None]:
# Fetch pretrained weights (idempotent)
!bash scripts/fetch_weights.sh


## 3. Build Evaluation Indices

In [None]:
# Build train/val/test indices (with ACTUAL image dimensions - this is critical!)
# Remove old indices first to ensure fresh rebuild
import shutil
from pathlib import Path

if Path("data/processed/evaluation").exists():
    shutil.rmtree("data/processed/evaluation")
    print("✓ Removed old indices")

!python scripts/build_evaluation_indices.py \
    --dataset_root data/raw \
    --output_dir data/processed/evaluation

In [None]:
# Verify indices created
import json
from pathlib import Path

test_index_path = "data/processed/evaluation/test_index.json"

if Path(test_index_path).exists():
    with open(test_index_path) as f:
        test_data = json.load(f)
    print(f"✓ Test index: {test_data['metadata']['num_images']} images")
    print(f"  Total objects: {test_data['metadata']['total_objects']}")
    print(f"  Classes: {test_data['metadata']['num_classes']}")
else:
    print(f"❌ Test index not found!")

### 3.1. Create data.yaml for Training

In [None]:
# Create data.yaml with absolute paths for Colab
import yaml
from pathlib import Path
import os

# Get absolute path to dataset
dataset_root = Path('data/raw').resolve()

# Read original data.yaml to get class names
with open(dataset_root / 'data.yaml', 'r') as f:
    config = yaml.safe_load(f)

# Create config with ABSOLUTE paths
train_config = {
    'path': str(dataset_root),  # Absolute base path
    'train': 'train/images',
    'val': 'valid/images', 
    'test': 'test/images',
    'names': config['names'],
    'nc': len(config['names'])
}

# Save to data/processed/
output_path = Path('data/processed/data.yaml')
output_path.parent.mkdir(parents=True, exist_ok=True)

with open(output_path, 'w') as f:
    yaml.dump(train_config, f, default_flow_style=False, sort_keys=False)

print(f"✓ Created data.yaml with absolute paths")
print(f"  Path: {train_config['path']}")
print(f"  Classes: {train_config['nc']}")

## 4. Run Experiment 1 (Freeze Ladder: YOLOv8m vs RT-DETR-L)

This section runs the full E1 sweep using the experiments runner scripts. 
then, per run train, export predictions, and run the custom evaluator.


In [None]:
# E1 sweep configuration
DRY_RUN = True   # True for quick testing: 1 epoch for each of the 8 runs. False for the real sweep.
EPOCHS  = 1 if DRY_RUN else 50
IMGSZ   = 640
SEED    = 42

print(f"DRY_RUN={DRY_RUN} | EPOCHS={EPOCHS} | IMGSZ={IMGSZ} | SEED={SEED}")


In [None]:
# Run the full E1 matrix (8 runs)
# This calls experiments/Experiment_1/runOneTest.py for each (model, freeze_id).
!EPOCHS={EPOCHS} IMGSZ={IMGSZ} SEED={SEED} bash experiments/Experiment_1/run_experiment1.sh


## 5. Verify Run Outputs

Sanity checks: verify that each run directory contains manifests, predictions, and evaluation outputs.


In [None]:
from pathlib import Path
import json

runs_root = Path("experiments/Experiment_1/runs")
assert runs_root.exists(), f"Missing runs root: {runs_root}"

expected_files = [
    "run_manifest.json",
    "train_summary.json",
    "predictions/val_predictions.json",
    "predictions/test_predictions.json",
    "eval/val/metrics.json",
    "eval/test/metrics.json",
]

missing = []
run_dirs = sorted([p for p in runs_root.glob("**/F[0-3]") if p.is_dir()])
print(f"Found {len(run_dirs)} run directories")
for rd in run_dirs:
    for ef in expected_files:
        if not (rd / ef).exists():
            missing.append((str(rd), ef))

if missing:
    print("WARNING: missing expected artifacts in some runs:")
    for rd, ef in missing[:40]:
        print(f"  - {rd} :: {ef}")
else:
    print("✓ All runs contain the expected artifacts.")


## 6. Aggregate Metrics Across Runs

Build a single table across the 8 runs, reading `run_manifest.json`, `eval/test/metrics.json`, and the predictions timing metadata.


In [None]:
import pandas as pd
import math

def _safe_get(d, keys, default=None):
    cur = d
    for k in keys:
        if not isinstance(cur, dict) or k not in cur:
            return default
        cur = cur[k]
    return cur

def extract_primary_metric(metrics_json: dict):
    """Robust extraction across evaluator versions. Tries common locations/keys."""
    # Common patterns we might produce (depending on evaluator implementation)
    candidates = [
        ("overall_f1", ["overall", "f1"]),
        ("overall_precision", ["overall", "precision"]),
        ("overall_recall", ["overall", "recall"]),
        ("f1", ["f1"]),
        ("precision", ["precision"]),
        ("recall", ["recall"]),
        ("map50", ["map50"]),
        ("map", ["map"]),
    ]
    for name, path in candidates:
        val = _safe_get(metrics_json, path)
        if isinstance(val, (int, float)) and not math.isnan(val):
            return name, float(val)
    # Fallback: search top-level numeric fields
    for k,v in metrics_json.items():
        if isinstance(v, (int, float)):
            return str(k), float(v)
    return None, None

rows = []
for rd in run_dirs:
    # Parse run identifiers from path: .../runs/<model>/<F#>
    parts = rd.parts
    model = parts[-2]
    freeze_id = parts[-1]
    manifest = json.loads((rd / "run_manifest.json").read_text())
    test_metrics = json.loads((rd / "eval/test/metrics.json").read_text())
    preds_test = json.loads((rd / "predictions/test_predictions.json").read_text())

    metric_name, metric_val = extract_primary_metric(test_metrics)

    rows.append({
        "model": model,
        "freeze_id": freeze_id,
        "trainable_params": _safe_get(manifest, ["trainable_params"]),
        "total_params": _safe_get(manifest, ["total_params"]),
        "primary_metric_name": metric_name,
        "primary_metric": metric_val,
        "avg_inference_time_ms": _safe_get(preds_test, ["inference_time_ms", "avg_inference_time_ms"]),
        "num_images": _safe_get(preds_test, ["inference_time_ms", "num_images"])
    })

df = pd.DataFrame(rows).sort_values(["model","freeze_id"])
display(df)


## 7. Plots

Key plots for E1:
1. Performance vs. trainable parameters (transfer learning / fine-tuning tradeoff).
2. Speed–accuracy tradeoff using average inference time.


In [None]:
import matplotlib.pyplot as plt

# Plot 1: primary metric vs trainable parameters
plt.figure(figsize=(7,4))
for model, g in df.groupby("model"):
    plt.plot(g["trainable_params"], g["primary_metric"], marker="o", label=model)
plt.xscale("log")
plt.xlabel("Trainable parameters (log scale)")
plt.ylabel(df["primary_metric_name"].dropna().iloc[0] if df["primary_metric_name"].notna().any() else "primary_metric")
plt.title("E1: Performance vs Trainable Parameters")
plt.grid(True)
plt.legend()
plt.show()

# Plot 2: speed–accuracy (if timing is available)
if df["avg_inference_time_ms"].notna().any():
    plt.figure(figsize=(7,4))
    for model, g in df.groupby("model"):
        plt.scatter(g["avg_inference_time_ms"], g["primary_metric"], label=model)
        for _, r in g.iterrows():
            plt.annotate(r["freeze_id"], (r["avg_inference_time_ms"], r["primary_metric"]))
    plt.xlabel("Avg inference time per image (ms)")
    plt.ylabel(df["primary_metric_name"].dropna().iloc[0] if df["primary_metric_name"].notna().any() else "primary_metric")
    plt.title("E1: Speed–Accuracy Tradeoff")
    plt.grid(True)
    plt.legend()
    plt.show()
else:
    print("No inference timing found in predictions JSON (avg_inference_time_ms).")


## 8. Inspect One Run (Optional)

Display a few evaluator plots for a selected run directory.


In [None]:
from IPython.display import Image, display

# Choose a run to inspect
inspect_run = run_dirs[0] if run_dirs else None
print(f"Inspecting: {inspect_run}")

if inspect_run:
    for rel in [
        "eval/test/threshold_sweep.png",
        "eval/test/per_class_f1.png",
        "eval/test/confusion_matrix.png",
        "eval/test/count_mae_comparison.png",
    ]:
        p = inspect_run / rel
        if p.exists():
            print(f"\n{rel}:")
            display(Image(filename=str(p)))
        else:
            print(f"Missing: {rel}")


## Summary

- Stages 1–3 prepare the dataset, build ground-truth indices, and generate `data/processed/data.yaml`.
- Stage 4 runs E1 (8 runs) using your experiment runner scripts.
- Stages 5–8 verify artifacts and aggregate results into tables and plots suitable for reporting.
