# Week 5: Deep Learning for Perception

## Module II: Perception & Localization

### Topics Covered

- Neural Networks Fundamentals (CNNs)
- Object Detection (YOLO, SSD, Faster R-CNN)
- Semantic Segmentation (FCN, U-Net, DeepLab)
- Instance Segmentation
- Model Training and Evaluation

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. Understand convolutional neural network (CNN) architectures for computer vision
2. Implement and evaluate object detection models for autonomous driving
3. Apply semantic segmentation for scene understanding
4. Distinguish between classification, detection, and segmentation tasks
5. Evaluate perception models using standard metrics (mAP, IoU, etc.)
6. Understand trade-offs between accuracy and inference speed
7. Apply data augmentation and training strategies for robust models

---

## Setup

Import required libraries for deep learning and visualization

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle, Circle, Polygon
from matplotlib.colors import ListedColormap
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Plotting configuration
plt.rcParams['figure.figsize'] = (14, 8)
plt.rcParams['font.size'] = 10

print("Libraries loaded successfully!")
print("NumPy version:", np.__version__)

## 1. Neural Networks Fundamentals

**Deep Learning** has revolutionized perception for autonomous vehicles, enabling robust detection and classification of objects, lanes, and road signs.

### Why Deep Learning for Autonomous Driving?

**Traditional Computer Vision** (hand-crafted features):
- HOG (Histogram of Oriented Gradients)
- SIFT (Scale-Invariant Feature Transform)
- Haar cascades

**Limitations**:
- ❌ Brittle to lighting/weather variations
- ❌ Requires manual feature engineering
- ❌ Poor generalization to novel scenarios

**Deep Learning**:
- ✅ Learns features automatically from data
- ✅ Robust to variations (lighting, occlusion, etc.)
- ✅ State-of-the-art performance on all perception tasks

---

### Convolutional Neural Networks (CNNs)

**CNNs** are specialized neural networks for processing grid-like data (images).

#### **Key Components**

##### **1. Convolutional Layer**

Applies learnable filters to extract features:

$$y[i, j] = \sum_{m} \sum_{n} x[i+m, j+n] \cdot w[m, n] + b$$

Where:
- **x**: Input feature map
- **w**: Convolutional kernel (weights)
- **b**: Bias term
- **y**: Output feature map

**Example**: 3×3 kernel convolving over a 5×5 image

**Key properties**:
- **Local connectivity**: Each neuron sees only a small region
- **Parameter sharing**: Same kernel applied across entire image
- **Translation invariance**: Detects features regardless of position

##### **2. Pooling Layer**

Downsamples feature maps to reduce computation and improve invariance.

**Max Pooling** (2×2 with stride 2):
$$y[i, j] = \max(x[2i:2i+2, 2j:2j+2])$$

**Average Pooling**:
$$y[i, j] = \frac{1}{4} \sum_{m=0}^{1} \sum_{n=0}^{1} x[2i+m, 2j+n]$$

**Effect**: Reduces spatial dimensions by factor of 2

##### **3. Activation Functions**

Introduce non-linearity to learn complex patterns.

**ReLU (Rectified Linear Unit)** - Most common:
$$f(x) = \max(0, x)$$

**Sigmoid** - Output layer for binary classification:
$$f(x) = \frac{1}{1 + e^{-x}}$$

**Softmax** - Output layer for multi-class:
$$f(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}$$

##### **4. Fully Connected Layer**

Standard dense layer connecting all inputs to all outputs:
$$y = Wx + b$$

---

### CNN Architecture Example: LeNet-5

```
Input (32×32×1)
    ↓
Conv1 (5×5, 6 filters) → 28×28×6
    ↓
MaxPool (2×2) → 14×14×6
    ↓
Conv2 (5×5, 16 filters) → 10×10×16
    ↓
MaxPool (2×2) → 5×5×16
    ↓
Flatten → 400
    ↓
FC1 → 120
    ↓
FC2 → 84
    ↓
Output → 10 classes
```

---

### Modern CNN Architectures

| Architecture | Year | Key Innovation | Parameters | ImageNet Top-5 |
|--------------|------|----------------|------------|-----------------|
| **AlexNet** | 2012 | Deep CNN + ReLU + Dropout | 61M | 84.6% |
| **VGG-16** | 2014 | Deeper (16 layers), 3×3 kernels | 138M | 92.7% |
| **ResNet-50** | 2015 | Residual connections (skip) | 25M | 96.4% |
| **MobileNetV2** | 2018 | Efficient for mobile/edge | 3.5M | 94.3% |
| **EfficientNet** | 2019 | Compound scaling | 5-66M | 97.1% |

**ResNet Residual Block**:
```
x → Conv → BN → ReLU → Conv → BN → (+) → ReLU
↓                                    ↑
└────────────────────────────────────┘ (skip connection)
```

**Key insight**: Skip connections allow training very deep networks (100+ layers) by addressing vanishing gradients.

---

### Transfer Learning

**Problem**: Training from scratch requires millions of labeled images.

**Solution**: Use pre-trained weights from ImageNet (1.2M images, 1000 classes).

**Approach**:
1. **Feature extraction**: Freeze early layers, train only final layers
2. **Fine-tuning**: Unfreeze all layers, train with small learning rate

**Benefits**:
- ✅ Faster convergence (10-100× fewer iterations)
- ✅ Better performance with limited data
- ✅ Reduced computational cost

---

### Training Deep Networks

#### **Loss Functions**

**Classification** (Cross-Entropy Loss):
$$L = -\sum_{i=1}^{C} y_i \log(\hat{y}_i)$$

Where:
- **y**: True label (one-hot encoded)
- **ŷ**: Predicted probability
- **C**: Number of classes

**Regression** (Mean Squared Error):
$$L = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2$$

#### **Optimization**

**Stochastic Gradient Descent (SGD)**:
$$\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t)$$

**Adam** (Adaptive Moment Estimation) - Most popular:
- Combines momentum and adaptive learning rates
- Robust to hyperparameter choices

**Learning rate scheduling**:
- **Step decay**: Reduce LR by factor every N epochs
- **Cosine annealing**: Smooth reduction following cosine curve

#### **Regularization**

Prevents overfitting:

1. **Dropout**: Randomly drop neurons during training (p = 0.5)
2. **L2 regularization**: Add penalty $\lambda \|\theta\|^2$ to loss
3. **Data augmentation**: Artificially increase dataset size
4. **Batch normalization**: Normalize activations per mini-batch

#### **Data Augmentation for Autonomous Driving**

Critical for robustness:

- **Geometric**: Flip, rotate, crop, scale
- **Photometric**: Brightness, contrast, saturation, hue
- **Weather simulation**: Rain, fog, snow overlays
- **Occlusion**: Random erasing/cutout

---

### Evaluation Metrics

#### **Classification Metrics**

**Confusion Matrix**:
```
                Predicted
              Pos    Neg
Actual Pos    TP     FN
       Neg    FP     TN
```

**Precision**: Of predicted positives, how many are correct?
$$\text{Precision} = \frac{TP}{TP + FP}$$

**Recall**: Of actual positives, how many did we find?
$$\text{Recall} = \frac{TP}{TP + FN}$$

**F1 Score**: Harmonic mean of precision and recall:
$$F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$$

**Accuracy**: Overall correctness:
$$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$$

---

### Computational Requirements

| Model | Parameters | FLOPs | Inference (ms) | Use Case |
|-------|------------|-------|----------------|----------|
| MobileNetV2 | 3.5M | 300M | 5-10 | Mobile/embedded |
| ResNet-50 | 25M | 4B | 20-30 | Edge compute |
| EfficientNet-B7 | 66M | 37B | 100+ | Cloud/datacenter |

**Real-time constraint**: Autonomous vehicles need **<50ms** inference for 20 Hz perception.

**Hardware accelerators**:
- **GPU**: NVIDIA Drive AGX (200+ TOPS)
- **TPU**: Google Edge TPU
- **ASIC**: Tesla FSD chip (144 TOPS)

---

### Advantages & Limitations

**Advantages**:
- ✅ State-of-the-art accuracy
- ✅ End-to-end learning from raw pixels
- ✅ Robust to variations
- ✅ Continual improvement with more data

**Limitations**:
- ❌ Requires large labeled datasets (100k+ images)
- ❌ Computationally expensive
- ❌ "Black box" - hard to interpret decisions
- ❌ Vulnerable to adversarial attacks
- ❌ Distribution shift (train vs. deployment)

**Mitigation**:
- Active learning and data curation
- Model compression (quantization, pruning)
- Uncertainty estimation
- Multi-sensor fusion for redundancy

# CNN Operations Visualization

# Simulate convolution operation
def conv2d(image, kernel):
    """
    Simple 2D convolution (without padding).
    
    Args:
        image: Input 2D array
        kernel: Convolution kernel
    
    Returns:
        Feature map after convolution
    """
    h_out = image.shape[0] - kernel.shape[0] + 1
    w_out = image.shape[1] - kernel.shape[1] + 1
    output = np.zeros((h_out, w_out))
    
    for i in range(h_out):
        for j in range(w_out):
            output[i, j] = np.sum(image[i:i+kernel.shape[0], j:j+kernel.shape[1]] * kernel)
    
    return output

def relu(x):
    """ReLU activation function."""
    return np.maximum(0, x)

def max_pool2d(image, pool_size=2):
    """
    Max pooling with stride = pool_size.
    
    Args:
        image: Input 2D array
        pool_size: Size of pooling window
    
    Returns:
        Downsampled feature map
    """
    h_out = image.shape[0] // pool_size
    w_out = image.shape[1] // pool_size
    output = np.zeros((h_out, w_out))
    
    for i in range(h_out):
        for j in range(w_out):
            output[i, j] = np.max(image[i*pool_size:(i+1)*pool_size, 
                                       j*pool_size:(j+1)*pool_size])
    
    return output

# Create synthetic input image (8x8)
input_image = np.array([
    [0, 0, 0, 0, 0, 0, 0, 0],
    [0, 1, 1, 1, 1, 1, 1, 0],
    [0, 1, 0, 0, 0, 0, 1, 0],
    [0, 1, 0, 0, 0, 0, 1, 0],
    [0, 1, 0, 0, 0, 0, 1, 0],
    [0, 1, 0, 0, 0, 0, 1, 0],
    [0, 1, 1, 1, 1, 1, 1, 0],
    [0, 0, 0, 0, 0, 0, 0, 0]
])

# Define edge detection kernels
vertical_edge_kernel = np.array([
    [-1, 0, 1],
    [-1, 0, 1],
    [-1, 0, 1]
])

horizontal_edge_kernel = np.array([
    [-1, -1, -1],
    [ 0,  0,  0],
    [ 1,  1,  1]
])

# Apply convolution
vertical_edges = conv2d(input_image, vertical_edge_kernel)
horizontal_edges = conv2d(input_image, horizontal_edge_kernel)

# Apply ReLU
vertical_edges_relu = relu(vertical_edges)
horizontal_edges_relu = relu(horizontal_edges)

# Apply max pooling
vertical_pooled = max_pool2d(vertical_edges_relu)
horizontal_pooled = max_pool2d(horizontal_edges_relu)

# Visualization
fig = plt.figure(figsize=(16, 10))

# Row 1: Input and Kernels
ax1 = fig.add_subplot(3, 4, 1)
im1 = ax1.imshow(input_image, cmap='gray', vmin=0, vmax=1)
ax1.set_title('Input Image (8×8)', fontsize=12, fontweight='bold')
ax1.axis('off')
plt.colorbar(im1, ax=ax1, fraction=0.046)

ax2 = fig.add_subplot(3, 4, 2)
im2 = ax2.imshow(vertical_edge_kernel, cmap='RdBu_r', vmin=-1, vmax=1)
ax2.set_title('Vertical Edge Kernel (3×3)', fontsize=12, fontweight='bold')
ax2.axis('off')
for i in range(3):
    for j in range(3):
        ax2.text(j, i, f'{vertical_edge_kernel[i,j]:.0f}', 
                ha='center', va='center', color='white', fontweight='bold')
plt.colorbar(im2, ax=ax2, fraction=0.046)

ax3 = fig.add_subplot(3, 4, 3)
im3 = ax3.imshow(horizontal_edge_kernel, cmap='RdBu_r', vmin=-1, vmax=1)
ax3.set_title('Horizontal Edge Kernel (3×3)', fontsize=12, fontweight='bold')
ax3.axis('off')
for i in range(3):
    for j in range(3):
        ax3.text(j, i, f'{horizontal_edge_kernel[i,j]:.0f}', 
                ha='center', va='center', color='white', fontweight='bold')
plt.colorbar(im3, ax=ax3, fraction=0.046)

# Row 2: After Convolution
ax4 = fig.add_subplot(3, 4, 5)
im4 = ax4.imshow(vertical_edges, cmap='RdBu_r')
ax4.set_title('After Conv (Vertical)\n(6×6)', fontsize=12, fontweight='bold')
ax4.axis('off')
plt.colorbar(im4, ax=ax4, fraction=0.046)

ax5 = fig.add_subplot(3, 4, 6)
im5 = ax5.imshow(horizontal_edges, cmap='RdBu_r')
ax5.set_title('After Conv (Horizontal)\n(6×6)', fontsize=12, fontweight='bold')
ax5.axis('off')
plt.colorbar(im5, ax=ax5, fraction=0.046)

# Row 3: After ReLU
ax6 = fig.add_subplot(3, 4, 9)
im6 = ax6.imshow(vertical_edges_relu, cmap='viridis')
ax6.set_title('After ReLU (Vertical)\n(6×6)', fontsize=12, fontweight='bold')
ax6.axis('off')
plt.colorbar(im6, ax=ax6, fraction=0.046)

ax7 = fig.add_subplot(3, 4, 10)
im7 = ax7.imshow(horizontal_edges_relu, cmap='viridis')
ax7.set_title('After ReLU (Horizontal)\n(6×6)', fontsize=12, fontweight='bold')
ax7.axis('off')
plt.colorbar(im7, ax=ax7, fraction=0.046)

# Row 3: After Max Pooling
ax8 = fig.add_subplot(3, 4, 11)
im8 = ax8.imshow(vertical_pooled, cmap='viridis')
ax8.set_title('After MaxPool (Vertical)\n(3×3)', fontsize=12, fontweight='bold')
ax8.axis('off')
plt.colorbar(im8, ax=ax8, fraction=0.046)

ax9 = fig.add_subplot(3, 4, 12)
im9 = ax9.imshow(horizontal_pooled, cmap='viridis')
ax9.set_title('After MaxPool (Horizontal)\n(3×3)', fontsize=12, fontweight='bold')
ax9.axis('off')
plt.colorbar(im9, ax=ax9, fraction=0.046)

# Add pipeline diagram
ax10 = fig.add_subplot(3, 4, 4)
ax10.axis('off')
ax10.text(0.5, 0.8, 'CNN Pipeline:', ha='center', fontsize=14, fontweight='bold')
ax10.text(0.5, 0.6, '1. Convolution\n   (Feature extraction)', ha='center', fontsize=10)
ax10.text(0.5, 0.4, '2. ReLU\n   (Non-linearity)', ha='center', fontsize=10)
ax10.text(0.5, 0.2, '3. Pooling\n   (Downsampling)', ha='center', fontsize=10)
ax10.set_xlim([0, 1])
ax10.set_ylim([0, 1])

plt.tight_layout()
plt.show()

print("=" * 70)
print("CNN OPERATION SUMMARY")
print("=" * 70)
print(f"Input shape:           {input_image.shape}")
print(f"Kernel shape:          {vertical_edge_kernel.shape}")
print(f"After Conv:            {vertical_edges.shape}")
print(f"After ReLU:            {vertical_edges_relu.shape}")
print(f"After MaxPool(2×2):    {vertical_pooled.shape}")
print(f"\nDimensionality reduction: {input_image.shape} → {vertical_pooled.shape}")
print(f"Parameters in kernel: {vertical_edge_kernel.size}")
print("=" * 70)

## 2. Object Detection

**Object Detection** combines **classification** (what is it?) with **localization** (where is it?). Essential for autonomous driving to detect vehicles, pedestrians, cyclists, etc.

### Task Definition

**Input**: Image (H × W × 3)  
**Output**: List of detections, each with:
- **Bounding box**: (x, y, w, h) or (x₁, y₁, x₂, y₂)
- **Class label**: Car, Pedestrian, Cyclist, etc.
- **Confidence score**: Probability [0, 1]

---

### Evolution of Object Detection

#### **Two-Stage Detectors** (Accuracy-focused)

**R-CNN Family** (2014-2017):
```
R-CNN → Fast R-CNN → Faster R-CNN → Mask R-CNN
```

**Faster R-CNN Pipeline**:
1. **Backbone CNN**: Extract feature maps (e.g., ResNet-50)
2. **Region Proposal Network (RPN)**: Propose ~2000 candidate boxes
3. **ROI Pooling**: Extract fixed-size features for each proposal
4. **Detection Head**: Classify + refine box coordinates

**Advantages**:
- ✅ High accuracy (mAP ~40% on COCO)
- ✅ Precise localization

**Disadvantages**:
- ❌ Slow (5-7 FPS)
- ❌ Complex training (multi-stage)

---

#### **One-Stage Detectors** (Speed-focused)

**Key Innovation**: Predict bounding boxes and classes directly from feature maps, skip region proposals.

##### **YOLO (You Only Look Once)**

**Philosophy**: "Look at the image once" - single forward pass.

**YOLOv1 (2016)**:
1. Divide image into S×S grid (7×7)
2. Each cell predicts B bounding boxes (2) + confidence
3. Each cell predicts C class probabilities (20 for PASCAL VOC)
4. Output: S×S×(B×5 + C) = 7×7×30 tensor

**Loss Function** (Multi-part):
$$L = \lambda_{coord} L_{box} + L_{obj} + \lambda_{noobj} L_{noobj} + L_{class}$$

**Evolution**:
- **YOLOv2** (2017): Batch norm, anchor boxes, multi-scale
- **YOLOv3** (2018): 3 scales, residual blocks, 53 layers
- **YOLOv4** (2020): CSPDarknet, PANet, Mish activation
- **YOLOv5** (2020): PyTorch, optimized for production
- **YOLOv8** (2023): SOTA accuracy + speed

**Performance** (YOLOv5):
- **YOLOv5n** (nano): 1.9M params, 4.5 GFLOPs, **45 FPS**
- **YOLOv5s** (small): 7.2M params, 16.5 GFLOPs, **30 FPS**
- **YOLOv5m** (medium): 21.2M params, 49.0 GFLOPs, **20 FPS**
- **YOLOv5x** (extra-large): 86.7M params, 205.7 GFLOPs, **10 FPS**

---

##### **SSD (Single Shot MultiBox Detector)**

**Key Idea**: Multi-scale feature maps for detecting objects of different sizes.

**Architecture**:
```
Input (300×300×3)
    ↓
Backbone (VGG-16 / ResNet)
    ↓
Feature Pyramid: [38×38, 19×19, 10×10, 5×5, 3×3, 1×1]
    ↓
Predictions at each scale
```

**Multi-scale Detection**:
- **Large feature maps** (38×38): Detect small objects
- **Small feature maps** (3×3): Detect large objects

**Anchor Boxes**:
- Pre-defined boxes of various aspect ratios: [1:1, 2:1, 1:2, 3:1, 1:3]
- 8732 total anchors across all scales

**Performance**:
- **SSD300**: 300×300 input, **59 FPS**, 77.2% mAP
- **SSD512**: 512×512 input, **22 FPS**, 79.8% mAP

---

### Bounding Box Representation

**Two common formats**:

1. **YOLO format**: (x_center, y_center, width, height) - normalized [0, 1]
2. **Pascal VOC format**: (x_min, y_min, x_max, y_max) - absolute pixels

**Conversion**:
```python
# YOLO → Pascal VOC
x1 = (x_center - width/2) * image_width
y1 = (y_center - height/2) * image_height
x2 = (x_center + width/2) * image_width
y2 = (y_center + height/2) * image_height
```

---

### Evaluation Metrics

#### **Intersection over Union (IoU)**

Measures overlap between predicted and ground truth boxes:

$$\text{IoU} = \frac{\text{Area of Overlap}}{\text{Area of Union}}$$

**Thresholds**:
- IoU > 0.5: Typically considered a correct detection
- IoU > 0.7: Stricter threshold

#### **Precision-Recall Curve**

- **Precision**: What fraction of detections are correct?
- **Recall**: What fraction of ground truth objects were detected?

Plot Precision vs. Recall as confidence threshold varies.

#### **Average Precision (AP)**

Area under the Precision-Recall curve for a single class.

$$\text{AP} = \int_0^1 P(r) \, dr$$

#### **mean Average Precision (mAP)**

Average AP across all classes:

$$\text{mAP} = \frac{1}{C} \sum_{c=1}^{C} \text{AP}_c$$

**Variants**:
- **mAP@0.5**: IoU threshold = 0.5
- **mAP@0.5:0.95**: Average over IoU ∈ [0.5, 0.55, ..., 0.95]

---

### Non-Maximum Suppression (NMS)

**Problem**: Multiple overlapping detections for same object.

**Solution**: Keep only the highest confidence detection, suppress others.

**Algorithm**:
1. Sort detections by confidence score (descending)
2. Select detection with highest confidence
3. Remove all detections with IoU > threshold (e.g., 0.5) with selected box
4. Repeat until no detections remain

**Code**:
```python
def nms(boxes, scores, iou_threshold=0.5):
    keep = []
    indices = np.argsort(scores)[::-1]  # Sort by score descending
    
    while len(indices) > 0:
        current = indices[0]
        keep.append(current)
        
        # Compute IoU of current box with all remaining boxes
        ious = compute_iou(boxes[current], boxes[indices[1:]])
        
        # Keep only boxes with IoU < threshold
        indices = indices[1:][ious < iou_threshold]
    
    return keep
```

---

### Automotive-Specific Datasets

| Dataset | Year | Images | Boxes | Classes | Notes |
|---------|------|--------|-------|---------|-------|
| **KITTI** | 2012 | 7,481 | 80k | 8 | Benchmark for autonomous driving |
| **nuScenes** | 2019 | 40k | 1.4M | 23 | 360° coverage, Boston/Singapore |
| **Waymo Open** | 2019 | 200k | 12M | 4 | Largest AV dataset |
| **BDD100K** | 2018 | 100k | 1.8M | 10 | Diverse weather/time of day |
| **Argoverse** | 2019 | 30k | - | 15 | HD maps + trajectories |

**Class distribution challenges**:
- Vehicles: 70-80% of instances
- Pedestrians: 15-20%
- Cyclists: 2-5%
- Other: <1% (motorcycles, animals, etc.)

**Solution**: Class-balanced sampling, focal loss

---

### Challenges in Autonomous Driving

1. **Occlusion**: Pedestrians behind cars
2. **Scale variation**: Close vs. distant objects
3. **Class imbalance**: Many cars, few cyclists
4. **Domain shift**: Train on sunny, test in rain/snow
5. **Real-time constraints**: Must run at 20-30 FPS
6. **Safety-critical**: False negatives (misses) are catastrophic

**Solutions**:
- Multi-scale training
- Data augmentation (weather, occlusion)
- Ensemble models
- Temporal consistency (tracking across frames)
- Sensor fusion (camera + LiDAR + radar)

---

### Model Selection for Autonomous Driving

| Use Case | Model | Rationale |
|----------|-------|-----------|
| **Production AV** | YOLOv5m / YOLOv8 | Balance of speed + accuracy |
| **Safety-critical** | Faster R-CNN | Highest accuracy, redundancy |
| **Embedded systems** | YOLOv5n / MobileNet-SSD | Low latency, low power |
| **Research** | Transformer-based (DETR) | SOTA accuracy |

**Typical pipeline**: YOLOv5s (30 FPS) + Multi-object tracking (DeepSORT)

In [None]:
# Object Detection Implementation

from dataclasses import dataclass
from typing import List, Tuple
from collections import defaultdict

@dataclass
class BoundingBox:
    """Bounding box representation."""
    x_min: float
    y_min: float
    x_max: float
    y_max: float
    confidence: float
    class_id: int
    class_name: str = ""

def compute_iou(box1: BoundingBox, box2: BoundingBox) -> float:
    """
    Compute Intersection over Union (IoU) between two bounding boxes.

    Args:
        box1: First bounding box
        box2: Second bounding box

    Returns:
        IoU value [0, 1]
    """
    # Compute intersection area
    x_left = max(box1.x_min, box2.x_min)
    y_top = max(box1.y_min, box2.y_min)
    x_right = min(box1.x_max, box2.x_max)
    y_bottom = min(box1.y_max, box2.y_max)

    if x_right < x_left or y_bottom < y_top:
        return 0.0

    intersection_area = (x_right - x_left) * (y_bottom - y_top)

    # Compute union area
    box1_area = (box1.x_max - box1.x_min) * (box1.y_max - box1.y_min)
    box2_area = (box2.x_max - box2.x_min) * (box2.y_max - box2.y_min)
    union_area = box1_area + box2_area - intersection_area

    return intersection_area / union_area if union_area > 0 else 0.0

def non_max_suppression(boxes: List[BoundingBox], iou_threshold: float = 0.5) -> List[BoundingBox]:
    """
    Apply Non-Maximum Suppression (NMS) to remove overlapping detections.

    Args:
        boxes: List of bounding boxes
        iou_threshold: IoU threshold for suppression

    Returns:
        Filtered list of bounding boxes
    """
    if len(boxes) == 0:
        return []

    # Sort boxes by confidence (descending)
    boxes_sorted = sorted(boxes, key=lambda x: x.confidence, reverse=True)

    keep = []
    while len(boxes_sorted) > 0:
        # Keep the highest confidence box
        current = boxes_sorted[0]
        keep.append(current)
        boxes_sorted = boxes_sorted[1:]

        # Remove all boxes with high IoU with current box
        filtered = []
        for box in boxes_sorted:
            if compute_iou(current, box) < iou_threshold:
                filtered.append(box)
        boxes_sorted = filtered

    return keep

def compute_average_precision(detections: List[BoundingBox],
                               ground_truths: List[BoundingBox],
                               iou_threshold: float = 0.5) -> Tuple[float, np.ndarray, np.ndarray]:
    """
    Compute Average Precision (AP) for a single class.

    Args:
        detections: Predicted bounding boxes (sorted by confidence descending)
        ground_truths: Ground truth bounding boxes
        iou_threshold: IoU threshold for considering a detection as correct

    Returns:
        Tuple of (AP, precision array, recall array)
    """
    if len(ground_truths) == 0:
        return 0.0, np.array([]), np.array([])

    # Sort detections by confidence
    detections = sorted(detections, key=lambda x: x.confidence, reverse=True)

    # Track which ground truths have been matched
    gt_matched = [False] * len(ground_truths)

    true_positives = np.zeros(len(detections))
    false_positives = np.zeros(len(detections))

    for i, det in enumerate(detections):
        # Find best matching ground truth
        best_iou = 0.0
        best_gt_idx = -1

        for j, gt in enumerate(ground_truths):
            if not gt_matched[j]:
                iou = compute_iou(det, gt)
                if iou > best_iou:
                    best_iou = iou
                    best_gt_idx = j

        # Check if detection is correct
        if best_iou >= iou_threshold and best_gt_idx >= 0:
            if not gt_matched[best_gt_idx]:
                true_positives[i] = 1
                gt_matched[best_gt_idx] = True
            else:
                false_positives[i] = 1
        else:
            false_positives[i] = 1

    # Compute cumulative sums
    tp_cumsum = np.cumsum(true_positives)
    fp_cumsum = np.cumsum(false_positives)

    # Compute precision and recall
    recalls = tp_cumsum / len(ground_truths)
    precisions = tp_cumsum / (tp_cumsum + fp_cumsum + 1e-10)

    # Compute AP using 11-point interpolation
    ap = 0.0
    for t in np.linspace(0, 1, 11):
        precision_at_recall = precisions[recalls >= t]
        if len(precision_at_recall) > 0:
            ap += np.max(precision_at_recall) / 11

    return ap, precisions, recalls

# Simulate object detection on a driving scene

# Create synthetic driving scene
np.random.seed(42)
scene_width, scene_height = 640, 480

# Define classes
classes = ['Car', 'Pedestrian', 'Cyclist', 'Traffic Light']
class_colors = {
    'Car': 'blue',
    'Pedestrian': 'green',
    'Cyclist': 'orange',
    'Traffic Light': 'red'
}

# Generate ground truth objects
ground_truth_boxes = [
    BoundingBox(100, 200, 200, 300, 1.0, 0, 'Car'),
    BoundingBox(250, 180, 350, 320, 1.0, 0, 'Car'),
    BoundingBox(400, 220, 480, 350, 1.0, 0, 'Car'),
    BoundingBox(150, 150, 180, 220, 1.0, 1, 'Pedestrian'),
    BoundingBox(320, 160, 350, 240, 1.0, 1, 'Pedestrian'),
    BoundingBox(520, 140, 560, 230, 1.0, 2, 'Cyclist'),
    BoundingBox(580, 50, 600, 100, 1.0, 3, 'Traffic Light'),
]

# Generate predicted detections (with some errors and noise)
predicted_boxes = [
    # True positives (good detections)
    BoundingBox(105, 205, 205, 305, 0.95, 0, 'Car'),
    BoundingBox(255, 185, 355, 325, 0.92, 0, 'Car'),
    BoundingBox(405, 225, 485, 355, 0.88, 0, 'Car'),
    BoundingBox(152, 152, 182, 222, 0.85, 1, 'Pedestrian'),
    BoundingBox(522, 142, 562, 232, 0.78, 2, 'Cyclist'),
    BoundingBox(582, 52, 602, 102, 0.91, 3, 'Traffic Light'),

    # False positive (wrong detection)
    BoundingBox(450, 100, 500, 150, 0.65, 0, 'Car'),

    # Duplicate detection (should be suppressed by NMS)
    BoundingBox(110, 210, 210, 310, 0.72, 0, 'Car'),

    # False negative: pedestrian at (320, 160) is missing

    # Low confidence detection (borderline)
    BoundingBox(560, 300, 620, 400, 0.45, 0, 'Car'),
]

# Apply NMS
nms_boxes = non_max_suppression(predicted_boxes, iou_threshold=0.5)

# Filter by confidence threshold
confidence_threshold = 0.5
filtered_boxes = [box for box in nms_boxes if box.confidence >= confidence_threshold]

# Compute metrics per class
metrics = {}
for class_id, class_name in enumerate(classes):
    gt_class = [box for box in ground_truth_boxes if box.class_id == class_id]
    det_class = [box for box in filtered_boxes if box.class_id == class_id]

    if len(gt_class) > 0:
        ap, precisions, recalls = compute_average_precision(det_class, gt_class)
        metrics[class_name] = {
            'AP': ap,
            'Precision': precisions,
            'Recall': recalls,
            'GT_count': len(gt_class),
            'Det_count': len(det_class)
        }

# Compute mAP
mAP = np.mean([m['AP'] for m in metrics.values()])

# Visualization
fig = plt.figure(figsize=(18, 12))

# Plot 1: Raw detections (before NMS)
ax1 = fig.add_subplot(2, 3, 1)
ax1.set_xlim([0, scene_width])
ax1.set_ylim([scene_height, 0])
ax1.set_aspect('equal')
ax1.set_title('Raw Detections (Before NMS)\n{} boxes'.format(len(predicted_boxes)),
              fontsize=12, fontweight='bold')
ax1.set_xlabel('X (pixels)')
ax1.set_ylabel('Y (pixels)')
ax1.grid(True, alpha=0.3)

for box in predicted_boxes:
    rect = Rectangle((box.x_min, box.y_min), box.x_max - box.x_min, box.y_max - box.y_min,
                      linewidth=2, edgecolor=class_colors[box.class_name], facecolor='none', alpha=0.7)
    ax1.add_patch(rect)
    ax1.text(box.x_min, box.y_min - 5, f'{box.class_name}\n{box.confidence:.2f}',
             fontsize=8, color=class_colors[box.class_name], fontweight='bold',
             bbox=dict(boxstyle='round,pad=0.3', facecolor='white', alpha=0.7))

# Plot 2: After NMS
ax2 = fig.add_subplot(2, 3, 2)
ax2.set_xlim([0, scene_width])
ax2.set_ylim([scene_height, 0])
ax2.set_aspect('equal')
ax2.set_title('After NMS\n{} boxes (removed {} duplicates)'.format(len(nms_boxes), len(predicted_boxes) - len(nms_boxes)),
              fontsize=12, fontweight='bold')
ax2.set_xlabel('X (pixels)')
ax2.set_ylabel('Y (pixels)')
ax2.grid(True, alpha=0.3)

for box in nms_boxes:
    rect = Rectangle((box.x_min, box.y_min), box.x_max - box.x_min, box.y_max - box.y_min,
                      linewidth=2, edgecolor=class_colors[box.class_name], facecolor='none', alpha=0.7)
    ax2.add_patch(rect)
    ax2.text(box.x_min, box.y_min - 5, f'{box.class_name}\n{box.confidence:.2f}',
             fontsize=8, color=class_colors[box.class_name], fontweight='bold',
             bbox=dict(boxstyle='round,pad=0.3', facecolor='white', alpha=0.7))

# Plot 3: Final detections (after confidence filtering)
ax3 = fig.add_subplot(2, 3, 3)
ax3.set_xlim([0, scene_width])
ax3.set_ylim([scene_height, 0])
ax3.set_aspect('equal')
ax3.set_title('Final Detections (Conf > {:.2f})\n{} boxes'.format(confidence_threshold, len(filtered_boxes)),
              fontsize=12, fontweight='bold')
ax3.set_xlabel('X (pixels)')
ax3.set_ylabel('Y (pixels)')
ax3.grid(True, alpha=0.3)

for box in filtered_boxes:
    rect = Rectangle((box.x_min, box.y_min), box.x_max - box.x_min, box.y_max - box.y_min,
                      linewidth=3, edgecolor=class_colors[box.class_name], facecolor='none')
    ax3.add_patch(rect)
    ax3.text(box.x_min, box.y_min - 5, f'{box.class_name}\n{box.confidence:.2f}',
             fontsize=9, color=class_colors[box.class_name], fontweight='bold',
             bbox=dict(boxstyle='round,pad=0.3', facecolor='white', alpha=0.8))

# Plot 4: Ground truth
ax4 = fig.add_subplot(2, 3, 4)
ax4.set_xlim([0, scene_width])
ax4.set_ylim([scene_height, 0])
ax4.set_aspect('equal')
ax4.set_title('Ground Truth\n{} objects'.format(len(ground_truth_boxes)),
              fontsize=12, fontweight='bold')
ax4.set_xlabel('X (pixels)')
ax4.set_ylabel('Y (pixels)')
ax4.grid(True, alpha=0.3)

for box in ground_truth_boxes:
    rect = Rectangle((box.x_min, box.y_min), box.x_max - box.x_min, box.y_max - box.y_min,
                      linewidth=3, edgecolor=class_colors[box.class_name], facecolor='none', linestyle='--')
    ax4.add_patch(rect)
    ax4.text(box.x_min, box.y_min - 5, box.class_name,
             fontsize=9, color=class_colors[box.class_name], fontweight='bold',
             bbox=dict(boxstyle='round,pad=0.3', facecolor='white', alpha=0.8))

# Plot 5: Precision-Recall curves
ax5 = fig.add_subplot(2, 3, 5)
for class_name, data in metrics.items():
    if len(data['Recall']) > 0:
        ax5.plot(data['Recall'], data['Precision'], marker='o', label=f"{class_name} (AP={data['AP']:.3f})", linewidth=2)
ax5.set_xlabel('Recall', fontsize=11)
ax5.set_ylabel('Precision', fontsize=11)
ax5.set_title('Precision-Recall Curves', fontsize=12, fontweight='bold')
ax5.grid(True, alpha=0.3)
ax5.legend()
ax5.set_xlim([0, 1])
ax5.set_ylim([0, 1.05])

# Plot 6: Detection statistics
ax6 = fig.add_subplot(2, 3, 6)
ax6.axis('off')
ax6.text(0.5, 0.95, 'Detection Performance Summary', ha='center', fontsize=14, fontweight='bold')
ax6.text(0.5, 0.85, f'mAP@0.5: {mAP:.3f}', ha='center', fontsize=12,
         bbox=dict(boxstyle='round,pad=0.5', facecolor='lightblue', alpha=0.8))

y_pos = 0.72
for class_name, data in metrics.items():
    text = f"{class_name}: AP={data['AP']:.3f}, GT={data['GT_count']}, Det={data['Det_count']}"
    ax6.text(0.1, y_pos, text, fontsize=10, family='monospace')
    y_pos -= 0.08

# Add statistics
stats_text = f"\nNMS IoU threshold: {0.5:.2f}\n"
stats_text += f"Confidence threshold: {confidence_threshold:.2f}\n"
stats_text += f"True Positives: {len([b for b in filtered_boxes if any(compute_iou(b, gt) > 0.5 for gt in ground_truth_boxes)])}\n"
stats_text += f"False Positives: {len([b for b in filtered_boxes if all(compute_iou(b, gt) < 0.5 for gt in ground_truth_boxes)])}\n"
stats_text += f"False Negatives: {len(ground_truth_boxes) - len([gt for gt in ground_truth_boxes if any(compute_iou(det, gt) > 0.5 for det in filtered_boxes)])}"

ax6.text(0.1, y_pos - 0.1, stats_text, fontsize=9, family='monospace',
         bbox=dict(boxstyle='round,pad=0.5', facecolor='lightyellow', alpha=0.8))

plt.tight_layout()
plt.show()

print("=" * 80)
print("OBJECT DETECTION METRICS")
print("=" * 80)
print(f"{'Class':<15} {'AP@0.5':<10} {'Ground Truth':<15} {'Detections':<15}")
print("-" * 80)
for class_name, data in metrics.items():
    print(f"{class_name:<15} {data['AP']:<10.3f} {data['GT_count']:<15} {data['Det_count']:<15}")
print("-" * 80)
print(f"{'mAP@0.5':<15} {mAP:<10.3f}")
print("=" * 80)

# Exercise solutions


---

## References

- References to be added