# Evaluatie in Computer Vision

Het evalueren van computer vision modellen vereist specifieke metrieken en methodes die rekening houden met de unieke aspecten van visuele taken. In dit notebook bespreken we de belangrijkste evaluatiemethodes voor verschillende computer vision taken.

## Classification Metrics

### Accuracy

De eenvoudigste metriek - percentage correcte voorspellingen:

$$Accuracy = \frac{TP + TN}{TP + TN + FP + FN}$$

### Precision en Recall

**Precision**: Fractie van positieve voorspellingen die correct zijn

$$Precision = \frac{TP}{TP + FP}$$

**Recall**: Fractie van positieve instances die correct geïdentificeerd zijn

$$Recall = \frac{TP}{TP + FN}$$

### F1-Score

Harmonisch gemiddelde van precision en recall:

$$F1 = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}$$

### Confusion Matrix

```python
# Confusion matrix voor multi-class classification
import torch
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

def plot_confusion_matrix(y_true, y_pred, classes):
    """Plot confusion matrix"""
    cm = confusion_matrix(y_true, y_pred)
    
    plt.figure(figsize=(10, 8))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=classes, yticklabels=classes)
    plt.xlabel('Predicted Label')
plt.ylabel('True Label')
    plt.title('Confusion Matrix')
    plt.show()
    
    return cm
```

## Object Detection Metrics

### Intersection over Union (IoU)

Meet de overlap tussen predicted en ground truth bounding boxes:

$$IoU = \frac{|A \cap B|}{|A \cup B|}$$

```python
def calculate_iou(box1, box2):
    """Bereken IoU tussen twee bounding boxes"""
    # box format: [x1, y1, x2, y2]
    x1_inter = max(box1[0], box2[0])
    y1_inter = max(box1[1], box2[1])
    x2_inter = min(box1[2], box2[2])
    y2_inter = min(box1[3], box2[3])
    
    # Geen overlap
    if x2_inter <= x1_inter or y2_inter <= y1_inter:
        return 0.0
    
    # Bereken overlap area
    inter_area = (x2_inter - x1_inter) * (y2_inter - y1_inter)
    
    # Union area
    box1_area = (box1[2] - box1[0]) * (box1[3] - box1[1])
    box2_area = (box2[2] - box2[0]) * (box2[3] - box2[1])
    union_area = box1_area + box2_area - inter_area
    
    return inter_area / union_area
```

### Mean Average Precision (mAP)

De standaard metriek voor object detection:

```python
def calculate_map(predictions, ground_truth, iou_threshold=0.5):
    """Bereken mean Average Precision"""
    
    # Sorteer predictions op confidence score
    predictions = sorted(predictions, key=lambda x: x['confidence'], reverse=True)
    
    # Bereken precision-recall curve
    tp = fp = 0
    precisions = []
    recalls = []
    
    for pred in predictions:
        # Vind beste matching ground truth
        best_iou = 0
        best_gt_idx = -1
        
        for i, gt in enumerate(ground_truth):
            iou = calculate_iou(pred['bbox'], gt['bbox'])
            if iou > best_iou:
                best_iou = iou
                best_gt_idx = i
        
        if best_iou >= iou_threshold and best_gt_idx >= 0:
            # True positive
            tp += 1
            # Verwijder matched ground truth
            del ground_truth[best_gt_idx]
        else:
            # False positive
            fp += 1
        
        # Bereken precision en recall
        precision = tp / (tp + fp) if (tp + fp) > 0 else 0
        recall = tp / len(ground_truth) if ground_truth else 1
        
        precisions.append(precision)
        recalls.append(recall)
    
    # Bereken AP (Area under PR curve)
    ap = 0
    for i in range(1, len(precisions)):
        ap += (recalls[i] - recalls[i-1]) * precisions[i]
    
    return ap
```

## Segmentation Metrics

### Pixel Accuracy

Percentage correct geclassificeerde pixels:

$$Pixel\ Accuracy = \frac{\sum_i n_{ii}}{\sum_i t_i}$$

### Mean Intersection over Union (mIoU)

Gemiddelde IoU over alle klassen:

$$mIoU = \frac{1}{C} \sum_{c=1}^C \frac{TP_c}{TP_c + FP_c + FN_c}$$

```python
def calculate_miou(pred_mask, gt_mask, num_classes):
    """Bereken mean IoU voor segmentation"""
    ious = []
    
    for class_id in range(num_classes):
        # Binary masks voor huidige klasse
        pred_binary = (pred_mask == class_id)
        gt_binary = (gt_mask == class_id)
        
        # Bereken intersection en union
        intersection = torch.logical_and(pred_binary, gt_binary).sum().float()
        union = torch.logical_or(pred_binary, gt_binary).sum().float()
        
        # IoU voor deze klasse
        iou = intersection / union if union > 0 else torch.tensor(0.0)
        ious.append(iou)
    
    return torch.mean(torch.stack(ious))
```

### Dice Coefficient

Alternatieve metriek voor segmentation:

$$Dice = \frac{2 \cdot |A \cap B|}{|A| + |B|}$$

## Keypoint Detection Metrics

### Percentage of Correct Keypoints (PCK)

Percentage keypoints binnen een threshold afstand:

$$PCK@\alpha = \frac{1}{N} \sum_{i=1}^N \mathbb{1}(d_i < \alpha \cdot d_{ref})$$

### Mean Per Joint Position Error (MPJPE)

Gemiddelde Euclidean afstand tussen predicted en ground truth keypoints:

$$MPJPE = \frac{1}{N \cdot K} \sum_{i=1}^N \sum_{j=1}^K \|p_{ij} - \hat{p}_{ij}\|_2$$

## OCR Metrics

### Character Error Rate (CER)

Percentage foutief voorspelde karakters:

$$CER = \frac{I + D + S}{N}$$

Waar:
- I = aantal inserties
- D = aantal deleties
- S = aantal substituties
- N = aantal karakters in ground truth

### Word Error Rate (WER)

Vergelijkbaar met CER maar op woordniveau:

$$WER = \frac{I + D + S}{N}$$

## Advanced Evaluation Methods

### Cross-Validation

```python
# K-fold cross-validation voor vision taken
from sklearn.model_selection import KFold
import numpy as np

def kfold_cross_validation(model_class, X, y, k=5, **model_kwargs):
    """K-fold cross-validation"""
    kf = KFold(n_splits=k, shuffle=True, random_state=42)
    scores = []
    
    for fold, (train_idx, val_idx) in enumerate(kf.split(X)):
        print(f"Training fold {fold + 1}/{k}")
        
        # Split data
        X_train, X_val = X[train_idx], X[val_idx]
        y_train, y_val = y[train_idx], y[val_idx]
        
        # Train model
        model = model_class(**model_kwargs)
        # ... training code ...
        
        # Evaluate
        score = evaluate_model(model, X_val, y_val)
        scores.append(score)
        print(f"Fold {fold + 1} score: {score:.4f}")
    
    mean_score = np.mean(scores)
    std_score = np.std(scores)
    print(f"\nMean CV Score: {mean_score:.4f} (+/- {std_score:.4f})")
    
    return scores
```

### Bootstrapping

Schatting van confidence intervallen:

```python
def bootstrap_evaluation(predictions, ground_truth, n_bootstrap=1000):
    """Bootstrap confidence intervallen"""
    n_samples = len(predictions)
    bootstrap_scores = []
    
    for _ in range(n_bootstrap):
        # Sample met teruglegging
        indices = np.random.choice(n_samples, size=n_samples, replace=True)
        boot_pred = [predictions[i] for i in indices]
        boot_gt = [ground_truth[i] for i in indices]
        
        # Bereken score voor bootstrap sample
        score = calculate_metric(boot_pred, boot_gt)
        bootstrap_scores.append(score)
    
    # Bereken confidence intervalen
    lower_bound = np.percentile(bootstrap_scores, 2.5)
    upper_bound = np.percentile(bootstrap_scores, 97.5)
    mean_score = np.mean(bootstrap_scores)
    
    return mean_score, (lower_bound, upper_bound)
```

## Visualisatie van Resultaten

### Bounding Box Visualisatie

```python
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from PIL import Image

def visualize_detections(image_path, detections, save_path=None):
    """Visualiseer object detections"""
    # Laad afbeelding
    image = Image.open(image_path)
    fig, ax = plt.subplots(1, figsize=(12, 8))
    ax.imshow(image)
    
    # Plot detections
    for detection in detections:
        x1, y1, x2, y2 = detection['bbox']
        confidence = detection['confidence']
        class_name = detection['class']
        
        # Bounding box
        rect = patches.Rectangle(
            (x1, y1), x2-x1, y2-y1,
            linewidth=2, edgecolor='red', facecolor='none'
        )
        ax.add_patch(rect)
        
        # Label
        ax.text(x1, y1-5, f'{class_name}: {confidence:.2f}',
               bbox=dict(boxstyle='round,pad=0.3', facecolor='red', alpha=0.8),
               color='white', fontsize=10)
    
    plt.axis('off')
    if save_path:
        plt.savefig(save_path, bbox_inches='tight', dpi=150)
    plt.show()
```

### Segmentation Mask Visualisatie

```python
def visualize_segmentation(image, mask, alpha=0.6):
    """Visualiseer segmentation resultaten"""
    # Color map voor verschillende klassen
    colors = plt.cm.get_cmap('tab20')(np.arange(21) / 20.0)
    
    # Maak overlay
    overlay = np.zeros_like(image)
    for class_id in np.unique(mask):
        if class_id == 0:  # Skip background
            continue
        overlay[mask == class_id] = colors[class_id][:3]
    
    # Combineer origineel beeld met overlay
    blended = (image * (1 - alpha) + overlay * alpha)
    
    plt.figure(figsize=(15, 5))
    
    plt.subplot(1, 3, 1)
    plt.imshow(image)
    plt.title('Original Image')
    plt.axis('off')
    
    plt.subplot(1, 3, 2)
    plt.imshow(mask, cmap='tab20')
    plt.title('Segmentation Mask')
    plt.axis('off')
    
    plt.subplot(1, 3, 3)
    plt.imshow(blended)
    plt.title('Overlay')
    plt.axis('off')
    
    plt.tight_layout()
    plt.show()
```

## Model Comparison

### Statistical Significance Testing

```python
from scipy import stats

def compare_models(model1_scores, model2_scores, alpha=0.05):
    """Vergelijk twee modellen statistisch"""
    
    # Student's t-test
    t_stat, p_value = stats.ttest_ind(model1_scores, model2_scores)
    
    print(f"T-statistic: {t_stat:.4f}")
    print(f"P-value: {p_value:.4f}")
    
    if p_value < alpha:
        print(f"Significant difference (α={alpha})")
        if np.mean(model1_scores) > np.mean(model2_scores):
            print("Model 1 performs significantly better")
        else:
            print("Model 2 performs significantly better")
    else:
        print(f"No significant difference (α={alpha})")
    
    return t_stat, p_value
```

### McNemar's Test voor Classification

```python
def mcnemar_test(model1_preds, model2_preds, true_labels):
    """McNemar's test voor model vergelijking"""
    
    # Contingency table
    n11 = n00 = n10 = n01 = 0
    
    for m1, m2, true in zip(model1_preds, model2_preds, true_labels):
        if m1 == true and m2 == true:
            n11 += 1  # Beide correct
        elif m1 != true and m2 != true:
            n00 += 1  # Beide fout
        elif m1 == true and m2 != true:
            n10 += 1  # Alleen model 1 correct
        else:
            n01 += 1  # Alleen model 2 correct
    
    # McNemar's test statistic
    if (n10 + n01) > 0:
        chi2 = (abs(n10 - n01) - 1) ** 2 / (n10 + n01)
        p_value = 1 - stats.chi2.cdf(chi2, 1)
    else:
        chi2 = 0
        p_value = 1
    
    return chi2, p_value
```

## Error Analysis

### Confusion Matrix Analysis

```python
def analyze_errors(confusion_matrix, class_names):
    """Analyseer fouten in confusion matrix"""
    n_classes = len(class_names)
    
    # Per-class metrics
    for i in range(n_classes):
        tp = confusion_matrix[i, i]
        fn = np.sum(confusion_matrix[i, :]) - tp
        fp = np.sum(confusion_matrix[:, i]) - tp
        tn = np.sum(confusion_matrix) - tp - fn - fp
        
        precision = tp / (tp + fp) if (tp + fp) > 0 else 0
        recall = tp / (tp + fn) if (tp + fn) > 0 else 0
        f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
        
        print(f"\n{class_names[i]}:")
        print(f"  Precision: {precision:.3f}")
        print(f"  Recall: {recall:.3f}")
        print(f"  F1-score: {f1:.3f}")
        
        # Meest frequente fouten
        error_indices = np.argsort(confusion_matrix[i, :])[::-1]
        print("  Most confused with:")
        for j in error_indices[1:4]:  # Top 3 fouten
            if confusion_matrix[i, j] > 0:
                print(f"    {class_names[j]}: {confusion_matrix[i, j]} instances")
```

### Visual Error Analysis

```python
def visualize_errors(images, true_labels, pred_labels, class_names, n_errors=5):
    """Visualiseer foutieve voorspellingen"""
    
    # Vind foutieve voorspellingen
    error_indices = []
    for i, (true, pred) in enumerate(zip(true_labels, pred_labels)):
        if true != pred:
            error_indices.append((i, true, pred))
    
    # Visualiseer eerste n fouten
    n_errors = min(n_errors, len(error_indices))
    fig, axes = plt.subplots(2, n_errors, figsize=(15, 6))
    
    for i in range(n_errors):
        idx, true_label, pred_label = error_indices[i]
        
        # Plot afbeelding
        axes[0, i].imshow(images[idx])
        axes[0, i].set_title(f'True: {class_names[true_label]}')
        axes[0, i].axis('off')
        
        axes[1, i].imshow(images[idx])
        axes[1, i].set_title(f'Pred: {class_names[pred_label]}')
        axes[1, i].axis('off')
    
    plt.tight_layout()
    plt.show()
```

## Performance Benchmarks

### Standard Datasets

- **ImageNet**: 1000 klassen, 1.2M trainingsafbeeldingen
- **COCO**: 80 object klassen voor detection/segmentation
- **Cityscapes**: Urban scene understanding
- **Pascal VOC**: Multi-label classification en detection

### Leaderboards

Vergelijk modellen op standaard benchmarks:

```python
def compare_with_benchmarks(model_score, benchmark_scores):
    """Vergelijk model met benchmark resultaten"""
    
    benchmarks = {
        'ImageNet': {
            'AlexNet': 63.3,
            'VGG16': 71.3,
            'ResNet50': 76.0,
            'Vision Transformer': 85.3,
            'State-of-the-Art': 90.9
        },
        'COCO mAP': {
            'Faster R-CNN': 36.4,
            'YOLOv3': 33.0,
            'EfficientDet': 47.5,
            'State-of-the-Art': 63.2
        }
    }
    
    print("Benchmark Comparison:")
    print(f"Your model: {model_score:.2f}")
    
    for benchmark, scores in benchmarks.items():
        print(f"\n{benchmark}:")
        for model, score in scores.items():
            diff = model_score - score
            status = "✅ Better" if diff > 0 else "❌ Worse" if diff < 0 else "➖ Equal"
            print(f"  {model}: {score:.2f} {status}")
    
    return benchmarks
```

## Model Interpretatie

### Grad-CAM Visualisatie

```python
def generate_gradcam(model, image, target_class):
    """Genereer Grad-CAM heatmap"""
    model.eval()
    
    # Forward pass
    features = []
    def hook_fn(module, input, output):
        features.append(output)
    
    # Registreer hook op laatste conv laag
    hook = model.layer4.register_forward_hook(hook_fn)
    
    # Forward pass met gradienten
    output = model(image)
    hook.remove()
    
    # Bereken gradienten
    model.zero_grad()
    target = output[0, target_class]
    target.backward()
    
    # Genereer heatmap
    feature_map = features[0][0]
    gradients = model.layer4.weight.grad[0]
    
    # Global average pooling van gradienten
    weights = torch.mean(gradients, dim=(1, 2))
    
    # Weighted combination van feature maps
    cam = torch.zeros(feature_map.shape[1:])
    for i, w in enumerate(weights):
        cam += w * feature_map[i, :, :]
    
    # ReLU en normalisatie
    cam = torch.clamp(cam, min=0)
    cam = (cam - cam.min()) / (cam.max() - cam.min())
    
    return cam.detach().numpy()
```

### Saliency Maps

Visualiseer welke pixels het meest bijdragen aan de voorspelling:

```python
def generate_saliency_map(model, image, target_class):
    """Genereer saliency map"""
    model.eval()
    image.requires_grad = True
    
    # Forward pass
    output = model(image)
    target = output[0, target_class]
    
    # Backward pass
    target.backward()
    
    # Neem absolute waarde van gradienten
    saliency = torch.abs(image.grad[0])
    
    # Normaliseer
    saliency = (saliency - saliency.min()) / (saliency.max() - saliency.min())
    
    return saliency.detach().numpy()
```

In [None]:
import torch
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_fscore_support, confusion_matrix

# Uitgebreide evaluatie utilities voor computer vision

class VisionEvaluator:
    """Comprehensive evaluation toolkit for computer vision"""
    
    def __init__(self):
        self.metrics = {}
        self.predictions = []
        self.ground_truth = []
    
    def calculate_classification_metrics(self, y_true, y_pred, class_names=None):
        """Bereken classification metrics"""
        # Basis metrics
        precision, recall, f1, support = precision_recall_fscore_support(
            y_true, y_pred, average=None, zero_division=0
        )
        
        # Confusion matrix
        cm = confusion_matrix(y_true, y_pred)
        
        # Per-class resultaten
        results = {}
        for i, class_name in enumerate(class_names or range(len(precision))):
            results[class_name] = {
                'precision': precision[i],
                'recall': recall[i],
                'f1': f1[i],
                'support': support[i]
            }
        
        # Macro averages
        macro_precision = np.mean(precision)
        macro_recall = np.mean(recall)
        macro_f1 = np.mean(f1)
        
        # Accuracy
        accuracy = np.sum(np.diag(cm)) / np.sum(cm)
        
        return {
            'accuracy': accuracy,
            'macro_precision': macro_precision,
            'macro_recall': macro_recall,
            'macro_f1': macro_f1,
            'per_class': results,
            'confusion_matrix': cm
        }
    
    def calculate_iou(self, box1, box2):
        """Bereken IoU tussen twee bounding boxes"""
        # box format: [x1, y1, x2, y2]
        x1_inter = max(box1[0], box2[0])
        y1_inter = max(box1[1], box2[1])
        x2_inter = min(box1[2], box2[2])
        y2_inter = min(box1[3], box2[3])
        
        if x2_inter <= x1_inter or y2_inter <= y1_inter:
            return 0.0
        
        inter_area = (x2_inter - x1_inter) * (y2_inter - y1_inter)
        box1_area = (box1[2] - box1[0]) * (box1[3] - box1[1])
        box2_area = (box2[2] - box2[0]) * (box2[3] - box2[1])
        union_area = box1_area + box2_area - inter_area
        
        return inter_area / union_area
    
    def calculate_map(self, predictions, ground_truth, iou_thresholds=[0.5]):
        """Bereken mAP voor object detection"""
        aps = []
        
        for iou_thresh in iou_thresholds:
            # Sorteer predictions op confidence
            sorted_preds = sorted(predictions, key=lambda x: x['confidence'], reverse=True)
            
            tp = fp = 0
            precisions = []
            recalls = []
            
            for pred in sorted_preds:
                # Vind beste IoU met ground truth
                best_iou = 0
                best_gt_idx = -1
                
                for i, gt in enumerate(ground_truth):
                    iou = self.calculate_iou(pred['bbox'], gt['bbox'])
                    if iou > best_iou:
                        best_iou = iou
                        best_gt_idx = i
                
                if best_iou >= iou_thresh and best_gt_idx >= 0:
                    tp += 1
                    del ground_truth[best_gt_idx]
                else:
                    fp += 1
                
                precision = tp / (tp + fp) if (tp + fp) > 0 else 0
                recall = tp / len(ground_truth) if ground_truth else 1
                
                precisions.append(precision)
                recalls.append(recall)
            
            # Bereken AP
            ap = 0
            for i in range(1, len(precisions)):
                ap += (recalls[i] - recalls[i-1]) * precisions[i]
            aps.append(ap)
        
        return np.mean(aps)
    
    def plot_precision_recall_curve(self, precisions, recalls):
        """Plot precision-recall curve"""
        plt.figure(figsize=(8, 6))
        plt.plot(recalls, precisions, 'b-', linewidth=2)
        plt.xlabel('Recall')
        plt.ylabel('Precision')
        plt.title('Precision-Recall Curve')
        plt.grid(True, alpha=0.3)
        plt.xlim([0, 1])
        plt.ylim([0, 1])
        plt.show()
    
    def plot_roc_curve(self, fpr, tpr, auc_score):
        """Plot ROC curve"""
        plt.figure(figsize=(8, 6))
        plt.plot(fpr, tpr, 'b-', linewidth=2, label=f'ROC curve (AUC = {auc_score:.3f})')
        plt.plot([0, 1], [0, 1], 'r--', label='Random classifier')
        plt.xlabel('False Positive Rate')
        plt.ylabel('True Positive Rate')
        plt.title('ROC Curve')
        plt.legend()
        plt.grid(True, alpha=0.3)
        plt.show()

# Voorbeeld gebruik
evaluator = VisionEvaluator()

# Simuleer classification resultaten
y_true = np.random.randint(0, 10, 100)
y_pred = y_true.copy()
# Voeg wat fouten toe
error_indices = np.random.choice(100, size=15, replace=False)
y_pred[error_indices] = np.random.randint(0, 10, 15)

# Bereken metrics
class_names = [f'Class_{i}' for i in range(10)]
metrics = evaluator.calculate_classification_metrics(y_true, y_pred, class_names)

print("Classification Results:")
print(f"Accuracy: {metrics['accuracy']:.3f}")
print(f"Macro F1: {metrics['macro_f1']:.3f}")
print(f"Confusion Matrix Shape: {metrics['confusion_matrix'].shape}")

# Simuleer object detection resultaten
predictions = [
    {'bbox': [0.1, 0.1, 0.5, 0.5], 'confidence': 0.9, 'class': 'person'},
    {'bbox': [0.2, 0.2, 0.6, 0.6], 'confidence': 0.8, 'class': 'car'}
]

ground_truth = [
    {'bbox': [0.15, 0.15, 0.55, 0.55], 'class': 'person'},
    {'bbox': [0.25, 0.25, 0.65, 0.65], 'class': 'car'}
]

map_score = evaluator.calculate_map(predictions, ground_truth)
print(f"\nmAP Score: {map_score:.3f}")

print("\nEvaluation utilities beschikbaar voor:")
print("- Classification metrics (accuracy, precision, recall, F1)")
print("- Object detection metrics (mAP, IoU)")
print("- Segmentation metrics (mIoU, Dice)")
print("- Visualisatie van resultaten")
print("- Statistical significance testing")
print("- Error analysis en interpretatie")