# Tutorial 12: Offline Evaluation Metrics

## Module 5: Evaluation

---

## Learning Objectives

By the end of this tutorial, you will be able to:

1. **Master classification metrics** - precision, recall, F1-score, ROC-AUC, PR-AUC, confusion matrix
2. **Apply regression metrics** - MSE, RMSE, MAE, R-squared, MAPE
3. **Understand ranking metrics** - Precision@K, Recall@K, MRR, mAP, nDCG
4. **Evaluate NLP models** - BLEU, ROUGE scores
5. **Choose appropriate metrics** based on business objectives and model types

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Dict, Tuple, Optional
from dataclasses import dataclass
import warnings

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report,
    roc_curve, roc_auc_score, precision_recall_curve, average_precision_score,
    mean_squared_error, mean_absolute_error, r2_score
)
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification, make_regression
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

warnings.filterwarnings('ignore')
np.random.seed(42)
plt.style.use('seaborn-v0_8-whitegrid')

print("Libraries imported successfully!")

---

## 1. Introduction to Evaluation Metrics

### Why Offline Evaluation Matters

Offline evaluation metrics are crucial for:
- **Model selection**: Comparing different models before deployment
- **Hyperparameter tuning**: Finding optimal model configurations
- **Quality assurance**: Ensuring models meet minimum performance thresholds
- **Debugging**: Identifying specific weaknesses in model predictions

In [None]:
# Create sample datasets for demonstrations

def create_classification_data(n_samples=1000, imbalance_ratio=0.3):
    """Create binary classification dataset with configurable class imbalance."""
    X, y = make_classification(
        n_samples=n_samples, n_features=20, n_informative=10,
        n_redundant=5, n_clusters_per_class=2,
        weights=[1 - imbalance_ratio, imbalance_ratio], random_state=42
    )
    return train_test_split(X, y, test_size=0.3, random_state=42)

def create_regression_data(n_samples=1000, noise=10):
    """Create regression dataset."""
    X, y = make_regression(
        n_samples=n_samples, n_features=20, n_informative=10,
        noise=noise, random_state=42
    )
    return train_test_split(X, y, test_size=0.3, random_state=42)

# Create datasets
X_train_clf, X_test_clf, y_train_clf, y_test_clf = create_classification_data()
X_train_reg, X_test_reg, y_train_reg, y_test_reg = create_regression_data()

print(f"Classification: {len(y_train_clf)} train, {len(y_test_clf)} test")
print(f"Class distribution (test): {np.bincount(y_test_clf)}")
print(f"Regression: {len(y_train_reg)} train, {len(y_test_reg)} test")

---

## 2. Classification Metrics

### 2.1 The Confusion Matrix

The confusion matrix is the foundation for understanding classification performance.

In [None]:
# Train a classifier for demonstration
clf = LogisticRegression(random_state=42, max_iter=1000)
clf.fit(X_train_clf, y_train_clf)

y_pred = clf.predict(X_test_clf)
y_pred_proba = clf.predict_proba(X_test_clf)[:, 1]

def plot_confusion_matrix(y_true, y_pred, labels=['Negative', 'Positive']):
    """Plot a detailed confusion matrix with annotations."""
    cm = confusion_matrix(y_true, y_pred)
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Raw counts
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
                xticklabels=labels, yticklabels=labels, ax=axes[0])
    axes[0].set_title('Confusion Matrix (Counts)')
    axes[0].set_ylabel('Actual')
    axes[0].set_xlabel('Predicted')
    
    # Normalized by row
    cm_norm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    sns.heatmap(cm_norm, annot=True, fmt='.2%', cmap='Blues',
                xticklabels=labels, yticklabels=labels, ax=axes[1])
    axes[1].set_title('Confusion Matrix (Normalized)')
    axes[1].set_ylabel('Actual')
    axes[1].set_xlabel('Predicted')
    
    plt.tight_layout()
    plt.show()
    
    tn, fp, fn, tp = cm.ravel()
    print(f"\nConfusion Matrix Breakdown:")
    print(f"  TN: {tn} | FP: {fp}")
    print(f"  FN: {fn} | TP: {tp}")

plot_confusion_matrix(y_test_clf, y_pred)

### 2.2 Precision, Recall, and F1-Score

- **Precision**: TP / (TP + FP) - How reliable are positive predictions?
- **Recall**: TP / (TP + FN) - How many positives did we find?
- **F1-Score**: Harmonic mean of precision and recall

In [None]:
@dataclass
class ClassificationMetrics:
    accuracy: float
    precision: float
    recall: float
    f1: float
    specificity: float
    
    def display(self):
        print("=" * 50)
        print("Classification Metrics Report")
        print("=" * 50)
        for name, value in [("Accuracy", self.accuracy), ("Precision", self.precision),
                            ("Recall", self.recall), ("F1-Score", self.f1),
                            ("Specificity", self.specificity)]:
            bar = "#" * int(value * 20) + "-" * (20 - int(value * 20))
            print(f"{name:12} [{bar}] {value:.4f}")

def compute_classification_metrics(y_true, y_pred):
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    return ClassificationMetrics(
        accuracy=accuracy_score(y_true, y_pred),
        precision=precision_score(y_true, y_pred, zero_division=0),
        recall=recall_score(y_true, y_pred, zero_division=0),
        f1=f1_score(y_true, y_pred, zero_division=0),
        specificity=tn / (tn + fp) if (tn + fp) > 0 else 0
    )

metrics = compute_classification_metrics(y_test_clf, y_pred)
metrics.display()

### 2.3 The Accuracy Paradox

Accuracy can be misleading with imbalanced datasets.

In [None]:
def demonstrate_accuracy_paradox():
    """Show why accuracy is misleading for imbalanced datasets."""
    np.random.seed(42)
    n_samples = 10000
    y_imb = np.concatenate([np.zeros(9900), np.ones(100)]).astype(int)
    
    # Always predict negative
    y_pred_neg = np.zeros(n_samples).astype(int)
    # Random predictions
    y_pred_random = np.random.randint(0, 2, n_samples)
    
    strategies = [("Always Negative", y_pred_neg), ("Random", y_pred_random)]
    
    print("Accuracy Paradox Demonstration")
    print("=" * 60)
    print(f"Dataset: 99% negative, 1% positive\n")
    
    for name, y_pred in strategies:
        acc = accuracy_score(y_imb, y_pred)
        rec = recall_score(y_imb, y_pred, zero_division=0)
        f1 = f1_score(y_imb, y_pred, zero_division=0)
        print(f"{name:20} Acc: {acc:.2%} | Recall: {rec:.2%} | F1: {f1:.4f}")
    
    print("\nKey Insight: 'Always Negative' has 99% accuracy but 0% recall!")

demonstrate_accuracy_paradox()

### 2.4 ROC Curve and AUC

ROC-AUC measures the model's ability to rank positive examples higher than negative ones.

In [None]:
def plot_roc_curve(y_true, y_scores, model_name="Model"):
    """Plot ROC curve with annotations."""
    fpr, tpr, thresholds = roc_curve(y_true, y_scores)
    roc_auc = roc_auc_score(y_true, y_scores)
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # ROC Curve
    axes[0].plot(fpr, tpr, 'b-', linewidth=2, label=f'{model_name} (AUC = {roc_auc:.4f})')
    axes[0].plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random (AUC = 0.5)')
    axes[0].fill_between(fpr, tpr, alpha=0.3)
    
    optimal_idx = np.argmax(tpr - fpr)
    axes[0].plot(fpr[optimal_idx], tpr[optimal_idx], 'ro', markersize=10,
                 label=f'Optimal Threshold = {thresholds[optimal_idx]:.3f}')
    
    axes[0].set_xlabel('False Positive Rate')
    axes[0].set_ylabel('True Positive Rate')
    axes[0].set_title('ROC Curve')
    axes[0].legend(loc='lower right')
    axes[0].grid(True, alpha=0.3)
    
    # Metrics vs Threshold
    thresh_range = np.linspace(0.01, 0.99, 50)
    precs, recs, f1s = [], [], []
    for t in thresh_range:
        y_pred_t = (y_scores >= t).astype(int)
        precs.append(precision_score(y_true, y_pred_t, zero_division=0))
        recs.append(recall_score(y_true, y_pred_t, zero_division=0))
        f1s.append(f1_score(y_true, y_pred_t, zero_division=0))
    
    axes[1].plot(thresh_range, precs, 'b-', label='Precision')
    axes[1].plot(thresh_range, recs, 'g-', label='Recall')
    axes[1].plot(thresh_range, f1s, 'r-', label='F1-Score')
    axes[1].set_xlabel('Threshold')
    axes[1].set_ylabel('Score')
    axes[1].set_title('Metrics vs Threshold')
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    return roc_auc

roc_auc = plot_roc_curve(y_test_clf, y_pred_proba, "Logistic Regression")
print(f"\nROC-AUC Score: {roc_auc:.4f}")

### 2.5 Precision-Recall Curve and PR-AUC

For imbalanced datasets, PR-AUC is often more informative than ROC-AUC.

In [None]:
def plot_pr_curve(y_true, y_scores, model_name="Model"):
    """Plot Precision-Recall curve."""
    precision_vals, recall_vals, thresholds = precision_recall_curve(y_true, y_scores)
    pr_auc = average_precision_score(y_true, y_scores)
    baseline = y_true.sum() / len(y_true)
    
    plt.figure(figsize=(10, 6))
    plt.plot(recall_vals, precision_vals, 'b-', linewidth=2,
             label=f'{model_name} (AP = {pr_auc:.4f})')
    plt.axhline(y=baseline, color='r', linestyle='--', 
                label=f'Random Baseline = {baseline:.4f}')
    plt.fill_between(recall_vals, precision_vals, alpha=0.3)
    
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.title('Precision-Recall Curve')
    plt.legend(loc='best')
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()
    
    return pr_auc

pr_auc = plot_pr_curve(y_test_clf, y_pred_proba, "Logistic Regression")
print(f"\nPR-AUC Score: {pr_auc:.4f}")

### 2.6 Multi-class Classification Metrics

In [None]:
def demonstrate_multiclass_metrics():
    """Demonstrate metrics for multi-class classification."""
    X, y = make_classification(
        n_samples=1000, n_features=20, n_informative=15,
        n_classes=4, n_clusters_per_class=2, random_state=42
    )
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
    clf = RandomForestClassifier(n_estimators=100, random_state=42)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    
    print("Multi-class Classification Metrics")
    print("=" * 60)
    
    for avg in ['macro', 'micro', 'weighted']:
        prec = precision_score(y_test, y_pred, average=avg)
        rec = recall_score(y_test, y_pred, average=avg)
        f1 = f1_score(y_test, y_pred, average=avg)
        print(f"{avg:10} | P: {prec:.4f} | R: {rec:.4f} | F1: {f1:.4f}")
    
    print("\nPer-Class Report:")
    print(classification_report(y_test, y_pred))
    
    cm = confusion_matrix(y_test, y_pred)
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.title('Multi-class Confusion Matrix')
    plt.ylabel('Actual')
    plt.xlabel('Predicted')
    plt.tight_layout()
    plt.show()

demonstrate_multiclass_metrics()

---

## 3. Regression Metrics

| Metric | Formula | Use Case |
|--------|---------|----------|
| MSE | mean((y - y_hat)^2) | Penalize large errors |
| RMSE | sqrt(MSE) | Same units as target |
| MAE | mean(abs(y - y_hat)) | Robust to outliers |
| R-squared | 1 - SS_res/SS_tot | Variance explained |

In [None]:
# Train regression model
reg = RandomForestRegressor(n_estimators=100, random_state=42)
reg.fit(X_train_reg, y_train_reg)
y_pred_reg = reg.predict(X_test_reg)

@dataclass
class RegressionMetrics:
    mse: float
    rmse: float
    mae: float
    r2: float
    mape: float
    
    def display(self):
        print("=" * 50)
        print("Regression Metrics Report")
        print("=" * 50)
        for name, value in [("MSE", self.mse), ("RMSE", self.rmse),
                            ("MAE", self.mae), ("R-squared", self.r2),
                            ("MAPE", self.mape)]:
            print(f"{name:12}: {value:.4f}")

def compute_regression_metrics(y_true, y_pred):
    mse = mean_squared_error(y_true, y_pred)
    mask = y_true != 0
    mape = np.mean(np.abs((y_true[mask] - y_pred[mask]) / y_true[mask])) * 100 if mask.any() else np.inf
    
    return RegressionMetrics(
        mse=mse,
        rmse=np.sqrt(mse),
        mae=mean_absolute_error(y_true, y_pred),
        r2=r2_score(y_true, y_pred),
        mape=mape
    )

reg_metrics = compute_regression_metrics(y_test_reg, y_pred_reg)
reg_metrics.display()

In [None]:
def visualize_regression_performance(y_true, y_pred):
    """Create comprehensive regression visualization."""
    residuals = y_true - y_pred
    
    fig, axes = plt.subplots(2, 2, figsize=(14, 12))
    
    # Actual vs Predicted
    axes[0, 0].scatter(y_true, y_pred, alpha=0.5)
    min_val, max_val = min(y_true.min(), y_pred.min()), max(y_true.max(), y_pred.max())
    axes[0, 0].plot([min_val, max_val], [min_val, max_val], 'r--', linewidth=2)
    axes[0, 0].set_xlabel('Actual')
    axes[0, 0].set_ylabel('Predicted')
    axes[0, 0].set_title('Actual vs Predicted')
    
    # Residual Distribution
    axes[0, 1].hist(residuals, bins=50, edgecolor='black', alpha=0.7)
    axes[0, 1].axvline(x=0, color='r', linestyle='--')
    axes[0, 1].set_xlabel('Residual')
    axes[0, 1].set_title('Residual Distribution')
    
    # Residuals vs Predicted
    axes[1, 0].scatter(y_pred, residuals, alpha=0.5)
    axes[1, 0].axhline(y=0, color='r', linestyle='--')
    axes[1, 0].set_xlabel('Predicted')
    axes[1, 0].set_ylabel('Residuals')
    axes[1, 0].set_title('Residuals vs Predicted')
    
    # Error Percentiles
    abs_errors = np.abs(residuals)
    percentiles = np.arange(0, 101, 5)
    error_pcts = np.percentile(abs_errors, percentiles)
    axes[1, 1].plot(percentiles, error_pcts, 'b-o')
    axes[1, 1].set_xlabel('Percentile')
    axes[1, 1].set_ylabel('Absolute Error')
    axes[1, 1].set_title('Error Distribution by Percentile')
    
    plt.tight_layout()
    plt.show()

visualize_regression_performance(y_test_reg, y_pred_reg)

---

## 4. Ranking Metrics

Essential for search engines, recommendation systems, and information retrieval.

### 4.1 Precision@K and Recall@K

In [None]:
def precision_at_k(relevance, k):
    """Calculate Precision@K."""
    if k <= 0:
        return 0.0
    return np.sum(np.asarray(relevance)[:k]) / k

def recall_at_k(relevance, k, total_relevant):
    """Calculate Recall@K."""
    if total_relevant == 0:
        return 0.0
    return np.sum(np.asarray(relevance)[:k]) / total_relevant

# Example: Recommendation system
recommended = np.array([1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0])
total_relevant = 10

print("Precision@K and Recall@K Analysis")
print("=" * 50)
print(f"Total relevant items: {total_relevant}")
print(f"Recommendations: {recommended}\n")

for k in [1, 3, 5, 10, 20]:
    p = precision_at_k(recommended, k)
    r = recall_at_k(recommended, k, total_relevant)
    print(f"K={k:2d} | P@K: {p:.4f} | R@K: {r:.4f}")

### 4.2 Mean Reciprocal Rank (MRR)

Measures how quickly relevant items appear in the ranked list.

In [None]:
def reciprocal_rank(relevance):
    """Calculate reciprocal rank."""
    relevance = np.asarray(relevance)
    positions = np.where(relevance == 1)[0]
    if len(positions) == 0:
        return 0.0
    return 1.0 / (positions[0] + 1)

def mean_reciprocal_rank(rankings):
    """Calculate MRR across multiple queries."""
    return np.mean([reciprocal_rank(r) for r in rankings])

# Example: Search results for 5 queries
query_results = [
    np.array([1, 0, 0, 0, 0]),  # RR = 1.0
    np.array([0, 0, 1, 0, 0]),  # RR = 0.33
    np.array([0, 1, 0, 0, 0]),  # RR = 0.5
    np.array([0, 0, 0, 0, 1]),  # RR = 0.2
    np.array([0, 0, 0, 0, 0]),  # RR = 0
]

print("Mean Reciprocal Rank (MRR)")
print("=" * 50)
for i, results in enumerate(query_results):
    rr = reciprocal_rank(results)
    print(f"Query {i+1}: {results} -> RR = {rr:.4f}")

mrr = mean_reciprocal_rank(query_results)
print(f"\nMRR = {mrr:.4f}")

### 4.3 Mean Average Precision (mAP)

Considers precision at each relevant item position.

In [None]:
def average_precision(relevance):
    """Calculate Average Precision for a single query."""
    relevance = np.asarray(relevance)
    n_relevant = np.sum(relevance)
    if n_relevant == 0:
        return 0.0
    
    precisions = []
    n_relevant_seen = 0
    for i, rel in enumerate(relevance):
        if rel == 1:
            n_relevant_seen += 1
            precisions.append(n_relevant_seen / (i + 1))
    
    return np.mean(precisions)

def mean_average_precision(rankings):
    """Calculate mAP across multiple queries."""
    return np.mean([average_precision(r) for r in rankings])

# Example
rankings = [
    np.array([1, 1, 0, 1, 0, 0, 0, 1, 0, 0]),  # AP = 0.69
    np.array([1, 0, 0, 1, 0, 0, 0, 0, 0, 1]),  # AP = 0.57
    np.array([0, 0, 0, 1, 0, 0, 0, 0, 0, 0]),  # AP = 0.25
]

print("Mean Average Precision (mAP)")
print("=" * 50)
for i, ranking in enumerate(rankings):
    ap = average_precision(ranking)
    print(f"Query {i+1}: AP = {ap:.4f}")

map_score = mean_average_precision(rankings)
print(f"\nmAP = {map_score:.4f}")

### 4.4 Normalized Discounted Cumulative Gain (nDCG)

Considers graded relevance scores and position-based discounting.

In [None]:
def dcg_at_k(relevance, k):
    """Calculate DCG@K."""
    relevance = np.asarray(relevance)[:k]
    positions = np.arange(1, len(relevance) + 1)
    discounts = np.log2(positions + 1)
    return np.sum(relevance / discounts)

def ndcg_at_k(relevance, k):
    """Calculate nDCG@K."""
    dcg = dcg_at_k(relevance, k)
    ideal_relevance = np.sort(relevance)[::-1]
    idcg = dcg_at_k(ideal_relevance, k)
    return dcg / idcg if idcg > 0 else 0.0

# Example with graded relevance (0=not relevant, 1=somewhat, 2=relevant, 3=highly)
graded_relevance = np.array([3, 2, 0, 1, 2, 0, 0, 1, 0, 0])

print("nDCG@K Analysis")
print("=" * 50)
print(f"Graded relevance: {graded_relevance}\n")

for k in [1, 3, 5, 10]:
    dcg = dcg_at_k(graded_relevance, k)
    ndcg = ndcg_at_k(graded_relevance, k)
    print(f"K={k:2d} | DCG@K: {dcg:.4f} | nDCG@K: {ndcg:.4f}")

---

## 5. NLP Metrics

### 5.1 BLEU Score

Measures n-gram overlap between generated and reference text.

In [None]:
from collections import Counter
import math

def get_ngrams(tokens, n):
    """Extract n-grams from token list."""
    return [tuple(tokens[i:i+n]) for i in range(len(tokens) - n + 1)]

def compute_bleu(reference, candidate, max_n=4):
    """Compute BLEU score."""
    ref_tokens = reference.lower().split()
    cand_tokens = candidate.lower().split()
    
    if len(cand_tokens) == 0:
        return 0.0
    
    # Brevity penalty
    bp = min(1.0, math.exp(1 - len(ref_tokens) / len(cand_tokens)))
    
    # N-gram precisions
    precisions = []
    for n in range(1, max_n + 1):
        ref_ngrams = Counter(get_ngrams(ref_tokens, n))
        cand_ngrams = Counter(get_ngrams(cand_tokens, n))
        
        matches = sum((cand_ngrams & ref_ngrams).values())
        total = sum(cand_ngrams.values())
        
        if total > 0:
            precisions.append(matches / total)
        else:
            precisions.append(0.0)
    
    # Geometric mean with smoothing
    if all(p > 0 for p in precisions):
        geo_mean = math.exp(sum(math.log(p) for p in precisions) / len(precisions))
    else:
        geo_mean = 0.0
    
    return bp * geo_mean

# Example
reference = "The quick brown fox jumps over the lazy dog"
candidates = [
    "The quick brown fox jumps over the lazy dog",  # Perfect match
    "The fast brown fox jumps over the lazy dog",   # One word different
    "A quick brown fox jumped over a lazy dog",     # More differences
    "The dog is lazy",                               # Very different
]

print("BLEU Score Examples")
print("=" * 60)
print(f"Reference: {reference}\n")

for cand in candidates:
    bleu = compute_bleu(reference, cand)
    print(f"BLEU: {bleu:.4f} | {cand}")

### 5.2 ROUGE Score

Recall-oriented metric for text summarization.

In [None]:
def compute_rouge_n(reference, candidate, n=1):
    """Compute ROUGE-N score."""
    ref_tokens = reference.lower().split()
    cand_tokens = candidate.lower().split()
    
    ref_ngrams = Counter(get_ngrams(ref_tokens, n))
    cand_ngrams = Counter(get_ngrams(cand_tokens, n))
    
    matches = sum((ref_ngrams & cand_ngrams).values())
    ref_total = sum(ref_ngrams.values())
    cand_total = sum(cand_ngrams.values())
    
    recall = matches / ref_total if ref_total > 0 else 0.0
    precision = matches / cand_total if cand_total > 0 else 0.0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0
    
    return {'precision': precision, 'recall': recall, 'f1': f1}

def compute_rouge_l(reference, candidate):
    """Compute ROUGE-L using LCS."""
    ref_tokens = reference.lower().split()
    cand_tokens = candidate.lower().split()
    
    # Compute LCS length
    m, n = len(ref_tokens), len(cand_tokens)
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if ref_tokens[i-1] == cand_tokens[j-1]:
                dp[i][j] = dp[i-1][j-1] + 1
            else:
                dp[i][j] = max(dp[i-1][j], dp[i][j-1])
    lcs_len = dp[m][n]
    
    recall = lcs_len / m if m > 0 else 0.0
    precision = lcs_len / n if n > 0 else 0.0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0
    
    return {'precision': precision, 'recall': recall, 'f1': f1}

# Example
reference = "The cat sat on the mat"
candidate = "The cat was sitting on the mat"

print("ROUGE Scores")
print("=" * 50)
print(f"Reference: {reference}")
print(f"Candidate: {candidate}\n")

for n in [1, 2]:
    scores = compute_rouge_n(reference, candidate, n)
    print(f"ROUGE-{n}: P={scores['precision']:.4f} R={scores['recall']:.4f} F1={scores['f1']:.4f}")

scores_l = compute_rouge_l(reference, candidate)
print(f"ROUGE-L: P={scores_l['precision']:.4f} R={scores_l['recall']:.4f} F1={scores_l['f1']:.4f}")

---

## 6. Metric Selection Guidelines

### Decision Framework

In [None]:
class MetricSelector:
    """Guide for selecting appropriate evaluation metrics."""
    
    RECOMMENDATIONS = {
        'binary_classification': {
            'balanced': ['ROC-AUC', 'F1-Score', 'Accuracy'],
            'imbalanced': ['PR-AUC', 'F1-Score', 'Precision/Recall'],
            'fp_costly': ['Precision', 'Specificity'],
            'fn_costly': ['Recall', 'Sensitivity'],
        },
        'multiclass_classification': {
            'balanced': ['Macro F1', 'Accuracy'],
            'imbalanced': ['Weighted F1', 'Macro F1'],
        },
        'regression': {
            'general': ['RMSE', 'MAE', 'R-squared'],
            'outlier_sensitive': ['RMSE', 'MSE'],
            'outlier_robust': ['MAE', 'Median AE'],
            'scale_independent': ['MAPE', 'R-squared'],
        },
        'ranking': {
            'top_k_matters': ['Precision@K', 'Recall@K', 'nDCG@K'],
            'first_result_matters': ['MRR'],
            'overall_ranking': ['mAP', 'nDCG'],
        },
    }
    
    @classmethod
    def recommend(cls, task_type, scenario):
        """Get metric recommendations."""
        if task_type in cls.RECOMMENDATIONS:
            task_recs = cls.RECOMMENDATIONS[task_type]
            if scenario in task_recs:
                return task_recs[scenario]
        return ['Please specify valid task and scenario']

# Display recommendations
print("Metric Selection Guide")
print("=" * 60)

examples = [
    ('binary_classification', 'imbalanced', 'Fraud Detection'),
    ('binary_classification', 'fn_costly', 'Medical Diagnosis'),
    ('regression', 'outlier_robust', 'House Price Prediction'),
    ('ranking', 'top_k_matters', 'Search Engine'),
]

for task, scenario, use_case in examples:
    metrics = MetricSelector.recommend(task, scenario)
    print(f"\n{use_case}:")
    print(f"  Task: {task}, Scenario: {scenario}")
    print(f"  Recommended: {', '.join(metrics)}")

---

## 7. Hands-on Exercises

### Exercise 1: Comprehensive Model Evaluation

In [None]:
def comprehensive_evaluation(y_true, y_pred, y_proba=None):
    """Perform comprehensive model evaluation."""
    print("=" * 60)
    print("Comprehensive Model Evaluation")
    print("=" * 60)
    
    # Basic metrics
    print("\n1. Basic Classification Metrics:")
    print(f"   Accuracy:  {accuracy_score(y_true, y_pred):.4f}")
    print(f"   Precision: {precision_score(y_true, y_pred):.4f}")
    print(f"   Recall:    {recall_score(y_true, y_pred):.4f}")
    print(f"   F1-Score:  {f1_score(y_true, y_pred):.4f}")
    
    if y_proba is not None:
        print("\n2. Probability-Based Metrics:")
        print(f"   ROC-AUC:   {roc_auc_score(y_true, y_proba):.4f}")
        print(f"   PR-AUC:    {average_precision_score(y_true, y_proba):.4f}")
    
    # Confusion matrix
    print("\n3. Confusion Matrix:")
    cm = confusion_matrix(y_true, y_pred)
    print(f"   TN: {cm[0,0]:4d} | FP: {cm[0,1]:4d}")
    print(f"   FN: {cm[1,0]:4d} | TP: {cm[1,1]:4d}")

# Run comprehensive evaluation
comprehensive_evaluation(y_test_clf, y_pred, y_pred_proba)

### Exercise 2: Compare Different Thresholds

In [None]:
def threshold_analysis(y_true, y_proba, thresholds=[0.3, 0.5, 0.7]):
    """Analyze model performance at different thresholds."""
    print("Threshold Analysis")
    print("=" * 60)
    
    results = []
    for thresh in thresholds:
        y_pred = (y_proba >= thresh).astype(int)
        results.append({
            'Threshold': thresh,
            'Precision': precision_score(y_true, y_pred, zero_division=0),
            'Recall': recall_score(y_true, y_pred, zero_division=0),
            'F1': f1_score(y_true, y_pred, zero_division=0),
            'Positives': y_pred.sum()
        })
    
    df = pd.DataFrame(results)
    print(df.to_string(index=False))
    return df

threshold_analysis(y_test_clf, y_pred_proba, [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8])

---

## 8. Summary

### Key Takeaways

1. **Classification Metrics**
   - Use precision when FP is costly, recall when FN is costly
   - F1-score balances precision and recall
   - ROC-AUC for balanced data, PR-AUC for imbalanced data

2. **Regression Metrics**
   - RMSE penalizes large errors; MAE is robust to outliers
   - R-squared shows variance explained
   - MAPE provides scale-independent comparison

3. **Ranking Metrics**
   - Precision@K for top-K recommendation quality
   - MRR for first-result importance
   - nDCG for graded relevance with position discounting

4. **Best Practices**
   - Always choose metrics aligned with business objectives
   - Use multiple metrics for comprehensive evaluation
   - Consider class imbalance when selecting metrics

### Next Steps

In the next tutorial, we'll cover **Online Evaluation and A/B Testing** to learn how to validate models in production environments.