# 07 - Evaluation: Offline and Online Metrics

---

## What the Chapter Says

The chapter covers **Evaluation = Offline + Online**:

### Offline Metrics (Chapter Table)
| Task | Metrics |
|------|--------|
| Classification | Precision, Recall, F1, Accuracy, ROC-AUC, PR-AUC, Confusion Matrix |
| Regression | MSE, MAE, RMSE |
| Ranking | Precision@k, Recall@k, MRR, mAP, nDCG |
| Image Generation | FID, Inception Score |
| NLP | BLEU, METEOR, ROUGE, CIDEr, SPICE |

### Online Metrics (Chapter Table)
| System | Metrics |
|--------|--------|
| Ads | CTR, Revenue lift |
| Harmful detection | Prevalence, Valid appeals |
| Video recommendation | CTR, Watch time, Completed videos |
| Friend recommendation | Requests sent per day, Requests accepted per day |

The chapter also mentions **fairness/bias questions** and **business sense** emphasis.

---

## Meta Interview Signal

| Level | Expectations |
|-------|-------------|
| **E5** | Knows offline metrics for each task type. Can compute and interpret them. Understands online metrics relevance. |
| **E6** | Ties metrics to business impact. Discusses metric tradeoffs (precision vs recall). Proposes fairness metrics. Designs A/B test metrics. |

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, average_precision_score, confusion_matrix,
    mean_squared_error, mean_absolute_error,
    precision_recall_curve, roc_curve
)
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)

---

## Offline Evaluation

**Definition**: Evaluation on held-out test data before deployment

### Classification Metrics (Chapter Table)

In [None]:
# Classification metrics from chapter
classification_metrics = pd.DataFrame({
    'Metric': [
        'Precision',
        'Recall',
        'F1 Score',
        'Accuracy',
        'ROC-AUC',
        'PR-AUC',
        'Confusion Matrix'
    ],
    'Formula/Definition': [
        'TP / (TP + FP)',
        'TP / (TP + FN)',
        '2 * (P * R) / (P + R)',
        '(TP + TN) / Total',
        'Area under ROC curve',
        'Area under Precision-Recall curve',
        'Matrix of TP, TN, FP, FN'
    ],
    'Use When': [
        'Cost of false positives is high (spam detection)',
        'Cost of false negatives is high (fraud, disease)',
        'Need balance of precision and recall',
        'Balanced classes (not recommended for imbalanced)',
        'Comparing models, threshold-independent',
        'Imbalanced datasets (better than ROC-AUC)',
        'Understanding error types'
    ]
})

print("="*90)
print("CLASSIFICATION METRICS (Chapter Table)")
print("="*90)
print(classification_metrics.to_string(index=False))

In [None]:
# Create synthetic classification data (CTR prediction)
n = 2000
y_true = np.random.choice([0, 1], n, p=[0.9, 0.1])  # 10% CTR (imbalanced)

# Simulate model predictions (not perfect)
y_scores = np.where(
    y_true == 1,
    np.random.beta(5, 2, n),  # Higher scores for positives
    np.random.beta(2, 5, n)   # Lower scores for negatives
)
y_pred = (y_scores > 0.5).astype(int)

print("SYNTHETIC CTR DATA")
print("="*40)
print(f"Total samples: {n}")
print(f"Positive rate: {y_true.mean()*100:.1f}%")
print(f"Predicted positive rate: {y_pred.mean()*100:.1f}%")

In [None]:
# Compute all classification metrics
print("\n" + "="*50)
print("CLASSIFICATION METRICS COMPUTATION")
print("="*50)

# Basic metrics
print(f"\nPrecision: {precision_score(y_true, y_pred):.4f}")
print(f"Recall: {recall_score(y_true, y_pred):.4f}")
print(f"F1 Score: {f1_score(y_true, y_pred):.4f}")
print(f"Accuracy: {accuracy_score(y_true, y_pred):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_true, y_scores):.4f}")
print(f"PR-AUC: {average_precision_score(y_true, y_scores):.4f}")

# Confusion matrix
cm = confusion_matrix(y_true, y_pred)
print(f"\nConfusion Matrix:")
print(f"            Predicted")
print(f"             0     1")
print(f"Actual 0  [{cm[0,0]:4d}  {cm[0,1]:4d}]")
print(f"       1  [{cm[1,0]:4d}  {cm[1,1]:4d}]")
print(f"\nTP={cm[1,1]}, TN={cm[0,0]}, FP={cm[0,1]}, FN={cm[1,0]}")

In [None]:
# Visualize ROC and PR curves
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# ROC Curve
fpr, tpr, thresholds_roc = roc_curve(y_true, y_scores)
roc_auc = roc_auc_score(y_true, y_scores)

axes[0].plot(fpr, tpr, color='blue', lw=2, label=f'ROC curve (AUC = {roc_auc:.3f})')
axes[0].plot([0, 1], [0, 1], color='gray', linestyle='--', label='Random')
axes[0].set_xlabel('False Positive Rate')
axes[0].set_ylabel('True Positive Rate')
axes[0].set_title('ROC Curve')
axes[0].legend(loc='lower right')
axes[0].grid(True, alpha=0.3)

# PR Curve
precision, recall, thresholds_pr = precision_recall_curve(y_true, y_scores)
pr_auc = average_precision_score(y_true, y_scores)

axes[1].plot(recall, precision, color='green', lw=2, label=f'PR curve (AUC = {pr_auc:.3f})')
axes[1].axhline(y=y_true.mean(), color='gray', linestyle='--', label=f'Baseline (prevalence={y_true.mean():.2f})')
axes[1].set_xlabel('Recall')
axes[1].set_ylabel('Precision')
axes[1].set_title('Precision-Recall Curve')
axes[1].legend(loc='upper right')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n[Key Insight]: PR-AUC is more informative for imbalanced datasets")
print("               because it doesn't get inflated by true negatives.")

### Regression Metrics (Chapter Table)

In [None]:
# Regression metrics from chapter
regression_metrics = pd.DataFrame({
    'Metric': ['MSE', 'MAE', 'RMSE'],
    'Formula': [
        '(1/n) * Σ(y - ŷ)²',
        '(1/n) * Σ|y - ŷ|',
        '√MSE'
    ],
    'Characteristics': [
        'Penalizes large errors more (squared)',
        'Robust to outliers (absolute)',
        'Same unit as target variable'
    ]
})

print("="*70)
print("REGRESSION METRICS (Chapter Table)")
print("="*70)
print(regression_metrics.to_string(index=False))

In [None]:
# Create synthetic regression data (watch time prediction)
n = 1000
y_true_reg = np.random.exponential(300, n)  # Watch time in seconds
y_pred_reg = y_true_reg + np.random.normal(0, 50, n)  # Predictions with noise
y_pred_reg = np.clip(y_pred_reg, 0, None)  # No negative watch time

print("REGRESSION METRICS COMPUTATION (Watch Time)")
print("="*50)
print(f"\nMSE: {mean_squared_error(y_true_reg, y_pred_reg):.2f}")
print(f"MAE: {mean_absolute_error(y_true_reg, y_pred_reg):.2f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_true_reg, y_pred_reg)):.2f} seconds")

print(f"\n[Interpretation]: On average, predictions are off by {mean_absolute_error(y_true_reg, y_pred_reg):.0f} seconds")

### Ranking Metrics (Chapter Table)

In [None]:
# Ranking metrics from chapter
ranking_metrics = pd.DataFrame({
    'Metric': ['Precision@k', 'Recall@k', 'MRR', 'mAP', 'nDCG'],
    'Definition': [
        'Relevant items in top-k / k',
        'Relevant items in top-k / total relevant',
        'Mean Reciprocal Rank - 1/rank of first relevant',
        'Mean Average Precision across queries',
        'Normalized Discounted Cumulative Gain'
    ],
    'Use Case': [
        'Are top-k results relevant?',
        'How many relevant items are in top-k?',
        'How quickly do we find first relevant item?',
        'Overall ranking quality',
        'Ranking quality with graded relevance'
    ]
})

print("="*80)
print("RANKING METRICS (Chapter Table)")
print("="*80)
print(ranking_metrics.to_string(index=False))

In [None]:
# Implement ranking metrics
def precision_at_k(relevant, retrieved, k):
    """Precision@k: fraction of top-k that are relevant"""
    relevant_set = set(relevant)
    retrieved_at_k = retrieved[:k]
    return len([item for item in retrieved_at_k if item in relevant_set]) / k

def recall_at_k(relevant, retrieved, k):
    """Recall@k: fraction of relevant items in top-k"""
    relevant_set = set(relevant)
    retrieved_at_k = retrieved[:k]
    return len([item for item in retrieved_at_k if item in relevant_set]) / len(relevant)

def mrr(relevant, retrieved):
    """Mean Reciprocal Rank"""
    relevant_set = set(relevant)
    for rank, item in enumerate(retrieved, 1):
        if item in relevant_set:
            return 1.0 / rank
    return 0.0

def ndcg_at_k(relevance_scores, k):
    """Normalized DCG@k"""
    dcg = sum((2**rel - 1) / np.log2(i + 2) for i, rel in enumerate(relevance_scores[:k]))
    ideal = sorted(relevance_scores, reverse=True)[:k]
    idcg = sum((2**rel - 1) / np.log2(i + 2) for i, rel in enumerate(ideal))
    return dcg / idcg if idcg > 0 else 0.0

# Example: Video recommendation
print("RANKING METRICS EXAMPLE: Video Recommendations")
print("="*50)

# Ground truth: videos the user actually watched/liked
relevant_videos = ['V1', 'V5', 'V8', 'V12']  # 4 relevant videos

# Model's ranked output
recommended = ['V3', 'V1', 'V7', 'V5', 'V2', 'V8', 'V4', 'V12', 'V9', 'V10']

# Relevance scores for nDCG (graded: 3=highly relevant, 2=relevant, 1=somewhat, 0=not)
relevance_scores = [0, 3, 0, 2, 0, 2, 0, 1, 0, 0]

print(f"\nRelevant videos: {relevant_videos}")
print(f"Recommended order: {recommended}")
print(f"Relevance scores: {relevance_scores}")

for k in [3, 5, 10]:
    print(f"\n--- k={k} ---")
    print(f"Precision@{k}: {precision_at_k(relevant_videos, recommended, k):.3f}")
    print(f"Recall@{k}: {recall_at_k(relevant_videos, recommended, k):.3f}")

print(f"\nMRR: {mrr(relevant_videos, recommended):.3f}")
print(f"nDCG@10: {ndcg_at_k(relevance_scores, 10):.3f}")

### Other Metrics (Chapter Table)

In [None]:
# NLP and Image Generation metrics from chapter
other_metrics = pd.DataFrame({
    'Domain': ['Image Generation', 'Image Generation', 'NLP', 'NLP', 'NLP', 'NLP', 'NLP'],
    'Metric': ['FID', 'Inception Score', 'BLEU', 'METEOR', 'ROUGE', 'CIDEr', 'SPICE'],
    'Description': [
        'Fréchet Inception Distance - distance between real/generated distributions',
        'Quality and diversity of generated images',
        'N-gram overlap with reference translations',
        'Alignment, synonymy, and recall',
        'Recall-oriented summary evaluation',
        'Consensus-based image description evaluation',
        'Semantic propositional evaluation'
    ]
})

print("="*90)
print("NLP & IMAGE GENERATION METRICS (Chapter Table)")
print("="*90)
print(other_metrics.to_string(index=False))

---

## Online Evaluation

**Definition**: Evaluation in production with real users

In [None]:
# Online metrics from chapter table
online_metrics = pd.DataFrame({
    'System': [
        'Ads',
        'Ads',
        'Harmful content detection',
        'Harmful content detection',
        'Video recommendation',
        'Video recommendation',
        'Video recommendation',
        'Friend recommendation',
        'Friend recommendation'
    ],
    'Metric': [
        'CTR (Click-Through Rate)',
        'Revenue lift',
        'Prevalence',
        'Valid appeals',
        'CTR',
        'Watch time',
        'Completed videos',
        'Requests sent per day',
        'Requests accepted per day'
    ],
    'Definition': [
        'Clicks / Impressions',
        'Incremental revenue from model',
        'Fraction of harmful content on platform',
        'Appealed decisions overturned',
        'Video clicks / Impressions',
        'Total time users spend watching',
        'Videos watched to completion',
        'Friend requests users send',
        'Friend requests accepted'
    ],
    'Higher/Lower is Better': [
        'Higher',
        'Higher',
        'Lower',
        'Lower',
        'Higher',
        'Higher',
        'Higher',
        'Higher',
        'Higher'
    ]
})

print("="*90)
print("ONLINE METRICS (Chapter Table)")
print("="*90)
print(online_metrics.to_string(index=False))

In [None]:
# Simulate online metrics for A/B test
np.random.seed(42)

# Simulate 14 days of A/B test data
n_days = 14
users_per_day = 10000

# Control group metrics
control_ctr = np.random.normal(0.025, 0.002, n_days)  # ~2.5% CTR
control_watch_time = np.random.normal(180, 10, n_days)  # ~180 sec average

# Treatment group metrics (model improvement)
treatment_ctr = np.random.normal(0.028, 0.002, n_days)  # ~2.8% CTR (+12%)
treatment_watch_time = np.random.normal(195, 10, n_days)  # ~195 sec (+8%)

ab_data = pd.DataFrame({
    'Day': range(1, n_days + 1),
    'Control CTR': control_ctr,
    'Treatment CTR': treatment_ctr,
    'Control Watch Time': control_watch_time,
    'Treatment Watch Time': treatment_watch_time,
})

print("A/B TEST SIMULATION: Video Recommendation")
print("="*60)
print(f"\nControl (Old Model):")
print(f"  Average CTR: {control_ctr.mean()*100:.2f}%")
print(f"  Average Watch Time: {control_watch_time.mean():.1f} sec")

print(f"\nTreatment (New Model):")
print(f"  Average CTR: {treatment_ctr.mean()*100:.2f}%")
print(f"  Average Watch Time: {treatment_watch_time.mean():.1f} sec")

print(f"\nRelative Lift:")
print(f"  CTR: +{(treatment_ctr.mean()/control_ctr.mean()-1)*100:.1f}%")
print(f"  Watch Time: +{(treatment_watch_time.mean()/control_watch_time.mean()-1)*100:.1f}%")

In [None]:
# Visualize A/B test results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# CTR over time
axes[0].plot(ab_data['Day'], ab_data['Control CTR']*100, 'b-o', label='Control', markersize=6)
axes[0].plot(ab_data['Day'], ab_data['Treatment CTR']*100, 'g-o', label='Treatment', markersize=6)
axes[0].axhline(y=control_ctr.mean()*100, color='blue', linestyle='--', alpha=0.5)
axes[0].axhline(y=treatment_ctr.mean()*100, color='green', linestyle='--', alpha=0.5)
axes[0].set_xlabel('Day')
axes[0].set_ylabel('CTR (%)')
axes[0].set_title('CTR Over A/B Test Period')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Watch Time over time
axes[1].plot(ab_data['Day'], ab_data['Control Watch Time'], 'b-o', label='Control', markersize=6)
axes[1].plot(ab_data['Day'], ab_data['Treatment Watch Time'], 'g-o', label='Treatment', markersize=6)
axes[1].axhline(y=control_watch_time.mean(), color='blue', linestyle='--', alpha=0.5)
axes[1].axhline(y=treatment_watch_time.mean(), color='green', linestyle='--', alpha=0.5)
axes[1].set_xlabel('Day')
axes[1].set_ylabel('Watch Time (sec)')
axes[1].set_title('Watch Time Over A/B Test Period')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Statistical significance check
from scipy import stats

print("\nSTATISTICAL SIGNIFICANCE (t-test)")
print("="*50)

# CTR significance
t_stat_ctr, p_val_ctr = stats.ttest_ind(treatment_ctr, control_ctr)
print(f"\nCTR:")
print(f"  t-statistic: {t_stat_ctr:.3f}")
print(f"  p-value: {p_val_ctr:.4f}")
print(f"  Significant at α=0.05: {p_val_ctr < 0.05}")

# Watch Time significance
t_stat_wt, p_val_wt = stats.ttest_ind(treatment_watch_time, control_watch_time)
print(f"\nWatch Time:")
print(f"  t-statistic: {t_stat_wt:.3f}")
print(f"  p-value: {p_val_wt:.4f}")
print(f"  Significant at α=0.05: {p_val_wt < 0.05}")

---

## Fairness and Bias (Chapter Mention)

In [None]:
print("="*70)
print("FAIRNESS AND BIAS EVALUATION (Chapter Mention)")
print("="*70)

print("""
The chapter emphasizes including fairness/bias questions:

1. DEMOGRAPHIC PARITY:
   - Does the model have similar positive rates across groups?
   - Example: Are job ads shown equally to all genders?

2. EQUALIZED ODDS:
   - Same true positive rate and false positive rate across groups?
   - Example: Harmful content detection equally accurate for all communities?

3. CALIBRATION:
   - When model says 70% probability, is it actually 70%?
   - Important for decision-making based on scores

4. REPRESENTATION:
   - Are all groups adequately represented in training data?
   - Underrepresented groups may have worse performance

INTERVIEW SIGNAL:
- E5: Knows fairness metrics exist, can name a few
- E6: Proposes specific fairness constraints for the use case,
      discusses tradeoffs (accuracy vs fairness)
""")

In [None]:
# Simple fairness metric example
np.random.seed(42)

# Simulate model predictions for two groups
n_each = 500

# Group A
group_a_true = np.random.choice([0, 1], n_each, p=[0.85, 0.15])
group_a_pred = np.where(
    group_a_true == 1,
    np.random.choice([0, 1], n_each, p=[0.2, 0.8]),  # 80% recall
    np.random.choice([0, 1], n_each, p=[0.95, 0.05])  # 5% FPR
)

# Group B (model is less accurate)
group_b_true = np.random.choice([0, 1], n_each, p=[0.85, 0.15])
group_b_pred = np.where(
    group_b_true == 1,
    np.random.choice([0, 1], n_each, p=[0.4, 0.6]),  # 60% recall (worse!)
    np.random.choice([0, 1], n_each, p=[0.90, 0.10])  # 10% FPR (worse!)
)

print("FAIRNESS ANALYSIS: Model Performance Across Groups")
print("="*60)

print(f"\nGroup A:")
print(f"  Precision: {precision_score(group_a_true, group_a_pred):.3f}")
print(f"  Recall (TPR): {recall_score(group_a_true, group_a_pred):.3f}")
print(f"  FPR: {(group_a_pred[group_a_true == 0] == 1).mean():.3f}")

print(f"\nGroup B:")
print(f"  Precision: {precision_score(group_b_true, group_b_pred):.3f}")
print(f"  Recall (TPR): {recall_score(group_b_true, group_b_pred):.3f}")
print(f"  FPR: {(group_b_pred[group_b_true == 0] == 1).mean():.3f}")

print(f"\n[Fairness Issue]: Group B has lower recall - model underperforms for this group")
print(f"[Action]: Investigate data representation, consider group-specific tuning")

---

## Business Sense Emphasis (Chapter Mention)

In [None]:
print("="*70)
print("BUSINESS SENSE IN METRIC SELECTION (Chapter Emphasis)")
print("="*70)

business_sense = pd.DataFrame({
    'System': ['Harmful Content', 'Ad CTR', 'Friend Recs', 'Video Recs'],
    'Wrong Metric Choice': [
        'Accuracy (ignores class imbalance)',
        'Clicks only (ignores conversions)',
        'Requests sent (ignores quality)',
        'CTR only (clickbait problem)'
    ],
    'Better Metric': [
        'Precision + Recall (balance FP/FN cost)',
        'Revenue lift + CTR (business outcome)',
        'Accepted requests (meaningful connections)',
        'Watch time + completion (genuine engagement)'
    ],
    'Business Rationale': [
        'FP = censorship complaints, FN = safety risk',
        'Ad that gets clicks but no purchase = bad',
        'Spam friend requests = bad experience',
        'User who clicks but leaves = not engaged'
    ]
})

print(business_sense.to_string(index=False))

print("\n[Interview Signal]: Always tie metrics back to business impact!")

---

## Tradeoffs (Chapter-Aligned)

| Tradeoff | Discussion | Interview Signal |
|----------|------------|------------------|
| **Precision vs Recall** | FP cost vs FN cost | E5: Knows formulas. E6: Maps to business impact |
| **Offline vs Online** | Test set vs real users | E5: Knows both. E6: Discusses offline-online correlation |
| **Short-term vs Long-term** | CTR vs retention | E5: Aware of difference. E6: Proposes guardrail metrics |
| **Accuracy vs Fairness** | Overall performance vs group equity | E5: Knows fairness metrics. E6: Proposes constraints |

---

## Meta Interview Signal (Detailed)

### E5 Answer Expectations

- Knows the metrics table for each task type (classification, regression, ranking)
- Can compute metrics given predictions
- Understands precision/recall tradeoff
- Can name relevant online metrics for a system

### E6 Additions

- **Business impact**: "Increasing recall from 0.7 to 0.8 catches 10% more harmful content = X fewer bad experiences"
- **Metric tradeoffs**: "Higher CTR might hurt watch time if we optimize for clickbait - need both as guardrails"
- **Fairness**: "We should slice metrics by demographic to ensure model works equally well for all users"
- **Offline-online gap**: "Offline AUC improved 2% but online CTR was flat - investigating feature freshness"

---

## Interview Drills

### Drill 1: Metrics Table Recall
From memory, list:
- 5 classification metrics
- 3 regression metrics
- 4 ranking metrics

### Drill 2: Metric Selection
For each scenario, choose the best offline metric and justify:
- Spam email detection (most important: not miss spam? or not flag good email?)
- Search ranking (top 10 results matter most)
- Watch time prediction

### Drill 3: Online Metrics
For a news feed ranking system, propose:
- 2 primary online metrics
- 2 guardrail metrics

### Drill 4: Precision-Recall Tradeoff
For harmful content detection:
- What's the cost of high precision, low recall?
- What's the cost of high recall, low precision?
- How would you balance them?

### Drill 5: Fairness Analysis
For a job recommendation system:
- What demographic slices would you analyze?
- What fairness metric would you use?
- How would you address disparities?