# Tutorial 13: Online Evaluation and A/B Testing

## Module 5: Evaluation

---

## Learning Objectives

By the end of this tutorial, you will be able to:

1. **Design online evaluation experiments** - understand online metrics and experiment setup
2. **Implement A/B testing frameworks** - sample size calculation, randomization, analysis
3. **Analyze experiment results** - statistical significance, confidence intervals
4. **Apply advanced testing methods** - multi-armed bandits, interleaving, canary releases
5. **Avoid common pitfalls** - peeking, multiple testing, novelty effects

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Dict, Tuple
from dataclasses import dataclass
from scipy import stats
from scipy.stats import norm, ttest_ind
import warnings

warnings.filterwarnings('ignore')
np.random.seed(42)
plt.style.use('seaborn-v0_8-whitegrid')

print("Libraries imported successfully!")

---

## 1. Introduction to Online Evaluation

### Why Online Evaluation?

Offline metrics don't always correlate with real-world performance:
- **Offline metrics** measure model quality on historical data
- **Online metrics** measure actual business impact in production

In [None]:
def demonstrate_offline_online_gap():
    """Demonstrate the gap between offline and online metrics."""
    
    models = [
        {"name": "Model A", "offline_auc": 0.92, "online_ctr": 0.045, "revenue": 2.1},
        {"name": "Model B", "offline_auc": 0.89, "online_ctr": 0.052, "revenue": 2.8},
        {"name": "Model C", "offline_auc": 0.95, "online_ctr": 0.041, "revenue": 1.9},
    ]
    
    print("Offline vs Online Performance Gap")
    print("=" * 60)
    print(f"{'Model':<10} {'Offline AUC':<15} {'Online CTR':<15} {'Revenue/User':<15}")
    print("-" * 60)
    
    for m in models:
        print(f"{m['name']:<10} {m['offline_auc']:<15.3f} {m['online_ctr']:<15.3f} ${m['revenue']:<14.2f}")
    
    print("\nKey Insight: Model C has best offline AUC but worst online performance!")

demonstrate_offline_online_gap()

---

## 2. Online Metrics

| Metric Type | Examples | Use Case |
|-------------|----------|----------|
| Engagement | CTR, Time on Site | Content platforms |
| Conversion | Sign-ups, Purchases | E-commerce |
| Revenue | ARPU, LTV | Monetization |
| Quality | Error Rate, Bounce Rate | User experience |

In [None]:
@dataclass
class OnlineMetrics:
    """Container for common online metrics."""
    impressions: int
    clicks: int
    conversions: int
    revenue: float
    sessions: int
    
    @property
    def ctr(self):
        return self.clicks / self.impressions if self.impressions > 0 else 0.0
    
    @property
    def conversion_rate(self):
        return self.conversions / self.clicks if self.clicks > 0 else 0.0
    
    @property
    def revenue_per_session(self):
        return self.revenue / self.sessions if self.sessions > 0 else 0.0
    
    def display(self):
        print("=" * 50)
        print("Online Metrics Report")
        print("=" * 50)
        print(f"Impressions: {self.impressions:,}")
        print(f"Clicks: {self.clicks:,}")
        print(f"Conversions: {self.conversions:,}")
        print(f"Revenue: ${self.revenue:,.2f}")
        print(f"CTR: {self.ctr:.2%}")
        print(f"Conversion Rate: {self.conversion_rate:.2%}")
        print(f"Revenue/Session: ${self.revenue_per_session:.2f}")

metrics = OnlineMetrics(
    impressions=100000, clicks=4500, conversions=450, revenue=22500, sessions=50000
)
metrics.display()

---

## 3. A/B Testing Fundamentals

A/B testing compares two versions:
- **Control (A)**: Current production version
- **Treatment (B)**: New version with changes

In [None]:
class ABTest:
    """A/B Test implementation."""
    
    def __init__(self, name, control_ratio=0.5):
        self.name = name
        self.control_ratio = control_ratio
        self.control_data = []
        self.treatment_data = []
    
    def assign_user(self, user_id):
        hash_val = hash(f"{self.name}_{user_id}") % 100
        return "control" if hash_val < self.control_ratio * 100 else "treatment"
    
    def record_outcome(self, group, value):
        if group == "control":
            self.control_data.append(value)
        else:
            self.treatment_data.append(value)
    
    def get_summary(self):
        control = np.array(self.control_data)
        treatment = np.array(self.treatment_data)
        return {
            "control": {"n": len(control), "mean": np.mean(control) if len(control) > 0 else 0},
            "treatment": {"n": len(treatment), "mean": np.mean(treatment) if len(treatment) > 0 else 0}
        }

# Simulate an A/B test
test = ABTest("recommendation_v2")
np.random.seed(42)

for i in range(10000):
    user_id = f"user_{i}"
    group = test.assign_user(user_id)
    
    if group == "control":
        outcome = np.random.binomial(1, 0.05)  # 5% conversion
    else:
        outcome = np.random.binomial(1, 0.055)  # 5.5% conversion (10% lift)
    
    test.record_outcome(group, outcome)

summary = test.get_summary()
print("A/B Test Summary")
print("=" * 50)
for group, stats in summary.items():
    print(f"{group.upper()}: n={stats['n']}, rate={stats['mean']:.4f}")

---

## 4. Statistical Analysis

- **Null Hypothesis (H0)**: No difference between control and treatment
- **p-value**: Probability of observing results if H0 is true
- **Significance Level (alpha)**: Threshold for rejecting H0 (typically 0.05)

In [None]:
class ABTestAnalyzer:
    """Statistical analysis for A/B tests."""
    
    def __init__(self, alpha=0.05):
        self.alpha = alpha
    
    def z_test_proportions(self, n1, p1, n2, p2):
        """Two-proportion z-test for comparing conversion rates."""
        p_pool = (n1 * p1 + n2 * p2) / (n1 + n2)
        se = np.sqrt(p_pool * (1 - p_pool) * (1/n1 + 1/n2))
        z_stat = (p2 - p1) / se if se > 0 else 0
        p_value = 2 * (1 - norm.cdf(abs(z_stat)))
        
        se_diff = np.sqrt(p1 * (1-p1) / n1 + p2 * (1-p2) / n2)
        z_crit = norm.ppf(1 - self.alpha/2)
        ci_lower = (p2 - p1) - z_crit * se_diff
        ci_upper = (p2 - p1) + z_crit * se_diff
        
        return {
            "z_statistic": z_stat,
            "p_value": p_value,
            "significant": p_value < self.alpha,
            "absolute_lift": p2 - p1,
            "relative_lift": (p2 - p1) / p1 if p1 > 0 else 0,
            "ci_lower": ci_lower,
            "ci_upper": ci_upper,
        }

analyzer = ABTestAnalyzer(alpha=0.05)
results = analyzer.z_test_proportions(
    n1=summary["control"]["n"], p1=summary["control"]["mean"],
    n2=summary["treatment"]["n"], p2=summary["treatment"]["mean"]
)

print("Statistical Analysis Results")
print("=" * 50)
print(f"Control Rate: {summary['control']['mean']:.4f}")
print(f"Treatment Rate: {summary['treatment']['mean']:.4f}")
print(f"Absolute Lift: {results['absolute_lift']:.4f}")
print(f"Relative Lift: {results['relative_lift']:.2%}")
print(f"Z-Statistic: {results['z_statistic']:.4f}")
print(f"P-Value: {results['p_value']:.4f}")
print(f"95% CI: [{results['ci_lower']:.4f}, {results['ci_upper']:.4f}]")
print(f"Significant: {'YES' if results['significant'] else 'NO'}")

In [None]:
def visualize_ab_results(control_rate, treatment_rate, control_n, treatment_n, alpha=0.05):
    """Visualize A/B test results."""
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Bar chart
    groups = ['Control', 'Treatment']
    rates = [control_rate, treatment_rate]
    colors = ['steelblue', 'coral']
    
    se_control = np.sqrt(control_rate * (1 - control_rate) / control_n)
    se_treatment = np.sqrt(treatment_rate * (1 - treatment_rate) / treatment_n)
    z = norm.ppf(1 - alpha/2)
    errors = [z * se_control, z * se_treatment]
    
    bars = axes[0].bar(groups, rates, color=colors, alpha=0.7)
    axes[0].errorbar(groups, rates, yerr=errors, fmt='none', color='black', capsize=5)
    axes[0].set_ylabel('Conversion Rate')
    axes[0].set_title('Control vs Treatment')
    
    for bar, rate in zip(bars, rates):
        axes[0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.002,
                     f'{rate:.2%}', ha='center', va='bottom')
    
    # Distribution of difference
    diff = treatment_rate - control_rate
    se_diff = np.sqrt(se_control**2 + se_treatment**2)
    
    x = np.linspace(diff - 4*se_diff, diff + 4*se_diff, 100)
    y = norm.pdf(x, diff, se_diff)
    
    axes[1].plot(x, y, 'b-', linewidth=2)
    axes[1].fill_between(x, y, alpha=0.3)
    axes[1].axvline(x=0, color='red', linestyle='--', label='No Effect')
    axes[1].axvline(x=diff, color='green', linestyle='-', label=f'Observed: {diff:.4f}')
    axes[1].set_xlabel('Difference (Treatment - Control)')
    axes[1].set_ylabel('Probability Density')
    axes[1].set_title('Distribution of Effect Size')
    axes[1].legend()
    
    plt.tight_layout()
    plt.show()

visualize_ab_results(
    summary['control']['mean'], summary['treatment']['mean'],
    summary['control']['n'], summary['treatment']['n']
)

---

## 5. Sample Size Calculation

Key parameters:
- **Alpha**: Probability of false positive (0.05)
- **Power**: Probability of detecting true effect (0.80)
- **MDE**: Minimum Detectable Effect

In [None]:
class SampleSizeCalculator:
    """Calculate required sample sizes for A/B tests."""
    
    @staticmethod
    def for_proportions(baseline_rate, mde_relative, alpha=0.05, power=0.80):
        """Sample size for conversion rate test."""
        p1 = baseline_rate
        p2 = baseline_rate * (1 + mde_relative)
        
        z_alpha = norm.ppf(1 - alpha/2)
        z_beta = norm.ppf(power)
        p_avg = (p1 + p2) / 2
        
        numerator = (z_alpha * np.sqrt(2 * p_avg * (1 - p_avg)) +
                     z_beta * np.sqrt(p1 * (1 - p1) + p2 * (1 - p2)))**2
        denominator = (p2 - p1)**2
        
        return int(np.ceil(numerator / denominator))

# Example calculations
calc = SampleSizeCalculator()
baseline_cvr = 0.05

print("Sample Size Calculator")
print("=" * 60)
print(f"Baseline conversion rate: {baseline_cvr:.0%}")
print()
print(f"{'MDE (Relative)':<20} {'Sample Size/Group':<20} {'Total Sample':<15}")
print("-" * 55)

for mde in [0.05, 0.10, 0.20]:
    n = calc.for_proportions(baseline_cvr, mde)
    print(f"{mde:>10.0%}           {n:>15,}      {2*n:>12,}")

In [None]:
def plot_sample_size_sensitivity(baseline_rate=0.05):
    """Plot sample size requirements for different MDE values."""
    calc = SampleSizeCalculator()
    mde_range = np.linspace(0.02, 0.30, 30)
    sample_sizes = [calc.for_proportions(baseline_rate, mde) for mde in mde_range]
    
    plt.figure(figsize=(10, 6))
    plt.plot(mde_range * 100, sample_sizes, 'b-', linewidth=2)
    
    for mde in [0.05, 0.10, 0.20]:
        n = calc.for_proportions(baseline_rate, mde)
        plt.plot(mde * 100, n, 'ro', markersize=10)
        plt.annotate(f'{n:,}', (mde * 100, n), textcoords="offset points",
                     xytext=(10, 5), fontsize=10)
    
    plt.xlabel('Minimum Detectable Effect (%)')
    plt.ylabel('Sample Size per Group')
    plt.title(f'Sample Size vs MDE (Baseline Rate = {baseline_rate:.0%})')
    plt.grid(True, alpha=0.3)
    plt.yscale('log')
    plt.tight_layout()
    plt.show()

plot_sample_size_sensitivity()

---

## 6. Multi-Armed Bandits

Bandits balance exploration (learning) and exploitation (using best variant).

In [None]:
class EpsilonGreedyBandit:
    """Epsilon-Greedy Multi-Armed Bandit."""
    
    def __init__(self, n_arms, epsilon=0.1):
        self.n_arms = n_arms
        self.epsilon = epsilon
        self.counts = np.zeros(n_arms)
        self.values = np.zeros(n_arms)
    
    def select_arm(self):
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_arms)
        return np.argmax(self.values)
    
    def update(self, arm, reward):
        self.counts[arm] += 1
        n = self.counts[arm]
        self.values[arm] = self.values[arm] + (reward - self.values[arm]) / n


class ThompsonSamplingBandit:
    """Thompson Sampling for Bernoulli bandits."""
    
    def __init__(self, n_arms):
        self.n_arms = n_arms
        self.alpha = np.ones(n_arms)
        self.beta = np.ones(n_arms)
    
    def select_arm(self):
        samples = [np.random.beta(self.alpha[i], self.beta[i]) for i in range(self.n_arms)]
        return np.argmax(samples)
    
    def update(self, arm, reward):
        if reward == 1:
            self.alpha[arm] += 1
        else:
            self.beta[arm] += 1

In [None]:
def simulate_bandit_comparison(n_rounds=10000):
    """Compare different bandit algorithms."""
    true_probs = [0.04, 0.05, 0.06, 0.045]
    n_arms = len(true_probs)
    
    bandits = {
        "Epsilon-Greedy": EpsilonGreedyBandit(n_arms, epsilon=0.1),
        "Thompson Sampling": ThompsonSamplingBandit(n_arms),
    }
    
    cumulative_rewards = {name: [] for name in bandits}
    arm_selections = {name: np.zeros(n_arms) for name in bandits}
    
    for t in range(n_rounds):
        for name, bandit in bandits.items():
            arm = bandit.select_arm()
            reward = np.random.binomial(1, true_probs[arm])
            bandit.update(arm, reward)
            
            arm_selections[name][arm] += 1
            prev = cumulative_rewards[name][-1] if cumulative_rewards[name] else 0
            cumulative_rewards[name].append(prev + reward)
    
    # Plot
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    for name, rewards in cumulative_rewards.items():
        axes[0].plot(rewards, label=name, linewidth=2)
    axes[0].set_xlabel('Round')
    axes[0].set_ylabel('Cumulative Reward')
    axes[0].set_title('Cumulative Reward Over Time')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    
    x = np.arange(n_arms)
    width = 0.35
    for i, (name, selections) in enumerate(arm_selections.items()):
        axes[1].bar(x + i * width, selections / n_rounds, width, label=name)
    axes[1].set_xlabel('Arm')
    axes[1].set_ylabel('Selection Proportion')
    axes[1].set_title('Arm Selection Distribution')
    axes[1].set_xticks(x + width/2)
    axes[1].set_xticklabels([f'Arm {i} (p={p})' for i, p in enumerate(true_probs)])
    axes[1].legend()
    
    plt.tight_layout()
    plt.show()
    
    print("\nBandit Summary (Best arm: Arm 2, p=0.06)")
    print("=" * 50)
    for name, selections in arm_selections.items():
        best_pct = selections[2] / n_rounds
        total = cumulative_rewards[name][-1]
        print(f"{name}: Best arm {best_pct:.1%}, Total reward {total}")

simulate_bandit_comparison()

---

## 7. Common Pitfalls

### 7.1 Peeking Problem

Looking at results too early inflates false positive rate.

In [None]:
def demonstrate_peeking_problem(n_simulations=1000):
    """Show how peeking inflates false positives."""
    n_samples = 5000
    peek_points = [500, 1000, 2000, 3000, 4000, 5000]
    true_rate = 0.05
    
    false_positives_any = 0
    false_positives_final = 0
    
    for _ in range(n_simulations):
        # Generate A/A test data (no real difference)
        control = np.random.binomial(1, true_rate, n_samples)
        treatment = np.random.binomial(1, true_rate, n_samples)
        
        significant_at_any = False
        significant_at_final = False
        
        for peek in peek_points:
            c_sum = control[:peek].sum()
            t_sum = treatment[:peek].sum()
            
            p_c = c_sum / peek
            p_t = t_sum / peek
            
            if peek == 0:
                continue
            
            p_pool = (c_sum + t_sum) / (2 * peek)
            se = np.sqrt(2 * p_pool * (1 - p_pool) / peek)
            
            if se > 0:
                z = abs(p_t - p_c) / se
                p_value = 2 * (1 - norm.cdf(z))
                
                if p_value < 0.05:
                    significant_at_any = True
                    if peek == n_samples:
                        significant_at_final = True
        
        if significant_at_any:
            false_positives_any += 1
        if significant_at_final:
            false_positives_final += 1
    
    print("Peeking Problem Demonstration")
    print("=" * 50)
    print(f"Simulations: {n_simulations}")
    print(f"\nFalse Positive Rate:")
    print(f"  Without peeking (final only): {false_positives_final/n_simulations:.1%}")
    print(f"  With peeking (any peek): {false_positives_any/n_simulations:.1%}")
    print(f"\nExpected rate: 5%")
    print(f"\nConclusion: Peeking inflates false positive rate significantly!")

demonstrate_peeking_problem()

### 7.2 Multiple Testing Problem

In [None]:
def demonstrate_multiple_testing():
    """Show how testing multiple metrics inflates false positives."""
    n_metrics = [1, 5, 10, 20]
    alpha = 0.05
    
    print("Multiple Testing Problem")
    print("=" * 50)
    print(f"{'# Metrics':<15} {'P(At least 1 FP)':<20} {'Correction Needed'}")
    print("-" * 50)
    
    for n in n_metrics:
        # P(at least one false positive) = 1 - P(no false positives)
        p_at_least_one_fp = 1 - (1 - alpha) ** n
        bonferroni_alpha = alpha / n
        
        print(f"{n:<15} {p_at_least_one_fp:<20.1%} alpha = {bonferroni_alpha:.4f}")
    
    print("\nSolution: Use Bonferroni correction (alpha / n_tests)")

demonstrate_multiple_testing()

---

## 8. Hands-on Exercises

### Exercise 1: Complete A/B Test Analysis

In [None]:
def run_complete_ab_analysis(control_visitors, control_conversions,
                              treatment_visitors, treatment_conversions,
                              alpha=0.05):
    """Run complete A/B test analysis."""
    
    p_c = control_conversions / control_visitors
    p_t = treatment_conversions / treatment_visitors
    
    print("A/B Test Analysis Report")
    print("=" * 60)
    
    # Basic stats
    print("\n1. Sample Summary")
    print(f"   Control: {control_visitors:,} visitors, {control_conversions:,} conversions")
    print(f"   Treatment: {treatment_visitors:,} visitors, {treatment_conversions:,} conversions")
    
    print("\n2. Conversion Rates")
    print(f"   Control: {p_c:.4f} ({p_c:.2%})")
    print(f"   Treatment: {p_t:.4f} ({p_t:.2%})")
    
    # Statistical test
    analyzer = ABTestAnalyzer(alpha=alpha)
    results = analyzer.z_test_proportions(control_visitors, p_c, treatment_visitors, p_t)
    
    print("\n3. Lift Analysis")
    print(f"   Absolute Lift: {results['absolute_lift']:.4f}")
    print(f"   Relative Lift: {results['relative_lift']:.2%}")
    
    print("\n4. Statistical Significance")
    print(f"   Z-Statistic: {results['z_statistic']:.4f}")
    print(f"   P-Value: {results['p_value']:.4f}")
    print(f"   95% CI: [{results['ci_lower']:.4f}, {results['ci_upper']:.4f}]")
    print(f"   Significant at alpha={alpha}: {'YES' if results['significant'] else 'NO'}")
    
    # Recommendation
    print("\n5. Recommendation")
    if results['significant'] and results['absolute_lift'] > 0:
        print("   -> SHIP IT! Treatment shows significant improvement.")
    elif results['significant'] and results['absolute_lift'] < 0:
        print("   -> DO NOT SHIP. Treatment performs significantly worse.")
    else:
        print("   -> No significant difference. Consider running longer or with more traffic.")

# Example usage
run_complete_ab_analysis(
    control_visitors=50000,
    control_conversions=2500,
    treatment_visitors=50000,
    treatment_conversions=2750
)

### Exercise 2: Power Analysis

In [None]:
def power_analysis_table(baseline_rate, daily_traffic, traffic_allocation=0.5):
    """Generate power analysis table for different scenarios."""
    
    calc = SampleSizeCalculator()
    mde_options = [0.05, 0.10, 0.15, 0.20]
    
    print(f"Power Analysis (Baseline = {baseline_rate:.1%}, Daily Traffic = {daily_traffic:,})")
    print("=" * 70)
    print(f"{'MDE':<10} {'Sample/Group':<15} {'Total Sample':<15} {'Duration (days)':<15}")
    print("-" * 70)
    
    for mde in mde_options:
        n = calc.for_proportions(baseline_rate, mde)
        total = 2 * n
        daily_per_group = daily_traffic * traffic_allocation * 0.5
        days = int(np.ceil(n / daily_per_group))
        
        print(f"{mde:<10.0%} {n:<15,} {total:<15,} {days:<15}")

power_analysis_table(baseline_rate=0.05, daily_traffic=100000)

---

## 9. Summary

### Key Takeaways

1. **Online vs Offline Metrics**
   - Offline metrics don't always correlate with business impact
   - Always validate with online experiments

2. **A/B Testing Fundamentals**
   - Randomize users properly
   - Calculate sample size before running
   - Wait for statistical significance

3. **Statistical Analysis**
   - Use appropriate tests (z-test for proportions, t-test for means)
   - Report confidence intervals, not just p-values
   - Consider practical significance

4. **Common Pitfalls**
   - Don't peek at results early
   - Correct for multiple testing
   - Account for novelty effects

5. **Advanced Methods**
   - Multi-armed bandits for exploration/exploitation
   - Shadow deployment for risk-free testing
   - Canary releases for gradual rollout

### Next Steps

In the next module, we'll cover **Deployment Strategies** to learn how to deploy models to production safely and efficiently.