# Sample Size and Power Analysis

## Learning Objectives
By the end of this notebook, you will understand:
- Why sample size matters and how to calculate it
- The relationship between power, sample size, and effect size
- How to calculate the minimum detectable effect (MDE)
- Trade-offs in experiment design

## Prerequisites
- Completed `01_ab_testing_fundamentals.ipynb`
- Understanding of p-values and confidence intervals

In [None]:
# Setup
import sys
sys.path.insert(0, '../src')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from experiments import (
    sample_size_proportion,
    sample_size_continuous,
    power_proportion,
    power_continuous,
    mde_proportion,
    mde_continuous,
    plot_power_curve,
    plot_sample_size_curve,
    create_sample_size_table,
)

np.random.seed(42)

---
## 1. The Power Triangle

Three key quantities are interconnected in experiment design:

```
        Sample Size (n)
              ▲
             / \
            /   \
           /     \
          /       \
   Power (1-β) ◄---► Effect Size (d)
```

**Given any two, you can calculate the third:**

| Given | Calculate | Question Answered |
|-------|-----------|-------------------|
| Effect size + Power | Sample size | "How many users do I need?" |
| Sample size + Power | Effect size (MDE) | "What's the smallest effect I can detect?" |
| Sample size + Effect size | Power | "What's my chance of detecting the effect?" |

---
## 2. Sample Size for Proportions (Conversion Rates)

The most common A/B test: comparing conversion rates.

### Formula (Intuition)
Sample size depends on:
- **Baseline rate**: Higher rates have more variance
- **Effect size**: Smaller effects need more data
- **Alpha (α)**: Lower α means more stringent test
- **Power (1-β)**: Higher power needs more data

In [None]:
# Example: Calculate sample size for conversion rate test

# Current conversion rate is 10%, we want to detect a 2% absolute increase
result = sample_size_proportion(
    baseline_rate=0.10,  # 10% current conversion rate
    mde=0.02,            # 2% absolute minimum detectable effect
    alpha=0.05,          # 5% significance level
    power=0.8            # 80% power
)

print(result)
print(f"\nWith {result.sample_size_per_group:,} users per group:")
print(f"- You have 80% chance of detecting a 2% improvement")
print(f"- If you see p < 0.05, there's only 5% chance it's a false positive")

In [None]:
# How sample size changes with effect size
baseline = 0.10
effects = [0.01, 0.02, 0.03, 0.04, 0.05]  # 1% to 5% absolute improvement

print("Sample Size Required for Different Effect Sizes")
print("="*60)
print(f"Baseline rate: {baseline:.1%}")
print(f"Alpha: 0.05, Power: 0.80")
print("-"*60)
print(f"{'MDE (absolute)':<20} {'MDE (relative)':<20} {'Sample Size/Group':<20}")
print("-"*60)

for mde in effects:
    result = sample_size_proportion(baseline_rate=baseline, mde=mde)
    relative_mde = mde / baseline
    print(f"{mde:.1%:<20} {relative_mde:.1%:<20} {result.sample_size_per_group:,}")

In [None]:
# Visualize: Sample size vs Effect size
fig = plot_sample_size_curve(
    baseline_rate=0.10,
    mde_range=(0.005, 0.05),
    metric_type='proportion'
)
plt.title('Sample Size Grows Quickly for Smaller Effects')
plt.show()

print("Key insight: Detecting small effects requires MUCH more data!")
print("Halving the effect size roughly quadruples the sample size.")

---
## 3. Sample Size for Continuous Metrics

For metrics like revenue, time on site, or engagement scores.

In [None]:
# Example: Revenue per user test
# Current average: $50, standard deviation: $30
# Want to detect $5 increase

result = sample_size_continuous(
    baseline_std=30,   # Standard deviation of revenue
    mde=5,             # Minimum detectable effect ($5)
    alpha=0.05,
    power=0.8
)

print(result)
print(f"\nCohen's d = {5/30:.2f} (effect size / std)")
print(f"This is considered a {'small' if 5/30 < 0.2 else 'medium' if 5/30 < 0.5 else 'large'} effect size")

---
## 4. Power Curves

Power curves show how power changes with effect size for a fixed sample size.

In [None]:
# Generate power curve
fig = plot_power_curve(
    baseline_rate=0.10,
    n_per_group=3000,
    mde_range=(0.005, 0.05),
    metric_type='proportion'
)
plt.title('Power Curve: What Effects Can You Detect with 3,000 Users/Group?')
plt.show()

In [None]:
# Compare power at different sample sizes
fig, ax = plt.subplots(figsize=(10, 6))

baseline_rate = 0.10
mde_values = np.linspace(0.005, 0.05, 50)
sample_sizes = [1000, 3000, 5000, 10000]

for n in sample_sizes:
    powers = [power_proportion(n, baseline_rate, mde) for mde in mde_values]
    ax.plot(mde_values * 100, powers, linewidth=2, label=f'n = {n:,}')

ax.axhline(y=0.8, color='red', linestyle='--', alpha=0.7, label='80% Power')
ax.set_xlabel('Effect Size (percentage points)')
ax.set_ylabel('Power')
ax.set_title('Power Comparison: More Users = Better Detection')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_ylim(0, 1)

plt.tight_layout()
plt.show()

---
## 5. Minimum Detectable Effect (MDE)

Given your sample size, what's the smallest effect you can reliably detect?

This is crucial for deciding if an experiment is worth running.

In [None]:
# Example: You have 5,000 users per group. What's your MDE?

n_per_group = 5000
baseline_rate = 0.10

mde = mde_proportion(
    n_per_group=n_per_group,
    baseline_rate=baseline_rate,
    alpha=0.05,
    power=0.8
)

print(f"With {n_per_group:,} users per group:")
print(f"  Baseline rate: {baseline_rate:.1%}")
print(f"  MDE (absolute): {mde:.2%}")
print(f"  MDE (relative): {mde/baseline_rate:.1%}")
print(f"\nMeaning: You can reliably detect a {mde:.2%} absolute improvement")
print(f"         or a {mde/baseline_rate:.1%} relative improvement")

In [None]:
# MDE for different sample sizes
sample_sizes = [1000, 2000, 5000, 10000, 20000, 50000]

print("MDE for Different Sample Sizes")
print("="*60)
print(f"Baseline: {baseline_rate:.1%}, Alpha: 0.05, Power: 0.80")
print("-"*60)

mdes = []
for n in sample_sizes:
    mde = mde_proportion(n, baseline_rate)
    mdes.append(mde)
    print(f"n = {n:>6,}  →  MDE = {mde:.3%} ({mde/baseline_rate:.1%} relative)")

# Visualize
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(sample_sizes, [m*100 for m in mdes], 'bo-', markersize=8)
ax.set_xlabel('Sample Size per Group')
ax.set_ylabel('MDE (percentage points)')
ax.set_title('Minimum Detectable Effect vs Sample Size')
ax.set_xscale('log')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

---
## 6. Sample Size Tables

Handy reference tables for quick planning.

In [None]:
# Create a sample size lookup table
table = create_sample_size_table(
    baseline_rate=0.10,
    mde_values=[0.01, 0.02, 0.03, 0.04, 0.05],
    power_values=[0.7, 0.8, 0.9, 0.95],
    metric_type='proportion'
)

print("Sample Size Table (per group)")
print("Baseline rate: 10%, Alpha: 5%")
print("="*60)
print(table.to_string())

---
## 7. Practical Considerations

### How to Choose Your Parameters

**Effect Size (MDE):**
- What's the minimum improvement that would justify implementation?
- Consider: development cost, opportunity cost, risk

**Power:**
- 80% is standard, 90% for critical experiments
- Higher power = larger sample = longer test

**Alpha:**
- 5% is standard
- Lower for high-stakes decisions (e.g., 1%)

### The Business Trade-off

```
Smaller MDE  ←→  Larger Sample Size  ←→  Longer Test Duration
     ↓                  ↓                        ↓
More sensitive    More costly           Delayed decisions
```

In [None]:
# Example: What MDE is realistic given your traffic?

daily_visitors = 10000  # 10k visitors per day
test_duration_days = 14  # 2 week test
split_ratio = 0.5  # 50% in each group

total_per_group = int(daily_visitors * test_duration_days * split_ratio)

print(f"Traffic Planning")
print("="*50)
print(f"Daily visitors: {daily_visitors:,}")
print(f"Test duration: {test_duration_days} days")
print(f"Split: {split_ratio:.0%} / {1-split_ratio:.0%}")
print(f"Users per group: {total_per_group:,}")
print()

# Calculate MDE
baseline = 0.08
mde = mde_proportion(total_per_group, baseline)

print(f"With baseline rate of {baseline:.1%}:")
print(f"  MDE (absolute): {mde:.2%}")
print(f"  MDE (relative): {mde/baseline:.1%}")
print()

# Is this practical?
print("Questions to ask:")
print(f"  - Is a {mde/baseline:.1%} improvement meaningful for the business?")
print(f"  - Do we expect the change to have at least a {mde:.2%} effect?")
print(f"  - If not, we need more traffic or a longer test.")

---
## 8. Key Takeaways

1. **Always calculate sample size BEFORE starting** - don't guess!

2. **Smaller effects need more data** - sample size roughly quadruples when you halve the MDE

3. **MDE is critical for experiment planning** - if your MDE is larger than a meaningful business impact, the test may not be worth running

4. **Trade-offs are inevitable** - balance sensitivity, duration, and cost

5. **Use power curves** to understand what you can and cannot detect

## Next Steps
- `03_analyzing_ab_results.ipynb` - Deep dive into result interpretation
- `05_interval_hypothesis_testing.ipynb` - Beyond "is there a difference?"