# A/B Testing Fundamentals

## Learning Objectives
By the end of this notebook, you will understand:
- What A/B testing is and when to use it
- Key statistical concepts: null/alternative hypotheses, p-values, Type I/II errors
- How to set up, run, and analyze a basic A/B test
- Common pitfalls to avoid

## Prerequisites
- Basic understanding of statistics (mean, standard deviation)
- Python basics

## Real-World Use Cases
- Testing a new website design to see if it increases conversions
- Comparing two email subject lines for click-through rate
- Evaluating whether a new product feature improves user engagement

---
## 1. What is A/B Testing?

**A/B testing** (also called split testing) is a method to compare two versions of something to determine which performs better.

### The Basic Setup
```
Population of Users
       |
   Random Split
      /    \
Control    Treatment
  (A)         (B)
   |           |
Measure    Measure
Metric     Metric
   |           |
   Compare Results
```

**Control (A)**: The current/baseline version  
**Treatment (B)**: The new version you're testing

### Why Randomization Matters
Random assignment ensures the groups are comparable. Without randomization, differences in outcomes might be due to differences in the users, not the treatment.

In [None]:
# Setup - Import libraries
import sys
sys.path.insert(0, '../src')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

# Import our experimentation library
from experiments import ABTest, MetricType, TestType

# Set random seed for reproducibility
np.random.seed(42)

---
## 2. Key Statistical Concepts

### Hypothesis Testing Framework

Every A/B test is framed as a hypothesis test:

- **Null Hypothesis (H₀)**: There is NO difference between A and B
- **Alternative Hypothesis (H₁)**: There IS a difference between A and B

We collect data and calculate the probability of seeing our results IF the null hypothesis were true. If this probability (p-value) is very small, we "reject" the null hypothesis.

### Visualizing the Concept
Imagine flipping a coin 100 times. If it's fair, you'd expect about 50 heads. But what if you got 65 heads? Is the coin biased, or did you just get lucky?

In [None]:
# Visualizing the sampling distribution under the null hypothesis
fig, ax = plt.subplots(figsize=(10, 6))

# Simulate 10,000 experiments with a fair coin (100 flips each)
n_simulations = 10000
n_flips = 100
results = np.random.binomial(n_flips, 0.5, n_simulations)  # Fair coin = 50% probability

# Plot histogram
ax.hist(results, bins=30, density=True, alpha=0.7, edgecolor='black')
ax.axvline(x=50, color='green', linestyle='--', linewidth=2, label='Expected (H₀)')
ax.axvline(x=65, color='red', linestyle='-', linewidth=2, label='Observed (65 heads)')

# Shade rejection region
x = np.linspace(0, 100, 1000)
y = stats.binom.pmf(x.astype(int), n_flips, 0.5)

ax.set_xlabel('Number of Heads', fontsize=12)
ax.set_ylabel('Probability', fontsize=12)
ax.set_title('Sampling Distribution Under Null Hypothesis (Fair Coin)', fontsize=14)
ax.legend(fontsize=10)

# Calculate p-value
p_value = 2 * (1 - stats.binom.cdf(64, n_flips, 0.5))  # Two-tailed
ax.text(75, 0.06, f'p-value = {p_value:.4f}', fontsize=12, 
        bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

plt.tight_layout()
plt.show()

print(f"Getting 65+ or 35- heads with a fair coin has probability: {p_value:.4f}")
print(f"This is {'significant' if p_value < 0.05 else 'not significant'} at the 5% level")

### Type I and Type II Errors

| Decision | H₀ is True (No Effect) | H₀ is False (Real Effect) |
|----------|------------------------|---------------------------|
| Reject H₀ | **Type I Error** (False Positive) | Correct! |
| Don't Reject H₀ | Correct! | **Type II Error** (False Negative) |

- **α (alpha)**: Probability of Type I error (typically 0.05 or 5%)
- **β (beta)**: Probability of Type II error
- **Power = 1 - β**: Probability of detecting a real effect (typically 0.80 or 80%)

### The Trade-off
- Lowering α reduces false positives but increases false negatives
- Increasing power reduces false negatives but requires larger samples

In [None]:
# Visualizing Type I and Type II errors
fig, ax = plt.subplots(figsize=(12, 6))

# Create two distributions: null (no effect) and alternative (real effect)
x = np.linspace(-4, 8, 1000)
null_dist = stats.norm.pdf(x, 0, 1)  # H0: effect = 0
alt_dist = stats.norm.pdf(x, 2.5, 1)  # H1: effect = 2.5 (true effect)

# Critical value for alpha = 0.05 (one-sided)
critical_value = stats.norm.ppf(0.95)

# Plot distributions
ax.plot(x, null_dist, 'b-', linewidth=2, label='Null (H₀): No Effect')
ax.plot(x, alt_dist, 'g-', linewidth=2, label='Alternative (H₁): True Effect')

# Shade Type I error (alpha)
ax.fill_between(x, 0, null_dist, where=(x >= critical_value), 
                color='red', alpha=0.3, label=f'Type I Error (α = 5%)')

# Shade Type II error (beta)
ax.fill_between(x, 0, alt_dist, where=(x <= critical_value), 
                color='orange', alpha=0.3, label='Type II Error (β)')

# Critical value line
ax.axvline(x=critical_value, color='black', linestyle='--', linewidth=2, 
           label=f'Critical Value ({critical_value:.2f})')

ax.set_xlabel('Test Statistic', fontsize=12)
ax.set_ylabel('Probability Density', fontsize=12)
ax.set_title('Type I and Type II Errors Visualized', fontsize=14)
ax.legend(loc='upper right')
ax.set_ylim(0, 0.5)

# Add annotations
ax.annotate('Reject H₀\n(Declare winner)', xy=(3.5, 0.05), fontsize=10, ha='center')
ax.annotate('Do not reject H₀\n(No conclusion)', xy=(-1, 0.05), fontsize=10, ha='center')

plt.tight_layout()
plt.show()

# Calculate power
power = 1 - stats.norm.cdf(critical_value, 2.5, 1)
print(f"Power (probability of detecting the effect): {power:.1%}")

---
## 3. Running Your First A/B Test

Let's walk through a complete example. Imagine we're testing a new checkout button design to see if it improves conversion rate.

**Current (Control)**: Blue "Buy Now" button - 10% conversion rate  
**New (Treatment)**: Green "Complete Purchase" button - we want to detect a 2% absolute improvement

In [None]:
# Step 1: Create the test configuration
test = ABTest(
    alpha=0.05,      # 5% significance level
    power=0.8,       # 80% power
    metric_type=MetricType.PROPORTION,  # Conversion rate is a proportion
    test_type=TestType.TWO_SIDED        # We care about both increases and decreases
)

print("Test Configuration:")
print(f"  Significance level (α): {test.alpha}")
print(f"  Power (1-β): {test.power}")
print(f"  Metric type: {test.metric_type.value}")
print(f"  Test type: {test.test_type.value}")

In [None]:
# Step 2: Calculate required sample size
baseline_rate = 0.10  # Current 10% conversion rate
mde = 0.02           # Minimum detectable effect: 2 percentage points

sample_size = test.get_sample_size(baseline_rate=baseline_rate, mde=mde)

print(f"\nSample Size Calculation:")
print(f"  Baseline conversion rate: {baseline_rate:.1%}")
print(f"  Minimum detectable effect: {mde:.1%} (absolute)")
print(f"  Expected treatment rate: {baseline_rate + mde:.1%}")
print(f"  Relative lift: {mde/baseline_rate:.1%}")
print(f"\n  Required sample size per group: {sample_size:,}")
print(f"  Total sample size: {sample_size * 2:,}")

In [None]:
# Step 3: Simulate the experiment
# (In reality, you would collect real data)

# True effect: treatment is actually better by 2.5 percentage points
true_control_rate = 0.10
true_treatment_rate = 0.125  # 12.5%

# Simulate data collection
n_per_group = sample_size

control_conversions = np.random.binomial(n_per_group, true_control_rate)
treatment_conversions = np.random.binomial(n_per_group, true_treatment_rate)

print(f"\nSimulated Experiment Results:")
print(f"  Control: {control_conversions:,} conversions out of {n_per_group:,} ({control_conversions/n_per_group:.2%})")
print(f"  Treatment: {treatment_conversions:,} conversions out of {n_per_group:,} ({treatment_conversions/n_per_group:.2%})")

In [None]:
# Step 4: Analyze the results
result = test.analyze_proportions(
    control_conversions=control_conversions,
    control_total=n_per_group,
    treatment_conversions=treatment_conversions,
    treatment_total=n_per_group
)

# Display the result
print(result)

In [None]:
# Step 5: Visualize the results
fig = test.plot_results()
plt.show()

# Print full summary
print(test.summary())

---
## 4. Interpreting Results

### What the Numbers Mean

1. **P-value**: The probability of seeing a difference this large (or larger) if there really were no difference. 
   - p < 0.05 → "Statistically significant" (reject H₀)
   - p ≥ 0.05 → "Not statistically significant" (don't reject H₀)

2. **Confidence Interval (CI)**: A range of plausible values for the true effect.
   - If CI doesn't include 0, the result is significant
   - Wider CI = more uncertainty

3. **Lift**: The difference between treatment and control.
   - Absolute lift: Treatment rate - Control rate
   - Relative lift: (Treatment rate - Control rate) / Control rate

### Important: Statistical vs Practical Significance

A result can be statistically significant but practically meaningless:
- With a huge sample, even tiny differences become "significant"
- Always consider: Is this effect large enough to matter for the business?

In [None]:
# Demonstrating statistical vs practical significance
# Tiny effect with huge sample size

huge_test = ABTest()
huge_result = huge_test.analyze_proportions(
    control_conversions=100000,
    control_total=1000000,
    treatment_conversions=100500,  # Only 0.05% absolute difference
    treatment_total=1000000
)

print("Example: Statistically Significant but Practically Meaningless")
print("="*60)
print(f"Control rate: {huge_result.control_mean:.4%}")
print(f"Treatment rate: {huge_result.treatment_mean:.4%}")
print(f"Absolute difference: {huge_result.absolute_lift:.4%}")
print(f"P-value: {huge_result.p_value:.4f}")
print(f"Significant: {huge_result.is_significant}")
print(f"\nBut wait... is a 0.05% improvement worth implementing?")

---
## 5. Common Pitfalls to Avoid

### Pitfall 1: Peeking at Results Too Early
Looking at results multiple times and stopping when you see significance inflates your false positive rate.

### Pitfall 2: Stopping Too Early
Running until you get a significant result guarantees you'll eventually get one, even if there's no real effect.

### Pitfall 3: Sample Ratio Mismatch (SRM)
If you expect a 50/50 split but get 55/45, something is wrong with your randomization.

### Pitfall 4: Multiple Comparisons
Testing many metrics increases false positive rate. If you test 20 metrics at α=0.05, you expect 1 false positive!

In [None]:
# Demonstrating the peeking problem
from experiments import detect_srm

print("Pitfall 3: Sample Ratio Mismatch Detection")
print("="*50)

# Good: No SRM
srm_good = detect_srm(n_control=5000, n_treatment=5100)
print(f"\nBalanced split (5000 vs 5100):")
print(f"  Expected ratio: {srm_good.expected_ratio}")
print(f"  Observed ratio: {srm_good.observed_ratio:.3f}")
print(f"  P-value: {srm_good.p_value:.4f}")
print(f"  SRM detected: {srm_good.is_mismatch}")

# Bad: SRM detected
srm_bad = detect_srm(n_control=5000, n_treatment=6000)
print(f"\nUnbalanced split (5000 vs 6000):")
print(f"  Expected ratio: {srm_bad.expected_ratio}")
print(f"  Observed ratio: {srm_bad.observed_ratio:.3f}")
print(f"  P-value: {srm_bad.p_value:.6f}")
print(f"  SRM detected: {srm_bad.is_mismatch} ⚠️ INVESTIGATE!")

---
## 6. Practice Exercises

### Exercise 1: Calculate Sample Size
You want to test a new email subject line. Your current open rate is 20%, and you want to detect a 3% absolute improvement.

**Question**: How many emails do you need to send per group?

In [None]:
# Your code here
email_test = ABTest()

# TODO: Calculate sample size for email open rate test
# sample_size = email_test.get_sample_size(...)
# print(f"Required sample size per group: {sample_size}")

<details>
<summary>Click for solution</summary>

```python
email_test = ABTest()
sample_size = email_test.get_sample_size(baseline_rate=0.20, mde=0.03)
print(f"Required sample size per group: {sample_size}")
# Answer: approximately 2,143 per group
```
</details>

### Exercise 2: Analyze Results
You ran a test with the following results:
- Control: 450 conversions out of 4,500 visitors
- Treatment: 520 conversions out of 4,600 visitors

**Question**: Is the treatment significantly better? What's the lift?

In [None]:
# Your code here
exercise_test = ABTest()

# TODO: Analyze the results
# result = exercise_test.analyze_proportions(...)
# print(result)

---
## 7. Key Takeaways

1. **A/B testing is a statistical framework** for comparing two options with proper randomization

2. **Plan your test before starting**: Calculate sample size, define success criteria

3. **P-value < 0.05 doesn't mean "important"**: Consider practical significance

4. **Avoid common pitfalls**: Don't peek, check for SRM, account for multiple comparisons

5. **Confidence intervals are your friend**: They show the range of plausible effects

## Further Reading
- Next notebook: `02_sample_size_power.ipynb` - Deep dive into power analysis
- `06_experiment_diagnostics.ipynb` - How to validate your experiments