# A/B Testing Framework - Complete Tutorial

**Author:** Anik Tahabilder  
**Project:** 15 of 22 - Kaggle ML Portfolio  
**Difficulty:** 7/10 | **Learning Value:** 9/10

---

## What Will You Learn?

This tutorial covers **statistical experiment design** for data-driven decisions:

| Topic | What You'll Understand |
|-------|------------------------|
| **Hypothesis Testing** | Null/Alternative hypotheses, p-values |
| **Statistical Tests** | t-test, z-test, chi-square, Mann-Whitney |
| **Sample Size** | Power analysis, minimum detectable effect |
| **Effect Size** | Cohen's d, practical significance |
| **Confidence Intervals** | Uncertainty quantification |
| **Common Pitfalls** | Multiple testing, peeking, Simpson's paradox |
| **Bayesian A/B Testing** | Alternative to frequentist approach |
| **Multi-Armed Bandits** | Adaptive experimentation |

---

## A/B Testing Flow

```
┌─────────────────────────────────────────────────────────────────────────┐
│                         A/B TESTING FRAMEWORK                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│                         ┌─────────────┐                                 │
│                         │   USERS     │                                 │
│                         │  (Traffic)  │                                 │
│                         └──────┬──────┘                                 │
│                                │                                        │
│                    ┌───────────┴───────────┐                            │
│                    │     RANDOM SPLIT      │                            │
│                    │       50/50           │                            │
│                    └───────────┬───────────┘                            │
│                                │                                        │
│              ┌─────────────────┴─────────────────┐                      │
│              │                                   │                      │
│              ▼                                   ▼                      │
│     ┌─────────────────┐                ┌─────────────────┐             │
│     │   CONTROL (A)   │                │  TREATMENT (B)  │             │
│     │                 │                │                 │             │
│     │  Current Design │                │   New Design    │             │
│     │  (Baseline)     │                │   (Variant)     │             │
│     └────────┬────────┘                └────────┬────────┘             │
│              │                                   │                      │
│              ▼                                   ▼                      │
│     ┌─────────────────┐                ┌─────────────────┐             │
│     │ Measure Metric  │                │ Measure Metric  │             │
│     │ (Conversion,    │                │ (Conversion,    │             │
│     │  Revenue, etc.) │                │  Revenue, etc.) │             │
│     └────────┬────────┘                └────────┬────────┘             │
│              │                                   │                      │
│              └───────────────┬───────────────────┘                      │
│                              │                                          │
│                              ▼                                          │
│                    ┌─────────────────────┐                              │
│                    │ STATISTICAL TEST    │                              │
│                    │                     │                              │
│                    │ Is B significantly  │                              │
│                    │ better than A?      │                              │
│                    └─────────────────────┘                              │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘
```

---

## Table of Contents

1. [Part 1: A/B Testing Fundamentals](#part1)
2. [Part 2: Hypothesis Testing](#part2)
3. [Part 3: Statistical Tests](#part3)
4. [Part 4: Sample Size & Power Analysis](#part4)
5. [Part 5: Effect Size & Practical Significance](#part5)
6. [Part 6: Common Pitfalls](#part6)
7. [Part 7: Bayesian A/B Testing](#part7)
8. [Part 8: Multi-Armed Bandits](#part8)
9. [Part 9: Complete Framework](#part9)
10. [Part 10: Summary](#part10)

---

<a id='part1'></a>
# Part 1: A/B Testing Fundamentals

---

## 1.1 What is A/B Testing?

A/B testing is a **randomized controlled experiment** to compare two versions:

| Term | Definition |
|------|------------|
| **Control (A)** | Current/baseline version |
| **Treatment (B)** | New variant being tested |
| **Metric** | What we measure (conversion, revenue, CTR) |
| **Randomization** | Users randomly assigned to A or B |

## 1.2 When to Use A/B Testing

| Use Case | Example |
|----------|--------|
| **Product Changes** | New checkout flow, button color |
| **ML Models** | New recommendation algorithm |
| **Pricing** | Different price points |
| **Marketing** | Email subject lines, ad copy |
| **UI/UX** | Layout changes, new features |

## 1.3 Key Metrics

| Metric Type | Examples | Test Type |
|-------------|----------|----------|
| **Binary** | Conversion (yes/no), Click (yes/no) | Chi-square, Z-test |
| **Continuous** | Revenue, Time on site | t-test |
| **Count** | Page views, Purchases | Poisson test |

In [None]:
# ============================================================
# SETUP AND IMPORTS
# ============================================================
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import norm, t, chi2_contingency, mannwhitneyu
from scipy.stats import ttest_ind, fisher_exact
import warnings
warnings.filterwarnings('ignore')

# For power analysis
from statsmodels.stats.power import TTestIndPower, NormalIndPower
from statsmodels.stats.proportion import proportions_ztest, proportion_confint

# Display settings
plt.style.use('seaborn-v0_8-whitegrid')
np.random.seed(42)

print("="*70)
print("A/B TESTING FRAMEWORK - TUTORIAL")
print("="*70)
print(f"NumPy: {np.__version__}")
print(f"Pandas: {pd.__version__}")
print("\nAll libraries loaded!")

In [None]:
# ============================================================
# LOAD A/B TEST DATA
# ============================================================
print("="*70)
print("LOADING A/B TEST DATA")
print("="*70)

# ============================================================
# KAGGLE DATASET CONFIGURATION
# ============================================================
# Dataset: https://www.kaggle.com/datasets/adarsh0806/ab-testing-practice
# This is a real A/B testing dataset for practice

import os

USE_KAGGLE = os.path.exists('/kaggle/input')

df = None

# Try loading from Kaggle
if USE_KAGGLE:
    # Try different possible paths
    possible_paths = [
        '/kaggle/input/ab-testing-practice/ab_testing_data.csv',
        '/kaggle/input/ab-testing-practice/AB_Testing.csv',
        '/kaggle/input/ab-testing-practice'
    ]
    
    for path in possible_paths:
        if os.path.exists(path):
            if os.path.isdir(path):
                # List files in directory
                files = os.listdir(path)
                csv_files = [f for f in files if f.endswith('.csv')]
                if csv_files:
                    df = pd.read_csv(os.path.join(path, csv_files[0]))
                    print(f"✓ Loaded from: {os.path.join(path, csv_files[0])}")
            else:
                df = pd.read_csv(path)
                print(f"✓ Loaded from: {path}")
            break
    
    if df is None:
        print("Dataset not found. Add 'ab-testing-practice' dataset in Kaggle.")

# Try kagglehub if available and df is None
if df is None:
    try:
        import kagglehub
        from kagglehub import KaggleDatasetAdapter
        
        df = kagglehub.load_dataset(
            KaggleDatasetAdapter.PANDAS,
            "adarsh0806/ab-testing-practice",
            "",
        )
        print("✓ Loaded via kagglehub")
    except Exception as e:
        print(f"kagglehub not available: {e}")

# Fallback: Generate synthetic data
if df is None:
    print("\nGenerating synthetic A/B test data...")
    print("(Add 'ab-testing-practice' dataset in Kaggle for real data)")
    
    def generate_ab_data(n_control=5000, n_treatment=5000, 
                         control_rate=0.10, treatment_rate=0.12,
                         seed=42):
        """Generate synthetic A/B test data."""
        np.random.seed(seed)
        
        control_conversions = np.random.binomial(1, control_rate, n_control)
        treatment_conversions = np.random.binomial(1, treatment_rate, n_treatment)
        
        df = pd.DataFrame({
            'user_id': range(n_control + n_treatment),
            'group': ['control'] * n_control + ['treatment'] * n_treatment,
            'converted': np.concatenate([control_conversions, treatment_conversions])
        })
        
        df['time_on_site'] = np.where(
            df['group'] == 'control',
            np.random.exponential(120, len(df)),
            np.random.exponential(135, len(df))
        )
        
        df['revenue'] = np.where(
            df['converted'] == 1,
            np.random.lognormal(3.5, 0.8, len(df)),
            0
        )
        
        return df
    
    df = generate_ab_data()

# ============================================================
# STANDARDIZE COLUMN NAMES
# ============================================================
print(f"\nOriginal columns: {list(df.columns)}")

# Common column name mappings
column_mappings = {
    'variant': 'group',
    'Variant': 'group',
    'test_group': 'group',
    'experiment_group': 'group',
    'ab_group': 'group',
    'conversion': 'converted',
    'Conversion': 'converted',
    'convert': 'converted',
    'purchased': 'converted',
    'clicked': 'converted',
    'user': 'user_id',
    'User': 'user_id',
    'userid': 'user_id',
    'id': 'user_id'
}

# Rename columns if needed
for old_name, new_name in column_mappings.items():
    if old_name in df.columns and new_name not in df.columns:
        df = df.rename(columns={old_name: new_name})

# Standardize group values
if 'group' in df.columns:
    # Convert to lowercase
    df['group'] = df['group'].astype(str).str.lower().str.strip()
    
    # Map common variations
    group_mappings = {
        'a': 'control',
        'b': 'treatment',
        'control': 'control',
        'treatment': 'treatment',
        'test': 'treatment',
        'variant': 'treatment',
        'experiment': 'treatment',
        '0': 'control',
        '1': 'treatment'
    }
    df['group'] = df['group'].map(lambda x: group_mappings.get(x, x))

# Add time_on_site if not present
if 'time_on_site' not in df.columns:
    df['time_on_site'] = np.where(
        df['group'] == 'control',
        np.random.exponential(120, len(df)),
        np.random.exponential(135, len(df))
    )

# Add revenue if not present
if 'revenue' not in df.columns:
    if 'converted' in df.columns:
        df['revenue'] = np.where(
            df['converted'] == 1,
            np.random.lognormal(3.5, 0.8, len(df)),
            0
        )
    else:
        df['revenue'] = 0

print(f"Standardized columns: {list(df.columns)}")

# ============================================================
# DATA SUMMARY
# ============================================================
print(f"\n" + "="*50)
print("DATASET SUMMARY")
print("="*50)
print(f"Shape: {df.shape}")
print(f"\nGroup distribution:")
print(df['group'].value_counts())

print(f"\nSample data:")
print(df.head(10))

if 'converted' in df.columns:
    print(f"\nConversion rates by group:")
    print(df.groupby('group')['converted'].agg(['sum', 'count', 'mean']))

In [None]:
# Visualize the data
print("="*70)
print("DATA VISUALIZATION")
print("="*70)

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# 1. Conversion rates by group
ax1 = axes[0]
conv_rates = df.groupby('group')['converted'].mean()
colors = ['steelblue', 'coral']
bars = ax1.bar(conv_rates.index, conv_rates.values, color=colors, edgecolor='black')
ax1.set_ylabel('Conversion Rate')
ax1.set_title('Conversion Rate by Group', fontweight='bold')
for bar, rate in zip(bars, conv_rates.values):
    ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.005, 
             f'{rate:.2%}', ha='center', fontweight='bold')

# 2. Revenue distribution
ax2 = axes[1]
for group, color in zip(['control', 'treatment'], colors):
    data = df[(df['group'] == group) & (df['revenue'] > 0)]['revenue']
    ax2.hist(data, bins=30, alpha=0.6, label=group.capitalize(), color=color, edgecolor='black')
ax2.set_xlabel('Revenue ($)')
ax2.set_ylabel('Frequency')
ax2.set_title('Revenue Distribution (Converters)', fontweight='bold')
ax2.legend()

# 3. Time on site
ax3 = axes[2]
df.boxplot(column='time_on_site', by='group', ax=ax3)
ax3.set_xlabel('Group')
ax3.set_ylabel('Time on Site (seconds)')
ax3.set_title('Time on Site by Group', fontweight='bold')
plt.suptitle('')  # Remove automatic title

plt.tight_layout()
plt.show()

---

<a id='part2'></a>
# Part 2: Hypothesis Testing

---

## 2.1 The Hypothesis Testing Framework

```
┌─────────────────────────────────────────────────────────────────────┐
│                    HYPOTHESIS TESTING                                │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Step 1: Define Hypotheses                                         │
│  ┌───────────────────────────────────────────────────────────────┐ │
│  │ H₀ (Null):        No difference between A and B               │ │
│  │ H₁ (Alternative): There IS a difference (B ≠ A, B > A, B < A) │ │
│  └───────────────────────────────────────────────────────────────┘ │
│                                                                     │
│  Step 2: Choose Significance Level (α)                             │
│  ┌───────────────────────────────────────────────────────────────┐ │
│  │ α = 0.05 (5%) is standard                                     │ │
│  │ This is the probability of false positive (Type I error)     │ │
│  └───────────────────────────────────────────────────────────────┘ │
│                                                                     │
│  Step 3: Collect Data & Calculate Test Statistic                   │
│                                                                     │
│  Step 4: Calculate p-value                                         │
│  ┌───────────────────────────────────────────────────────────────┐ │
│  │ p-value = P(observing data | H₀ is true)                      │ │
│  │ Small p-value → Evidence against H₀                           │ │
│  └───────────────────────────────────────────────────────────────┘ │
│                                                                     │
│  Step 5: Make Decision                                             │
│  ┌───────────────────────────────────────────────────────────────┐ │
│  │ if p-value < α: Reject H₀ (statistically significant)        │ │
│  │ if p-value ≥ α: Fail to reject H₀ (not significant)          │ │
│  └───────────────────────────────────────────────────────────────┘ │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
```

## 2.2 Types of Errors

| | H₀ True (No Effect) | H₀ False (Real Effect) |
|---|---|---|
| **Reject H₀** | Type I Error (α) - False Positive | Correct! (Power = 1-β) |
| **Fail to Reject H₀** | Correct! | Type II Error (β) - False Negative |

## 2.3 One-Tailed vs Two-Tailed Tests

| Test Type | Hypothesis | Use When |
|-----------|------------|----------|
| **Two-tailed** | H₁: μ_B ≠ μ_A | Don't know if B is better or worse |
| **One-tailed (right)** | H₁: μ_B > μ_A | Only care if B is better |
| **One-tailed (left)** | H₁: μ_B < μ_A | Only care if B is worse |

In [None]:
# ============================================================
# HYPOTHESIS TESTING EXAMPLE
# ============================================================
print("="*70)
print("HYPOTHESIS TESTING FOR A/B TEST")
print("="*70)

# Extract data
control = df[df['group'] == 'control']
treatment = df[df['group'] == 'treatment']

# Calculate statistics
n_control = len(control)
n_treatment = len(treatment)
conversions_control = control['converted'].sum()
conversions_treatment = treatment['converted'].sum()
rate_control = conversions_control / n_control
rate_treatment = conversions_treatment / n_treatment

print(f"""
STEP 1: Define Hypotheses
{'='*50}
H₀ (Null):        p_treatment = p_control
                  (No difference in conversion rates)

H₁ (Alternative): p_treatment ≠ p_control
                  (There is a difference)

STEP 2: Choose Significance Level
{'='*50}
α = 0.05 (5%)

STEP 3: Collect Data
{'='*50}
Control (A):   {conversions_control:,} / {n_control:,} = {rate_control:.4f} ({rate_control:.2%})
Treatment (B): {conversions_treatment:,} / {n_treatment:,} = {rate_treatment:.4f} ({rate_treatment:.2%})

Observed Difference: {rate_treatment - rate_control:.4f} ({(rate_treatment - rate_control):.2%})
Relative Lift: {(rate_treatment - rate_control) / rate_control * 100:.1f}%
""")

In [None]:
# Perform Z-test for proportions
print("="*70)
print("STEP 4 & 5: STATISTICAL TEST")
print("="*70)

# Z-test for two proportions
count = np.array([conversions_treatment, conversions_control])
nobs = np.array([n_treatment, n_control])

z_stat, p_value = proportions_ztest(count, nobs, alternative='two-sided')

# Calculate confidence interval for the difference
pooled_rate = (conversions_control + conversions_treatment) / (n_control + n_treatment)
se = np.sqrt(pooled_rate * (1 - pooled_rate) * (1/n_control + 1/n_treatment))
diff = rate_treatment - rate_control
ci_low = diff - 1.96 * se
ci_high = diff + 1.96 * se

print(f"""
Z-TEST FOR PROPORTIONS
{'='*50}
Test Statistic (Z): {z_stat:.4f}
P-value:            {p_value:.6f}

95% Confidence Interval for Difference:
  [{ci_low:.4f}, {ci_high:.4f}]
  [{ci_low:.2%}, {ci_high:.2%}]

DECISION (α = 0.05)
{'='*50}
""")

if p_value < 0.05:
    print(f"p-value ({p_value:.6f}) < 0.05")
    print("→ REJECT H₀")
    print("→ The difference IS statistically significant!")
    print(f"→ Treatment conversion rate is significantly {'higher' if diff > 0 else 'lower'}")
else:
    print(f"p-value ({p_value:.6f}) ≥ 0.05")
    print("→ FAIL TO REJECT H₀")
    print("→ The difference is NOT statistically significant")
    print("→ Cannot conclude treatment is different from control")

In [None]:
# Visualize the hypothesis test
print("="*70)
print("HYPOTHESIS TEST VISUALIZATION")
print("="*70)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 1. Normal distribution with test statistic
ax1 = axes[0]
x = np.linspace(-4, 4, 1000)
y = norm.pdf(x)
ax1.plot(x, y, 'b-', linewidth=2, label='Null Distribution')
ax1.fill_between(x, y, where=(x < -1.96) | (x > 1.96), alpha=0.3, color='red', label='Rejection Region (α=0.05)')
ax1.axvline(x=z_stat, color='green', linestyle='--', linewidth=2, label=f'Observed Z = {z_stat:.2f}')
ax1.axvline(x=-1.96, color='red', linestyle=':', alpha=0.7)
ax1.axvline(x=1.96, color='red', linestyle=':', alpha=0.7)
ax1.set_xlabel('Z-score')
ax1.set_ylabel('Probability Density')
ax1.set_title('Hypothesis Test Visualization', fontweight='bold')
ax1.legend()

# 2. Confidence interval
ax2 = axes[1]
ax2.errorbar(1, diff, yerr=[[diff - ci_low], [ci_high - diff]], 
             fmt='o', markersize=10, capsize=10, capthick=2, color='steelblue')
ax2.axhline(y=0, color='red', linestyle='--', label='No Effect')
ax2.set_xlim(0.5, 1.5)
ax2.set_xticks([1])
ax2.set_xticklabels(['Treatment - Control'])
ax2.set_ylabel('Difference in Conversion Rate')
ax2.set_title('95% Confidence Interval', fontweight='bold')
ax2.legend()

# Add text annotation
ax2.text(1.1, diff, f'{diff:.4f}\n({diff:.2%})', fontsize=10, fontweight='bold')

plt.tight_layout()
plt.show()

print("\nInterpretation:")
print("  - If CI doesn't include 0: Statistically significant")
print("  - If CI includes 0: Not statistically significant")

In [None]:
# ============================================================
# P-VALUE EXPLAINED VISUALLY
# ============================================================
print("="*70)
print("UNDERSTANDING P-VALUE")
print("="*70)

print("""
WHAT IS A P-VALUE?
==================
The p-value is the probability of observing data as extreme (or more extreme)
than what we actually observed, ASSUMING the null hypothesis is true.

Small p-value → Our observed data is very unlikely under H₀
             → Evidence AGAINST the null hypothesis
             → The difference is probably REAL

Example interpretation:
- p-value = 0.03 means: "If there were truly NO difference between A and B,
  we'd see results this extreme only 3% of the time by random chance."
""")

# Create detailed p-value visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 1. P-value explanation with bell curve
ax1 = axes[0]
x = np.linspace(-4, 4, 1000)
y = norm.pdf(x)

# Plot the distribution
ax1.plot(x, y, 'b-', linewidth=2.5, label='Null Distribution\n(If H₀ were true)')
ax1.fill_between(x, y, alpha=0.1, color='blue')

# Mark the center as "More likely observations"
ax1.annotate('More likely\nobservations', xy=(0, 0.4), fontsize=10, ha='center',
            fontweight='bold')

# Mark the tails as "Very unlikely observations"
ax1.annotate('Very unlikely\nobservations', xy=(-2.8, 0.05), fontsize=9, ha='center',
            color='red')
ax1.annotate('Very unlikely\nobservations', xy=(2.8, 0.05), fontsize=9, ha='center',
            color='red')

# Show our observed data point (z_stat from earlier)
observed_z = abs(z_stat)
ax1.axvline(x=observed_z, color='green', linestyle='-', linewidth=2.5)
ax1.axvline(x=-observed_z, color='green', linestyle='-', linewidth=2.5)

# Shade the p-value region (both tails for two-tailed test)
x_right = x[x >= observed_z]
x_left = x[x <= -observed_z]
ax1.fill_between(x_right, norm.pdf(x_right), alpha=0.5, color='green', label='P-value\n(shaded area)')
ax1.fill_between(x_left, norm.pdf(x_left), alpha=0.5, color='green')

# Add annotation for observed data point
ax1.annotate(f'Observed\ndata point\n(Z={observed_z:.2f})', 
            xy=(observed_z, norm.pdf(observed_z)), 
            xytext=(observed_z + 0.8, 0.2),
            fontsize=10, fontweight='bold', color='green',
            arrowprops=dict(arrowstyle='->', color='green', lw=2))

ax1.set_xlabel('Possible Results (Z-score)', fontsize=11)
ax1.set_ylabel('Probability Density', fontsize=11)
ax1.set_title('P-Value: Probability of Extreme Results Under H₀', fontweight='bold', fontsize=12)
ax1.legend(loc='upper right')
ax1.set_xlim(-4, 4)

# Add text box with p-value definition
textstr = f'P-value = {p_value:.4f}\n\nIf H₀ were true, we\'d see\nresults this extreme only\n{p_value:.1%} of the time.'
props = dict(boxstyle='round', facecolor='lightyellow', alpha=0.8)
ax1.text(0.02, 0.98, textstr, transform=ax1.transAxes, fontsize=10,
        verticalalignment='top', bbox=props)

# 2. Business impact visualization (like the 228% chart)
ax2 = axes[1]

# Simulate cumulative performance over time
months = ['Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
np.random.seed(42)

# Without A/B testing (random decisions)
baseline_growth = [100]
for i in range(6):
    # Random growth with some noise, sometimes negative
    growth = baseline_growth[-1] * (1 + np.random.uniform(-0.02, 0.05))
    baseline_growth.append(growth)

# With A/B testing (data-driven decisions)  
ab_growth = [100]
for i in range(6):
    # Consistent positive growth from making good decisions
    growth = ab_growth[-1] * (1 + np.random.uniform(0.08, 0.15))
    ab_growth.append(growth)

ax2.plot(months, baseline_growth, 'r-', linewidth=2.5, marker='o', markersize=8, 
         label='Without A/B Testing')
ax2.plot(months, ab_growth, 'b-', linewidth=2.5, marker='o', markersize=8,
         label='With A/B Testing')

# Fill between to show the gap
ax2.fill_between(months, baseline_growth, ab_growth, alpha=0.2, color='green')

# Annotate the improvement
improvement = (ab_growth[-1] - baseline_growth[-1]) / baseline_growth[-1] * 100
ax2.annotate(f'+{improvement:.0f}%', xy=(6, (ab_growth[-1] + baseline_growth[-1])/2),
            fontsize=16, fontweight='bold', color='green',
            ha='center')

ax2.set_xlabel('Month', fontsize=11)
ax2.set_ylabel('Performance Index', fontsize=11)
ax2.set_title('Business Impact of A/B Testing', fontweight='bold', fontsize=12)
ax2.legend(loc='upper left')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nKEY TAKEAWAY:")
print(f"  Our observed Z-score of {z_stat:.2f} falls in the tail of the distribution.")
print(f"  The p-value ({p_value:.4f}) tells us this would happen only {p_value:.1%} of the time")
print(f"  if there were truly no difference between Control and Treatment.")
print(f"  Since {p_value:.4f} < 0.05, we reject H₀ - the difference is REAL!")

---

<a id='part3'></a>
# Part 3: Statistical Tests

---

## 3.1 Choosing the Right Test

| Data Type | Test | When to Use |
|-----------|------|-------------|
| **Binary (proportions)** | Z-test, Chi-square | Conversion rate, CTR |
| **Continuous (normal)** | t-test | Revenue, time on site |
| **Continuous (non-normal)** | Mann-Whitney U | Skewed distributions |
| **Count data** | Poisson test | Page views, events |
| **Small samples** | Fisher's exact | n < 30 per group |

In [None]:
# ============================================================
# STATISTICAL TESTS COMPARISON
# ============================================================
print("="*70)
print("STATISTICAL TESTS FOR A/B TESTING")
print("="*70)

class ABTestSuite:
    """
    Comprehensive A/B testing suite with multiple statistical tests.
    """
    
    @staticmethod
    def z_test_proportions(conversions_a, n_a, conversions_b, n_b, alpha=0.05):
        """
        Z-test for comparing two proportions.
        
        Best for: Binary outcomes (conversion, click, etc.)
        Assumption: Large sample sizes (n > 30)
        """
        p_a = conversions_a / n_a
        p_b = conversions_b / n_b
        
        # Pooled proportion
        p_pool = (conversions_a + conversions_b) / (n_a + n_b)
        
        # Standard error
        se = np.sqrt(p_pool * (1 - p_pool) * (1/n_a + 1/n_b))
        
        # Z statistic
        z = (p_b - p_a) / se
        
        # P-value (two-tailed)
        p_value = 2 * (1 - norm.cdf(abs(z)))
        
        # Confidence interval
        se_diff = np.sqrt(p_a*(1-p_a)/n_a + p_b*(1-p_b)/n_b)
        ci = (p_b - p_a - 1.96*se_diff, p_b - p_a + 1.96*se_diff)
        
        return {
            'test': 'Z-test for Proportions',
            'statistic': z,
            'p_value': p_value,
            'significant': p_value < alpha,
            'ci_95': ci,
            'effect': p_b - p_a,
            'relative_effect': (p_b - p_a) / p_a * 100
        }
    
    @staticmethod
    def chi_square_test(conversions_a, n_a, conversions_b, n_b, alpha=0.05):
        """
        Chi-square test for independence.
        
        Best for: Binary outcomes
        Alternative to Z-test, gives same result for 2x2 tables
        """
        # Create contingency table
        table = np.array([
            [conversions_a, n_a - conversions_a],
            [conversions_b, n_b - conversions_b]
        ])
        
        chi2, p_value, dof, expected = chi2_contingency(table)
        
        return {
            'test': 'Chi-square Test',
            'statistic': chi2,
            'p_value': p_value,
            'significant': p_value < alpha,
            'dof': dof
        }
    
    @staticmethod
    def t_test(data_a, data_b, alpha=0.05, equal_var=False):
        """
        Independent samples t-test.
        
        Best for: Continuous outcomes (revenue, time)
        Assumption: Approximately normal distribution
        Use equal_var=False for Welch's t-test (recommended)
        """
        t_stat, p_value = ttest_ind(data_a, data_b, equal_var=equal_var)
        
        mean_a = np.mean(data_a)
        mean_b = np.mean(data_b)
        
        # Cohen's d effect size
        pooled_std = np.sqrt((np.var(data_a) + np.var(data_b)) / 2)
        cohens_d = (mean_b - mean_a) / pooled_std
        
        return {
            'test': "Welch's t-test" if not equal_var else "Student's t-test",
            'statistic': t_stat,
            'p_value': p_value,
            'significant': p_value < alpha,
            'mean_a': mean_a,
            'mean_b': mean_b,
            'effect': mean_b - mean_a,
            'cohens_d': cohens_d
        }
    
    @staticmethod
    def mann_whitney_test(data_a, data_b, alpha=0.05):
        """
        Mann-Whitney U test (non-parametric).
        
        Best for: Non-normal distributions, ordinal data
        More robust than t-test for skewed data
        """
        u_stat, p_value = mannwhitneyu(data_a, data_b, alternative='two-sided')
        
        return {
            'test': 'Mann-Whitney U Test',
            'statistic': u_stat,
            'p_value': p_value,
            'significant': p_value < alpha,
            'median_a': np.median(data_a),
            'median_b': np.median(data_b)
        }

print("ABTestSuite class created with methods:")
print("  - z_test_proportions(): For binary outcomes")
print("  - chi_square_test(): Alternative for binary outcomes")
print("  - t_test(): For continuous outcomes")
print("  - mann_whitney_test(): For non-normal distributions")

In [None]:
# Run all tests on our data
print("="*70)
print("RUNNING MULTIPLE STATISTICAL TESTS")
print("="*70)

# 1. Z-test for conversion rate
print("\n1. Z-TEST FOR CONVERSION RATE")
print("-" * 50)
z_result = ABTestSuite.z_test_proportions(
    conversions_control, n_control,
    conversions_treatment, n_treatment
)
for k, v in z_result.items():
    print(f"  {k}: {v}")

# 2. Chi-square test
print("\n2. CHI-SQUARE TEST")
print("-" * 50)
chi_result = ABTestSuite.chi_square_test(
    conversions_control, n_control,
    conversions_treatment, n_treatment
)
for k, v in chi_result.items():
    print(f"  {k}: {v}")

# 3. T-test for time on site
print("\n3. T-TEST FOR TIME ON SITE")
print("-" * 50)
t_result = ABTestSuite.t_test(
    control['time_on_site'].values,
    treatment['time_on_site'].values
)
for k, v in t_result.items():
    print(f"  {k}: {v}")

# 4. Mann-Whitney for revenue (skewed distribution)
print("\n4. MANN-WHITNEY TEST FOR REVENUE")
print("-" * 50)
mw_result = ABTestSuite.mann_whitney_test(
    control[control['revenue'] > 0]['revenue'].values,
    treatment[treatment['revenue'] > 0]['revenue'].values
)
for k, v in mw_result.items():
    print(f"  {k}: {v}")

---

<a id='part4'></a>
# Part 4: Sample Size & Power Analysis

---

## 4.1 Why Sample Size Matters

| Sample Size | Issue |
|-------------|-------|
| **Too Small** | High chance of missing real effects (low power) |
| **Too Large** | Wasted resources, may detect trivial effects |
| **Just Right** | Detect meaningful effects with high confidence |

## 4.2 Power Analysis Components

| Parameter | Symbol | Description | Typical Value |
|-----------|--------|-------------|---------------|
| **Significance Level** | α | P(Type I error) | 0.05 |
| **Power** | 1-β | P(detecting real effect) | 0.80 |
| **Effect Size** | δ | Minimum detectable effect | Depends on business |
| **Sample Size** | n | Number per group | Calculated |

## 4.3 The Power Formula

For proportions:
```
n = 2 × (Z_{α/2} + Z_β)² × p̄(1-p̄) / (p₁ - p₂)²

Where:
- p̄ = (p₁ + p₂) / 2
- Z_{α/2} = 1.96 for α=0.05
- Z_β = 0.84 for power=0.80
```

In [None]:
# ============================================================
# SAMPLE SIZE CALCULATOR
# ============================================================
print("="*70)
print("SAMPLE SIZE CALCULATOR")
print("="*70)

def calculate_sample_size_proportions(p1, p2, alpha=0.05, power=0.80):
    """
    Calculate required sample size for comparing two proportions.
    
    Parameters:
    - p1: Baseline conversion rate (control)
    - p2: Expected conversion rate (treatment)
    - alpha: Significance level (default 0.05)
    - power: Statistical power (default 0.80)
    
    Returns:
    - n: Sample size per group
    """
    # Z-scores
    z_alpha = norm.ppf(1 - alpha/2)  # Two-tailed
    z_beta = norm.ppf(power)
    
    # Pooled proportion
    p_bar = (p1 + p2) / 2
    
    # Sample size formula
    n = 2 * ((z_alpha + z_beta)**2) * p_bar * (1 - p_bar) / ((p2 - p1)**2)
    
    return int(np.ceil(n))

def calculate_sample_size_continuous(effect_size, alpha=0.05, power=0.80):
    """
    Calculate required sample size for continuous outcomes.
    
    Parameters:
    - effect_size: Cohen's d (standardized effect size)
    - alpha: Significance level
    - power: Statistical power
    """
    analysis = TTestIndPower()
    n = analysis.solve_power(effect_size=effect_size, alpha=alpha, power=power)
    return int(np.ceil(n))

# Example calculations
print("\nSCENARIO 1: Conversion Rate")
print("-" * 50)
baseline = 0.10  # 10% current conversion
expected = 0.12  # 12% expected with new design (20% lift)

n_required = calculate_sample_size_proportions(baseline, expected)
print(f"Baseline rate: {baseline:.1%}")
print(f"Expected rate: {expected:.1%}")
print(f"Minimum Detectable Effect: {expected - baseline:.1%} ({(expected-baseline)/baseline*100:.0f}% relative)")
print(f"\nRequired sample size per group: {n_required:,}")
print(f"Total users needed: {n_required * 2:,}")

# Vary the minimum detectable effect
print("\n" + "="*50)
print("SAMPLE SIZE vs MINIMUM DETECTABLE EFFECT")
print("="*50)
print(f"\n{'MDE':>10} {'Relative Lift':>15} {'Sample Size (per group)':>25}")
print("-" * 55)
for lift in [0.05, 0.10, 0.15, 0.20, 0.25, 0.30]:
    expected = baseline * (1 + lift)
    n = calculate_sample_size_proportions(baseline, expected)
    print(f"{expected - baseline:>10.2%} {lift:>14.0%} {n:>20,}")

In [None]:
# Visualize sample size requirements
print("="*70)
print("SAMPLE SIZE VISUALIZATION")
print("="*70)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 1. Sample size vs MDE
ax1 = axes[0]
lifts = np.linspace(0.05, 0.50, 50)
sample_sizes = [calculate_sample_size_proportions(0.10, 0.10*(1+l)) for l in lifts]
ax1.plot(lifts * 100, sample_sizes, 'b-', linewidth=2)
ax1.fill_between(lifts * 100, sample_sizes, alpha=0.3)
ax1.set_xlabel('Relative Lift (%)')
ax1.set_ylabel('Sample Size per Group')
ax1.set_title('Sample Size vs Minimum Detectable Effect', fontweight='bold')
ax1.set_yscale('log')
ax1.grid(True, alpha=0.3)

# Add annotations
for lift_pct, label in [(10, '10% lift'), (20, '20% lift'), (30, '30% lift')]:
    n = calculate_sample_size_proportions(0.10, 0.10*(1+lift_pct/100))
    ax1.annotate(f'{n:,}', xy=(lift_pct, n), xytext=(lift_pct+5, n*1.5),
                arrowprops=dict(arrowstyle='->', color='red'),
                fontsize=9, color='red')

# 2. Sample size vs Power
ax2 = axes[1]
powers = np.linspace(0.5, 0.99, 50)
sample_sizes_power = [calculate_sample_size_proportions(0.10, 0.12, power=p) for p in powers]
ax2.plot(powers * 100, sample_sizes_power, 'g-', linewidth=2)
ax2.fill_between(powers * 100, sample_sizes_power, alpha=0.3, color='green')
ax2.axvline(x=80, color='red', linestyle='--', label='80% power (standard)')
ax2.set_xlabel('Statistical Power (%)')
ax2.set_ylabel('Sample Size per Group')
ax2.set_title('Sample Size vs Power (20% relative lift)', fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nKey Insights:")
print("  - Smaller effects require exponentially more samples")
print("  - Higher power requires more samples (but 80% is standard)")
print("  - Always calculate sample size BEFORE running the test!")

---

<a id='part5'></a>
# Part 5: Effect Size & Practical Significance

---

## 5.1 Statistical vs Practical Significance

| Type | Question | Example |
|------|----------|---------|
| **Statistical Significance** | Is the effect real (not due to chance)? | p < 0.05 |
| **Practical Significance** | Is the effect large enough to matter? | +$100K revenue |

**Critical Insight:** A result can be statistically significant but practically meaningless!

## 5.2 Cohen's d (Effect Size for Continuous)

| Cohen's d | Interpretation |
|-----------|----------------|
| 0.2 | Small effect |
| 0.5 | Medium effect |
| 0.8 | Large effect |

## 5.3 Effect Size for Proportions

| Metric | Formula |
|--------|--------|
| **Absolute Difference** | p_B - p_A |
| **Relative Lift** | (p_B - p_A) / p_A × 100% |
| **Odds Ratio** | (p_B/(1-p_B)) / (p_A/(1-p_A)) |

In [None]:
# ============================================================
# EFFECT SIZE AND PRACTICAL SIGNIFICANCE
# ============================================================
print("="*70)
print("EFFECT SIZE ANALYSIS")
print("="*70)

def calculate_effect_sizes(p_a, p_b, n_a, n_b):
    """
    Calculate various effect size metrics.
    """
    # Absolute difference
    abs_diff = p_b - p_a
    
    # Relative lift
    rel_lift = (p_b - p_a) / p_a * 100
    
    # Odds ratio
    odds_a = p_a / (1 - p_a)
    odds_b = p_b / (1 - p_b)
    odds_ratio = odds_b / odds_a
    
    # Risk ratio (relative risk)
    risk_ratio = p_b / p_a
    
    # Number Needed to Treat (NNT)
    nnt = 1 / abs(abs_diff) if abs_diff != 0 else float('inf')
    
    # Cohen's h (effect size for proportions)
    h = 2 * (np.arcsin(np.sqrt(p_b)) - np.arcsin(np.sqrt(p_a)))
    
    return {
        'absolute_difference': abs_diff,
        'relative_lift': rel_lift,
        'odds_ratio': odds_ratio,
        'risk_ratio': risk_ratio,
        'nnt': nnt,
        'cohens_h': h
    }

# Calculate for our test
effects = calculate_effect_sizes(rate_control, rate_treatment, n_control, n_treatment)

print(f"""
EFFECT SIZE METRICS
{'='*50}
Control Rate:      {rate_control:.4f} ({rate_control:.2%})
Treatment Rate:    {rate_treatment:.4f} ({rate_treatment:.2%})

Absolute Difference: {effects['absolute_difference']:.4f} ({effects['absolute_difference']:.2%})
Relative Lift:       {effects['relative_lift']:.1f}%
Odds Ratio:          {effects['odds_ratio']:.3f}
Risk Ratio:          {effects['risk_ratio']:.3f}
Number Needed to Treat: {effects['nnt']:.0f}
Cohen's h:           {effects['cohens_h']:.3f}

INTERPRETATION
{'='*50}
""")

# Interpret Cohen's h
h = abs(effects['cohens_h'])
if h < 0.2:
    interpretation = "Small effect (h < 0.2)"
elif h < 0.5:
    interpretation = "Small to Medium effect (0.2 ≤ h < 0.5)"
elif h < 0.8:
    interpretation = "Medium to Large effect (0.5 ≤ h < 0.8)"
else:
    interpretation = "Large effect (h ≥ 0.8)"

print(f"Cohen's h interpretation: {interpretation}")
print(f"\nPractical interpretation:")
print(f"  - For every {effects['nnt']:.0f} users exposed to treatment,")
print(f"    we get 1 additional conversion compared to control.")

In [None]:
# Business impact calculation
print("="*70)
print("BUSINESS IMPACT ANALYSIS")
print("="*70)

# Hypothetical business metrics
monthly_visitors = 1_000_000
avg_order_value = 50  # dollars

# Current (control) metrics
current_conversions = monthly_visitors * rate_control
current_revenue = current_conversions * avg_order_value

# Expected (treatment) metrics
expected_conversions = monthly_visitors * rate_treatment
expected_revenue = expected_conversions * avg_order_value

# Impact
additional_conversions = expected_conversions - current_conversions
additional_revenue = expected_revenue - current_revenue

print(f"""
PROJECTED MONTHLY IMPACT
{'='*50}
Monthly Visitors: {monthly_visitors:,}
Average Order Value: ${avg_order_value}

                    Control        Treatment       Difference
Conversion Rate     {rate_control:.2%}          {rate_treatment:.2%}           +{rate_treatment-rate_control:.2%}
Conversions         {current_conversions:,.0f}        {expected_conversions:,.0f}         +{additional_conversions:,.0f}
Revenue             ${current_revenue:,.0f}      ${expected_revenue:,.0f}       +${additional_revenue:,.0f}

ANNUAL PROJECTED IMPACT: +${additional_revenue * 12:,.0f}
""")

# Visualize business impact
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Monthly revenue comparison
ax1 = axes[0]
categories = ['Control', 'Treatment']
revenues = [current_revenue / 1e6, expected_revenue / 1e6]
colors = ['steelblue', 'coral']
bars = ax1.bar(categories, revenues, color=colors, edgecolor='black')
ax1.set_ylabel('Monthly Revenue (Millions $)')
ax1.set_title('Revenue Comparison', fontweight='bold')
for bar, rev in zip(bars, revenues):
    ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.05,
             f'${rev:.2f}M', ha='center', fontweight='bold')

# Additional revenue highlight
ax2 = axes[1]
months = list(range(1, 13))
cumulative_impact = [additional_revenue * m / 1e6 for m in months]
ax2.bar(months, cumulative_impact, color='green', edgecolor='black', alpha=0.7)
ax2.plot(months, cumulative_impact, 'go-', linewidth=2, markersize=8)
ax2.set_xlabel('Month')
ax2.set_ylabel('Cumulative Additional Revenue (Millions $)')
ax2.set_title('Projected Cumulative Impact', fontweight='bold')
ax2.set_xticks(months)

plt.tight_layout()
plt.show()

---

<a id='part6'></a>
# Part 6: Common Pitfalls

---

## 6.1 The Big Mistakes in A/B Testing

| Pitfall | Problem | Solution |
|---------|---------|----------|
| **Peeking** | Checking results too early | Pre-define sample size, don't peek |
| **Multiple Testing** | Testing many variants | Bonferroni/FDR correction |
| **Simpson's Paradox** | Aggregation hides truth | Segment analysis |
| **Novelty Effect** | Initial boost fades | Run test longer |
| **Selection Bias** | Non-random assignment | Proper randomization |
| **Survivorship Bias** | Only measure survivors | Intent-to-treat analysis |

In [None]:
# ============================================================
# PITFALL 1: PEEKING PROBLEM
# ============================================================
print("="*70)
print("PITFALL 1: THE PEEKING PROBLEM")
print("="*70)

print("""
THE PROBLEM:
If you check p-values repeatedly as data comes in,
you WILL find "significance" by chance!

With α=0.05 and continuous monitoring:
- After 100 peeks: ~40% chance of false positive
- After 1000 peeks: ~80% chance of false positive
""")

# Simulate the peeking problem
def simulate_peeking(n_simulations=1000, max_samples=5000, peek_interval=100):
    """
    Simulate the false positive rate when peeking at results.
    Both groups have SAME conversion rate (no real effect).
    """
    true_rate = 0.10  # Same for both groups
    false_positives = 0
    
    for _ in range(n_simulations):
        # Generate data incrementally
        control = []
        treatment = []
        
        found_significant = False
        
        for n in range(peek_interval, max_samples + 1, peek_interval):
            # Add new data
            control.extend(np.random.binomial(1, true_rate, peek_interval))
            treatment.extend(np.random.binomial(1, true_rate, peek_interval))
            
            # Calculate p-value
            count = np.array([sum(treatment), sum(control)])
            nobs = np.array([len(treatment), len(control)])
            _, p_value = proportions_ztest(count, nobs)
            
            if p_value < 0.05:
                found_significant = True
                break
        
        if found_significant:
            false_positives += 1
    
    return false_positives / n_simulations

# Run simulation
print("\nSimulating 1000 experiments with peeking (NO real effect)...")
false_positive_rate = simulate_peeking(n_simulations=500)
print(f"\nFalse Positive Rate with peeking: {false_positive_rate:.1%}")
print(f"Expected without peeking: 5.0%")
print(f"\nPeeking inflated false positives by {false_positive_rate/0.05:.1f}x!")

In [None]:
# ============================================================
# PITFALL 2: MULTIPLE TESTING
# ============================================================
print("="*70)
print("PITFALL 2: MULTIPLE TESTING PROBLEM")
print("="*70)

print("""
THE PROBLEM:
Testing multiple hypotheses increases false positive rate.

P(at least 1 false positive) = 1 - (1 - α)^n

With α=0.05:
- 1 test:  5% chance of false positive
- 10 tests: 40% chance of at least 1 false positive
- 20 tests: 64% chance of at least 1 false positive
""")

# Demonstrate
alpha = 0.05
n_tests = np.arange(1, 51)
familywise_error = 1 - (1 - alpha) ** n_tests

# Bonferroni correction
def bonferroni_correction(p_values, alpha=0.05):
    """Apply Bonferroni correction."""
    n = len(p_values)
    adjusted_alpha = alpha / n
    return [p < adjusted_alpha for p in p_values], adjusted_alpha

# Benjamini-Hochberg (FDR) correction
def benjamini_hochberg(p_values, alpha=0.05):
    """Apply Benjamini-Hochberg FDR correction."""
    n = len(p_values)
    sorted_indices = np.argsort(p_values)
    sorted_p = np.array(p_values)[sorted_indices]
    
    # Calculate BH threshold
    thresholds = alpha * (np.arange(1, n + 1) / n)
    
    # Find significant tests
    significant = sorted_p <= thresholds
    
    # Return in original order
    result = np.zeros(n, dtype=bool)
    result[sorted_indices] = significant
    
    return list(result)

# Example with multiple tests
np.random.seed(42)
# Simulate 20 p-values (assume 3 are truly significant)
p_values = list(np.random.uniform(0.01, 0.99, 17)) + [0.01, 0.02, 0.03]
np.random.shuffle(p_values)

# Apply corrections
uncorrected = [p < 0.05 for p in p_values]
bonf_sig, bonf_alpha = bonferroni_correction(p_values)
bh_sig = benjamini_hochberg(p_values)

print(f"\nEXAMPLE: 20 hypothesis tests")
print(f"{'='*60}")
print(f"{'Test':<6} {'P-value':<12} {'Uncorrected':<14} {'Bonferroni':<14} {'BH (FDR)'}")
print(f"{'-'*60}")
for i, (p, unc, bonf, bh) in enumerate(zip(p_values, uncorrected, bonf_sig, bh_sig)):
    print(f"{i+1:<6} {p:<12.4f} {'SIG' if unc else '':<14} {'SIG' if bonf else '':<14} {'SIG' if bh else ''}")

print(f"\nSummary:")
print(f"  Uncorrected significant: {sum(uncorrected)}")
print(f"  Bonferroni significant: {sum(bonf_sig)} (α = {bonf_alpha:.4f})")
print(f"  BH (FDR) significant: {sum(bh_sig)}")

In [None]:
# Visualize multiple testing
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Family-wise error rate
ax1 = axes[0]
ax1.plot(n_tests, familywise_error * 100, 'r-', linewidth=2)
ax1.axhline(y=5, color='green', linestyle='--', label='Desired α=5%')
ax1.fill_between(n_tests, familywise_error * 100, alpha=0.3, color='red')
ax1.set_xlabel('Number of Tests')
ax1.set_ylabel('P(At Least 1 False Positive) %')
ax1.set_title('Family-wise Error Rate', fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Correction comparison
ax2 = axes[1]
methods = ['Uncorrected', 'Bonferroni', 'BH (FDR)']
counts = [sum(uncorrected), sum(bonf_sig), sum(bh_sig)]
colors = ['red', 'steelblue', 'green']
bars = ax2.bar(methods, counts, color=colors, edgecolor='black')
ax2.set_ylabel('Number of Significant Results')
ax2.set_title('Multiple Testing Corrections (20 tests)', fontweight='bold')
ax2.axhline(y=3, color='orange', linestyle='--', label='True positives (3)')
ax2.legend()

for bar, count in zip(bars, counts):
    ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.2,
             str(count), ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

---

<a id='part7'></a>
# Part 7: Bayesian A/B Testing

---

## 7.1 Frequentist vs Bayesian

| Aspect | Frequentist | Bayesian |
|--------|------------|----------|
| **Question** | What's P(data \| H₀)? | What's P(B better \| data)? |
| **Output** | p-value, CI | Posterior probability |
| **Prior info** | Not used | Can incorporate |
| **Interpretation** | "Reject/Fail to reject" | "95% chance B is better" |
| **Sample size** | Fixed in advance | Can stop anytime |

## 7.2 Bayesian Approach

```
Prior × Likelihood = Posterior

P(θ|data) ∝ P(data|θ) × P(θ)

For conversion rates, use Beta distribution:
- Prior: Beta(α₀, β₀)
- Posterior: Beta(α₀ + conversions, β₀ + non-conversions)
```

In [None]:
# ============================================================
# BAYESIAN A/B TESTING
# ============================================================
print("="*70)
print("BAYESIAN A/B TESTING")
print("="*70)

from scipy.stats import beta

class BayesianABTest:
    """
    Bayesian A/B testing using Beta-Binomial model.
    
    Prior: Beta(α, β)
    - α = 1, β = 1 gives uniform prior (no prior knowledge)
    - Can use historical data to set informative prior
    """
    
    def __init__(self, prior_alpha=1, prior_beta=1):
        self.prior_alpha = prior_alpha
        self.prior_beta = prior_beta
    
    def update(self, conversions, n):
        """
        Update posterior with observed data.
        
        Posterior: Beta(α + conversions, β + non-conversions)
        """
        post_alpha = self.prior_alpha + conversions
        post_beta = self.prior_beta + (n - conversions)
        return post_alpha, post_beta
    
    def prob_b_better(self, conv_a, n_a, conv_b, n_b, n_samples=100000):
        """
        Calculate P(B > A) using Monte Carlo simulation.
        
        Sample from both posteriors and count how often B > A.
        """
        # Get posteriors
        alpha_a, beta_a = self.update(conv_a, n_a)
        alpha_b, beta_b = self.update(conv_b, n_b)
        
        # Sample from posteriors
        samples_a = beta.rvs(alpha_a, beta_a, size=n_samples)
        samples_b = beta.rvs(alpha_b, beta_b, size=n_samples)
        
        # Calculate probability B > A
        prob = np.mean(samples_b > samples_a)
        
        return prob, samples_a, samples_b
    
    def expected_loss(self, conv_a, n_a, conv_b, n_b, n_samples=100000):
        """
        Calculate expected loss of choosing B over A.
        
        Loss = max(0, θ_A - θ_B) if we choose B
        """
        # Get posteriors
        alpha_a, beta_a = self.update(conv_a, n_a)
        alpha_b, beta_b = self.update(conv_b, n_b)
        
        # Sample
        samples_a = beta.rvs(alpha_a, beta_a, size=n_samples)
        samples_b = beta.rvs(alpha_b, beta_b, size=n_samples)
        
        # Expected loss of choosing B
        loss_b = np.mean(np.maximum(0, samples_a - samples_b))
        
        # Expected loss of choosing A
        loss_a = np.mean(np.maximum(0, samples_b - samples_a))
        
        return loss_a, loss_b

# Run Bayesian analysis
bayes = BayesianABTest(prior_alpha=1, prior_beta=1)  # Uniform prior

prob_b_better, samples_a, samples_b = bayes.prob_b_better(
    conversions_control, n_control,
    conversions_treatment, n_treatment
)

loss_a, loss_b = bayes.expected_loss(
    conversions_control, n_control,
    conversions_treatment, n_treatment
)

print(f"""
BAYESIAN A/B TEST RESULTS
{'='*50}
Prior: Beta(1, 1) - Uniform (no prior knowledge)

Control (A):
  Conversions: {conversions_control:,} / {n_control:,}
  Posterior: Beta({1 + conversions_control}, {1 + n_control - conversions_control})
  
Treatment (B):
  Conversions: {conversions_treatment:,} / {n_treatment:,}
  Posterior: Beta({1 + conversions_treatment}, {1 + n_treatment - conversions_treatment})

RESULTS:
  P(Treatment > Control) = {prob_b_better:.2%}
  P(Control > Treatment) = {1-prob_b_better:.2%}
  
  Expected Loss (choosing A): {loss_a:.4f} ({loss_a:.2%})
  Expected Loss (choosing B): {loss_b:.4f} ({loss_b:.2%})

RECOMMENDATION:
""")

if prob_b_better > 0.95:
    print(f"  Strong evidence that Treatment is better ({prob_b_better:.1%} probability)")
    print(f"  → Recommend implementing Treatment")
elif prob_b_better > 0.75:
    print(f"  Moderate evidence that Treatment is better ({prob_b_better:.1%} probability)")
    print(f"  → Consider running test longer for more certainty")
else:
    print(f"  Insufficient evidence ({prob_b_better:.1%} probability)")
    print(f"  → Continue the test")

In [None]:
# Visualize Bayesian posteriors
print("="*70)
print("BAYESIAN POSTERIOR VISUALIZATION")
print("="*70)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 1. Posterior distributions
ax1 = axes[0]
ax1.hist(samples_a, bins=100, density=True, alpha=0.6, label='Control (A)', color='steelblue')
ax1.hist(samples_b, bins=100, density=True, alpha=0.6, label='Treatment (B)', color='coral')
ax1.axvline(np.mean(samples_a), color='steelblue', linestyle='--', linewidth=2)
ax1.axvline(np.mean(samples_b), color='coral', linestyle='--', linewidth=2)
ax1.set_xlabel('Conversion Rate')
ax1.set_ylabel('Density')
ax1.set_title('Posterior Distributions', fontweight='bold')
ax1.legend()

# 2. Difference distribution
ax2 = axes[1]
diff_samples = samples_b - samples_a
ax2.hist(diff_samples, bins=100, density=True, alpha=0.7, color='green', edgecolor='black')
ax2.axvline(0, color='red', linestyle='--', linewidth=2, label='No Difference')
ax2.axvline(np.mean(diff_samples), color='blue', linestyle='-', linewidth=2, 
            label=f'Mean Diff: {np.mean(diff_samples):.4f}')

# Shade probability of B > A
x_fill = diff_samples[diff_samples > 0]
ax2.fill_between([0, np.max(diff_samples)], [0, 0], [ax2.get_ylim()[1], ax2.get_ylim()[1]], 
                 alpha=0.3, color='green', label=f'P(B>A) = {prob_b_better:.1%}')

ax2.set_xlabel('Difference (Treatment - Control)')
ax2.set_ylabel('Density')
ax2.set_title('Posterior of Difference', fontweight='bold')
ax2.legend()

plt.tight_layout()
plt.show()

# Credible interval
ci_low, ci_high = np.percentile(diff_samples, [2.5, 97.5])
print(f"\n95% Credible Interval for Difference: [{ci_low:.4f}, {ci_high:.4f}]")
print(f"  In plain English: We're 95% confident the true difference")
print(f"  is between {ci_low:.2%} and {ci_high:.2%}")

---

<a id='part8'></a>
# Part 8: Multi-Armed Bandits

---

## 8.1 Exploration vs Exploitation

| Approach | Trade-off |
|----------|----------|
| **A/B Test** | Pure exploration, then exploitation |
| **Bandit** | Balance exploration & exploitation |

## 8.2 When to Use Bandits

| Use Bandits When | Use A/B Tests When |
|------------------|--------------------|
| High opportunity cost | Need statistical rigor |
| Many variants | Few variants |
| Quick decisions needed | Can wait for results |
| Personalization | One-size-fits-all |

In [None]:
# ============================================================
# MULTI-ARMED BANDIT ALGORITHMS
# ============================================================
print("="*70)
print("MULTI-ARMED BANDITS")
print("="*70)

class EpsilonGreedy:
    """
    Epsilon-Greedy bandit algorithm.
    
    - With probability ε: Explore (random arm)
    - With probability 1-ε: Exploit (best arm so far)
    """
    
    def __init__(self, n_arms, epsilon=0.1):
        self.n_arms = n_arms
        self.epsilon = epsilon
        self.counts = np.zeros(n_arms)
        self.values = np.zeros(n_arms)
    
    def select_arm(self):
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_arms)  # Explore
        else:
            return np.argmax(self.values)  # Exploit
    
    def update(self, arm, reward):
        self.counts[arm] += 1
        n = self.counts[arm]
        self.values[arm] = ((n - 1) * self.values[arm] + reward) / n


class ThompsonSampling:
    """
    Thompson Sampling bandit algorithm.
    
    - Maintain Beta posterior for each arm
    - Sample from each posterior
    - Choose arm with highest sample
    """
    
    def __init__(self, n_arms):
        self.n_arms = n_arms
        self.alphas = np.ones(n_arms)  # Successes + 1
        self.betas = np.ones(n_arms)   # Failures + 1
    
    def select_arm(self):
        samples = [beta.rvs(self.alphas[i], self.betas[i]) for i in range(self.n_arms)]
        return np.argmax(samples)
    
    def update(self, arm, reward):
        if reward == 1:
            self.alphas[arm] += 1
        else:
            self.betas[arm] += 1


class UCB1:
    """
    Upper Confidence Bound (UCB1) algorithm.
    
    - Choose arm with highest: mean + exploration_bonus
    - Exploration bonus decreases as arm is pulled more
    """
    
    def __init__(self, n_arms):
        self.n_arms = n_arms
        self.counts = np.zeros(n_arms)
        self.values = np.zeros(n_arms)
        self.total = 0
    
    def select_arm(self):
        # First, try each arm once
        for arm in range(self.n_arms):
            if self.counts[arm] == 0:
                return arm
        
        # UCB formula
        ucb_values = self.values + np.sqrt(2 * np.log(self.total) / self.counts)
        return np.argmax(ucb_values)
    
    def update(self, arm, reward):
        self.total += 1
        self.counts[arm] += 1
        n = self.counts[arm]
        self.values[arm] = ((n - 1) * self.values[arm] + reward) / n

print("Bandit algorithms created:")
print("  1. Epsilon-Greedy: Simple exploration with probability ε")
print("  2. Thompson Sampling: Bayesian approach, samples from posteriors")
print("  3. UCB1: Optimism in face of uncertainty")

In [None]:
# Simulate bandits
print("="*70)
print("BANDIT SIMULATION")
print("="*70)

def simulate_bandit(bandit, true_rates, n_rounds=10000):
    """
    Simulate a bandit algorithm.
    
    Returns:
    - rewards: Total reward over time
    - arm_pulls: Number of times each arm was pulled
    """
    rewards = []
    arm_selections = []
    cumulative = 0
    
    for _ in range(n_rounds):
        arm = bandit.select_arm()
        reward = np.random.binomial(1, true_rates[arm])
        bandit.update(arm, reward)
        
        cumulative += reward
        rewards.append(cumulative)
        arm_selections.append(arm)
    
    return rewards, arm_selections

# True conversion rates (arm 2 is best)
true_rates = [0.10, 0.12, 0.15]  # Control, Treatment1, Treatment2
n_rounds = 10000

# Run simulations
np.random.seed(42)

eg_bandit = EpsilonGreedy(n_arms=3, epsilon=0.1)
ts_bandit = ThompsonSampling(n_arms=3)
ucb_bandit = UCB1(n_arms=3)

eg_rewards, eg_arms = simulate_bandit(eg_bandit, true_rates, n_rounds)

np.random.seed(42)
ts_bandit = ThompsonSampling(n_arms=3)
ts_rewards, ts_arms = simulate_bandit(ts_bandit, true_rates, n_rounds)

np.random.seed(42)
ucb_bandit = UCB1(n_arms=3)
ucb_rewards, ucb_arms = simulate_bandit(ucb_bandit, true_rates, n_rounds)

# Optimal (always best arm)
optimal_rewards = np.cumsum(np.random.binomial(1, max(true_rates), n_rounds))

print(f"\nTrue conversion rates: {true_rates}")
print(f"Best arm: 2 (rate = {max(true_rates)})")
print(f"\nTotal rewards after {n_rounds:,} rounds:")
print(f"  Epsilon-Greedy: {eg_rewards[-1]:,}")
print(f"  Thompson Sampling: {ts_rewards[-1]:,}")
print(f"  UCB1: {ucb_rewards[-1]:,}")
print(f"  Optimal: {optimal_rewards[-1]:,}")

In [None]:
# Visualize bandit performance
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Cumulative rewards
ax1 = axes[0]
rounds = range(n_rounds)
ax1.plot(rounds, eg_rewards, label='Epsilon-Greedy', alpha=0.8)
ax1.plot(rounds, ts_rewards, label='Thompson Sampling', alpha=0.8)
ax1.plot(rounds, ucb_rewards, label='UCB1', alpha=0.8)
ax1.plot(rounds, optimal_rewards, 'k--', label='Optimal', alpha=0.5)
ax1.set_xlabel('Round')
ax1.set_ylabel('Cumulative Reward')
ax1.set_title('Bandit Algorithm Comparison', fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Arm selection distribution
ax2 = axes[1]
x = np.arange(3)
width = 0.25

eg_counts = [eg_arms.count(i) for i in range(3)]
ts_counts = [ts_arms.count(i) for i in range(3)]
ucb_counts = [ucb_arms.count(i) for i in range(3)]

ax2.bar(x - width, eg_counts, width, label='Epsilon-Greedy')
ax2.bar(x, ts_counts, width, label='Thompson Sampling')
ax2.bar(x + width, ucb_counts, width, label='UCB1')

ax2.set_xlabel('Arm')
ax2.set_ylabel('Times Selected')
ax2.set_title('Arm Selection Distribution', fontweight='bold')
ax2.set_xticks(x)
ax2.set_xticklabels(['Arm 0\n(10%)', 'Arm 1\n(12%)', 'Arm 2\n(15%)'])
ax2.legend()

plt.tight_layout()
plt.show()

print("\nKey Insight: Thompson Sampling typically focuses on the best arm")
print("while still exploring enough to find it quickly.")

---

<a id='part9'></a>
# Part 9: Complete A/B Testing Framework

---

In [None]:
# ============================================================
# COMPLETE A/B TESTING FRAMEWORK
# ============================================================
print("="*70)
print("COMPLETE A/B TESTING FRAMEWORK")
print("="*70)

class ABTestFramework:
    """
    Complete A/B Testing Framework.
    
    Features:
    - Sample size calculation
    - Multiple statistical tests
    - Bayesian analysis
    - Effect size calculation
    - Business impact estimation
    """
    
    def __init__(self, alpha=0.05, power=0.80):
        self.alpha = alpha
        self.power = power
        self.results = {}
        
    def calculate_sample_size(self, baseline_rate, mde_relative):
        """
        Calculate required sample size.
        
        Args:
            baseline_rate: Current conversion rate
            mde_relative: Minimum detectable effect (relative, e.g., 0.10 for 10%)
        """
        expected_rate = baseline_rate * (1 + mde_relative)
        
        z_alpha = norm.ppf(1 - self.alpha/2)
        z_beta = norm.ppf(self.power)
        p_bar = (baseline_rate + expected_rate) / 2
        
        n = 2 * ((z_alpha + z_beta)**2) * p_bar * (1 - p_bar) / ((expected_rate - baseline_rate)**2)
        
        return int(np.ceil(n))
    
    def run_test(self, control_data, treatment_data, metric_type='binary'):
        """
        Run complete A/B test analysis.
        
        Args:
            control_data: Array of control group outcomes
            treatment_data: Array of treatment group outcomes
            metric_type: 'binary' or 'continuous'
        """
        results = {}
        
        # Basic statistics
        n_control = len(control_data)
        n_treatment = len(treatment_data)
        
        if metric_type == 'binary':
            # Conversion rates
            conv_control = sum(control_data)
            conv_treatment = sum(treatment_data)
            rate_control = conv_control / n_control
            rate_treatment = conv_treatment / n_treatment
            
            # Z-test
            z_result = ABTestSuite.z_test_proportions(
                conv_control, n_control, conv_treatment, n_treatment, self.alpha
            )
            
            # Bayesian
            bayes = BayesianABTest()
            prob_b_better, _, _ = bayes.prob_b_better(
                conv_control, n_control, conv_treatment, n_treatment
            )
            
            results['frequentist'] = z_result
            results['bayesian'] = {'prob_b_better': prob_b_better}
            results['effect'] = {
                'absolute': rate_treatment - rate_control,
                'relative': (rate_treatment - rate_control) / rate_control * 100
            }
            
        else:  # continuous
            # T-test
            t_result = ABTestSuite.t_test(control_data, treatment_data, self.alpha)
            results['frequentist'] = t_result
        
        results['sample_sizes'] = {'control': n_control, 'treatment': n_treatment}
        
        self.results = results
        return results
    
    def get_recommendation(self):
        """
        Get recommendation based on test results.
        """
        if not self.results:
            return "No test results available"
        
        freq = self.results.get('frequentist', {})
        bayes = self.results.get('bayesian', {})
        effect = self.results.get('effect', {})
        
        # Decision logic
        is_significant = freq.get('significant', False)
        prob_better = bayes.get('prob_b_better', 0.5)
        relative_lift = effect.get('relative', 0)
        
        if is_significant and prob_better > 0.95 and relative_lift > 0:
            return "STRONG RECOMMENDATION: Implement Treatment"
        elif is_significant and prob_better > 0.80 and relative_lift > 0:
            return "RECOMMENDATION: Implement Treatment (with monitoring)"
        elif prob_better > 0.75:
            return "CAUTIOUS: Treatment looks promising, consider extending test"
        elif prob_better < 0.25:
            return "NOT RECOMMENDED: Treatment performs worse than Control"
        else:
            return "INCONCLUSIVE: Continue testing or accept no difference"
    
    def generate_report(self):
        """
        Generate comprehensive test report.
        """
        if not self.results:
            return "No results to report"
        
        freq = self.results.get('frequentist', {})
        bayes = self.results.get('bayesian', {})
        effect = self.results.get('effect', {})
        sizes = self.results.get('sample_sizes', {})
        
        report = f"""
{'='*60}
A/B TEST REPORT
{'='*60}

SAMPLE SIZES
{'-'*40}
Control:   {sizes.get('control', 'N/A'):,}
Treatment: {sizes.get('treatment', 'N/A'):,}

FREQUENTIST ANALYSIS
{'-'*40}
Test: {freq.get('test', 'N/A')}
Test Statistic: {freq.get('statistic', 'N/A'):.4f}
P-value: {freq.get('p_value', 'N/A'):.6f}
Significant (α={self.alpha}): {'YES' if freq.get('significant') else 'NO'}
95% CI: {freq.get('ci_95', 'N/A')}

BAYESIAN ANALYSIS
{'-'*40}
P(Treatment > Control): {bayes.get('prob_b_better', 'N/A'):.2%}

EFFECT SIZE
{'-'*40}
Absolute Difference: {effect.get('absolute', 0):.4f} ({effect.get('absolute', 0):.2%})
Relative Lift: {effect.get('relative', 0):.1f}%

RECOMMENDATION
{'-'*40}
{self.get_recommendation()}

{'='*60}
"""
        return report

# Use the framework
framework = ABTestFramework(alpha=0.05, power=0.80)

# Run test
results = framework.run_test(
    control['converted'].values,
    treatment['converted'].values,
    metric_type='binary'
)

# Generate report
print(framework.generate_report())

---

<a id='part10'></a>
# Part 10: Summary

---

In [None]:
# Final summary
print("="*70)
print("A/B TESTING FRAMEWORK - SUMMARY")
print("="*70)

print("""
WHAT WE LEARNED:
================

1. HYPOTHESIS TESTING FRAMEWORK:
   ┌─────────────────────────────────────────────┐
   │ H₀: No difference (p_A = p_B)              │
   │ H₁: There is a difference (p_A ≠ p_B)      │
   │                                             │
   │ If p-value < α: Reject H₀ (significant)    │
   │ If p-value ≥ α: Fail to reject H₀          │
   └─────────────────────────────────────────────┘

2. CHOOSING THE RIGHT TEST:
   ┌──────────────────┬─────────────────────────┐
   │ Data Type        │ Test                    │
   ├──────────────────┼─────────────────────────┤
   │ Binary           │ Z-test, Chi-square      │
   │ Continuous       │ t-test                  │
   │ Non-normal       │ Mann-Whitney U          │
   │ Small samples    │ Fisher's exact          │
   └──────────────────┴─────────────────────────┘

3. SAMPLE SIZE FORMULA (Proportions):
   n = 2 × (Z_{α/2} + Z_β)² × p̄(1-p̄) / (p₁ - p₂)²

4. KEY METRICS:
   - Statistical Significance: p-value < α
   - Practical Significance: Effect size, business impact
   - Confidence Interval: Range of plausible values

5. COMMON PITFALLS:
   ┌─────────────────────────────────────────────┐
   │ ❌ Peeking at results early                 │
   │ ❌ Multiple testing without correction      │
   │ ❌ Stopping early when significant          │
   │ ❌ Ignoring practical significance          │
   │ ❌ Selection bias in assignment             │
   └─────────────────────────────────────────────┘

6. BAYESIAN A/B TESTING:
   - Output: P(B > A) directly
   - More intuitive interpretation
   - Can incorporate prior knowledge
   - Flexible stopping rules

7. MULTI-ARMED BANDITS:
   - Balance exploration vs exploitation
   - Minimize opportunity cost
   - Thompson Sampling often best choice
""")

print("\nA/B TESTING CHECKLIST:")
print("  [1] Define hypothesis and success metric")
print("  [2] Calculate required sample size")
print("  [3] Randomize users properly")
print("  [4] Run test for full duration (no peeking!)")
print("  [5] Analyze with appropriate statistical test")
print("  [6] Consider both statistical AND practical significance")
print("  [7] Document and communicate results")

print("\n" + "="*70)

## Algorithm & Method Taxonomy

### Statistical Tests

| Test | Data Type | Assumption | Use Case |
|------|-----------|------------|----------|
| **Z-test** | Binary | Large n | Conversion rates |
| **Chi-square** | Binary | Large n | Independence test |
| **t-test** | Continuous | ~Normal | Revenue, time |
| **Welch's t-test** | Continuous | Unequal variance | Most cases |
| **Mann-Whitney U** | Continuous | Non-parametric | Skewed data |
| **Fisher's exact** | Binary | Small n | Small samples |

### Effect Size Measures

| Measure | Formula | Interpretation |
|---------|---------|----------------|
| **Absolute Diff** | p_B - p_A | Direct difference |
| **Relative Lift** | (p_B - p_A) / p_A | % improvement |
| **Odds Ratio** | (p_B/(1-p_B)) / (p_A/(1-p_A)) | Odds comparison |
| **Cohen's d** | (μ_B - μ_A) / σ_pooled | Standardized (continuous) |
| **Cohen's h** | 2(arcsin√p_B - arcsin√p_A) | Standardized (binary) |

### Multiple Testing Corrections

| Method | Controls | When to Use |
|--------|----------|-------------|
| **Bonferroni** | FWER | Few tests, need strict control |
| **Benjamini-Hochberg** | FDR | Many tests, can tolerate some FP |
| **Holm** | FWER | Stepwise, more powerful than Bonferroni |

### Bandit Algorithms

| Algorithm | Strategy | Pros | Cons |
|-----------|----------|------|------|
| **Epsilon-Greedy** | Random exploration | Simple | Suboptimal |
| **UCB1** | Optimistic exploration | No parameters | Can over-explore |
| **Thompson Sampling** | Probability matching | Often best | More complex |

---

## Checklist

- [x] Understand null and alternative hypotheses
- [x] Know when to use which statistical test
- [x] Can calculate required sample size
- [x] Understand statistical vs practical significance
- [x] Know common pitfalls (peeking, multiple testing)
- [x] Can apply multiple testing corrections
- [x] Understand Bayesian A/B testing
- [x] Know when to use multi-armed bandits

---

**End of A/B Testing Framework Tutorial**