# Week 8: A/B Testing & Hypothesis Testing

**Goal:** Master statistical hypothesis testing and A/B testing for marketing experiments.

**Time Commitment:** ~1 hour per day √ó 7 days = 7 hours total

**What You'll Learn:**
- A/B test fundamentals and experimental design
- Sample size calculations and power analysis
- Two-proportion z-tests for conversion rates
- Statistical vs practical significance
- Multiple testing corrections
- Sequential testing for faster decisions
- Real marketing experiment design and analysis

**Why This Matters:**
As a Marketing Measurement Partner, A/B testing is your primary tool for:
- Proving that creative changes improve performance
- Optimizing landing pages and user experiences
- Making data-driven decisions with confidence
- Avoiding costly mistakes based on random noise
- Calculating the ROI of optimization efforts

A/B testing separates opinions from facts. Master it, and you'll drive millions in incremental revenue.

---

## üìÖ Day 50: A/B Test Fundamentals (~60 min)

### Learning Objectives
- Understand the scientific method applied to marketing
- Learn the components of an A/B test
- Formulate null and alternative hypotheses
- Understand Type I and Type II errors

### The Business Problem
Your marketing team believes a new landing page design will increase conversion rates. Before rolling it out to all traffic, you need to prove it works with statistical rigor.

In [None]:
# Import libraries
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

# Settings
np.random.seed(42)
sns.set_style('whitegrid')
pd.set_option('display.precision', 4)

### üìñ Concept: The A/B Testing Framework

**Components of an A/B Test:**
1. **Control (A)**: Current version (baseline)
2. **Treatment (B)**: New version (variant)
3. **Metric**: What you're measuring (e.g., conversion rate)
4. **Hypothesis**: Prediction about the treatment's effect
5. **Sample Size**: How many users in each group
6. **Significance Level (Œ±)**: Risk of false positive (typically 0.05 or 5%)

**Hypotheses:**
- **Null Hypothesis (H‚ÇÄ)**: No difference between A and B
- **Alternative Hypothesis (H‚ÇÅ)**: There is a difference

**Decision Errors:**
- **Type I Error (Œ±)**: False positive - declaring a winner when there's no real difference
- **Type II Error (Œ≤)**: False negative - missing a real difference

In [None]:
# Example: Landing page A/B test data
test_data = pd.DataFrame({
    'variant': ['Control', 'Treatment'],
    'visitors': [10000, 10000],
    'conversions': [350, 412]
})

# Calculate conversion rates
test_data['conversion_rate'] = test_data['conversions'] / test_data['visitors']

print("A/B Test Results:")
print(test_data)
print(f"\nObserved Lift: {((test_data.loc[1, 'conversion_rate'] / test_data.loc[0, 'conversion_rate']) - 1) * 100:.2f}%")

### üí° Try It: Calculate Relative and Absolute Lift

Understanding different ways to measure improvement is crucial for communicating results.

In [None]:
# YOUR CODE HERE
# Given the test_data above:
# 1. Calculate absolute lift (treatment_rate - control_rate)
# 2. Calculate relative lift ((treatment_rate - control_rate) / control_rate)
# 3. Which metric is more meaningful for business decisions?



### üìñ Concept: Statistical Significance

Just observing a difference doesn't mean it's real. We need to account for random chance.

**P-value**: The probability of seeing results this extreme (or more) if there's actually no difference.
- If p < 0.05, we reject the null hypothesis (declare significance)
- If p ‚â• 0.05, we fail to reject the null (insufficient evidence)

In [None]:
# Visualize the concept of p-value
control_rate = test_data.loc[0, 'conversion_rate']
treatment_rate = test_data.loc[1, 'conversion_rate']

# Simulate what we'd see under the null hypothesis (no difference)
null_distribution = np.random.binomial(10000, control_rate, size=10000) / 10000

plt.figure(figsize=(12, 5))
plt.hist(null_distribution, bins=50, alpha=0.7, edgecolor='black')
plt.axvline(treatment_rate, color='red', linestyle='--', linewidth=2, label=f'Treatment Rate: {treatment_rate:.4f}')
plt.axvline(control_rate, color='blue', linestyle='--', linewidth=2, label=f'Control Rate: {control_rate:.4f}')
plt.xlabel('Conversion Rate')
plt.ylabel('Frequency')
plt.title('Null Distribution: What We\'d Expect by Random Chance')
plt.legend()
plt.show()

print(f"Treatment rate is {abs(treatment_rate - control_rate) / np.std(null_distribution):.2f} standard deviations from control")

### ‚úèÔ∏è Exercise 1: Interpret Test Results

You ran three different A/B tests. Interpret each result.

In [None]:
results = pd.DataFrame({
    'test_name': ['Email Subject Line', 'Ad Creative', 'Pricing Page'],
    'control_cvr': [0.025, 0.042, 0.105],
    'treatment_cvr': [0.028, 0.045, 0.098],
    'p_value': [0.023, 0.112, 0.086],
    'sample_size_per_group': [50000, 20000, 8000]
})

# YOUR CODE HERE
# For each test:
# 1. Calculate relative lift
# 2. Determine if it's statistically significant (p < 0.05)
# 3. Make a recommendation: Implement, Don't Implement, or Run Longer
# 4. Explain your reasoning



### üéØ Day 50 Mini-Project: Design Your First A/B Test

Design a complete A/B test for a marketing scenario.

In [None]:
# Scenario: You want to test a new CTA button color on your product page
# Current performance:
# - Daily visitors: 5,000
# - Current conversion rate: 4.0%
# - Expected lift from new button: 10% relative improvement

# YOUR CODE HERE
# Design the test by specifying:
# 1. Null hypothesis
# 2. Alternative hypothesis  
# 3. Primary metric
# 4. Significance level (Œ±)
# 5. Desired power (1 - Œ≤), typically 0.80
# 6. Minimum detectable effect
# 7. How you'll split traffic (50/50, 90/10, etc.)
# 8. Any guardrail metrics you'll monitor
#
# Create a document/dict with your test plan



### üéì Day 50 Key Takeaways

‚úÖ A/B testing applies the scientific method to marketing  
‚úÖ Always formulate hypotheses before running tests  
‚úÖ Statistical significance prevents false positives  
‚úÖ P-value measures probability of results under null hypothesis  
‚úÖ Consider both Type I and Type II errors  

**Next:** Tomorrow we'll calculate required sample sizes!

---

## üìÖ Day 51: Sample Size Calculation (~60 min)

### Learning Objectives
- Calculate required sample size for A/B tests
- Understand the relationship between sample size, power, and MDE
- Estimate test duration
- Make trade-offs between speed and sensitivity

### The Business Problem
Before launching a test, you need to know: "How long will this take?" Running tests too short leads to false conclusions. Running them too long wastes time.

### üìñ Concept: Sample Size Factors

Sample size depends on four factors:
1. **Baseline conversion rate (p‚ÇÅ)**: Current performance
2. **Minimum detectable effect (MDE)**: Smallest lift you care about
3. **Significance level (Œ±)**: Usually 0.05 (5% false positive risk)
4. **Statistical power (1-Œ≤)**: Usually 0.80 (80% chance to detect real effect)

**Trade-offs:**
- Smaller MDE ‚Üí Larger sample needed
- Higher power ‚Üí Larger sample needed
- Lower Œ± ‚Üí Larger sample needed

In [None]:
from statsmodels.stats.power import zt_ind_solve_power
from statsmodels.stats.proportion import proportion_effectsize

def calculate_sample_size(baseline_rate, minimum_detectable_effect, alpha=0.05, power=0.80):
    """
    Calculate required sample size per group for A/B test.
    
    Parameters:
    - baseline_rate: Current conversion rate (e.g., 0.05 for 5%)
    - minimum_detectable_effect: Relative lift to detect (e.g., 0.10 for 10%)
    - alpha: Significance level (default 0.05)
    - power: Statistical power (default 0.80)
    
    Returns:
    - Required sample size per group
    """
    # Calculate treatment rate
    treatment_rate = baseline_rate * (1 + minimum_detectable_effect)
    
    # Calculate effect size (Cohen's h)
    effect_size = proportion_effectsize(baseline_rate, treatment_rate)
    
    # Calculate sample size per group
    sample_size = zt_ind_solve_power(
        effect_size=effect_size,
        alpha=alpha,
        power=power,
        ratio=1.0,  # Equal group sizes
        alternative='two-sided'
    )
    
    return int(np.ceil(sample_size))

# Example: Landing page test
baseline_cvr = 0.05  # 5% conversion rate
mde = 0.10  # Want to detect 10% relative lift

required_sample = calculate_sample_size(baseline_cvr, mde)
print(f"Required sample size per group: {required_sample:,}")
print(f"Total sample size (both groups): {required_sample * 2:,}")

### üí° Try It: Calculate Test Duration

Given your daily traffic, estimate how long the test needs to run.

In [None]:
# YOUR CODE HERE
# Given:
# - Required sample per group: from calculation above
# - Daily visitors: 5,000
# - Traffic allocation: 50% to each group
#
# Calculate:
# 1. Daily sample per group
# 2. Days needed to reach required sample
# 3. Should you run for full weeks to account for day-of-week effects?



### üìñ Concept: MDE Trade-offs

Smaller MDE = more sensitivity = longer tests. You need to balance business needs with statistical requirements.

In [None]:
# Compare sample sizes for different MDEs
baseline = 0.05
mde_values = [0.05, 0.10, 0.15, 0.20, 0.25]

sample_sizes = []
for mde in mde_values:
    n = calculate_sample_size(baseline, mde)
    sample_sizes.append(n)

# Visualize
plt.figure(figsize=(10, 6))
plt.plot([m*100 for m in mde_values], sample_sizes, marker='o', linewidth=2, markersize=8)
plt.xlabel('Minimum Detectable Effect (%)', fontsize=12)
plt.ylabel('Required Sample Size Per Group', fontsize=12)
plt.title('Sample Size vs. Minimum Detectable Effect\n(Baseline CVR: 5%, Power: 80%, Œ±: 0.05)', fontsize=14)
plt.grid(True, alpha=0.3)
plt.ticklabel_format(style='plain', axis='y')

# Add annotations
for mde, n in zip(mde_values, sample_sizes):
    plt.annotate(f'{n:,}', xy=(mde*100, n), xytext=(5, 5), textcoords='offset points')

plt.show()

print("\nSample Size Comparison:")
for mde, n in zip(mde_values, sample_sizes):
    print(f"MDE: {mde*100:5.1f}% ‚Üí Sample size: {n:6,} per group ({n*2:7,} total)")

### ‚úèÔ∏è Exercise 2: Multi-Scenario Planning

Create a sample size calculator for different scenarios.

In [None]:
# Test scenarios
scenarios = pd.DataFrame({
    'test_name': ['Email Open Rate', 'Landing Page CVR', 'Add-to-Cart Rate', 'Purchase CVR'],
    'baseline_rate': [0.25, 0.05, 0.15, 0.03],
    'target_mde': [0.10, 0.15, 0.10, 0.20],
    'daily_traffic': [50000, 10000, 8000, 5000]
})

# YOUR CODE HERE
# For each scenario:
# 1. Calculate required sample size per group
# 2. Calculate total sample size
# 3. Estimate test duration in days (50/50 split)
# 4. Round up to full weeks
# 5. Create a summary table
# 6. Which test will take longest? Which is fastest?



### üéØ Day 51 Mini-Project: Sample Size Calculator Tool

Build an interactive sample size calculator with multiple what-if scenarios.

In [None]:
# YOUR CODE HERE
# Create a comprehensive function that:
# 
# Takes inputs:
# - baseline_rate
# - mde
# - alpha (default 0.05)
# - power (default 0.80)
# - daily_traffic
# - traffic_allocation (default 0.50)
#
# Returns a detailed report including:
# - Required sample per group
# - Total required sample
# - Estimated days to complete
# - Recommended duration (rounded to full weeks)
# - Expected absolute lift (conversions per day)
# - Sensitivity analysis (what if MDE is 20% higher/lower?)
#
# Test it with:
# - Baseline: 4%
# - MDE: 12.5%
# - Daily traffic: 8,000



### üéì Day 51 Key Takeaways

‚úÖ Sample size calculations prevent underpowered tests  
‚úÖ Smaller MDEs require larger samples (longer tests)  
‚úÖ Always calculate duration before starting  
‚úÖ Business value should inform MDE selection  
‚úÖ Account for weekly seasonality in test planning  

**Next:** Tomorrow we'll perform two-proportion z-tests!

---

## üìÖ Day 52: Two-Proportion Z-Tests (~60 min)

### Learning Objectives
- Perform two-proportion z-tests
- Calculate confidence intervals
- Interpret test statistics and p-values
- Make statistical decisions

### The Business Problem
Your A/B test has collected enough data. Now you need to analyze it properly and determine if the treatment truly outperforms the control.

### üìñ Concept: Two-Proportion Z-Test

Tests whether two proportions (conversion rates) are significantly different.

**Formula:**
```
z = (p‚ÇÅ - p‚ÇÇ) / SE

where SE = sqrt(p_pooled * (1 - p_pooled) * (1/n‚ÇÅ + 1/n‚ÇÇ))
p_pooled = (x‚ÇÅ + x‚ÇÇ) / (n‚ÇÅ + n‚ÇÇ)
```

**Decision Rule:**
- If |z| > 1.96 (for Œ±=0.05), reject null hypothesis
- Or equivalently, if p-value < 0.05, reject null hypothesis

In [None]:
from statsmodels.stats.proportion import proportions_ztest

def analyze_ab_test(control_conversions, control_visitors, 
                    treatment_conversions, treatment_visitors,
                    alpha=0.05):
    """
    Perform two-proportion z-test for A/B test.
    
    Returns:
    - Dictionary with test results and interpretation
    """
    # Conversion rates
    control_rate = control_conversions / control_visitors
    treatment_rate = treatment_conversions / treatment_visitors
    
    # Absolute and relative lift
    absolute_lift = treatment_rate - control_rate
    relative_lift = (treatment_rate - control_rate) / control_rate
    
    # Perform z-test
    count = np.array([treatment_conversions, control_conversions])
    nobs = np.array([treatment_visitors, control_visitors])
    z_stat, p_value = proportions_ztest(count, nobs, alternative='two-sided')
    
    # Confidence interval for difference
    se_diff = np.sqrt(
        (control_rate * (1 - control_rate) / control_visitors) +
        (treatment_rate * (1 - treatment_rate) / treatment_visitors)
    )
    ci_lower = absolute_lift - 1.96 * se_diff
    ci_upper = absolute_lift + 1.96 * se_diff
    
    # Decision
    is_significant = p_value < alpha
    
    results = {
        'control_rate': control_rate,
        'treatment_rate': treatment_rate,
        'absolute_lift': absolute_lift,
        'relative_lift': relative_lift,
        'z_statistic': z_stat,
        'p_value': p_value,
        'ci_95_lower': ci_lower,
        'ci_95_upper': ci_upper,
        'is_significant': is_significant,
        'decision': 'SIGNIFICANT' if is_significant else 'NOT SIGNIFICANT'
    }
    
    return results

# Example: Landing page test
results = analyze_ab_test(
    control_conversions=350,
    control_visitors=10000,
    treatment_conversions=412,
    treatment_visitors=10000
)

print("A/B Test Results")
print("=" * 60)
print(f"Control CVR:     {results['control_rate']:.4f} ({results['control_rate']*100:.2f}%)")
print(f"Treatment CVR:   {results['treatment_rate']:.4f} ({results['treatment_rate']*100:.2f}%)")
print(f"Absolute Lift:   {results['absolute_lift']:.4f} ({results['absolute_lift']*100:.2f} percentage points)")
print(f"Relative Lift:   {results['relative_lift']:.4f} ({results['relative_lift']*100:.2f}%)")
print(f"\nZ-statistic:     {results['z_statistic']:.4f}")
print(f"P-value:         {results['p_value']:.4f}")
print(f"95% CI:          [{results['ci_95_lower']:.4f}, {results['ci_95_upper']:.4f}]")
print(f"\nDecision:        {results['decision']}")

if results['is_significant']:
    print("\n‚úÖ RECOMMENDATION: Implement the treatment variant.")
    print(f"   Expected lift: {results['relative_lift']*100:.1f}%")
else:
    print("\n‚ö†Ô∏è  RECOMMENDATION: Insufficient evidence. Don't implement.")

### üí° Try It: Analyze Real Test Data

Analyze multiple tests and make recommendations.

In [None]:
# YOUR CODE HERE
# Analyze these three tests:
#
# Test 1: Email subject line
# Control: 1,250 opens / 50,000 sends
# Treatment: 1,450 opens / 50,000 sends
#
# Test 2: CTA button color
# Control: 180 clicks / 8,000 visitors
# Treatment: 195 clicks / 8,000 visitors
#
# Test 3: Pricing page layout
# Control: 420 purchases / 5,000 visitors
# Treatment: 485 purchases / 5,000 visitors
#
# For each: Calculate results and make recommendation



### üìñ Concept: Confidence Intervals

Confidence intervals provide a range of plausible values for the true lift. If the CI includes zero, the result is not significant.

In [None]:
# Visualize confidence intervals for multiple tests
test_results = [
    ('Email A', 0.015, 0.005, 0.025),
    ('Landing Page B', 0.032, 0.018, 0.046),
    ('Ad Creative C', 0.008, -0.002, 0.018),
    ('Checkout Flow D', -0.005, -0.015, 0.005),
]

fig, ax = plt.subplots(figsize=(12, 6))

for i, (name, lift, ci_low, ci_high) in enumerate(test_results):
    color = 'green' if ci_low > 0 else ('red' if ci_high < 0 else 'gray')
    ax.plot([ci_low, ci_high], [i, i], 'o-', linewidth=3, markersize=8, color=color)
    ax.plot([lift], [i], 'D', markersize=10, color=color)

ax.axvline(0, color='black', linestyle='--', linewidth=1, alpha=0.5)
ax.set_yticks(range(len(test_results)))
ax.set_yticklabels([name for name, *_ in test_results])
ax.set_xlabel('Absolute Lift (percentage points)', fontsize=12)
ax.set_title('95% Confidence Intervals for Test Results\n(Green = Significant Win, Red = Significant Loss, Gray = Inconclusive)', fontsize=14)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

### ‚úèÔ∏è Exercise 3: One-Sided vs Two-Sided Tests

Understand when to use one-sided vs two-sided tests.

In [None]:
# Test data
control_conv = 400
control_vis = 10000
treatment_conv = 445
treatment_vis = 10000

# YOUR CODE HERE
# 1. Perform a two-sided test (can treatment be better OR worse?)
# 2. Perform a one-sided test (can treatment be better?)
# 3. Compare the p-values
# 4. When would you use one-sided? When two-sided?
# 5. What are the risks of using one-sided tests?
#
# Hint: Use proportions_ztest with alternative='larger' for one-sided



### üéØ Day 52 Mini-Project: Automated Test Analyzer

Build a comprehensive A/B test analysis tool.

In [None]:
# YOUR CODE HERE
# Create a function that takes test data and produces:
#
# 1. Statistical results (z-stat, p-value, CIs)
# 2. Business metrics (absolute lift, relative lift, incremental conversions)
# 3. Visual report:
#    - Conversion rate comparison (bar chart)
#    - Confidence interval visualization
# 4. Written recommendation with reasoning
# 5. Revenue impact calculation (if revenue per conversion is provided)
#
# Test it with this data:
# Control: 3,250 conversions from 100,000 visitors
# Treatment: 3,680 conversions from 100,000 visitors
# Revenue per conversion: $45



### üéì Day 52 Key Takeaways

‚úÖ Two-proportion z-tests are the workhorse of A/B testing  
‚úÖ P-values quantify evidence against the null hypothesis  
‚úÖ Confidence intervals show the range of plausible effects  
‚úÖ Statistical significance ‚â† practical significance  
‚úÖ Always report both relative and absolute lift  

**Next:** Tomorrow we'll explore statistical vs practical significance!

---

## üìÖ Day 53: Statistical vs Practical Significance (~60 min)

### Learning Objectives
- Distinguish statistical from practical significance
- Calculate business impact of test results
- Make economically rational decisions
- Avoid the "significance trap"

### The Business Problem
You found a statistically significant 1% lift in conversion rate. Should you implement it? What if it requires 3 months of engineering work?

### üìñ Concept: Statistical ‚â† Practical Significance

**Statistical Significance:** The effect is unlikely due to chance (p < 0.05)

**Practical Significance:** The effect is large enough to matter for business decisions

With large samples, tiny effects become statistically significant but may not be worth implementing.

In [None]:
# Example: Large sample, small effect
control_conv = 10000
control_vis = 200000  # 5.00% CVR
treatment_conv = 10200
treatment_vis = 200000  # 5.10% CVR (2% relative lift)

results = analyze_ab_test(control_conv, control_vis, treatment_conv, treatment_vis)

print("Large Sample, Small Effect")
print("=" * 60)
print(f"Sample size per group: {control_vis:,}")
print(f"Relative lift: {results['relative_lift']*100:.2f}%")
print(f"P-value: {results['p_value']:.4f}")
print(f"Statistical significance: {results['decision']}")
print(f"\nBut is a {results['relative_lift']*100:.1f}% lift worth implementing?")
print(f"That depends on:")
print(f"  - Implementation cost")
print(f"  - Maintenance burden")
print(f"  - Opportunity cost")
print(f"  - Revenue impact")

### üí° Try It: Calculate Business Impact

Determine if statistically significant results are worth implementing.

In [None]:
# YOUR CODE HERE
# Given:
# - Current monthly visitors: 500,000
# - Current CVR: 5.0%
# - Treatment CVR: 5.1% (2% relative lift)
# - Revenue per conversion: $50
# - Implementation cost: $25,000
# - Monthly maintenance: $2,000
#
# Calculate:
# 1. Monthly incremental conversions
# 2. Monthly incremental revenue
# 3. Annual incremental revenue
# 4. Payback period for implementation cost
# 5. ROI after 1 year
# 6. Should you implement? Why or why not?



### üìñ Concept: Minimum Practical Difference

Before testing, define the **minimum practical difference (MPD)**: the smallest effect worth implementing.

This is different from MDE (minimum detectable effect):
- **MDE**: What you can reliably detect
- **MPD**: What's worth implementing

Ideally, MDE ‚â§ MPD

In [None]:
def calculate_mpd(monthly_volume, baseline_cvr, revenue_per_conversion, 
                  implementation_cost, desired_payback_months=6):
    """
    Calculate minimum practical difference based on economic criteria.
    
    Returns:
    - Minimum relative lift needed to justify implementation
    """
    # Current monthly conversions
    baseline_conversions = monthly_volume * baseline_cvr
    
    # Incremental conversions needed to payback in desired timeframe
    required_incremental_revenue = implementation_cost / desired_payback_months
    required_incremental_conversions = required_incremental_revenue / revenue_per_conversion
    
    # Required lift
    required_relative_lift = required_incremental_conversions / baseline_conversions
    
    return required_relative_lift

# Example
mpd = calculate_mpd(
    monthly_volume=500000,
    baseline_cvr=0.05,
    revenue_per_conversion=50,
    implementation_cost=25000,
    desired_payback_months=6
)

print(f"Minimum Practical Difference: {mpd*100:.2f}%")
print(f"\nAny lift below {mpd*100:.2f}% won't pay back within 6 months.")
print(f"Even if statistically significant, it's not worth implementing.")

### ‚úèÔ∏è Exercise 4: Economic Decision Framework

Build a decision framework that considers both statistical and practical significance.

In [None]:
test_scenarios = pd.DataFrame({
    'test': ['A', 'B', 'C', 'D'],
    'relative_lift': [0.15, 0.03, 0.08, 0.25],
    'p_value': [0.001, 0.032, 0.156, 0.089],
    'implementation_cost': [50000, 5000, 30000, 100000],
    'monthly_visitors': [1000000, 500000, 200000, 2000000],
    'baseline_cvr': [0.04, 0.06, 0.03, 0.05],
    'revenue_per_conversion': [75, 45, 120, 60]
})

# YOUR CODE HERE
# For each test:
# 1. Determine statistical significance (p < 0.05)
# 2. Calculate annual incremental revenue
# 3. Calculate ROI (annual_revenue / implementation_cost)
# 4. Calculate payback period in months
# 5. Make a recommendation:
#    - "Implement" if statistically significant AND ROI > 200%
#    - "Don't implement" if not statistically significant
#    - "Consider" if significant but ROI 100-200%
#    - "Don't implement" if ROI < 100%



### üéØ Day 53 Mini-Project: Test Prioritization Framework

Create a framework to prioritize which tests to run based on potential business impact.

In [None]:
# YOUR CODE HERE
# You have 5 test ideas. Prioritize them based on:
# - Expected lift
# - Affected traffic volume
# - Implementation complexity (cost)
# - Time to implement
# - Confidence in success (probability)
#
# Create a scoring system that considers:
# 1. Expected value = (probability_of_success √ó expected_lift √ó volume √ó value_per_conversion)
# 2. Implementation cost
# 3. Time to results
#
# Calculate a priority score and rank the tests.
#
# Test ideas:
test_ideas = pd.DataFrame({
    'test': ['New homepage hero', 'Simplified checkout', 'Product page redesign', 
             'Email frequency', 'Mobile app onboarding'],
    'expected_lift': [0.10, 0.20, 0.15, 0.08, 0.25],
    'probability_of_success': [0.60, 0.40, 0.50, 0.70, 0.30],
    'monthly_affected_users': [500000, 100000, 300000, 200000, 50000],
    'baseline_cvr': [0.05, 0.30, 0.08, 0.12, 0.20],
    'revenue_per_conversion': [60, 80, 75, 50, 100],
    'implementation_weeks': [6, 12, 8, 2, 10],
    'implementation_cost': [40000, 100000, 60000, 10000, 80000]
})



### üéì Day 53 Key Takeaways

‚úÖ Statistical significance doesn't guarantee business value  
‚úÖ Always calculate ROI before implementing  
‚úÖ Define minimum practical difference upfront  
‚úÖ Consider implementation costs and payback period  
‚úÖ Large samples can make tiny effects "significant"  

**Next:** Tomorrow we'll tackle the multiple testing problem!

---

## üìÖ Day 54-56: Advanced Topics (Condensed)

### Day 54: Multiple Testing Problem
- Bonferroni correction
- False discovery rate (FDR)
- When and how to correct for multiple comparisons

### Day 55: Sequential Testing
- Always-valid p-values
- Sequential probability ratio test (SPRT)
- Stopping rules

### Day 56: Capstone - Design and Analyze A/B Test
- Full end-to-end test design
- Sample size calculation
- Analysis and recommendation
- Business case presentation

*Note: These sections would be fully expanded in a production version with detailed code examples, exercises, and mini-projects.*

---

### üéì Week 8 Complete!

**Congratulations!** You've mastered A/B testing fundamentals.

**What You've Learned:**
- ‚úÖ A/B test design and hypothesis formulation
- ‚úÖ Sample size calculations and power analysis
- ‚úÖ Two-proportion z-tests and statistical inference
- ‚úÖ Business-focused decision making
- ‚úÖ Multiple testing and sequential testing

**Next Week:** Attribution modeling - understanding the customer journey!

---