# Marketing A/B Test Analysis: Complete Walkthrough

This notebook demonstrates a real-world marketing A/B test using the Marketing dataset (588K users).

**Scenario**: E-commerce company testing new ad creatives

**Business Question**: Should we switch to the new ad design?

**Primary Metric**: Conversion rate (binary outcome)

**Dataset**: faviovaz/marketing-ab-testing from Kaggle (588,101 observations)

## üìö What You'll Learn

1. ‚úÖ Data quality validation and SRM checks
2. ‚úÖ Power analysis and sample size calculations
3. ‚úÖ CUPED variance reduction with pre-experiment covariates
4. ‚úÖ Guardrail metrics and non-inferiority testing
5. ‚úÖ Novelty effect detection
6. ‚úÖ Ship/hold/abandon decision framework
7. ‚úÖ Business impact translation

## üéØ Learning Objectives

By the end of this notebook, you'll understand:
- How to validate experiment randomization (SRM check)
- When your experiment has enough power to detect effects
- How CUPED can speed up your experiments by 20-40%
- How to protect guardrail metrics while optimizing primary metrics
- How to detect and handle novelty effects
- How to make data-driven ship/hold/abandon decisions

---

## Setup: Import Libraries

In [None]:
# Core imports
import pandas as pd
import numpy as np
from IPython.display import display, Markdown, HTML
import warnings
warnings.filterwarnings('ignore')

# AB Testing package imports
from ab_testing.data import loaders
from ab_testing.pipelines.marketing_pipeline import run_marketing_analysis
from ab_testing.core import power as power_module

print("‚úÖ All imports successful!")
print("üìä Ready to analyze Marketing A/B test data")

---

## Part 1: Load and Inspect Data

### üìö Why This Matters

**Always inspect data before analysis**. Real-world datasets have issues:
- Missing values
- Duplicates
- Outliers
- Type mismatches

**Bad data ‚Üí Bad decisions**. Netflix, Booking.com, and Meta all run automated data quality checks before every experiment analysis.

In [None]:
# Load the full dataset
print("Loading Marketing A/B Test dataset...")
df = loaders.load_marketing_ab()

print(f"‚úÖ Dataset loaded: {len(df):,} observations")
print(f"\nDataset shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")

In [None]:
# Inspect first few rows
display(df.head())

In [None]:
# Data quality checks
print("=" * 70)
print("DATA QUALITY CHECKS")
print("=" * 70)

print(f"\n1. Missing Values:")
missing = df.isnull().sum()
if missing.sum() == 0:
    print("   ‚úÖ No missing values detected")
else:
    print(missing[missing > 0])

print(f"\n2. Data Types:")
display(df.dtypes)

print(f"\n3. Control vs Treatment Split:")
split = df['test_group'].value_counts()
display(split)
ratio = split['treatment'] / split['control']
print(f"\n   Ratio: {ratio:.4f} (should be ~1.0 for 50/50 split)")
if abs(ratio - 1.0) < 0.05:
    print("   ‚úÖ Split looks balanced")
else:
    print("   ‚ö†Ô∏è Split appears imbalanced - check SRM in next step!")

In [None]:
# Summary statistics
print("=" * 70)
print("CONVERSION RATES BY GROUP")
print("=" * 70)

conversion_summary = df.groupby('test_group')['converted'].agg([
    ('Total Users', 'count'),
    ('Conversions', 'sum'),
    ('Conversion Rate', 'mean')
])

display(conversion_summary)

control_rate = conversion_summary.loc['control', 'Conversion Rate']
treatment_rate = conversion_summary.loc['treatment', 'Conversion Rate']
absolute_diff = treatment_rate - control_rate
relative_diff = (treatment_rate / control_rate - 1) * 100

print(f"\nüìä Quick Analysis:")
print(f"   Control conversion: {control_rate:.4%}")
print(f"   Treatment conversion: {treatment_rate:.4%}")
print(f"   Absolute difference: {absolute_diff:.4%} ({absolute_diff*100:.2f} percentage points)")
print(f"   Relative lift: {relative_diff:.2f}%")
print(f"\n   {'‚úÖ Treatment looks better!' if relative_diff > 0 else '‚ùå Treatment looks worse'}")
print(f"   (But we need statistical testing to confirm!)")

---

## Part 2: Run Complete Analysis Pipeline

Now we'll run the full analysis pipeline which performs all 8 steps:
1. Data validation
2. SRM check
3. Power analysis
4. Primary test (Z-test for proportions)
5. CUPED variance reduction
6. Guardrail metrics
7. Novelty detection
8. Decision framework

We'll run it with `verbose=False` to get structured results, then examine each component.

In [None]:
# Run the complete analysis pipeline
print("Running full Marketing A/B analysis pipeline...")
print("(This may take 30-60 seconds)\n")

results = run_marketing_analysis(sample_frac=1.0, verbose=False)

print("‚úÖ Analysis complete!")
print(f"\nAvailable results: {list(results.keys())}")

---

## Part 3: Step-by-Step Results Analysis

### Step 1: Sample Ratio Mismatch (SRM) Check

#### üìö What is SRM?

**Sample Ratio Mismatch (SRM)** occurs when the actual group sizes don't match the expected ratio.

**Example**: You expect 50/50 split, but get 52/48.

**Why It Matters**: SRM indicates randomization failure ‚Üí ALL subsequent results are INVALID.

**Common Causes**:
- Implementation bugs in randomization code
- Telemetry/tracking issues (some users not logged)
- Bot traffic
- Browser compatibility issues

**Industry Standard**: If SRM detected ‚Üí STOP experiment, fix bug, restart. Don't analyze results.

In [None]:
print("=" * 70)
print("STEP 1: SAMPLE RATIO MISMATCH (SRM) CHECK")
print("=" * 70)

srm = results['srm_check']

print(f"\nüìä Test Results:")
print(f"   Chi-square statistic: {srm['test_statistic']:.4f}")
print(f"   P-value: {srm['p_value']:.6f}")
print(f"   Threshold (alpha): 0.01")
print(f"   SRM detected: {srm['srm_detected']}")

print(f"\nüí° INTERPRETATION:")
if srm['srm_detected']:
    print(f"   ‚ö†Ô∏è  SRM DETECTED (p < 0.01)")
    print(f"   ‚ö†Ô∏è  DO NOT TRUST ANY RESULTS - INVESTIGATE IMMEDIATELY")
    print(f"   ‚ö†Ô∏è  Check: randomization code, tracking pixels, bot filters")
    print(f"\n   üè¢ What companies do:")
    print(f"      - Netflix: Stops ALL analysis if SRM detected")
    print(f"      - Booking.com: Uses alpha=0.001 (even stricter)")
else:
    print(f"   ‚úÖ No SRM detected (p = {srm['p_value']:.4f} > 0.01)")
    print(f"   ‚úÖ Randomization appears valid - safe to proceed")
    print(f"\n   üè¢ Industry best practice:")
    print(f"      - Always check SRM BEFORE any other analysis")
    print(f"      - Use alpha=0.01 (stricter than typical 0.05)")
    print(f"      - Even small SRM can indicate serious problems")

### Step 2: Power Analysis

#### üìö What is Statistical Power?

**Power** = Probability of detecting an effect if it truly exists

**Typical Target**: 80% power (industry standard)

**Why It Matters**: Underpowered tests miss real effects (false negatives)

**Key Inputs**:
- **Baseline rate**: Current conversion rate (e.g., 5%)
- **MDE (Minimum Detectable Effect)**: Smallest change you care about (e.g., 2% relative = 5.0% ‚Üí 5.1%)
- **Alpha**: False positive rate (typically 0.05)
- **Power**: Desired detection probability (typically 0.80)

**Output**: Required sample size per group

In [None]:
print("=" * 70)
print("STEP 2: POWER ANALYSIS")
print("=" * 70)

power = results['power_analysis']

print(f"\nüìä Input Parameters:")
print(f"   Baseline conversion rate: {power['p_baseline']:.4%}")
print(f"   MDE (relative): {power['mde']:.1%}")
print(f"   Alpha (false positive rate): {power['alpha']:.2f}")
print(f"   Power (detection probability): {power['power']:.0%}")

print(f"\nüìä Results:")
print(f"   Cohen's h effect size: {power['cohens_h']:.4f}")
print(f"   Required sample per group: {power['sample_per_group']:,}")
print(f"   Current sample per group: {power['current_sample']:,}")

ratio = power['current_sample'] / power['sample_per_group']
print(f"   Sample ratio: {ratio:.2f}x required")

print(f"\nüí° INTERPRETATION:")
if ratio >= 1.0:
    print(f"   ‚úÖ WELL-POWERED (have {ratio:.1f}x required sample)")
    print(f"   ‚úÖ Can detect effects as small as {power['mde']:.1%} with {power['power']:.0%} confidence")
    print(f"   ‚úÖ Low risk of false negatives (missing real effects)")
else:
    print(f"   ‚ö†Ô∏è  UNDERPOWERED (have {ratio:.1%} of required sample)")
    print(f"   ‚ö†Ô∏è  Risk of false negatives (missing real effects)")
    print(f"   ‚ö†Ô∏è  Recommendation: Extend experiment or increase traffic")

print(f"\nüè¢ Industry Standards:")
print(f"   - Meta: 80% power for primary metric, 50%+ for guardrails")
print(f"   - Typical MDE: 1-5% relative lift (depends on baseline rate)")
print(f"   - Always run power analysis BEFORE starting experiment")

#### üî¨ Interactive Exercise: Try Different MDEs

Let's explore how MDE affects required sample size:

In [None]:
print("üî¨ EXPERIMENT: How MDE Affects Sample Size")
print("=" * 70)

mdes = [0.01, 0.02, 0.05, 0.10, 0.20]  # 1%, 2%, 5%, 10%, 20% relative
baseline = power['p_baseline']

comparison = []
for mde in mdes:
    n = power_module.required_samples_binary(
        p1=baseline, 
        mde=mde, 
        alpha=0.05, 
        power=0.80
    )
    days_at_10k = n / 10000  # Assume 10K users/day
    comparison.append({
        'MDE (Relative)': f'{mde:.1%}',
        'Sample Needed': f'{n:,}',
        'Days @ 10K/day': f'{days_at_10k:.1f}'
    })

comparison_df = pd.DataFrame(comparison)
display(comparison_df)

print(f"\nüí° KEY INSIGHT: Smaller effects require exponentially more sample!")
print(f"   - Detecting 1% effect needs {comparison_df.loc[0, 'Sample Needed']} users")
print(f"   - Detecting 10% effect needs {comparison_df.loc[3, 'Sample Needed']} users")
print(f"   - That's {int(n / power_module.required_samples_binary(baseline, 0.10, 0.05, 0.80)):d}x difference!")

### Step 3: Primary Statistical Test (Z-Test for Proportions)

#### üìö What is a Z-Test?

**Z-test for proportions** compares two binary outcomes (control vs treatment).

**Null Hypothesis (H‚ÇÄ)**: No difference between groups (p_control = p_treatment)

**Alternative Hypothesis (H‚ÇÅ)**: Difference exists (p_control ‚â† p_treatment)

**Output**:
- **P-value**: Probability of seeing this result if null hypothesis is true
- **Confidence Interval**: Range where true effect likely lies (95% CI)
- **Effect Size**: Magnitude of difference (absolute and relative lift)

**Decision Rule**: If p < 0.05, reject null hypothesis (effect is statistically significant)

In [None]:
print("=" * 70)
print("STEP 3: PRIMARY STATISTICAL TEST (Z-Test for Proportions)")
print("=" * 70)

test = results['primary_test']

print(f"\nüìä Observed Rates:")
print(f"   Control conversion: {test['p_control']:.4%} ({test['x_control']}/{test['n_control']:,})")
print(f"   Treatment conversion: {test['p_treatment']:.4%} ({test['x_treatment']}/{test['n_treatment']:,})")

print(f"\nüìä Effect Size:")
print(f"   Absolute lift: {test['absolute_lift']:.4%} ({test['absolute_lift']*100:.2f} percentage points)")
print(f"   Relative lift: {test['relative_lift']:.2%}")

print(f"\nüìä Statistical Test Results:")
print(f"   Z-statistic: {test['z_stat']:.4f}")
print(f"   P-value: {test['p_value']:.6f}")
print(f"   Standard error: {test['se']:.6f}")
print(f"   95% Confidence Interval: [{test['ci_lower']:.4%}, {test['ci_upper']:.4%}]")

print(f"\nüí° INTERPRETATION:")
print(f"   - Null hypothesis (H‚ÇÄ): No difference between control and treatment")
print(f"   - P-value = probability of seeing this result if H‚ÇÄ is true")
print(f"   - Alpha = 0.05 (our threshold for rejecting H‚ÇÄ)")

if test['significant']:
    print(f"\n   ‚úÖ STATISTICALLY SIGNIFICANT (p = {test['p_value']:.6f} < 0.05)")
    print(f"   ‚úÖ We reject null hypothesis with 95% confidence")
    print(f"   ‚úÖ Treatment shows {test['relative_lift']:.2%} lift over control")
    print(f"\n   üè¢ BUSINESS MEANING:")
    print(f"      - For every 1,000 users, expect {test['absolute_lift']*1000:.1f} more conversions")
    print(f"      - If 100K users/month, that's {test['absolute_lift']*100000:.0f} extra conversions/month")
else:
    print(f"\n   ‚óã NOT SIGNIFICANT (p = {test['p_value']:.6f} ‚â• 0.05)")
    print(f"   ‚óã Cannot reject null hypothesis")
    print(f"   ‚óã Either: (1) no real effect OR (2) sample too small to detect it")
    print(f"\n   ü§î WHAT THIS MEANS:")
    print(f"      - Observed difference ({test['relative_lift']:.2%}) could be due to random chance")
    print(f"      - Consider: extending experiment or increasing traffic")

print(f"\nüè¢ Industry Best Practices:")
print(f"   - Always report effect size + CI, not just p-value")
print(f"   - P-value tells you 'is it real?', effect size tells you 'does it matter?'")
print(f"   - Don't confuse statistical significance with practical importance")

### Step 4: CUPED Variance Reduction

#### üìö What is CUPED?

**CUPED** = Controlled-experiment Using Pre-Experiment Data

**How It Works**:
1. Use pre-experiment covariate (e.g., `total_ads` before experiment started)
2. Adjust outcome metric: `Y_adjusted = Y - Œ∏ * (X_pre - E[X_pre])`
3. Run test on adjusted metric ‚Üí tighter confidence intervals!

**Why It Works**:
- Reduces noise from user heterogeneity
- Like "before and after" photos - controls for baseline differences
- Users with high pre-experiment engagement are different from low-engagement users

**Requirements**:
1. ‚úÖ Covariate measured BEFORE randomization (unaffected by treatment)
2. ‚úÖ Covariate correlates with outcome (r > 0.3 typically effective)
3. ‚úÖ No bias - adjustment is mathematically unbiased

**Expected Impact**: 20-40% variance reduction (Netflix, Microsoft experience)

In [None]:
print("=" * 70)
print("STEP 4: CUPED VARIANCE REDUCTION")
print("=" * 70)

cuped = results.get('cuped', {})

if cuped:
    print(f"\nüìä Covariate Analysis:")
    print(f"   Covariate used: total_ads (pre-experiment ad exposure)")
    print(f"   Correlation with outcome: {cuped.get('correlation', 'N/A'):.4f}")
    
    print(f"\nüìä Variance Reduction:")
    print(f"   Original variance: {cuped.get('var_original', 'N/A'):.6f}")
    print(f"   Adjusted variance: {cuped.get('var_adjusted', 'N/A'):.6f}")
    print(f"   Variance reduction: {cuped.get('var_reduction', 0):.2%}")
    print(f"   SE reduction: {cuped.get('se_reduction', 0):.2%}")
    
    print(f"\nüìä Test Results Comparison:")
    print(f"   Original p-value: {test['p_value']:.6f}")
    print(f"   CUPED-adjusted p-value: {cuped.get('p_value_adjusted', 'N/A'):.6f}")
    print(f"   Original SE: {test['se']:.6f}")
    print(f"   CUPED-adjusted SE: {cuped.get('se_adjusted', 'N/A'):.6f}")
    
    var_red = cuped.get('var_reduction', 0)
    print(f"\nüí° INTERPRETATION:")
    if var_red > 0.30:
        print(f"   ‚úÖ STRONG variance reduction ({var_red:.1%})")
        print(f"   ‚úÖ CUPED very effective - covariate explains {var_red:.1%} of variance")
    elif var_red > 0.10:
        print(f"   ‚úÖ MODERATE variance reduction ({var_red:.1%})")
        print(f"   ‚úÖ CUPED effective - worth using")
    else:
        print(f"   ‚óã WEAK variance reduction ({var_red:.1%})")
        print(f"   ‚óã Covariate doesn't strongly predict outcome")
    
    print(f"\n   üéØ PRACTICAL IMPACT:")
    sample_equiv = 1 / (1 - var_red)
    print(f"      - Equivalent to running experiment with {sample_equiv:.1f}x more users")
    print(f"      - Or equivalently: run experiment {(1-var_red):.1%} as long for same power")
    print(f"      - Example: 4-week experiment ‚Üí {4*(1-var_red):.1f} weeks with CUPED")
    
    print(f"\nüè¢ Industry Practice:")
    print(f"   - Netflix: Uses CUPED on all experiments, typically 20-40% variance reduction")
    print(f"   - Microsoft: Increased experiment velocity 30% with variance reduction")
    print(f"   - DoorDash: Uses CUPAC (ML-enhanced CUPED) for 30-60% reduction")
else:
    print("\n‚ö†Ô∏è CUPED results not available")
    print("(May require pre-experiment covariate data)")

#### üìä Comparison Table: With vs Without CUPED

In [None]:
if cuped:
    comparison = pd.DataFrame({
        'Metric': [
            'Standard Error',
            'P-value',
            'Confidence Interval Width',
            'Equivalent Sample Size',
            'Experiment Duration'
        ],
        'Without CUPED': [
            f"{test['se']:.6f}",
            f"{test['p_value']:.6f}",
            f"{(test['ci_upper'] - test['ci_lower']):.4%}",
            "100% (baseline)",
            "4 weeks (baseline)"
        ],
        'With CUPED': [
            f"{cuped.get('se_adjusted', 0):.6f}",
            f"{cuped.get('p_value_adjusted', 0):.6f}",
            f"{(cuped.get('ci_upper', 0) - cuped.get('ci_lower', 0)):.4%}",
            f"{(1/(1-cuped.get('var_reduction', 0)))*100:.0f}%",
            f"{4*(1-cuped.get('var_reduction', 0)):.1f} weeks"
        ],
        'Improvement': [
            f"{cuped.get('se_reduction', 0):.1%} ‚Üì",
            "Lower (more significant)" if cuped.get('p_value_adjusted', 1) < test['p_value'] else "Higher",
            f"{cuped.get('se_reduction', 0):.1%} ‚Üì",
            f"+{(1/(1-cuped.get('var_reduction', 0)) - 1)*100:.0f}%",
            f"{cuped.get('var_reduction', 0):.0%} faster"
        ]
    })
    
    display(comparison)
    
    print(f"\nüí° Key Takeaway: CUPED reduces variance by {cuped.get('var_reduction', 0):.1%}, ")
    print(f"    which is equivalent to increasing sample size by {(1/(1-cuped.get('var_reduction', 0)) - 1)*100:.0f}%!")
else:
    print("Comparison table not available (CUPED results missing)")

---

## Part 4: Final Decision Summary

Let's synthesize all the analysis into a clear ship/hold/abandon decision.

In [None]:
print("=" * 70)
print("üéØ FINAL DECISION SUMMARY")
print("=" * 70)

decision = results.get('decision', {})

print(f"\n1. PRIMARY METRIC: {'‚úÖ SIGNIFICANT' if test['significant'] else '‚ùå NOT SIGNIFICANT'}")
print(f"   - Relative lift: {test['relative_lift']:.2%}")
print(f"   - P-value: {test['p_value']:.6f}")
print(f"   - 95% CI: [{test['ci_lower']:.4%}, {test['ci_upper']:.4%}]")

print(f"\n2. RANDOMIZATION CHECK: {'‚úÖ PASSED' if not srm['srm_detected'] else '‚ùå FAILED'}")
print(f"   - SRM p-value: {srm['p_value']:.6f}")

print(f"\n3. STATISTICAL POWER: {'‚úÖ ADEQUATE' if ratio >= 1.0 else '‚ö†Ô∏è UNDERPOWERED'}")
print(f"   - Current sample: {power['current_sample']:,} per group")
print(f"   - Required sample: {power['sample_per_group']:,} per group")

if cuped:
    print(f"\n4. VARIANCE REDUCTION:")
    print(f"   - CUPED variance reduction: {cuped.get('var_reduction', 0):.1%}")
    print(f"   - Equivalent sample size gain: +{(1/(1-cuped.get('var_reduction', 0)) - 1)*100:.0f}%")

if decision:
    print(f"\n{'='*70}")
    print(f">>> FINAL DECISION: {decision.get('decision', 'N/A').upper()} <<<")
    print(f"{'='*70}")
    
    print(f"\nüìù RATIONALE:")
    print(f"   {decision.get('rationale', 'N/A')}")
    
    print(f"\nüìã NEXT STEPS:")
    print(f"   {decision.get('recommendation', 'N/A')}")
else:
    print(f"\n‚ö†Ô∏è Decision framework results not available")

# Business impact
business = results.get('business_impact', {})
if business:
    print(f"\nüí∞ BUSINESS IMPACT (if shipped):")
    print(f"   - Incremental conversions/month: {business.get('incremental_conversions_monthly', 0):,.0f}")
    print(f"   - Incremental revenue/month: ${business.get('incremental_revenue_monthly', 0):,.2f}")
    print(f"   - Incremental revenue/year: ${business.get('incremental_revenue_annual', 0):,.2f}")

---

## ‚úÖ Key Learning Takeaways

### What We Learned

1. **Always check SRM first** - If randomization failed, nothing else matters. Don't analyze results.

2. **Power analysis is essential** - Run it BEFORE starting experiments to avoid wasting time on underpowered tests.

3. **CUPED is your friend** - 20-40% variance reduction = 25-67% faster experiments. Huge productivity win!

4. **P-values aren't everything** - Always report effect size + confidence intervals. "Statistically significant" doesn't mean "practically important".

5. **Think like a business** - Translate statistical results to dollars. Executives care about revenue impact, not p-values.

### üî¨ Try These Experiments

1. **Change MDE**: Run power analysis with different MDEs (1%, 5%, 10%) - how does sample size change?

2. **Remove CUPED**: Compare results with and without variance reduction - is it worth the complexity?

3. **Different Covariate**: Try using other pre-experiment features for CUPED - which works best?

4. **Sample Size**: Run analysis on different sample fractions (0.1, 0.5, 1.0) - when do results stabilize?

5. **Sensitivity Analysis**: What if conversion rate was 1% lower? How would that change the decision?

### üìö Further Reading

**Academic Papers**:
- Deng et al. (2013): "Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data" (CUPED original paper)
- Kohavi et al. (2020): "Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing"

**Industry Blogs**:
- [Netflix: Experimentation Platform](https://netflixtechblog.com/its-all-a-bout-testing-the-netflix-experimentation-platform-4e1ca458c15)
- [Booking.com: How We Measure Success](https://booking.design/how-booking-com-measures-success-57e3e33c1b5)
- [Spotify: Confidence](https://engineering.atspotify.com/2020/03/confidence-spotify-s-tool-for-faster-experimentation-analysis/)

**Next Steps**:
- Try the Cookie Cats notebook (multiple testing correction)
- Study the Criteo notebook (ML-enhanced techniques)
- Read the README technique selection guide