# A/B Testing Masterclass: Complete End-to-End Workflow
## Marketing Campaign Analysis

---

### Learning Objectives

By the end of this notebook, you will understand:

1. **Data Quality Validation** - Why and how to validate your data first
2. **SRM Detection** - Identifying randomization failures
3. **Power Analysis** - Determining if you have enough data
4. **CUPED Variance Reduction** - Using pre-experiment data for efficiency
5. **Guardrail Metrics** - Protecting critical metrics
6. **Novelty Effect Detection** - Identifying temporary effects
7. **Business Impact** - Translating statistics to dollars
8. **Decision Framework** - Ship/Hold/Abandon logic

---

### The Business Context

This dataset contains ~588K observations from a marketing A/B test comparing:
- **Control (PSA)**: Public Service Announcement (no product ad)
- **Treatment (Ad)**: Actual product advertisement

**Primary Question:** Does showing the ad increase conversion rate?

**Important Note:** This dataset has 96%/4% allocation (treatment/control), which suggests it may be **observational data** rather than a true randomized experiment. We'll address this in the analysis.

---

## Setup

In [4]:
import os

if not os.getcwd().endswith("ab_testing"):
    try:
        os.chdir("../")
    except OSError:
        raise FileNotFoundError("Could not change into 'ab_testing' from the current directory.")

print(f"Current working directory: {os.getcwd()}")


Current working directory: c:\docker_projects\ab_testing


In [5]:
# Core imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, Any

# A/B Testing modules
from ab_testing.data import loaders
from ab_testing.core import randomization, frequentist, power
from ab_testing.variance_reduction import cuped
from ab_testing.diagnostics import guardrails, novelty

# Set up plotting
plt.style.use('seaborn-v0_8-whitegrid')
%matplotlib inline

print("‚úì Modules loaded successfully")

‚úì Modules loaded successfully


---

## Step 1: Load and Validate Data

### Why Data Quality Matters

**Garbage In, Garbage Out (GIGO)**

Common data quality issues that invalidate experiments:
- Missing values (especially systematic)
- Duplicates (inflates sample size)
- Outliers (skew results)
- Data type errors
- Group imbalance

**Always validate BEFORE running any analysis.**

---

In [6]:
# Load data with 10% sample (for learning speed)
SAMPLE_FRAC = 0.1

print(f"Loading Marketing A/B dataset (sample={SAMPLE_FRAC})...")
df = loaders.load_marketing_ab(sample_frac=SAMPLE_FRAC)

print(f"\n‚úì Loaded {len(df):,} observations")
print(f"\nColumns: {list(df.columns)}")
print(f"\nData types:")
print(df.dtypes)

Loading Marketing A/B dataset (sample=0.1)...
Loading Marketing A/B dataset from data\raw\marketing_ab\marketing_AB.csv...
Loaded Marketing A/B dataset: 588,101 rows, 7 columns
  Conversion rate (ad): 2.55%
  Conversion rate (psa): 1.79%
  Sampled to 58,810 rows (10.0% of full dataset)

‚úì Loaded 58,810 observations

Columns: ['user_id', 'test_group', 'converted', 'total_ads', 'most_ads_day', 'most_ads_hour', 'treatment']

Data types:
user_id           int64
test_group       object
converted          bool
total_ads         int64
most_ads_day     object
most_ads_hour     int64
treatment         int64
dtype: object


In [7]:
# Data quality checks
print("Data Quality Validation:")
print("=" * 50)

# 1. Missing values
print(f"\n1. Missing Values:")
missing = df.isnull().sum()
if missing.sum() == 0:
    print(f"   ‚úì No missing values")
else:
    print(f"   ‚ö†Ô∏è  Missing values found:")
    print(missing[missing > 0])

# 2. Duplicates
print(f"\n2. Duplicates:")
n_duplicates = df.duplicated().sum()
if n_duplicates == 0:
    print(f"   ‚úì No duplicate rows")
else:
    print(f"   ‚ö†Ô∏è  {n_duplicates:,} duplicate rows found")

# 3. Group distribution
print(f"\n3. Group Distribution:")
group_counts = df['test_group'].value_counts()
print(f"   {group_counts.to_dict()}")

# 4. Outcome variable
print(f"\n4. Outcome Variable (converted):")
print(f"   Unique values: {df['converted'].unique()}")
print(f"   Type: {df['converted'].dtype}")

Data Quality Validation:

1. Missing Values:
   ‚úì No missing values

2. Duplicates:
   ‚úì No duplicate rows

3. Group Distribution:
   {'ad': 56424, 'psa': 2386}

4. Outcome Variable (converted):
   Unique values: [False  True]
   Type: bool


In [8]:
# Separate groups
control = df[df['test_group'] == 'psa']
treatment = df[df['test_group'] == 'ad']

print("\nGroup Summary:")
print("=" * 50)
print(f"Control (PSA):   {len(control):,} observations")
print(f"Treatment (Ad):  {len(treatment):,} observations")
print(f"Total:           {len(df):,} observations")

# Allocation ratio
ratio = len(treatment) / len(control)
print(f"\nAllocation ratio (treatment/control): {ratio:.2f}")

if ratio > 10:
    print(f"\n‚ö†Ô∏è  SEVERE IMBALANCE: This is likely OBSERVATIONAL data")
    print(f"   True experiments typically have ratios near 1.0")
    print(f"   Causal claims should be made with caution")


Group Summary:
Control (PSA):   2,386 observations
Treatment (Ad):  56,424 observations
Total:           58,810 observations

Allocation ratio (treatment/control): 23.65

‚ö†Ô∏è  SEVERE IMBALANCE: This is likely OBSERVATIONAL data
   True experiments typically have ratios near 1.0
   Causal claims should be made with caution


---

## Step 2: Sample Ratio Mismatch (SRM) Check

### Understanding This Dataset

This dataset has **~96%/4% allocation** (treatment/control). This is unusual for a true A/B test and suggests:

1. **Observational data** - Users self-selected into groups
2. **Designed imbalance** - Intentional (for cost/risk reasons)
3. **Data labeling error** - Mislabeled as "A/B test"

**Important:** For this analysis, we'll treat this as **observational data**, which means:
- We can measure **association**, not **causation**
- Confounding variables may bias results
- Propensity score matching would improve causal inference

---

In [9]:
# Calculate observed allocation
n_control = len(control)
n_treatment = len(treatment)
total = n_control + n_treatment

observed_ratio_control = n_control / total
observed_ratio_treatment = n_treatment / total

# This dataset is OBSERVATIONAL with ~4%/96% split
# We'll check if sample matches this expected pattern
IS_RCT = False  # NOT a true randomized controlled trial
EXPECTED_ALLOCATION = [0.04, 0.96]  # Observed baseline pattern

print("Dataset Classification:")
print("=" * 50)
print(f"\nType: OBSERVATIONAL DATA (not a true RCT)")
print(f"\nReason: {n_treatment/n_control:.1f}x imbalance suggests self-selection")
print(f"        True experiments typically have 1:1 to 4:1 ratios")
print(f"\nImplications:")
print(f"  ‚Ä¢ Results show ASSOCIATION, not CAUSATION")
print(f"  ‚Ä¢ Confounding variables may bias estimates")
print(f"  ‚Ä¢ Use for learning, but interpret with caution")

Dataset Classification:

Type: OBSERVATIONAL DATA (not a true RCT)

Reason: 23.6x imbalance suggests self-selection
        True experiments typically have 1:1 to 4:1 ratios

Implications:
  ‚Ä¢ Results show ASSOCIATION, not CAUSATION
  ‚Ä¢ Confounding variables may bias estimates
  ‚Ä¢ Use for learning, but interpret with caution


In [10]:
# Run allocation check (diagnostic only for observational data)
srm_result = randomization.srm_check(
    n_control=n_control,
    n_treatment=n_treatment,
    expected_ratio=EXPECTED_ALLOCATION,
    alpha=0.01
)

print("\nAllocation Check (Diagnostic):")
print("=" * 50)
print(f"Expected (baseline pattern): {EXPECTED_ALLOCATION[0]:.1%} / {EXPECTED_ALLOCATION[1]:.1%}")
print(f"Observed (this sample):      {observed_ratio_control:.2%} / {observed_ratio_treatment:.2%}")
print(f"\nChi-square: {srm_result['chi2_statistic']:.4f}")
print(f"P-value:    {srm_result['p_value']:.6f}")

if not srm_result['srm_detected']:
    print(f"\n‚úì Sample matches expected pattern")
else:
    print(f"\n‚ö†Ô∏è  Sample differs from expected pattern")
    print(f"   Check data loading/filtering logic")


Allocation Check (Diagnostic):
Expected (baseline pattern): 4.0% / 96.0%
Observed (this sample):      4.06% / 95.94%

Chi-square: 0.4999
P-value:    0.479537

‚úì Sample matches expected pattern


---

## Step 3: Power Analysis

### Why Power Analysis Matters

**Power** = Probability of detecting a real effect when it exists (1 - Œ≤)

**MDE** = Minimum Detectable Effect (smallest change we can reliably detect)

| Power | Interpretation |
|-------|---------------|
| 80% | Industry standard - 20% chance of missing real effect |
| 90% | Conservative - 10% chance of missing real effect |
| 95% | Very conservative - 5% chance of missing real effect |

### The Formula

For binary outcomes (conversion), required sample per group:

$$n = \frac{2(z_{\alpha/2} + z_\beta)^2 \cdot p(1-p)}{\delta^2}$$

Where:
- $z_{\alpha/2}$ = 1.96 for Œ±=0.05 (two-sided)
- $z_\beta$ = 0.84 for 80% power
- $p$ = baseline conversion rate
- $\delta$ = MDE (absolute difference)

---

In [11]:
# Calculate baseline conversion rate
p_control = control['converted'].mean()
p_treatment = treatment['converted'].mean()
observed_lift = (p_treatment / p_control - 1) * 100

print("Conversion Rates:")
print("=" * 50)
print(f"Control (PSA):   {p_control:.4f} ({p_control*100:.2f}%)")
print(f"Treatment (Ad):  {p_treatment:.4f} ({p_treatment*100:.2f}%)")
print(f"\nObserved relative lift: {observed_lift:.2f}%")

Conversion Rates:
Control (PSA):   0.0189 (1.89%)
Treatment (Ad):  0.0256 (2.56%)

Observed relative lift: 35.69%


In [12]:
# Run power analysis
TARGET_MDE = 0.02  # 2% relative MDE (e.g., 2% ‚Üí 2.04%)

power_result = power.power_analysis_summary(
    p_baseline=p_control,
    mde=TARGET_MDE,  # 2% relative lift
    alpha=0.05,
    power=0.80
)

print("Power Analysis:")
print("=" * 50)
print(f"\nInputs:")
print(f"  Baseline conversion: {p_control:.2%}")
print(f"  Target MDE:          {TARGET_MDE:.1%} relative")
print(f"  Alpha (Type I):      0.05 (5%)")
print(f"  Power (1 - Type II): 0.80 (80%)")

print(f"\nResults:")
print(f"  Treatment rate with MDE: {power_result['p_treatment']:.2%}")
print(f"  Required sample per group: {power_result['sample_per_group']:,}")
print(f"  Total required: {power_result['sample_per_group'] * 2:,}")
print(f"  Effect size (Cohen's h): {power_result['cohens_h']:.4f}")

Power Analysis:

Inputs:
  Baseline conversion: 1.89%
  Target MDE:          2.0% relative
  Alpha (Type I):      0.05 (5%)
  Power (1 - Type II): 0.80 (80%)

Results:
  Treatment rate with MDE: 1.92%
  Required sample per group: 2,061,546
  Total required: 4,123,092
  Effect size (Cohen's h): 0.0028


In [13]:
# Compare to actual sample
actual_per_group = min(n_control, n_treatment)
required_per_group = power_result['sample_per_group']

print("\nPower Assessment:")
print("=" * 50)
print(f"Required per group: {required_per_group:,}")
print(f"Actual (smaller group): {actual_per_group:,}")

if actual_per_group >= required_per_group:
    ratio_achieved = actual_per_group / required_per_group
    print(f"\n‚úì WELL-POWERED: {ratio_achieved:.1f}x required sample")
    print(f"  Can detect effects as small as {TARGET_MDE:.1%} with 80% confidence")
else:
    ratio_achieved = actual_per_group / required_per_group
    print(f"\n‚ö†Ô∏è  UNDERPOWERED: Only {ratio_achieved:.1%} of required sample")
    print(f"  Risk: May miss real effects (false negatives)")
    print(f"\n  Options:")
    print(f"  1. Extend experiment to collect more data")
    print(f"  2. Target larger MDE (accept coarser resolution)")
    print(f"  3. Use variance reduction (CUPED/CUPAC)")


Power Assessment:
Required per group: 2,061,546
Actual (smaller group): 2,386

‚ö†Ô∏è  UNDERPOWERED: Only 0.1% of required sample
  Risk: May miss real effects (false negatives)

  Options:
  1. Extend experiment to collect more data
  2. Target larger MDE (accept coarser resolution)
  3. Use variance reduction (CUPED/CUPAC)


---

## Step 4: Primary Metric Test

---

In [14]:
# Z-test for conversion rate
x_control = control['converted'].sum()
x_treatment = treatment['converted'].sum()

primary_result = frequentist.z_test_proportions(
    x_control=x_control,
    n_control=n_control,
    x_treatment=x_treatment,
    n_treatment=n_treatment,
    alpha=0.05,
    two_sided=True
)

print("Primary Metric: Conversion Rate")
print("=" * 50)
print(f"\nControl:   {p_control:.4f} ({x_control:,}/{n_control:,})")
print(f"Treatment: {p_treatment:.4f} ({x_treatment:,}/{n_treatment:,})")
print(f"\nAbsolute difference: {primary_result['absolute_lift']:.4f} ({primary_result['absolute_lift']*100:.2f}pp)")
print(f"Relative lift:       {primary_result['relative_lift']:.2%}")
print(f"\nZ-statistic: {primary_result['z_statistic']:.4f}")
print(f"P-value:     {primary_result['p_value']:.6f}")
print(f"95% CI:      [{primary_result['ci_lower']:.4f}, {primary_result['ci_upper']:.4f}]")
print(f"\nStatistically significant: {primary_result['significant']}")

Primary Metric: Conversion Rate

Control:   0.0189 (45/2,386)
Treatment: 0.0256 (1,444/56,424)

Absolute difference: 0.0067 (0.67pp)
Relative lift:       35.69%

Z-statistic: 2.0504
P-value:     0.040330
95% CI:      [0.0011, 0.0123]

Statistically significant: True


In [15]:
# Interpret the result
print("\nInterpretation:")
print("=" * 50)

if primary_result['significant']:
    print(f"‚úì STATISTICALLY SIGNIFICANT (p = {primary_result['p_value']:.4f} < 0.05)")
    print(f"\nWhat this means:")
    print(f"  ‚Ä¢ If there were truly no effect, we'd see this result")
    print(f"    only {primary_result['p_value']*100:.2f}% of the time by chance")
    print(f"  ‚Ä¢ Treatment shows {primary_result['relative_lift']:.2%} lift")
    print(f"  ‚Ä¢ 95% confident true lift is between")
    print(f"    {primary_result['ci_lower']*100:.2f}pp and {primary_result['ci_upper']*100:.2f}pp")
    
    print(f"\n‚ö†Ô∏è  CAUTION (Observational Data):")
    print(f"  ‚Ä¢ This shows ASSOCIATION, not necessarily CAUSATION")
    print(f"  ‚Ä¢ Users who saw ads may differ from PSA users")
    print(f"  ‚Ä¢ Confounding variables may explain the difference")
else:
    print(f"‚óã NOT STATISTICALLY SIGNIFICANT (p = {primary_result['p_value']:.4f} ‚â• 0.05)")
    print(f"\nWhat this means:")
    print(f"  ‚Ä¢ Cannot confidently say treatment differs from control")
    print(f"  ‚Ä¢ Either: (1) No real effect, or (2) Sample too small")


Interpretation:
‚úì STATISTICALLY SIGNIFICANT (p = 0.0403 < 0.05)

What this means:
  ‚Ä¢ If there were truly no effect, we'd see this result
    only 4.03% of the time by chance
  ‚Ä¢ Treatment shows 35.69% lift
  ‚Ä¢ 95% confident true lift is between
    0.11pp and 1.23pp

‚ö†Ô∏è  CAUTION (Observational Data):
  ‚Ä¢ This shows ASSOCIATION, not necessarily CAUSATION
  ‚Ä¢ Users who saw ads may differ from PSA users
  ‚Ä¢ Confounding variables may explain the difference


---

## Step 5: CUPED Variance Reduction

### What is CUPED?

**CUPED** = Controlled-experiment Using Pre-Experiment Data

It uses **pre-experiment covariates** to reduce noise in your metrics:

$$Y_{\text{adjusted}} = Y - \theta(X - \bar{X})$$

Where:
- $Y$ = outcome (conversion)
- $X$ = pre-experiment covariate (must be unaffected by treatment)
- $\theta = \text{Cov}(Y, X) / \text{Var}(X)$ = optimal adjustment coefficient

### Why CUPED Works

If a covariate predicts the outcome, it explains some of the variance. By adjusting for this, we reduce noise and increase power.

**Variance reduction** = $r^2$ (correlation squared)

| Correlation | Variance Reduction | Sample Equivalent |
|-------------|-------------------|-------------------|
| 0.30 | 9% | 1.1x more users |
| 0.50 | 25% | 1.33x more users |
| 0.70 | 49% | 2x more users |

---

In [16]:
# Use total_ads as pre-experiment covariate
# (number of ads shown before outcome - proxy for user engagement)

control_outcome = control['converted'].values
control_covariate = control['total_ads'].values

treatment_outcome = treatment['converted'].values
treatment_covariate = treatment['total_ads'].values

print("CUPED Setup:")
print("=" * 50)
print(f"Outcome: converted (binary)")
print(f"Covariate: total_ads (pre-experiment ad exposure)")
print(f"\nCovariate statistics:")
print(f"  Control mean:    {control_covariate.mean():.2f}")
print(f"  Treatment mean:  {treatment_covariate.mean():.2f}")

CUPED Setup:
Outcome: converted (binary)
Covariate: total_ads (pre-experiment ad exposure)

Covariate statistics:
  Control mean:    24.29
  Treatment mean:  24.86


In [17]:
# Run CUPED
cuped_result = cuped.cuped_ab_test(
    y_control=control_outcome,
    y_treatment=treatment_outcome,
    x_control=control_covariate,
    x_treatment=treatment_covariate,
    alpha=0.05
)

print("\nCUPED Results:")
print("=" * 50)
print(f"\nCovariate Information:")
print(f"  Correlation with outcome: {cuped_result['correlation']:.4f}")
print(f"  Theta (adjustment coef):  {cuped_result['theta']:.6f}")

print(f"\nVariance Reduction:")
print(f"  Variance reduction: {cuped_result['var_reduction']:.1%}")
print(f"  SE reduction:       {cuped_result['se_reduction']:.1%}")
print(f"  Sample size equivalent: {1/(1-cuped_result['var_reduction']):.2f}x")

print(f"\nAdjusted Test:")
print(f"  Raw p-value:      {primary_result['p_value']:.6f}")
print(f"  CUPED p-value:    {cuped_result['p_value_adjusted']:.6f}")
print(f"  Change:           {cuped_result['p_value_adjusted'] - primary_result['p_value']:.6f}")


CUPED Results:

Covariate Information:
  Correlation with outcome: 0.2173
  Theta (adjustment coef):  0.000570

Variance Reduction:
  Variance reduction: 4.4%
  SE reduction:       1.4%
  Sample size equivalent: 1.05x

Adjusted Test:
  Raw p-value:      0.040330
  CUPED p-value:    0.017084
  Change:           -0.023246


In [18]:
# Assess CUPED effectiveness
print("\nCUPED Effectiveness:")
print("=" * 50)

if cuped_result['var_reduction'] > 0.20:
    print(f"‚úì STRONG variance reduction ({cuped_result['var_reduction']:.1%})")
    print(f"  Covariate explains {cuped_result['var_reduction']:.1%} of outcome variance")
    print(f"  Like running with {1/(1-cuped_result['var_reduction']):.1f}x more users!")
elif cuped_result['var_reduction'] > 0.10:
    print(f"‚úì MODERATE variance reduction ({cuped_result['var_reduction']:.1%})")
    print(f"  Helpful noise reduction")
elif cuped_result['var_reduction'] > 0.05:
    print(f"‚óã MODEST variance reduction ({cuped_result['var_reduction']:.1%})")
    print(f"  Some benefit but not dramatic")
else:
    print(f"‚óã WEAK variance reduction ({cuped_result['var_reduction']:.1%})")
    print(f"  Covariate doesn't strongly predict outcome")
    print(f"  Consider finding better pre-experiment predictors")


CUPED Effectiveness:
‚óã WEAK variance reduction (4.4%)
  Covariate doesn't strongly predict outcome
  Consider finding better pre-experiment predictors


---

## Step 6: Guardrail Metrics

### The Guardrail Framework

| Metric Type | Purpose | Test Type |
|-------------|---------|----------|
| **Primary** | What we OPTIMIZE | Standard hypothesis test |
| **Guardrail** | What we PROTECT | Non-inferiority test |

**Non-inferiority test:**
- Question: "Is degradation within acceptable threshold?"
- NOT asking "is it better?" - just "is it not too bad?"
- Pass if: Lower bound of CI > threshold

---

In [19]:
# Guardrail: Average ads shown should not decrease significantly
guardrail_control = control['total_ads'].values
guardrail_treatment = treatment['total_ads'].values

guardrail_result = guardrails.non_inferiority_test(
    control=guardrail_control,
    treatment=guardrail_treatment,
    delta=-0.05,  # Allow up to 5% degradation
    metric_type='relative',
    alpha=0.05
)
guardrail_result['metric_name'] = 'avg_ads_shown'

print("Guardrail: Average Ads Shown")
print("=" * 50)
print(f"Tolerance: -5.0% (max allowed degradation)")
print(f"\nControl mean:   {guardrail_result['mean_control']:.2f}")
print(f"Treatment mean: {guardrail_result['mean_treatment']:.2f}")

# Calculate relative change
rel_change = guardrail_result['difference'] / guardrail_result['mean_control']
print(f"\nRelative change: {rel_change:.2%}")
print(f"95% CI lower:    {guardrail_result['ci_lower']:.4f}")
print(f"\nResult: {'‚úì PASSED' if guardrail_result['passed'] else '‚úó FAILED'}")

Guardrail: Average Ads Shown
Tolerance: -5.0% (max allowed degradation)

Control mean:   24.29
Treatment mean: 24.86

Relative change: 2.33%
95% CI lower:    -0.8023

Result: ‚úì PASSED


---

## Step 7: Novelty Effect Detection

### What are Novelty Effects?

**Novelty effect** = Temporary spike due to user curiosity, not genuine value

| Week | Effect | Interpretation |
|------|--------|----------------|
| 1 | +15% | Users exploring new feature |
| 2 | +10% | Novelty wearing off |
| 3 | +5% | Returning to baseline |
| 4 | +3% | True sustained effect |

**Why it matters:**
- Shipping novelty = wasted engineering
- Users may actually dislike feature once novelty wears off
- Need to distinguish temporary from sustained effects

---

In [20]:
# Aggregate conversion by day of week (proxy for time)
daily_control = control.groupby('most_ads_day')['converted'].mean().sort_index()
daily_treatment = treatment.groupby('most_ads_day')['converted'].mean().sort_index()

print("Time-Based Analysis:")
print("=" * 50)
print(f"\nTime periods available: {len(daily_control)}")
print(f"\nConversion by day (control):")
print(daily_control)
print(f"\nConversion by day (treatment):")
print(daily_treatment)

Time-Based Analysis:

Time periods available: 7

Conversion by day (control):
most_ads_day
Friday       0.010782
Monday       0.022857
Saturday     0.019934
Sunday       0.032468
Thursday     0.019465
Tuesday      0.021127
Wednesday    0.008310
Name: converted, dtype: float64

Conversion by day (treatment):
most_ads_day
Friday       0.018317
Monday       0.033400
Saturday     0.024449
Sunday       0.023664
Thursday     0.023724
Tuesday      0.030918
Wednesday    0.025439
Name: converted, dtype: float64


In [21]:
# Check if we have enough time points for novelty analysis
MIN_TIME_POINTS = 10

if len(daily_control) >= MIN_TIME_POINTS:
    # Run novelty detection
    novelty_result = novelty.detect_novelty_effect(
        metrics_control=daily_control.values,
        metrics_treatment=daily_treatment.values,
        window_size=3,
        alpha=0.05
    )
    
    print("Novelty Effect Analysis:")
    print("=" * 50)
    print(f"\nEarly period effect: {novelty_result['early_effect']:.4f}")
    print(f"Late period effect:  {novelty_result['late_effect']:.4f}")
    print(f"Effect decay:        {novelty_result['effect_decay']:.4f}")
    print(f"Decay p-value:       {novelty_result['decay_pvalue']:.4f}")
    
    if novelty_result['novelty_detected']:
        print(f"\n‚ö†Ô∏è  NOVELTY EFFECT DETECTED!")
        print(f"   Effect is WEAKENING over time")
        print(f"   Recommendation: Run post-launch holdout (2-4 weeks)")
    else:
        print(f"\n‚úì NO NOVELTY EFFECT DETECTED")
        print(f"   Effect appears STABLE across time")
else:
    novelty_result = None
    print(f"\n‚óã Insufficient time points ({len(daily_control)}) for novelty analysis")
    print(f"   Need ‚â•{MIN_TIME_POINTS} time points for reliable detection")
    print(f"   Recommendation: Run longer experiments (2-4 weeks)")


‚óã Insufficient time points (7) for novelty analysis
   Need ‚â•10 time points for reliable detection
   Recommendation: Run longer experiments (2-4 weeks)


---

## Step 8: Business Impact Calculation

### From Statistics to Dollars

P-values tell us "is it real?" but business impact tells us "does it matter?"

**Components:**
1. **Scale**: How many users affected?
2. **Magnitude**: How big is the effect per user?
3. **Value**: What's each conversion worth?
4. **Time**: Annualized for comparisons

**Formula:**
$$\text{Annual Value} = \text{Users} \times \text{Lift} \times \text{Value per Conversion} \times 12$$

---

In [22]:
# Business assumptions (customize for your business)
MONTHLY_USERS = 1_000_000
AVG_ORDER_VALUE = 10.0

print("Business Impact Calculation:")
print("=" * 50)
print(f"\nüìä ASSUMPTIONS:")
print(f"   Monthly active users: {MONTHLY_USERS:,}")
print(f"   Average order value:  ${AVG_ORDER_VALUE:.2f}")
print(f"   Baseline conversion:  {p_control:.2%}")
print(f"   Treatment conversion: {p_treatment:.2%}")

Business Impact Calculation:

üìä ASSUMPTIONS:
   Monthly active users: 1,000,000
   Average order value:  $10.00
   Baseline conversion:  1.89%
   Treatment conversion: 2.56%


In [23]:
# Calculate business impact
baseline_conversions_monthly = MONTHLY_USERS * p_control
treatment_conversions_monthly = MONTHLY_USERS * p_treatment
incremental_conversions = treatment_conversions_monthly - baseline_conversions_monthly

incremental_revenue_monthly = incremental_conversions * AVG_ORDER_VALUE
incremental_revenue_annual = incremental_revenue_monthly * 12

# Confidence interval bounds (on the LIFT, not absolute)
# CI is on the difference (absolute), so multiply by users and value
worst_case_annual = MONTHLY_USERS * primary_result['ci_lower'] * AVG_ORDER_VALUE * 12
best_case_annual = MONTHLY_USERS * primary_result['ci_upper'] * AVG_ORDER_VALUE * 12

print(f"\nüìà MONTHLY IMPACT:")
print(f"   Baseline conversions:     {baseline_conversions_monthly:,.0f}")
print(f"   Treatment conversions:    {treatment_conversions_monthly:,.0f}")
print(f"   Incremental conversions:  {incremental_conversions:,.0f}")
print(f"   Incremental revenue:      ${incremental_revenue_monthly:,.2f}")

print(f"\nüí∞ ANNUALIZED IMPACT:")
print(f"   Best case (95% CI upper):  ${best_case_annual:,.2f}")
print(f"   Expected (point estimate): ${incremental_revenue_annual:,.2f}")
print(f"   Worst case (95% CI lower): ${worst_case_annual:,.2f}")


üìà MONTHLY IMPACT:
   Baseline conversions:     18,860
   Treatment conversions:    25,592
   Incremental conversions:  6,732
   Incremental revenue:      $67,319.30

üí∞ ANNUALIZED IMPACT:
   Best case (95% CI upper):  $1,481,219.81
   Expected (point estimate): $807,831.59
   Worst case (95% CI lower): $134,443.37


In [24]:
# ROI analysis
print(f"\nüìä COST-BENEFIT ANALYSIS:")
print(f"   Expected annual value: ${incremental_revenue_annual:,.2f}")
print(f"\n   If implementation cost = ${incremental_revenue_annual * 0.5:,.0f}:")
print(f"      ‚Üí Break-even in ~6 months - GOOD investment")
print(f"\n   If implementation cost > ${incremental_revenue_annual:,.0f}:")
print(f"      ‚Üí Need multi-year value or strategic reasons")

print(f"\n‚ö†Ô∏è  Risk assessment:")
if worst_case_annual > 0:
    print(f"   ‚úì Even in worst case, still profitable (${worst_case_annual:,.0f})")
    print(f"   ‚úì Low downside risk")
else:
    print(f"   ‚ö†Ô∏è  Worst case is NEGATIVE (${worst_case_annual:,.0f})")
    print(f"   ‚ö†Ô∏è  Consider: Is upside worth the downside risk?")


üìä COST-BENEFIT ANALYSIS:
   Expected annual value: $807,831.59

   If implementation cost = $403,916:
      ‚Üí Break-even in ~6 months - GOOD investment

   If implementation cost > $807,832:
      ‚Üí Need multi-year value or strategic reasons

‚ö†Ô∏è  Risk assessment:
   ‚úì Even in worst case, still profitable ($134,443)
   ‚úì Low downside risk


---

## Step 9: Final Decision

---

In [25]:
# Make final decision
decision_result = guardrails.evaluate_guardrails(
    primary_result=primary_result,
    guardrail_results=[guardrail_result]
)

print("\n" + "=" * 60)
print("FINAL DECISION FRAMEWORK")
print("=" * 60)

print(f"\nüéØ Primary Metric: Conversion Rate")
print(f"   Significant: {decision_result['primary_significant']}")
print(f"   Positive:    {decision_result['primary_positive']}")
print(f"   Lift:        {primary_result['relative_lift']:.2%}")

print(f"\nüõ°Ô∏è  Guardrails:")
print(f"   Passed: {decision_result['guardrails_passed']}/{decision_result['guardrails_total']}")

if novelty_result is not None:
    print(f"\n‚è±Ô∏è  Novelty:")
    if novelty_result['novelty_detected']:
        print(f"   ‚ö†Ô∏è  Detected - effect may be temporary")
    else:
        print(f"   ‚úì Not detected - effect appears stable")

print(f"\nüí∞ Business Impact:")
print(f"   Annual value: ${incremental_revenue_annual:,.0f}")

decision = decision_result['decision'].upper()
print(f"\n" + "=" * 60)
print(f">>> FINAL DECISION: {decision} <<<")
print("=" * 60)


FINAL DECISION FRAMEWORK

üéØ Primary Metric: Conversion Rate
   Significant: True
   Positive:    True
   Lift:        35.69%

üõ°Ô∏è  Guardrails:
   Passed: 1/1

üí∞ Business Impact:
   Annual value: $807,832

>>> FINAL DECISION: SHIP <<<


In [26]:
# Decision explanation
print("\n" + "=" * 60)
print("DECISION INTERPRETATION")
print("=" * 60)

if decision == 'SHIP':
    print(f"\n‚úÖ SHIP RECOMMENDATION")
    print(f"\nRationale:")
    print(f"  ‚Ä¢ Primary metric improved significantly ({primary_result['relative_lift']:.2%})")
    print(f"  ‚Ä¢ All guardrails passed")
    print(f"  ‚Ä¢ Positive business impact (${incremental_revenue_annual:,.0f}/year)")
    if novelty_result is not None and novelty_result['novelty_detected']:
        print(f"\n‚ö†Ô∏è  Caveat: Novelty effect detected")
        print(f"   Run 2-4 week post-launch holdout to verify sustained effect")
    print(f"\nNext steps:")
    print(f"  1. Prepare rollout plan")
    print(f"  2. Set up monitoring dashboards")
    print(f"  3. Define rollback criteria")

elif decision == 'ABANDON':
    print(f"\n‚ùå ABANDON RECOMMENDATION")
    print(f"\nRationale:")
    if not decision_result['primary_positive']:
        print(f"  ‚Ä¢ Primary metric showed NEGATIVE impact")
    if decision_result['guardrails_passed'] < decision_result['guardrails_total']:
        print(f"  ‚Ä¢ Guardrail(s) FAILED")
    print(f"\nNext steps:")
    print(f"  1. Analyze why it failed")
    print(f"  2. Generate new hypotheses")
    print(f"  3. Design improved treatment")

else:  # HOLD
    print(f"\n‚ö™ HOLD RECOMMENDATION")
    print(f"\nRationale:")
    if not decision_result['primary_significant']:
        print(f"  ‚Ä¢ Primary metric not statistically significant")
    print(f"\nOptions:")
    print(f"  1. Extend experiment (more data)")
    print(f"  2. Increase traffic allocation")
    print(f"  3. Use variance reduction (CUPED/CUPAC)")


DECISION INTERPRETATION

‚úÖ SHIP RECOMMENDATION

Rationale:
  ‚Ä¢ Primary metric improved significantly (35.69%)
  ‚Ä¢ All guardrails passed
  ‚Ä¢ Positive business impact ($807,832/year)

Next steps:
  1. Prepare rollout plan
  2. Set up monitoring dashboards
  3. Define rollback criteria


---

## Summary: The Complete A/B Testing Checklist

### Always Do These (in order)

| Step | Question | Tool |
|------|----------|------|
| 1 | Is data quality OK? | Manual checks |
| 2 | Is randomization valid? | `srm_check()` |
| 3 | Do we have enough data? | `power_analysis_summary()` |
| 4 | Is the effect real? | `z_test_proportions()` |
| 5 | Can we reduce noise? | `cuped_ab_test()` |
| 6 | Did we harm anything? | `non_inferiority_test()` |
| 7 | Is it sustainable? | `detect_novelty_effect()` |
| 8 | Is it worth it? | Business impact calculation |
| 9 | Ship/Hold/Abandon? | `evaluate_guardrails()` |

### Common Pitfalls

1. **Peeking**: Stopping early without proper sequential testing
2. **Ignoring SRM**: Trusting results from broken randomization
3. **No guardrails**: Optimizing primary at expense of critical metrics
4. **Novelty blind**: Shipping temporary effects
5. **P-value obsession**: Significant ‚â† meaningful
6. **No power analysis**: Running underpowered experiments

---

## Exercises

### Exercise 1: Try Different Sample Sizes

Run the analysis with sample_frac=0.5 and sample_frac=1.0. How do the results change?

In [27]:
# YOUR CODE HERE


### Exercise 2: Vary Guardrail Thresholds

What happens if you use a stricter guardrail threshold (-2% instead of -5%)?

In [28]:
# YOUR CODE HERE


### Exercise 3: Business Assumptions

How sensitive is the business impact to your assumptions? Try:
- 500K users instead of 1M
- $5 AOV instead of $10

In [29]:
# YOUR CODE HERE
