# A/B Testing Masterclass: Complete End-to-End Workflow
## Marketing Campaign Analysis

---

## üéØ From Start to Finish: The Full Experimentation Workflow

This notebook completes the trilogy by walking through **everything** that happens in a real experiment‚Äîfrom data quality validation to business impact calculation.

### Why This Matters for Interviews

In data science interviews, A/B testing questions reveal how you think about the **entire process**, not just the statistics:

> *"Many candidates can calculate a p-value, but struggle to explain what happens before and after. They don't know how to validate data quality, estimate business impact, or explain why their results might not generalize."*

This notebook covers the parts that separate strong candidates from average ones:

| Phase | What Most Candidates Do | What Strong Candidates Do |
|-------|-------------------------|---------------------------|
| **Data Quality** | Assume it's clean | Validate systematically |
| **Power Analysis** | Skip it | Use it to set expectations |
| **Interpretation** | Report p-values | Translate to business impact |
| **Novelty Effects** | Ignore them | Check for temporal patterns |
| **Decision** | "Significant = ship" | Consider full context |

### The Complete Checklist

```
‚ñ° 1. Data Quality Validation
    ‚îî‚îÄ‚îÄ Missing values, duplicates, outliers, group balance
‚ñ° 2. SRM Check
    ‚îî‚îÄ‚îÄ Did randomization work? (Or is this observational?)
‚ñ° 3. Power Analysis
    ‚îî‚îÄ‚îÄ Do we have enough data? What's our MDE?
‚ñ° 4. Primary Metric Test
    ‚îî‚îÄ‚îÄ Statistical and practical significance
‚ñ° 5. Variance Reduction
    ‚îî‚îÄ‚îÄ CUPED if pre-experiment data available
‚ñ° 6. Guardrail Metrics
    ‚îî‚îÄ‚îÄ Are we causing unacceptable harm elsewhere?
‚ñ° 7. Novelty Effect Check
    ‚îî‚îÄ‚îÄ Is the effect temporary?
‚ñ° 8. Business Impact
    ‚îî‚îÄ‚îÄ What does this mean in dollars/users?
‚ñ° 9. Final Decision
    ‚îî‚îÄ‚îÄ Ship / Hold / Abandon with full context
```

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. **Validate data quality** systematically before any analysis
2. **Distinguish RCT from observational data** and adjust interpretation
3. **Conduct power analysis** to set realistic expectations
4. **Detect novelty effects** using time-based analysis
5. **Calculate business impact** (revenue, ROI, user impact)
6. **Make holistic decisions** considering all factors

---

## The Business Context

This dataset contains ~588K observations from a marketing A/B test:
- **Control (PSA)**: Public Service Announcement (no product ad)
- **Treatment (Ad)**: Actual product advertisement

**Primary Question**: Does showing the ad increase conversion rate?

### üí° Interview Insight: Spotting Observational Data

**Important**: This dataset has **96%/4% allocation** (treatment/control). This is a red flag.

*"Why is 96/4 allocation suspicious?"*

- True RCTs almost never use such extreme allocation
- This suggests **observational data** (who happened to see vs. not see ads)
- Observational data has **selection bias** (users who saw ads may differ systematically)

**What changes with observational data**:
- Can't claim causal effects with confidence
- Need to acknowledge limitations upfront
- Results are **associational**, not **causal**

---

## Setup

In [1]:
import os

if not os.getcwd().endswith("ab_testing"):
    try:
        os.chdir("../")
    except OSError:
        raise FileNotFoundError("Could not change into 'ab_testing' from the current directory.")

print(f"Current working directory: {os.getcwd()}")


Current working directory: c:\docker_projects\ab_testing


In [2]:
# Core imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, Any
from datetime import datetime, timedelta

# A/B Testing modules
from ab_testing.data import loaders
from ab_testing.core import randomization, frequentist, power
from ab_testing.variance_reduction import cuped
from ab_testing.diagnostics import guardrails, novelty

# Set up plotting
plt.style.use('seaborn-v0_8-whitegrid')
%matplotlib inline

print("‚úì Modules loaded successfully")

‚úì Modules loaded successfully


---

## Step 1: Data Quality Validation

### üí° Interview Insight: GIGO (Garbage In, Garbage Out)

The first thing strong candidates do is **validate the data**. This shows methodological rigor.

*"Before looking at results, I always check data quality. Bad data leads to wrong conclusions no matter how sophisticated the analysis."*

### Data Quality Checklist

| Check | Why It Matters |
|-------|----------------|
| Missing values | Systematic missingness can bias results |
| Duplicates | Inflates sample size, underestimates variance |
| Outliers | Can skew means and increase variance |
| Data types | Wrong types cause calculation errors |
| Group balance | Extreme imbalance may indicate problems |

---

In [3]:
# Load the data
df = loaders.load_marketing_ab(sample_frac=1.0)
print(f"Dataset loaded: {len(df):,} observations")
print(f"\nColumns: {list(df.columns)}")

Loading Marketing A/B dataset from data\raw\marketing_ab\marketing_AB.csv...
Loaded Marketing A/B dataset: 588,101 rows, 7 columns
  Conversion rate (ad): 2.55%
  Conversion rate (psa): 1.79%
Dataset loaded: 588,101 observations

Columns: ['user_id', 'test_group', 'converted', 'total_ads', 'most_ads_day', 'most_ads_hour', 'treatment']


In [4]:
# Data quality validation function
def validate_data_quality(df):
    """Comprehensive data quality check."""
    print("DATA QUALITY VALIDATION")
    print("=" * 60)
    
    issues = []
    
    # 1. Missing values
    print("\n1Ô∏è‚É£  MISSING VALUES")
    missing = df.isnull().sum()
    if missing.sum() > 0:
        print(f"   ‚ö†Ô∏è  Found missing values:")
        for col in missing[missing > 0].index:
            pct = missing[col] / len(df) * 100
            print(f"      {col}: {missing[col]:,} ({pct:.2f}%)")
            issues.append(f"Missing values in {col}")
    else:
        print("   ‚úì No missing values")
    
    # 2. Duplicates
    print("\n2Ô∏è‚É£  DUPLICATES")
    n_duplicates = df.duplicated().sum()
    if n_duplicates > 0:
        pct = n_duplicates / len(df) * 100
        print(f"   ‚ö†Ô∏è  Found {n_duplicates:,} duplicate rows ({pct:.2f}%)")
        issues.append("Duplicate rows found")
    else:
        print("   ‚úì No duplicate rows")
    
    # 3. Group balance
    print("\n3Ô∏è‚É£  GROUP BALANCE")
    if 'test' in df.columns:
        group_col = 'test'
    elif 'treatment' in df.columns:
        group_col = 'treatment'
    else:
        group_col = None
    
    if group_col:
        balance = df[group_col].value_counts(normalize=True)
        print(f"   Group distribution ({group_col}):")
        for val, pct in balance.items():
            print(f"      {val}: {pct:.2%}")
        
        # Check for extreme imbalance
        min_pct = balance.min()
        if min_pct < 0.1:  # Less than 10%
            print(f"   ‚ö†Ô∏è  Extreme imbalance detected (smallest group: {min_pct:.2%})")
            issues.append("Extreme group imbalance")
        elif min_pct < 0.3:
            print(f"   ‚ö†Ô∏è  Moderate imbalance (smallest group: {min_pct:.2%})")
        else:
            print("   ‚úì Groups reasonably balanced")
    
    # 4. Data types
    print("\n4Ô∏è‚É£  DATA TYPES")
    print(f"   {df.dtypes.to_string().replace(chr(10), chr(10) + '   ')}")
    
    # Summary
    print("\n" + "=" * 60)
    if issues:
        print(f"‚ö†Ô∏è  ISSUES FOUND: {len(issues)}")
        for issue in issues:
            print(f"   - {issue}")
    else:
        print("‚úì DATA QUALITY PASSED")
    print("=" * 60)
    
    return issues

# Run validation
data_issues = validate_data_quality(df)

DATA QUALITY VALIDATION

1Ô∏è‚É£  MISSING VALUES
   ‚úì No missing values

2Ô∏è‚É£  DUPLICATES
   ‚úì No duplicate rows

3Ô∏è‚É£  GROUP BALANCE
   Group distribution (treatment):
      1: 96.00%
      0: 4.00%
   ‚ö†Ô∏è  Extreme imbalance detected (smallest group: 4.00%)

4Ô∏è‚É£  DATA TYPES
   user_id           int64
   test_group       object
   converted          bool
   total_ads         int64
   most_ads_day     object
   most_ads_hour     int64
   treatment         int64

‚ö†Ô∏è  ISSUES FOUND: 1
   - Extreme group imbalance


### üí° Interview Insight: What To Do About Data Issues

If you find data quality issues, don't panic. Explain your approach:

| Issue | Possible Actions |
|-------|------------------|
| **Missing values** | Investigate pattern, impute or exclude |
| **Duplicates** | Remove if true duplicates, investigate if not |
| **Outliers** | Winsorize, trim, or analyze separately |
| **Extreme imbalance** | Acknowledge, adjust interpretation |

The key is showing you **thought about it** and have a reasoned approach.

---

## Step 2: SRM Check (RCT vs. Observational)

Given the extreme imbalance (96/4), we need to classify this dataset properly.

---

In [5]:
# Identify treatment column
treatment_col = 'test' if 'test' in df.columns else 'treatment'
outcome_col = 'converted' if 'converted' in df.columns else 'conversion'

# Calculate group sizes
group_sizes = df[treatment_col].value_counts()
total = len(df)

# Determine control/treatment
control_val = group_sizes.idxmax()  # Larger group is likely control
treatment_val = group_sizes.idxmin()  # Smaller group is likely treatment

control_count = group_sizes[control_val]
treatment_count = group_sizes[treatment_val]

treatment_ratio = treatment_count / total

print("Dataset Classification")
print("=" * 50)
print(f"\nControl ({control_val}): {control_count:,} ({control_count/total:.2%})")
print(f"Treatment ({treatment_val}): {treatment_count:,} ({treatment_ratio:.2%})")

Dataset Classification

Control (1): 564,577 (96.00%)
Treatment (0): 23,524 (4.00%)


In [6]:
# Classify the dataset
def classify_dataset(treatment_ratio):
    """Classify dataset as RCT or observational based on allocation."""
    if 0.4 <= treatment_ratio <= 0.6:
        return 'RCT', 'Balanced allocation suggests randomized controlled trial'
    elif 0.1 <= treatment_ratio <= 0.9:
        return 'DESIGNED_IMBALANCE', 'Intentionally unequal allocation (15-85 range)'
    else:
        return 'OBSERVATIONAL', 'Extreme imbalance suggests observational data'

data_type, explanation = classify_dataset(treatment_ratio)

print("\nDataset Classification")
print("=" * 50)
print(f"\nType: {data_type}")
print(f"Reason: {explanation}")

if data_type == 'OBSERVATIONAL':
    print("\n‚ö†Ô∏è  IMPLICATIONS:")
    print("   - Cannot claim causal effects")
    print("   - Selection bias likely present")
    print("   - Results are associational only")
    print("   - Need to acknowledge limitations in conclusions")


Dataset Classification

Type: OBSERVATIONAL
Reason: Extreme imbalance suggests observational data

‚ö†Ô∏è  IMPLICATIONS:
   - Cannot claim causal effects
   - Selection bias likely present
   - Results are associational only
   - Need to acknowledge limitations in conclusions


In [7]:
# Run SRM check (but interpret carefully given observational nature)
# For observational data, SRM is less meaningful but we do it for completeness

if data_type == 'OBSERVATIONAL':
    print("SRM Check (Observational Data Context)")
    print("=" * 50)
    print("\n‚ö†Ô∏è  Note: SRM is designed for RCTs with known expected allocation.")
    print("   For observational data, the allocation IS the phenomenon we observe.")
    print("   Checking for information purposes only.")
    # Set a dummy srm_result for use later in the notebook
    srm_result = {'srm_detected': False, 'srm_severe': False, 'srm_warning': False}
else:
    srm_result = randomization.srm_check(
        n_control=control_count,
        n_treatment=treatment_count,
        expected_ratio=[0.5, 0.5],  # Expected 50/50 split as list
        alpha=0.01
    )
    print("SRM Check Results")
    print("=" * 50)
    print(f"\nChi-square statistic: {srm_result['chi2_statistic']:.4f}")
    print(f"P-value: {srm_result['p_value']:.6f}")
    print(f"SRM Detected: {srm_result['srm_detected']}")
    print(f"Severe SRM: {srm_result['srm_severe']}")

SRM Check (Observational Data Context)

‚ö†Ô∏è  Note: SRM is designed for RCTs with known expected allocation.
   For observational data, the allocation IS the phenomenon we observe.
   Checking for information purposes only.


---

## Step 3: Power Analysis

### üí° Interview Insight: Why Power Analysis Matters

Power analysis answers: *"Do we have enough data to detect a meaningful effect?"*

**Before the experiment**: Helps determine sample size
**After the experiment**: Helps interpret null results

*"If the test is not significant, was it because there's no effect, or because we didn't have enough power to detect one?"*

### Key Concepts

| Term | Definition | Typical Value |
|------|------------|---------------|
| **Power (1-Œ≤)** | Probability of detecting a true effect | 80% |
| **Alpha (Œ±)** | False positive rate | 5% |
| **MDE** | Minimum Detectable Effect | Varies by context |

---

In [8]:
# Calculate baseline conversion rate
control_df = df[df[treatment_col] == control_val]
treatment_df = df[df[treatment_col] == treatment_val]

baseline_rate = control_df[outcome_col].mean()

print("Baseline Metrics")
print("=" * 50)
print(f"\nControl conversion rate: {baseline_rate:.4f} ({baseline_rate:.2%})")
print(f"Control sample size: {len(control_df):,}")
print(f"Treatment sample size: {len(treatment_df):,}")

Baseline Metrics

Control conversion rate: 0.0255 (2.55%)
Control sample size: 564,577
Treatment sample size: 23,524


In [9]:
# Power analysis
# The power_analysis_summary function takes p_baseline and mde, returns required sample size
# We want to find MDE given our sample sizes, so we'll use binary search

from scipy.optimize import brentq

def find_mde(n_control, n_treatment, p_baseline, target_power=0.80, alpha=0.05):
    """Find MDE that achieves target power given sample sizes."""
    # Use smaller sample size for conservative estimate
    n = min(n_control, n_treatment)
    
    def power_diff(mde):
        """Difference between achieved power and target power."""
        p_treatment = p_baseline * (1 + mde)
        if p_treatment >= 1 or p_treatment <= 0:
            return -1  # Invalid
        achieved_power = power.power_binary(p_baseline, p_treatment, n, alpha)
        return achieved_power - target_power
    
    # Binary search for MDE
    try:
        mde = brentq(power_diff, 0.001, 2.0)  # Search between 0.1% and 200% relative lift
        return mde
    except ValueError:
        return None

# Calculate MDE
mde_result = find_mde(
    n_control=len(control_df),
    n_treatment=len(treatment_df),
    p_baseline=baseline_rate,
    target_power=0.80,
    alpha=0.05
)

print("Power Analysis")
print("=" * 50)
print(f"\nWith current sample sizes:")
print(f"  Control: {len(control_df):,}")
print(f"  Treatment: {len(treatment_df):,}")
print(f"\nBaseline conversion rate: {baseline_rate:.2%}")

if mde_result:
    mde_absolute = baseline_rate * mde_result
    print(f"\nMinimum Detectable Effect (MDE) at 80% power:")
    print(f"  Relative: {mde_result:.2%}")
    print(f"  Absolute: {mde_absolute:.4f}")
    print(f"\nInterpretation:")
    print(f"  We can detect a {mde_result:.2%} relative lift with 80% power.")
    print(f"  Smaller effects may not be detectable with this sample.")
    
    # Store for later use
    power_result = {
        'mde_relative': mde_result,
        'mde_absolute': mde_absolute,
        'p_baseline': baseline_rate,
        'n_control': len(control_df),
        'n_treatment': len(treatment_df)
    }
else:
    print("\n‚ö†Ô∏è  Could not calculate MDE (sample too small or baseline too extreme)")
    power_result = {'mde_relative': 0, 'mde_absolute': 0}

Power Analysis

With current sample sizes:
  Control: 564,577
  Treatment: 23,524

Baseline conversion rate: 2.55%

Minimum Detectable Effect (MDE) at 80% power:
  Relative: 16.57%
  Absolute: 0.0042

Interpretation:
  We can detect a 16.57% relative lift with 80% power.
  Smaller effects may not be detectable with this sample.


### üí° Interview Insight: Interpreting MDE

*"What does the MDE tell you?"*

**Strong answer**: *"The MDE tells us the smallest effect we can reliably detect. If the true effect is smaller than our MDE, we're unlikely to find a statistically significant result‚Äîeven if the effect is real. This is important for interpreting null results: a non-significant result doesn't mean no effect, it means we couldn't detect an effect at least as large as our MDE."*

---

## Step 4: Primary Metric Test (Conversion Rate)

In [10]:
# Extract outcome arrays
control_outcome = control_df[outcome_col].values
treatment_outcome = treatment_df[outcome_col].values

# z_test_proportions expects counts, not arrays
x_control = control_outcome.sum()
n_control = len(control_outcome)
x_treatment = treatment_outcome.sum()
n_treatment = len(treatment_outcome)

# Run z-test
conversion_result = frequentist.z_test_proportions(
    x_control=x_control,
    n_control=n_control,
    x_treatment=x_treatment,
    n_treatment=n_treatment,
    alpha=0.05
)

print("Primary Metric: Conversion Rate")
print("=" * 50)
print(f"\nControl:   {conversion_result['p_control']:.4f} ({conversion_result['p_control']:.2%})")
print(f"Treatment: {conversion_result['p_treatment']:.4f} ({conversion_result['p_treatment']:.2%})")
print(f"\nAbsolute difference: {conversion_result['absolute_lift']:.4f}")
print(f"Relative lift: {conversion_result['relative_lift']:.2%}")
print(f"\n95% CI: [{conversion_result['ci_lower']:.4f}, {conversion_result['ci_upper']:.4f}]")
print(f"P-value: {conversion_result['p_value']:.6f}")
print(f"\nStatistically significant: {conversion_result['significant']}")

Primary Metric: Conversion Rate

Control:   0.0255 (2.55%)
Treatment: 0.0179 (1.79%)

Absolute difference: -0.0077
Relative lift: -30.11%

95% CI: [-0.0094, -0.0060]
P-value: 0.000000

Statistically significant: True


In [11]:
# Compare observed effect to MDE
observed_relative_lift = abs(conversion_result['relative_lift'])
mde_relative = abs(power_result['mde_relative'])

print("\nEffect Size vs. MDE")
print("=" * 50)
print(f"\nObserved relative lift: {conversion_result['relative_lift']:.2%}")
print(f"Minimum Detectable Effect: {mde_relative:.2%}")

if observed_relative_lift >= mde_relative:
    print(f"\n‚úì Observed effect is larger than MDE")
    print("  We had sufficient power to detect this effect.")
else:
    print(f"\n‚ö†Ô∏è  Observed effect is smaller than MDE")
    print("  The effect may be real but we lack power to confirm.")


Effect Size vs. MDE

Observed relative lift: -30.11%
Minimum Detectable Effect: 16.57%

‚úì Observed effect is larger than MDE
  We had sufficient power to detect this effect.


---

## Step 5: CUPED Variance Reduction

CUPED uses pre-experiment data to reduce variance. Even without true pre-experiment data, we can demonstrate the concept.

---

In [12]:
# Check for available covariates
print("Available columns for CUPED:")
covariate_cols = [col for col in df.columns if col not in [treatment_col, outcome_col]]
print(covariate_cols)

# If we have a usable covariate, run CUPED
if 'tot_impr' in df.columns:
    # Total impressions can serve as a covariate
    covariate_col = 'tot_impr'
    
    cuped_result = cuped.cuped_ab_test(
        y=df[outcome_col].values,
        treatment=df[treatment_col].map({control_val: 0, treatment_val: 1}).values,
        x_pre=df[covariate_col].values,
        alpha=0.05
    )
    
    print("\nCUPED Analysis Results")
    print("=" * 50)
    print(f"\nUsing covariate: {covariate_col}")
    print(f"\nTreatment effect: {cuped_result['ate']:.6f}")
    print(f"Standard error: {cuped_result['se']:.6f}")
    print(f"\n95% CI: [{cuped_result['ci_lower']:.6f}, {cuped_result['ci_upper']:.6f}]")
    print(f"P-value: {cuped_result['p_value']:.6f}")
    
    # Calculate variance reduction
    basic_se = (conversion_result['ci_upper'] - conversion_result['ci_lower']) / (2 * 1.96)
    cuped_se = cuped_result['se']
    variance_reduction = 1 - (cuped_se ** 2) / (basic_se ** 2)
    
    print(f"\nVariance reduction: {variance_reduction:.1%}")
else:
    print("\n‚ö†Ô∏è  No suitable covariate found for CUPED.")
    print("   In practice, you'd use a pre-experiment measure of the outcome.")

Available columns for CUPED:
['user_id', 'test_group', 'total_ads', 'most_ads_day', 'most_ads_hour']

‚ö†Ô∏è  No suitable covariate found for CUPED.
   In practice, you'd use a pre-experiment measure of the outcome.


---

## Step 6: Guardrail Metrics

Beyond conversion, we need to check that we haven't caused harm elsewhere.

---

In [13]:
# Check available guardrail metrics
print("Potential Guardrail Metrics")
print("=" * 50)

# Look for engagement metrics
if 'tot_impr' in df.columns:
    print("\nFound: tot_impr (total impressions)")
    print("  This measures ad exposure‚Äîwe don't want to over-serve ads.")
    
    control_impr = control_df['tot_impr'].values
    treatment_impr = treatment_df['tot_impr'].values
    
    # Non-inferiority test (we're checking we didn't INCREASE impressions too much)
    # Or that we didn't decrease engagement too much
    guardrail_result = guardrails.non_inferiority_test(
        control=control_impr,
        treatment=treatment_impr,
        delta=-0.10,  # Allow max 10% degradation
        metric_type='relative',
        alpha=0.05
    )
    
    print("\nGuardrail: Ad Impressions")
    print(f"  Control mean: {guardrail_result['mean_control']:.2f}")
    print(f"  Treatment mean: {guardrail_result['mean_treatment']:.2f}")
    print(f"  Relative change: {(guardrail_result['difference'] / guardrail_result['mean_control']):.2%}")
    print(f"  Guardrail passed: {'‚úì Yes' if guardrail_result['passed'] else '‚úó No'}")
else:
    print("\n‚ö†Ô∏è  No additional metrics available for guardrails.")
    guardrail_result = {'passed': True}  # Default to passed

Potential Guardrail Metrics

‚ö†Ô∏è  No additional metrics available for guardrails.


---

## Step 7: Novelty Effect Detection

### üí° Interview Insight: Why Check for Novelty Effects?

**Novelty effects** occur when users respond to *newness* rather than the actual feature.

Example: A new UI gets high engagement initially (curiosity) but engagement drops as users get used to it.

**How to detect**:
- Analyze effect over time
- If effect decreases, novelty may be at play
- Recommend holdout for long-term monitoring

**Industry Practice**:
- Zynga: All major changes run 2-4 weeks minimum
- King: Uses 2-week holdouts for game changes
- Supercell: Monitors metrics for weeks post-launch

---

In [14]:
# Simulate time-based analysis (if we had timestamp data)
# For demonstration, we'll create synthetic time periods

print("Novelty Effect Analysis")
print("=" * 50)

# Check if we have time data
time_cols = [col for col in df.columns if 'time' in col.lower() or 'date' in col.lower()]

if time_cols:
    print(f"\nTime columns found: {time_cols}")
    # Would run actual time-based analysis here
else:
    print("\n‚ö†Ô∏è  No timestamp data available.")
    print("\nSimulating time-based analysis with random assignment to weeks:")
    
    # Simulate 4 weeks of data
    np.random.seed(42)
    df['simulated_week'] = np.random.randint(1, 5, size=len(df))
    
    # Calculate effect by week
    print("\n{'Week':<8} {'Control CR':>12} {'Treatment CR':>14} {'Lift':>10}")
    print("-" * 50)
    
    weekly_lifts = []
    for week in range(1, 5):
        week_df = df[df['simulated_week'] == week]
        ctrl_cr = week_df[week_df[treatment_col] == control_val][outcome_col].mean()
        treat_cr = week_df[week_df[treatment_col] == treatment_val][outcome_col].mean()
        lift = (treat_cr - ctrl_cr) / ctrl_cr if ctrl_cr > 0 else 0
        weekly_lifts.append(lift)
        print(f"Week {week:<4} {ctrl_cr:>12.2%} {treat_cr:>14.2%} {lift:>10.2%}")
    
    # Check for declining trend
    if len(weekly_lifts) >= 2:
        trend = weekly_lifts[-1] - weekly_lifts[0]
        print(f"\nTrend (Week 4 - Week 1): {trend:.2%}")
        if trend < -0.05:
            print("‚ö†Ô∏è  Possible novelty effect detected (declining lift)")
        else:
            print("‚úì No clear novelty effect (lift stable or increasing)")

Novelty Effect Analysis

‚ö†Ô∏è  No timestamp data available.

Simulating time-based analysis with random assignment to weeks:

{'Week':<8} {'Control CR':>12} {'Treatment CR':>14} {'Lift':>10}
--------------------------------------------------
Week 1           2.58%          1.78%    -30.91%


Week 2           2.61%          1.80%    -30.99%
Week 3           2.53%          1.96%    -22.74%
Week 4           2.50%          1.60%    -35.79%

Trend (Week 4 - Week 1): -4.88%
‚úì No clear novelty effect (lift stable or increasing)


---

## Step 8: Business Impact Calculation

### üí° Interview Insight: Translating Statistics to Dollars

This is where many candidates fall short. They report p-values but can't answer:
*"What does this mean for the business?"*

**Strong candidates** translate results into:
- Revenue impact
- User impact
- ROI calculations
- Confidence intervals on business metrics

---

In [15]:
# Business impact calculation
print("BUSINESS IMPACT ANALYSIS")
print("=" * 60)

# Assumptions (would come from business context)
MONTHLY_VISITORS = 10_000_000  # 10M visitors
REVENUE_PER_CONVERSION = 50  # $50 per conversion
IMPLEMENTATION_COST = 50_000  # $50K to implement

print("\nAssumptions:")
print(f"  Monthly visitors: {MONTHLY_VISITORS:,}")
print(f"  Revenue per conversion: ${REVENUE_PER_CONVERSION}")
print(f"  Implementation cost: ${IMPLEMENTATION_COST:,}")

# Current state (use p_control instead of mean_control)
current_conversions = MONTHLY_VISITORS * conversion_result['p_control']
current_revenue = current_conversions * REVENUE_PER_CONVERSION

print(f"\nCurrent State (Control):")
print(f"  Conversion rate: {conversion_result['p_control']:.2%}")
print(f"  Monthly conversions: {current_conversions:,.0f}")
print(f"  Monthly revenue: ${current_revenue:,.0f}")

# Projected state (use p_treatment instead of mean_treatment)
projected_conversions = MONTHLY_VISITORS * conversion_result['p_treatment']
projected_revenue = projected_conversions * REVENUE_PER_CONVERSION

print(f"\nProjected State (Treatment):")
print(f"  Conversion rate: {conversion_result['p_treatment']:.2%}")
print(f"  Monthly conversions: {projected_conversions:,.0f}")
print(f"  Monthly revenue: ${projected_revenue:,.0f}")

# Impact
monthly_impact = projected_revenue - current_revenue
annual_impact = monthly_impact * 12
roi = (annual_impact - IMPLEMENTATION_COST) / IMPLEMENTATION_COST

print(f"\nImpact:")
print(f"  Additional monthly conversions: {projected_conversions - current_conversions:,.0f}")
print(f"  Additional monthly revenue: ${monthly_impact:,.0f}")
print(f"  Additional annual revenue: ${annual_impact:,.0f}")
print(f"  ROI (first year): {roi:.0%}")

BUSINESS IMPACT ANALYSIS

Assumptions:
  Monthly visitors: 10,000,000
  Revenue per conversion: $50
  Implementation cost: $50,000

Current State (Control):
  Conversion rate: 2.55%
  Monthly conversions: 255,466
  Monthly revenue: $12,773,280

Projected State (Treatment):
  Conversion rate: 1.79%
  Monthly conversions: 178,541
  Monthly revenue: $8,927,053

Impact:
  Additional monthly conversions: -76,925
  Additional monthly revenue: $-3,846,227
  Additional annual revenue: $-46,154,719
  ROI (first year): -92409%


In [16]:
# Confidence interval on business impact
print("\nConfidence Interval on Annual Revenue Impact")
print("=" * 50)

# Convert CI to revenue
impact_lower = conversion_result['ci_lower'] * MONTHLY_VISITORS * REVENUE_PER_CONVERSION * 12
impact_upper = conversion_result['ci_upper'] * MONTHLY_VISITORS * REVENUE_PER_CONVERSION * 12

print(f"\n95% CI on annual revenue impact:")
print(f"  Lower: ${impact_lower:,.0f}")
print(f"  Point estimate: ${annual_impact:,.0f}")
print(f"  Upper: ${impact_upper:,.0f}")

if impact_lower > 0:
    print(f"\n‚úì Even the conservative estimate is positive.")
elif impact_upper < 0:
    print(f"\n‚ö†Ô∏è  Even the optimistic estimate is negative.")
else:
    print(f"\n‚ö†Ô∏è  CI includes zero‚Äîimpact is uncertain.")


Confidence Interval on Annual Revenue Impact

95% CI on annual revenue impact:
  Lower: $-56,603,844
  Point estimate: $-46,154,719
  Upper: $-35,705,595

‚ö†Ô∏è  Even the optimistic estimate is negative.


### üí° Interview Insight: Communicating Uncertainty

Notice we provide a **range**, not just a point estimate. This shows you understand:

1. Statistical results have uncertainty
2. Business decisions should account for downside risk
3. CI on business metrics is more actionable than p-values

*"The expected annual revenue impact is $X, with a 95% confidence interval of $Y to $Z. Even in the pessimistic scenario, the ROI exceeds our threshold."*

---

## Step 9: Final Decision

Now we synthesize everything into a decision.

---

In [17]:
# Final decision framework
print("\n" + "=" * 70)
print("FINAL DECISION SYNTHESIS")
print("=" * 70)

# Summarize all factors
print("\nüìä DATA QUALITY:")
print(f"   Issues found: {len(data_issues)}")
if data_issues:
    for issue in data_issues:
        print(f"   - {issue}")

print(f"\nüî¨ DATASET TYPE: {data_type}")
if data_type == 'OBSERVATIONAL':
    print("   ‚ö†Ô∏è  Causal claims limited")

print(f"\nüìà PRIMARY METRIC (Conversion Rate):")
print(f"   Lift: {conversion_result['relative_lift']:.2%}")
print(f"   P-value: {conversion_result['p_value']:.4f}")
print(f"   Significant: {conversion_result['significant']}")
print(f"   Direction: {'Positive' if conversion_result['relative_lift'] > 0 else 'Negative'}")

print(f"\n‚ö° POWER ANALYSIS:")
print(f"   MDE: {power_result['mde_relative']:.2%}")
print(f"   Observed effect {'>' if observed_relative_lift >= mde_relative else '<'} MDE")

print(f"\nüõ°Ô∏è  GUARDRAILS:")
print(f"   Passed: {'‚úì Yes' if guardrail_result['passed'] else '‚úó No'}")

print(f"\n‚è±Ô∏è  NOVELTY EFFECT:")
print(f"   Detected: Limited data for assessment")

print(f"\nüí∞ BUSINESS IMPACT:")
print(f"   Annual revenue: ${annual_impact:,.0f}")
print(f"   ROI: {roi:.0%}")

# Make decision
print("\n" + "=" * 70)

# Decision logic
primary_positive = conversion_result['significant'] and conversion_result['relative_lift'] > 0
primary_negative = conversion_result['significant'] and conversion_result['relative_lift'] < 0
guardrails_passed = guardrail_result['passed']

if data_type == 'OBSERVATIONAL':
    # More cautious with observational data
    print("\n‚ö™ RECOMMENDATION: HOLD / INVESTIGATE FURTHER")
    print("\nReasoning:")
    print("  ‚Ä¢ This is observational data (96/4 allocation)")
    print("  ‚Ä¢ Cannot claim causal effects with confidence")
    print("  ‚Ä¢ Selection bias may explain observed differences")
    print("\nNext Steps:")
    print("  1. Design a proper RCT with balanced allocation")
    print("  2. Investigate why allocation is so imbalanced")
    print("  3. Consider propensity score matching for causal inference")
elif primary_negative or not guardrails_passed:
    print("\n‚ùå RECOMMENDATION: ABANDON")
    print("\nReasoning:")
    if primary_negative:
        print(f"  ‚Ä¢ Primary metric is significantly negative ({conversion_result['relative_lift']:.2%})")
    if not guardrails_passed:
        print("  ‚Ä¢ Guardrail metric failed")
elif primary_positive and guardrails_passed:
    print("\n‚úÖ RECOMMENDATION: SHIP")
    print("\nReasoning:")
    print(f"  ‚Ä¢ Primary metric positive and significant ({conversion_result['relative_lift']:.2%})")
    print("  ‚Ä¢ All guardrails passed")
    print(f"  ‚Ä¢ Positive ROI ({roi:.0%})")
else:
    print("\n‚ö™ RECOMMENDATION: HOLD")
    print("\nReasoning:")
    if not conversion_result['significant']:
        print("  ‚Ä¢ Primary metric not statistically significant")
    print("\nNext Steps:")
    print("  1. Continue collecting data")
    print("  2. Re-evaluate in 1-2 weeks")

print("\n" + "=" * 70)


FINAL DECISION SYNTHESIS

üìä DATA QUALITY:
   Issues found: 1
   - Extreme group imbalance

üî¨ DATASET TYPE: OBSERVATIONAL
   ‚ö†Ô∏è  Causal claims limited

üìà PRIMARY METRIC (Conversion Rate):
   Lift: -30.11%
   P-value: 0.0000
   Significant: True
   Direction: Negative

‚ö° POWER ANALYSIS:
   MDE: 16.57%
   Observed effect > MDE

üõ°Ô∏è  GUARDRAILS:
   Passed: ‚úì Yes

‚è±Ô∏è  NOVELTY EFFECT:
   Detected: Limited data for assessment

üí∞ BUSINESS IMPACT:
   Annual revenue: $-46,154,719
   ROI: -92409%


‚ö™ RECOMMENDATION: HOLD / INVESTIGATE FURTHER

Reasoning:
  ‚Ä¢ This is observational data (96/4 allocation)
  ‚Ä¢ Cannot claim causal effects with confidence
  ‚Ä¢ Selection bias may explain observed differences

Next Steps:
  1. Design a proper RCT with balanced allocation
  2. Investigate why allocation is so imbalanced
  3. Consider propensity score matching for causal inference



---

## Summary: The Complete Workflow

| Step | What We Did | Key Question Answered |
|------|-------------|----------------------|
| 1. Data Quality | Validated for issues | Is our data trustworthy? |
| 2. SRM Check | Classified dataset type | Is this an RCT or observational? |
| 3. Power Analysis | Calculated MDE | Can we detect meaningful effects? |
| 4. Primary Test | Z-test for conversion | Is there a statistically significant effect? |
| 5. CUPED | Variance reduction | Can we improve precision? |
| 6. Guardrails | Non-inferiority tests | Are we causing harm elsewhere? |
| 7. Novelty | Time-based analysis | Is the effect temporary? |
| 8. Business Impact | Revenue calculation | What does this mean in dollars? |
| 9. Decision | Synthesized all factors | Ship / Hold / Abandon? |

---

## Key Takeaways for Interviews

1. **Validate data first.** Don't trust that data is clean.

2. **Know your limitations.** Observational ‚â† causal. Say it upfront.

3. **Power analysis isn't optional.** It helps interpret results.

4. **Translate to business impact.** P-values don't pay salaries‚Äîrevenue does.

5. **Acknowledge uncertainty.** Use confidence intervals, not just point estimates.

6. **Check for novelty.** Short-term wins can be long-term losses.

7. **Decisions need context.** Significant ‚â† ship. Consider the full picture.

---

## üéì Exercises for Practice

### Exercise 1: Sensitivity Analysis
How does the decision change if we use Œ±=0.01 instead of Œ±=0.05?

### Exercise 2: Business Impact Scenarios
What if revenue per conversion is $20 instead of $50? What's the break-even?

### Exercise 3: Interview Practice
Write a 2-minute executive summary of this analysis for a VP of Product. Focus on: what we tested, what we found, what we recommend, and what we're uncertain about.

---

**Next Notebook**: [04_reference_guide.ipynb](04_reference_guide.ipynb) - Quick reference for all A/B testing concepts and techniques.