# Criteo Uplift Analysis: ML-Enhanced Experimentation

**Scenario**: Ad tech company optimizing targeted campaigns

**Business Question**: Who should we target to maximize incremental conversions?

**Dataset**: 13.9M observations (use sample for faster execution)

**Key Features**: 11 user characteristics for ML-enhanced variance reduction

## üìö What You'll Learn

1. ‚úÖ CUPAC (ML-enhanced CUPED with GradientBoosting)
2. ‚úÖ X-Learner for heterogeneous treatment effects (HTE)
3. ‚úÖ Sequential testing for early stopping
4. ‚úÖ Large-scale data best practices

In [None]:
# Setup
import pandas as pd
import numpy as np
from ab_testing.data import loaders
from ab_testing.pipelines.criteo_pipeline import run_criteo_analysis

print("‚úÖ Ready to analyze Criteo data!")
print("\n‚ö†Ô∏è NOTE: Full dataset is 13.9M rows. Using sample_frac=0.01 (1%) for faster execution.")
print("   This gives ~140K rows, which is still large enough for all techniques.")

## Part 1: Load Data (1% Sample)

For development/learning, we'll use 1% of the data (~140K rows).

For production analysis, use larger samples or full dataset.

In [None]:
# Load 1% sample (adjust sample_frac as needed)
df = loaders.load_criteo_uplift(sample_frac=0.01)

print(f"Loaded {len(df):,} observations (1% sample)")
print(f"\nColumns: {df.columns.tolist()}")
display(df.head())

print(f"\nTreatment split:")
print(df['treatment'].value_counts())

print(f"\nOutcome rates:")
print(f"Visit rate: {df['visit'].mean():.4%}")
print(f"Conversion rate: {df['conversion'].mean():.4%}")

## Part 2: Run Complete Analysis

**‚ö†Ô∏è This may take 2-5 minutes** due to ML model training (CUPAC, X-Learner).

Progress will be shown during execution.

In [None]:
# Run with 1% sample (faster execution)
results = run_criteo_analysis(sample_frac=0.01, verbose=False)

print(f"‚úÖ Analysis complete!")
print(f"Available results: {list(results.keys())}")

## Part 3: CUPAC vs CUPED Comparison

### üìö The Difference

**CUPED (Linear)**:
- Uses 1 covariate
- Linear adjustment: `Y_adj = Y - Œ∏ * (X - E[X])`
- Typical: 20-40% variance reduction

**CUPAC (ML-Enhanced)**:
- Uses multiple features (11 in Criteo)
- ML model (GradientBoosting): `Y_adj = Y - Y_pred`
- Captures non-linear relationships
- Typical: 30-60% variance reduction

In [None]:
cupac = results.get('cupac', {})

if cupac:
    print("=" * 70)
    print("CUPAC RESULTS")
    print("=" * 70)
    
    print(f"\nüìä Model Performance:")
    print(f"   Model type: {cupac.get('model', 'N/A')}")
    print(f"   Features used: {cupac.get('n_features', 11)}")
    print(f"   Model R¬≤: {cupac.get('model_r2', 'N/A'):.4f}")
    
    print(f"\nüìä Variance Reduction:")
    print(f"   Variance reduction: {cupac.get('var_reduction', 0):.2%}")
    print(f"   SE reduction: {cupac.get('se_reduction', 0):.2%}")
    
    var_red = cupac.get('var_reduction', 0)
    sample_equiv = 1 / (1 - var_red) if var_red < 1 else 1
    
    print(f"\nüí° PRACTICAL IMPACT:")
    print(f"   - Equivalent to {sample_equiv:.1f}x more users")
    print(f"   - Or run experiment {(1-var_red):.0%} as long")
    print(f"   - Example: 4-week test ‚Üí {4*(1-var_red):.1f} weeks with CUPAC")
    
    print(f"\nüè¢ When CUPAC Beats CUPED:")
    print(f"   ‚úÖ Multiple features available (11+ features)")
    print(f"   ‚úÖ Non-linear relationships (power users behave differently)")
    print(f"   ‚úÖ Large dataset (n > 10K for ML training)")
    print(f"   ‚úÖ Worth the complexity (30-60% vs 20-40% reduction)")

## Part 4: Heterogeneous Treatment Effects (X-Learner)

### üìö Why HTE Matters

**Average Treatment Effect (ATE)**: Overall impact across all users

**Problem**: Not everyone benefits equally!
- Some users: +20% conversion (high uplift)
- Some users: -5% conversion (negative effect)
- Average: +5% (misleading!)

**Solution**: Estimate individual-level treatment effects (CATE)
- **CATE** = Conditional Average Treatment Effect
- Enables targeting: Focus on high-uplift users
- Example: "Don't send promo to users who'd convert anyway"

In [None]:
hte = results.get('hte', {})

if hte:
    print("=" * 70)
    print("HETEROGENEOUS TREATMENT EFFECTS (X-Learner)")
    print("=" * 70)
    
    print(f"\nüìä CATE Distribution:")
    cate_values = hte.get('cate_estimates', [])
    if len(cate_values) > 0:
        print(f"   Mean CATE: {np.mean(cate_values):.4f}")
        print(f"   Std CATE: {np.std(cate_values):.4f}")
        print(f"   Min CATE: {np.min(cate_values):.4f} (negative effect)")
        print(f"   Max CATE: {np.max(cate_values):.4f} (high uplift)")
        
        # Analyze subgroups
        top_10pct = np.percentile(cate_values, 90)
        bottom_10pct = np.percentile(cate_values, 10)
        
        print(f"\nüìä Subgroup Analysis:")
        print(f"   Top 10% CATE threshold: {top_10pct:.4f}")
        print(f"   Bottom 10% CATE threshold: {bottom_10pct:.4f}")
        print(f"   Spread: {top_10pct - bottom_10pct:.4f}")
        
        print(f"\nüí° TARGETING DECISION:")
        print(f"   - Target top 10%: Expect {top_10pct:.2%} incremental lift")
        print(f"   - Avoid bottom 10%: Negative effect ({bottom_10pct:.2%})")
        print(f"   - Or use continuous score for dynamic targeting")
        
        print(f"\nüè¢ Industry Applications:")
        print(f"   - Netflix: Personalize which shows to promote per user")
        print(f"   - Uber: Target promos to users with high incremental value")
        print(f"   - E-commerce: Personalize discounts based on uplift")

## Part 5: Sequential Testing (Early Stopping)

### üìö The Benefit

**Traditional Approach**: Run experiment for fixed duration (e.g., 4 weeks)

**Sequential Testing**: Check results at interim points, stop early if effect is clear

**Average Savings**: 30-50% reduction in experiment duration

**Key**: Properly control Type I error via alpha spending function (O'Brien-Fleming)

In [None]:
seq = results.get('sequential', {})

if seq:
    print("=" * 70)
    print("SEQUENTIAL TESTING RESULTS")
    print("=" * 70)
    
    print(f"\nüìä Configuration:")
    print(f"   Number of looks: {seq.get('n_looks', 5)}")
    print(f"   Overall alpha: 0.05")
    print(f"   Method: O'Brien-Fleming")
    
    boundaries = seq.get('boundaries', [])
    if boundaries:
        print(f"\nüìä Alpha Spending at Each Look:")
        for i, bound in enumerate(boundaries, 1):
            print(f"   Look {i}: alpha = {bound:.6f}")
    
    print(f"\nüí° HOW TO USE:")
    print(f"   1. Pre-commit to analysis plan (e.g., check every week for 5 weeks)")
    print(f"   2. At each look, compare p-value to boundary")
    print(f"   3. If p < boundary ‚Üí STOP, effect is significant")
    print(f"   4. If p ‚â• boundary ‚Üí CONTINUE to next look")
    print(f"   5. At final look, use standard alpha=0.05")
    
    print(f"\nüè¢ Industry Impact:")
    print(f"   - Average 30-50% shorter experiments")
    print(f"   - Saves money and increases velocity")
    print(f"   - Trade-off: Slight power loss if run to end")

## ‚úÖ Key Takeaways

1. **CUPAC > CUPED** when you have rich features (11+) and large samples (n > 10K)
2. **HTE enables targeting** - "Who benefits most?" is the modern question
3. **X-Learner estimates individual effects** - Enables personalization decisions
4. **Sequential testing reduces duration** - Stop early when possible (30-50% faster)
5. **Large-scale data requires sampling** - Use sample_frac for development

## üî¨ Try These Experiments

1. **Different sample sizes**: Try 0.001, 0.01, 0.10 - how do results change?
2. **ML model comparison**: CUPAC with RandomForest vs GradientBoosting
3. **Subgroup targeting**: Calculate ROI of targeting top 10% vs top 50%
4. **Sequential boundaries**: Try different # of looks (3, 5, 10)
5. **Feature importance**: Which features matter most for CATE?

## üìö Further Reading

**Papers**:
- Athey & Imbens (2016): "Recursive partitioning for heterogeneous causal effects"
- K√ºnzel et al. (2019): "Metalearners for estimating heterogeneous treatment effects"

**Industry Blogs**:
- [DoorDash: CUPAC](https://careersatdoordash.com/blog/improving-experimental-power-through-control-using-predictions-as-covariate-cupac/)
- [Uber: Causal ML](https://www.uber.com/blog/causal-inference-at-uber/)
- [Netflix: Experimentation at Scale](https://netflixtechblog.com/experimentation-at-netflix-6ab9a47e7caa)