# Cookie Cats Retention Analysis: Multiple Metrics & Product Trade-offs

**Scenario**: Mobile game testing gate placement (level 30 vs level 40)

**Business Question**: Does moving the gate improve retention?

**Metrics**: 1-day retention, 7-day retention, rounds played per player

**Dataset**: 90,189 players

## üìö What You'll Learn

1. ‚úÖ Testing multiple outcomes simultaneously (1d + 7d retention)
2. ‚úÖ Multiple testing correction (Bonferroni vs Benjamini-Hochberg)
3. ‚úÖ Ratio metrics with delta method
4. ‚úÖ Product trade-off decisions (what if metrics conflict?)

In [None]:
# Setup
import pandas as pd
import numpy as np
from ab_testing.data import loaders
from ab_testing.pipelines.cookie_cats_pipeline import run_cookie_cats_analysis

print("‚úÖ Ready to analyze Cookie Cats data!")

## Part 1: Load Data

In [None]:
df = loaders.load_cookie_cats()
print(f"Loaded {len(df):,} players")
display(df.head())

print(f"\nGroup split:")
print(df['version'].value_counts())

print(f"\nRetention rates by group:")
retention = df.groupby('version')[['retention_1', 'retention_7']].mean()
display(retention)

## Part 2: Run Complete Analysis

This pipeline tests BOTH retention metrics and applies multiple testing correction.

In [None]:
results = run_cookie_cats_analysis(sample_frac=1.0, verbose=False)
print(f"‚úÖ Analysis complete!")
print(f"Available results: {list(results.keys())}")

## Part 3: Multiple Testing Problem

### üìö The Challenge

When testing **k metrics** at alpha=0.05:
- **Single test**: 5% false positive rate
- **2 tests**: ~10% false positive rate (1 - 0.95¬≤)
- **5 tests**: ~23% false positive rate
- **10 tests**: ~40% false positive rate!

**Solution**: Multiple testing correction to control family-wise error rate

In [None]:
print("=" * 70)
print("1-DAY RETENTION TEST")
print("=" * 70)

ret1d = results['retention_1d']
print(f"Control: {ret1d['p_control']:.4%}")
print(f"Treatment: {ret1d['p_treatment']:.4%}")
print(f"Lift: {ret1d['relative_lift']:.2%}")
print(f"P-value: {ret1d['p_value']:.6f}")
print(f"Significant: {ret1d['significant']}")

print("\n" + "=" * 70)
print("7-DAY RETENTION TEST")
print("=" * 70)

ret7d = results['retention_7d']
print(f"Control: {ret7d['p_control']:.4%}")
print(f"Treatment: {ret7d['p_treatment']:.4%}")
print(f"Lift: {ret7d['relative_lift']:.2%}")
print(f"P-value: {ret7d['p_value']:.6f}")
print(f"Significant: {ret7d['significant']}")

### Multiple Testing Correction Comparison

In [None]:
mt = results.get('multiple_testing', {})

if mt:
    comparison = pd.DataFrame({
        'Metric': ['1-day retention', '7-day retention'],
        'Original P-value': [
            f"{ret1d['p_value']:.6f}",
            f"{ret7d['p_value']:.6f}"
        ],
        'Bonferroni Threshold': [
            f"{0.05/2:.4f}",
            f"{0.05/2:.4f}"
        ],
        'Bonferroni Significant': [
            '‚úÖ' if ret1d['p_value'] < 0.025 else '‚ùå',
            '‚úÖ' if ret7d['p_value'] < 0.025 else '‚ùå'
        ],
        'BH-FDR Significant': [
            '‚úÖ' if mt.get('bh_significant', [False, False])[0] else '‚ùå',
            '‚úÖ' if mt.get('bh_significant', [False, False])[1] else '‚ùå'
        ]
    })
    
    display(comparison)
    
    print(f"\nüí° KEY INSIGHT:")
    print(f"   - Bonferroni: More conservative (higher bar for significance)")
    print(f"   - BH-FDR: More power (better at detecting real effects)")
    print(f"   - For k=2 metrics, Bonferroni is reasonable")
    print(f"   - For k>5 metrics, consider BH-FDR")

## Part 4: Decision Scenarios

What if metrics conflict?

In [None]:
print("=" * 70)
print("DECISION SCENARIOS")
print("=" * 70)

scenarios = pd.DataFrame({
    'Scenario': [
        'Both improve',
        'Both worsen',
        '1d improves, 7d worsens',
        '1d worsens, 7d improves',
        'Neither significant'
    ],
    'Decision': [
        '‚úÖ SHIP',
        '‚ùå ABANDON',
        'ü§î DEPENDS (short-term gain, long-term loss)',
        '‚úÖ SHIP (long-term matters more)',
        '‚è∏Ô∏è HOLD (extend test or abandon)'
    ],
    'Rationale': [
        'Clear winner across all timeframes',
        'Clear loser - damages retention',
        'Product dilemma: prioritize long-term',
        '7-day retention > 1-day for games',
        'Insufficient evidence to decide'
    ]
})

display(scenarios)

# Actual results
print(f"\nüéØ YOUR ACTUAL SCENARIO:")
ret1d_sig = ret1d['significant']
ret7d_sig = ret7d['significant']
ret1d_pos = ret1d['relative_lift'] > 0
ret7d_pos = ret7d['relative_lift'] > 0

if ret1d_sig and ret1d_pos and ret7d_sig and ret7d_pos:
    print("   ‚úÖ SHIP - Both metrics improved")
elif ret1d_sig and not ret1d_pos and ret7d_sig and not ret7d_pos:
    print("   ‚ùå ABANDON - Both metrics worsened")
elif (ret1d_sig and ret1d_pos) and (ret7d_sig and not ret7d_pos):
    print("   ü§î TRADE-OFF - 1d improved but 7d worsened (prioritize long-term!)")
else:
    print("   See scenario table above for your specific case")

## ‚úÖ Key Takeaways

1. **Always correct for multiple testing** - Testing k metrics inflates false positive rate
2. **Bonferroni for few metrics** (k<5), **BH-FDR for many** (k>5)
3. **Product decisions are complex** - Metrics often conflict, requires judgment
4. **Prioritize long-term metrics** - 7-day retention > 1-day for games
5. **Ratio metrics need delta method** - Simple ratio CIs are biased

## üìö Next Steps

- Try the Criteo notebook (ML-enhanced techniques)
- Read about sequential testing (early stopping)
- Study CUPAC (ML-enhanced variance reduction)