# A/B Testing Masterclass: The Complete Experimentation Lifecycle
## Cookie Cats Mobile Game Analysis

---

## üéØ What This Notebook Is Really About

**In data science interviews, A/B testing is rarely just about p-values and confidence intervals.**

It's a proxy for something much bigger. Interviewers use A/B testing questions to understand:
- **How you think** through problems from start to finish
- **How you deal with ambiguity** when there's no single correct answer
- **How you turn messy data into decisions** that actually matter to the business

This is where many strong candidates struggle‚Äînot because they don't know the math (most do), but because they lack a **clear mental model for the full A/B testing lifecycle**. They jump straight to analysis when they should be asking: *What hypothesis are we testing? Why these metrics? What trade-offs are we accepting?*

### The A/B Testing Lifecycle

This notebook follows the complete experimentation lifecycle:

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  1. FRAME THE QUESTION                                              ‚îÇ
‚îÇ     ‚îî‚îÄ‚îÄ What business problem are we solving?                       ‚îÇ
‚îÇ     ‚îî‚îÄ‚îÄ What's our hypothesis and why?                              ‚îÇ
‚îÇ                                                                     ‚îÇ
‚îÇ  2. CHOOSE METRICS                                                  ‚îÇ
‚îÇ     ‚îî‚îÄ‚îÄ Primary (what we're optimizing)                             ‚îÇ
‚îÇ     ‚îî‚îÄ‚îÄ Guardrails (what we can't harm)                             ‚îÇ
‚îÇ     ‚îî‚îÄ‚îÄ Trade-offs between them                                     ‚îÇ
‚îÇ                                                                     ‚îÇ
‚îÇ  3. VALIDATE THE EXPERIMENT                                         ‚îÇ
‚îÇ     ‚îî‚îÄ‚îÄ SRM check (did randomization work?)                         ‚îÇ
‚îÇ     ‚îî‚îÄ‚îÄ Data quality validation                                     ‚îÇ
‚îÇ                                                                     ‚îÇ
‚îÇ  4. ANALYZE RESULTS                                                 ‚îÇ
‚îÇ     ‚îî‚îÄ‚îÄ Statistical tests (the math part)                           ‚îÇ
‚îÇ     ‚îî‚îÄ‚îÄ Practical significance (does it matter?)                    ‚îÇ
‚îÇ                                                                     ‚îÇ
‚îÇ  5. INTERPRET & DECIDE                                              ‚îÇ
‚îÇ     ‚îî‚îÄ‚îÄ What does this mean for the business?                       ‚îÇ
‚îÇ     ‚îî‚îÄ‚îÄ Ship / Hold / Abandon decision                              ‚îÇ
‚îÇ     ‚îî‚îÄ‚îÄ What are we uncertain about?                                ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

**Foundations first, sophistication later.** This notebook covers the fundamentals that every experimentation analysis should include. Later notebooks (Criteo, Marketing) build on this foundation with advanced techniques.

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. **Frame a hypothesis** properly (not just state it)
2. **Choose metrics** and articulate trade-offs between them
3. **Validate randomization** using two-stage SRM gating
4. **Interpret results** with multiple testing correction
5. **Make decisions** using the Ship/Hold/Abandon framework
6. **Communicate findings** in business terms, not just statistics

---

## Phase 1: Frame the Question

### The Business Context

**Cookie Cats** is a popular mobile puzzle game. Like many free-to-play games, it uses "gates"‚Äîpoints where players must wait or make an in-app purchase to continue.

Gates serve two purposes:
1. **Monetization**: Players can pay to skip the wait
2. **Engagement**: Breaks prevent burnout and give players a reason to return

### üí° Interview Insight: How to Frame a Hypothesis

In interviews, many candidates state their hypothesis as a simple prediction:

> *"Moving the gate will increase retention."*

Strong candidates frame it with **reasoning** and **risks**:

> *"Moving the gate from level 30 to level 40 might improve early retention because players get more uninterrupted gameplay before hitting the first paywall‚Äîreducing frustration during the critical onboarding period. However, there's a risk: players who reach level 40 before any gate may become so invested that the sudden stop feels more jarring, causing them to quit entirely rather than wait or pay."*

The second framing shows you've thought about:
- **The mechanism** (why you think it will work)
- **The counterfactual** (what could go wrong)
- **The trade-off** (early retention vs. long-term engagement)

---

### Our Experimental Design

| Aspect | Control (gate_30) | Treatment (gate_40) |
|--------|-------------------|---------------------|
| Gate Position | Level 30 | Level 40 |
| Allocation | 50% | 50% |
| Sample Size | ~45,000 | ~45,000 |

**Hypothesis**: Moving the gate later will improve 1-day retention by reducing early frustration.

**Risk**: Players may quit at higher levels when they finally hit the gate.

**Key Insight**: Notice we're running a **50/50 randomized controlled trial (RCT)**‚Äîthis is critical for SRM validation later.

---

## Phase 2: Choose Metrics (and Understand Trade-offs)

### üí° Interview Insight: Why Metric Selection Matters

One of the most common interview questions is: *"What metrics would you use?"*

Weak answers list metrics. Strong answers explain the **hierarchy** and **trade-offs**:

#### Our Metric Framework

| Type | Metric | Why This Metric | Trade-off |
|------|--------|-----------------|------------|
| **Primary** | 1-Day Retention | Most sensitive to onboarding changes; captures immediate impact | May miss long-term effects |
| **Secondary** | 7-Day Retention | Captures whether players stick around | Slower to move; more variance |
| **Guardrail** | Engagement (rounds/player) | Ensures we're not gaming retention at the cost of depth | Can increase even if quality decreases |

### Why These Specific Metrics?

**1-Day Retention as Primary:**
- Moves faster than 7-day (shorter feedback loop)
- Most sensitive to early game experience
- Strong predictor of long-term value

**7-Day Retention as Guardrail (not secondary):**
- The gate change could improve day-1 but hurt week-1
- We need to ensure we're not just delaying churn

**Engagement as Guardrail:**
- A player who returns but plays less is a warning sign
- Prevents Goodhart's Law (optimizing the metric, not the goal)

### üí° Trade-off Discussion

**What we're optimizing**: Short-term retention (1-day)

**What we're protecting**: Long-term retention (7-day) and engagement depth

**The implicit bet**: We believe improving early retention will have positive downstream effects, but we're setting guardrails to detect if we're wrong.

---

## Setup and Data Loading

In [1]:
# Core imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# A/B Testing modules
from ab_testing.data import loaders
from ab_testing.core import randomization, frequentist
from ab_testing.advanced import multiple_testing, ratio_metrics
from ab_testing.diagnostics import guardrails

# Set up plotting
plt.style.use('seaborn-v0_8-whitegrid')
%matplotlib inline

print("‚úì Modules loaded successfully")

‚úì Modules loaded successfully


In [2]:
# Load the Cookie Cats dataset
df = loaders.load_cookie_cats(sample_frac=1.0)

print(f"Dataset loaded: {len(df):,} players")
print(f"\nColumns: {list(df.columns)}")
print(f"\nFirst few rows:")
df.head()

Loading Cookie Cats dataset from data\raw\cookie_cats\cookie_cats.csv...
Loaded Cookie Cats dataset: 90,189 rows, 6 columns
  7-day retention (gate_30): 19.02%
  7-day retention (gate_40): 18.20%
Dataset loaded: 90,189 players

Columns: ['userid', 'version', 'sum_gamerounds', 'retention_1', 'retention_7', 'treatment']

First few rows:


Unnamed: 0,userid,version,sum_gamerounds,retention_1,retention_7,treatment
0,116,gate_30,3,False,False,0
1,337,gate_30,38,True,False,0
2,377,gate_40,165,True,False,1
3,483,gate_40,1,False,False,1
4,488,gate_40,179,True,True,1


In [3]:
# Understand the data
print("Dataset Summary")
print("=" * 50)
print(f"\nGroup distribution:")
print(df['version'].value_counts())
print(f"\n1-Day Retention by group:")
print(df.groupby('version')['retention_1'].mean())
print(f"\n7-Day Retention by group:")
print(df.groupby('version')['retention_7'].mean())
print(f"\nGame rounds statistics:")
print(df.groupby('version')['sum_gamerounds'].describe())

Dataset Summary

Group distribution:
version
gate_40    45489
gate_30    44700
Name: count, dtype: int64

1-Day Retention by group:
version
gate_30    0.448188
gate_40    0.442283
Name: retention_1, dtype: float64

7-Day Retention by group:
version
gate_30    0.190201
gate_40    0.182000
Name: retention_7, dtype: float64

Game rounds statistics:
           count       mean         std  min  25%   50%   75%      max
version                                                               
gate_30  44700.0  52.456264  256.716423  0.0  5.0  17.0  50.0  49854.0
gate_40  45489.0  51.298776  103.294416  0.0  5.0  16.0  52.0   2640.0


---

## Phase 3: Validate the Experiment

### Step 1: Sample Ratio Mismatch (SRM) Check

### üí° Interview Insight: Why Validation Comes Before Analysis

Many candidates jump straight to analyzing results. Experienced practitioners **always validate first**.

> *"Before we look at any treatment effects, we need to verify the experiment ran correctly. If randomization failed, all downstream analysis is meaningless."*

This is a lifecycle principle: **garbage in, garbage out**.

### What is SRM?

Sample Ratio Mismatch occurs when actual group sizes don't match expected allocation. For a 50/50 split, we expect roughly equal groups (allowing for random variation).

**Why SRM is Critical:**

If groups are imbalanced beyond what random chance would produce, it indicates:
- Bug in randomization code
- Tracking/logging issues (events lost for one group)
- Group-specific crashes (treatment causes technical problems)
- Bot traffic concentrated in one group

**Industry Practice:**
- Microsoft: Blocks all analysis if SRM detected
- Netflix: Triggers immediate engineering alerts
- Booking.com: Uses stricter alpha (0.001)

### Two-Stage SRM Gating

With large samples (90K+ users), even tiny deviations become statistically significant. We use a two-stage approach:

1. **Statistical Significance**: p-value < 0.01 (chi-square test)
2. **Practical Significance**: Deviation > 1 percentage point from expected

**Only if BOTH conditions are met do we halt the analysis.**

This prevents false alarms from large-sample statistical sensitivity.

---

In [4]:
# Calculate actual group proportions
group_counts = df['version'].value_counts()
total = len(df)

control_count = group_counts.get('gate_30', 0)
treatment_count = group_counts.get('gate_40', 0)

actual_control_ratio = control_count / total
actual_treatment_ratio = treatment_count / total

print("Group Distribution Analysis")
print("=" * 50)
print(f"Total users: {total:,}")
print(f"\nControl (gate_30): {control_count:,} ({actual_control_ratio:.2%})")
print(f"Treatment (gate_40): {treatment_count:,} ({actual_treatment_ratio:.2%})")
print(f"\nExpected: 50% / 50%")
print(f"Deviation from expected: {abs(actual_control_ratio - 0.5):.4%}")

Group Distribution Analysis
Total users: 90,189

Control (gate_30): 44,700 (49.56%)
Treatment (gate_40): 45,489 (50.44%)

Expected: 50% / 50%
Deviation from expected: 0.4374%


In [5]:
# Run formal SRM check using two-stage gating
srm_result = randomization.srm_check(
    n_control=control_count,
    n_treatment=treatment_count,
    expected_ratio=[0.5, 0.5],  # Expected 50/50 split as list
    alpha=0.01
)

print("SRM Check Results")
print("=" * 50)
print(f"\nStatistical test:")
print(f"  Chi-square statistic: {srm_result['chi2_statistic']:.4f}")
print(f"  P-value: {srm_result['p_value']:.6f}")
print(f"  Statistically significant: {srm_result['srm_detected']}")

print(f"\nPractical significance:")
print(f"  Actual ratio (control): {srm_result['ratio_control']:.4f}")
print(f"  Expected ratio (control): 0.5")
print(f"  Deviation: {srm_result['max_pp_deviation']:.4f}")
print(f"  Severe (>1%): {srm_result['srm_severe']}")

SRM Check Results

Statistical test:
  Chi-square statistic: 6.9024
  P-value: 0.008608
  Statistically significant: True

Practical significance:
  Actual ratio (control): 0.4956
  Expected ratio (control): 0.5
  Deviation: 0.0044
  Severe (>1%): False


In [6]:
# Apply two-stage gating logic
IS_RCT = True  # This is a designed 50/50 experiment
PRACTICAL_THRESHOLD = 0.01  # 1 percentage point

# Use the correct keys from srm_result
# max_pp_deviation is the calculated deviation from expected ratio
deviation = srm_result['max_pp_deviation']
statistically_significant = srm_result['srm_detected']
practically_significant = srm_result['practical_significant']

print("\nTwo-Stage SRM Gating")
print("=" * 50)
print(f"\n[1] Statistical Significance: {'‚ö†Ô∏è YES' if statistically_significant else '‚úì NO'}")
print(f"    (p-value = {srm_result['p_value']:.6f}, threshold = 0.01)")
print(f"\n[2] Practical Significance: {'‚ö†Ô∏è YES' if practically_significant else '‚úì NO'}")
print(f"    (deviation = {deviation:.4f}, threshold = {PRACTICAL_THRESHOLD})")

if IS_RCT and srm_result['srm_severe']:
    print("\n" + "=" * 50)
    print("üö´ HARD GATE: Analysis should STOP here.")
    print("   Investigate randomization before proceeding.")
    print("=" * 50)
elif srm_result['srm_warning']:
    print("\n" + "=" * 50)
    print("‚ö†Ô∏è  WARNING: Statistical but not practical significance.")
    print("   This is common with large samples (90K+).")
    print("   Proceeding with caution.")
    print("=" * 50)
else:
    print("\n" + "=" * 50)
    print("‚úì SRM CHECK PASSED")
    print("  Randomization appears to have worked correctly.")
    print("  Proceeding to treatment effect analysis.")
    print("=" * 50)


Two-Stage SRM Gating

[1] Statistical Significance: ‚ö†Ô∏è YES
    (p-value = 0.008608, threshold = 0.01)

[2] Practical Significance: ‚úì NO
    (deviation = 0.0044, threshold = 0.01)

   This is common with large samples (90K+).
   Proceeding with caution.


### üí° Interview Insight: Explaining SRM Decisions

In an interview, you might be asked: *"The SRM check is statistically significant. Should we stop?"*

**Weak answer**: *"Yes, the p-value is below 0.01."*

**Strong answer**: *"It depends. With 90K users, even a 0.3% deviation becomes statistically significant. The key question is whether the deviation is large enough to bias our results. A 49.7% vs 50.3% split won't meaningfully affect treatment effect estimates. However, a 45% vs 55% split would be concerning. I'd look at the practical significance‚Äîis the deviation large enough to matter?"*

This shows you understand the difference between **statistical** and **practical** significance.

---

## Phase 4: Analyze Results

### Step 2: Primary Metric Analysis (1-Day Retention)

Now that we've validated the experiment, we can analyze treatment effects.

---

In [7]:
# Prepare data for analysis
control_df = df[df['version'] == 'gate_30']
treatment_df = df[df['version'] == 'gate_40']

# Extract arrays for testing
control_retention_1d = control_df['retention_1'].values
treatment_retention_1d = treatment_df['retention_1'].values

print("Sample Sizes")
print("=" * 40)
print(f"Control: {len(control_retention_1d):,}")
print(f"Treatment: {len(treatment_retention_1d):,}")

Sample Sizes
Control: 44,700
Treatment: 45,489


In [8]:
# Run z-test for 1-day retention
# z_test_proportions expects: x_control (successes), n_control (total), x_treatment, n_treatment
x_control_1d = control_retention_1d.sum()  # Number of retained players
n_control_1d = len(control_retention_1d)   # Total control players
x_treatment_1d = treatment_retention_1d.sum()
n_treatment_1d = len(treatment_retention_1d)

retention_1d_result = frequentist.z_test_proportions(
    x_control=x_control_1d,
    n_control=n_control_1d,
    x_treatment=x_treatment_1d,
    n_treatment=n_treatment_1d,
    alpha=0.05
)

print("1-Day Retention Analysis")
print("=" * 50)
print(f"\nControl:   {retention_1d_result['p_control']:.4f} ({retention_1d_result['p_control']:.2%})")
print(f"Treatment: {retention_1d_result['p_treatment']:.4f} ({retention_1d_result['p_treatment']:.2%})")
print(f"\nAbsolute difference: {retention_1d_result['absolute_lift']:.4f}")
print(f"Relative lift: {retention_1d_result['relative_lift']:.2%}")
print(f"\n95% CI: [{retention_1d_result['ci_lower']:.4f}, {retention_1d_result['ci_upper']:.4f}]")
print(f"P-value: {retention_1d_result['p_value']:.6f}")
print(f"\nStatistically significant: {retention_1d_result['significant']}")

1-Day Retention Analysis

Control:   0.4482 (44.82%)
Treatment: 0.4423 (44.23%)

Absolute difference: -0.0059
Relative lift: -1.32%

95% CI: [-0.0124, 0.0006]
P-value: 0.074410

Statistically significant: False


### üí° Interview Insight: Interpreting Results

Notice that the result might show a **negative** treatment effect (retention decreased when we moved the gate later). 

In an interview, you might be asked: *"The results contradict your hypothesis. What do you conclude?"*

**Weak answer**: *"The experiment failed."*

**Strong answer**: *"This is actually valuable information. It suggests our mental model was wrong‚Äîdelaying the gate doesn't reduce frustration, it may actually increase it. The sunk-cost hypothesis might be at play: players who've invested 40 levels feel more entitled to continue and react worse to being stopped. This gives us insight for future experiments‚Äîmaybe the gate timing matters less than how it's presented."*

Experiments that disprove hypotheses are still successful experiments.

---

### Step 3: Multiple Testing Correction

We're testing multiple metrics (1-day retention, 7-day retention). This inflates our false positive rate.

**The Problem:**
- Testing 1 metric at Œ±=0.05: 5% false positive rate
- Testing 2 metrics at Œ±=0.05: 1 - (0.95)¬≤ = 9.75% false positive rate
- Testing 5 metrics at Œ±=0.05: 1 - (0.95)‚Åµ = 22.6% false positive rate

**The Solution: Benjamini-Hochberg FDR Control**

Instead of controlling the probability of ANY false positive (FWER), we control the expected proportion of false positives among rejected hypotheses (FDR).

---

In [9]:
# Test 7-day retention
control_retention_7d = control_df['retention_7'].values
treatment_retention_7d = treatment_df['retention_7'].values

# Convert to counts for z_test_proportions
x_control_7d = control_retention_7d.sum()
n_control_7d = len(control_retention_7d)
x_treatment_7d = treatment_retention_7d.sum()
n_treatment_7d = len(treatment_retention_7d)

retention_7d_result = frequentist.z_test_proportions(
    x_control=x_control_7d,
    n_control=n_control_7d,
    x_treatment=x_treatment_7d,
    n_treatment=n_treatment_7d,
    alpha=0.05
)

print("7-Day Retention Analysis")
print("=" * 50)
print(f"\nControl:   {retention_7d_result['p_control']:.4f} ({retention_7d_result['p_control']:.2%})")
print(f"Treatment: {retention_7d_result['p_treatment']:.4f} ({retention_7d_result['p_treatment']:.2%})")
print(f"\nRelative lift: {retention_7d_result['relative_lift']:.2%}")
print(f"P-value: {retention_7d_result['p_value']:.6f}")
print(f"\nStatistically significant: {retention_7d_result['significant']}")

7-Day Retention Analysis

Control:   0.1902 (19.02%)
Treatment: 0.1820 (18.20%)

Relative lift: -4.31%
P-value: 0.001554

Statistically significant: True


In [10]:
# Apply Benjamini-Hochberg correction
p_values = [
    retention_1d_result['p_value'],
    retention_7d_result['p_value']
]
metric_names = ['1-Day Retention', '7-Day Retention']

bh_result = multiple_testing.benjamini_hochberg(
    p_values=p_values,
    alpha=0.05
)

print("Multiple Testing Correction (Benjamini-Hochberg)")
print("=" * 60)
print(f"\nFDR level: {bh_result['alpha']:.2%}")
print(f"\n{'Metric':<20} {'P-value':>12} {'Adjusted P':>12} {'Significant':>12}")
print("-" * 60)

# Use correct keys: 'significant' (not 'reject_null'), 'n_significant' (not 'n_discoveries')
for i, metric in enumerate(metric_names):
    print(f"{metric:<20} {p_values[i]:>12.6f} {bh_result['adjusted_p_values'][i]:>12.6f} {str(bh_result['significant'][i]):>12}")

print(f"\nNumber of discoveries: {bh_result['n_significant']}")

Multiple Testing Correction (Benjamini-Hochberg)

FDR level: 5.00%

Metric                    P-value   Adjusted P  Significant
------------------------------------------------------------
1-Day Retention          0.074410     0.074410        False
7-Day Retention          0.001554     0.003108         True

Number of discoveries: 1


### üí° Interview Insight: When to Correct for Multiple Testing

*"When should you use Bonferroni vs. Benjamini-Hochberg?"*

**Strong answer**: *"It depends on the cost of false positives vs. false negatives. Bonferroni is more conservative‚Äîit controls the probability of any false positive, which is appropriate when false positives are very costly (like medical trials). Benjamini-Hochberg is less conservative‚Äîit controls the expected proportion of false positives among discoveries, which is appropriate when you're doing exploratory analysis and can tolerate some false positives in exchange for not missing true effects. For A/B tests with 2-5 metrics, BH is usually the right choice because we don't want to be so conservative that we miss real improvements."*

---

### Step 4: Ratio Metrics (Delta Method)

Engagement (game rounds per player) is a **ratio metric**. We can't simply compare means because the variance calculation is different.

The **Delta Method** provides the correct standard error for ratios.

---

In [11]:
# Extract engagement data
control_rounds = control_df['sum_gamerounds'].values
treatment_rounds = treatment_df['sum_gamerounds'].values

# For ratio metric, we need numerator (total rounds) and denominator (player count)
# In this case, we're computing rounds per player = mean rounds
# Correct params: numerator_control, denominator_control, numerator_treatment, denominator_treatment
ratio_result = ratio_metrics.ratio_metric_test(
    numerator_control=control_rounds,
    denominator_control=np.ones(len(control_rounds)),  # 1 player each
    numerator_treatment=treatment_rounds,
    denominator_treatment=np.ones(len(treatment_rounds)),
    alpha=0.05
)

print("Engagement Analysis (Game Rounds per Player)")
print("=" * 50)
# Correct return keys: ratio_control, ratio_treatment, ratio_diff, relative_lift
print(f"\nControl mean:   {ratio_result['ratio_control']:.2f} rounds")
print(f"Treatment mean: {ratio_result['ratio_treatment']:.2f} rounds")
print(f"\nDifference: {ratio_result['ratio_diff']:.2f} rounds")
print(f"Relative change: {ratio_result['relative_lift']:.2%}")
print(f"\n95% CI: [{ratio_result['ci_lower']:.2f}, {ratio_result['ci_upper']:.2f}]")
print(f"P-value: {ratio_result['p_value']:.6f}")
print(f"\nStatistically significant: {ratio_result['significant']}")

Engagement Analysis (Game Rounds per Player)

Control mean:   52.46 rounds
Treatment mean: 51.30 rounds

Difference: -1.16 rounds
Relative change: -2.21%

95% CI: [-3.72, 1.40]
P-value: 0.375921

Statistically significant: False


---

## Phase 5: Interpret and Decide

### Step 5: Guardrail Evaluation

Guardrails use **non-inferiority tests**: we're not trying to prove improvement, just that we haven't caused unacceptable harm.

| Metric | Tolerance | Meaning |
|--------|-----------|----------|
| 7-Day Retention | -1% | We accept up to 1% relative decrease |
| Engagement | -5% | We accept up to 5% relative decrease |

These thresholds reflect **business judgment** about acceptable trade-offs.

---

In [12]:
# Guardrail 1: 7-Day Retention (must not degrade more than 1%)
guardrail_retention_7d = guardrails.non_inferiority_test(
    control=control_retention_7d,
    treatment=treatment_retention_7d,
    delta=-0.01,  # Allow max 1% relative degradation
    metric_type='relative',
    alpha=0.05
)
guardrail_retention_7d['metric_name'] = '7-Day Retention'

print("Guardrail 1: 7-Day Retention")
print("=" * 50)
print(f"Tolerance: -1.0% (max allowed degradation)")
print(f"\nControl mean:   {guardrail_retention_7d['mean_control']:.4f}")
print(f"Treatment mean: {guardrail_retention_7d['mean_treatment']:.4f}")

rel_change_7d = guardrail_retention_7d['difference'] / guardrail_retention_7d['mean_control']
rel_ci_lower_7d = guardrail_retention_7d['ci_lower'] / guardrail_retention_7d['mean_control']

print(f"\nRelative change:    {rel_change_7d:.2%}")
print(f"95% CI lower bound: {rel_ci_lower_7d:.2%}")
print(f"\nResult: {'‚úì PASSED' if guardrail_retention_7d['passed'] else '‚úó FAILED'}")

Guardrail 1: 7-Day Retention
Tolerance: -1.0% (max allowed degradation)

Control mean:   0.1902
Treatment mean: 0.1820

Relative change:    -4.31%
95% CI lower bound: -6.55%

Result: ‚úó FAILED


In [13]:
# Guardrail 2: Engagement (must not degrade more than 5%)
guardrail_engagement = guardrails.non_inferiority_test(
    control=control_rounds,
    treatment=treatment_rounds,
    delta=-0.05,
    metric_type='relative',
    alpha=0.05
)
guardrail_engagement['metric_name'] = 'Engagement (Game Rounds)'

print("Guardrail 2: Engagement (Game Rounds per Player)")
print("=" * 50)
print(f"Tolerance: -5.0% (max allowed degradation)")
print(f"\nControl mean:   {guardrail_engagement['mean_control']:.2f} rounds")
print(f"Treatment mean: {guardrail_engagement['mean_treatment']:.2f} rounds")

rel_change_eng = guardrail_engagement['difference'] / guardrail_engagement['mean_control']
rel_ci_lower_eng = guardrail_engagement['ci_lower'] / guardrail_engagement['mean_control']

print(f"\nRelative change:    {rel_change_eng:.2%}")
print(f"95% CI lower bound: {rel_ci_lower_eng:.2%}")
print(f"\nResult: {'‚úì PASSED' if guardrail_engagement['passed'] else '‚úó FAILED'}")

Guardrail 2: Engagement (Game Rounds per Player)
Tolerance: -5.0% (max allowed degradation)

Control mean:   52.46 rounds
Treatment mean: 51.30 rounds

Relative change:    -2.21%
95% CI lower bound: -6.31%

Result: ‚úó FAILED


### Step 6: The Ship / Hold / Abandon Decision

### üí° Interview Insight: The Decision Framework

This is often the most important part of an interview discussion. Interviewers want to see how you synthesize statistical results into business decisions.

**The Framework:**

| Decision | Criteria |
|----------|----------|
| **SHIP** | Primary metric significant AND positive AND all guardrails pass |
| **ABANDON** | Primary metric significant AND negative OR any guardrail fails |
| **HOLD** | Primary metric not significant OR mixed signals |

**Key Principle**: We don't ship on neutral results, and we don't ship if we're causing harm elsewhere.

---

In [14]:
# Make final decision using the framework
decision_result = guardrails.evaluate_guardrails(
    primary_result={
        'significant': retention_1d_result['significant'],
        'relative_lift': retention_1d_result['relative_lift'],
        'p_value': retention_1d_result['p_value']
    },
    guardrail_results=[guardrail_retention_7d, guardrail_engagement]
)

print("\n" + "=" * 60)
print("DECISION FRAMEWORK EVALUATION")
print("=" * 60)
print(f"\nüéØ Primary Metric: 1-Day Retention")
print(f"   Significant: {decision_result['primary_significant']}")
print(f"   Positive:    {decision_result['primary_positive']}")
print(f"   Lift:        {retention_1d_result['relative_lift']:.2%}")

print(f"\nüõ°Ô∏è  Guardrail Metrics:")
print(f"   Passed: {decision_result['guardrails_passed']} / {decision_result['guardrails_total']}")
print(f"   - 7-Day Retention: {'‚úì PASSED' if guardrail_retention_7d['passed'] else '‚úó FAILED'}")
print(f"   - Engagement:      {'‚úì PASSED' if guardrail_engagement['passed'] else '‚úó FAILED'}")

print(f"\n" + "=" * 60)
decision = decision_result['decision'].upper()
print(f">>> FINAL DECISION: {decision} <<<")
print("=" * 60)


DECISION FRAMEWORK EVALUATION

üéØ Primary Metric: 1-Day Retention
   Significant: False
   Positive:    False
   Lift:        -1.32%

üõ°Ô∏è  Guardrail Metrics:
   Passed: 0 / 2
   - 7-Day Retention: ‚úó FAILED
   - Engagement:      ‚úó FAILED

>>> FINAL DECISION: HOLD <<<


### Step 7: Interpreting the Decision (Connecting to Business)

### üí° Interview Insight: Connecting Statistics to Business Impact

The final step is translating your analysis into business terms. This is where judgment matters more than formulas.

---

In [15]:
# Business impact interpretation
decision = decision_result['decision'].upper()

print("\n" + "=" * 60)
print("BUSINESS INTERPRETATION")
print("=" * 60)

if decision == 'SHIP':
    print("\n‚úÖ RECOMMENDATION: SHIP")
    print("\nWhy ship?")
    print(f"  ‚Ä¢ Primary metric improved by {retention_1d_result['relative_lift']:.2%}")
    print(f"  ‚Ä¢ All guardrails passed")
    print("\nBusiness impact (per 100,000 new players):")
    additional_returns = abs(retention_1d_result['absolute_lift']) * 100000
    print(f"  ‚Ä¢ Additional day-1 returns: {additional_returns:.0f} players")
    print("  ‚Ä¢ More engaged base ‚Üí more monetization opportunities")
    print("\nNext steps:")
    print("  1. Roll out to 100% of players")
    print("  2. Monitor 7-day and 30-day retention post-launch")
    print("  3. Track revenue impact")

elif decision == 'ABANDON':
    print("\n‚ùå RECOMMENDATION: ABANDON")
    print("\nWhy abandon?")
    if not decision_result['primary_positive']:
        print(f"  ‚Ä¢ Primary metric showed NEGATIVE impact ({retention_1d_result['relative_lift']:.2%})")
    if not guardrail_retention_7d['passed']:
        print(f"  ‚Ä¢ 7-day retention guardrail FAILED")
    if not guardrail_engagement['passed']:
        print(f"  ‚Ä¢ Engagement guardrail FAILED")
    print("\nBusiness impact:")
    lost_returns = abs(retention_1d_result['absolute_lift']) * 100000
    print(f"  ‚Ä¢ Would lose ~{lost_returns:.0f} day-1 returns per 100K players")
    print("\nLearnings:")
    print("  ‚Ä¢ Delaying the gate doesn't reduce frustration‚Äîit may increase it")
    print("  ‚Ä¢ Sunk cost: Players invested in 40 levels react worse to being stopped")
    print("\nNext steps:")
    print("  1. Test alternative gate presentations (instead of positions)")
    print("  2. Consider gate at level 35 as a middle ground")
    print("  3. Analyze user feedback for qualitative insights")

else:  # HOLD
    print("\n‚ö™ RECOMMENDATION: HOLD")
    print("\nWhy hold?")
    if not decision_result['primary_significant']:
        print(f"  ‚Ä¢ Primary metric not statistically significant")
        print(f"  ‚Ä¢ Observed {retention_1d_result['relative_lift']:.2%} lift could be random noise")
    print("\nOptions:")
    print("  1. Extend experiment duration for more data")
    print("  2. Increase traffic allocation for faster results")
    print("  3. Analyze subgroups (new vs. existing players)")


BUSINESS INTERPRETATION

‚ö™ RECOMMENDATION: HOLD

Why hold?
  ‚Ä¢ Primary metric not statistically significant
  ‚Ä¢ Observed -1.32% lift could be random noise

Options:
  1. Extend experiment duration for more data
  2. Increase traffic allocation for faster results
  3. Analyze subgroups (new vs. existing players)


---

## Summary: The Complete Lifecycle

We've walked through the full A/B testing lifecycle:

| Phase | What We Did | Key Insight |
|-------|-------------|-------------|
| **1. Frame** | Articulated hypothesis with mechanism and risk | Hypotheses need reasoning, not just predictions |
| **2. Metrics** | Chose primary + guardrails with trade-offs | Metric selection involves business judgment |
| **3. Validate** | Two-stage SRM check | Don't analyze until you've validated |
| **4. Analyze** | Z-tests + multiple testing correction | Statistics are just one piece |
| **5. Decide** | Ship/Hold/Abandon with business context | Judgment > formulas |

---

## üéì Exercises for Practice

### Exercise 1: Different Guardrail Thresholds
What if we set a stricter guardrail (-0.5% instead of -1%) for 7-day retention? How does this change the decision?

### Exercise 2: Segment Analysis
Do the results differ for high-engagement players (>50 rounds) vs. low-engagement players?

### Exercise 3: Interview Practice
Write a 3-minute explanation of this experiment and its results as if presenting to a non-technical product manager. Focus on: what we learned, what we recommend, and what uncertainties remain.

---

In [16]:
# Exercise 1: Stricter guardrail
# Try running the non-inferiority test with delta=-0.005 (0.5%) instead of -0.01
# Your code here:

# guardrail_strict = guardrails.non_inferiority_test(
#     control=control_retention_7d,
#     treatment=treatment_retention_7d,
#     delta=-0.005,  # Stricter threshold
#     metric_type='relative',
#     alpha=0.05
# )
# print(f"Stricter guardrail passed: {guardrail_strict['passed']}")

In [17]:
# Exercise 2: Segment analysis
# Your code here:

# high_engagement_control = control_df[control_df['sum_gamerounds'] > 50]['retention_1'].values
# high_engagement_treatment = treatment_df[treatment_df['sum_gamerounds'] > 50]['retention_1'].values
# 
# segment_result = frequentist.z_test_proportions(
#     control=high_engagement_control,
#     treatment=high_engagement_treatment,
#     alpha=0.05
# )
# print(f"High-engagement segment lift: {segment_result['relative_lift']:.2%}")

---

## Key Takeaways for Interviews

1. **Lead with the lifecycle, not the math.** Start by framing the question and choosing metrics before touching any code.

2. **Validate before you analyze.** SRM checks aren't optional‚Äîthey're the first line of defense against bad data.

3. **Trade-offs are everywhere.** Articulating what you're optimizing vs. protecting shows business maturity.

4. **Negative results are valuable.** An experiment that disproves your hypothesis is still a successful experiment.

5. **Connect to business impact.** Translate statistical results into dollars, users, or concrete outcomes.

6. **Know your uncertainty.** Strong candidates acknowledge what they don't know and suggest next steps.

---

**Next Notebook**: [02_criteo_advanced_techniques.ipynb](02_criteo_advanced_techniques.ipynb) - Advanced techniques for complex, real-world scenarios where assumptions break down and judgment matters more than formulas.