# Cookie Cats Retention A/B Test Analysis

---


### Executive summary
A/B testing whether moving the first gate from level 30 to level 40 affects retention and engagement.

- **Primary metric (Day-7 retention)**  
  - Control rate (`gate_30`): 19.02% 
  - Treatment rate (`gate_40`): 18.20%
  - Δ = -0.82 pp with 95% CI [-1.33, -0.31], p-value = 0.0016, relative change = -4.31%, Cohen's h = 0.02
- **Decision rule:** Pre-specified MDE = 1.0 pp at α = 0.05.  
- **Outcome:** Fail, result is statistically significant but does not meet the MDE threshold (on the contrary, it harms retention). Guardrail results do not indicate significance. 

**Conclusion:** Based on the pre-specified criteria, the gate move to level 40 **is not recommended** for rollout.

### Project description

Cookie Cats is a mobile puzzle game developed by Tactile Entertainment. The game involves "gates" that serve as intentional breaks between levels, a system typical to puzzle games that boosts player enjoyment and play duration. In this project I will test whether moving the gate from level 30 to level 40 has a significant effect on player retention at day 7, using retention at day 1 and engagement as guardrail metrics. 

Dataset is sourced from [Kaggle](https://www.kaggle.com/datasets/mursideyarkin/mobile-games-ab-testing-cookie-cats).

`cookie_cats.csv`

| column | data type | description | 
|--------|-----------|-------------|
| `userid` | `int` | Unique player ID |
| `version` | `str` | Experiment split (``gate_30`` = control, ``gate_40`` = treatment) |
| `sum_gamerounds` | `int` | Total rounds per player |
| `retention_1` | `bool` | Active 1 day after installing the game |
| `retention_7` | `bool` | Active 7 days after installing the game |

Each row corresponds to 1 user.

I pre-specified day-7 retention as the primary metric, while specifying day-1 retention and game rounds as guardrails used diagnostically to catch unintended harmful effects.

**Primary metric:** Day-7 retention (``retention_7``).  
**Test applied:** Two-proportion z-test (α = 0.05, two-sided).  
**Power/MDE:** Target 80% power, operational MDE = 1.0 pp.  

- **H₀:** Gate move has no effect on D7 retention.
- **H₁:** Gate move changes D7 retention.

**Guardrails (diagnostic):**
- **Day-1 retention (``retention_1``):** z-test (same structure as primary).
- **Game rounds (``sum_gamerounds``):** Mann–Whitney U (non-parametric) and Welch's t-test on log-transformed data.
  - **H₀:** Distributions are equal across groups.

**Multiple testing:** Guardrail p-values corrected based on Holm method.

**Decision rule:** Rollout only if
- D7 is significant (p < 0.05) AND uplift ≥ 1.0 pp. 
- Otherwise, fail or flag for review.

**Population:** Control (``gate_30``) and treatment (``gate_40``) players

**Sanity checks (Sample Ratio Test):**
- Chi-square test

In [None]:
# Success criteria
primary_metric = "retention_7"  # day-7 retention
guardrail_metrics = [
    "retention_1",
    "sum_gamerounds",
]  # day-1 retention and sum of game rounds per player
alpha = 0.05  # 5% significance level
confidence_level = 1 - alpha  # 95% confidence level
power = 0.80  # 80% statistical power
mde_pp = 1.0  # pre-specified absolute uplift threshold in percentage points

### 1. Experiment setup

In [None]:
# Imports
from cookiecats.io import load_cookiecats
import cookiecats.plots as ccp
import cookiecats.stats as ccs
from cookiecats.tables import build_results_table
import numpy as np
import pandas as pd

### 2. Load data & quick inspection

In [None]:
cookie_df = load_cookiecats()
cookie_df.head()

In [None]:
cookie_df.info()

In [None]:
cookie_df.groupby("version")["userid"].nunique()

### 3. Data cleaning & sanity checks

#### 3.1 Type conversions

In [None]:
# Convert retention values from boolean to integer
cookie_df[["retention_1", "retention_7"]] = cookie_df[
    ["retention_1", "retention_7"]
].astype(int)
cookie_df[["retention_1", "retention_7"]].head()

#### 3.2 Missing values & duplicates

In [None]:
# Missing values per column
missing_values = cookie_df.isnull().sum()
print(f"Missing values per column: \n{missing_values}")

In [None]:
# Check unique users and duplicates
unique_users = cookie_df["userid"].nunique()
duplicate_users = cookie_df[cookie_df.duplicated("userid", keep=False)]

print(f"Unique users: {unique_users}")
print(f"Duplicate users: {duplicate_users['userid'].unique()}")

#### 3.3 Outliers

Exploring outliers in ``sum_gamerounds`` distribution.

In [None]:
# Inspect top players with highest game rounds
cookie_df.sort_values("sum_gamerounds", ascending=False).head()

**Note:** 49854 game rounds is physically impossible, this value is possibly not a naturally ocurring outlier but due to a data integrity error. The remaining top values are in the 2-3k range, indicating extreme player engagement (within normal constraints). I decided to remove the 50k row and use the filtered dataset (``plot_df``) only for the relevant plots on engagement so that they are better interpretable. The sanity checks and tests use the main dataset (``cookiecats_df``) and thus remain unaffected by this decision.

In [None]:
# Drop the anomalous value
plot_df = cookie_df[cookie_df["sum_gamerounds"] < 3000].copy()
plot_df.sort_values("sum_gamerounds", ascending=False).head()

In [None]:
# Inspect game rounds distribution
percentiles = cookie_df["sum_gamerounds"].describe(
    percentiles=[0.5, 0.9, 0.95, 0.99, 0.999]
)
print(percentiles)
# IQR method
Q1 = cookie_df["sum_gamerounds"].quantile(0.25)
Q3 = cookie_df["sum_gamerounds"].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Check outliers
outliers = cookie_df[
    (cookie_df["sum_gamerounds"] < lower_bound)
    | (cookie_df["sum_gamerounds"] > upper_bound)
]
print(
    f"Outliers detected: {len(outliers)} rows ({len(outliers) / len(cookie_df):.2%} of dataset)"
)

In [None]:
# Plot the distribution of game rounds by version
ccp.plot_game_rounds(plot_df, lower_bound=lower_bound, upper_bound=upper_bound)

In [None]:
# Create a column to log-transform sum_gamerounds
plot_df.loc[:, "log_sum_gamerounds"] = np.log1p(plot_df["sum_gamerounds"])

# Plot the log-transformed distribution of game rounds by version
ccp.plot_game_rounds(plot_df, lower_bound=lower_bound, upper_bound=upper_bound, log=True)

#### 3.5 Assignment counts & Sample Ratio Test

Conduct a sanity check for sample ratio mismatch (SRM) between versions ``gate_30`` (control group) and ``gate_40`` (treatment group).

In [None]:
# Assign unique player counts to variables
control_players = cookie_df[cookie_df["version"] == "gate_30"]["userid"].nunique()
treatment_players = cookie_df[cookie_df["version"] == "gate_40"]["userid"].nunique()

# Implement chi-square test for SRM
srm_result = ccs.test_srm_chi2(control_players, treatment_players)

# Print SRM results
print(f"SRM (chi-square) p = {srm_result[0]:6f}")
print(
    f"Allocation ratio: control = {control_players} ({srm_result[1]:.2%}), treatment = {treatment_players} ({srm_result[2]:.2%})"
)

if srm_result[0] < 0.001:
    print("Strong SRM indication")
else:
    print("No strong SRM indication")

### 4. Exploratory data analysis (EDA)

#### 4.1 Summary statistics

Mean, count, and standard deviation summary

In [None]:
# Retention summary statistics
cookie_df.groupby("version")[["retention_1", "retention_7"]].agg(
    ["mean", "std", "count"]
)

In [None]:
# Game rounds summary statistics
agg = cookie_df.groupby("version").agg(
    mean=("sum_gamerounds", "mean"),
    std=("sum_gamerounds", "std"),
    count=("sum_gamerounds", "count"),
)

agg[["mean", "std"]] = agg[["mean", "std"]].round(2)
agg["count"] = agg["count"].astype(int)
agg

#### 4.2 Assignment bar plot

In [None]:
# Plot assignment counts by version
ccp.plot_assignment_counts(cookie_df)

#### 4.3 Retention rates (day-1 and day-7) with 95% CI

In [None]:
# Plot retention rates by version with 95% CI
ccp.plot_retention_rates(cookie_df)

#### 4.4 Game rounds distribution

In [None]:
# Plot histogram for player count - game rounds distribution
ccp.plot_game_rounds_dist(plot_df)

In [None]:
# Plot log-transformed histogram for player count - game rounds distribution
ccp.plot_game_rounds_dist(plot_df, log=True)

### 5. Power analysis & MDE exploration

Minimum Detectable Effect at current sample sizes (80% power, alpha = 0.05)

In [None]:
# Baseline p0 (control group day-7 retention)
p0 = cookie_df[cookie_df["version"] == "gate_30"]["retention_7"].mean()

mde_result = ccs.solve_mde(
    cookie_df,
    alpha=alpha,
    power=power,
    p0=p0
)

print(
    f"MDE at current N ({mde_result[0]}) (80% power, two-sided alpha=0.05): {mde_result[1]:.4f} pp"
)

Any true lift ≥ 0.74 pp will be detected with ≈80% power.

Required N per group for target MDEs (balanced)

In [None]:
required_n_result = ccs.solve_required_n(
    alpha=alpha,
    power=power,
    p0=p0
)

required_n_result

Power vs. MDE plot

In [None]:
ccp.plot_power_vs_mde(
    p0=p0,
    nob=mde_result[0],
    alpha=alpha,
    mde_pp_current=mde_result[1]
)

The curve crosses 80% power near the computed MDE.

MDE vs. N per group plot (balanced at 80% power)

In [None]:
ccp.plot_mde_vs_sample(
    p0=p0,
    alpha=alpha,
    power=power,
    n=mde_result[0],
    mde_pp_current=mde_result[1]
)

#### Interpretation

While the experiment design aims for 80% power with 1.0 pp MDE, at the current sample size the experiment can detect a more sensitive MDE of 0.74 pp.

### 6. Primary metric analysis: day-7 retention

In [None]:
ret7_results = ccs.test_two_prop_z(
    df=cookie_df,
    ctrl=control_players,
    treat=treatment_players,
    col="retention_7",
    alpha=alpha,
    p0=p0
)

# Print day-7 retention test results
print(f"gate_30 size: {control_players}")
print(f"gate_40 size: {treatment_players}")
print(
    f"Control group (gate_30) day-7 retention rate: {ret7_results[0]:.6f} (95% CI [{ret7_results[1]:.6f}, {ret7_results[2]:.6f}])"
)
print(
    f"Treatment group (gate_40) day-7 retention rate: {ret7_results[3]:.6f} (95% CI [{ret7_results[4]:.6f}, {ret7_results[5]:.6f}])"
)
print(f"z = {ret7_results[6]:.6f}, p-value = {ret7_results[7]:.6f}")
print(
    f"Absolute difference: {ret7_results[8]:.6f} pp (95% CI [{ret7_results[9] * 100:.3f}, {ret7_results[10] * 100:.3f}] pp)"
)
print(f"Relative difference: {ret7_results[11]:.2f}%")
print(f"Effect size (Cohen's h): {ret7_results[12]:.6f}")

#### Interpretation

p < 0.05 indicates statistical significance, meaning that there is indeed a difference between gate level versions. The effect is -0.82 pp, which is lower than the pre-specified MDE of 1.0 pp, but Cohen's h = 0.02 indicates a small effect size. In other words, the treatment (``gate_40``) hurts retention relative to control (``gate_30``), albeit in a negligible magnitude.

#### Exposure sensitivity analysis

The dataset does not indicate whether users actually reached the level gate, so some players may not have been exposed. 
The current estimate is the average effect over all players, and only a fraction _q_ of users actually reach the gate, which may dilute experiment results. Assuming that there is no effect on the unexposed players, the exposed-only effect can be calculated as:   

`exposed effect = current effect / q`  

95% confidence intervals can be similarly calculated.

In [None]:
# Calculate exposure sensitivity
def exposure_sensitivity(delta, ci_low, ci_high, qs=(0.2,0.3,0.4,0.5,0.6,0.7,1)):
    rows = []
    for q in qs:
        rows.append({
            "Exposure rate (q)": f"{q * 100:.0f}%",
            "Exposed effect (pp)": f"{delta / q:.2f}",
            "95% CI low (pp)": f"{ci_low / q:.2f}",
            "95% CI high (pp)": f"{ci_high / q:.2f}"
        })
    return pd.DataFrame(rows)

exp_sens_df = exposure_sensitivity(delta=-0.82, ci_low=-1.33, ci_high=-0.31)
exp_sens_df

**Note:**  
Exposure also affects power sensitivity. MDE among exposed players would be roughly equal to MDE / q, meaning that it is harder to detect the effect if q is small.

### 7. Guardrail metrics analyses: day-1 retention & game rounds

#### 7.1 Day-1 retention

In [None]:
ret1_results = ccs.test_two_prop_z(
    df=cookie_df,
    ctrl=control_players,
    treat=treatment_players,
    col="retention_1",
    alpha=alpha,
    p0=p0
)

# Print day-1 retention test results
print(f"gate_30 size: {control_players}")
print(f"gate_40 size: {treatment_players}")
print(
    f"Control group (gate_30) day-1 retention rate: {ret1_results[0]:.6f} (95% CI [{ret1_results[1]:.6f}, {ret1_results[2]:.6f}])"
)
print(
    f"Treatment group (gate_40) day-1 retention rate: {ret1_results[3]:.6f} (95% CI [{ret1_results[4]:.6f}, {ret1_results[5]:.6f}])"
)
print(f"z = {ret1_results[6]:.6f}, p-value = {ret1_results[7]:.6f}")
print(
    f"Absolute difference: {ret1_results[8]:.6f} pp (95% CI [{ret1_results[9] * 100:.3f}, {ret1_results[10] * 100:.3f}] pp)"
)
print(f"Relative difference: {ret1_results[11]:.2f}%")
print(f"Effect size (Cohen's h): {ret1_results[12]:.6f}")

##### Interpretation

Indicated by a large p=0.074410, there may be no statistically significant difference between experiment groups in day-1 retention.

#### 7.2 Game rounds

Heavy-tailed (right / positive skew), non-parametric + log-transform

In [None]:
# Engagement stats and Mann-Whitney U test results
engagement_stats = ccs.calculate_engagement_stats(cookie_df)
rounds_results = ccs.test_game_rounds(engagement_stats)

print(f"Control (gate_30) mean: {engagement_stats[2]:.2f}, median: {engagement_stats[3]:.2f}")
print(f"Treatment (gate_40) mean: {engagement_stats[4]:.2f}, median: {engagement_stats[5]:.2f}")
print(f"Mann-Whitney U statistic: {rounds_results[0]:.0f}, p-value: {rounds_results[1]:.6f}")
print(f"Absoulute difference in mean game rounds: {engagement_stats[6]:.2f}")

In [None]:
# Welch's t-test results
print(
    f"Welch's t-test on log-transformed rounds t-statistic = {rounds_results[2]:6f}, p-value = {rounds_results[3]:6f}"
)
print(f"Control (gate_30) log mean = {engagement_stats[7].mean():6f}")
print(f"Treatment (gate_40) log mean = {engagement_stats[8].mean():6f}")

Bootstrap 95% CI for mean difference

In [None]:
# Bootstrap results for mean difference in game rounds
bootstrap_result = ccs.bootstrap_mean_diff(rounds_ctrl=engagement_stats[0], rounds_treat=engagement_stats[1])

print(f"Mean difference: {bootstrap_result[0]:.2f}")
print(f"Bootstrap 95% CI for mean difference: [{bootstrap_result[1]:.2f}, {bootstrap_result[2]:.2f}]")

##### Interpretation

Across all three methods (MWU, Welch t, bootstrap CI), results consistently demonstrate:
- No significant difference between ``gate_30`` and ``gate_40`` in game rounds played.
- Effect size is negligible in practice (less than ±1–2 rounds out of ~50 average).
- In other words, changing the gate does not affect overall engagement measured by total rounds played.

### 8. Multiple testing correction

In [None]:
# Adjust guardrail p-values using Holm correction
guardrail_adj = ccs.correct_pvals(ret1_results[7], rounds_results[1], rounds_results[3], alpha=alpha)
print(f"Significant tests: {guardrail_adj[0]}")
print(f"Corrected p-values: {guardrail_adj[1]}")

### 9. Results summary table

In [None]:
results_table = build_results_table(
    ret1_results=ret1_results,
    ret7_results=ret7_results,
    rounds_results=rounds_results,
    bootstrap_result=bootstrap_result,
    guardrail_adj=guardrail_adj,
    engagement_stats=engagement_stats,
    alpha=alpha
)

results_table

In [None]:
def decide(pval, delta, mde_pp, alpha):
    if pval < alpha:
        if delta >= mde_pp:
            return "Roll out"
        if delta <= -mde_pp:
            return "Do not roll out, treatment harmful beyond MDE"
        return "Significant but below MDE, do not roll out"
    return "No effect, do not roll out."


print("Decision:", decide(pval=ret7_results[7], delta=ret7_results[7], mde_pp=mde_result[1], alpha=alpha))

### 10. Business impact projection

In [None]:
# Calculate the number of retained players at Day-7 per 100,000 new players
def calculate_impact(ret7_results, K, ARPU_7):
    # Calculate the number of retained players at Day-7 per K new players
    retained_players = K * (ret7_results[3] - ret7_results[0])
    
    # Calculate the approximate revenue impact per K new players
    revenue_impact = retained_players * ARPU_7
    
    # Calculate the confidence interval for the difference per K retained players
    delta_ci = (K * ret7_results[9], K * ret7_results[10])

    return retained_players, revenue_impact, delta_ci

K = 100_000
ARPU_7 = 0.50
business_projection = calculate_impact(ret7_results, K=K, ARPU_7=ARPU_7)
print(
    f"Retained players at Day-7 per {K:,} new players: {business_projection[0]:.0f} [{business_projection[2][0]:.0f}, {business_projection[2][1]:.0f}]"
)

# Calculate the approximate revenue impact per 100,000 new players
print(f"Approximate revenue impact per {K:,} players: ${business_projection[1]:.0f}")

### 11. Conclusion
- **Primary Metric**: Moving the gate from level 30 to level 40 **reduces Day-7 retention by 0.82 percentage points**
- **Statistical Significance**: Highly significant (p = 0.0016)
- **Effect**: Below the pre-specified MDE threshold of 1.0 percentage points

**Decision:**
**Do not roll out** the treatment. While the effect is statistically significant, it fails to meet the business threshold for implementation and actually harms retention.

#### 11.1 Limitations & next steps

- **No timestamps in dataset:** cannot perform time-to-event analysis.
- **No covariates:** cannot adjust for potential confounding or run subgroup analysis.
- **No exposure flag:** results are population estimates, effect may be diluted if many users never reach the gate.
- **No revenue data:** business-impact projection is illustrative and requires real ARPU values.

In [None]:
# Save the results tables to a CSV file
# results_table.to_csv(r"reports/results_table.csv", index=False)
# exp_sens_df.to_csv(r"reports/exposure_sensitivity.csv", index=False)