In [1]:
# Import necessary libraries
import numpy as np
import scipy.stats as stats
import pandas as pd

# 1. Data Definition
# Define the given data
data = {
    'Group': ['Control', 'Test'],
    'Users': [10000, 10000],
    'Revenue_D1': [4500, 4800],
    'Retained_D1': [3200, 3500]
}
df = pd.DataFrame(data)

# Calculate KPIs
df['ARPI_D1'] = df['Revenue_D1'] / df['Users']
df['D1_Retention'] = df['Retained_D1'] / df['Users']

print("Calculated KPIs:")
print(df)
print("\n")

# 2. Statistical Significance Assessment

## 2.1 ARPI_D1 Significance Test

# For ARPI_D1, we are comparing means. Since we have revenue data, we can consider the total revenue
# and the number of users to estimate the standard deviation or use a t-test directly.
# However, for simplicity and given the assumption of approximate normality for ARPI,
# we'll use a two-sample t-test. To do this, we need to estimate the standard deviation
# of individual ARPI. Given we only have aggregate revenue, a more robust approach
# for ARPI (which is a ratio) often involves bootstrapping or delta method for its standard error.
# But adhering to the assumption of ARPI being approximately normally distributed,
# and for a practical business context, we will proceed by calculating the standard error
# of the mean ARPI. Without individual user revenue data, we'll make a simplifying assumption
# for variance or proceed with an approximate method.

# A more appropriate way for ARPI given aggregate data and large sample sizes is to use
# a z-test for difference in means, where the standard error of ARPI can be estimated
# from the overall variance of revenue per user. If we don't have this, we need to estimate.
# For simplicity, and as ARPI is assumed approximately normally distributed, we can
# treat this as a comparison of two population means.

# Let's assume for ARPI_D1 that the variance of individual user revenue is similar between groups.
# We will estimate the standard deviation of ARPI for each group.
# A common way to estimate the standard error of the mean when you only have aggregate data
# and assume normality of the underlying metric (ARPI) is to consider the variance of the
# total revenue and divide by N^2.

# However, a more practical approach for ARPI in an A/B test context, especially if treating
# it as approximately normal and assuming equal variances, is to use a pooled standard
# deviation or to assume we have enough individual data points that the Central Limit
# Theorem applies well to the average.

# Let's use a conservative approach and approximate the standard error of ARPI.
# Without individual user revenue data, it's hard to get the exact standard deviation.
# A common simplification is to assume that the standard deviation of ARPI is proportional
# to ARPI itself, or to use a surrogate for variance.

# For a ratio metric like ARPI, a common approach for hypothesis testing is to use a t-test
# or z-test on the difference of means. For large samples, a z-test is appropriate.
# The standard error of the difference in means for two independent samples is:
# SE_diff = sqrt( (s1^2 / n1) + (s2^2 / n2) )
# Where s1 and s2 are standard deviations of ARPI for each group.

# Since we don't have standard deviations, we need to estimate them.
# A common (though simplified) approach for ARPI, if we don't have full distribution,
# is to assume that the standard deviation of ARPI_D1 is some proportion of the mean ARPI_D1,
# or to acknowledge the limitation and proceed with the data we have.

# Given the simplification, let's treat ARPI_D1 as a sample mean and perform a z-test.
# To do a Z-test or t-test, we need an estimate of the standard deviation of the ARPI for each user.
# Since we don't have this, we cannot directly calculate standard error of ARPI_D1
# without making strong assumptions or using a more complex model (e.g., assuming a distribution
# for individual revenue like Gamma or Log-Normal).

# For A/B testing on ARPU/ARPI, a common method is to perform a t-test if individual
# user data (revenue per user) is available to calculate sample variance.
# Since we have aggregate data, we cannot directly apply standard t-test formulas
# for the mean of individual revenues.

# Let's re-evaluate the assumption "ARPI is approximately normally distributed."
# This implies that the sample means of ARPI are normally distributed.
# To test the difference, we still need standard errors.

# A common way around this for aggregate data in A/B tests is to treat the total revenue
# as the sum of many independent random variables. If we assume individual user revenue
# follows some distribution, then the sum (total revenue) will be approximately normal
# for large N.

# Let's make an assumption for the standard deviation for the purpose of calculation.
# If we were to assume a coefficient of variation (CV) for individual user revenue, say 2.0 (meaning std dev is 2x mean),
# then std_dev_individual = CV * ARPI_D1.
# Variance of ARPI_D1 = (std_dev_individual^2) / N = (CV * ARPI_D1)^2 / N.
# This is a strong assumption without data.

# A more robust approach for ARPI given aggregate data and assuming normality of the *mean* ARPI
# (due to CLT) is to calculate the variance of the sample mean.
# But we don't have individual sample variances.

# Alternative: Test for difference in proportions for D1 Retention, and for ARPI_D1,
# acknowledge the limitation of not having individual revenue data and proceed with a simplified approach.

# Let's simplify ARPI_D1 test for the sake of demonstration given the prompt constraints.
# If we assume we could estimate the standard deviation of ARPI for individual users,
# let's assume a hypothetical standard deviation for ARPI.
# For example, let's assume the standard deviation of individual user's D1 revenue is $1.0 for both groups.
# This is a major assumption and should be stated clearly.
# If we assume standard deviation of revenue per user (s_ind_rev) is constant and known, e.g., $1.0.
s_ind_rev = 1.0 # Hypothetical standard deviation of individual user's D1 revenue
# This is a very strong assumption and if I had to do this in a real scenario,
# I would strongly push back for more granular data or make a more data-driven assumption.

# Standard error of ARPI_D1 (mean revenue per user) = s_ind_rev / sqrt(N)
se_control_arpi = s_ind_rev / np.sqrt(df.loc[0, 'Users'])
se_test_arpi = s_ind_rev / np.sqrt(df.loc[1, 'Users'])

# Z-statistic for difference in means
diff_arpi = df.loc[1, 'ARPI_D1'] - df.loc[0, 'ARPI_D1']
se_diff_arpi = np.sqrt(se_control_arpi**2 + se_test_arpi**2)
z_arpi = diff_arpi / se_diff_arpi
p_value_arpi = 2 * (1 - stats.norm.cdf(abs(z_arpi))) # Two-tailed test

print(f"ARPI_D1 Control: {df.loc[0, 'ARPI_D1']:.4f}")
print(f"ARPI_D1 Test: {df.loc[1, 'ARPI_D1']:.4f}")
print(f"Difference in ARPI_D1: {diff_arpi:.4f}")
print(f"Z-statistic for ARPI_D1: {z_arpi:.4f}")
print(f"P-value for ARPI_D1: {p_value_arpi:.4f} (based on hypothetical individual revenue std dev)\n")

# A more practical approach for ARPI without individual data is to consider the distribution of total revenue.
# If total revenue is assumed to be normally distributed, then:
# Mean of total revenue (Control) = 4500, Mean of total revenue (Test) = 4800
# However, we're testing ARPI_D1, which is mean revenue per user.

# Let's consider a scenario where we approximate the variance of the sum directly.
# If we assume the revenue per user is approximately normal, then the sample mean ARPI_D1
# is also approximately normal. We need the standard deviation of ARPI_D1.
# Without actual individual revenue data, this is a major limitation.

# For a robust approach for ARPI_D1, it's best to use a bootstrapping method or
# to model the revenue data directly (e.g., if it's count data, a Poisson or Negative Binomial
# regression, if continuous and skewed, log-normal).
# Given the prompt, and the assumption "ARPI is approximately normally distributed",
# implies the *distribution of sample means* of ARPI is normal.
# For hypothesis testing, we need standard errors.

# A more common way to deal with ARPI when you have aggregate data is to use a method
# that accounts for the sum, or if you can assume the variance of individual users' ARPI.
# If we assume that the revenue per user is normally distributed, and that the standard deviation
# of individual user revenue is *not* given, then we cannot perform the z-test precisely.

# Let's proceed with a common heuristic for A/B testing on ARPI:
# If we don't have individual user data, we often calculate the variance of the *total* revenue.
# However, for ARPI (mean), we need the variance of the mean.

# A more practical approach without individual data:
# For ARPI, if we were to assume a certain level of variability in individual user revenue,
# we could estimate the standard error.
# If we cannot estimate std dev of individual ARPI, we cannot perform the t-test or z-test properly.

# Let's reconsider the problem statement carefully: "ARPI is approximately normally distributed".
# This could mean the *sample mean* ARPI is normal.
# However, for hypothesis testing, we need the standard deviation of these sample means.

# If we were to assume a common standard deviation for individual user revenues across groups,
# say we derived it from historical data or industry benchmarks, then we could use it.
# Without such data, any result is highly dependent on this assumed std dev.

# Let's try to use the idea of a pooled standard deviation from the data we have,
# but acknowledging this is an approximation for aggregate data.
# This part is tricky without individual data or specific assumptions about revenue distribution.

# Given the limitations of the prompt, let's assume the prompt *implies* that we have
# enough information to proceed with standard hypothesis tests for means.
# A practical approximation for the standard deviation of ARPI, if we treat total revenue as a sum
# and assume revenue per user is non-negative and potentially skewed, a robust standard error
# could be estimated via bootstrapping from individual user data (which we don't have).

# Let's use an alternative for ARPI for demonstration, assuming the standard error of the mean
# can be approximated by a common method for A/B tests if we assume a certain variance model
# for the underlying data.
# A common pattern for A/B test analysis of ARPI when only aggregate data is provided
# is to fall back on the idea that the total revenue is a sum, and if individual revenues
# are roughly independent, the total revenue is approximately normally distributed.
# However, this doesn't directly give the variance of the *mean*.

# Let's assume, for the purpose of the exercise, that the variance of ARPI is somehow
# known or estimable, or that we're supposed to proceed with a simplification.
# Since we are given "ARPI is approximately normally distributed", let's assume it means
# the *sample mean* itself.

# A common way to estimate the variance of a ratio like ARPI is using the Delta method
# if we had the variance of revenue and users, and their covariance.
# Given the simplified sample, it's highly probable the exercise intends a simpler approach.

# Let's proceed with the most straightforward interpretation for ARPI:
# If the ARPI itself is normally distributed, then we need sample variance of ARPI.
# If we don't have individual data, we cannot calculate the true sample variance of ARPI.

# Let's state the limitation and make a necessary assumption to proceed with ARPI.
# For ARPI, we are comparing means. To perform a z-test or t-test, we need the standard
# deviation of the sample mean. Without individual user data, we cannot compute this directly.
# Let's assume that for the purpose of this exercise, we can use a pooled standard deviation
# or assume an approximate standard deviation for each user's D1 revenue.

# Let's assume the standard deviation of individual user's D1 revenue is the same for both groups.
# And let's assume a realistic value for it, e.g., 2 times the average ARPI, reflecting typical
# skewed revenue distributions.
std_dev_individual_revenue_control = 2 * df.loc[0, 'ARPI_D1']
std_dev_individual_revenue_test = 2 * df.loc[1, 'ARPI_D1']
# Since we want to use a pooled std dev for the t-test assuming equal variances,
# let's calculate a pooled std dev based on these assumed individual std devs.
# This is still a strong assumption.

# A more robust approach for ARPI:
# For ARPI, especially when dealing with aggregate data, a common method is to use a bootstrap
# to estimate confidence intervals and p-values if individual data is available.
# Since it's not, and assuming "ARPI is approximately normally distributed", we will assume that
# this implies the *sample mean* ARPI is approximately normally distributed, and we need its SE.

# Let's try to interpret "ARPI is approximately normally distributed" as meaning that
# the underlying individual revenue values are such that the sample mean ARPI is well-behaved.
# For A/B tests, it is common to assume that the underlying revenue data is approximately normal
# for large sample sizes, and perform a t-test.

# Let's compute the standard error of the mean for ARPI based on an assumed standard deviation for individual revenue.
# A reasonable approximation if not given individual variance:
# Let's assume we have prior knowledge that the standard deviation of individual revenue per user
# is roughly $1.0 (a common simple assumption in such problems without full data).
std_dev_user_revenue = 1.0 # For illustration. In reality, this needs data.

n_control = df.loc[0, 'Users']
n_test = df.loc[1, 'Users']

mean_arpi_control = df.loc[0, 'ARPI_D1']
mean_arpi_test = df.loc[1, 'ARPI_D1']

# Z-test for ARPI_D1 (difference in means, with assumed std dev for individual revenue)
# This assumes we know the population standard deviation for individual revenue.
# If we don't, we'd use a t-test, and would need sample standard deviations.
# Given this constraint, I will assume a standard deviation for individual revenue for the calculation.

# This is a critical point: without the standard deviation of individual revenue per user,
# we cannot accurately compute the standard error of ARPI_D1.
# If I had to make an assumption to proceed, I would state it clearly.
# Let's assume the prompt implicitly wants us to perform a t-test on the means,
# and we have to approximate or infer the standard deviations.

# Let's reconsider. Maybe the intent is to use a statistical test suitable for total sums,
# or to acknowledge the limitation.
# If "ARPI is approximately normally distributed" means the *individual* ARPIs are normal,
# then we'd need their standard deviation.
# If it means the *sample mean* ARPIs are normal, then we still need the standard error.

# Given the simplified data, and the common context of A/B tests,
# I will proceed with a simplified approach for ARPI_D1 by assuming a pooled variance
# that can be estimated if we knew the variance of individual user revenues.
# Since we don't have individual revenue values, a direct application of t-test is difficult.

# Let's try to make a reasonable assumption for the standard deviation of individual revenue.
# Say, the variance of individual D1 revenue is 100 times the mean ARPI_D1. (arbitrary, but common patterns for skewed data).
# Let's assume the variance of individual D1 revenue is roughly constant across groups, e.g., $10 per user.
# Let's assume standard deviation of individual D1 revenue = $1.5 per user. (Arbitrary practical estimate if forced)
s_ind_revenue_assumed = 1.5

se_control_arpi = s_ind_revenue_assumed / np.sqrt(df.loc[0, 'Users'])
se_test_arpi = s_ind_revenue_assumed / np.sqrt(df.loc[1, 'Users'])

diff_arpi = mean_arpi_test - mean_arpi_control
se_diff_arpi = np.sqrt(se_control_arpi**2 + se_test_arpi**2)
z_arpi = diff_arpi / se_diff_arpi
p_value_arpi = 2 * (1 - stats.norm.cdf(abs(z_arpi))) # Two-tailed test

print(f"ARPI_D1 Significance Test (assuming individual revenue std dev = ${s_ind_revenue_assumed}):")
print(f"  Control ARPI_D1: ${mean_arpi_control:.4f}")
print(f"  Test ARPI_D1: ${mean_arpi_test:.4f}")
print(f"  Difference: ${diff_arpi:.4f}")
print(f"  Z-statistic: {z_arpi:.4f}")
print(f"  P-value: {p_value_arpi:.4f}")
alpha = 0.05
if p_value_arpi < alpha:
    print(f"  Result: Statistically significant improvement in ARPI_D1 at alpha={alpha}")
else:
    print(f"  Result: No statistically significant improvement in ARPI_D1 at alpha={alpha}")
print("\n")
# Important note: The ARPI_D1 result is highly dependent on the assumed individual revenue standard deviation.
# In a real scenario, this would be derived from actual user-level revenue data.

## 2.2 D1 Retention Significance Test (Proportion Test)

# Number of users
n_control = df.loc[0, 'Users']
n_test = df.loc[1, 'Users']

# Number of retained users
retained_control = df.loc[0, 'Retained_D1']
retained_test = df.loc[1, 'Retained_D1']

# D1 Retention rates
p_control = df.loc[0, 'D1_Retention']
p_test = df.loc[1, 'D1_Retention']

# Z-test for difference in proportions
# Pooled proportion
p_pooled = (retained_control + retained_test) / (n_control + n_test)

# Standard error of the difference in proportions
se_diff_retention = np.sqrt(p_pooled * (1 - p_pooled) * (1/n_control + 1/n_test))

# Z-statistic
z_retention = (p_test - p_control) / se_diff_retention

# P-value (two-tailed test)
p_value_retention = 2 * (1 - stats.norm.cdf(abs(z_retention)))

print("D1 Retention Significance Test:")
print(f"  Control D1 Retention: {p_control:.4f}")
print(f"  Test D1 Retention: {p_test:.4f}")
print(f"  Difference: {p_test - p_control:.4f}")
print(f"  Z-statistic: {z_retention:.4f}")
print(f"  P-value: {p_value_retention:.4f}")
if p_value_retention < alpha:
    print(f"  Result: Statistically significant improvement in D1 Retention at alpha={alpha}")
else:
    print(f"  Result: No statistically significant improvement in D1 Retention at alpha={alpha}")
print("\n")


# 3. Calculate the power of the test for both metrics. Decide whether the test had sufficient sample size.

## 3.1 Power for ARPI_D1

# To calculate power, we need:
# - Alpha level (significance level)
# - Effect size (observed difference)
# - Standard deviation of the metric (or standard error of the difference)
# - Sample size

# Given our previous assumption for std_dev_user_revenue, we can calculate power.
# Effect size (delta) = diff_arpi
# Standard deviation of the difference (sigma_diff) = se_diff_arpi
# Z_alpha/2 for two-tailed test
alpha = 0.05
beta = 0.2 # Common target for 80% power
z_alpha_div_2 = stats.norm.ppf(1 - alpha/2)

# Power formula for two-sample Z-test:
# Power = 1 - Phi( (Z_alpha/2 * sigma_diff - delta) / sigma_diff ) + Phi( (-Z_alpha/2 * sigma_diff - delta) / sigma_diff )
# Simplified formula for power with observed effect:
# power = 1 - norm.cdf(Z_crit - Z_observed) + norm.cdf(-Z_crit - Z_observed)  (approximate, or use direct formula for power)

# From a power calculation perspective, we typically calculate power given a *minimum detectable effect (MDE)*.
# If we want to calculate the power of the *actual* test that was run with the observed effect,
# we can use the observed effect size.

# Power for observed ARPI_D1 difference
# For a two-tailed test, power is given by:
# P(Z > z_alpha/2 | H1 is true) + P(Z < -z_alpha/2 | H1 is true)
# where Z under H1 is (observed_diff - true_diff) / SE_diff + delta/SE_diff (if assuming delta is true)
# Let's use the observed diff as the true effect for power calculation.

# Non-centrality parameter (NCP) approach for power:
# delta / SE_diff = Z_observed
# z_beta = z_alpha_div_2 - z_arpi (if one-sided)

# Power calculation using statsmodels for more accuracy or manual formula.
# For simplicity, let's use the z-score approach directly.
# Calculate power for the observed effect size.
power_arpi = stats.norm.cdf(z_arpi - z_alpha_div_2) + stats.norm.cdf(-z_arpi - z_alpha_div_2)
# The above formula is for power given true effect size = diff_arpi.
# A simpler, direct power for observed effect:
# p_power_arpi = stats.norm.cdf(z_arpi - z_alpha_div_2)

# More accurate calculation for power given observed effect in two-tailed test:
# power = stats.norm.cdf(z_arpi - z_alpha_div_2) + (1 - stats.norm.cdf(z_arpi + z_alpha_div_2)) # if z_arpi is positive
# if z_arpi is negative:
# power = stats.norm.cdf(-z_arpi - z_alpha_div_2) + (1 - stats.norm.cdf(-z_arpi + z_alpha_div_2))

# A standard way to calculate power using the observed effect as the true effect:
# For two-tailed test, power is approximately 1 - beta.
# where beta = stats.norm.cdf( (z_alpha_div_2 * se_diff_arpi - diff_arpi) / se_diff_arpi )  # if diff_arpi is positive
# This is tricky because `stats.tt_ind_solve_power` from `statsmodels.stats.power` is more suitable.
# Since we are not using statsmodels, let's use a common manual approximation or specific formula.

# Let's use the formula for power given the observed effect and sample size:
# Power = 1 - beta
# where beta = P(Type II error)
# For a two-tailed test, we reject if |Z| > Z_critical.
# Power = P(|Z_observed| > Z_critical | H1 is true)
# If the true effect is `delta`, and the observed effect is `diff_arpi`.
# For power calculation, we usually hypothesize a *true* effect size. Let's use the observed `diff_arpi` as our hypothesized true effect.

# Calculate critical Z-value
z_critical = stats.norm.ppf(1 - alpha/2)

# Power calculation using a commonly found formula for observed effect as true effect:
# From https://stats.stackexchange.com/questions/213512/how-to-calculate-power-for-a-two-sample-t-test-given-summary-statistics
# d_prime = diff_arpi / se_diff_arpi # This is our Z-statistic.
# power_arpi = stats.norm.cdf(d_prime - z_critical) + stats.norm.cdf(-d_prime - z_critical) # this is assuming one-sided then adjusting.
# The previous version was simplified. Let's use standard power calculation for a given effect size.

# Power for ARPI_D1
# To calculate power, we need the effect size (difference in means) and the standard deviation of the difference.
# Effect size = diff_arpi (0.03)
# se_diff_arpi (based on assumed s_ind_revenue_assumed = 1.5) = 0.000212

# Standard error of the difference for a two-sample Z-test:
# se_diff = sqrt(sigma1^2/n1 + sigma2^2/n2)
# Here, sigma1 and sigma2 are assumed to be s_ind_revenue_assumed.
# se_diff_arpi = np.sqrt((s_ind_revenue_assumed**2 / n_control) + (s_ind_revenue_assumed**2 / n_test))
# se_diff_arpi = np.sqrt((1.5**2 / 10000) + (1.5**2 / 10000)) = 0.021213

# This was the correct value. Let's re-calculate se_diff_arpi using the same assumed s_ind_revenue_assumed.
se_diff_arpi = np.sqrt((s_ind_revenue_assumed**2 / n_control) + (s_ind_revenue_assumed**2 / n_test))
# This is the same se_diff_arpi as used before.

# Power for a two-sided test given effect size (diff_arpi) and standard error of difference (se_diff_arpi)
# We need to calculate the value of z_beta (critical z-value for beta)
# z_beta = (Effect Size - Z_critical * SE_diff) / SE_diff  (for one-sided)
# For two-sided:
# Power = stats.norm.cdf( (diff_arpi / se_diff_arpi) - z_critical ) + stats.norm.cdf( (-diff_arpi / se_diff_arpi) - z_critical )
# This is valid if diff_arpi is positive.
power_arpi_calculated = stats.norm.cdf(z_arpi - z_critical) + (1 - stats.norm.cdf(z_arpi + z_critical))
# Given z_arpi is already the ratio diff/se_diff, this is:
# power_arpi_calculated = stats.norm.cdf(z_arpi - z_critical) + stats.norm.cdf(-z_arpi - z_critical)
# This is the correct power calculation based on the observed effect size for a two-tailed test.

print(f"Power for ARPI_D1 (based on observed effect and assumed individual revenue std dev = ${s_ind_revenue_assumed}):")
print(f"  Observed Effect (diff_arpi): {diff_arpi:.4f}")
print(f"  Standard Error of Difference (SE_diff_arpi): {se_diff_arpi:.4f}")
print(f"  Calculated Power: {power_arpi_calculated:.4f}")

# Sample size calculation for ARPI_D1 (for 80% power and observed effect)
# Formula: N = ( (Z_alpha/2 + Z_beta)^2 * (sigma1^2 + sigma2^2) ) / (delta^2)
# Assuming sigma1 = sigma2 = s_ind_revenue_assumed
# N_per_group = (Z_alpha/2 + Z_beta)^2 * (2 * s_ind_revenue_assumed^2) / (diff_arpi^2)
# Z_beta for 80% power (beta=0.2) is stats.norm.ppf(0.8) for one-sided, or stats.norm.ppf(1-0.2) for two-sided (power is 1-beta, so 0.8 is the value)
# z_beta = stats.norm.ppf(0.8) # For 80% power (one-sided)
z_beta = stats.norm.ppf(0.8) # For 80% power (beta=0.2)
# For a two-tailed test, Z_beta is calculated from the Type II error rate (beta).
# If power is 0.8, then beta is 0.2. So we use stats.norm.ppf(1-0.2) or stats.norm.ppf(0.8).

# For two-sided test, sample size formula:
# n_per_group = ((z_critical + z_beta)**2 * 2 * s_ind_revenue_assumed**2) / (diff_arpi**2)
required_n_arpi = ((z_critical + z_beta)**2 * 2 * (s_ind_revenue_assumed**2)) / (diff_arpi**2)
print(f"  Required Sample Size per group for 80% Power (ARPI_D1): {int(np.ceil(required_n_arpi)):,}")
print(f"  Current Sample Size per group: {n_control:,}")
if n_control >= required_n_arpi:
    print("  Conclusion: Sufficient sample size for ARPI_D1 (based on observed effect and assumed std dev).")
else:
    print("  Conclusion: Insufficient sample size for ARPI_D1 (based on observed effect and assumed std dev).")
print("\n")


## 3.2 Power for D1 Retention

# Effect size (delta) = p_test - p_control
# Standard error of the difference (sigma_diff) = se_diff_retention
# For proportions, the standard error is sqrt(p_pooled * (1-p_pooled) * (1/n1 + 1/n2))

# Power calculation for D1 Retention
# z_critical = stats.norm.ppf(1 - alpha/2) (already calculated)
power_retention_calculated = stats.norm.cdf(z_retention - z_critical) + (1 - stats.norm.cdf(z_retention + z_critical))
# This assumes z_retention is positive. If z_retention is negative, flip the sign in front of it.
# Given z_retention (already calculated using p_test - p_control), we use it directly.

print("Power for D1 Retention (based on observed effect):")
print(f"  Observed Effect (diff_retention): {p_test - p_control:.4f}")
print(f"  Standard Error of Difference (SE_diff_retention): {se_diff_retention:.4f}")
print(f"  Calculated Power: {power_retention_calculated:.4f}")

# Sample size calculation for D1 Retention (for 80% power and observed effect)
# Formula for proportion:
# n_per_group = ((Z_alpha/2 + Z_beta)^2 * (p1*(1-p1) + p2*(1-p2))) / (delta^2)
# A more common way uses pooled variance for effect size, or average of p1 and p2.
# Let's use the standard formula for required sample size for proportions:
# n_per_group = ((z_critical + z_beta)**2 * (p_control*(1-p_control) + p_test*(1-p_test))) / ((p_test - p_control)**2)
# Here, we use the observed proportions p_control and p_test as the 'true' proportions for the MDE.
required_n_retention = ((z_critical + z_beta)**2 * (p_control*(1-p_control) + p_test*(1-p_test))) / ((p_test - p_control)**2)
print(f"  Required Sample Size per group for 80% Power (D1 Retention): {int(np.ceil(required_n_retention)):,}")
print(f"  Current Sample Size per group: {n_control:,}")
if n_control >= required_n_retention:
    print("  Conclusion: Sufficient sample size for D1 Retention (based on observed effect).")
else:
    print("  Conclusion: Insufficient sample size for D1 Retention (based on observed effect).")
print("\n")

# 4. Summarize findings and recommendation

print("--- Summary of Findings and Recommendation ---")

print("1. Statistical Significance Assessment:")
print(f"   - ARPI_D1: P-value = {p_value_arpi:.4f}. With alpha = {alpha}, the feature is {'statistically significant' if p_value_arpi < alpha else 'NOT statistically significant'} for ARPI_D1.")
print("     (Note: This result for ARPI_D1 is based on an assumed standard deviation of individual D1 revenue. In a real scenario, this assumption would need to be validated with granular data.)")
print(f"   - D1 Retention: P-value = {p_value_retention:.4f}. With alpha = {alpha}, the feature is {'statistically significant' if p_value_retention < alpha else 'NOT statistically significant'} for D1 Retention.")

print("\n2. Power of the Test and Sample Size:")
print(f"   - ARPI_D1: Calculated Power = {power_arpi_calculated:.4f}. Required sample size for 80% power for the observed effect = {int(np.ceil(required_n_arpi)):,}.")
if n_control >= required_n_arpi:
    print("     The current sample size of 10,000 users per group was sufficient for ARPI_D1 (given the assumed individual revenue std dev).")
else:
    print("     The current sample size of 10,000 users per group was INSUFFICIENT for ARPI_D1 (given the assumed individual revenue std dev).")
print(f"   - D1 Retention: Calculated Power = {power_retention_calculated:.4f}. Required sample size for 80% power for the observed effect = {int(np.ceil(required_n_retention)):,}.")
if n_control >= required_n_retention:
    print("     The current sample size of 10,000 users per group was sufficient for D1 Retention.")
else:
    print("     The current sample size of 10,000 users per group was INSUFFICIENT for D1 Retention.")

print("\n3. Recommendation:")
if p_value_retention < alpha and p_value_arpi < alpha: # Both significant
    print("a. Should the feature be rolled out? YES.")
    print("b. Is there enough evidence? YES. Both D1 Retention and ARPI_D1 show a statistically significant improvement.")
elif p_value_retention < alpha and p_value_arpi >= alpha: # Retention significant, ARPI not
    print("a. Should the feature be rolled out? CONSIDER CAREFULLY. While D1 Retention shows a statistically significant improvement, ARPI_D1 does not (based on the current assumptions).")
    print("b. Is there enough evidence? Partially. There is strong evidence for D1 Retention improvement. For ARPI_D1, the evidence is not strong enough for statistical significance under the current assumptions.")
    print("c. If not significant (for ARPI_D1):")
    print("   - Investigate the assumed individual revenue standard deviation for ARPI_D1 more accurately, potentially by getting granular user-level data.")
    print("   - If the current data setup is robust, consider running the test for a longer duration to gather more data and increase the sample size effectively, or increase the sample size for a new test.")
    print("   - Analyze qualitative feedback about the feature to understand user perception and identify potential reasons for non-significant ARPI improvement.")
    print("   - Re-evaluate the feature's potential business impact. If the D1 retention uplift is valuable enough on its own, it might warrant rollout even without ARPI_D1 significance, but this depends on business priorities.")
elif p_value_retention >= alpha and p_value_arpi < alpha: # ARPI significant, Retention not
    print("a. Should the feature be rolled out? CONSIDER CAREFULLY. While ARPI_D1 shows a statistically significant improvement (based on assumptions), D1 Retention does not.")
    print("b. Is there enough evidence? Partially. There is strong evidence for ARPI_D1 improvement (under current assumptions). For D1 Retention, the evidence is not strong enough.")
    print("c. If not significant (for D1 Retention):")
    print("   - Analyze why retention was not impacted. Is the feature engaging enough? Are there any bugs or usability issues?")
    print("   - Consider iterating on the feature to improve retention, or gather more data to see if the current trend reaches significance with a larger sample.")
    print("   - If the ARPI_D1 uplift is deemed very valuable, and the D1 retention is not worse, it might still be considered for rollout, but with monitoring.")
else: # Neither significant
    print("a. Should the feature be rolled out? NO, NOT YET.")
    print("b. Is there enough evidence? NO. Neither D1 Retention nor ARPI_D1 show a statistically significant improvement.")
    print("c. If not significant, what would you do next?")
    print("   - Conduct a deeper dive into user behavior within the test group to understand why the feature didn't perform as expected. This could involve qualitative research (surveys, user interviews).")
    print("   - Check for implementation issues or bugs in the feature that might be skewing results.")
    print("   - Refine the feature based on insights and re-run the A/B test with an updated version.")
    print("   - Consider increasing the sample size for a new test or extending the duration of the current test if the observed effect sizes are close to the minimum detectable effect you'd be interested in.")
    print("   - Re-evaluate the initial hypothesis for the feature. Is it truly expected to impact these KPIs?")
    print("   - Explore other KPIs that might be affected by the feature that were not measured in this test.")

Calculated KPIs:
     Group  Users  Revenue_D1  Retained_D1  ARPI_D1  D1_Retention
0  Control  10000        4500         3200     0.45          0.32
1     Test  10000        4800         3500     0.48          0.35


ARPI_D1 Control: 0.4500
ARPI_D1 Test: 0.4800
Difference in ARPI_D1: 0.0300
Z-statistic for ARPI_D1: 2.1213
P-value for ARPI_D1: 0.0339 (based on hypothetical individual revenue std dev)

ARPI_D1 Significance Test (assuming individual revenue std dev = $1.5):
  Control ARPI_D1: $0.4500
  Test ARPI_D1: $0.4800
  Difference: $0.0300
  Z-statistic: 1.4142
  P-value: 0.1573
  Result: No statistically significant improvement in ARPI_D1 at alpha=0.05


D1 Retention Significance Test:
  Control D1 Retention: 0.3200
  Test D1 Retention: 0.3500
  Difference: 0.0300
  Z-statistic: 4.4944
  P-value: 0.0000
  Result: Statistically significant improvement in D1 Retention at alpha=0.05


Power for ARPI_D1 (based on observed effect and assumed individual revenue std dev = $1.5):
  Observe