# Case Study 1: E-Commerce Conversion Rate A/B Test

## Scenario
An e-commerce company wants to test whether a **new checkout page design** increases the purchase conversion rate. The current conversion rate is approximately **13%**. The product team believes the new design could improve it by at least **2 percentage points**.

**Tests used:** Z-test for proportions, Chi-Square test, Fisher's Exact Test

## 1. Business Understanding & Hypothesis

**Business context:** The checkout page is the final step before purchase. Even small improvements in conversion rate can translate to significant revenue.

**Hypotheses:**
- $H_0$: The new checkout page has no effect on conversion rate ($p_{treatment} = p_{control}$)
- $H_a$: The new checkout page has a different conversion rate ($p_{treatment} \neq p_{control}$)

**PICOT:**
- **P**opulation: Users who reach the checkout page
- **I**ntervention: New checkout page design
- **C**omparison: Old checkout page (control)
- **O**utcome: Purchase conversion rate
- **T**ime: 2 weeks

## 2. Experiment Design

In [None]:
import numpy as np
import pandas as pd
from scipy import stats
from statsmodels.stats.proportion import proportions_ztest, proportion_confint
from statsmodels.stats.power import NormalIndPower
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)

In [None]:
# Experiment parameters
baseline_conversion = 0.13    # Current conversion rate
mde = 0.02                    # Minimum detectable effect (2 percentage points)
alpha = 0.05                  # Significance level
power = 0.80                  # Statistical power

# Calculate effect size (Cohen's h for proportions)
from statsmodels.stats.proportion import proportion_effectsize
effect_size = proportion_effectsize(baseline_conversion + mde, baseline_conversion)
print(f"Effect size (Cohen's h): {effect_size:.4f}")

# Calculate required sample size per group
analysis = NormalIndPower()
required_n = analysis.solve_power(
    effect_size=effect_size,
    alpha=alpha,
    power=power,
    alternative='two-sided'
)
required_n = int(np.ceil(required_n))
print(f"Required sample size per group: {required_n:,}")
print(f"Total sample size needed: {required_n * 2:,}")

# Estimate test duration
daily_visitors = 2000
duration_days = np.ceil((required_n * 2) / daily_visitors)
print(f"\nWith {daily_visitors:,} daily visitors:")
print(f"Estimated duration: {duration_days:.0f} days")

## 3. Generate Simulated Data

In practice, this data comes from the experiment platform. Here we simulate it.

In [None]:
n_control = 4700
n_treatment = 4700

# Simulate: treatment has a real improvement of ~2%
control_conversions = np.random.binomial(1, 0.13, n_control)
treatment_conversions = np.random.binomial(1, 0.15, n_treatment)

df = pd.DataFrame({
    'group': ['control'] * n_control + ['treatment'] * n_treatment,
    'converted': np.concatenate([control_conversions, treatment_conversions])
})

print(f"Dataset shape: {df.shape}")
print(f"\nGroup sizes:")
print(df['group'].value_counts())
print(f"\nConversion rates:")
print(df.groupby('group')['converted'].mean())

## 4. Validity Checks

In [None]:
# Sample Ratio Mismatch (SRM) check using Chi-square goodness-of-fit
observed = [n_control, n_treatment]
expected_ratio = [0.5, 0.5]
total = n_control + n_treatment
expected_counts = [total * r for r in expected_ratio]

srm_chi2, srm_pval = stats.chisquare(observed, expected_counts)
print("=== Sample Ratio Mismatch Check ===")
print(f"Observed: Control={n_control}, Treatment={n_treatment}")
print(f"Expected: 50/50 split = {total//2} each")
print(f"Chi-square statistic: {srm_chi2:.4f}")
print(f"p-value: {srm_pval:.4f}")
print(f"Result: {'PASS - No SRM detected' if srm_pval > 0.01 else 'FAIL - SRM detected!'}")

## 5. Statistical Analysis

### 5.1 Z-Test for Proportions

The most common test for comparing conversion rates in A/B testing. Appropriate when:
- Metric is binary (converted / not converted)
- Sample size is large (n > 30 per group)
- Observations are independent

In [None]:
# Compute observed values
control_data = df[df['group'] == 'control']['converted']
treatment_data = df[df['group'] == 'treatment']['converted']

successes = np.array([control_data.sum(), treatment_data.sum()])
nobs = np.array([len(control_data), len(treatment_data)])

p_control = successes[0] / nobs[0]
p_treatment = successes[1] / nobs[1]
p_diff = p_treatment - p_control
relative_lift = (p_treatment - p_control) / p_control * 100

print("=== Descriptive Statistics ===")
print(f"Control:   {successes[0]:,} / {nobs[0]:,} = {p_control:.4f} ({p_control*100:.2f}%)")
print(f"Treatment: {successes[1]:,} / {nobs[1]:,} = {p_treatment:.4f} ({p_treatment*100:.2f}%)")
print(f"Absolute difference: {p_diff:.4f} ({p_diff*100:.2f} pp)")
print(f"Relative lift: {relative_lift:.2f}%")

In [None]:
# Z-test for proportions (two-sided)
z_stat, z_pval = proportions_ztest(successes, nobs, alternative='two-sided')

print("=== Z-Test for Proportions ===")
print(f"Z-statistic: {z_stat:.4f}")
print(f"p-value: {z_pval:.4f}")
print(f"\nAt alpha = {alpha}:")
if z_pval < alpha:
    print(f"REJECT H0 - The difference IS statistically significant (p={z_pval:.4f} < {alpha})")
else:
    print(f"FAIL TO REJECT H0 - The difference is NOT statistically significant (p={z_pval:.4f} >= {alpha})")

# Confidence interval for the difference
ci_control = proportion_confint(successes[0], nobs[0], alpha=alpha, method='normal')
ci_treatment = proportion_confint(successes[1], nobs[1], alpha=alpha, method='normal')

# CI for the difference in proportions
se_diff = np.sqrt(p_control*(1-p_control)/nobs[0] + p_treatment*(1-p_treatment)/nobs[1])
z_crit = stats.norm.ppf(1 - alpha/2)
ci_diff = (p_diff - z_crit * se_diff, p_diff + z_crit * se_diff)

print(f"\n95% CI for difference: ({ci_diff[0]:.4f}, {ci_diff[1]:.4f})")
print(f"95% CI for difference: ({ci_diff[0]*100:.2f}pp, {ci_diff[1]*100:.2f}pp)")

### 5.2 Chi-Square Test

An alternative for categorical data. Tests whether the distribution of outcomes differs between groups.

In [None]:
# Build contingency table
contingency_table = pd.crosstab(df['group'], df['converted'], margins=True)
contingency_table.columns = ['Not Converted', 'Converted', 'Total']
contingency_table.index = ['Control', 'Treatment', 'Total']
print("=== Contingency Table ===")
print(contingency_table)

# Chi-square test
table = [[successes[0], nobs[0] - successes[0]],
         [successes[1], nobs[1] - successes[1]]]

chi2, chi2_pval, dof, expected = stats.chi2_contingency(table)

print(f"\n=== Chi-Square Test ===")
print(f"Chi-square statistic: {chi2:.4f}")
print(f"Degrees of freedom: {dof}")
print(f"p-value: {chi2_pval:.4f}")
print(f"\nExpected frequencies:")
print(pd.DataFrame(expected, columns=['Not Converted', 'Converted'], 
                    index=['Control', 'Treatment']).round(1))

print(f"\nAll expected frequencies >= 5: {(np.array(expected) >= 5).all()}")
if chi2_pval < alpha:
    print(f"\nREJECT H0 - Statistically significant (p={chi2_pval:.4f} < {alpha})")
else:
    print(f"\nFAIL TO REJECT H0 - Not significant (p={chi2_pval:.4f} >= {alpha})")

### 5.3 Fisher's Exact Test

For demonstration purposes. In practice, use this when expected cell frequencies are < 5.

In [None]:
odds_ratio, fisher_pval = stats.fisher_exact(table)

print("=== Fisher's Exact Test ===")
print(f"Odds ratio: {odds_ratio:.4f}")
print(f"p-value: {fisher_pval:.4f}")
print(f"\nNote: With large samples, Fisher's and Chi-square give very similar results.")
print(f"Fisher's is preferred when any expected cell count < 5.")

### 5.4 Comparison of All Three Tests

In [None]:
results = pd.DataFrame({
    'Test': ['Z-test (proportions)', 'Chi-Square', "Fisher's Exact"],
    'Statistic': [z_stat, chi2, odds_ratio],
    'p-value': [z_pval, chi2_pval, fisher_pval],
    'Significant (alpha=0.05)': [z_pval < alpha, chi2_pval < alpha, fisher_pval < alpha]
})
print("=== Test Comparison ===")
print(results.to_string(index=False))
print("\nNote: Z-test^2 â‰ˆ Chi-square statistic (they are equivalent for 2x2 tables)")
print(f"Z^2 = {z_stat**2:.4f}, Chi2 = {chi2:.4f}")

## 6. Visualization

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Plot 1: Conversion rates with CI
groups = ['Control', 'Treatment']
rates = [p_control, p_treatment]
errors = [
    [p_control - ci_control[0], ci_control[1] - p_control],
    [p_treatment - ci_treatment[0], ci_treatment[1] - p_treatment]
]
errors = np.array(errors).T

bars = axes[0].bar(groups, rates, color=['#3498db', '#e74c3c'], width=0.5, alpha=0.8)
axes[0].errorbar(groups, rates, yerr=errors, fmt='none', color='black', capsize=5)
axes[0].set_ylabel('Conversion Rate')
axes[0].set_title('Conversion Rate by Group (with 95% CI)')
for bar, rate in zip(bars, rates):
    axes[0].text(bar.get_x() + bar.get_width()/2., bar.get_height() + 0.002,
                f'{rate:.2%}', ha='center', va='bottom', fontweight='bold')

# Plot 2: CI for the difference
axes[1].errorbar(0, p_diff, yerr=[[p_diff - ci_diff[0]], [ci_diff[1] - p_diff]],
                 fmt='o', color='#2ecc71', markersize=10, capsize=10, linewidth=2)
axes[1].axhline(y=0, color='red', linestyle='--', alpha=0.7, label='No effect')
axes[1].set_xlim(-0.5, 0.5)
axes[1].set_ylabel('Difference in Conversion Rate')
axes[1].set_title('95% CI for Difference (Treatment - Control)')
axes[1].set_xticks([])
axes[1].legend()

# Plot 3: p-value comparison
test_names = ['Z-test', 'Chi-Square', "Fisher's"]
pvalues = [z_pval, chi2_pval, fisher_pval]
colors = ['#27ae60' if p < alpha else '#e74c3c' for p in pvalues]
axes[2].barh(test_names, pvalues, color=colors, alpha=0.8)
axes[2].axvline(x=alpha, color='red', linestyle='--', label=f'alpha = {alpha}')
axes[2].set_xlabel('p-value')
axes[2].set_title('p-values Across Tests')
axes[2].legend()

plt.tight_layout()
plt.show()

## 7. Conclusion & Recommendation

**Statistical conclusion:** All three tests agree. We have sufficient evidence to reject the null hypothesis at the 5% significance level.

**Practical significance:** The observed lift is approximately 2 percentage points, which aligns with our MDE. The confidence interval gives us the range of plausible true effects.

**Recommendation:** Launch the new checkout page design. The improvement is both statistically and practically significant.

---

## Interview Follow-Up Questions & Answers

### Q1: Why did you choose a Z-test instead of a t-test for this problem?

**Answer:**

The Z-test for proportions is the correct choice when comparing **binary outcomes** (converted vs. not converted) between two groups with **large sample sizes**. The t-test is designed for **continuous metrics** (like average revenue). Since our metric is a proportion (conversion rate), the Z-test is the natural fit.

Additionally, with large samples (n > 30), the sampling distribution of the proportion is approximately normal by the Central Limit Theorem, making the Z-test appropriate.

### Q2: What is the relationship between the Z-test and the Chi-square test for a 2x2 table?

**Answer:**

For a 2x2 contingency table, the **Chi-square statistic equals the square of the Z-statistic**: $\chi^2 = Z^2$. They will always give the same p-value. The difference is:

- The **Z-test** is inherently directional (you can do one-tailed tests)
- The **Chi-square test** is always two-tailed and extends naturally to larger tables (e.g., comparing 3+ groups)

In practice for a standard two-group A/B test, they are interchangeable.

### Q3: When would you use Fisher's Exact Test instead of Chi-square?

**Answer:**

Use Fisher's Exact Test when any **expected cell frequency in the contingency table is less than 5**. This typically happens with:
- Very small sample sizes
- Very rare events (e.g., testing a feature with 0.1% conversion rate and only 200 users)

Fisher's test computes the **exact probability** rather than relying on the chi-square approximation, so it's always valid regardless of sample size. The downside is it becomes computationally expensive for large tables.

### Q4: Your test is significant but the lift is only 2%. How do you convince stakeholders to launch?

**Answer:**

I would frame it in **business terms**:

1. **Revenue impact:** If we have 1M monthly visitors reaching checkout with an AOV (Average Order Value) of \$50, a 2pp increase in conversion means: $1M \times 0.02 \times \$50 = \$1M$ additional monthly revenue.

2. **Confidence interval:** The 95% CI tells us the true effect is likely between X and Y, giving stakeholders a range to plan around.

3. **Risk assessment:** We ran validity checks (SRM, guardrail metrics) and everything looks clean. The result is robust.

4. **Implementation cost:** If the new design is easy to maintain, the risk-reward ratio is very favorable.

### Q5: You notice that the sample ratio is 52/48 instead of 50/50. What do you do?

**Answer:**

I would run a **Sample Ratio Mismatch (SRM) test** using a chi-square goodness-of-fit test:

1. If the p-value is **above 0.01**, the mismatch is likely due to random variation - proceed with analysis.
2. If the p-value is **below 0.01**, there's likely a **bug in the randomization system** (e.g., bot traffic, broken redirect, caching issue).

If SRM is detected, I would **NOT trust the test results**. Instead, I'd investigate the root cause with engineering, fix the issue, and re-run the experiment.

### Q6: What if the p-value is 0.06? What would you recommend?

**Answer:**

A p-value of 0.06 means we **fail to reject** $H_0$ at the 5% level, but the result is close. My approach:

1. **Look at the confidence interval**: If the CI includes practically meaningful effects, the test may be **underpowered** - we might need more data.
2. **Check practical significance**: If the point estimate of the effect is large and business-relevant, I might recommend **extending the test** to collect more data.
3. **Never change alpha after seeing results**: That would be p-hacking. The threshold should be set before the experiment.
4. **Context matters**: For a low-risk, easy-to-implement change, the business might accept the risk. For a high-cost change, we need stronger evidence.

I would NOT say "it's almost significant" - a result either meets the pre-defined threshold or it doesn't.

### Q7: How would you handle a situation where conversion rate improved but revenue per user decreased?

**Answer:**

This is a classic **guardrail metric** scenario. It could mean:

1. **The new design attracts low-value conversions**: Maybe the simpler checkout reduces friction for small purchases but doesn't help larger ones.
2. **Cannibalization**: Users might be buying cheaper items instead of browsing more.

**What I'd do:**
- Segment the analysis by user type, purchase amount, and product category
- Calculate **total revenue impact** (conversion rate x average order value x traffic)
- If net revenue is negative despite higher conversion, **do not launch**
- Consider if the OEC (Overall Evaluation Criterion) should be revenue rather than conversion rate