# Case Study 4: Multi-Variant Test - Landing Page Optimization (A/B/C/D)

## Scenario
A marketing team wants to test **four different landing page designs** (A, B, C, D) simultaneously to determine which one drives the most sign-ups and highest engagement (time on page). Instead of running multiple A/B tests sequentially, they run a single multi-variant test.

**Tests used:** ANOVA, Kruskal-Wallis, Chi-Square (for proportions), Tukey HSD post-hoc, Bonferroni correction

## 1. Why Multi-Variant Testing?

| Approach | Pros | Cons |
|----------|------|------|
| **Sequential A/B tests** (A vs B, then winner vs C...) | Simple to analyze | Takes much longer; environment may change between tests |
| **Multi-variant A/B/n test** | Tests all variants simultaneously; faster overall | More complex analysis; requires more total traffic; multiple comparison problem |

**Key challenge:** When comparing multiple groups, the probability of a false positive increases:

$$P(\text{at least 1 false positive}) = 1 - (1 - \alpha)^k$$

For 4 groups with 6 pairwise comparisons at $\alpha = 0.05$: $1 - (0.95)^6 = 0.265$ (26.5% chance!)

In [None]:
import numpy as np
import pandas as pd
from scipy import stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from statsmodels.stats.proportion import proportions_ztest
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)

## 2. Generate Simulated Data

In [None]:
n_per_group = 500

# Time on page (seconds) - continuous metric
# Design A: Current (baseline) 
# Design B: Minor tweak (no real effect)
# Design C: Significant improvement
# Design D: Even better
time_A = np.random.normal(45, 15, n_per_group)   # mean=45s, std=15s
time_B = np.random.normal(46, 14, n_per_group)   # tiny difference from A
time_C = np.random.normal(52, 16, n_per_group)   # meaningful improvement
time_D = np.random.normal(55, 15, n_per_group)   # best

# Clip negative values (time can't be negative)
time_A = np.clip(time_A, 1, None)
time_B = np.clip(time_B, 1, None)
time_C = np.clip(time_C, 1, None)
time_D = np.clip(time_D, 1, None)

# Sign-up rate (binary metric)
signup_A = np.random.binomial(1, 0.10, n_per_group)
signup_B = np.random.binomial(1, 0.11, n_per_group)
signup_C = np.random.binomial(1, 0.14, n_per_group)
signup_D = np.random.binomial(1, 0.15, n_per_group)

# Create DataFrame
df = pd.DataFrame({
    'variant': ['A'] * n_per_group + ['B'] * n_per_group + ['C'] * n_per_group + ['D'] * n_per_group,
    'time_on_page': np.concatenate([time_A, time_B, time_C, time_D]),
    'signed_up': np.concatenate([signup_A, signup_B, signup_C, signup_D])
})

print("=== Summary Statistics ===")
summary = df.groupby('variant').agg(
    n=('time_on_page', 'count'),
    mean_time=('time_on_page', 'mean'),
    std_time=('time_on_page', 'std'),
    signup_rate=('signed_up', 'mean')
).round(3)
print(summary)

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Time on page distributions
colors = ['#3498db', '#e67e22', '#2ecc71', '#e74c3c']
for variant, color in zip(['A', 'B', 'C', 'D'], colors):
    data = df[df['variant'] == variant]['time_on_page']
    axes[0].hist(data, bins=30, alpha=0.5, color=color, label=f'Variant {variant}', density=True)
axes[0].set_xlabel('Time on Page (seconds)')
axes[0].set_ylabel('Density')
axes[0].set_title('Time on Page Distribution by Variant')
axes[0].legend()

# Sign-up rates
signup_rates = df.groupby('variant')['signed_up'].mean()
bars = axes[1].bar(signup_rates.index, signup_rates.values, color=colors, alpha=0.8)
for bar, rate in zip(bars, signup_rates.values):
    axes[1].text(bar.get_x() + bar.get_width()/2., bar.get_height() + 0.002,
                f'{rate:.1%}', ha='center', va='bottom', fontweight='bold')
axes[1].set_ylabel('Sign-up Rate')
axes[1].set_title('Sign-up Rate by Variant')

plt.tight_layout()
plt.show()

## 3. Analysis: Continuous Metric (Time on Page)

### 3.1 One-Way ANOVA

Tests whether the means of the groups are all equal:
- $H_0$: $\mu_A = \mu_B = \mu_C = \mu_D$
- $H_a$: At least one mean is different

In [None]:
# Check assumptions first
# 1. Normality (Shapiro-Wilk) - with large n, CLT applies regardless
print("=== Normality Check (Shapiro-Wilk) ===")
for variant in ['A', 'B', 'C', 'D']:
    data = df[df['variant'] == variant]['time_on_page']
    stat, pval = stats.shapiro(data)
    print(f"Variant {variant}: W={stat:.4f}, p={pval:.4f} {'(normal)' if pval > 0.05 else '(not normal)'}")

# 2. Equal variances (Levene's test)
levene_stat, levene_pval = stats.levene(time_A, time_B, time_C, time_D)
print(f"\n=== Levene's Test for Equal Variances ===")
print(f"Statistic: {levene_stat:.4f}, p-value: {levene_pval:.4f}")
print(f"{'Equal variances assumed' if levene_pval > 0.05 else 'Unequal variances - use Welch ANOVA'}")

In [None]:
# One-Way ANOVA
f_stat, anova_pval = stats.f_oneway(time_A, time_B, time_C, time_D)

print("=== One-Way ANOVA ===")
print(f"F-statistic: {f_stat:.4f}")
print(f"p-value: {anova_pval:.6f}")
if anova_pval < 0.05:
    print("REJECT H0: At least one group mean is significantly different.")
    print("But ANOVA doesn't tell us WHICH groups differ -> need post-hoc tests.")
else:
    print("FAIL TO REJECT H0: No significant difference among group means.")

### 3.2 Kruskal-Wallis Test (Non-parametric alternative)

In [None]:
kw_stat, kw_pval = stats.kruskal(time_A, time_B, time_C, time_D)

print("=== Kruskal-Wallis Test ===")
print(f"H-statistic: {kw_stat:.4f}")
print(f"p-value: {kw_pval:.6f}")
if kw_pval < 0.05:
    print("REJECT H0: Distributions differ significantly.")
else:
    print("FAIL TO REJECT H0: No significant difference.")

print(f"\nNote: Kruskal-Wallis is the non-parametric equivalent of ANOVA.")
print(f"Use when data is not normally distributed or for ordinal data.")

### 3.3 Post-Hoc Test: Tukey HSD

ANOVA tells us *that* groups differ. **Tukey's Honestly Significant Difference** tells us *which* pairs differ, while controlling for multiple comparisons.

In [None]:
tukey = pairwise_tukeyhsd(
    endog=df['time_on_page'],
    groups=df['variant'],
    alpha=0.05
)

print("=== Tukey HSD Post-Hoc Test ===")
print(tukey)

# Visualize
fig = tukey.plot_simultaneous(figsize=(10, 5))
plt.title('Tukey HSD: 95% Confidence Intervals for Pairwise Differences')
plt.xlabel('Time on Page (seconds)')
plt.tight_layout()
plt.show()

## 4. Analysis: Binary Metric (Sign-up Rate)

### 4.1 Chi-Square Test for Multiple Proportions

In [None]:
# Contingency table
contingency = pd.crosstab(df['variant'], df['signed_up'])
contingency.columns = ['Not Signed Up', 'Signed Up']
print("=== Contingency Table ===")
print(contingency)

# Chi-square test
chi2, chi2_pval, dof, expected = stats.chi2_contingency(contingency.values)

print(f"\n=== Chi-Square Test ===")
print(f"Chi-square statistic: {chi2:.4f}")
print(f"Degrees of freedom: {dof}")
print(f"p-value: {chi2_pval:.4f}")
print(f"\nAll expected frequencies >= 5: {(expected >= 5).all()}")
if chi2_pval < 0.05:
    print("REJECT H0: Sign-up rates differ significantly across variants.")
else:
    print("FAIL TO REJECT H0: No significant difference in sign-up rates.")

### 4.2 Pairwise Comparisons with Bonferroni Correction

For proportions, Tukey HSD doesn't apply directly. We use pairwise Z-tests with Bonferroni correction.

In [None]:
variants = ['A', 'B', 'C', 'D']
n_comparisons = len(variants) * (len(variants) - 1) // 2  # 6 pairwise comparisons
bonferroni_alpha = 0.05 / n_comparisons

print(f"Number of pairwise comparisons: {n_comparisons}")
print(f"Bonferroni-corrected alpha: {bonferroni_alpha:.4f}")
print(f"\n{'Pair':<10} {'p-value':<12} {'Corrected p':<14} {'Significant':<12}")
print("-" * 50)

pairwise_results = []
for i in range(len(variants)):
    for j in range(i+1, len(variants)):
        v1, v2 = variants[i], variants[j]
        d1 = df[df['variant'] == v1]['signed_up']
        d2 = df[df['variant'] == v2]['signed_up']
        
        z, p = proportions_ztest(
            [d1.sum(), d2.sum()],
            [len(d1), len(d2)],
            alternative='two-sided'
        )
        
        corrected_p = min(p * n_comparisons, 1.0)  # Bonferroni correction
        sig = corrected_p < 0.05
        
        print(f"{v1} vs {v2:<5} {p:<12.4f} {corrected_p:<14.4f} {'Yes *' if sig else 'No'}")
        pairwise_results.append({'pair': f'{v1} vs {v2}', 'p_value': p, 'corrected_p': corrected_p, 'significant': sig})

## 5. Summary & Recommendation

In [None]:
print("=" * 60)
print("SUMMARY OF MULTI-VARIANT TEST")
print("=" * 60)

print("\n--- Time on Page (Continuous Metric) ---")
print(f"ANOVA p-value: {anova_pval:.6f} -> {'Significant' if anova_pval < 0.05 else 'Not significant'}")
print("\nTukey HSD significant pairs:")
for i in range(len(tukey.reject)):
    if tukey.reject[i]:
        print(f"  {tukey.groupsunique[tukey._results_table.data[i+1][0]]} vs "
              f"{tukey.groupsunique[tukey._results_table.data[i+1][1]]}")

print("\n--- Sign-up Rate (Binary Metric) ---")
print(f"Chi-square p-value: {chi2_pval:.4f} -> {'Significant' if chi2_pval < 0.05 else 'Not significant'}")
print("\nBonferroni-corrected significant pairs:")
for r in pairwise_results:
    if r['significant']:
        print(f"  {r['pair']}")

print("\n--- Recommendation ---")
print("Based on both metrics:")
print(f"  Best variant for time on page: D (mean={time_D.mean():.1f}s)")
print(f"  Best variant for sign-up rate: D (rate={signup_D.mean():.1%})")
print("  Variant D consistently outperforms across metrics -> Recommend launching D.")

---

## Interview Follow-Up Questions & Answers

### Q1: Why not just run multiple separate A/B tests (A vs B, A vs C, A vs D)?

**Answer:**

Running separate tests has two major problems:

1. **Multiple comparisons inflate false positives.** With 3 separate tests at $\alpha = 0.05$, the probability of at least one false positive is $1 - (0.95)^3 = 14.3\%$, nearly 3x the intended 5% rate.

2. **Temporal confounds.** Running tests sequentially means the environment may change between tests (seasonality, marketing campaigns, product changes), making comparisons invalid.

A multi-variant test with proper correction (Bonferroni, Tukey HSD) tests all variants simultaneously while controlling the family-wise error rate.

### Q2: What is the difference between Bonferroni and Tukey HSD corrections?

**Answer:**

| Method | Approach | Strictness | Best For |
|--------|----------|------------|----------|
| **Bonferroni** | Divides $\alpha$ by number of comparisons | Very conservative | Any type of test; few comparisons |
| **Tukey HSD** | Uses studentized range distribution | Less conservative | All pairwise mean comparisons; balanced designs |
| **Benjamini-Hochberg** | Controls False Discovery Rate (FDR) | Least conservative | Many comparisons; exploratory analysis |

Bonferroni is **more conservative** (fewer false positives, but higher false negative rate). Tukey HSD is specifically designed for all pairwise mean comparisons after ANOVA and is generally preferred in that context. Benjamini-Hochberg is useful when you're running many tests and are willing to tolerate some false discoveries.

### Q3: ANOVA is significant but none of the pairwise comparisons are. How is that possible?

**Answer:**

This can happen because:

1. **ANOVA is an omnibus test** - it tests whether ANY group differs from ANY other. The overall F-test pools information across all groups, giving it more power.

2. **Post-hoc tests apply corrections** that reduce power for individual comparisons. Bonferroni in particular can be very conservative.

3. **The effect may be distributed** - small differences between many pairs can produce a significant overall F-test without any single pair being significant.

**What to do:** Consider using a less conservative correction (Tukey HSD or Benjamini-Hochberg), or increase sample size. Report the overall ANOVA result honestly and note that specific pairwise differences were not detectable with the available power.

### Q4: How much more traffic do you need for an A/B/C/D test vs an A/B test?

**Answer:**

With 4 variants:
- **Total traffic** is roughly 4x a single A/B test (you need n per group for each of 4 groups)
- If using Bonferroni correction with 6 pairwise comparisons, each comparison uses $\alpha/6 \approx 0.0083$, which requires **larger samples per comparison** to maintain power
- Rule of thumb: expect to need about **2-3x the total traffic** compared to a simple A/B test to achieve similar power for detecting differences between any two groups

**Trade-off:** More variants = more traffic needed = longer test duration, BUT you learn about all variants simultaneously rather than testing sequentially.

### Q5: When would you use Kruskal-Wallis instead of ANOVA?

**Answer:**

Use Kruskal-Wallis when:

1. **Data is not normally distributed** and sample sizes are small (CLT doesn't help)
2. **Ordinal data** (e.g., user satisfaction: 1-5 stars)
3. **Heavy outliers** that distort means
4. **Variances are very unequal** across groups

Kruskal-Wallis compares **rank distributions** rather than means, making it robust to the above issues. The trade-off: it has less statistical power than ANOVA when ANOVA assumptions hold, and its post-hoc test (Dunn's test) is less elegant than Tukey HSD.