# Case Study 3: Bayesian A/B Test - Subscription Page Pricing

## Scenario
A streaming platform wants to test whether a **new pricing page layout** increases the subscription sign-up rate. Instead of a traditional frequentist approach, the team wants to use **Bayesian A/B testing** to get a direct probability that variant B is better than A, and to potentially stop the test early if there's strong evidence.

Current sign-up rate: ~8%. The team wants to know *"What is the probability that the new page is better?"*

**Methods used:** Beta-Binomial model, Posterior analysis, Monte Carlo simulation, Expected Loss

## 1. Bayesian vs Frequentist: Key Differences

| Aspect | Frequentist | Bayesian |
|--------|-------------|----------|
| **Question answered** | "What's the probability of this data given no effect?" | "What's the probability B is better than A?" |
| **Core output** | p-value, confidence interval | Posterior distribution, credible interval |
| **Parameters** | Fixed (unknown) constants | Random variables with distributions |
| **Prior knowledge** | Not incorporated | Incorporated through priors |
| **Peeking** | Inflates false positives | Naturally handled |
| **Sample size** | Must be fixed upfront | More flexible |
| **Interpretation** | "95% of such intervals contain the true value" | "95% probability the true value is in this interval" |

In [None]:
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)

## 2. The Beta-Binomial Model

For binary outcomes (signed up / didn't sign up), the conjugate prior is the **Beta distribution**:

- **Prior:** $p \sim \text{Beta}(\alpha, \beta)$
- **Likelihood:** $x \sim \text{Binomial}(n, p)$
- **Posterior:** $p \mid x \sim \text{Beta}(\alpha + x, \beta + n - x)$

Where $x$ = number of successes, $n$ = number of trials.

The beauty of conjugacy is that the posterior has a **closed-form solution** - no MCMC needed.

In [None]:
# Simulate experiment data
n_control = 3000
n_treatment = 3000

true_p_control = 0.08
true_p_treatment = 0.095  # ~19% relative lift

# Simulate results
conversions_control = np.random.binomial(n_control, true_p_control)
conversions_treatment = np.random.binomial(n_treatment, true_p_treatment)

print("=== Observed Data ===")
print(f"Control:   {conversions_control} / {n_control} = {conversions_control/n_control:.4f}")
print(f"Treatment: {conversions_treatment} / {n_treatment} = {conversions_treatment/n_treatment:.4f}")

## 3. Setting the Prior

The choice of prior reflects our **prior knowledge** about the conversion rate.

In [None]:
# Uninformative prior: Beta(1, 1) = Uniform(0, 1)
# This says "we have no prior knowledge about the conversion rate"
prior_alpha = 1
prior_beta = 1

# Alternative: weakly informative prior centered around 8%
# Beta(8, 92) -> mean = 8/(8+92) = 0.08, roughly equivalent to 100 prior observations
informative_alpha = 8
informative_beta = 92

x = np.linspace(0, 0.25, 1000)

fig, ax = plt.subplots(1, 1, figsize=(10, 5))
ax.plot(x, stats.beta.pdf(x, 1, 1), 'b--', label='Uninformative: Beta(1,1)', linewidth=2)
ax.plot(x, stats.beta.pdf(x, 8, 92), 'r-', label='Weakly informative: Beta(8,92)', linewidth=2)
ax.set_xlabel('Conversion Rate')
ax.set_ylabel('Density')
ax.set_title('Prior Distributions')
ax.legend()
plt.tight_layout()
plt.show()

print(f"Uninformative prior mean: {1/(1+1):.2f} (flat, no preference)")
print(f"Informative prior mean: {8/(8+92):.2f} (centered on historical rate)")
print(f"\nWe'll use the uninformative prior Beta(1,1) to let the data speak.")

## 4. Compute Posterior Distributions

In [None]:
# Posterior parameters (with uninformative prior)
post_alpha_A = prior_alpha + conversions_control
post_beta_A = prior_beta + n_control - conversions_control

post_alpha_B = prior_alpha + conversions_treatment
post_beta_B = prior_beta + n_treatment - conversions_treatment

print("=== Posterior Distributions ===")
print(f"Control:   Beta({post_alpha_A}, {post_beta_A})")
print(f"  Mean: {post_alpha_A / (post_alpha_A + post_beta_A):.4f}")
print(f"  95% Credible Interval: {stats.beta.ppf([0.025, 0.975], post_alpha_A, post_beta_A)}")
print(f"\nTreatment: Beta({post_alpha_B}, {post_beta_B})")
print(f"  Mean: {post_alpha_B / (post_alpha_B + post_beta_B):.4f}")
print(f"  95% Credible Interval: {stats.beta.ppf([0.025, 0.975], post_alpha_B, post_beta_B)}")

In [None]:
# Plot posterior distributions
x = np.linspace(0.04, 0.14, 1000)

fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(x, stats.beta.pdf(x, post_alpha_A, post_beta_A),
        'b-', linewidth=2, label=f'Control posterior (mean={post_alpha_A/(post_alpha_A+post_beta_A):.4f})')
ax.plot(x, stats.beta.pdf(x, post_alpha_B, post_beta_B),
        'r-', linewidth=2, label=f'Treatment posterior (mean={post_alpha_B/(post_alpha_B+post_beta_B):.4f})')
ax.fill_between(x, stats.beta.pdf(x, post_alpha_A, post_beta_A), alpha=0.2, color='blue')
ax.fill_between(x, stats.beta.pdf(x, post_alpha_B, post_beta_B), alpha=0.2, color='red')
ax.set_xlabel('Conversion Rate')
ax.set_ylabel('Posterior Density')
ax.set_title('Posterior Distributions of Conversion Rate')
ax.legend()
plt.tight_layout()
plt.show()

## 5. Key Bayesian Metrics

In [None]:
# Monte Carlo simulation to compute P(B > A)
n_simulations = 100_000

samples_A = stats.beta.rvs(post_alpha_A, post_beta_A, size=n_simulations)
samples_B = stats.beta.rvs(post_alpha_B, post_beta_B, size=n_simulations)

# Probability that B is better than A
prob_B_better = (samples_B > samples_A).mean()

# Distribution of the uplift
uplift = samples_B - samples_A
relative_uplift = (samples_B - samples_A) / samples_A * 100

# Expected loss (risk of choosing wrong variant)
loss_choose_B = np.maximum(samples_A - samples_B, 0).mean()  # Loss if B is actually worse
loss_choose_A = np.maximum(samples_B - samples_A, 0).mean()  # Loss if A is actually worse

print("=== Bayesian Decision Metrics ===")
print(f"\nP(Treatment > Control): {prob_B_better:.4f} ({prob_B_better*100:.1f}%)")
print(f"P(Control > Treatment): {1-prob_B_better:.4f} ({(1-prob_B_better)*100:.1f}%)")
print(f"\nExpected uplift (absolute): {uplift.mean():.4f} ({uplift.mean()*100:.2f}pp)")
print(f"95% Credible Interval for uplift: ({np.percentile(uplift, 2.5):.4f}, {np.percentile(uplift, 97.5):.4f})")
print(f"\nExpected relative uplift: {relative_uplift.mean():.1f}%")
print(f"\nExpected loss if we choose Treatment: {loss_choose_B:.5f}")
print(f"Expected loss if we choose Control:   {loss_choose_A:.5f}")
print(f"\n--> Choosing Treatment minimizes expected loss.")

In [None]:
# Visualize the uplift distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Absolute uplift
axes[0].hist(uplift * 100, bins=80, color='#2ecc71', alpha=0.7, edgecolor='white')
axes[0].axvline(x=0, color='red', linestyle='--', linewidth=2, label='No effect')
axes[0].axvline(x=np.percentile(uplift * 100, 2.5), color='orange', linestyle=':', label='95% CI')
axes[0].axvline(x=np.percentile(uplift * 100, 97.5), color='orange', linestyle=':')
axes[0].set_xlabel('Uplift (percentage points)')
axes[0].set_ylabel('Frequency')
axes[0].set_title(f'Distribution of Uplift\nP(B > A) = {prob_B_better:.1%}')
axes[0].legend()

# Relative uplift
axes[1].hist(relative_uplift, bins=80, color='#9b59b6', alpha=0.7, edgecolor='white')
axes[1].axvline(x=0, color='red', linestyle='--', linewidth=2, label='No effect')
axes[1].set_xlabel('Relative Uplift (%)')
axes[1].set_ylabel('Frequency')
axes[1].set_title(f'Distribution of Relative Uplift\nMean = {relative_uplift.mean():.1f}%')
axes[1].legend()

plt.tight_layout()
plt.show()

## 6. Sequential Monitoring (Bayesian Advantage)

One key advantage of Bayesian testing: we can **monitor the posterior as data accumulates** without inflating false positives.

In [None]:
# Simulate sequential data collection and track P(B > A) over time
np.random.seed(42)
n_total = 3000
step_size = 100

control_stream = np.random.binomial(1, true_p_control, n_total)
treatment_stream = np.random.binomial(1, true_p_treatment, n_total)

checkpoints = range(step_size, n_total + 1, step_size)
prob_b_better_over_time = []

for n in checkpoints:
    # Cumulative conversions
    conv_a = control_stream[:n].sum()
    conv_b = treatment_stream[:n].sum()
    
    # Posteriors
    samples_a = stats.beta.rvs(1 + conv_a, 1 + n - conv_a, size=10_000)
    samples_b = stats.beta.rvs(1 + conv_b, 1 + n - conv_b, size=10_000)
    
    prob_b_better_over_time.append((samples_b > samples_a).mean())

fig, ax = plt.subplots(figsize=(12, 5))
ax.plot(list(checkpoints), prob_b_better_over_time, 'b-', linewidth=2)
ax.axhline(y=0.95, color='green', linestyle='--', alpha=0.7, label='95% threshold')
ax.axhline(y=0.50, color='gray', linestyle=':', alpha=0.5, label='50% (no preference)')
ax.fill_between(list(checkpoints), 0.95, 1.0, alpha=0.1, color='green')
ax.set_xlabel('Sample Size per Group')
ax.set_ylabel('P(Treatment > Control)')
ax.set_title('Sequential Monitoring: Probability B is Better Over Time')
ax.set_ylim(0, 1.05)
ax.legend()
plt.tight_layout()
plt.show()

# Find first time we cross 95%
crossed = [(n, p) for n, p in zip(checkpoints, prob_b_better_over_time) if p >= 0.95]
if crossed:
    print(f"First crossed 95% threshold at n={crossed[0][0]} per group (P(B>A)={crossed[0][1]:.3f})")
    print(f"Could have stopped {n_total - crossed[0][0]} observations early per group!")

## 7. Comparison with Frequentist Result

In [None]:
from statsmodels.stats.proportion import proportions_ztest

# Frequentist Z-test
z_stat, z_pval = proportions_ztest(
    [conversions_control, conversions_treatment],
    [n_control, n_treatment],
    alternative='two-sided'
)

print("=== Frequentist vs Bayesian ===")
print(f"\nFrequentist (Z-test):")
print(f"  Z-statistic: {z_stat:.4f}")
print(f"  p-value: {z_pval:.4f}")
print(f"  Decision: {'Reject H0' if z_pval < 0.05 else 'Fail to reject H0'}")
print(f"\nBayesian:")
print(f"  P(Treatment > Control): {prob_B_better:.4f}")
print(f"  Expected loss (choosing Treatment): {loss_choose_B:.5f}")
print(f"  Decision: {'Choose Treatment' if prob_B_better > 0.95 else 'Need more evidence'}")
print(f"\nBoth approaches agree in this case, but Bayesian gives directly")
print(f"interpretable probabilities and expected loss for decision-making.")

---

## Interview Follow-Up Questions & Answers

### Q1: What are the advantages of Bayesian A/B testing over frequentist?

**Answer:**

1. **Interpretability**: You get direct probability statements ("82% chance B is better") instead of p-values which are often misinterpreted.

2. **Peeking is natural**: You can monitor results continuously without inflating false positive rates (though you should still be careful about stopping rules).

3. **Incorporate prior knowledge**: If you have historical data about conversion rates, you can incorporate it through priors.

4. **Decision-theoretic framework**: Expected loss lets you make cost-aware decisions. You can ask "What's the maximum cost of choosing B if it's actually worse?"

5. **No fixed sample size required**: While you still need sufficient data, Bayesian methods are more flexible about when to stop.

### Q2: What are the disadvantages of Bayesian A/B testing?

**Answer:**

1. **Prior sensitivity**: Poor choice of priors can bias results, especially with small samples. Teams may disagree on what prior to use.

2. **Computational cost**: For complex models (non-conjugate priors), you need MCMC sampling which is slower.

3. **Less established in industry**: Most experimentation platforms default to frequentist methods. Bayesian requires more education for stakeholders.

4. **False sense of flexibility**: While peeking is safer, stopping very early can still lead to unreliable decisions. The posterior with few data points has high uncertainty.

5. **Harder to pre-register**: Frequentist tests have clearer pre-registration frameworks (alpha, power, sample size).

### Q3: How do you choose the prior? What if you get it wrong?

**Answer:**

**Choosing the prior:**
- **Uninformative / flat prior** (Beta(1,1)): When you have no prior knowledge. Lets data speak entirely.
- **Weakly informative**: Based on historical data. E.g., if historical conversion rate is 8%, use Beta(8, 92) - equivalent to 100 prior observations.
- **Skeptical prior**: Centers on zero effect to be conservative about detecting improvements.

**What if you get it wrong:**
- With sufficient data, the posterior is **dominated by the likelihood** (data overwhelms the prior). This is called "prior washing out."
- Run **sensitivity analysis**: Check if your conclusions change with different reasonable priors.
- Rule of thumb: If your prior's "effective sample size" is small relative to actual data, the prior has minimal impact.

### Q4: What is "expected loss" and how do you use it for decisions?

**Answer:**

Expected loss quantifies the **risk of making the wrong decision**:

$$\text{Expected Loss(choose B)} = E[\max(p_A - p_B, 0)]$$

This is the average amount of conversion rate you'd lose if B is actually worse than A.

**Decision rule:** Choose the variant with the **lower expected loss**.

**Threshold approach:** Set a "loss threshold" (e.g., 0.1%). If expected loss of choosing B < threshold, launch B. This naturally accounts for the magnitude of potential loss, not just significance.

**Business advantage:** You can convert expected loss to dollars: if expected loss is 0.05pp and you have 1M monthly visitors with \$50 AOV, the risk is 1M x 0.0005 x \$50 = \$25K/month.

### Q5: Can you peek at results in Bayesian testing without consequences?

**Answer:**

**Partially true, with caveats:**

Unlike frequentist testing where repeated peeking inflates the Type I error rate, Bayesian posteriors are **always valid given the observed data**. Each time you compute P(B > A), it's a correct posterior probability.

**However:**
- Stopping very early means the **posterior has high uncertainty** (wide credible intervals)
- If you stop the moment P(B > A) first exceeds 95%, you'll make more errors than if you wait for the posterior to stabilize
- The **expected loss** metric helps here: even if P(B > A) is high, if expected loss is also high, you need more data

**Best practice:** Monitor continuously but set a **minimum sample size** and use expected loss thresholds rather than just P(B > A).

### Q6: A stakeholder asks: "Is 90% probability good enough to launch?" How do you respond?

**Answer:**

It depends on the **cost of being wrong**:

1. **Low-cost, easily reversible change** (e.g., button color): 90% might be fine. The cost of being wrong is minimal and you can always revert.

2. **High-cost, hard-to-reverse change** (e.g., pricing model): You'd want 95%+ and low expected loss. Getting pricing wrong can cause churn.

3. **Look at expected loss, not just probability**: If P(B > A) = 90% but expected loss is only \$100/month if wrong, it's probably fine. If expected loss is \$1M/month, you need stronger evidence.

4. **Consider opportunity cost**: How much do you lose by waiting for more data? If the potential gain is large and the risk is small, 90% might be sufficient.