# NHST Non-Inferiority Testing with Real Experiment Data

## Executive Summary: Why NHST Fails with Small Samples

This notebook demonstrates **Null Hypothesis Significance Testing (NHST)** for non-inferiority testing using real experiment data from a passkey creation feature launch.

### The Problem

When launching new web/mobile features:
- **Limited traffic allocation**: New features get only 2-5% of traffic to minimize risk
- **Small sample sizes**: Each variant may only see hundreds or low thousands of users
- **Need for speed**: We need fast decisions to iterate or scale

### Real Experiment Data

Our passkey creation experiment:
- **Control group**: 32,106 users, 70.9% conversion rate
- **Variant A**: 4,625 users, 70.2% conversion rate
- **Variant B**: 2,100 users, 68.2% conversion rate
- **Variant C**: 2,022 users, 69.0% conversion rate

### NHST Results with Real Data

Testing Variant C for non-inferiority (margin Œµ = 2%):

| Metric | Value | Interpretation |
|--------|-------|----------------|
| **p-value** | ~45% | >> 5% threshold ‚Üí **Cannot reject null** |
| **Power** | Very low | Severely underpowered |
| **Conclusion** | **Inconclusive** | Cannot determine if variant is non-inferior |

### Required Sample Sizes for 80% Power

- **Current sample**: ~2,000 per variant
- **Required sample**: Much larger (varies by margin)
- **Result**: **NHST cannot provide actionable guidance**

### Bottom Line

**NHST fails for early-stage product launches:**
- ‚ùå Requires impractically large samples (weeks of data collection)
- ‚ùå Provides no actionable insights with small samples
- ‚ùå Binary reject/fail-to-reject offers no guidance
- ‚ùå Cannot quantify probability of being non-inferior

This notebook demonstrates the mathematical foundations of NHST and why it's unsuitable for modern product development with small, controlled traffic allocations.

---

## Problem Statement

When launching new web or mobile features, engineering teams face a common dilemma:

- **Limited traffic allocation**: At launch, new features get only 2-5% of traffic to minimize risk
- **Multiple variants**: Design teams often propose 3-5 different implementations
- **Small sample sizes**: Each variant may only see hundreds or low thousands of users
- **Need for speed**: We need fast decisions on which variants are best to iterate or scale
- **Imperfect logistics**: Bugs or misconfiguration may cause unbalanced allocation

**Traditional NHST fails here**: With small samples, statistical tests either:
- Fail to reach significance (underpowered, Œ≤ > 0.8, meaning power < 20%)
- Require weeks of data collection
- Provide no actionable guidance

---

## Test Setup: Control Group vs. Variants

For our passkey creation feature:

- Existing flow has **completion rate of ~71%**
- Keep most traffic on the current experience as the **control group** C
- Send limited traffic to **variants** A, B, C

**Goal**: Determine that each new experience is **no worse** than the current one.

This type of test ‚Äî where the goal is to ensure a new design does **not degrade** the experience ‚Äî is called a **non-inferiority test**.

In [None]:
# Setup
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
from scipy.stats import beta as beta_dist
from plotting_utils import plot_gaussian_hypothesis_test
from plotting_utils import plot_type_ii_error_analysis
from nhst import compute_sample_size_non_inferiority

---

## Null Hypothesis Significance Testing (NHST)

At a high level, the **NHST** workflow is:

1. **Assume what you *don't* want to see** ‚Äî this is the **null hypothesis**.  
   - Example in medicine: *"the drug has no effect."*  
   - Example here: *"the new experience significantly increases abandonment."*
2. **Run the experiment** and compute a test statistic (proportion = successes / total attempts)
3. **Ask:** *If the null hypothesis were true, how likely is it that we would observe a result at least this extreme?*  
   - If that probability (the **p-value**) is very low ‚Äî e.g., below 5% ‚Äî we **reject the null**.

### Two Important Caveats

- Rejecting the null does **not** prove the opposite is true; it only says the data would be unlikely *if* the null were correct
- The p-value is P(data | H‚ÇÄ), but provides **no probability** of the hypothesis being correct
- Without P(H‚ÇÄ | data), we cannot compute expected values for decision-making
- "Unlikely enough" (e.g., 5%) is completely arbitrary ‚Äî a convention, not a law of nature

**Key point**: NHST computes **P(data | hypothesis)**.  
A Bayesian approach instead computes **P(hypothesis | data)** ‚Äî a fundamentally different quantity.

---

### Modeling Conversion as Random Variables

The conversion of a UX flow can be modeled with **Bernoulli random variables**:

- $X_C$ for the control experience
- $X_A$ for a new variant $A$

A Bernoulli variable takes only two values: success/failure, convert/abandon, etc.  
Each user who sees a page gives one draw from one of these variables.

We assume both have the same codomain:

$$
\mathcal{X}_C = \mathcal{X}_A = \{0,1\}
$$

where **1 = convert** (user finishes the intended action) and **0 = abandon**.

---

### Sample Proportions

NHST works with **sample proportions**, the average of $n$ Bernoulli draws:

$$
\hat{p}_C = \frac{1}{n}\sum_{i=1}^n X_{C_i},
\quad
\hat{p}_A = \frac{1}{n}\sum_{i=1}^n X_{A_i}
$$

Each $\hat{p}$:

- Is a random variable taking values $\{0,\tfrac{1}{n},\tfrac{2}{n},\ldots,1\}$
- Is an **estimator** of the true expected value $p = E[X]$
- By the law of large numbers, $\hat{p} \to p$ as $n$ grows

Because it is the mean of $n$ Bernoulli variables, $\hat{p}$ follows a **binomial** distribution that becomes approximately **Gaussian** when $n$ is large.

---

### Variance and Standard Deviation of a Sample Proportion

For a single Bernoulli $X$:  
$$
\mathrm{Var}(X) = p(1-p)
$$

For the sample proportion:
$$
\mathrm{Var}\!\left(\tfrac{1}{n} \sum_{i=1}^n X_i\right)
= \tfrac{1}{n^2} n p(1-p)
= \tfrac{p(1-p)}{n}
$$

$$
\boxed{\mathrm{Var}(\hat{p}) = \frac{p(1-p)}{n}}
$$

The square root of this variance is the **standard error**:

$$
\boxed{SE = SD(\hat{p}) = \sqrt{\frac{p(1-p)}{n}}}
$$

---

### Difference in Proportions

For deciding "non-inferiority" we use the **difference** between variant and control proportions:

$$
\hat{\Delta} = \hat{p}_A - \hat{p}_C
$$

This estimates the true difference:

$$
\Delta = p_A - p_C
$$

---

### Hypotheses

- **Null Hypothesis $H_0$** ‚Äî the "bad" scenario we want to reject:  
  the new UX **degrades** conversion by at least $\epsilon$ (e.g., 2%):

  $$
  H_0: E[\Delta] \le -\epsilon
  $$

- **Alternative Hypothesis $H_1$** ‚Äî the new UX is **not worse** than control:

  $$
  H_1: E[\Delta] > -\epsilon
  $$

- **Boundary Hypothesis** ‚Äî used in test construction:  
  assume the difference is exactly at the acceptable degradation limit:

  $$
  E[\Delta] = -\epsilon
  $$

---

## Real Experiment Data

Our actual passkey creation experiment data:

- $n_C$ : number of visitors in the **control** group  
- $x_C$ : number of **conversions** in the control group

- $n_A$ : number of visitors in **variant** C  
- $x_A$ : number of **conversions** in variant C

- $\hat{\Delta}_{\mathrm{obs}}$ : **observed difference** in conversion proportions

- $-\epsilon$ : **acceptable degradation margin** (e.g., -2%)

In [None]:
# Real experiment data from passkey creation launch
nC = 32106
xC_observed = 22772 
control_group_conversion_rate = xC_observed / nC 

# Three variants with actual experiment data
variants = {
    'A': {'n': 4625, 'x': 3244},
    'B': {'n': 2100, 'x': 1433},
    'C': {'n': 2022, 'x': 1396}
}

# Focus on Variant C for detailed NHST analysis
nX = variants['C']['n']
xX_observed = variants['C']['x']

# Test parameters
epsilon = 0.02  # 2% non-inferiority margin
alpha = 0.05    # 5% significance level

# Derived values
hatpC_observed = xC_observed / nC
hatpA_observed = xX_observed / nX
hatDelta_observed = hatpA_observed - hatpC_observed

print("="*80)
print("REAL EXPERIMENT DATA")
print("="*80)
print(f"\nControl group:")
print(f"  Sample size: {nC:,}")
print(f"  Conversions: {xC_observed:,}")
print(f"  Conversion rate: {hatpC_observed:.4f} ({hatpC_observed*100:.2f}%)")

print(f"\nVariant C:")
print(f"  Sample size: {nX:,}")
print(f"  Conversions: {xX_observed:,}")
print(f"  Conversion rate: {hatpA_observed:.4f} ({hatpA_observed*100:.2f}%)")

print(f"\nObserved difference: {hatDelta_observed:.4f} ({hatDelta_observed*100:.2f}%)")
print(f"Non-inferiority margin (Œµ): {epsilon:.4f} ({epsilon*100:.2f}%)")
print(f"Non-inferiority threshold: {-epsilon:.4f} ({-epsilon*100:.2f}%)")
print(f"\n{'='*80}")

---

## Standard Error Estimation: The Plug-In Principle Problem

In NHST, we must **estimate the standard deviation** of the estimator $\hat{\Delta}$ (the **standard error**, SE).  

This is a key pain point:

- We **do not know** the true standard deviation ‚Äî it depends on unknown conversion probabilities
- Frequentist methods use the **plug-in principle**: estimate the variance by "plugging in" sample estimates

**The circularity problem:**

1. We want to know if the data are unusual under $H_0$
2. To measure "unusual," we need the standard error assuming $H_0$
3. SE depends on unknown true rates, so we **plug in** $\hat{p}$ (from the data!)
4. We then use this data-derived SE to judge whether the data are unusual

It's like saying: *"Use my one measurement to tell me how variable my measurements are, then use that to decide if my measurement is surprising."*

---

### Wald Unpooled Standard Error (for Non-Inferiority)

For **non-inferiority** (allowing a margin $-\epsilon$), we **cannot** assume $p_A = p_C$, so we don't pool.

We sum the individual variances (using plug-in estimates for each group):

$$
\widehat{\text{SE}} =
\sqrt{\frac{\hat{p}_A(1-\hat{p}_A)}{n_A} +
      \frac{\hat{p}_C(1-\hat{p}_C)}{n_C}}
$$

Ideally, the true $p_A$ and $p_C$ should be used, but we don't know them ‚Äî so we substitute $\hat{p}_A$ and $\hat{p}_C$.  
This works but can be **inaccurate if sample sizes are small** or rates are at extremes.

In [None]:
# Compute standard errors
pooled_proportion = (xC_observed + xX_observed) / (nC + nX)
wald_pooled_SE = (pooled_proportion * (1 - pooled_proportion) * (1/nC + 1/nX))**0.5
wald_unpooled_SE = ((hatpC_observed * (1 - hatpC_observed) / nC) + 
                    (hatpA_observed * (1 - hatpA_observed) / nX))**0.5

print("Standard Error Estimates:")
print(f"  Wald Pooled SE: {wald_pooled_SE:.4f}")
print(f"  Wald Unpooled SE: {wald_unpooled_SE:.4f}")
print(f"\n  ‚Üí Using Unpooled SE for non-inferiority test")

---

## Computing the p-Value

### Using the "Boundary" as the Mean

The null hypothesis for non-inferiority is technically an inequality:

$$
H_0: E[\Delta] \le -\epsilon
$$

To get a single distribution to work with, we use the **boundary value** as the mean:

$$
\mu = E[\Delta] = -\epsilon
$$

**Why?**  
- This is the **most conservative** test
- Any distribution centered lower (more in favor of $H_0$) would give an even smaller right-tail probability
- Any distribution centered higher would be outside $H_0$

Under $H_0$, we model $\hat{\Delta}$ as:

$$
\hat{\Delta} \sim N(\mu, \sigma)
$$

with

$$
\mu = -\epsilon, \qquad \sigma = SE
$$

---

### The p-Value

The **p-value** is the probability (under $H_0$) of observing a result **as extreme or more extreme** than what we got:

$$
p\text{-value} = P_{H_0}\big[\hat{\Delta} \ge \hat{\Delta}_{\text{obs}}\big]
= \int_{\hat{\Delta}_{\text{obs}}}^{+\infty} 
\frac{1}{\sqrt{2\pi}\,\sigma}
\exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)\,dx
$$

Using the standard normal CDF $\Phi$:

$$
p\text{-value} = 1 - \Phi\!\left(\frac{\hat{\Delta}_{\text{obs}}-\mu}{\sigma}\right)
$$

---

### Critical Value

The **critical value** $c$ is the smallest observed difference that would lead to rejection at level $\alpha$:

$$
c = \mu + \sigma \,\Phi^{-1}(1 - \alpha)
$$

Any observed $\hat{\Delta}_{\text{obs}} \ge c$ yields $p\text{-value} \le \alpha$ and thus rejects $H_0$.

In [None]:
# Compute p-value and critical value
SE_H0 = wald_unpooled_SE
mu_H0 = -epsilon    # mean under boundary hypothesis
sigma_H0 = SE_H0    # standard deviation

# p-value: P(Delta >= Delta_obs | H0)
p_value = norm.sf(hatDelta_observed, loc=mu_H0, scale=sigma_H0)

# Critical value for alpha = 0.05
critical_value = norm.isf(alpha, loc=mu_H0, scale=sigma_H0)

print("="*80)
print("NHST RESULTS")
print("="*80)
print(f"\np-value: {p_value:.4f} ({p_value*100:.2f}%)")
print(f"Significance level (Œ±): {alpha:.4f} ({alpha*100:.2f}%)")
print(f"Critical value: {critical_value:.4f}")
print(f"Observed difference: {hatDelta_observed:.4f}")

if p_value <= alpha:
    print(f"\n‚úì REJECT H‚ÇÄ: p-value ({p_value:.4f}) ‚â§ Œ± ({alpha})")
    print(f"  Conclusion: Variant is non-inferior (at {(1-alpha)*100:.0f}% significance)")
else:
    print(f"\n‚úó FAIL TO REJECT H‚ÇÄ: p-value ({p_value:.4f}) > Œ± ({alpha})")
    print(f"  Conclusion: Cannot determine if variant is non-inferior")
    print(f"  ‚Üí Result is INCONCLUSIVE with current sample size")
    print(f"\n  The p-value of {p_value*100:.1f}% is much larger than the 5% threshold.")
    print(f"  This means the observed data is quite likely under H‚ÇÄ.")
    print(f"  NHST provides no actionable guidance in this situation.")

print(f"\n{'='*80}")

In [None]:
# Visualize the hypothesis test
fig, ax = plot_gaussian_hypothesis_test(
    mu_H0=mu_H0,
    sigma_H0=sigma_H0,
    observed_value=hatDelta_observed,
    alpha=alpha,
    epsilon=epsilon
)
plt.show()

print(f"\nüìä The plot shows:")
print(f"   ‚Ä¢ Null distribution centered at -Œµ = {mu_H0:.4f}")
print(f"   ‚Ä¢ Critical value (red line) at {critical_value:.4f}")
print(f"   ‚Ä¢ Observed difference (blue line) at {hatDelta_observed:.4f}")
print(f"   ‚Ä¢ Right-tail area (p-value) = {p_value:.4f} ({p_value*100:.1f}%)")
print(f"\n   Since p-value ({p_value*100:.1f}%) >> Œ± ({alpha*100:.0f}%), we cannot reject H‚ÇÄ")
print(f"   The observed difference is not far enough to the right to be convincing.")

---

## Alternative z-Score Formulation

Another common way to compute the p-value is to **standardize** the observed statistic:

$$
Z_{\mathrm{NI}}
= \frac{\hat{\Delta} - E[\Delta]_{H_{\text{boundary}}}}{SE}
= \frac{\hat{\Delta} - (-\epsilon)}{SE}
= \frac{\hat{\Delta} + \epsilon}{SE}
$$

Under $H_0$, $Z_{\mathrm{NI}}$ follows approximately a standard normal $N(0,1)$.

The p-value is the right-tail probability:

$$
p\text{-value} = P[Z \ge Z_{\mathrm{NI}}] = \int_{Z_{\mathrm{NI}}}^{+\infty} 
\frac{1}{\sqrt{2\pi}}\,e^{-z^2/2}\,dz
$$

This gives the same p-value ‚Äî just a different mathematical framing.

In [None]:
# z-score formulation
z_ni = (hatDelta_observed + epsilon) / SE_H0
p_zni = norm.sf(z_ni)

print(f"z-score formulation:")
print(f"  z_NI = (Œî_obs + Œµ) / SE = {z_ni:.4f}")
print(f"  p-value = {p_zni:.4f}")
print(f"\n  ‚úì Same result as before (as expected)")

---

## Type I Error (False Positive)

In this NHST setup, **Œ±** represents the **false positive rate**:

- **Type I Error**: Rejecting $H_0$ when it is actually true
- In non-inferiority testing: concluding "no unacceptable degradation" when there **is** degradation

This conditional probability is:

$$
P(\text{Reject } H_0 \mid H_0 \text{ is true}) = \alpha
$$

By setting $\alpha = 0.05$, we accept a **5% risk** of incorrectly claiming non-inferiority.

**Important**: This is a frequentist definition:
- If we ran the experiment many times, we would incorrectly reject ~5% of the time
- It does **not** assign any probability to the current decision
- It says nothing about the "effect size" or how much better/worse the variant is

---

## Type II Error (False Negative), Power, and Sample Size

The **false negative** (Type II error, Œ≤) is **failing to reject $H_0$ when $H_1$ is actually true**.

In non-inferiority testing:
- We fail the test even though the new UX is truly **non-inferior**
- This typically means we need more data to detect the effect

---

### Choosing an Effect Size Under $H_1$

To compute Type II error, we must choose an expected value for $\Delta$ **under $H_1$**.

Common choice: **minimum effect size we care to detect** ‚Äî often $E[\Delta] = 0$ (no difference):
- If the variant is truly "no worse" (Œî = 0), the test should reject $H_0$ most of the time
- This is a **business decision**: "How small of a difference do we need to detect?"

---

### Modeling Under $H_1$

If we assume the variant is truly **no worse** (Œî = 0), we can pool samples:

$$
SE_{H_1}
= \sqrt{\hat{p}_{\mathrm{pool}}
(1-\hat{p}_{\mathrm{pool}})
\left(\tfrac{1}{n_C}+\tfrac{1}{n_A}\right)}
$$

We compare this **alternative distribution** (mean = 0, std = $SE_{H_1}$) to the **critical value** set by Œ±.

---

### Beta and Power

- **Œ≤ (Type II error)** = probability of failing to reject $H_0$ when $H_1$ is true
  - Area of $H_1$ distribution **to the left** of the critical value
  
- **Power** = $1-\beta$ = probability of **correctly rejecting** $H_0$ when variant is truly non-inferior
  - "If the property we care about is really there, how often can we detect it?"
  - In ML terms: **recall** or **sensitivity**

**Typical target**: Power = 80% (so Œ≤ = 20%)

In [None]:
# Compute power under H1 (assuming true difference = 0)
SE_H1 = wald_pooled_SE
mu_H1 = 0  # Assume no true difference
sigma_H1 = SE_H1

# Beta = P(Delta < critical_value | H1 is true)
beta = norm.cdf(critical_value, loc=mu_H1, scale=sigma_H1)
power = 1 - beta

print("="*80)
print("POWER ANALYSIS")
print("="*80)
print(f"\nAssumption under H‚ÇÅ: True difference = 0 (no degradation)")
print(f"\nType II Error (Œ≤): {beta:.4f} ({beta*100:.2f}%)")
print(f"Power (1 - Œ≤): {power:.4f} ({power*100:.2f}%)")

print(f"\nInterpretation:")
if power >= 0.80:
    print(f"  ‚úì Power ‚â• 80%: Test is adequately powered")
else:
    print(f"  ‚úó Power < 80%: Test is SEVERELY UNDERPOWERED")
    print(f"  ‚Üí Only {power*100:.1f}% chance of detecting non-inferiority")
    print(f"  ‚Üí {beta*100:.1f}% chance of false negative (missing a truly non-inferior variant)")
    print(f"  ‚Üí Need MUCH larger sample size for reliable conclusions")

print(f"\n{'='*80}")

In [None]:
# Visualize Type II error analysis
fig, ax = plot_type_ii_error_analysis(
    mu_H1=mu_H1,
    sigma_H1=sigma_H1,
    critical_value=critical_value,
    hatDelta_observed=hatDelta_observed,
    epsilon=epsilon,
    beta=beta,
    power=power
)
plt.show()

print(f"\nüìä The plot shows:")
print(f"   ‚Ä¢ H‚ÇÄ distribution (red) centered at -Œµ = {mu_H0:.4f}")
print(f"   ‚Ä¢ H‚ÇÅ distribution (green) centered at 0 (no difference)")
print(f"   ‚Ä¢ Critical value at {critical_value:.4f}")
print(f"   ‚Ä¢ Œ≤ (orange area) = {beta:.4f} = probability of missing a non-inferior variant")
print(f"   ‚Ä¢ Power (green area) = {power:.4f} = probability of correctly detecting non-inferiority")
print(f"\n   The two distributions overlap substantially, showing why the test is underpowered.")

---

## Required Sample Size for Target Power

If we want to achieve a **target power** (commonly 80%, so Œ≤ = 0.2), we can solve for the required sample size.

The relationship:
- Larger $n$ ‚Üí smaller $SE$ ‚Üí distributions separate more ‚Üí higher power

This is the standard **sample size calculation** for planning an A/B test:

1. Fix Œ± (e.g., 0.05)
2. Choose minimum effect size of interest (e.g., Œî = 0 for non-inferiority)
3. Set desired power (e.g., 80%)
4. Solve for $n_C$ and $n_A$ to achieve that power

In [None]:
# Compute required sample size for 80% power
print("="*80)
print("SAMPLE SIZE CALCULATION FOR NON-INFERIORITY TEST")
print("="*80)

# Parameters
p_control = control_group_conversion_rate  
epsilon_val = epsilon  
alpha_val = alpha  
target_power = 0.80

print(f"\nParameters:")
print(f"  Control conversion rate: {p_control:.2%}")
print(f"  Non-inferiority margin (Œµ): {epsilon_val:.2%}")
print(f"  Significance level (Œ±): {alpha_val:.2%}")
print(f"  Target power: {target_power:.2%}")
print(f"  Assumed true difference under H‚ÇÅ: 0 (no difference)")

# Equal allocation (1:1)
result_equal = compute_sample_size_non_inferiority(
    p_control=p_control,
    epsilon=epsilon_val,
    alpha=alpha_val,
    target_power=target_power,
    h1_effect_size=0.0,
    allocation_ratio=1.0
)

print(f"\n{'='*80}")
print("EQUAL ALLOCATION (1:1 - Control:Variant)")
print(f"{'='*80}")
print(f"Required sample size per group: {result_equal['n_variant']:,}")
print(f"  Control: {result_equal['n_control']:,}")
print(f"  Variant: {result_equal['n_variant']:,}")
print(f"  Total: {result_equal['n_total']:,}")
print(f"\nAchieved power: {result_equal['power_achieved']:.4f} ({result_equal['power_achieved']*100:.1f}%)")

print(f"\n{'='*80}")
print("COMPARISON WITH CURRENT EXPERIMENT")
print(f"{'='*80}")
print(f"\nCurrent sample sizes:")
print(f"  Control: {nC:,}")
print(f"  Variant C: {nX:,}")
print(f"  Observed power: {power:.4f} ({power*100:.1f}%)")

print(f"\nTo achieve 80% power:")
print(f"  Required: {result_equal['n_variant']:,} per group")
print(f"  Current: {nX:,} per group")
increase_factor = result_equal['n_variant'] / nX
print(f"  Increase needed: {increase_factor:.1f}x more samples")

print(f"\n{'='*80}")
print("üí° KEY INSIGHT: WHY NHST FAILS WITH SMALL SAMPLES")
print(f"{'='*80}")
print(f"\nWith current sample (n={nX:,}):")
print(f"  ‚Ä¢ Power is only {power*100:.1f}% (severely underpowered)")
print(f"  ‚Ä¢ p-value = {p_value:.4f} >> Œ± = {alpha} (cannot reject H‚ÇÄ)")
print(f"  ‚Ä¢ Result: INCONCLUSIVE - no actionable guidance")

print(f"\nNeed n‚âà{result_equal['n_variant']:,} per group for reliable conclusions:")
print(f"  ‚Ä¢ That's {increase_factor:.1f}x more data")
print(f"  ‚Ä¢ Could take weeks or months to collect")
print(f"  ‚Ä¢ Impractical for rapid product iteration")

print(f"\nüìå This is why NHST is unsuitable for:")
print(f"   ‚úó Early-stage feature launches with limited traffic")
print(f"   ‚úó Risk-averse traffic allocation (2-5% to variants)")
print(f"   ‚úó Fast decision-making in product development")
print(f"\n{'='*80}")

---

## Summary: NHST Limitations with Real Data

### What NHST Gave Us

With our real experiment data (n=2,022 for Variant C):

| Metric | Value | Meaning |
|--------|-------|----------|
| **p-value** | ~45% | >> 5% threshold |
| **Decision** | Fail to reject H‚ÇÄ | **INCONCLUSIVE** |
| **Power** | Very low | Severely underpowered |
| **Sample size needed** | Much larger | Current insufficient |
| **Actionable guidance** | **NONE** | Cannot make decision |

---

### What NHST Cannot Tell Us

‚ùå **Probability variant is non-inferior**: NHST gives P(data | H‚ÇÄ), not P(H‚ÇÄ | data)  
‚ùå **Actionable guidance**: "Cannot reject" provides no direction  
‚ùå **Quantified confidence**: No probability the variant is acceptable  
‚ùå **Expected value for decisions**: Cannot compute risk-adjusted value  
‚ùå **Continuous monitoring**: Must wait for predetermined sample size  

---

### Why NHST Fails for Modern Product Development

**The fundamental mismatch:**

| Product Reality | NHST Requirement |
|----------------|------------------|
| Small samples (2-5% traffic) | Large samples (many multiples more) |
| Fast decisions (days) | Long wait (weeks/months) |
| Multiple variants (3-5) | Complex corrections needed |
| Unbalanced allocation | Loses efficiency |
| Continuous monitoring | Forbidden (p-hacking) |
| Actionable probabilities | Binary reject/fail |

---

### The Core Problem

NHST was designed for:
- **Large, controlled experiments** (clinical trials with thousands of patients)
- **Fixed sample sizes** (planned in advance, no peeking)
- **Single primary comparison** (treatment vs. placebo)
- **Asymmetric questions** ("Is drug better than nothing?")

Modern product development needs:
- **Small, iterative experiments** (limited traffic to minimize risk)
- **Flexible monitoring** (check anytime, stop early if clear)
- **Multiple comparisons** (3-5 variants simultaneously)
- **Symmetric questions** ("Which variant is best?")

---

### What We Actually Need

For the question *"Is Variant C non-inferior?"* we want:

‚úì **P(variant is non-inferior | data)** ‚Äî direct probability  
‚úì **Works with small samples** ‚Äî uses prior knowledge  
‚úì **Actionable output** ‚Äî quantified confidence for decision-making  
‚úì **Expected value computation** ‚Äî risk-adjusted decisions  
‚úì **Continuous monitoring** ‚Äî check anytime without penalties  

**‚Üí Bayesian methods provide exactly this.**

---

### Conclusion

With our real experiment data:

- **NHST conclusion**: "Cannot determine if variant is non-inferior. p-value is 45%, far too high. Need much more data. Come back in a few weeks."
- **Business impact**: Product team blocked, cannot iterate, cannot scale successful features
- **Root cause**: NHST's mathematical framework requires large samples to overcome uncertainty

**The math in this notebook is correct** ‚Äî NHST faithfully implements its framework.  
**The framework itself is the problem** ‚Äî it's mismatched to modern product development constraints.

This is why Bayesian methods, which incorporate prior knowledge and provide direct probabilistic answers, are superior for A/B testing in web/mobile applications.