# From Manual Bayesian A/B Tests to Automated Thompson Sampling

**A practitioner's guide to modern A/B testing for web and mobile product teams**

This presentation shows how we analyze product launches (e.g. passkey rollouts) with Bayesian A/B test analysis, then suggests how to systematize and automate experimentation with Thompson Sampling — an automated version of the Bayesian technique.

---

## Presentation Overview

1. **Classical Statistics vs. Bayesian Statistics** — NHST $P(D \mid \theta)$ vs. Bayesian $P(\theta \mid D)$
2. **Why Bayesian is Better for CX / Pricing A/B Tests** — practical, technical, and institutional reasons
3. **Our First Test: Non-Inferiority** — NHST provides no answer; our Bayesian test does, even on a small sample
4. **Select Best Variant** — choose the winning variant with direct probability
5. **Practical Issues in Large Corporate Environments** — release engineering, approvals, and iteration speed
6. **Multi-Armed Bandits and Thompson Sampling** — the fully automated solution

# Part 1: Classical Statistics vs. Bayesian Statistics

---

## The Fundamental Question

When we run an experiment and observe data, two very different questions can be asked:

| Framework | Question | Notation |
|-----------|----------|----------|
| **NHST (Frequentist)** | "How likely is this data, assuming the hypothesis is true?" | $P(\text{data} \mid \theta)$ |
| **Bayesian** | "How likely is the hypothesis, given the data we observed?" | $P(\theta \mid \text{data})$ |

These look similar but are **fundamentally different**.

- **NHST** bakes the decision rule *into* the probability computation — you set a significance level $\alpha$ (e.g., 5%), compute a p-value, and get a binary reject/fail-to-reject answer.
- **Bayesian** keeps the decision rule *outside* the probability computation — you get a full posterior distribution, then apply whatever business logic you need.

> **Analogy**: NHST is like a smoke detector (binary alarm). Bayesian is like a thermometer (continuous reading you can act on however you choose).

---

## NHST in a Nutshell

1. **Assume what you *don't* want to see** — the **null hypothesis** $H_0$
   - Medicine: "the drug has no effect"
   - A/B test: "the new experience degrades conversion"
2. **Run the experiment** and compute a test statistic
3. **Ask**: If $H_0$ were true, how likely is a result at least this extreme?
   - If that probability (the **p-value**) is below threshold $\alpha$, **reject** $H_0$

### Key Caveats

- Rejecting $H_0$ does **not** prove the alternative is true — it only says the data would be unlikely *if* $H_0$ were correct.
- The p-value itself is $P(\text{data} \mid H_0)$, but the resulting decision comes with **no probability of being correct**.
- "Unlikely enough" is arbitrary — the 5% threshold is a convention, not a law of nature.
- Without $P(H_0 \mid \text{data})$, we **cannot** compute expected values for decision-making. If deploying a bad variant costs \$100k and a good one gains \$50k, NHST provides no framework to quantify the expected value of the decision.

---

## Bayesian in a Nutshell

1. **Pick a prior belief** — express it as a probability distribution (e.g., Beta distribution)
2. **Run the experiment** — observe data
3. **Apply Bayes' theorem** — update the prior into a **posterior**
4. **Rinse and repeat** — the posterior becomes the new prior for the next batch of data

$$
\boxed{P(\theta \mid \text{data}) = \frac{P(\text{data} \mid \theta) \cdot P(\theta)}{P(\text{data})}}
$$

The posterior gives us a **full probability distribution** over the unknown parameter — we can compute any quantity we need: point estimates, credible intervals, probability of being better than a threshold, expected loss, etc.

---

### References

- E.T. Jaynes, *Probability Theory: The Logic of Science* — the philosophical case for Bayesian reasoning as "the language of science"
  ([Cambridge University Press](https://www.cambridge.org/core/books/probability-theory/9CA08E224FF30123304E6D8935CF1A99))
- Kruschke, J.K., *Doing Bayesian Data Analysis* — accessible introduction with practical examples
  ([Academic Press](https://sites.google.com/site/doingbayesiandataanalysis/))


# Part 2: Why Bayesian is Better for CX / Pricing A/B Tests

---

## Known Issues with NHST for Product A/B Tests

| NHST Problem | Why It Matters for Product Teams |
|---|---|
| **Underpowered with small samples** | At launch, variants get only 2-5% of traffic (hundreds of users). NHST often "fails to reject" — giving no actionable answer. |
| **p-value hacking** | Checking results before the planned sample size inflates false positives. Teams are tempted to "peek" — and NHST forbids it. |
| **Unbalanced samples** | Bugs or misconfiguration in traffic splitters create wildly unequal groups (e.g., 7,000 control vs. 150 variant). NHST loses efficiency; Bayesian handles this naturally. |
| **No updatability** | Cannot incorporate previous experiment results or domain knowledge. Each test starts from scratch. |
| **Binary output** | "Reject" or "fail to reject" — no probability of being better, no expected value, no quantified risk. |
| **Multiple comparisons** | Comparing 3+ variants requires Bonferroni or other corrections, making an already underpowered test even more conservative. |
| **Replication crisis** | The accumulated effect of these issues has led to a well-documented crisis of replicability in medicine and social sciences. |

---

## Why Bayesian Excels Here

| Bayesian Advantage | Detail |
|---|---|
| **Works with small samples** | Incorporates prior knowledge; provides meaningful conclusions even with n=150 per variant. |
| **Handles unbalanced allocation** | 90% control, 10% variants? Each variant analyzed independently — no "balanced design" needed. |
| **Scales to many variants** | Single coherent analysis — no multiple comparison penalties. Direct answer: P(A is best)=31%, P(B is best)=47%, etc. |
| **Provides actionable probabilities** | Instead of "cannot reject $H_0$": "47% chance B is best, 22% it's worse than control." Can compute expected values. |
| **Continuous monitoring** | Check results anytime without p-hacking concerns. Update posteriors incrementally as data arrives. Stop early if a clear winner emerges. |
| **Updatable** | The posterior from one experiment becomes the prior for the next — mathematically rigorous sequential learning. |

---

## When Does It *Not* Matter?

All of this matters most when working with **smaller samples** and needing to **make decisions quickly**. Once you have millions of data points and can wait weeks, NHST and Bayesian approaches converge to similar conclusions. The advantage is clearest in the early, high-uncertainty phase of product launches.

---

## Institutional Shift: FDA Bayesian Guidance (January 2026)

> In January 2026, the FDA issued a landmark draft guidance titled **"Use of Bayesian Methodology in Clinical Trials of Drugs and Biological Products"**, marking a significant shift in how the agency approaches drug approval.

If the most conservative regulator in the world is embracing Bayesian methods for drug trials, the case for product A/B testing is even stronger — our stakes are lower and our iteration speed is higher.

- [FDA Draft Guidance (2026)](https://www.fda.gov/regulatory-information/search-fda-guidance-documents/use-bayesian-methodology-clinical-trials-drugs-and-biological-products)

---

## Summary Comparison Table

| Aspect | Traditional NHST | Bayesian Approach |
|--------|-----------------|-------------------|
| Small samples | Underpowered, inconclusive | Works well with prior knowledge |
| Unbalanced allocation | Loses efficiency | No problem |
| Multiple variants | Complex corrections needed | Natural single analysis |
| Interpretation | p-value (hard to explain) | Probability (intuitive) |
| Decision making | Binary reject/fail | Quantified risk/confidence |
| Continuous monitoring | Forbidden (p-hacking) | Allowed and rigorous |
| Time to decision | Weeks (need larger n) | Days (works with small n) |

---

### References

- Ioannidis, J.P.A. (2005), "Why Most Published Research Findings Are False" — the paper that launched the replication crisis discussion
  ([PLOS Medicine](https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124))
- Gelman, A. et al., *Bayesian Data Analysis* (3rd ed.) — the standard graduate reference
  ([Columbia University](http://www.stat.columbia.edu/~gelman/book/))
- Google/Optimizely engineering blogs on Bayesian A/B testing:
  - [Optimizely Stats Engine](https://www.optimizely.com/optimization-glossary/stats-engine/)
  - [Google Analytics Bayesian approach](https://support.google.com/analytics/answer/2846882)
- VWO knowledge base on Bayesian testing:
  ([VWO](https://vwo.com/bayesian-ab-testing/))

# Part 3: Our First Test — Non-Inferiority

---

## The Setup

We have an existing digital identity + credentials creation flow with a **completion rate of ~71%**. We are adding passkey creation, which adds extra pages and clicks.

- Keep **x%** of traffic on the current experience as the **control group** $C$
- Send the remaining traffic to one or more **variants** $A_1$, $A_2$, $A_3$

**First question**: Does adding passkey creation cause an **unacceptable degradation** of the completion rate?

This is a **non-inferiority test** — we want to show the new experience is "no worse" than the current one (within a tolerance $\epsilon$).

---

## What Happened with NHST

With our real experiment data:
- **Control**: n=32,106, conversion rate ~70.9%
- **Variant C** (smallest): n=2,022, conversion rate ~69.0%
- **Non-inferiority margin**: $\epsilon$ = 2%

The NHST non-inferiority test computes a p-value by:
1. Estimating the standard error from the data (plug-in principle — circular but pragmatically accepted)
2. Modeling the test statistic under $H_0$ as Gaussian with mean $-\epsilon$
3. Computing the right-tail probability

**Result**: p-value $\approx$ 0.45 — far from the 5% threshold. NHST **fails to reject** $H_0$.

**Translation**: "We can't say anything. We don't know whether the new CX causes unacceptable degradation."

This is because the sample is small. At launch, you typically put a tiny fraction of traffic on new features — and those small samples are often insufficient for NHST.

> See `ABmethodologies.ipynb` cells 5-22 for the full NHST derivation and numerical examples.

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm, beta as beta_dist

# --- Experiment Data ---
nC = 32106
xC = 22772
control_rate = xC / nC

variants = {
    'A': {'n': 4625, 'x': 3244},
    'B': {'n': 2100, 'x': 1433},
    'C': {'n': 2022, 'x': 1396}
}

epsilon = 0.02  # 2% non-inferiority margin

print(f"Control: n={nC:,}, rate={control_rate:.2%}")
for name, d in variants.items():
    r = d['x'] / d['n']
    print(f"Variant {name}: n={d['n']:,}, rate={r:.2%}")
print(f"\nNon-inferiority margin (epsilon): {epsilon:.0%}")
print(f"Non-inferiority threshold: {control_rate - epsilon:.2%}")

## NHST Result: Inconclusive

Let's run the NHST non-inferiority test on Variant C (the smallest sample) to see why it fails.

In [None]:
# NHST non-inferiority test on Variant C
nX = variants['C']['n']
xX = variants['C']['x']
hatpC = xC / nC
hatpA = xX / nX
hatDelta = hatpA - hatpC

# Unpooled SE (appropriate for non-inferiority)
SE = np.sqrt(hatpC * (1 - hatpC) / nC + hatpA * (1 - hatpA) / nX)
mu_H0 = -epsilon

# p-value: right tail P(Delta >= observed | H0)
p_value = norm.sf(hatDelta, loc=mu_H0, scale=SE)

# Power analysis
pooled_p = (xC + xX) / (nC + nX)
SE_H1 = np.sqrt(pooled_p * (1 - pooled_p) * (1/nC + 1/nX))
critical_value = norm.isf(0.05, loc=mu_H0, scale=SE)
power = 1 - norm.cdf(critical_value, loc=0, scale=SE_H1)

print("NHST Non-Inferiority Test (Variant C)")
print("=" * 50)
print(f"Observed difference: {hatDelta:.4f}")
print(f"Standard error: {SE:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"Power: {power:.1%}")
print()
if p_value <= 0.05:
    print("Result: REJECT H0 — non-inferiority established")
else:
    print("Result: FAIL TO REJECT — inconclusive")
    print(f"  (Power is only {power:.1%} — test is severely underpowered)")
    print(f"  NHST cannot help us with n={nX} for this variant.")

## Bayesian Non-Inferiority: Actionable Results

The Bayesian approach uses a **weakly informative prior** centered on the control rate (since variants operate in the same range), then updates with observed data.

**Key insight**: the prior reflects domain knowledge ("we added pages, so conversion should be *around* the control rate") while the non-inferiority threshold reflects a business requirement ("we can tolerate up to 2% degradation").

**Posterior update rule** (Beta-Binomial conjugacy):
$$
\text{Prior: } \text{Beta}(\alpha_0, \beta_0) \quad + \quad \text{Data: } k \text{ successes in } n \text{ trials} \quad \Rightarrow \quad \text{Posterior: } \text{Beta}(\alpha_0 + k, \; \beta_0 + n - k)
$$

Then we directly compute: $P(\text{variant rate} > \text{control rate} - \epsilon \mid \text{data})$

> See `ABmethodologies.ipynb` cells 25-41 and `Bayesian_AB_Test_Workflow.ipynb` cells 5-8 for the full treatment.

In [None]:
from bayesian import test_non_inferiority_weakly_informative
from plotting_utils import plot_weakly_informative_prior_with_variants

# Run Bayesian non-inferiority test on all variants
expected_degradation = 0.01  # Domain knowledge: adding clicks may degrade by ~1%

results_ni = test_non_inferiority_weakly_informative(
    n_control=nC,
    x_control=xC,
    variants_data=variants,
    epsilon=epsilon,
    expected_degradation=expected_degradation,
    alpha_prior_strength=20,  # Weak prior (high entropy)
    threshold=0.95
)

print("Bayesian Non-Inferiority Test Results")
print("=" * 60)
print(f"Prior centered at: {control_rate - expected_degradation:.2%}")
print(f"Test threshold: {control_rate - epsilon:.2%}")
print()

for name, res in results_ni.items():
    status = "NON-INFERIOR" if res['is_non_inferior'] else "NOT NON-INFERIOR"
    observed_rate = variants[name]['x'] / variants[name]['n']
    print(f"Variant {name}: {status}")
    print(f"  Observed rate: {observed_rate:.2%}, Posterior mean: {res['variant_rate']:.2%}")
    print(f"  P(variant > threshold): {res['probability']:.2%}")
    print()

# Visualize
fig, ax = plot_weakly_informative_prior_with_variants(results_ni)
plt.show()

## Takeaway: Same Data, Different Answers

| Method | Variant C Result | Actionable? |
|--------|-----------------|-------------|
| **NHST** | p-value ~0.45, "fail to reject" | No — inconclusive |
| **Bayesian** | P(non-inferior) > 95% | Yes — non-inferiority established |

The Bayesian approach succeeds because it:
1. Uses a **weakly informative prior** reflecting domain knowledge (variants should perform "around" the control rate)
2. Provides a **direct probability** rather than a p-value
3. Works naturally with the **small, unbalanced samples** typical of product launches

---

### References

- Christensen, R. et al., *Bayesian Ideas and Data Analysis* — practical Bayesian methods
  ([CRC Press](https://www.routledge.com/Bayesian-Ideas-and-Data-Analysis/Christensen-Johnson-Branscum-Hanson/p/book/9781439803547))
- FDA guidance on non-inferiority trial design:
  ([FDA Non-Inferiority Guidance](https://www.fda.gov/regulatory-information/search-fda-guidance-documents/non-inferiority-clinical-trials-establish-effectiveness))

# Part 4: Select Best Variant

---

## The Problem NHST Was Not Designed For

NHST was built for **asymmetric** questions: "Is this drug better than placebo?" It struggles with **symmetric** questions: "Which of A, B, C is best?"

| NHST Approach | Problem |
|---|---|
| **Winner-takes-all** (highest observed rate) | Ignores uncertainty; easily picks wrong variant with small samples |
| **Pairwise t-tests + Bonferroni** | Very conservative (higher Type II error); only gives significant/not-significant |
| **ANOVA + post-hoc** | Only says "something differs" — not which is best or by how much |
| **Confidence interval overlap** | Inconclusive; overlapping CIs don't mean "no difference" |

**None of these directly answer**: "What is the probability that variant A is the best?"

---

## Bayesian: Direct Probability of Being Best

The Bayesian framework answers the question we actually care about:

1. Compute the **posterior Beta distribution** for each variant
2. Draw a large number of samples (e.g., 100k) from each posterior via **Monte Carlo simulation**
3. For each draw, identify which variant has the highest conversion rate
4. Report: $P(A \text{ is best}), \; P(B \text{ is best}), \; P(C \text{ is best})$

**Advantages**:
- **Direct answer**: "Variant A is best with 88% probability"
- **No multiple-comparison corrections** — single coherent analysis
- **Scales naturally** to any number of variants
- **Quantifies uncertainty** — not just yes/no
- **Business-friendly** — easy to factor in risk, cost, implementation difficulty

> See `ABmethodologies.ipynb` cells 42-50 and `Bayesian_AB_Test_Workflow.ipynb` cells 9-12 for the full analysis.

In [None]:
from bayesian import select_best_variant
from plotting_utils import plot_multiple_posteriors_comparison

# Select the best variant using Monte Carlo simulation
selection = select_best_variant(
    variants_data=variants,
    alpha_prior=1,   # Non-informative prior for fair comparison
    beta_prior=1,
    credible_level=0.95,
    n_simulations=100000
)

# Display results
print("Probability Each Variant is Best")
print("=" * 50)
for name in ['A', 'B', 'C']:
    prob = selection['probabilities'][name]
    bar = '#' * int(prob * 50)
    print(f"  P({name} is best) = {prob:.2%}  {bar}")

winner = selection['best_variant']
print(f"\nWinner: Variant {winner}")
print(f"  Probability of being best: {selection['probabilities'][winner]:.2%}")
print(f"  Posterior mean: {selection['posterior_means'][winner]:.2%}")
ci = selection['credible_intervals'][winner]
print(f"  95% Credible interval: [{ci[0]:.2%}, {ci[1]:.2%}]")
print(f"  Expected loss: {selection['expected_loss'][winner]:.4f}")

# Visualize posterior distributions
posteriors = {}
for name, data in variants.items():
    alpha_post = data['x'] + 1
    beta_post = data['n'] - data['x'] + 1
    posteriors[name] = {
        'alpha': alpha_post,
        'beta': beta_post,
        'mean': alpha_post / (alpha_post + beta_post),
        'ci_95': (beta_dist.ppf(0.025, alpha_post, beta_post),
                  beta_dist.ppf(0.975, alpha_post, beta_post))
    }

fig, ax = plot_multiple_posteriors_comparison(
    posteriors=posteriors,
    control_group_conversion_rate=control_rate,
    epsilon=epsilon
)
plt.show()

# Part 5: Practical Issues in Large Corporate Environments

---

## The Iteration Speed Problem

Even after a Bayesian analysis delivers a clear winner, **deploying that winner can take months** in a large organization:

### A Real-World Timeline

| Date | Event |
|------|-------|
| **Early November** | Analysis complete. Variant A identified as winner with >88% probability. Decision made to deploy. |
| **Mid November** | Release freeze begins (Black Friday / holiday season). No changes allowed. |
| **Late November** | Bug discovered in the custom A/B traffic splitter. Needs fix before deployment. |
| **December** | Legal review required for the change. Holiday schedules slow approvals. |
| **January** | Winning variant finally deployed to 100% of traffic. |

**Result**: ~2 months between "we know the answer" and "users benefit from it."

---

## The Bottlenecks

1. **Release Engineering**: Code freezes, deployment windows, staging environments, QA cycles
2. **Approvals**: Legal, compliance, product management sign-offs
3. **Custom Infrastructure**: Bespoke traffic splitters that need manual reconfiguration
4. **Organizational Inertia**: Multiple teams need to coordinate for a simple traffic reallocation

---

## The Cost of Delay

Every day we keep showing inferior variants to users:
- **Lost conversions** — users see a worse experience than necessary
- **Opportunity cost** — the team can't start the next experiment
- **Compounding delay** — each experiment in the pipeline waits for the previous one

If the winning variant converts 2% better and we see 10,000 users/day, a 60-day delay means ~12,000 lost conversions.

---

## The Solution: Automate the Decision Loop

What if the system could **automatically shift traffic** to better-performing variants — without manual intervention, release cycles, or approvals for each reallocation?

This is exactly what **Thompson Sampling** provides.

> Instead of: Experiment → Analyze → Decide → Request release → Wait → Deploy → Repeat
>
> We get: Deploy all variants → Algorithm continuously optimizes traffic → Winner emerges automatically

# Part 6: Multi-Armed Bandits and Thompson Sampling

---

## The Multi-Armed Bandit Problem

Imagine a casino with **K slot machines** ("one-armed bandits"), each with an unknown payout probability. You have a limited budget. How do you maximize total payout?

- **Exploration**: Try different machines to learn which is best
- **Exploitation**: Play the machine you currently think is best

Too much exploration wastes pulls on bad machines. Too much exploitation might miss a better machine.

### A/B Testing *Is* a Bandit Problem

| Casino | A/B Testing |
|--------|-------------|
| Slot machines ("arms") | Variants (A, B, C, control) |
| Pull a lever | Show a variant to a user |
| Payout | User converts |
| Unknown probability | True conversion rate |
| Limited budget | Finite users |

**Goal**: Maximize total conversions (not just *identify* the best variant).

**Regret**: The difference between what we *would* have achieved always showing the best variant vs. what we *actually* achieved.

---

## Thompson Sampling: The Algorithm

Thompson Sampling is **provably optimal** for minimizing cumulative regret and is **incredibly simple**:

For each incoming user:
1. **Sample** once from each variant's posterior: $\theta_i \sim \text{Beta}(\alpha_i, \beta_i)$
2. **Choose** the variant with the highest sample: $i^* = \arg\max_i \theta_i$
3. **Show** that variant to the user
4. **Observe** the outcome: conversion (1) or not (0)
5. **Update** that variant's posterior: $\alpha_{i^*} \mathrel{+}= r, \; \beta_{i^*} \mathrel{+}= (1 - r)$

**That's it.** Five lines of logic — simpler than any classical statistical test.

### Why It Works

- **Early on**: Wide posteriors → high variance in samples → more exploration
- **Later**: Narrow posteriors → low variance → exploitation of the best variant
- **Automatically**: No parameters to tune, no stopping rules, no sample size calculations

The algorithm allocates traffic to variant $i$ proportionally to $P(\text{variant } i \text{ is best} \mid \text{data})$ — which is *exactly* what we computed in the Bayesian best-variant selection above.

> See `ThompsonSampling_DynamicTrafficAllocation.ipynb` for the full treatment, simulation code, and production considerations.

In [None]:
# --- Thompson Sampling Simulation ---
np.random.seed(42)

true_rates = {
    'A': 3244 / 4625,  # ~70.1%
    'B': 1433 / 2100,  # ~68.2%
    'C': 1396 / 2022,  # ~69.0%
}

def run_thompson_sampling(true_rates, n_users):
    """Simulate Thompson sampling and return results."""
    variants_list = list(true_rates.keys())
    alpha = {v: 1 for v in variants_list}
    beta = {v: 1 for v in variants_list}
    n_shown = {v: 0 for v in variants_list}
    n_conv = {v: 0 for v in variants_list}
    total_conv = 0
    
    history = {'user': [], 'prob_A': [], 'prob_B': [], 'prob_C': []}
    
    for uid in range(n_users):
        samples = {v: np.random.beta(alpha[v], beta[v]) for v in variants_list}
        chosen = max(samples, key=samples.get)
        converted = int(np.random.random() < true_rates[chosen])
        
        alpha[chosen] += converted
        beta[chosen] += (1 - converted)
        n_shown[chosen] += 1
        n_conv[chosen] += converted
        total_conv += converted
        
        if uid % 50 == 0:
            mc = 10000
            counts = {v: 0 for v in variants_list}
            for _ in range(mc):
                s = {v: np.random.beta(alpha[v], beta[v]) for v in variants_list}
                counts[max(s, key=s.get)] += 1
            history['user'].append(uid)
            for v in variants_list:
                history[f'prob_{v}'].append(counts[v] / mc)
    
    return n_shown, n_conv, total_conv, history

def run_fixed_allocation(true_rates, n_users):
    """Simulate fixed equal allocation."""
    variants_list = list(true_rates.keys())
    n_shown = {v: 0 for v in variants_list}
    n_conv = {v: 0 for v in variants_list}
    total_conv = 0
    for uid in range(n_users):
        chosen = variants_list[uid % len(variants_list)]
        converted = int(np.random.random() < true_rates[chosen])
        n_shown[chosen] += 1
        n_conv[chosen] += converted
        total_conv += converted
    return n_shown, n_conv, total_conv

n_users = 5000

# Run both strategies
ts_shown, ts_conv, ts_total, ts_history = run_thompson_sampling(true_rates, n_users)
fx_shown, fx_conv, fx_total = run_fixed_allocation(true_rates, n_users)

best = max(true_rates, key=true_rates.get)
optimal = n_users * true_rates[best]

print("THOMPSON SAMPLING vs FIXED ALLOCATION")
print("=" * 60)
print(f"{'':20s} {'Thompson':>12s} {'Fixed':>12s}")
print("-" * 60)
for v in ['A', 'B', 'C']:
    ts_pct = 100 * ts_shown[v] / n_users
    fx_pct = 100 * fx_shown[v] / n_users
    print(f"Variant {v} traffic:    {ts_pct:10.1f}%  {fx_pct:10.1f}%")
print("-" * 60)
print(f"Total conversions:  {ts_total:10d}   {fx_total:10d}")
print(f"Conversion rate:    {100*ts_total/n_users:10.2f}%  {100*fx_total/n_users:10.2f}%")
print(f"Regret:             {optimal - ts_total:10.0f}   {optimal - fx_total:10.0f}")
print(f"\nThompson Sampling gained {ts_total - fx_total:.0f} extra conversions")

In [None]:
# Visualize: P(variant is best) over time
fig, ax = plt.subplots(figsize=(12, 5))
ax.plot(ts_history['user'], ts_history['prob_A'], label='P(A is best)', lw=2, color='#2ecc71')
ax.plot(ts_history['user'], ts_history['prob_B'], label='P(B is best)', lw=2, color='#e74c3c')
ax.plot(ts_history['user'], ts_history['prob_C'], label='P(C is best)', lw=2, color='#3498db')
ax.axhline(y=0.95, color='gray', ls='--', lw=1, alpha=0.5, label='95% threshold')
ax.set_xlabel('Number of Users', fontsize=12)
ax.set_ylabel('Probability of Being Best', fontsize=12)
ax.set_title('Thompson Sampling: Learning Which Variant is Best Over Time', fontsize=14, fontweight='bold')
ax.legend(loc='right', fontsize=10)
ax.grid(True, alpha=0.3)
ax.set_ylim(0, 1.05)
plt.tight_layout()
plt.show()

# Find when 95% confidence reached
for i, p in enumerate(ts_history['prob_A']):
    if p >= 0.95:
        print(f"Reached 95% confidence that A is best after ~{ts_history['user'][i]:,} users")
        break
else:
    print(f"Did not reach 95% confidence within {n_users:,} users (but traffic was already optimized)")

## Key Benefits of Thompson Sampling

### Dynamic Traffic Allocation
Thompson Sampling **automatically** routes more traffic to better-performing variants. Inferior variants naturally fade out without manual intervention.

### Adding New Variants Dynamically
One of the greatest practical advantages: **new variants can enter at any time**.

1. New variant arrives → initialize with prior Beta(1, 1)
2. It immediately competes in sampling with existing variants
3. Wide posterior → gets explored (sometimes samples high → gets traffic)
4. Good variants prove themselves; bad ones fade out

No need to stop the test, redistribute traffic, recalculate sample sizes, or worry about multiple comparisons.

> See `ThompsonSampling_DynamicTrafficAllocation.ipynb` cells 13-15 for a full simulation of adding variant D mid-experiment.

### Production Considerations

When implementing Thompson Sampling in production, there are important real-world considerations beyond the basic algorithm:

| Consideration | Issue | Solution |
|---|---|---|
| **Delayed feedback** | Conversion may happen minutes/hours after variant shown | Batch updates (every 10-60 min); weakly informative priors reduce early regret |
| **Non-stationarity** | Conversion rates drift over time (seasonality, product changes) | Exponential decay or sliding window on observations |
| **Scalability** | High-traffic systems need low-latency decisions | Store $(\alpha, \beta)$ in distributed cache; batch posterior updates |
| **Cold start** | New variants start with no data | Weakly informative priors; accept initial exploration phase |

> See `ThompsonSampling_DynamicTrafficAllocation.ipynb` Appendix for detailed treatment of delayed feedback, non-stationarity, and production architecture.

# Summary

---

## The Journey

| Step | Manual / NHST | Bayesian / Automated |
|------|--------------|---------------------|
| **1. Non-inferiority** | NHST: "fail to reject" (inconclusive) | Bayesian: "95%+ probability of non-inferiority" |
| **2. Best variant** | Pairwise tests + corrections (complex, conservative) | Monte Carlo: "Variant A is best with 88% probability" |
| **3. Deploy winner** | Manual release cycle (weeks/months) | Thompson Sampling: automatic, continuous optimization |
| **4. Add new variant** | Stop test, redesign, restart | Add anytime — algorithm adapts seamlessly |

---

## The Full Comparison

| Aspect | Traditional A/B (NHST) | Thompson Sampling |
|--------|----------------------|-------------------|
| Traffic allocation | Fixed (e.g., 33/33/33) | Dynamic (adapts to performance) |
| Total conversions | Suboptimal (wastes traffic) | Near-optimal (minimizes regret) |
| Time to decision | Wait for significance | Continuous improvement |
| Adding variants | Restart test | Add anytime |
| Removing variants | Manual rebalance | Automatic fade-out |
| Multiple comparisons | Need corrections | No problem |
| Stopping rule | Pre-determined | Flexible |
| Implementation | Complex statistics | 5 lines of code |

---

## Bottom Line

For modern product development with rapid iteration cycles, risk-averse traffic allocation, and multiple design options:

1. **Use Bayesian methods** for non-inferiority testing and variant selection — they work with small samples, provide actionable probabilities, and handle unbalanced designs naturally.

2. **Use Thompson Sampling** to automate the entire experimentation loop — it minimizes regret, adapts traffic dynamically, and eliminates the release-engineering bottleneck.

3. **Start simple** — the mathematical foundations (Beta-Binomial conjugacy) are elegant and the code is minimal. A production MVP can be built with a distributed cache for $(\alpha, \beta)$ parameters and a few lines of sampling logic.

---

## Companion Notebooks

| Notebook | Content |
|----------|---------|
| `ABmethodologies.ipynb` | Full mathematical derivation of NHST and Bayesian approaches, numerical examples, plotting utilities |
| `Bayesian_AB_Test_Workflow.ipynb` | Concise Bayesian workflow: non-inferiority test → variant selection, with utility functions |
| `ThompsonSampling_DynamicTrafficAllocation.ipynb` | Thompson Sampling algorithm, simulations, dynamic variant addition, production considerations |

---

## References

### Books
- Jaynes, E.T. *Probability Theory: The Logic of Science* — [Cambridge University Press](https://www.cambridge.org/core/books/probability-theory/9CA08E224FF30123304E6D8935CF1A99)
- Gelman, A. et al. *Bayesian Data Analysis* (3rd ed.) — [Columbia University](http://www.stat.columbia.edu/~gelman/book/)
- Kruschke, J.K. *Doing Bayesian Data Analysis* — [Academic Press](https://sites.google.com/site/doingbayesiandataanalysis/)

### Papers
- Ioannidis, J.P.A. (2005) "Why Most Published Research Findings Are False" — [PLOS Medicine](https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124)
- Thompson, W.R. (1933) "On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples" — [Biometrika](https://doi.org/10.1093/biomet/25.3-4.285)
- Chapelle, O. & Li, L. (2011) "An Empirical Evaluation of Thompson Sampling" — [NeurIPS](https://papers.nips.cc/paper/2011/hash/e53a0a2978c28872a4505bdb51db06dc-Abstract.html)
- Russo, D.J. et al. (2018) "A Tutorial on Thompson Sampling" — [Foundations and Trends in Machine Learning](https://arxiv.org/abs/1707.02038)
- Agrawal, S. & Goyal, N. (2012) "Analysis of Thompson Sampling for the Multi-armed Bandit Problem" — [COLT](https://arxiv.org/abs/1111.1797)

### Industry & Regulatory
- FDA Draft Guidance (2026): "Use of Bayesian Methodology in Clinical Trials" — [FDA](https://www.fda.gov/regulatory-information/search-fda-guidance-documents/use-bayesian-methodology-clinical-trials-drugs-and-biological-products)
- Optimizely Stats Engine — [Optimizely](https://www.optimizely.com/optimization-glossary/stats-engine/)
- VWO Bayesian A/B Testing — [VWO](https://vwo.com/bayesian-ab-testing/)
- Google "Multi-Armed Bandits" (2024) — [Google AI Blog](https://ai.googleblog.com/)

### Tutorials
- Cam Davidson-Pilon, *Bayesian Methods for Hackers* — [GitHub / online book](https://camdavidsonpilon.github.io/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/)
- Lilian Weng, "The Multi-Armed Bandit Problem and Its Solutions" — [Blog post](https://lilianweng.github.io/posts/2018-01-23-multi-armed-bandit/)