# Bayesian A/B Test Workflow

This notebook demonstrates a complete Bayesian A/B testing workflow using utility functions.

## Workflow Overview

1. **Setup**: Define experiment data (control + variants)
2. **Non-Inferiority Test**: Verify variants don't degrade performance
3. **Visualize Results**: Plot prior and posteriors for all variants
4. **Select Best Variant**: Choose the winning variant with probability
5. **Visualize Selection**: Compare all variant posteriors

## Key Advantages of Bayesian Approach

- ✅ Works with small, unbalanced samples
- ✅ Provides actionable probabilities (not just p-values)
- ✅ Scales to many variants effortlessly
- ✅ Allows continuous monitoring without p-hacking concerns
- ✅ Directly answers: "Which variant is best?"

## 1. Setup: Import Libraries and Define Data

In [None]:
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt

# Import Bayesian utility functions
from bayesian import (
    test_non_inferiority_weakly_informative,
    select_best_variant
)

# Import plotting utilities
from plotting_utils import (
    plot_weakly_informative_prior_with_variants,
    plot_multiple_posteriors_comparison
)

## 2. Define Experiment Data

Example: Testing 3 passkey creation UX variants against a control.

- **Control**: Current experience (~71% completion rate)
- **Variants A, B, C**: New passkey creation flows
- **Goal**: Ensure variants don't degrade completion, then pick the best one

In [None]:
# Control group data
n_control = 4411  # Number of users
x_control = 3138  # Number who completed
control_rate = x_control / n_control

print(f"Control Group:")
print(f"  Sample size: {n_control:,}")
print(f"  Conversions: {x_control:,}")
print(f"  Conversion rate: {control_rate:.2%}")

# Variant data
variants_data = {
    'A': {'n': 561, 'x': 381},
    'B': {'n': 285, 'x': 192},
    'C': {'n': 294, 'x': 201}
}

print(f"\nVariants:")
for name, data in variants_data.items():
    rate = data['x'] / data['n']
    print(f"  {name}: n={data['n']:3d}, x={data['x']:3d}, rate={rate:.2%}")

# Test parameters
epsilon = 0.05  # 5% non-inferiority margin (acceptable degradation)
print(f"\nNon-inferiority margin (ε): {epsilon:.1%}")
print(f"Non-inferiority threshold: {control_rate - epsilon:.2%}")

## 3. Non-Inferiority Test

Test whether each variant is **non-inferior** (not significantly worse) than control.

### Key Insight: Domain Knowledge vs Business Tolerance

The test separates two important concepts:
1. **Expected degradation** (domain knowledge): "Adding 2 extra clicks will degrade by ~2%"
2. **Business tolerance** (epsilon): "We can accept up to 5% degradation"

The prior is centered at your **expected** performance (domain knowledge), while the test checks against the **maximum acceptable** threshold (business requirement).

In [None]:
# Domain knowledge: Adding passkey creation (2 extra clicks) will degrade by ~2%
expected_degradation = 0.01

# Run non-inferiority test
results = test_non_inferiority_weakly_informative(
    n_control=n_control,
    x_control=x_control,
    variants_data=variants_data,
    epsilon=epsilon,  # Business: can tolerate 5% degradation
    expected_degradation=expected_degradation,  # Domain: expect 2% degradation
    alpha_prior_strength=20,  # Weak prior (high entropy)
    threshold=0.95  # 95% probability required
)

print("="*80)
print("PRIOR AND THRESHOLD SETUP")
print("="*80)
print(f"Control rate: {control_rate:.2%}")
print(f"Expected degradation (domain knowledge): {expected_degradation:.1%}")
print(f"  → Prior centered at: {control_rate - expected_degradation:.2%}")
print(f"Maximum acceptable degradation (business): {epsilon:.1%}")
print(f"  → Test threshold at: {control_rate - epsilon:.2%}")
print(f"\nThis means:")
print(f"  • Prior says: 'I expect variant around {control_rate - expected_degradation:.1%}'")
print(f"  • Test says: 'Must be above {control_rate - epsilon:.1%} to pass'")

# Display results
print("\n" + "="*80)
print("NON-INFERIORITY TEST RESULTS")
print("="*80)

for variant_name, result in results.items():
    status = "✓ NON-INFERIOR" if result['is_non_inferior'] else "✗ NOT NON-INFERIOR"
    print(f"\nVariant {variant_name}: {status}")
    print(f"  P(variant > threshold): {result['probability']:.2%}")
    print(f"  Posterior mean: {result['variant_rate']:.2%}")
    print(f"  Prior mean: {result['prior_mean']:.2%}")
    print(f"  Observed rate: {variants_data[variant_name]['x']/variants_data[variant_name]['n']:.2%}")

# Summary
non_inferior_count = sum(1 for r in results.values() if r['is_non_inferior'])
print(f"\n{'='*80}")
print(f"Summary: {non_inferior_count}/{len(variants_data)} variants are non-inferior")
print(f"{'='*80}")

## 4. Visualize Non-Inferiority Test

Plot shows:
- **Gray dashed line**: Common weakly informative prior
- **Colored lines**: Posterior distribution for each variant
- **Red dotted line**: Non-inferiority threshold
- **Black dash-dot line**: Control conversion rate
- **Text box**: Probability each variant exceeds threshold

In [None]:
# Create visualization - simplified usage!
fig, ax = plot_weakly_informative_prior_with_variants(results)
plt.show()

## 5. Select Best Variant

Among non-inferior variants, which one is most likely the best?

Uses Monte Carlo simulation to compute **P(variant is best)** for each variant.

In [None]:
# Select best variant among all (or filter to non-inferior only)
# For this example, we'll analyze all variants
selection_results = select_best_variant(
    variants_data=variants_data,
    alpha_prior=1,  # Non-informative prior for selection
    beta_prior=1,
    credible_level=0.95,
    n_simulations=100000
)

# Display results
print("="*80)
print("BEST VARIANT SELECTION")
print("="*80)

print(f"\nProbability each variant is best:")
for name, prob in selection_results['probabilities'].items():
    bar = '█' * int(prob * 60)
    print(f"  {name}: {prob:.2%} {bar}")

winner = selection_results['best_variant']
winner_prob = selection_results['probabilities'][winner]
print(f"\n{'='*80}")
print(f"WINNER: Variant {winner}")
print(f"  Probability of being best: {winner_prob:.2%}")
print(f"  Posterior mean: {selection_results['posterior_means'][winner]:.2%}")
print(f"  95% Credible interval: [{selection_results['credible_intervals'][winner][0]:.2%}, {selection_results['credible_intervals'][winner][1]:.2%}]")
print(f"  Expected loss: {selection_results['expected_loss'][winner]:.4f}")
print(f"{'='*80}")

## 6. Visualize Variant Comparison

Compare posterior distributions of all variants to see overlap and separation.

In [None]:
# Prepare posteriors for plotting
from scipy.stats import beta as beta_dist

posteriors = {}
for name, data in variants_data.items():
    # Using non-informative prior Beta(1,1) for fair comparison
    alpha_post = data['x'] + 1
    beta_post = data['n'] - data['x'] + 1
    
    posteriors[name] = {
        'alpha': alpha_post,
        'beta': beta_post,
        'mean': alpha_post / (alpha_post + beta_post),
        'ci_95': (
            beta_dist.ppf(0.025, alpha_post, beta_post),
            beta_dist.ppf(0.975, alpha_post, beta_post)
        )
    }

# Create comparison plot
fig, ax = plot_multiple_posteriors_comparison(
    posteriors=posteriors,
    control_group_conversion_rate=control_rate,
    epsilon=epsilon
)
plt.show()

## Summary

This workflow demonstrates:

1. ✅ **Non-Inferiority Testing**: Verify variants don't significantly degrade performance
   - Uses weakly informative prior based on control data
   - Provides direct probability: P(variant > threshold)
   - Works with small, unbalanced samples

2. ✅ **Variant Selection**: Choose the best performing variant
   - Direct answer: P(variant A is best), P(variant B is best), etc.
   - No multiple comparison corrections needed
   - Scales to any number of variants

3. ✅ **Actionable Results**: Business-friendly outputs
   - "Variant B is best with 47% probability"
   - "Expected loss if we choose C instead: 0.0023"
   - Clear decision-making support

## Key Takeaway

Bayesian methods provide:
- **Faster decisions** (works with small samples)
- **Better interpretability** (probabilities, not p-values)
- **Greater flexibility** (any sample sizes, multiple variants)
- **Continuous monitoring** (no p-hacking issues)

For modern product development with rapid iteration and risk-averse traffic allocation, Bayesian methods are superior to traditional NHST approaches.