# The Math Behind A/B Testing: Formulas, Calculations, and More...

A/B testing is not just a marketing tool—it is a statistical experiment. This guide explains the math behind A/B testing, including how to calculate sample sizes, standard errors, confidence intervals, and how to analyze discrete and continuous metrics. Each formula is derived step by step, with clear justification, so you can understand the reasoning behind every calculation.

## What is A/B Testing?

A/B testing (split testing) is an experiment where you split your audience to test two or more variations and determine which performs better.  
It helps answer questions like: “Does changing this button color increase clicks?”  

**Metrics Types:**  
- **Continuous metrics:** measurable on a scale (e.g., revenue per user, time on page).  
- **Discrete metrics:** countable events (e.g., clicks, signups, purchases).  

Choosing the correct metric type determines the statistical method and sample size calculation.

## When to Use A/B Testing

- **Best for:** incremental changes (UX tweaks, small features, page load optimization).  
- **Not ideal for:** major changes (new product, full rebranding, or entirely new UX), as these can introduce confounding emotional effects.

## Common Mistakes to Avoid

- **Invalid hypothesis:** Be explicit about what you are testing and the expected outcome.  
- **Testing too many elements at once:** Makes it hard to identify what caused changes.  
- **Ignoring statistical significance:** Let tests run long enough to achieve significance.  
- **Ignoring external factors:** Compare similar periods (avoid holiday spikes or promotions).

## How A/B Testing Works

1. Create two versions of content (control and treatment), changing only **one variable**.  
2. Split audiences randomly and equally.  
3. Measure results over sufficient duration.  

### Example: User Experience

- **Version A (control):** Sidebar CTA  
- **Version B (challenger):** Top-of-page CTA  
- Measure **clicks** (discrete metric) to determine the winner.

### Example: Design Test

- **Version A (control):** Red CTA button  
- **Version B (challenger):** Green CTA button  
- Measure **time on page** (continuous metric) to determine engagement difference.

## A/B Testing Goals

- **Discrete metrics:** conversion rate, signups, purchases  
- **Continuous metrics:** revenue per user, session duration  
- **Mixed:** cart abandonment, product image improvements  

## Designing an A/B Test

1. Select **one variable** to test.  
2. Identify your **primary metric** (continuous or discrete).  
3. Create **control** and **treatment** versions.  
4. Randomly split your audience equally.  
5. Calculate required **sample size**.  
6. Decide significance thresholds: $\alpha$ (Type I error) and $\beta$ (Type II error).  
7. Run **one test at a time**.



## Step 1: Test Statistic

For measuring the difference between two groups:

$$
T = \frac{\text{observed difference} - \text{expected difference under $H_0$}}{SE}
$$

Where:

- **Observed difference:** difference in metrics between treatment and control.  
- **Expected difference under $H_0$:** usually 0 (no difference).  
- **SE (standard error):** accounts for variability in the metric.

### Standard Error Formulas

#### Continuous Metrics

$$
SE = \sqrt{\frac{s_c^2}{n_c} + \frac{s_t^2}{n_t}}
$$`

- $s_c^2$, $s_t^2$: sample variances  
- $n_c$, $n_t$: sample sizes

> **Note:** Variances add for independent samples; divide by sample size to adjust for averaging; square root returns SE in original units.

#### Discrete Metrics (Binary)

$$
SE = \sqrt{\frac{p_c(1-p_c)}{n_c} + \frac{p_t(1-p_t)}{n_t}}
$$

- $p_c$, $p_t$: conversion rates  
- $n_c$, $n_t$: sample sizes  

> **Note:** Variance of Bernoulli $p(1-p)$, divided by sample size for mean; sum for independent groups.



## Step 2: Confidence Interval

95% CI for difference:

$$
(p_t - p_c) \pm Z_{0.975} \cdot SE
$$

- $Z_{0.975} = 1.96$  
- CI shows plausible range of true difference.



## Step 3: Sample Size Calculation

### Continuous Metrics

Goal: detect a difference $\Delta$ with significance $\alpha$ and power $1-\beta$.

1. Test statistic:

$$
T = \frac{\bar{X}_t - \bar{X}_c}{\sqrt{\frac{\sigma^2}{n} + \frac{\sigma^2}{n}}} = \frac{\Delta}{\sqrt{2\sigma^2 / n}}
$$

2. Solve for $n$:

$$
Z_{\alpha/2} + Z_\beta = \frac{\Delta}{\sqrt{2\sigma^2 / n}}
$$

$$
\sqrt{\frac{2\sigma^2}{n}} = \frac{\Delta}{Z_{\alpha/2} + Z_\beta}
$$

$$
\frac{2\sigma^2}{n} = \frac{\Delta^2}{(Z_{\alpha/2} + Z_\beta)^2}
$$

$$
n = \frac{2 (Z_{\alpha/2} + Z_\beta)^2 \sigma^2}{\Delta^2}
$$

> **Note:** Includes variance, confidence, and desired power.

### Discrete Metrics (Binary)

$$
n = \frac{2 (Z_{\alpha/2} + Z_\beta)^2 \hat{p}(1-\hat{p})}{\Delta^2}
$$

- $\hat{p}$: baseline conversion rate  
- $\Delta$: minimum detectable effect  

> **Note:** Variance of Bernoulli distribution and two independent groups included.

### Example: Discrete Metric

- Baseline $p_c = 0.10$  
- Minimum detectable effect $\Delta = 0.02$  
- 95% confidence ($Z_{0.975}=1.96$)  
- 80% power ($Z_{0.8}=0.84$)

Step-by-step:

1. $Z_{0.975}+Z_{0.8} = 1.96+0.84 = 2.8$  
2. Square: $(2.8)^2 = 7.84$  
3. Multiply by $2 \cdot \hat{p}(1-\hat{p}) = 2 \cdot 0.1 \cdot 0.9 = 0.18$  
4. $7.84 \cdot 0.18 = 1.4112$  
5. Divide by $\Delta^2 = 0.0004$ → $1.4112 / 0.0004 = 3528$

Required sample size per group: **≈ 3528 users**



## Step 4: Running the Test

- Use a testing tool  
- Split users randomly  
- Run for sufficient duration to collect enough events for SE and CI  
- Collect qualitative feedback if needed  



## Step 5: Analyzing Results

### Discrete Metric

- Standard error:

$$
SE = \sqrt{\frac{p_c (1-p_c)}{n_c} + \frac{p_t (1-p_t)}{n_t}}
$$

- Confidence interval:

$$
(p_t - p_c) \pm Z_{0.975} \cdot SE
$$`

### Continuous Metric

- Standard error:

$$
SE = \sqrt{\frac{s_c^2}{n_c} + \frac{s_t^2}{n_t}}
$$

- Confidence interval:

$$
(\bar{X}_t - \bar{X}_c) \pm Z_{0.975} \cdot SE
$$

Take action based on CI and practical significance ($\Delta$).



## Advanced Concepts

### Multi-Armed Bandit

The multi-armed bandit approach dynamically allocates traffic to the best-performing variant, instead of splitting traffic equally like traditional A/B testing. This can improve overall performance during the test by favoring better options earlier.  

- **Thompson Sampling** is commonly used for discrete metrics (e.g., clicks, conversions). It models the conversion probability of each variant with a Beta distribution:

$$
\theta_i \sim \text{Beta}(\alpha_i + \text{successes}_i, \beta_i + \text{failures}_i)
$$

Here, $\theta_i$ is the estimated conversion rate for variant $i$, updated as data arrives.  

- **Expected regret** quantifies the loss from not always choosing the best variant:

$$
R(T) = \sum_{t=1}^{T} (\mu^* - \mu_{a_t})
$$

Where $\mu^*$ is the conversion rate of the optimal variant and $\mu_{a_t}$ is the conversion rate of the variant chosen at time $t$.

### Sequential Testing

Sequential testing allows you to continuously monitor results and stop the experiment early if there is enough evidence, rather than waiting for a fixed sample size. This can save time and resources but requires adjusting significance thresholds to avoid false positives.  

- Adjust the significance threshold over time:

$$
\alpha(t) = \alpha \cdot \frac{t}{T}
$$

- Use the **Sequential Probability Ratio Test (SPRT)** to compare likelihoods of hypotheses as data accumulates:

$$
\Lambda_t = \frac{L(\text{data}_t | H_1)}{L(\text{data}_t | H_0)}
$$

**Stop rules:**  
- Accept $H_1$ if $\Lambda_t > B$ (enough evidence for treatment effect)  
- Accept $H_0$ if $\Lambda_t < A$ (enough evidence for no effect)  
- Continue collecting data otherwise  

### Bayesian A/B Testing

Bayesian A/B testing incorporates prior beliefs about metrics and updates them with observed data, giving a probability distribution over the true effect rather than a single point estimate.  

- **Posterior update:**

$$
p(\theta | \text{data}) \propto p(\text{data} | \theta) \cdot p(\theta)
$$

- The 95% **credible interval** provides the range where the true parameter lies with 95% probability:

$$
\text{CI}_{95\%} = [\theta_{\text{lower}}, \theta_{\text{upper}}]
$$

- Bayesian methods allow **probability-based decisions**, e.g., “There is a 90% chance that variant B is better than A,” which can be more intuitive than p-values.

## How to Read Results

When analyzing A/B test results, consider both statistical and practical significance:  

1. **Verify statistical significance** – Check p-values or credible intervals to ensure observed differences are unlikely to be due to chance.  
2. **Compute confidence intervals** – For discrete or continuous metrics, intervals quantify uncertainty around the estimated difference.  
3. **Assess practical effect size ($\Delta$)** – Consider if the difference is meaningful for business or user experience.  
4. **Decide next steps** – Roll out the winning variant, iterate with new tests, or discard changes that don’t show improvement.

## Decision Flow
```
Start
  │
  ▼
Select Metric
  │
  ├─► Is it Discrete? (clicks, signups, purchases)
  │       │
  │       ├─► Yes 
  │       │     │
  │       │     ├─► Use standard A/B test for proportions
  │       │     ├─► Optional: Multi-Armed Bandit for faster allocation
  │       │     └─► Sample size via binary formula
  │       │
  │       └─► No (continuous: time on page, revenue)
  │             │
  │             ├─► Use t-test / ANOVA
  │             ├─► Optional: Sequential or Bayesian testing
  │             └─► Sample size via variance-based formula
  │
  ▼
Randomly split audience
  │
  ▼
Run test (ensure sufficient duration)
  │
  ▼
Analyze Results
  │
  ├─► Statistically Significant?
  │       │
  │       ├─► Yes ─► Implement winning variant
  │       │
  │       └─► No ─► Iterate or increase sample size
  │
  ▼
End