# How to Do A/B Testing

## What is A/B Testing?

A/B testing, or **split testing**, is an experiment where you split your audience to test variations and determine which performs better.  
It is valuable because different audiences behave differently — what works for one company may not work for another.  

**Continuous vs. Discrete Perspective:**  
- **Continuous metrics**: measurable on a scale (e.g., time on page, revenue per user).  
- **Discrete metrics**: countable events (e.g., clicks, signups, purchases).  

Understanding your metric type is crucial because it determines the statistical method and sample size calculation.



## How Does A/B Testing Work?

You create two versions of content, changing only **one variable**, then show them to two similarly sized audiences.  
Analyze which performs better over a sufficient period to draw accurate conclusions.

![A/B Testing Explanation](img/a-b-testing-explanation.webp)

### Example: User Experience Test
- Version A: Sidebar CTA (**control**)  
- Version B: Top-of-page CTA (**challenger**)  

Compare performance among randomly selected, equally sized visitors using discrete metrics like click counts.

### Example: Design Test
- Version A: Red CTA button (**control**)  
- Version B: Green CTA button (**challenger**)  

Compare using **continuous metrics** like average time spent on page or scroll depth.



## A/B Testing Goals

- **Higher Conversion Rate** (discrete)  
- **Increased Website Traffic** (discrete)  
- **Lower Bounce Rate** (continuous)  
- **Better Product Images** (continuous/discrete mix)  
- **Lower Cart Abandonment** (discrete)



## Designing an A/B Test

1. **Select one variable** (independent variable)  
2. **Identify your primary metric** (dependent variable, continuous or discrete)  
3. **Create control and challenger versions**  
4. **Split your audience randomly and equally**  
5. **Calculate required sample size**  
6. **Decide significance thresholds** ($\alpha$, $\beta$)  
7. **Run only one test at a time**  

### Sample Size Calculation

- **For discrete (binary) metrics** like conversion rate:

$$
n = \frac{2 \cdot (Z_{\alpha/2} + Z_\beta)^2 \cdot \hat{p}(1-\hat{p})}{\Delta^2}
$$

- **For continuous metrics** like revenue per user:

$$
n = \frac{2 \cdot (Z_{\alpha/2} + Z_\beta)^2 \cdot \sigma^2}{\Delta^2}
$$

Where:  
- $\hat{p}$ = baseline conversion rate (discrete)  
- $\sigma^2$ = variance of the continuous metric  
- $\Delta$ = minimum detectable effect  
- $Z_{\alpha/2}, Z_\beta$ = critical values for confidence and power  



## During the A/B Test

- Use a testing tool  
- Test variations simultaneously  
- Allow sufficient duration for statistical significance  
- Collect qualitative user feedback  

**Tip:** Continuous monitoring is easier for continuous metrics (e.g., revenue), but discrete metrics (e.g., signups) may need longer periods for enough events.



## After the A/B Test

- Focus on your goal metric  
- Compare conversion rates:

$$
SE = \sqrt{\frac{p_c (1-p_c)}{n_c} + \frac{p_t (1-p_t)}{n_t}} \quad \text{(discrete)}
$$

- Confidence interval (95%):

$$
(p_t - p_c) \pm Z_{0.975} \cdot SE
$$

- For continuous metrics, use:

$$
SE = \sqrt{\frac{s_c^2}{n_c} + \frac{s_t^2}{n_t}}
$$

- Take action based on results  
- Plan your next test  



## Advanced A/B Testing Concepts with Practical Examples

### 1. Multi-Armed Bandit Testing
- **Concept:** Dynamically allocate traffic to better-performing variants to maximize expected reward.
- **Metrics:** Discrete (clicks, signups), Continuous (revenue, time on site).
- **Thompson Sampling Example (Discrete):**

$$
\theta_i \sim \text{Beta}(\alpha_i + \text{successes}_i, \beta_i + \text{failures}_i)
$$

- **Traffic Allocation Rule:** Choose variant with highest sampled $\theta_i$ for each new user.
- **Expected Regret Formula:**

$$
R(T) = \sum_{t=1}^{T} (\mu^* - \mu_{a_t})
$$

where:  
$\mu^*$ = mean reward of best variant, $\mu_{a_t}$ = mean reward of chosen variant at time $t$.

**Practical Example:**  
- Website has 3 headline variants.  
- Initial prior: $\text{Beta}(1,1)$ for each.  
- Users arrive sequentially; Thompson Sampling updates $\alpha, \beta$ for each variant based on clicks.  
- Over time, most traffic is sent to the variant performing best while still exploring others.

### 2. Sequential Testing
- **Concept:** Continuous monitoring without inflating Type I error if done correctly.
- **Adjusted significance threshold:**

$$
\alpha(t) = \alpha \cdot \frac{t}{T}
$$

- **Sequential Probability Ratio Test (SPRT):**  

$$
\Lambda_t = \frac{L(\text{data}_t \mid H_1)}{L(\text{data}_t \mid H_0)}
$$

Stop rules:  
- Accept $H_1$ if $\Lambda_t > B$  
- Accept $H_0$ if $\Lambda_t < A$  
- Continue sampling otherwise  

**Practical Example:**  
- Email click-through test (variant A vs B).  
- Check results daily.  
- SPRT thresholds set to stop early if one variant clearly outperforms the other, saving time and resources.

### 3. Bayesian A/B Testing
- **Concept:** Updates posterior distribution as new data arrives. Provides probability-based decision-making.
- **Posterior:**

$$
p(\theta \mid \text{data}) \propto p(\text{data} \mid \theta) \cdot p(\theta)
$$

- **Credible Interval:**  
95% credible interval for a metric $\theta$:

$$
\text{CI}_{95\%} = [\theta_{\text{lower}}, \theta_{\text{upper}}] \quad \text{such that} \quad P(\theta \in \text{CI}_{95\%} \mid \text{data}) = 0.95
$$

**Practical Example:**  
- A/B test for revenue per user (continuous metric).  
- Prior: $\theta \sim \text{Normal}(50, 20^2)$  
- Data collected: 200 users per variant.  
- Posterior shows $P(\text{Variant B > Variant A}) = 0.92$, giving strong evidence to roll out B.

### 4. Metric Selection & Pitfalls
- **Continuous metrics:** revenue, session duration → detect subtle improvements.
- **Discrete metrics:** signups, purchases → clear actions.
- **Pitfalls:**
  - Multiple comparisons inflate Type I error.
  - Peeking at data without correction biases results.
  
**Practical Example:**  
- Testing 4 product layouts.  
- Tracking both average revenue (continuous) and purchase count (discrete).  
- Adjust significance using Bonferroni correction when comparing multiple variants.

### 5. Practical Considerations
- **Segmentation:** device, geography, traffic source.
- **Duration:** ensure enough users for reliable results (consider both discrete & continuous metrics).
- **Randomization:** prevent allocation bias.
- **External confounders:** holidays, marketing campaigns, or outages.

**Practical Example:**  
- Mobile app A/B test with two variants of onboarding flow.  
- Segment by OS (iOS/Android), traffic source (organic/paid).  
- Run for 4 weeks to capture weekday/weekend usage patterns.  
- Check metrics separately for each segment to detect interaction effects.




## How to Read Results

1. Verify statistical significance  
2. Compute confidence intervals for discrete or continuous metrics  
3. Examine heterogeneous effects  
4. Assess practical significance ($\Delta$)  
5. Decide next steps: rollout, iterate, or discard  


# A/B Testing Decision Flowchart
```
Start
  │
  ▼
Select Metric
  │
  ├─► Is it Discrete? (clicks, signups, purchases)
  │       │
  │       ├─► Yes 
  │       │     │
  │       │     ├─► Use standard A/B test for proportions
  │       │     ├─► Optional: Multi-Armed Bandit for faster allocation
  │       │     └─► Sample size via binary formula
  │       │
  │       └─► No (continuous: time on page, revenue)
  │             │
  │             ├─► Use t-test / ANOVA
  │             ├─► Optional: Sequential or Bayesian testing
  │             └─► Sample size via variance-based formula
  │
  ▼
Randomly split audience
  │
  ▼
Run test (ensure sufficient duration)
  │
  ▼
Analyze Results
  │
  ├─► Statistically Significant?
  │       │
  │       ├─► Yes ─► Implement winning variant
  │       │
  │       └─► No ─► Iterate or increase sample size
  │
  ▼
End
