# A/B Testing - Interview Questions & Answers

Comprehensive Q&A covering fundamentals, experiment design, metric selection, common pitfalls, and advanced topics.

## Fundamentals

### Q1: What is A/B testing and why is it important?

**Answer:**

A/B testing is a controlled experiment comparing two or more variants to measure the impact of a change on a specific metric. It's the gold standard for **causal inference** in product development because it eliminates confounders through randomization.

Unlike observational studies, A/B tests let you attribute changes **directly** to your intervention.

### Q2: Explain the difference between Type I and Type II errors in A/B testing context.

**Answer:**

- **Type I Error** (false positive, $\alpha$): Concluding the treatment works when it doesn't. You ship a change that has no real effect (or is harmful). Controlled by significance level $\alpha$ (typically 5%).

- **Type II Error** (false negative, $\beta$): Missing a real effect. You fail to ship a beneficial change. Controlled by statistical power $= 1 - \beta$ (typically 80%).

**Trade-off:** Reducing $\alpha$ (more conservative) increases $\beta$ (more false negatives). The only way to improve both is to **increase sample size**.

### Q3: What is a p-value? What is the most common misconception?

**Answer:**

The p-value is the probability of observing data as extreme as (or more extreme than) what was observed, **assuming $H_0$ is true**.

$$p\text{-value} = P(\text{data} \mid H_0)$$

**Common misconception:** "The p-value is the probability that $H_0$ is true" — this is **WRONG**. The p-value says nothing about the probability of hypotheses. It only tells you how surprising the data would be under $H_0$.

Another misconception: $p = 0.04$ doesn't mean there's a 96% chance the treatment works.

### Q4: What is statistical power and why does it matter?

**Answer:**

$$\text{Power} = P(\text{reject } H_0 \mid H_0 \text{ is false}) = 1 - \beta$$

It's the probability of detecting a real effect when one exists. Standard target is **80%**.

Power depends on:
1. **Sample size** — larger $n$ = more power
2. **Effect size** — larger effect = easier to detect
3. **Significance level** — higher $\alpha$ = more power but more false positives
4. **Variance** — lower variance = more power

An underpowered test wastes resources because you're unlikely to detect the effect even if it exists.

### Q5: Explain the difference between statistical significance and practical significance.

**Answer:**

- **Statistical significance** means the observed difference is unlikely due to chance ($p < \alpha$).
- **Practical significance** means the difference is large enough to matter for the business.

They don't always align:
1. **Large sample + tiny effect** = statistically significant but practically useless (e.g., 0.001% conversion lift on millions of users).
2. **Small sample + large effect** = practically significant but not statistically significant (underpowered).

Always evaluate **BOTH**. The confidence interval helps: it shows the range of plausible effect sizes, letting you assess practical importance.

## Experiment Design

### Q6: How do you determine the sample size for an A/B test?

**Answer:**

Sample size depends on four parameters:
1. **Significance level** $\alpha$ (typically 0.05)
2. **Power** $1 - \beta$ (typically 0.80)
3. **Baseline metric value** (current conversion rate)
4. **Minimum Detectable Effect (MDE)** — the smallest improvement worth detecting

Formula for proportions:

$$n = \frac{(Z_{\alpha/2} + Z_\beta)^2 \cdot [p_1(1 - p_1) + p_2(1 - p_2)]}{(p_1 - p_2)^2}$$

Rule of thumb: $n \approx 16\sigma^2 / \delta^2$ for $\alpha = 0.05$, power $= 0.80$.

**Key insight:** Halving the MDE **quadruples** the required sample size. Always calculate sample size **BEFORE** running the test.

### Q7: How do you decide how long to run an A/B test?

**Answer:**

$$\text{Duration} = \frac{\text{required\_sample\_size}}{\text{daily\_eligible\_traffic}}$$

But also consider:
1. **Minimum 1 week** to capture weekday/weekend patterns
2. **Avoid** holidays, sales events, or other anomalies
3. **Account for novelty effect** — users react to ANY change initially
4. **Account for primacy effect** — users resist change initially

**Never** stop a test early just because results look significant (peeking problem). **Never** extend a test because results aren't significant yet (that's p-hacking).

### Q8: What is the Minimum Detectable Effect (MDE) and how do you choose it?

**Answer:**

MDE is the smallest effect size that would be worthwhile to detect from a **business perspective**.

Choosing MDE:
1. Work with stakeholders to understand what improvement justifies the cost of implementing the change
2. Consider the baseline metric and what's realistic
3. Translate to business impact (e.g., 1pp conversion increase = \$X revenue/month)
4. Balance against duration — smaller MDE needs more data

- **For large platforms** (millions of users): MDE can be small (0.5–1%) because even small effects have huge aggregate impact.
- **For small companies**: MDE should be larger (5–10%) because you need bigger effects to justify the investment.

### Q9: What is an A/A test and why would you run one?

**Answer:**

An A/A test shows the **same experience** to both groups (no treatment).

Purposes:
1. **Validate** the randomization system works correctly
2. **Verify** the logging and metric computation pipeline
3. **Calibrate** false positive rates (you should reject $H_0$ about 5% of the time at $\alpha = 0.05$)
4. **Detect** selection bias or instrumentation issues

If an A/A test shows significant differences, your experimentation platform has a bug. **Do NOT** run A/B tests until it's fixed.

## Metric Selection

### Q10: How do you choose the right metric for an A/B test?

**Answer:**

A good primary metric must be **MAST**:
- **M**easurable — quantifiable
- **A**ttributable — causally linked to the change
- **S**ensitive — responsive to the change, low noise
- **T**imely — observable within the test window

Approach:
1. Start with the business goal
2. Choose **ONE** primary metric (decision metric)
3. Define guardrail metrics (must not degrade)
4. Add secondary metrics (nice to understand but don't drive decisions)

**Example:** Testing a new checkout flow —
- *Primary:* conversion rate
- *Guardrails:* page load time, error rate, revenue per user
- *Secondary:* cart abandonment rate, time to purchase

### Q11: What are guardrail metrics and why are they important?

**Answer:**

Guardrail metrics are secondary metrics that **must NOT degrade**, even if the primary metric improves. They protect against unintended consequences.

Examples:
1. Testing a more aggressive notification strategy → *Primary:* engagement. *Guardrail:* unsubscribe rate.
2. Testing a simpler checkout → *Primary:* conversion. *Guardrail:* average order value, revenue.

**Rule:** If a guardrail metric degrades significantly, **DO NOT launch** the treatment regardless of the primary metric result.

## Common Pitfalls & Challenges

### Q12: What is the peeking problem and how do you address it?

**Answer:**

**Peeking** = checking test results repeatedly before the planned end date and stopping early when you see significance.

**Problem:** At any point during the test, random fluctuations can produce a "significant" result. If you check daily for 14 days, your effective false positive rate can be **25–30%** instead of 5%.

**Solutions:**
1. Pre-commit to a sample size and don't peek
2. Use **sequential testing** methods (e.g., group sequential designs with O'Brien-Fleming bounds)
3. Use **Bayesian methods** where peeking is more natural
4. Use **always-valid confidence intervals**

### Q13: What is Simpson's Paradox and how can it affect A/B tests?

**Answer:**

Simpson's Paradox occurs when a trend that appears in **aggregate** data **reverses** when the data is segmented.

**Example:** Treatment wins overall, but loses in every individual segment (mobile, desktop, tablet) because of unequal traffic distribution across segments.

How to handle:
1. Always **segment** your analysis by key dimensions
2. Check for **interaction effects** between the treatment and important covariates
3. Ensure randomization is balanced across segments (**stratified randomization**)
4. Use the segment-level analysis to understand **WHY**, but the overall metric for the decision (unless there's a specific reason to focus on a segment)

### Q14: What is network effect (interference) and how does it affect A/B tests?

**Answer:**

Network effect / interference occurs when a user's experience in one group **affects users in another group**, violating the independence assumption (**SUTVA** — Stable Unit Treatment Value Assumption).

Examples:
1. **Social networks:** if treatment users share content differently, control users' feeds change too
2. **Marketplace:** if treatment sellers change pricing, control buyers see different prices
3. **Ride-sharing:** if treatment changes driver incentives, control riders are affected

**Solutions:**
1. **Cluster randomization** — randomize by geography, social cluster
2. **Switchback experiments** — alternate treatment/control over time
3. Use **ego-network randomization** for social products

### Q15: What is the multiple testing problem? How do you handle it?

**Answer:**

When running multiple tests or checking multiple metrics, the probability of at least one false positive increases:

$$P(\text{at least 1 FP}) = 1 - (1 - \alpha)^k$$

With 20 metrics at $\alpha = 0.05$: **64% chance** of at least one false positive!

**Correction methods:**
1. **Bonferroni:** divide $\alpha$ by number of tests. Simple but very conservative.
2. **Holm-Bonferroni:** step-down procedure, less conservative.
3. **Benjamini-Hochberg (FDR):** controls false discovery rate instead of family-wise error rate. Best for exploratory analysis with many metrics.
4. **Pre-register ONE primary metric** and treat others as exploratory.

## Advanced Topics

### Q16: When would you use Bayesian A/B testing over frequentist?

**Answer:**

**Use Bayesian when:**
1. Stakeholders want interpretable results ("82% probability B is better" vs "$p = 0.03$")
2. You need to stop tests early or monitor continuously
3. You have strong prior information to incorporate
4. You want to make risk-aware decisions using expected loss

**Use frequentist when:**
1. You need well-established, widely accepted methodology
2. Regulatory or compliance requirements demand frequentist methods
3. Simplicity and computational efficiency matter
4. You have clear pre-defined sample sizes and timelines

### Q17: What is CUPED and why is it used?

**Answer:**

**CUPED** (Controlled-experiment Using Pre-Experiment Data) is a **variance reduction** technique. It uses pre-experiment data (covariates) correlated with the metric to reduce noise.

Formula:

$$Y_{\text{adjusted}} = Y - \theta \cdot (X - \bar{X})$$

where $X$ is the pre-experiment covariate.

**Benefits:** Can reduce variance by **50%+**, which means you need much smaller samples (or shorter tests) to detect the same effect.

It's widely used at Microsoft, Netflix, Uber, and other large tech companies. It's essentially applying **regression adjustment** to experimental data.

### Q18: What is a Multi-Armed Bandit? How does it differ from A/B testing?

**Answer:**

**Multi-Armed Bandit (MAB)** dynamically allocates more traffic to better-performing variants during the experiment.

Key differences from A/B testing:

| | A/B Test | Multi-Armed Bandit |
|---|---|---|
| **Allocation** | Fixed (50/50) | Dynamic (adapts over time) |
| **Optimizes for** | Statistical error minimization | Opportunity cost (regret) minimization |
| **Inference** | Clear statistical inference | Trades inference quality for optimization |

**Use MAB when:**
1. Opportunity cost of showing inferior variant is high
2. You care more about optimization than inference
3. Short-term decisions with limited traffic

**Use A/B when:**
1. You need rigorous causal inference
2. You want to understand the magnitude of the effect
3. Long-term product decisions

### Q19: Your A/B test shows no significant result. What do you tell stakeholders?

**Answer:**

"Fail to reject $H_0$" does **NOT** mean "no effect exists."

My approach:
1. **Check the confidence interval** — if it includes meaningful positive effects, the test may be underpowered. Recommend extending.
2. **Check if the CI is narrow and centered around zero** — then confidently say "the effect, if any, is too small to matter."
3. **Check for bugs** — SRM, instrumentation issues, incorrect metric computation.
4. **Segment analysis** — maybe the effect exists in a subgroup.
5. **Be honest:** "We don't have enough evidence to conclude the treatment is better. Here's what the data tells us about the range of possible effects."

**Never** say "the treatment doesn't work" — that's accepting $H_0$, which is different from failing to reject it.

### Q20: Walk me through how you would design an A/B test from scratch for a new feature.

**Answer:**

Step-by-step framework:

1. **UNDERSTAND:** What business problem does this solve? What's the expected user behavior change?
2. **HYPOTHESIZE:** Define $H_0$ and $H_a$. Use PICOT framework.
3. **METRICS:** Choose primary metric (one!), guardrail metrics, secondary metrics.
4. **DESIGN:** Set $\alpha$ (0.05), power (0.80), calculate MDE with stakeholders, compute sample size, estimate duration.
5. **VALIDATE:** Run A/A test to verify the platform. Check for SRM.
6. **RUN:** Deploy to randomized groups. Monitor guardrail metrics daily. Don't peek at primary metric.
7. **ANALYZE:** After planned duration, compute test statistic and p-value. Calculate confidence interval. Check practical significance.
8. **DECIDE:**
   - Significant + practical → **launch**
   - Not significant + narrow CI around 0 → **don't launch**
   - Inconclusive → **extend or redesign**
   - Guardrails degraded → **don't launch**