# Case Study 2: SaaS Session Duration A/B Test (Continuous Metrics)

## Scenario
A SaaS company redesigned its onboarding flow to improve user engagement. The product team wants to test whether the new onboarding increases **average session duration** (a continuous metric). Current average session duration is **5.2 minutes** with a standard deviation of ~3.1 minutes.

**Tests used:** Two-sample t-test, Welch's t-test, Mann-Whitney U test

## 1. Business Understanding & Hypothesis

**Business context:** Longer session duration correlates with feature adoption and retention. The new onboarding introduces interactive tutorials instead of static walkthroughs.

**Hypotheses:**
- $H_0$: The new onboarding has no effect on session duration ($\mu_{treatment} = \mu_{control}$)
- $H_a$: The new onboarding changes session duration ($\mu_{treatment} \neq \mu_{control}$)

**Experiment parameters:**
- $\alpha = 0.05$
- Power = 0.80
- MDE: 0.5 minutes increase in average session duration

In [None]:
import numpy as np
import pandas as pd
from scipy import stats
from statsmodels.stats.power import TTestIndPower
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)

## 2. Experiment Design & Sample Size

In [None]:
# Parameters
baseline_mean = 5.2      # minutes
baseline_std = 3.1       # minutes
mde = 0.5                # 0.5 minute increase
alpha = 0.05
power = 0.80

# Cohen's d effect size
cohens_d = mde / baseline_std
print(f"Cohen's d effect size: {cohens_d:.4f}")
print(f"  (small: 0.2, medium: 0.5, large: 0.8)")

# Sample size calculation
analysis = TTestIndPower()
required_n = analysis.solve_power(
    effect_size=cohens_d,
    alpha=alpha,
    power=power,
    alternative='two-sided'
)
required_n = int(np.ceil(required_n))
print(f"\nRequired sample size per group: {required_n:,}")
print(f"Total sample size needed: {required_n * 2:,}")

## 3. Generate Simulated Data

Session duration data is typically **right-skewed** (some users have very long sessions). We simulate this with a gamma distribution, which is more realistic than normal.

In [None]:
n_per_group = 600

# Control: gamma distribution with mean=5.2, std~3.1
# shape=k, scale=theta -> mean=k*theta, var=k*theta^2
k_control = (baseline_mean / baseline_std) ** 2
theta_control = baseline_std ** 2 / baseline_mean

# Treatment: slight improvement -> mean=5.7, similar std
treatment_mean = 5.7
treatment_std = 3.2
k_treatment = (treatment_mean / treatment_std) ** 2
theta_treatment = treatment_std ** 2 / treatment_mean

control = np.random.gamma(k_control, theta_control, n_per_group)
treatment = np.random.gamma(k_treatment, theta_treatment, n_per_group)

df = pd.DataFrame({
    'group': ['control'] * n_per_group + ['treatment'] * n_per_group,
    'session_duration': np.concatenate([control, treatment])
})

print("=== Descriptive Statistics ===")
summary = df.groupby('group')['session_duration'].agg(['count', 'mean', 'std', 'median'])
summary.columns = ['N', 'Mean', 'Std', 'Median']
print(summary.round(3))
print(f"\nDifference in means: {treatment.mean() - control.mean():.3f} minutes")

## 4. Check Assumptions

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(16, 4))

# Distribution plots
axes[0].hist(control, bins=40, alpha=0.6, color='#3498db', label='Control', density=True)
axes[0].hist(treatment, bins=40, alpha=0.6, color='#e74c3c', label='Treatment', density=True)
axes[0].set_xlabel('Session Duration (min)')
axes[0].set_ylabel('Density')
axes[0].set_title('Distribution of Session Duration')
axes[0].legend()

# QQ plots
stats.probplot(control, dist="norm", plot=axes[1])
axes[1].set_title('Q-Q Plot: Control')

stats.probplot(treatment, dist="norm", plot=axes[2])
axes[2].set_title('Q-Q Plot: Treatment')

plt.tight_layout()
plt.show()

In [None]:
# Test for normality (Shapiro-Wilk)
shapiro_control = stats.shapiro(control)
shapiro_treatment = stats.shapiro(treatment)
print("=== Normality Tests (Shapiro-Wilk) ===")
print(f"Control:   W={shapiro_control.statistic:.4f}, p={shapiro_control.pvalue:.4f}")
print(f"Treatment: W={shapiro_treatment.statistic:.4f}, p={shapiro_treatment.pvalue:.4f}")
print(f"\nNote: With large samples, Shapiro-Wilk often rejects normality.")
print(f"By CLT, the sampling distribution of the mean is approximately normal for n={n_per_group}.")

# Test for equal variances (Levene's test)
levene_stat, levene_pval = stats.levene(control, treatment)
print(f"\n=== Equal Variance Test (Levene's) ===")
print(f"Statistic: {levene_stat:.4f}, p-value: {levene_pval:.4f}")
if levene_pval < 0.05:
    print("Variances are significantly different -> Use Welch's t-test")
else:
    print("No evidence of unequal variances -> Standard t-test is OK")

## 5. Statistical Analysis

### 5.1 Two-Sample t-test (assumes equal variance)

In [None]:
t_stat, t_pval = stats.ttest_ind(treatment, control, equal_var=True)

print("=== Two-Sample t-test (equal variance) ===")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {t_pval:.4f}")
print(f"Degrees of freedom: {n_per_group * 2 - 2}")
if t_pval < alpha:
    print(f"REJECT H0 (p={t_pval:.4f} < {alpha})")
else:
    print(f"FAIL TO REJECT H0 (p={t_pval:.4f} >= {alpha})")

### 5.2 Welch's t-test (does NOT assume equal variance)

This is the **recommended default** in practice because it is robust to unequal variances.

In [None]:
welch_stat, welch_pval = stats.ttest_ind(treatment, control, equal_var=False)

# Calculate Welch-Satterthwaite degrees of freedom
s1, s2 = control.std(ddof=1), treatment.std(ddof=1)
n1, n2 = len(control), len(treatment)
welch_df = ((s1**2/n1 + s2**2/n2)**2) / ((s1**2/n1)**2/(n1-1) + (s2**2/n2)**2/(n2-1))

print("=== Welch's t-test (unequal variance) ===")
print(f"t-statistic: {welch_stat:.4f}")
print(f"p-value: {welch_pval:.4f}")
print(f"Welch-Satterthwaite df: {welch_df:.1f}")
if welch_pval < alpha:
    print(f"REJECT H0 (p={welch_pval:.4f} < {alpha})")
else:
    print(f"FAIL TO REJECT H0 (p={welch_pval:.4f} >= {alpha})")

# Confidence interval for the difference in means
mean_diff = treatment.mean() - control.mean()
se_diff = np.sqrt(s1**2/n1 + s2**2/n2)
t_crit = stats.t.ppf(1 - alpha/2, welch_df)
ci = (mean_diff - t_crit * se_diff, mean_diff + t_crit * se_diff)

print(f"\nMean difference: {mean_diff:.3f} minutes")
print(f"95% CI: ({ci[0]:.3f}, {ci[1]:.3f}) minutes")

### 5.3 Mann-Whitney U Test (non-parametric alternative)

Use when data is **non-normal or highly skewed**. It compares the **ranks** of observations rather than the raw values.

In [None]:
u_stat, u_pval = stats.mannwhitneyu(treatment, control, alternative='two-sided')

# Calculate rank-biserial correlation (effect size)
rank_biserial = 1 - (2 * u_stat) / (n1 * n2)

print("=== Mann-Whitney U Test ===")
print(f"U-statistic: {u_stat:.1f}")
print(f"p-value: {u_pval:.4f}")
print(f"Rank-biserial correlation (effect size): {rank_biserial:.4f}")
if u_pval < alpha:
    print(f"REJECT H0 (p={u_pval:.4f} < {alpha})")
else:
    print(f"FAIL TO REJECT H0 (p={u_pval:.4f} >= {alpha})")
print(f"\nNote: Mann-Whitney tests if one distribution is stochastically greater than the other.")
print(f"It does NOT directly test means - it tests the probability that a random")
print(f"observation from treatment is greater than a random observation from control.")

### 5.4 Comparison of All Tests

In [None]:
results = pd.DataFrame({
    'Test': ['Two-sample t-test', "Welch's t-test", 'Mann-Whitney U'],
    'Statistic': [t_stat, welch_stat, u_stat],
    'p-value': [t_pval, welch_pval, u_pval],
    'Significant': [t_pval < alpha, welch_pval < alpha, u_pval < alpha],
    'Assumes Normality': ['Yes', 'Yes', 'No'],
    'Assumes Equal Var': ['Yes', 'No', 'No']
})
print("=== Test Comparison ===")
print(results.to_string(index=False))

## 6. Visualization

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Box plot comparison
bp = axes[0].boxplot([control, treatment], labels=['Control', 'Treatment'],
                     patch_artist=True, showmeans=True,
                     meanprops={'marker': 'D', 'markerfacecolor': 'red', 'markersize': 8})
bp['boxes'][0].set_facecolor('#3498db')
bp['boxes'][1].set_facecolor('#e74c3c')
for box in bp['boxes']:
    box.set_alpha(0.6)
axes[0].set_ylabel('Session Duration (minutes)')
axes[0].set_title('Session Duration by Group (diamond = mean)')

# CI for difference in means
axes[1].errorbar(0, mean_diff, yerr=[[mean_diff - ci[0]], [ci[1] - mean_diff]],
                 fmt='o', color='#2ecc71', markersize=10, capsize=10, linewidth=2)
axes[1].axhline(y=0, color='red', linestyle='--', alpha=0.7, label='No effect')
axes[1].axhline(y=mde, color='blue', linestyle=':', alpha=0.5, label=f'MDE = {mde} min')
axes[1].set_xlim(-0.5, 0.5)
axes[1].set_ylabel('Difference in Mean Session Duration (min)')
axes[1].set_title("95% CI for Difference (Welch's t-test)")
axes[1].set_xticks([])
axes[1].legend()

plt.tight_layout()
plt.show()

---

## Interview Follow-Up Questions & Answers

### Q1: Why is Welch's t-test preferred over the standard t-test in practice?

**Answer:**

Welch's t-test is preferred because:

1. **It does not assume equal variances** between groups, making it more robust.
2. When variances ARE equal, Welch's t-test gives results nearly identical to the standard t-test (minimal power loss).
3. When variances are NOT equal, the standard t-test can have inflated Type I error rates.

The asymmetry in risk makes Welch's the safer default: you lose very little when variances are equal, but you avoid significant errors when they're not.

### Q2: The data is right-skewed (not normal). Can you still use the t-test?

**Answer:**

Yes, thanks to the **Central Limit Theorem (CLT)**. The t-test compares **sample means**, and the CLT guarantees that the distribution of sample means approaches normality as sample size increases, **regardless of the underlying distribution**.

Rules of thumb:
- **n > 30**: Usually sufficient for mild skew
- **n > 100**: Adequate for moderate skew
- **Very heavy tails / extreme outliers**: Consider Mann-Whitney U or bootstrapping even with large n

In this case study with n=600 per group, the CLT provides strong protection, so the t-test is valid despite the skewed data.

### Q3: When would you choose Mann-Whitney U over the t-test?

**Answer:**

Use Mann-Whitney U when:

1. **Ordinal data** (e.g., satisfaction ratings 1-5) where means are not meaningful
2. **Small samples with non-normal data** where CLT doesn't apply
3. **Heavy outliers** that would distort the mean
4. You care about the **overall distribution shift** rather than just the mean

**Important caveat:** Mann-Whitney tests whether one group tends to have larger values than the other. It does NOT test means directly. If you specifically need to compare means, the t-test (with sufficient sample size) is more appropriate.

### Q4: How would you handle heavy outliers in session duration data?

**Answer:**

Several strategies:

1. **Winsorization**: Cap extreme values at a percentile (e.g., 95th or 99th). This preserves the data point but limits its influence.

2. **Trimmed mean**: Remove the top and bottom X% of observations before computing the mean.

3. **Log transformation**: Apply $\log(x + 1)$ to compress the right tail, making the distribution more symmetric.

4. **Use robust test**: Mann-Whitney U or bootstrap-based tests are naturally less affected by outliers.

5. **Investigate the outliers**: Are they real users or bots? Users who left the tab open? Understanding the cause informs the treatment.

The key principle: **decide on outlier handling BEFORE looking at results** to avoid biasing the analysis.

### Q5: Your test shows the new onboarding increases session duration by 0.5 min. Is this a good metric?

**Answer:**

Session duration as a primary metric has limitations:

**Pros:**
- Easy to measure and understand
- Correlates with engagement

**Cons:**
- **Not always directional**: Longer sessions could mean confusion, not engagement
- **Can be gamed**: A confusing UI also increases time on page
- **Doesn't capture value**: 5 minutes of frustrated searching â‰  5 minutes of productive use

**Better approach:** Use session duration as a **guardrail metric** and pair it with **action-based metrics** like:
- Feature adoption rate
- Task completion rate
- 7-day retention rate
- Number of key actions completed per session

### Q6: How do you calculate and interpret Cohen's d?

**Answer:**

Cohen's d measures the **standardized effect size** - the difference in means expressed in units of standard deviation:

$$d = \frac{\bar{X}_B - \bar{X}_A}{s_{pooled}}$$

Where $s_{pooled} = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2 - 2}}$

**Interpretation guidelines (Cohen's conventions):**

| d | Interpretation |
|---|---|
| 0.2 | Small effect |
| 0.5 | Medium effect |
| 0.8 | Large effect |

**Why it matters:** p-values depend on sample size, but effect size doesn't. A tiny effect can be "significant" with enormous samples. Cohen's d helps assess **practical importance** independently of sample size.