# Two-Sample Hypothesis Testing and ANOVA

In this notebook, we extend hypothesis testing to comparisons between groups.

We cover:
- two-sample t-tests
- variance assumptions and Welch correction
- one-way ANOVA
- post-hoc testing

These methods answer the question:
**Do different groups come from the same population?**


### ðŸŸ¦ Imports & Data

In [None]:
import sys
from pathlib import Path

# Get project root: notebooks/ â†’ project/
PROJECT_ROOT = Path().resolve().parent
sys.path.insert(0, str(PROJECT_ROOT))

print("Added to path:", PROJECT_ROOT)


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from scipy import stats
import statsmodels.api as sm
from statsmodels.formula.api import ols

from src.data_generation import generate_student_dataset

sns.set(style="whitegrid")

df = generate_student_dataset(n=4000, random_state=42)


## ðŸŸ© Part I: Two-Sample Testing: Research Question

Do **male and female students** have the same average exam score?

### ðŸŸ¦ Hypotheses (Two-Sample t-test)


Let:
- $ \mu_M $ = mean score of male students  
- $ \mu_F $ = mean score of female students  

$$
H_0: \mu_M = \mu_F
$$

$$
H_1: \mu_M \neq \mu_F
$$

This is a **two-sided independent samples test**.


### ðŸŸ¦ Assumptions of the Two-Sample t-Test

1. Independence of observations
2. Approximate normality within each group
3. Equality of variances (only for the classical t-test)

If assumption (3) fails, we use **Welchâ€™s t-test**.

### ðŸŸ¦ Group Separation

In [None]:
group_m = df[df.gender == "M"]["score"]
group_f = df[df.gender == "F"]["score"]

group_m.mean(), group_f.mean()

### ðŸŸ¦ Variance Equality Test (Levene)

In [None]:
levene_stat, levene_p = stats.levene(group_m, group_f)
levene_stat, levene_p


### ðŸŸ¦ Variance Test Interpretation

#### Variance Equality

- If $ p > 0.05 $: equal variances assumed
- If $ p \le 0.05 $: variances differ

To be conservative, we proceed with **Welchâ€™s t-test**.

### ðŸŸ¦ Mathematical Formulation (Welch t-test)

The test statistic is:

$$
t = \frac{\bar{x}_1 - \bar{x}_2}
{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}
$$

Degrees of freedom are approximated by the
Welchâ€“Satterthwaite equation.

### ðŸŸ¦ Welchâ€™s t-Test (SciPy)

In [None]:
t_stat, p_value = stats.ttest_ind(
    group_m,
    group_f,
    equal_var=False
)

t_stat, p_value


### ðŸŸ¦ Conclusion (Two-Sample Test)

- The p-value indicates a statistically significant difference
- Mean scores differ between groups
- Practical significance will be assessed using effect size

At this stage, we conclude **a difference exists**, not how large it is.

## ðŸŸ© Part II â€” One-Way ANOVA

### ðŸŸ¦ ANOVA Research Question

Does the **teaching method** influence exam scores?

### ðŸŸ¦ ANOVA Hypotheses

Let:
- $ \mu_A, \mu_B, \mu_C $ be mean scores for methods A, B, and C

$$
H_0: \mu_A = \mu_B = \mu_C
$$

$$
H_1: \text{At least one group mean differs}
$$

### ðŸŸ¦ Why ANOVA? Why Not Multiple t-Tests?

Multiple pairwise t-tests inflate the **Type I error rate**.

ANOVA solves this by testing all group means simultaneously
while controlling the overall error rate.

### ðŸŸ¦ Mathematical Formulation (ANOVA)

The F-statistic is defined as:

$$
F = \frac{MS_{\text{between}}}{MS_{\text{within}}}
$$

Where:
$$
MS_{\text{between}} = \frac{SS_{\text{between}}}{k - 1}
\quad
MS_{\text{within}} = \frac{SS_{\text{within}}}{N - k}
$$

Under $ H_0 $:

$$
F \sim F_{k-1, N-k}
$$

### ðŸŸ¦ Visual Inspection

In [None]:
plt.figure(figsize=(7, 4))
sns.boxplot(x="teaching_method", y="score", data=df)
plt.title("Exam Scores by Teaching Method")
plt.show()


### ðŸŸ¦ One-Way ANOVA (SciPy)

In [None]:
stats.f_oneway(
    df[df.teaching_method == "A"]["score"],
    df[df.teaching_method == "B"]["score"],
    df[df.teaching_method == "C"]["score"]
)


### ðŸŸ¦ ANOVA via Statsmodels (Recommended)

In [None]:
anova_model = ols("score ~ C(teaching_method)", data=df).fit()
anova_table = sm.stats.anova_lm(anova_model, typ=2)
anova_table


### ðŸŸ¦ ANOVA Interpretation

- The F-statistic is large
- The p-value is extremely small

We reject the null hypothesis:
**Teaching method has a statistically significant effect on exam scores.**

### ðŸŸ¦ Post-hoc Testing Motivation

#### Why Post-hoc Tests?

ANOVA tells us **that a difference exists**, but not **where**.

Post-hoc tests identify which pairs of groups differ
while controlling for multiple comparisons.

### ðŸŸ¦ Tukey HSD Post-hoc Test

In [None]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

tukey = pairwise_tukeyhsd(
    endog=df["score"],
    groups=df["teaching_method"],
    alpha=0.05
)

print(tukey)


### ðŸŸ¦ Post-hoc Interpretation

- Some teaching methods differ significantly
- Others may not
- The magnitude of differences will be quantified using effect size

This completes the group comparison analysis.

## Summary

In this notebook, we demonstrated:

- Two-sample hypothesis testing with variance diagnostics
- Welchâ€™s t-test for robustness
- One-way ANOVA with mathematical foundation
- Post-hoc testing for detailed group comparisons

These methods generalize hypothesis testing beyond single populations.
