# Hypothesis Testing

Hypothesis testing is a statistical framework for making decisions from data under uncertainty. We define a **null hypothesis** (H₀) representing the status quo, and an **alternative hypothesis** (H₁) representing the claim we want to support. A test statistic is computed from sample data and compared against a critical value or converted to a p-value.

**Key concepts:**
- **p-value**: probability of observing data at least as extreme as the sample, assuming H₀ is true
- **Significance level (α)**: threshold below which we reject H₀ (commonly 0.05)
- **Type I error**: rejecting a true H₀ (false positive, probability = α)
- **Type II error**: failing to reject a false H₀ (false negative, probability = β)
- **Power**: probability of correctly rejecting a false H₀ (= 1 − β)

This notebook covers the most common parametric and non-parametric tests used in statistics and data science.

In [None]:
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from itertools import combinations

# Reproducibility
rng = np.random.default_rng(seed=42)

# Plot aesthetics
plt.rcParams.update({
    'figure.dpi': 100,
    'axes.spines.top': False,
    'axes.spines.right': False,
    'font.size': 11,
})

print('NumPy:', np.__version__)
print('SciPy:', __import__('scipy').__version__)
print('Matplotlib:', __import__('matplotlib').__version__)

## 1. One-Sample t-Test

The **one-sample t-test** asks: is the mean of a population equal to a hypothesised value μ₀?

$$H_0: \mu = \mu_0 \qquad H_1: \mu \neq \mu_0$$

The test statistic is:

$$t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}$$

where $\bar{x}$ is the sample mean, $s$ is the sample standard deviation, and $n$ is the sample size. Under H₀ this follows a t-distribution with $n - 1$ degrees of freedom.

**Assumptions:** observations are independent and (approximately) normally distributed — the test is robust to mild departures when $n \geq 30$.

In [None]:
# Scenario: a factory claims its bolts have mean diameter 10 mm.
# We sample 40 bolts and want to test that claim.

mu_claimed = 10.0          # hypothesised population mean (mm)
n = 40
true_mean = 10.3           # the factory is actually producing slightly larger bolts
true_std  = 0.8

sample = rng.normal(loc=true_mean, scale=true_std, size=n)

t_stat, p_value = stats.ttest_1samp(sample, popmean=mu_claimed)

print(f'Sample mean : {sample.mean():.4f} mm')
print(f'Sample std  : {sample.std(ddof=1):.4f} mm')
print(f't-statistic : {t_stat:.4f}')
print(f'p-value     : {p_value:.4f}')

alpha = 0.05
if p_value < alpha:
    print(f'\nDecision: Reject H₀ (p={p_value:.4f} < α={alpha})')
    print(f'The sample mean differs significantly from {mu_claimed} mm.')
else:
    print(f'\nDecision: Fail to reject H₀ (p={p_value:.4f} ≥ α={alpha})')
    print(f'No significant evidence that the mean differs from {mu_claimed} mm.')

# Visualise: sample distribution vs hypothesised mean
fig, ax = plt.subplots(figsize=(7, 4))
ax.hist(sample, bins=10, color='steelblue', edgecolor='white', alpha=0.8, label='Sample')
ax.axvline(sample.mean(), color='steelblue', lw=2, ls='--', label=f'Sample mean ({sample.mean():.2f})')
ax.axvline(mu_claimed,    color='tomato',   lw=2, ls='--', label=f'Claimed mean ({mu_claimed})')
ax.set_xlabel('Bolt diameter (mm)')
ax.set_ylabel('Count')
ax.set_title(f'One-sample t-test  |  t={t_stat:.3f}, p={p_value:.4f}')
ax.legend()
plt.tight_layout()
plt.show()

## 2. Two-Sample t-Test (Independent Groups)

The **independent two-sample t-test** compares the means of two separate groups:

$$H_0: \mu_1 = \mu_2 \qquad H_1: \mu_1 \neq \mu_2$$

SciPy's `ttest_ind` uses Welch's t-test by default (`equal_var=False`), which does **not** assume equal variances and is generally preferred over Student's t-test:

$$t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}$$

**When to use:** comparing outcomes between two independent groups (e.g., A/B test, treatment vs control).

In [None]:
# Scenario: compare exam scores between two teaching methods.
n1, n2 = 35, 40
group_A = rng.normal(loc=72, scale=10, size=n1)   # Method A
group_B = rng.normal(loc=77, scale=12, size=n2)   # Method B (slightly better)

# Welch's t-test (does not assume equal variances)
t_stat, p_value = stats.ttest_ind(group_A, group_B, equal_var=False)

print('Group A — mean: {:.2f}, std: {:.2f}, n={}'.format(group_A.mean(), group_A.std(ddof=1), n1))
print('Group B — mean: {:.2f}, std: {:.2f}, n={}'.format(group_B.mean(), group_B.std(ddof=1), n2))
print(f'\nt-statistic : {t_stat:.4f}')
print(f'p-value     : {p_value:.4f}')

alpha = 0.05
conclusion = 'Reject H₀' if p_value < alpha else 'Fail to reject H₀'
print(f'Decision    : {conclusion} at α={alpha}')

# Visualise distributions
fig, ax = plt.subplots(figsize=(8, 4))
bins = np.linspace(40, 115, 25)
ax.hist(group_A, bins=bins, alpha=0.6, color='steelblue', edgecolor='white', label=f'Method A (mean={group_A.mean():.1f})')
ax.hist(group_B, bins=bins, alpha=0.6, color='salmon',    edgecolor='white', label=f'Method B (mean={group_B.mean():.1f})')
ax.axvline(group_A.mean(), color='steelblue', lw=2, ls='--')
ax.axvline(group_B.mean(), color='salmon',    lw=2, ls='--')
ax.set_xlabel('Exam Score')
ax.set_ylabel('Count')
ax.set_title(f'Two-sample t-test (Welch)  |  t={t_stat:.3f}, p={p_value:.4f}')
ax.legend()
plt.tight_layout()
plt.show()

## 3. Paired t-Test

The **paired t-test** is used when each observation in one group is matched with exactly one observation in the other group. Common examples: measuring the same subject before and after treatment, or comparing two sensors on the same specimen.

$$H_0: \mu_d = 0 \qquad H_1: \mu_d \neq 0$$

where $d_i = x_{i,\text{after}} - x_{i,\text{before}}$. This reduces to a one-sample t-test on the differences:

$$t = \frac{\bar{d}}{s_d / \sqrt{n}}$$

**Advantage over two-sample t-test:** pairing removes between-subject variability, increasing statistical power.

In [None]:
# Scenario: blood pressure (mmHg) measured before and after a 12-week exercise programme.
n = 25
baseline = rng.normal(loc=130, scale=15, size=n)                 # baseline (before)
treatment_effect = rng.normal(loc=-8, scale=5, size=n)           # individual response
followup = baseline + treatment_effect                            # after

t_stat, p_value = stats.ttest_rel(followup, baseline)

differences = followup - baseline
print(f'Mean difference (after − before): {differences.mean():.2f} mmHg')
print(f'Std of differences              : {differences.std(ddof=1):.2f} mmHg')
print(f't-statistic : {t_stat:.4f}')
print(f'p-value     : {p_value:.6f}')

alpha = 0.05
if p_value < alpha:
    print(f'\nDecision: Reject H₀ — the exercise programme significantly changed blood pressure.')
else:
    print(f'\nDecision: Fail to reject H₀.')

# Plot differences
fig, axes = plt.subplots(1, 2, figsize=(10, 4))

# Paired lines
ax = axes[0]
for i in range(n):
    color = 'steelblue' if followup[i] < baseline[i] else 'salmon'
    ax.plot([0, 1], [baseline[i], followup[i]], color=color, alpha=0.5, lw=1)
ax.plot([0, 1], [baseline.mean(), followup.mean()], 'k-o', lw=2.5, markersize=8, label='Group mean')
ax.set_xticks([0, 1])
ax.set_xticklabels(['Before', 'After'])
ax.set_ylabel('Blood Pressure (mmHg)')
ax.set_title('Individual trajectories')
ax.legend()

# Histogram of differences
ax = axes[1]
ax.hist(differences, bins=8, color='steelblue', edgecolor='white', alpha=0.8)
ax.axvline(0, color='tomato', lw=2, ls='--', label='H₀: mean diff = 0')
ax.axvline(differences.mean(), color='steelblue', lw=2, ls='--',
           label=f'Observed mean ({differences.mean():.1f})')
ax.set_xlabel('After − Before (mmHg)')
ax.set_ylabel('Count')
ax.set_title(f'Paired t-test  |  p={p_value:.4f}')
ax.legend()

plt.tight_layout()
plt.show()

## 4. Chi-Squared Test for Independence

The **χ² test for independence** determines whether two categorical variables are associated. Given an observed contingency table with counts $O_{ij}$ and expected counts $E_{ij}$ under independence:

$$\chi^2 = \sum_{i,j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}, \qquad E_{ij} = \frac{R_i \cdot C_j}{N}$$

Degrees of freedom: $(r - 1)(c - 1)$ where $r$ = rows, $c$ = columns.

**Assumptions:**
- Observations are independent
- Expected cell counts ≥ 5 (use Fisher's exact test when this fails)

**When to use:** testing association between categorical variables (e.g., survey responses vs demographic group).

In [None]:
# Scenario: Is preference for a product (A/B/C) independent of age group (18-34 / 35-54 / 55+)?
observed = np.array([
    #  Prod A  Prod B  Prod C
    [   45,    30,    25],   # 18-34
    [   35,    50,    15],   # 35-54
    [   20,    20,    60],   # 55+
])

chi2, p_value, dof, expected = stats.chi2_contingency(observed)

print('Observed frequencies:')
print(observed)
print('\nExpected frequencies (under independence):')
print(np.round(expected, 1))
print(f'\nχ² statistic : {chi2:.4f}')
print(f'Degrees of freedom: {dof}')
print(f'p-value      : {p_value:.6f}')

alpha = 0.05
if p_value < alpha:
    print(f'\nDecision: Reject H₀ — product preference IS associated with age group.')
else:
    print(f'\nDecision: Fail to reject H₀ — no significant association detected.')

# Heatmap of standardised residuals
std_residuals = (observed - expected) / np.sqrt(expected)

fig, axes = plt.subplots(1, 2, figsize=(11, 4))

age_labels    = ['18-34', '35-54', '55+']
product_labels = ['Product A', 'Product B', 'Product C']

for ax, data, title, fmt in zip(
        axes,
        [observed, std_residuals],
        ['Observed counts', 'Standardised residuals  (>|2| = notable)'],
        ['.0f', '.2f']):
    im = ax.imshow(data, cmap='RdYlGn' if 'residual' in title.lower() else 'Blues',
                   aspect='auto')
    ax.set_xticks(range(3)); ax.set_xticklabels(product_labels)
    ax.set_yticks(range(3)); ax.set_yticklabels(age_labels)
    for i in range(3):
        for j in range(3):
            ax.text(j, i, format(data[i, j], fmt), ha='center', va='center', fontsize=12)
    ax.set_title(title)
    plt.colorbar(im, ax=ax, shrink=0.8)

plt.suptitle(f'Chi-squared test  |  χ²={chi2:.2f}, df={dof}, p={p_value:.4f}', y=1.02)
plt.tight_layout()
plt.show()

## 5. One-Way ANOVA

**Analysis of Variance (ANOVA)** tests whether the means of three or more independent groups are all equal:

$$H_0: \mu_1 = \mu_2 = \cdots = \mu_k \qquad H_1: \text{at least one } \mu_i \neq \mu_j$$

The F-statistic partitions total variance into:
- **Between-group variance** (signal): variation explained by group membership
- **Within-group variance** (noise): variation within each group

$$F = \frac{\text{MS}_{\text{between}}}{\text{MS}_{\text{within}}}$$

**Assumptions:** independence, normality within groups, homogeneity of variances (Levene's test).

A significant ANOVA result only tells us *some* means differ — post-hoc tests (Tukey HSD, Bonferroni) identify *which* pairs.

In [None]:
# Scenario: compare crop yield (kg/plot) under three fertiliser treatments.
n_per_group = 20
control  = rng.normal(loc=50, scale=8, size=n_per_group)
fert_A   = rng.normal(loc=55, scale=8, size=n_per_group)
fert_B   = rng.normal(loc=62, scale=9, size=n_per_group)

f_stat, p_value = stats.f_oneway(control, fert_A, fert_B)

print(f'Control  — mean: {control.mean():.2f}, std: {control.std(ddof=1):.2f}')
print(f'Fert A   — mean: {fert_A.mean():.2f}, std: {fert_A.std(ddof=1):.2f}')
print(f'Fert B   — mean: {fert_B.mean():.2f}, std: {fert_B.std(ddof=1):.2f}')
print(f'\nF-statistic : {f_stat:.4f}')
print(f'p-value     : {p_value:.6f}')

alpha = 0.05
if p_value < alpha:
    print(f'\nDecision: Reject H₀ — at least one group mean differs significantly.')
else:
    print(f'\nDecision: Fail to reject H₀.')

# Post-hoc pairwise t-tests with Bonferroni correction
groups = {'Control': control, 'Fert A': fert_A, 'Fert B': fert_B}
names = list(groups.keys())
pairs = list(combinations(names, 2))
print('\nPost-hoc pairwise Welch t-tests (Bonferroni corrected):')
for g1, g2 in pairs:
    t, p = stats.ttest_ind(groups[g1], groups[g2], equal_var=False)
    p_adj = min(p * len(pairs), 1.0)   # Bonferroni adjustment
    sig = '***' if p_adj < 0.001 else ('**' if p_adj < 0.01 else ('*' if p_adj < 0.05 else 'ns'))
    print(f'  {g1} vs {g2}: p_adj={p_adj:.4f} {sig}')

# Boxplot
fig, ax = plt.subplots(figsize=(7, 5))
data_to_plot = [control, fert_A, fert_B]
bp = ax.boxplot(data_to_plot, patch_artist=True, widths=0.5,
                medianprops={'color': 'black', 'lw': 2})
colors = ['#6baed6', '#74c476', '#fd8d3c']
for patch, color in zip(bp['boxes'], colors):
    patch.set_facecolor(color)
    patch.set_alpha(0.75)
ax.set_xticklabels(['Control', 'Fertiliser A', 'Fertiliser B'])
ax.set_ylabel('Yield (kg / plot)')
ax.set_title(f'One-way ANOVA  |  F={f_stat:.3f}, p={p_value:.4f}')

# Strip plot overlay
for i, arr in enumerate(data_to_plot, start=1):
    jitter = rng.uniform(-0.1, 0.1, size=len(arr))
    ax.scatter(i + jitter, arr, color='black', alpha=0.3, s=15, zorder=3)

plt.tight_layout()
plt.show()

## 6. Multiple Testing Correction

When we perform $m$ hypothesis tests simultaneously, the probability of at least one false positive increases rapidly. If each test has type I error rate α, the **family-wise error rate (FWER)** is:

$$\text{FWER} = 1 - (1 - \alpha)^m$$

For $m = 20$ tests at α = 0.05: FWER ≈ 64%!

**Bonferroni correction** is the simplest method: adjust the threshold to $\alpha^* = \alpha / m$, or equivalently multiply each p-value by $m$ (capping at 1).

**Benjamini-Hochberg (BH)** controls the **False Discovery Rate (FDR)** instead — less conservative and more powerful when many tests are conducted (e.g., genomics, neuroimaging).

| Method | Controls | Conservative? | Use when |
|--------|----------|---------------|---------|
| Bonferroni | FWER | Very | Few tests, confirmatory study |
| Holm-Bonferroni | FWER | Moderately | Few tests, step-down |
| Benjamini-Hochberg | FDR | Mildly | Many tests, exploratory study |

In [None]:
# Simulate 20 one-sample t-tests.
# 15 null hypotheses are TRUE (μ=0), 5 are FALSE (μ≠0 — real effects).

m_null = 15   # true null
m_alt  = 5    # true alternatives
m_total = m_null + m_alt
n_obs = 30

null_samples = [rng.normal(0, 1, n_obs) for _ in range(m_null)]
alt_samples  = [rng.normal(0.9, 1, n_obs) for _ in range(m_alt)]  # real effect
all_samples  = null_samples + alt_samples
true_labels  = ['null'] * m_null + ['alternative'] * m_alt

raw_pvals = np.array([stats.ttest_1samp(s, popmean=0).pvalue for s in all_samples])

# Bonferroni correction
bonferroni_pvals = np.minimum(raw_pvals * m_total, 1.0)

# Benjamini-Hochberg (manual implementation)
def bh_correction(pvals, fdr=0.05):
    n = len(pvals)
    sorted_idx = np.argsort(pvals)
    sorted_p = pvals[sorted_idx]
    thresholds = fdr * np.arange(1, n + 1) / n
    reject_mask = sorted_p <= thresholds
    # Find the largest k such that p_(k) <= k/m * FDR
    if reject_mask.any():
        cutoff = np.where(reject_mask)[0].max()
        final_reject = np.zeros(n, dtype=bool)
        final_reject[sorted_idx[:cutoff + 1]] = True
    else:
        final_reject = np.zeros(n, dtype=bool)
    return final_reject

bh_reject = bh_correction(raw_pvals, fdr=0.05)

alpha = 0.05
print(f'Uncorrected  α=0.05: {(raw_pvals < alpha).sum()} significant')
print(f'Bonferroni corrected: {(bonferroni_pvals < alpha).sum()} significant')
print(f'BH FDR-corrected    : {bh_reject.sum()} significant')

# Visualise p-values
fig, ax = plt.subplots(figsize=(10, 4))
x = np.arange(m_total)
colors_bar = ['#e74c3c' if t == 'alternative' else '#3498db' for t in true_labels]
bars = ax.bar(x, -np.log10(raw_pvals), color=colors_bar, alpha=0.7, edgecolor='white')

ax.axhline(-np.log10(alpha),           color='black',  ls='-',  lw=1.5, label=f'Uncorrected α={alpha}')
ax.axhline(-np.log10(alpha / m_total), color='orange', ls='--', lw=1.5, label=f'Bonferroni α*={alpha/m_total:.4f}')

from matplotlib.patches import Patch
legend_handles = [
    Patch(facecolor='#e74c3c', alpha=0.7, label='True alternative (real effect)'),
    Patch(facecolor='#3498db', alpha=0.7, label='True null (no effect)'),
]
leg1 = ax.legend(handles=legend_handles, loc='upper left')
ax.add_artist(leg1)
ax.legend(loc='upper right')

ax.set_xlabel('Test index')
ax.set_ylabel('-log₁₀(p-value)')
ax.set_title('Multiple testing: raw p-values and correction thresholds')
ax.set_xticks(x)

plt.tight_layout()
plt.show()

# FWER inflation demo
m_range = np.arange(1, 101)
fwer = 1 - (1 - 0.05) ** m_range

fig, ax = plt.subplots(figsize=(7, 3))
ax.plot(m_range, fwer, color='tomato', lw=2)
ax.axhline(0.05, color='steelblue', ls='--', label='α = 0.05')
ax.axhline(0.50, color='gray',      ls=':',  label='50% FWER')
ax.set_xlabel('Number of tests (m)')
ax.set_ylabel('Family-wise error rate')
ax.set_title('FWER inflation as number of simultaneous tests grows')
ax.legend()
plt.tight_layout()
plt.show()

## Summary: Choosing the Right Test

| Test | Data type | Groups | Key assumption | SciPy function |
|------|-----------|--------|----------------|----------------|
| One-sample t-test | Continuous | 1 vs hypothesised μ | Normality | `ttest_1samp` |
| Independent two-sample t-test (Welch) | Continuous | 2 independent | Normality | `ttest_ind` |
| Paired t-test | Continuous | 2 matched/repeated | Normality of differences | `ttest_rel` |
| One-way ANOVA | Continuous | ≥ 3 independent | Normality, equal variance | `f_oneway` |
| Chi-squared test | Categorical | 2+ categories × 2+ categories | Expected counts ≥ 5 | `chi2_contingency` |
| Mann-Whitney U | Ordinal/non-normal | 2 independent | None (non-parametric) | `mannwhitneyu` |
| Wilcoxon signed-rank | Ordinal/non-normal | 2 matched | Symmetry of differences | `wilcoxon` |
| Kruskal-Wallis | Ordinal/non-normal | ≥ 3 independent | None (non-parametric) | `kruskal` |

**Decision flowchart:**
1. How many groups? → 1, 2, or ≥ 3
2. Are groups independent or paired?
3. Is the outcome continuous or categorical?
4. Are normality assumptions met? (use Q-Q plots + Shapiro-Wilk)
5. If conducting multiple tests → apply correction (Bonferroni or BH)

**Effect size matters:** a statistically significant result does not imply practical significance. Report Cohen's d (t-tests), η² (ANOVA), or Cramér's V (chi-squared) alongside p-values.