[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/buildLittleWorlds/ml-math-with-densworld/blob/main/modules/01-statistics-probability/notebooks/04-hypothesis-testing.ipynb)

# Lesson 4: Hypothesis Testing

*"In the Archives, victory is not decided by truth, but by the margin of doubt. A Stone School philosopher may seem undefeated—but sample fifty of their debates, and you'll find that chance alone could explain their record."*  
— Mink Pavar, testimony before the Senate Inquiry

---

## The Core Problem

The Capital Archives preserve records of 256 formal debates between philosophical schools. Looking at the raw numbers, the Stone School appears dominant—their scholars seem to win more often. But is this difference **real**, or could it be explained by random chance?

This is the fundamental question of **hypothesis testing**: when we see a pattern in our data, is it signal or noise?

---

## Learning Objectives

By the end of this lesson, you will:
1. Understand the null hypothesis as a "default assumption"
2. Interpret p-values correctly (and avoid common misconceptions)
3. Recognize Type I and Type II errors and their tradeoffs
4. Detect and avoid the multiple comparisons trap (p-hacking)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

# Set random seed for reproducibility
np.random.seed(42)

# Nice plotting defaults
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

# Colab-ready data loading
BASE_URL = "https://raw.githubusercontent.com/buildLittleWorlds/ml-math-with-densworld/main/data/"

# Load the scholar debates dataset
debates = pd.read_csv(BASE_URL + "scholar_debates.csv")

print(f"Loaded {len(debates)} debate records")
print(f"Years covered: {debates['year'].min()} - {debates['year'].max()}")
debates.head()

## Part 1: The Three Schools

The Capital's intellectual life is dominated by three philosophical schools:

- **Stone School**: Emphasizes permanence, structure, and tradition
- **Water School**: Values flexibility, adaptation, and flow
- **Pebble School**: Seeks compromise, often dismissed as "fence-sitters"

Let's examine the win rates by school:

In [None]:
# Create a function to determine if a school won a debate
def calculate_win_rate(df, school):
    """Calculate the win rate for a given school."""
    # Debates where this school participated as scholar_a
    as_a = df[df['scholar_a_school'] == school]
    wins_as_a = (as_a['outcome'] == 'victory_a').sum()
    
    # Debates where this school participated as scholar_b
    as_b = df[df['scholar_b_school'] == school]
    wins_as_b = (as_b['outcome'] == 'victory_b').sum()
    
    total_debates = len(as_a) + len(as_b)
    total_wins = wins_as_a + wins_as_b
    
    # Exclude draws for win rate calculation
    draws_as_a = (as_a['outcome'] == 'draw').sum()
    draws_as_b = (as_b['outcome'] == 'draw').sum()
    decisive_debates = total_debates - draws_as_a - draws_as_b
    
    return {
        'school': school,
        'total_debates': total_debates,
        'wins': total_wins,
        'decisive_debates': decisive_debates,
        'win_rate': total_wins / decisive_debates if decisive_debates > 0 else 0
    }

# Calculate win rates for each school
schools = ['stone_school', 'water_school', 'pebble_school']
win_rates = pd.DataFrame([calculate_win_rate(debates, school) for school in schools])

print("Scholar Debate Win Rates by School")
print("=" * 60)
for _, row in win_rates.iterrows():
    school_name = row['school'].replace('_', ' ').title()
    print(f"{school_name:20} | {row['wins']:3d} wins / {row['decisive_debates']:3d} decisive debates | Win Rate: {row['win_rate']:.1%}")

# Visualize
fig, ax = plt.subplots(figsize=(10, 6))
colors = ['#8B4513', '#4169E1', '#808080']  # Brown, Blue, Gray
bars = ax.bar(win_rates['school'].str.replace('_', ' ').str.title(), 
              win_rates['win_rate'], color=colors, edgecolor='black')
ax.axhline(0.5, color='red', linestyle='--', linewidth=2, label='50% (fair odds)')
ax.set_ylabel('Win Rate', fontsize=12)
ax.set_title('Scholar Debate Win Rates by Philosophical School', fontsize=14)
ax.set_ylim(0, 0.7)
ax.legend()

# Add counts on bars
for bar, (_, row) in zip(bars, win_rates.iterrows()):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02, 
            f"n={row['decisive_debates']}", ha='center', fontsize=11)

plt.tight_layout()
plt.show()

## Part 2: The Null Hypothesis

Looking at the data, one school appears to outperform the others. But before we declare a winner, we must ask: **could this difference arise by chance alone?**

This is where the **null hypothesis (H₀)** comes in. The null hypothesis represents the "boring" or "default" assumption—typically that there is no real effect.

### For our question:
- **H₀ (Null)**: The school has a 50% win rate (no advantage)
- **H₁ (Alternative)**: The school has a win rate ≠ 50% (real advantage or disadvantage)

We assume H₀ is true, then calculate how likely we'd be to observe our data under that assumption.

In [None]:
# Let's test: Is Stone School's win rate significantly different from 50%?
stone_stats = win_rates[win_rates['school'] == 'stone_school'].iloc[0]
n_debates = int(stone_stats['decisive_debates'])
n_wins = int(stone_stats['wins'])
observed_rate = stone_stats['win_rate']

print("Testing Stone School Dominance")
print("=" * 50)
print(f"Observed: {n_wins} wins out of {n_debates} decisive debates")
print(f"Observed win rate: {observed_rate:.1%}")
print(f"\nH₀: True win rate = 50%")
print(f"H₁: True win rate ≠ 50%")

# Under H₀, what's the probability of seeing this many or more wins?
# This is a binomial test
# Use scipy.stats.binom_test (or binomtest in newer scipy)
try:
    from scipy.stats import binomtest
    result = binomtest(n_wins, n_debates, 0.5, alternative='two-sided')
    p_value = result.pvalue
except ImportError:
    # Fallback for older scipy
    p_value = stats.binom_test(n_wins, n_debates, 0.5)

print(f"\nP-value: {p_value:.4f}")

## Part 3: The P-Value — What It Really Means

The **p-value** is perhaps the most misunderstood concept in statistics. Let's be clear about what it is and isn't:

### What the p-value IS:
> The probability of observing data as extreme as (or more extreme than) what we saw, **assuming the null hypothesis is true**.

### What the p-value is NOT:
- ❌ NOT the probability that H₀ is true
- ❌ NOT the probability that H₁ is true
- ❌ NOT the probability that the result is due to chance

### Intuition: The Surprise Interpretation

Think of the p-value as a measure of **surprise**. If the null hypothesis were true, how surprised would we be to see this data?

In [None]:
# Visualize the p-value with a simulation
# If Stone School truly had a 50% win rate, what would their win counts look like?

n_simulations = 10000
simulated_wins = np.random.binomial(n_debates, 0.5, n_simulations)

fig, ax = plt.subplots(figsize=(12, 6))

# Histogram of simulated wins under H₀
counts, bins, _ = ax.hist(simulated_wins, bins=range(0, n_debates+2), 
                          color='steelblue', edgecolor='black', alpha=0.7,
                          density=True, align='left')

# Mark the observed value and region more extreme
ax.axvline(n_wins, color='red', linewidth=3, linestyle='--', 
           label=f'Observed: {n_wins} wins')

# Shade the "as extreme or more" regions
extreme_low = n_debates - n_wins  # Mirror for two-sided test
for i, count in enumerate(counts):
    if bins[i] >= n_wins or bins[i] <= extreme_low:
        ax.bar(bins[i], count, color='red', alpha=0.5, edgecolor='black', width=0.8)

ax.set_xlabel('Number of Wins (out of {})'.format(n_debates), fontsize=12)
ax.set_ylabel('Probability', fontsize=12)
ax.set_title('Distribution of Wins Under H₀ (50% Win Rate)\nRed region = p-value', fontsize=14)
ax.legend(fontsize=11)

plt.tight_layout()
plt.show()

# Calculate empirical p-value from simulation
empirical_p = np.mean((simulated_wins >= n_wins) | (simulated_wins <= extreme_low))
print(f"Empirical p-value (from simulation): {empirical_p:.4f}")
print(f"Exact p-value (from binomial test): {p_value:.4f}")

## Part 4: Statistical Significance and Alpha

By convention, we compare the p-value to a threshold called **α (alpha)**, typically set at 0.05.

- If p < α: **Reject H₀** — the result is "statistically significant"
- If p ≥ α: **Fail to reject H₀** — we don't have enough evidence

### Important Caveats:

1. **α = 0.05 is arbitrary** — it's just a convention from the 1920s
2. **"Not significant" ≠ "No effect"** — it means we can't tell
3. **"Significant" ≠ "Important"** — a tiny effect can be statistically significant with enough data

In [None]:
# Test all schools
print("Hypothesis Tests for Each School")
print("=" * 60)
print(f"{'School':<20} {'Win Rate':<12} {'p-value':<12} {'Significant?':<12}")
print("-" * 60)

alpha = 0.05

for _, row in win_rates.iterrows():
    n = int(row['decisive_debates'])
    wins = int(row['wins'])
    
    try:
        result = binomtest(wins, n, 0.5, alternative='two-sided')
        p = result.pvalue
    except:
        p = stats.binom_test(wins, n, 0.5)
    
    sig = "Yes" if p < alpha else "No"
    school_name = row['school'].replace('_', ' ').title()
    print(f"{school_name:<20} {row['win_rate']:<12.1%} {p:<12.4f} {sig:<12}")

print(f"\n(Using α = {alpha})")

## Part 5: Type I and Type II Errors

When making decisions based on hypothesis tests, we can make two kinds of mistakes:

| | H₀ is Actually True | H₀ is Actually False |
|---|---|---|
| **Reject H₀** | Type I Error (False Positive) | Correct (True Positive) |
| **Fail to Reject H₀** | Correct (True Negative) | Type II Error (False Negative) |

### In the Archives context:

- **Type I Error**: Declaring Stone School superior when they're actually just lucky
- **Type II Error**: Failing to detect a real advantage due to too little data

### The Tradeoff:
- Lower α → Fewer false positives, but more false negatives
- Higher α → Fewer false negatives, but more false positives

In [None]:
# Demonstrate the tradeoff with simulation
# True scenario: Water School actually has a 55% win rate (small real advantage)

TRUE_RATE = 0.55  # Small real advantage
n_debates_sim = 50  # Sample size
n_experiments = 10000

# Track decisions at different alpha levels
alphas = [0.01, 0.05, 0.10, 0.20]
results = {alpha: {'reject': 0, 'fail_reject': 0} for alpha in alphas}

for _ in range(n_experiments):
    # Simulate debates with true 55% win rate
    wins = np.random.binomial(n_debates_sim, TRUE_RATE)
    
    # Calculate p-value
    try:
        p = binomtest(wins, n_debates_sim, 0.5, alternative='two-sided').pvalue
    except:
        p = stats.binom_test(wins, n_debates_sim, 0.5)
    
    # Make decision at each alpha
    for alpha in alphas:
        if p < alpha:
            results[alpha]['reject'] += 1
        else:
            results[alpha]['fail_reject'] += 1

print(f"True win rate: {TRUE_RATE:.0%} (there IS a real effect)")
print(f"Sample size: {n_debates_sim} debates per experiment")
print(f"\nDetection rates (statistical power) at different α levels:")
print("=" * 60)
print(f"{'Alpha':<10} {'Detect Effect':<20} {'Miss Effect (Type II)':<20}")
print("-" * 60)

for alpha in alphas:
    power = results[alpha]['reject'] / n_experiments
    miss_rate = results[alpha]['fail_reject'] / n_experiments
    print(f"{alpha:<10} {power:<20.1%} {miss_rate:<20.1%}")

print(f"\n⚠️  With a small effect (55% vs 50%) and limited data (n={n_debates_sim}),")
print(f"   we often fail to detect the real difference!")

## Part 6: The Multiple Comparisons Trap

### P-Hacking: The Scholar's Temptation

Mink Pavar, the forger, was also known for his statistical manipulations. He understood that if you test enough hypotheses, some will appear "significant" by pure chance.

**The problem**: If you test 20 independent hypotheses at α = 0.05, you expect 1 false positive even if nothing is real.

Let's demonstrate this with the debate data:

In [None]:
# Test many hypotheses on the debate data
# Most of these we'd expect to find no effect

hypotheses = []

# 1. Does venue affect outcome?
for venue in debates['venue'].unique():
    venue_data = debates[debates['venue'] == venue]
    a_wins = (venue_data['outcome'] == 'victory_a').sum()
    total = len(venue_data[venue_data['outcome'] != 'draw'])
    if total > 10:
        try:
            p = binomtest(a_wins, total, 0.5).pvalue
        except:
            p = stats.binom_test(a_wins, total, 0.5)
        hypotheses.append(('Venue: ' + venue, a_wins/total, total, p))

# 2. Does topic affect outcome?
for topic in debates['topic_category'].unique():
    topic_data = debates[debates['topic_category'] == topic]
    a_wins = (topic_data['outcome'] == 'victory_a').sum()
    total = len(topic_data[topic_data['outcome'] != 'draw'])
    if total > 10:
        try:
            p = binomtest(a_wins, total, 0.5).pvalue
        except:
            p = stats.binom_test(a_wins, total, 0.5)
        hypotheses.append(('Topic: ' + topic, a_wins/total, total, p))

# 3. Does judge count affect outcome?
for judges in debates['judge_count'].unique():
    judge_data = debates[debates['judge_count'] == judges]
    a_wins = (judge_data['outcome'] == 'victory_a').sum()
    total = len(judge_data[judge_data['outcome'] != 'draw'])
    if total > 10:
        try:
            p = binomtest(a_wins, total, 0.5).pvalue
        except:
            p = stats.binom_test(a_wins, total, 0.5)
        hypotheses.append((f'{judges} judges', a_wins/total, total, p))

# 4. Does year period affect outcome?
for period in [(850, 860), (860, 870), (870, 880)]:
    period_data = debates[(debates['year'] >= period[0]) & (debates['year'] < period[1])]
    a_wins = (period_data['outcome'] == 'victory_a').sum()
    total = len(period_data[period_data['outcome'] != 'draw'])
    if total > 10:
        try:
            p = binomtest(a_wins, total, 0.5).pvalue
        except:
            p = stats.binom_test(a_wins, total, 0.5)
        hypotheses.append((f'Years {period[0]}-{period[1]}', a_wins/total, total, p))

# Sort by p-value
hypotheses.sort(key=lambda x: x[3])

print(f"Tested {len(hypotheses)} hypotheses")
print("\nResults sorted by p-value:")
print("=" * 70)
print(f"{'Hypothesis':<30} {'Rate':<10} {'n':<8} {'p-value':<12} {'Sig?':<6}")
print("-" * 70)

for hyp, rate, n, p in hypotheses:
    sig = "*" if p < 0.05 else ""
    print(f"{hyp:<30} {rate:<10.1%} {n:<8} {p:<12.4f} {sig:<6}")

n_significant = sum(1 for h in hypotheses if h[3] < 0.05)
print(f"\n{n_significant} hypotheses are 'significant' at α = 0.05")
print(f"Expected by chance alone: {len(hypotheses) * 0.05:.1f}")

### The Bonferroni Correction

One solution to the multiple comparisons problem is the **Bonferroni correction**: divide α by the number of tests.

If testing m hypotheses:
$$\alpha_{\text{corrected}} = \frac{\alpha}{m}$$

This is conservative—it reduces Type I errors at the cost of more Type II errors.

In [None]:
# Apply Bonferroni correction
n_tests = len(hypotheses)
alpha_corrected = 0.05 / n_tests

print(f"Bonferroni Correction")
print(f"=" * 50)
print(f"Number of tests: {n_tests}")
print(f"Original α: 0.05")
print(f"Corrected α: {alpha_corrected:.4f}")

print(f"\nHypotheses significant after Bonferroni correction:")
print("-" * 50)

any_significant = False
for hyp, rate, n, p in hypotheses:
    if p < alpha_corrected:
        print(f"{hyp}: p = {p:.4f}")
        any_significant = True

if not any_significant:
    print("None! All 'significant' findings were likely false positives.")

## Part 7: Effect Size — Beyond Significance

Statistical significance tells us whether an effect exists. **Effect size** tells us how big it is.

For proportions, a simple effect size is the difference from 50%:

$$\text{Effect Size} = |p - 0.5|$$

A school with 52% win rate has a small effect. A school with 70% win rate has a large effect.

In [None]:
# Effect sizes for each school
print("Effect Sizes by School")
print("=" * 50)
print(f"{'School':<20} {'Win Rate':<12} {'Effect Size':<15} {'Interpretation':<15}")
print("-" * 50)

for _, row in win_rates.iterrows():
    effect = abs(row['win_rate'] - 0.5)
    if effect < 0.05:
        interp = "Negligible"
    elif effect < 0.10:
        interp = "Small"
    elif effect < 0.20:
        interp = "Medium"
    else:
        interp = "Large"
    
    school_name = row['school'].replace('_', ' ').title()
    print(f"{school_name:<20} {row['win_rate']:<12.1%} {effect:<15.1%} {interp:<15}")

## Part 8: Comparing Two Groups

Often we want to compare two groups directly. For example: Is there a difference between Stone School and Water School win rates?

### Chi-Square Test for Independence

In [None]:
# Create a contingency table: Stone School vs Water School debates
# Only look at debates where these two schools faced each other

stone_vs_water = debates[
    ((debates['scholar_a_school'] == 'stone_school') & (debates['scholar_b_school'] == 'water_school')) |
    ((debates['scholar_a_school'] == 'water_school') & (debates['scholar_b_school'] == 'stone_school'))
]

# Count wins for each school
stone_wins = (
    ((stone_vs_water['scholar_a_school'] == 'stone_school') & (stone_vs_water['outcome'] == 'victory_a')).sum() +
    ((stone_vs_water['scholar_b_school'] == 'stone_school') & (stone_vs_water['outcome'] == 'victory_b')).sum()
)

water_wins = (
    ((stone_vs_water['scholar_a_school'] == 'water_school') & (stone_vs_water['outcome'] == 'victory_a')).sum() +
    ((stone_vs_water['scholar_b_school'] == 'water_school') & (stone_vs_water['outcome'] == 'victory_b')).sum()
)

draws = (stone_vs_water['outcome'] == 'draw').sum()
total_decisive = stone_wins + water_wins

print("Stone School vs Water School Head-to-Head")
print("=" * 50)
print(f"Total debates: {len(stone_vs_water)}")
print(f"Draws: {draws}")
print(f"Stone School wins: {stone_wins}")
print(f"Water School wins: {water_wins}")
print(f"\nStone School win rate: {stone_wins/total_decisive:.1%}")
print(f"Water School win rate: {water_wins/total_decisive:.1%}")

# Binomial test
try:
    p = binomtest(stone_wins, total_decisive, 0.5).pvalue
except:
    p = stats.binom_test(stone_wins, total_decisive, 0.5)

print(f"\nP-value (test if different from 50/50): {p:.4f}")
print(f"Significant at α=0.05? {'Yes' if p < 0.05 else 'No'}")

## Summary

| Concept | Key Insight | Common Mistake |
|---------|-------------|----------------|
| Null Hypothesis (H₀) | Default assumption of "no effect" | Thinking it's what we want to prove |
| P-value | P(data \| H₀), not P(H₀ \| data) | Interpreting as "probability H₀ is true" |
| Significance (α) | Arbitrary threshold, usually 0.05 | Treating as magical cutoff |
| Type I Error | False positive (rejecting true H₀) | Ignoring when p-hacking |
| Type II Error | False negative (missing real effect) | Concluding "no effect" from high p |
| Multiple Comparisons | Test enough, something will be "significant" | Not correcting for many tests |
| Effect Size | How big is the effect? | Confusing significance with importance |

---

## Exercises

### Exercise 1: The Age Effect

Test whether older scholars (age > 50) have a different win rate than younger scholars (age ≤ 50). 
1. Calculate win rates for each group
2. Perform an appropriate hypothesis test
3. Calculate the effect size

In [None]:
# Exercise 1: Your code here
# Hint: Create columns for winner_age, then compare groups


### Exercise 2: Publication Power

Does having more publications help? Test whether scholars with above-median publications win more debates.
1. Find the median publication count
2. Compare win rates above vs below median
3. What's the p-value?

In [None]:
# Exercise 2: Your code here


### Exercise 3: The Controversy Connection

Are close debates (margin = 'narrow' or outcome = 'draw') more controversial (higher `controversy_score`)?
1. Calculate mean controversy score for close vs decisive debates
2. Perform a t-test to compare the means
3. Interpret the result

In [None]:
# Exercise 3: Your code here
# Hint: Use stats.ttest_ind() for comparing two means


### Exercise 4: P-Hacking Simulation

Simulate the p-hacking problem:
1. Generate 20 random "coin flip" experiments (each with n=50 trials, true p=0.5)
2. Test each for deviation from 0.5
3. How many are "significant" at α=0.05?
4. Repeat this simulation 1000 times and show the distribution of "significant" findings

In [None]:
# Exercise 4: Your code here


---

## Next Lesson

In **Lesson 5: Bayesian Classification**, we'll move from "is there an effect?" to "what should we believe?" We'll investigate manuscript forgeries in the Archives, using evidence to update our beliefs about whether a document is genuine.

*"The frequentist asks: if Mink were innocent, how often would we see this evidence? The Bayesian asks: given this evidence, how likely is Mink innocent?"*  
— From the Senate Inquiry transcripts