[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/buildLittleWorlds/ml-math-with-densworld/blob/main/modules/01-statistics-probability/notebooks/02-distributions-as-terrain.ipynb)

# Lesson 2: Distributions as Terrain

*"The price of a Grimslew is as unpredictable as the creature itself—most trades are quick copper exchanges for common specimens, but once in a generation, a perfect Mottled Lungfish sells for enough to buy a Senate seat."*  
— Ledger annotation, Capital Creature Market, 1847

---

## The Core Problem

In the Capital's creature market, merchants face a peculiar challenge: **most sales are small, but rare sales are enormous**. A seller who prices inventory based on the average sale will go bankrupt—because the average is pulled upward by rare, spectacular transactions that may never repeat.

Meanwhile, in the Dens, mapmakers recording ground stability face the opposite problem: their measurement errors cluster predictably around the true value, with large errors being genuinely rare.

These are two different **terrains of possibility**. Understanding the shape of your data's terrain is essential before applying any statistical model.

---

## Learning Objectives

By the end of this lesson, you will:
1. Recognize the normal distribution and understand why it emerges from accumulated small effects
2. Identify skewed distributions and understand why mean ≠ median
3. Know when standard statistical models will fail due to distribution shape
4. Transform skewed data to enable standard analyses

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

# Set random seed for reproducibility
np.random.seed(42)

# Nice plotting defaults
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

# Colab-ready data loading
BASE_URL = "https://raw.githubusercontent.com/buildLittleWorlds/ml-math-with-densworld/main/data/"

# Load the datasets
creature_market = pd.read_csv(BASE_URL + "creature_market.csv")
dens_boundary = pd.read_csv(BASE_URL + "dens_boundary_observations.csv")

print(f"Loaded {len(creature_market)} creature market transactions")
print(f"Loaded {len(dens_boundary)} boundary observations")

## Part 1: The Normal Distribution — Measurement Errors in the Dens

### The Mapmakers' Dilemma

In the Dens, where "yesterday's map is gossip," mapmakers like Vagabu Olt and The Pickbox Man spend their lives surveying the shifting boundaries between solid ground and densmuck. Their instruments—theodolites, measuring rods, even simple pacing—all introduce small errors.

But here's the remarkable thing: when you accumulate many independent small effects, the errors follow a predictable pattern called the **normal distribution** (or Gaussian, or bell curve).

Let's examine the measurement errors from our boundary observations:

In [None]:
# Extract measurement errors (difference between observed and true stability)
errors = dens_boundary['measurement_error']

# Plot the distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram
axes[0].hist(errors, bins=40, density=True, alpha=0.7, color='steelblue', edgecolor='black')

# Overlay theoretical normal distribution
x = np.linspace(errors.min(), errors.max(), 100)
axes[0].plot(x, stats.norm.pdf(x, errors.mean(), errors.std()), 
             'r-', linewidth=2, label='Normal distribution')
axes[0].axvline(0, color='green', linestyle='--', linewidth=2, label='Zero error')
axes[0].set_xlabel('Measurement Error')
axes[0].set_ylabel('Density')
axes[0].set_title('Distribution of Mapmaker Measurement Errors')
axes[0].legend()

# Q-Q plot to test normality
stats.probplot(errors, dist="norm", plot=axes[1])
axes[1].set_title('Q-Q Plot: Do Errors Follow Normal Distribution?')

plt.tight_layout()
plt.show()

print(f"Mean error: {errors.mean():.4f} (should be near 0 if unbiased)")
print(f"Standard deviation: {errors.std():.4f}")
print(f"Median error: {errors.median():.4f}")
print(f"\nNote: Mean ≈ Median for symmetric distributions")

### Why the Bell Curve Emerges

A mapmaker's measurement error comes from many independent sources:
- Slight trembling of the hand
- Wind affecting the theodolite
- Imperfect calibration
- Judgment calls on where exactly the boundary lies
- Fatigue

Each source contributes a tiny positive or negative error. When you **add up many independent random effects**, the result follows a normal distribution—regardless of what each individual effect looks like.

This is a preview of the **Central Limit Theorem**, which we'll explore deeply in Lesson 3.

### The 68-95-99.7 Rule

For normal distributions, we can make precise probability statements:

In [None]:
mean = errors.mean()
std = errors.std()

within_1_std = ((errors >= mean - std) & (errors <= mean + std)).mean()
within_2_std = ((errors >= mean - 2*std) & (errors <= mean + 2*std)).mean()
within_3_std = ((errors >= mean - 3*std) & (errors <= mean + 3*std)).mean()

print("The 68-95-99.7 Rule for Normal Distributions:")
print("="*50)
print(f"Within 1 std: {within_1_std:.1%} (theory: 68.3%)")
print(f"Within 2 std: {within_2_std:.1%} (theory: 95.4%)")
print(f"Within 3 std: {within_3_std:.1%} (theory: 99.7%)")

print(f"\nPractical interpretation for mapmakers:")
print(f"  - Most errors ({within_1_std:.0%}) are within ±{std:.3f}")
print(f"  - Errors beyond ±{2*std:.3f} are rare ({1-within_2_std:.1%})")
print(f"  - Errors beyond ±{3*std:.3f} almost never happen ({1-within_3_std:.2%})")

### Instrument Quality Matters

Different instruments have different precision. Let's compare the measurement error distributions by instrument type:

In [None]:
fig, ax = plt.subplots(figsize=(12, 6))

instruments = dens_boundary['instrument_type'].unique()
colors = ['steelblue', 'coral', 'seagreen']

for instrument, color in zip(instruments, colors):
    subset = dens_boundary[dens_boundary['instrument_type'] == instrument]['measurement_error']
    ax.hist(subset, bins=30, alpha=0.5, label=f'{instrument} (std={subset.std():.4f})', 
            color=color, density=True)

ax.axvline(0, color='black', linestyle='--', linewidth=2)
ax.set_xlabel('Measurement Error')
ax.set_ylabel('Density')
ax.set_title('Measurement Error by Instrument Type')
ax.legend()
plt.show()

print("\nPrecision by instrument (lower std = more precise):")
for instrument in instruments:
    subset = dens_boundary[dens_boundary['instrument_type'] == instrument]['measurement_error']
    print(f"  {instrument:15s}: std = {subset.std():.4f}, mean = {subset.mean():+.4f}")

## Part 2: Skewed Distributions — The Creature Market

### A Very Different Terrain

Now let's look at creature market prices. Unlike measurement errors, prices don't cluster symmetrically around a center. Instead, they exhibit **right skew**: most transactions are small, but rare transactions are enormous.

*"For every hundred specimens of Swamp Hornet sold for pocket change, there's one collector who'll pay a year's wages for a pristine Grimslew."*  
— Market proverb

In [None]:
prices = creature_market['price_per_unit']

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram of raw prices
axes[0].hist(prices, bins=50, color='goldenrod', edgecolor='black', alpha=0.7)
axes[0].axvline(prices.mean(), color='red', linewidth=2, linestyle='-', 
                label=f'Mean = {prices.mean():.1f}')
axes[0].axvline(prices.median(), color='blue', linewidth=2, linestyle='--', 
                label=f'Median = {prices.median():.1f}')
axes[0].set_xlabel('Price per Unit')
axes[0].set_ylabel('Number of Transactions')
axes[0].set_title('Creature Market Prices: Right-Skewed Distribution')
axes[0].legend()

# Log-scale histogram
axes[1].hist(np.log10(prices + 1), bins=50, color='goldenrod', edgecolor='black', alpha=0.7)
axes[1].set_xlabel('Log₁₀(Price + 1)')
axes[1].set_ylabel('Number of Transactions')
axes[1].set_title('Log-Transformed Prices: More Symmetric')

plt.tight_layout()
plt.show()

print("Price statistics:")
print(f"  Mean:   {prices.mean():>10.2f}")
print(f"  Median: {prices.median():>10.2f}")
print(f"  Min:    {prices.min():>10.2f}")
print(f"  Max:    {prices.max():>10.2f}")
print(f"\n  Mean / Median ratio: {prices.mean() / prices.median():.2f}")
print(f"  (For symmetric distributions, this ratio ≈ 1.0)")

### Why Mean ≠ Median Matters

In a symmetric distribution (like mapmaker errors), the mean and median are nearly equal. But in a skewed distribution, they diverge.

**The mean is pulled toward the tail.** A few extremely expensive specimens drag the average upward, even though most transactions are much smaller.

This has practical implications:
- If you're a **seller**, the mean might make your inventory seem more valuable than it is
- If you're a **buyer**, the median better represents what you'll actually pay
- If you're a **statistician**, many formulas assume symmetry and will give misleading results

In [None]:
# Demonstrate how extreme values affect the mean
print("The Influence of Extreme Values")
print("="*50)

# Original statistics
print(f"\nAll {len(prices)} transactions:")
print(f"  Mean:   {prices.mean():.2f}")
print(f"  Median: {prices.median():.2f}")

# Remove top 1%
threshold_99 = prices.quantile(0.99)
prices_trimmed = prices[prices <= threshold_99]
print(f"\nWithout top 1% (prices > {threshold_99:.2f}):")
print(f"  Mean:   {prices_trimmed.mean():.2f}  (dropped {(prices.mean() - prices_trimmed.mean()):.2f})")
print(f"  Median: {prices_trimmed.median():.2f}  (barely changed)")

# Remove top 5%
threshold_95 = prices.quantile(0.95)
prices_more_trimmed = prices[prices <= threshold_95]
print(f"\nWithout top 5% (prices > {threshold_95:.2f}):")
print(f"  Mean:   {prices_more_trimmed.mean():.2f}")
print(f"  Median: {prices_more_trimmed.median():.2f}")

### The Log-Normal Distribution

Creature prices follow what's called a **log-normal distribution**. This means:
- The prices themselves are skewed
- But the **logarithm** of prices is normally distributed

Why does this happen? Prices are affected by **multiplicative factors**:
- Rarity multiplies the base price
- Condition (pristine vs. damaged) multiplies again
- Collector demand multiplies again

When factors multiply rather than add, you get log-normal distributions.

In [None]:
# Test log-normality
log_prices = np.log(prices + 1)  # Add 1 to handle any zeros

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Q-Q plot for raw prices (should NOT be linear)
stats.probplot(prices, dist="norm", plot=axes[0])
axes[0].set_title('Q-Q Plot: Raw Prices vs. Normal\n(Not linear = not normal)')

# Q-Q plot for log prices (should BE linear)
stats.probplot(log_prices, dist="norm", plot=axes[1])
axes[1].set_title('Q-Q Plot: Log Prices vs. Normal\n(Linear = log-normal)')

plt.tight_layout()
plt.show()

## Part 3: When Models Fail — Distribution Assumptions

Many statistical models assume your data follows a normal distribution (or at least a symmetric one). When this assumption is violated, results can be misleading.

### Example: Confidence Intervals for Mean Price

The standard formula for a 95% confidence interval assumes normality:

$$\text{CI} = \bar{x} \pm 1.96 \times \frac{s}{\sqrt{n}}$$

Let's see what happens when we apply this to skewed data:

In [None]:
# Standard CI calculation
n = len(prices)
mean = prices.mean()
std = prices.std()
se = std / np.sqrt(n)

ci_lower = mean - 1.96 * se
ci_upper = mean + 1.96 * se

print("Standard 95% Confidence Interval for Mean Price")
print("="*50)
print(f"Point estimate: {mean:.2f}")
print(f"95% CI: [{ci_lower:.2f}, {ci_upper:.2f}]")

print("\n⚠️  Problems with this approach:")
print(f"  - The CI is symmetric around the mean")
print(f"  - But the data is NOT symmetric")
print(f"  - The lower bound ({ci_lower:.2f}) is closer to the median ({prices.median():.2f})")
print(f"  - The upper bound may underestimate rare high prices")

### Bootstrap: A Distribution-Free Alternative

When your data is skewed, **bootstrapping** provides a more honest confidence interval. It works by:
1. Resampling your data with replacement
2. Calculating the statistic for each resample
3. Using the percentiles of these resampled statistics as the CI

In [None]:
# Bootstrap confidence interval
n_bootstrap = 10000
bootstrap_means = []

for _ in range(n_bootstrap):
    resample = np.random.choice(prices, size=n, replace=True)
    bootstrap_means.append(resample.mean())

bootstrap_means = np.array(bootstrap_means)
boot_ci_lower = np.percentile(bootstrap_means, 2.5)
boot_ci_upper = np.percentile(bootstrap_means, 97.5)

print("Bootstrap 95% Confidence Interval for Mean Price")
print("="*50)
print(f"Point estimate: {mean:.2f}")
print(f"95% CI: [{boot_ci_lower:.2f}, {boot_ci_upper:.2f}]")

print(f"\nComparison:")
print(f"  Standard CI: [{ci_lower:.2f}, {ci_upper:.2f}]  (symmetric)")
print(f"  Bootstrap CI: [{boot_ci_lower:.2f}, {boot_ci_upper:.2f}]  (asymmetric)")

# Visualize
fig, ax = plt.subplots(figsize=(10, 5))
ax.hist(bootstrap_means, bins=50, color='goldenrod', edgecolor='black', alpha=0.7)
ax.axvline(mean, color='red', linewidth=2, label=f'Sample Mean = {mean:.1f}')
ax.axvline(boot_ci_lower, color='blue', linewidth=2, linestyle='--', label=f'2.5th percentile')
ax.axvline(boot_ci_upper, color='blue', linewidth=2, linestyle='--', label=f'97.5th percentile')
ax.set_xlabel('Bootstrap Sample Mean')
ax.set_ylabel('Frequency')
ax.set_title('Bootstrap Distribution of Mean Price')
ax.legend()
plt.show()

## Part 4: Kurtosis and Fat Tails

Beyond skewness, another important distribution property is **kurtosis**—how heavy the tails are compared to a normal distribution.

In the Quarry, expedition casualties exhibit **fat tails**: most expeditions have zero or few casualties, but catastrophic events (creature attacks, cave-ins) occasionally kill many crew members at once.

Let's examine this with our data:

In [None]:
# Load expedition data to examine casualties
expeditions = pd.read_csv(BASE_URL + "expedition_outcomes.csv")

casualties = expeditions['casualties']

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram of casualties
axes[0].hist(casualties, bins=30, color='darkred', edgecolor='black', alpha=0.7)
axes[0].set_xlabel('Casualties per Expedition')
axes[0].set_ylabel('Number of Expeditions')
axes[0].set_title('Distribution of Expedition Casualties')

# Compare to normal distribution with same mean/std
x = np.linspace(0, casualties.max(), 100)
normal_pdf = stats.norm.pdf(x, casualties.mean(), casualties.std())
axes[1].hist(casualties, bins=30, density=True, color='darkred', edgecolor='black', alpha=0.5, label='Actual')
axes[1].plot(x, normal_pdf, 'b-', linewidth=2, label='Normal (same mean/std)')
axes[1].set_xlabel('Casualties per Expedition')
axes[1].set_ylabel('Density')
axes[1].set_title('Casualties vs. Normal Distribution')
axes[1].legend()

plt.tight_layout()
plt.show()

print(f"Casualty statistics:")
print(f"  Mean: {casualties.mean():.2f}")
print(f"  Std:  {casualties.std():.2f}")
print(f"  Skewness: {stats.skew(casualties):.2f}  (>0 means right-skewed)")
print(f"  Excess Kurtosis: {stats.kurtosis(casualties):.2f}  (>0 means fatter tails than normal)")

### Why Fat Tails Matter

If you model casualty risk using a normal distribution, you will **systematically underestimate catastrophic events**. The normal distribution says 3-sigma events are vanishingly rare. But in fat-tailed distributions, they happen more often than you'd expect.

*"The Quarry doesn't kill in ones and twos. It waits until your guard is down, then takes the whole crew."*  
— Gull's Remnants saying

In [None]:
# Compare tail probabilities: actual vs. normal assumption
mean_c = casualties.mean()
std_c = casualties.std()

thresholds = [mean_c + 2*std_c, mean_c + 3*std_c, mean_c + 4*std_c]

print("Probability of Extreme Casualties")
print("="*60)
print(f"{'Threshold':<20} {'Actual':<15} {'Normal Predicts':<15} {'Ratio':<10}")
print("-"*60)

for t in thresholds:
    actual = (casualties > t).mean()
    normal_pred = 1 - stats.norm.cdf(t, mean_c, std_c)
    ratio = actual / normal_pred if normal_pred > 0 else float('inf')
    print(f">{t:>5.1f} casualties  {actual:>10.4f}      {normal_pred:>10.6f}      {ratio:>8.1f}x")

print(f"\n⚠️  Extreme events happen MORE often than normal distribution predicts!")

## Part 5: Identifying Distribution Types

Here's a practical guide for recognizing distribution shapes in your data:

| Characteristic | Normal | Log-Normal | Fat-Tailed |
|---------------|--------|------------|------------|
| Symmetry | Symmetric | Right-skewed | Often right-skewed |
| Mean vs Median | Equal | Mean > Median | Mean > Median |
| Q-Q Plot | Linear | Curved up-right | S-shaped |
| Extreme values | Very rare | Somewhat common | More common than expected |
| Example | Measurement errors | Prices, incomes | Casualties, extreme events |

### Decision Tree for Distribution Choice

In [None]:
def diagnose_distribution(data, name="Data"):
    """Diagnose the distribution type of a dataset."""
    
    mean_val = data.mean()
    median_val = data.median()
    skew_val = stats.skew(data)
    kurt_val = stats.kurtosis(data)
    
    print(f"\n{'='*50}")
    print(f"Distribution Diagnosis: {name}")
    print(f"{'='*50}")
    print(f"Mean:     {mean_val:.3f}")
    print(f"Median:   {median_val:.3f}")
    print(f"Mean/Median Ratio: {mean_val/median_val:.2f}")
    print(f"Skewness: {skew_val:.2f}")
    print(f"Excess Kurtosis: {kurt_val:.2f}")
    
    print(f"\nDiagnosis:")
    
    # Check symmetry
    if abs(skew_val) < 0.5:
        print(f"  ✓ Approximately symmetric (skew = {skew_val:.2f})")
    elif skew_val > 0:
        print(f"  → Right-skewed (skew = {skew_val:.2f})")
    else:
        print(f"  ← Left-skewed (skew = {skew_val:.2f})")
    
    # Check tails
    if abs(kurt_val) < 1:
        print(f"  ✓ Normal-like tails (kurtosis = {kurt_val:.2f})")
    elif kurt_val > 0:
        print(f"  ⚠️  Fat tails - extreme values more likely (kurtosis = {kurt_val:.2f})")
    else:
        print(f"  ⚠️  Thin tails - extreme values less likely (kurtosis = {kurt_val:.2f})")
    
    # Recommendation
    print(f"\nRecommendation:")
    if abs(skew_val) < 0.5 and abs(kurt_val) < 1:
        print(f"  → Standard normal-based methods should work well")
    elif skew_val > 1:
        print(f"  → Consider log transformation or log-normal models")
        print(f"  → Use median instead of mean for central tendency")
    if kurt_val > 2:
        print(f"  → Use robust methods (bootstrap, trimmed means)")
        print(f"  → Be cautious of 'rare event' underestimation")

# Diagnose our datasets
diagnose_distribution(dens_boundary['measurement_error'], "Mapmaker Measurement Errors")
diagnose_distribution(creature_market['price_per_unit'], "Creature Market Prices")
diagnose_distribution(expeditions['casualties'], "Expedition Casualties")

## Summary

| Concept | Key Insight | Densworld Example |
|---------|-------------|-------------------|
| Normal Distribution | Emerges from many small, additive effects | Mapmaker measurement errors |
| Skewness | Mean ≠ Median; mean pulled toward tail | Creature market prices |
| Log-Normal | Multiplicative factors create log-normal data | Price = base × rarity × condition × demand |
| Fat Tails | Extreme events more likely than normal predicts | Expedition casualties |
| 68-95-99.7 Rule | Only works for normal distributions | Mapmaker errors stay in bounds |
| Bootstrap | Distribution-free confidence intervals | Works for any terrain |

---

## Exercises

### Exercise 1: Creature Categories

Different creature categories may have different price distributions. Compare the price distribution for `insect` vs `mammal` categories. Which is more skewed? Why might this be?

In [None]:
# Exercise 1: Your code here
# Hint: creature_market[creature_market['category'] == 'insect']['price_per_unit']


### Exercise 2: Experience and Precision

Do more experienced mapmakers have smaller measurement errors? Calculate the standard deviation of `measurement_error` grouped by `observer_experience` (you may want to bin experience into categories like 0-10, 11-20, 21+ years).

In [None]:
# Exercise 2: Your code here
# Hint: pd.cut() can bin continuous variables


### Exercise 3: Transform and Test

Apply a log transformation to creature prices. Then:
1. Calculate the mean and median of the log-transformed prices
2. Create a Q-Q plot to verify the transformed data is approximately normal
3. Calculate a 95% confidence interval for the mean of log-prices
4. Transform this CI back to the original scale

In [None]:
# Exercise 3: Your code here
# Hint: Use np.log() to transform, np.exp() to back-transform


### Exercise 4: Tail Risk Assessment

The Boss is planning a large expedition and wants to understand casualty risk. What is the probability of suffering 5 or more casualties on an expedition? Calculate this:
1. From the actual data (empirical probability)
2. Assuming a normal distribution
3. Assuming a Poisson distribution (often used for count data)

Which assumption is most conservative (predicts highest risk)?

In [None]:
# Exercise 4: Your code here
# Hint: stats.poisson.sf(k, mu) gives P(X > k) for Poisson


---

## Next Lesson

In **Lesson 3: The Central Limit Theorem**, we'll discover why averaging many observations reveals truth—even when individual observations are noisy and non-normal. We'll follow mapmakers as they combine multiple surveys to find the true boundary of the Dens.

*"One measurement is a guess. Ten measurements are a vote. A hundred measurements are the truth."*  
— Vagabu Olt