[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/buildLittleWorlds/ml-math-with-densworld/blob/main/modules/01-statistics-probability/notebooks/01-uncertainty-intuition.ipynb)

# Lesson 1: The Intuition of Uncertainty

*"All maps are wrong, but some are useful."* - Vagabu Olt, wandering cartographer

---

## The Core Problem

In the Capital Archives, scholars study thousands of expedition reports from Yeller Quarry. But they face a fundamental challenge:

- **Population**: All expeditions that have ever been or will be conducted (infinite, unknowable)
- **Sample**: The 1,000 expedition records we have on file

When the Senate asks "What is the true success rate of Quarry expeditions?", they're asking about the **population**. But we can only calculate from our **sample**.

This gap between sample and population is the source of all uncertainty in data science.

---

## Learning Objectives

By the end of this lesson, you will:
1. Understand the difference between populations and samples
2. See why every statistic is an *estimate* with uncertainty
3. Grasp the concept of a random variable

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Set random seed for reproducibility
np.random.seed(42)

# Nice plotting defaults
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

# Colab-ready data loading
BASE_URL = "https://raw.githubusercontent.com/buildLittleWorlds/ml-math-with-densworld/main/data/"

# Load the expedition outcomes dataset
expeditions = pd.read_csv(BASE_URL + "expedition_outcomes.csv")
print(f"Loaded {len(expeditions)} expedition records")
expeditions.head()

## Part 1: Populations vs. Samples

### The Fundamental Truth of the Archives

The Boss ran expeditions into Yeller Quarry for 8 years before the disaster. Gull's Remnants continue today. Countless other crews have ventured into the marsh.

Our archive contains 1,000 expedition records. But this is just a **sample** of all expeditions that have ever occurred. Many were never recorded. Some records were lost. Others are still being written.

Let's pretend, for teaching purposes, that we have access to the "true" population of all 100,000 expeditions ever conducted:

In [None]:
# Simulate the "true population" of all expeditions
# In reality, we'd never have this - it's omniscient knowledge
POPULATION_SIZE = 100_000

# True parameters (unknown to actual researchers)
TRUE_SUCCESS_RATE = 0.72  # 72% of all expeditions are successful
TRUE_AVG_CATCH_VALUE = 115  # Average catch value across all time
TRUE_CASUALTY_RATE = 0.18  # 18% of expeditions have casualties

# Generate the population
np.random.seed(42)
population = {
    'success': np.random.binomial(1, TRUE_SUCCESS_RATE, POPULATION_SIZE),
    'catch_value': np.random.lognormal(mean=4.0, sigma=1.0, size=POPULATION_SIZE),
    'had_casualties': np.random.binomial(1, TRUE_CASUALTY_RATE, POPULATION_SIZE)
}
population_df = pd.DataFrame(population)

print(f"Population size: {len(population_df):,}")
print(f"\nTrue population statistics (normally unknowable):")
print(f"  Success rate: {population_df['success'].mean():.1%}")
print(f"  Average catch value: {population_df['catch_value'].mean():.1f}")
print(f"  Casualty rate: {population_df['had_casualties'].mean():.1%}")

Now, let's see what happens when we only have a **sample** - which is the reality scholars in the Archives face:

In [None]:
# Take a sample of 100 expeditions (like what an archivist might have)
sample_size = 100
sample = population_df.sample(n=sample_size, random_state=42)

print(f"Sample size: {len(sample)}")
print(f"\nSample statistics (what the archivist calculates):")
print(f"  Sample success rate: {sample['success'].mean():.1%}")
print(f"  Sample avg catch value: {sample['catch_value'].mean():.1f}")
print(f"  Sample casualty rate: {sample['had_casualties'].mean():.1%}")

print(f"\nError in estimates:")
print(f"  Success rate error: {abs(sample['success'].mean() - TRUE_SUCCESS_RATE):.1%}")
print(f"  Catch value error: {abs(sample['catch_value'].mean() - TRUE_AVG_CATCH_VALUE):.1f}")

### The Key Insight

Notice that our sample statistics are **close** to the true values, but not exact.

This is the fundamental challenge faced by scholars in the Capital Archives: **every measurement is an estimate with error.**

When Yasho Krent debates Grigsu Haldo about expedition success rates, they're both working from incomplete samples of an unknowable truth.

Let's see how much estimates vary across different samples:

In [None]:
# Imagine 1000 different archivists, each with their own sample of 100 expeditions
n_archivists = 1000
sample_success_rates = []

for _ in range(n_archivists):
    archivist_sample = population_df.sample(n=sample_size)
    sample_success_rates.append(archivist_sample['success'].mean())

sample_success_rates = np.array(sample_success_rates)

# Visualize how estimates vary
fig, ax = plt.subplots()
ax.hist(sample_success_rates, bins=30, edgecolor='black', alpha=0.7, color='steelblue')
ax.axvline(TRUE_SUCCESS_RATE, color='red', linewidth=2, linestyle='--', 
           label=f'True Rate = {TRUE_SUCCESS_RATE:.0%}')
ax.axvline(sample_success_rates.mean(), color='green', linewidth=2, 
           label=f'Average of Samples = {sample_success_rates.mean():.1%}')
ax.set_xlabel('Estimated Success Rate')
ax.set_ylabel('Number of Archivists')
ax.set_title('1000 Archivists, Each with 100 Expedition Records')
ax.legend()
plt.show()

print(f"Range of estimates: {sample_success_rates.min():.1%} to {sample_success_rates.max():.1%}")
print(f"Standard deviation: {sample_success_rates.std():.1%}")

## Part 2: The Standard Error - Quantifying Uncertainty

The spread of sample estimates is called the **Standard Error**. It tells us how much our estimate might be wrong.

There's a beautiful formula:

$$\text{Standard Error} = \frac{\sigma}{\sqrt{n}}$$

Where:
- $\sigma$ = population standard deviation
- $n$ = sample size

### The Semantic Meaning:

**Larger samples = Less uncertainty**

This is why the Capital demands detailed expedition reports. More data means better estimates. When Gull tells stories of expeditions to young trappers, those stories—if recorded—would reduce our uncertainty about the true nature of Yeller Quarry.

Let's verify this with catch values:

In [None]:
# Compare different sample sizes using catch values
sample_sizes = [10, 50, 100, 500, 1000]
n_experiments = 500

fig, axes = plt.subplots(1, len(sample_sizes), figsize=(16, 4))

for idx, n in enumerate(sample_sizes):
    means = [population_df.sample(n=n)['catch_value'].mean() for _ in range(n_experiments)]
    
    axes[idx].hist(means, bins=20, edgecolor='black', alpha=0.7, color='goldenrod')
    axes[idx].axvline(population_df['catch_value'].mean(), color='red', linewidth=2)
    axes[idx].set_title(f'n = {n}\nSE = {np.std(means):.1f}')
    axes[idx].set_xlabel('Avg Catch Value')
    axes[idx].set_xlim(30, 200)
    
plt.suptitle('How Sample Size Affects Precision of Catch Value Estimates', fontsize=14)
plt.tight_layout()
plt.show()

print("As sample size increases by 4x, standard error decreases by ~2x")
print("This is the sqrt(n) relationship in action.")

## Part 3: Random Variables - A New Way of Thinking

In arithmetic, when we write $x = 5$, we mean $x$ has one specific value.

In statistics, a **random variable** is different. It's a variable that can take multiple values, each with a certain probability.

### Example: Creature Encounters

Let $X$ = the number of creature encounters on an expedition.

$X$ doesn't equal any single number. Instead, it follows a distribution:
- $P(X = 0) = $ some probability
- $P(X = 1) = $ some probability
- ... and so on

### Why This Matters for the Quarry:

When The Boss plans an expedition, she can't know exactly how many creatures they'll encounter. But she can estimate the *distribution* of possibilities based on past expeditions.

In [None]:
# Analyze creature encounters from our actual expedition data
encounters = expeditions['creature_encounters']

# Count frequencies
values, counts = np.unique(encounters, return_counts=True)
frequencies = counts / len(encounters)

fig, ax = plt.subplots(figsize=(10, 6))
ax.bar(values, frequencies, edgecolor='black', alpha=0.7, color='darkred')
ax.set_xlabel('Number of Creature Encounters')
ax.set_ylabel('Probability')
ax.set_title('Distribution of Creature Encounters per Expedition\n(Random Variable X = encounters)')
ax.set_xticks(values)
plt.show()

print("Probability distribution of creature encounters:")
for v, p in zip(values[:10], frequencies[:10]):
    print(f"  P(X = {v}) = {p:.3f}")
if len(values) > 10:
    print(f"  ...")

### Connecting to Expedition Planning

The Boss doesn't know she'll have exactly 3 encounters. But she knows:
- The *expected value* (mean) of encounters
- The *variance* (spread) of possibilities

This probabilistic thinking is what separates experienced crew leaders from novices.

In [None]:
print("Summary statistics for creature encounters:")
print(f"  Expected (mean): {encounters.mean():.1f} encounters")
print(f"  Standard deviation: {encounters.std():.1f}")
print(f"  Minimum observed: {encounters.min()}")
print(f"  Maximum observed: {encounters.max()}")
print(f"\nThe Boss prepares for {encounters.mean():.0f} encounters,")
print(f"but knows it could reasonably be as high as {int(encounters.mean() + 2*encounters.std())}")

## Part 4: Working with Real Expedition Data

Let's apply these concepts to our actual expedition archive:

In [None]:
# Our sample from the archives
print("=" * 50)
print("YELLER QUARRY EXPEDITION ARCHIVE")
print("=" * 50)
print(f"Total records: {len(expeditions)}")
print(f"Years covered: {expeditions['year'].min()} - {expeditions['year'].max()}")

print(f"\n--- Estimated Statistics (with uncertainty) ---")

# Success rate with standard error
success_rate = expeditions['success'].mean()
success_se = np.sqrt(success_rate * (1 - success_rate) / len(expeditions))
print(f"\nSuccess Rate: {success_rate:.1%} ± {1.96*success_se:.1%} (95% CI)")

# Average catch value with standard error
catch_mean = expeditions['catch_value'].mean()
catch_se = expeditions['catch_value'].std() / np.sqrt(len(expeditions))
print(f"Avg Catch Value: {catch_mean:.1f} ± {1.96*catch_se:.1f} (95% CI)")

# Casualty rate
casualty_rate = (expeditions['casualties'] > 0).mean()
casualty_se = np.sqrt(casualty_rate * (1 - casualty_rate) / len(expeditions))
print(f"Casualty Rate: {casualty_rate:.1%} ± {1.96*casualty_se:.1%} (95% CI)")

print(f"\n(The ± values represent our uncertainty due to limited sample size)")

## Key Takeaways

1. **You never see the truth** - only samples from an unknown population. The Archives contain shadows, not the reality of Yeller Quarry.

2. **Every statistic is an estimate** - the sample success rate is not the true success rate, it's our best guess with attached uncertainty.

3. **Larger samples = less uncertainty** - the Standard Error shrinks as $\frac{1}{\sqrt{n}}$. This is why detailed records matter.

4. **Random variables map outcomes to probabilities** - creature encounters aren't deterministic, they follow a distribution. Experienced crews think probabilistically.

---

## Exercises

1. **Sector Analysis**: Compare success rates between the 'Deep Quarry' sector and 'Surface Flats'. Which has higher success? Calculate the standard error for each.

2. **Sample Size Experiment**: Take random samples of size 25, 100, and 400 from the expedition data. How much does the catch value estimate vary?

3. **Yeller Groups**: Expeditions with Yeller groups are rare (~10%). Calculate the success rate for expeditions with and without Yeller groups. Given the small sample size of Yeller expeditions, how confident can we be in the difference?

In [None]:
# Exercise 1: Sector Analysis
# Hint: Use expeditions[expeditions['sector'] == 'Deep Quarry'] to filter



In [None]:
# Exercise 2: Sample Size Experiment
# Hint: Use expeditions.sample(n=25)['catch_value'].mean()



In [None]:
# Exercise 3: Yeller Group Analysis
# Hint: expeditions['has_yeller_group'] is True/False



---

## Next Lesson

In **Lesson 2: Distributions as Terrain**, we'll explore the creature market price data and learn to recognize when your data violates the assumptions that standard models make. The long-tailed distribution of prices tells a story about the Quarry's economy.