[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/buildLittleWorlds/ml-math-with-densworld/blob/main/modules/02-linear-algebra/notebooks/01-vectors-dual-view.ipynb)

# Lesson 1: Vectors — Two Ways of Seeing

*"Each creature is a point in the space of all possible creatures. The Leatherback Burrower and the Stakdur exist as distant stars in the same constellation of traits—one timid and burrowing, the other a territorial apex predator. Between them lies every creature that could ever evolve in these depths."*  
— Boffa Trent, *Natural Philosophy of the Quarry*, 1823

---

## The Core Insight

A **vector** is simply an ordered list of numbers. But there are **two equally valid ways** to interpret this humble list:

1. **The Archivist's View**: A row of data (creature statistics, manuscript features, expedition records)
2. **The Cartographer's View**: A coordinate in n-dimensional space

Understanding both views is essential for machine learning—and for understanding how the Capital Archives classify knowledge. The archivists see data as records to be catalogued. The mapmakers see the same data as locations in an abstract space where **similar things cluster together**.

---

## Learning Objectives

By the end of this lesson, you will:
1. See creatures and manuscripts as points in high-dimensional space
2. Visualize how "similar" things are "close" in this space
3. Understand why this geometric view enables classification and recommendation
4. Calculate distances between data points in feature space

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from scipy.spatial.distance import pdist, squareform

# Set random seed for reproducibility
np.random.seed(42)

# Nice plotting defaults
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

# Colab-ready data loading
BASE_URL = "https://raw.githubusercontent.com/buildLittleWorlds/ml-math-with-densworld/main/data/"

# Load our datasets
creature_vectors = pd.read_csv(BASE_URL + "creature_vectors.csv")
creature_similarity = pd.read_csv(BASE_URL + "creature_similarity.csv")
manuscripts = pd.read_csv(BASE_URL + "manuscript_features.csv")

print(f"Loaded {len(creature_vectors)} creatures with behavioral/habitat vectors")
print(f"Loaded {len(creature_similarity)} pairwise similarity calculations")
print(f"Loaded {len(manuscripts)} manuscript records")

## Part 1: The Archivist's View

In the Capital Archives, every creature catalogued from the Quarry has an entry with multiple attributes. The archivists record each creature as a row of numbers—a **vector** of features.

*"We measure what we can: aggression on a scale of zero to one, sociality, nocturnality, territoriality, hunting strategy. Each number pins down one facet of the creature's nature. Together, they form a portrait more precise than any sketch."*  
— Archives Cataloguing Manual, 1851

Let's examine how the Archives record creature data:

In [None]:
# Display the creature catalog
print("The Creature Catalog — Behavioral Features:")
print("="*80)
behavioral_cols = ['creature_id', 'common_name', 'category', 'aggression', 'sociality', 
                   'nocturnality', 'territoriality', 'hunting_strategy']
print(creature_vectors[behavioral_cols].to_string(index=False))

In [None]:
# Each creature is a vector
print("\nEach creature is a VECTOR of behavioral attributes:")
print("="*70)

behavioral_features = ['aggression', 'sociality', 'nocturnality', 'territoriality', 'hunting_strategy']

for _, row in creature_vectors.head(6).iterrows():
    vector = row[behavioral_features].values
    print(f"{row['common_name']:25} → [{', '.join(f'{v:.2f}' for v in vector)}]")

print(f"\n{'':25}   (aggression, sociality, nocturnality, territoriality, hunting)")

### The Key Insight

The **Cave Bat** is represented by the vector `[0.10, 0.95, 0.95, 0.20, 0.80]`:
- Low aggression (0.10) — docile
- High sociality (0.95) — lives in colonies
- High nocturnality (0.95) — active at night
- Low territoriality (0.20) — shares space easily
- High hunting strategy (0.80) — active predator (of insects)

The **Witch Creature** is `[1.00, 0.00, 0.90, 1.00, 0.50]`:
- Maximum aggression (1.00) — attacks on sight
- Zero sociality (0.00) — purely solitary
- High nocturnality (0.90) — darkness hunter
- Maximum territoriality (1.00) — kills intruders

These are just lists of attributes. But watch what happens when we think **geometrically**...

---

## Part 2: The Cartographer's View

Vagabu Olt, the wandering map-sketcher, would see it differently.

Those same vectors are **coordinates in n-dimensional space**:
- Each feature becomes an axis
- Each creature becomes a **point** at those coordinates
- The entire catalog becomes a **cloud of points** in "creature space"

*"I drew creatures on paper. Boffa Trent taught me to see them as stars—each one a fixed point in the heavens of possibility, with similar beasts clustered like constellations."*  
— Vagabu Olt, field notes

Let's visualize creatures in 3D feature space (using aggression, sociality, and nocturnality):

In [None]:
# Create 3D plot of creatures in feature space
fig = plt.figure(figsize=(14, 10))
ax = fig.add_subplot(111, projection='3d')

# Color by category
category_colors = {
    'bird': 'skyblue',
    'reptile': 'green',
    'insect': 'orange',
    'mammal': 'brown',
    'amphibian': 'olive',
    'fish': 'blue',
    'worm': 'purple',
    'unknown': 'red'
}

for _, row in creature_vectors.iterrows():
    color = category_colors.get(row['category'], 'gray')
    ax.scatter(row['aggression'], row['sociality'], row['nocturnality'], 
               c=color, s=100, edgecolor='black', alpha=0.8)
    # Label a few notable creatures
    if row['common_name'] in ['Cave Bat', 'Witch Creature', 'Marsh Hornet', 'Yeller Frog', 'Stakdur']:
        ax.text(row['aggression']+0.03, row['sociality']+0.03, row['nocturnality']+0.03,
                row['common_name'], fontsize=9, fontweight='bold')

ax.set_xlabel('Aggression', fontsize=11)
ax.set_ylabel('Sociality', fontsize=11)
ax.set_zlabel('Nocturnality', fontsize=11)
ax.set_title('Each Creature is a Point in 3D "Behavioral Space"', fontsize=13)

# Add legend
for cat, color in category_colors.items():
    ax.scatter([], [], c=color, label=cat, s=60)
ax.legend(loc='upper left', fontsize=9)

plt.tight_layout()
plt.show()

## Part 3: Why This Matters — Similarity = Proximity

Here's the insight that powers most of machine learning:

**Similar things are close together in feature space.**

If two creatures have similar behavioral traits, their vectors point to nearby locations in this abstract space. This is how the Archives could build:

- **Classification**: A new creature near the "deadly predator" cluster is probably dangerous
- **Recommendation**: Expeditions that encountered the Wharver should prepare for the Stakdur (nearby in behavioral space)
- **Clustering**: Group creatures by natural behavioral similarities

Let's calculate the actual distances between creatures:

In [None]:
# Create feature matrix for behavioral traits
behavioral_features = ['aggression', 'sociality', 'nocturnality', 'territoriality', 'hunting_strategy']
X_behavioral = creature_vectors[behavioral_features].values

# Calculate pairwise Euclidean distances
distances = squareform(pdist(X_behavioral, metric='euclidean'))

# Create distance matrix DataFrame
distance_df = pd.DataFrame(
    distances, 
    index=creature_vectors['common_name'],
    columns=creature_vectors['common_name']
)

print("Creature Distance Matrix (based on 5 behavioral traits):")
print("Smaller distance = more similar behavior")
print("="*80)
# Show a subset
sample_creatures = ['Cave Bat', 'Yeller Bat', 'Witch Creature', 'Stakdur', 'Yeller Frog', 'Mud Worm']
print(distance_df.loc[sample_creatures, sample_creatures].round(3).to_string())

In [None]:
# Find most similar and most different pairs
pairs = []
names = creature_vectors['common_name'].values
for i in range(len(names)):
    for j in range(i+1, len(names)):
        pairs.append((names[i], names[j], distances[i, j]))

pairs_sorted = sorted(pairs, key=lambda x: x[2])

print("Most Similar Creature Pairs (smallest distance):")
print("-" * 70)
for p1, p2, dist in pairs_sorted[:7]:
    print(f"  {p1:22} <-> {p2:22}  distance = {dist:.3f}")

print("\nMost Different Creature Pairs (largest distance):")
print("-" * 70)
for p1, p2, dist in pairs_sorted[-5:]:
    print(f"  {p1:22} <-> {p2:22}  distance = {dist:.3f}")

### Interpreting the Results

Notice the patterns:
- **Cave Bat and Yeller Bat** are extremely similar (both highly social, nocturnal, colony-dwelling mammals)
- **Witch Creature and Yeller Frog** are maximally different (solitary apex predator vs. docile social amphibian)

This isn't coincidence—it's the geometry of behavioral traits revealing true biological relationships.

---

## Part 4: Habitat Vectors — Another Dimension of Similarity

Creatures aren't just defined by behavior. They also have habitat preferences. Our dataset includes a second set of features:

- `depth_preference` — How deep in the Quarry they live
- `moisture_preference` — Wet vs. dry environments
- `light_tolerance` — Can they survive in lit areas?
- `cave_affinity` — Preference for enclosed spaces
- `surface_affinity` — Preference for open areas

In [None]:
# Display habitat features
print("The Creature Catalog — Habitat Features:")
print("="*90)
habitat_cols = ['common_name', 'depth_preference', 'moisture_preference', 'light_tolerance', 
                'cave_affinity', 'surface_affinity']
print(creature_vectors[habitat_cols].head(10).to_string(index=False))

In [None]:
# Compare behavioral similarity vs habitat similarity
habitat_features = ['depth_preference', 'moisture_preference', 'light_tolerance', 
                    'cave_affinity', 'surface_affinity']

X_habitat = creature_vectors[habitat_features].values
habitat_distances = squareform(pdist(X_habitat, metric='euclidean'))

# Compare for specific pairs
print("Behavioral vs Habitat Similarity:")
print("="*70)
print(f"{'Creature Pair':<45} {'Behavioral':<12} {'Habitat':<12}")
print("-"*70)

interesting_pairs = [
    ('Cave Bat', 'Yeller Bat'),
    ('Grimslew Fish', 'Marsh Eel'),
    ('Stakdur', 'Maw Beast'),
    ('Cave Bat', 'Marsh Hornet'),
    ('Witch Creature', 'Yeller Frog')
]

for c1, c2 in interesting_pairs:
    idx1 = creature_vectors[creature_vectors['common_name'] == c1].index[0]
    idx2 = creature_vectors[creature_vectors['common_name'] == c2].index[0]
    b_dist = distances[idx1, idx2]
    h_dist = habitat_distances[idx1, idx2]
    print(f"{c1} <-> {c2:<20} {b_dist:<12.3f} {h_dist:<12.3f}")

### The Full Vector: Behavior + Habitat

The Archives' complete creature vector combines **all 10 features**. This is the "full profile" that captures both how a creature acts and where it lives.

The pre-calculated `creature_similarity.csv` file contains similarity metrics using all features:

In [None]:
# Look at pre-calculated similarity metrics
print("Pre-calculated Similarity Metrics (using all 10 features):")
print(creature_similarity[['creature_a_name', 'creature_b_name', 
                           'euclidean_dist_full', 'cosine_sim_full']].head(15).to_string(index=False))

print("\n(Euclidean distance: lower = more similar)")
print("(Cosine similarity: higher = more similar, max = 1.0)")

## Part 5: Vectors as Arrows — Magnitude and Direction

There's another interpretation: a vector as an **arrow from the origin**.

This view emphasizes:
- **Magnitude**: How "extreme" is this creature? (length of arrow)
- **Direction**: What *kind* of extreme? (where does it point?)

The dataset includes pre-calculated **vector norms** (magnitudes):

In [None]:
# Visualize creatures as arrows (2D: aggression vs sociality)
fig, ax = plt.subplots(figsize=(11, 10))

# Select representative creatures
selected_names = ['Cave Bat', 'Witch Creature', 'Marsh Hornet', 'Yeller Frog', 
                  'Stakdur', 'Mud Worm', 'Metal-Beaked Finch', 'Maw Beast']
selected = creature_vectors[creature_vectors['common_name'].isin(selected_names)]

colors = plt.cm.viridis(np.linspace(0, 1, len(selected)))

for (_, row), color in zip(selected.iterrows(), colors):
    ax.arrow(0, 0, row['aggression'], row['sociality'], 
             head_width=0.03, head_length=0.02,
             fc=color, ec=color, linewidth=2.5, alpha=0.8)
    ax.annotate(row['common_name'], 
                xy=(row['aggression'], row['sociality']),
                xytext=(row['aggression']+0.02, row['sociality']+0.02),
                fontsize=10, fontweight='bold')

ax.set_xlim(-0.05, 1.1)
ax.set_ylim(-0.05, 1.1)
ax.set_xlabel('Aggression', fontsize=12)
ax.set_ylabel('Sociality', fontsize=12)
ax.set_title('Creatures as Arrows in Feature Space\n(Direction = trait balance, Length = intensity)', fontsize=13)
ax.axhline(0, color='black', linewidth=0.5)
ax.axvline(0, color='black', linewidth=0.5)
ax.set_aspect('equal')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Display vector magnitudes (norms)
print("Vector Magnitudes (L2 Norms) — Measuring Overall 'Intensity':")
print("="*70)
print(f"{'Creature':<25} {'Behavioral Norm':<18} {'Habitat Norm':<18} {'Full Norm':<15}")
print("-"*70)

for _, row in creature_vectors.iterrows():
    print(f"{row['common_name']:<25} {row['behavioral_norm_l2']:<18.4f} {row['habitat_norm_l2']:<18.4f} {row['full_norm_l2']:<15.4f}")

print("\nInterpretation:")
print("  - Higher norm = more 'extreme' creature (further from origin)")
print("  - Marsh Hornet has highest behavioral norm (extreme on many traits)")
print("  - Mud Worm has lowest behavioral norm (mild on all traits)")

## Part 6: Manuscript Vectors — The Same Concept, Different Domain

The same concept applies to manuscripts in the Archives. Each manuscript can be represented as a vector of stylometric features:

- Sentence length patterns
- Vocabulary richness
- Alignment with Stone School terminology
- Alignment with Water School terminology
- Alignment with Pebble School terminology

This lets the Archives detect forgeries by seeing which manuscripts are "out of place" in feature space.

*"Mink Pavar's forgeries fooled the eyes of scholars for decades. But in the space of stylometric features, they occupy an impossible position—too close to multiple schools at once."*  
— Archives Investigation Report, 1863

In [None]:
# Look at manuscript feature vectors
print("Manuscript Feature Vectors:")
print("="*90)
ms_cols = ['manuscript_id', 'attributed_author', 'is_forgery',
           'school_alignment_stone', 'school_alignment_water', 'school_alignment_pebble']
print(manuscripts[ms_cols].head(12).to_string(index=False))

In [None]:
# Visualize manuscripts in 2D: Stone vs Water alignment
fig, ax = plt.subplots(figsize=(11, 8))

# Separate authentic and forged
authentic = manuscripts[~manuscripts['is_forgery']]
forged = manuscripts[manuscripts['is_forgery']]

ax.scatter(authentic['school_alignment_stone'], authentic['school_alignment_water'],
           c='steelblue', s=50, alpha=0.6, label=f'Authentic (n={len(authentic)})', edgecolor='white')
ax.scatter(forged['school_alignment_stone'], forged['school_alignment_water'],
           c='crimson', s=100, alpha=0.9, label=f'Forged (n={len(forged)})', edgecolor='black', marker='X')

ax.set_xlabel('Stone School Alignment', fontsize=12)
ax.set_ylabel('Water School Alignment', fontsize=12)
ax.set_title('Manuscripts in "Philosophical Space"\n(Forgeries often show suspicious mixed signals)', fontsize=13)
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("Notice: Forgeries (red X) often appear between the natural clusters.")
print("Mink Pavar's forgeries show Water School contamination despite being")
print("attributed to Stone School scholars like Grigsu Haldo.")

## Part 7: The Curse of Dimensions

We visualized 2-3 features easily. But real data has many more dimensions:

- A manuscript might have 100 stylometric features
- A creature might be described by 50 attributes  
- A single expedition record has dozens of variables

We can't visualize 100 dimensions, but **the math works identically**.

The distance between two n-dimensional vectors is computed the same way:

$$\text{distance}(\mathbf{a}, \mathbf{b}) = \sqrt{\sum_{i=1}^{n} (a_i - b_i)^2}$$

In [None]:
# Distance formula works in ANY dimension
print("The Euclidean Distance Formula Works in Any Dimension:")
print("="*60)

# 2D distance
a_2d = np.array([1, 2])
b_2d = np.array([4, 6])
dist_2d = np.sqrt(np.sum((a_2d - b_2d)**2))
print(f"\n2D:   a = {a_2d}, b = {b_2d}")
print(f"      distance = {dist_2d:.4f}")

# 3D distance
a_3d = np.array([1, 2, 3])
b_3d = np.array([4, 6, 8])
dist_3d = np.sqrt(np.sum((a_3d - b_3d)**2))
print(f"\n3D:   a = {a_3d}, b = {b_3d}")
print(f"      distance = {dist_3d:.4f}")

# 10D distance (our full creature vectors)
idx_bat = creature_vectors[creature_vectors['common_name'] == 'Cave Bat'].index[0]
idx_witch = creature_vectors[creature_vectors['common_name'] == 'Witch Creature'].index[0]

all_features = behavioral_features + habitat_features
vec_bat = creature_vectors.loc[idx_bat, all_features].values
vec_witch = creature_vectors.loc[idx_witch, all_features].values
dist_10d = np.sqrt(np.sum((vec_bat - vec_witch)**2))
print(f"\n10D:  Cave Bat vs Witch Creature (all features)")
print(f"      distance = {dist_10d:.4f}")

# 100D distance - same formula!
a_100d = np.random.randn(100)
b_100d = np.random.randn(100)
dist_100d = np.sqrt(np.sum((a_100d - b_100d)**2))
print(f"\n100D: Random vectors")
print(f"      distance = {dist_100d:.4f}")

print("\nThe formula is always: sqrt(sum of squared differences)")
print("Geometry works even when we can't visualize it!")

## Summary

| Concept | Key Insight | Densworld Example |
|---------|-------------|-------------------|
| **Vector** | An ordered list of numbers | Creature behavioral traits: [0.1, 0.95, 0.95, 0.2, 0.8] |
| **Archivist's View** | Vector as a data record (row in a table) | Each creature is a catalog entry |
| **Cartographer's View** | Vector as coordinates in n-dimensional space | Each creature is a point in "creature space" |
| **Similarity = Proximity** | Similar things are close in feature space | Cave Bat and Yeller Bat cluster together |
| **Distance** | Euclidean formula works in any dimension | sqrt(sum of squared differences) |
| **Vector Magnitude** | Length of the arrow (norm) | Marsh Hornet is "extreme"; Mud Worm is "mild" |
| **High Dimensions** | Math works the same; intuition carries over | 10-feature creature vectors, 100-feature manuscripts |

---

## Exercises

### Exercise 1: Nearest Neighbor Recommendation

An expedition team just encountered a **Coil Tube Serpent**. Find the 3 most similar creatures (by full 10-feature vector) to help them prepare for what else they might encounter in that area.

In [None]:
# Exercise 1: Find 3 nearest neighbors to Coil Tube Serpent
# Hint: Use the creature_similarity DataFrame or calculate distances manually

# Your code here


### Exercise 2: Manuscript Clustering

Plot all manuscripts colored by their `attributed_school`. Do authentic manuscripts form natural clusters by school? Where do forgeries tend to fall?

In [None]:
# Exercise 2: Manuscript clustering visualization
# Hint: Use different colors for each attributed_school
# Mark forgeries with a different marker shape

# Your code here


### Exercise 3: Behavioral vs Habitat Similarity

Some creatures are behaviorally similar but live in different habitats (e.g., nocturnal hunters in caves vs. swamps). Find a pair of creatures that have:
- **Small** behavioral distance (similar behavior)
- **Large** habitat distance (different environments)

What might explain this pattern?

In [None]:
# Exercise 3: Find creatures with similar behavior but different habitats
# Hint: Calculate both distance matrices and look for large differences

# Your code here


### Exercise 4: Vector Arithmetic

Create a "hypothetical creature" by averaging the feature vectors of the Cave Bat and the Stakdur. What would this hybrid creature's behavioral profile look like? Is it close to any real creature in the catalog?

In [None]:
# Exercise 4: Vector arithmetic - create a hybrid creature
# Hint: (vector_bat + vector_stakdur) / 2

# Your code here


---

## Next Lesson

In **Lesson 2: Vector Norms**, we'll explore different ways to measure distance—and why the choice matters. The L1 "Manhattan" distance tells a different story than the L2 "Euclidean" distance, with real consequences for creature classification and manuscript analysis.

*"The shortest path between two points depends on how you're allowed to travel. In the Dens, where tunnels twist and branch, the straight line is a fantasy."*  
— The Pickbox Man