[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/buildLittleWorlds/ml-math-with-densworld/blob/main/modules/02-linear-algebra/notebooks/05-rank-independence.ipynb)

# Lesson 5: Rank and Linear Independence

*"The Archives contain ten thousand manuscripts, but only three philosophical schools. Most variation is redundant—knowing the Stone School alignment almost determines the others. The true dimension of the data is far smaller than it appears."*  
— Archivist's note, Capital Archives

---

## The Core Question

When we add a new feature to our dataset, are we actually adding *new information*? Or is the new feature just a combination of what we already have?

Consider the manuscript school alignments:
- Stone School alignment
- Water School alignment  
- Pebble School alignment

If these three always sum to 1.0 (a manuscript must align with *some* school), then the third is completely determined by the first two. We have **3 columns but only 2 independent dimensions**.

This matters because:
- **Multicollinearity** breaks linear regression
- **Dimensionality reduction** exploits redundancy
- **Feature selection** removes useless variables

---

## Learning Objectives

By the end of this lesson, you will:
1. Understand linear independence and dependence
2. Calculate and interpret matrix rank
3. Detect multicollinearity in real datasets
4. See why low rank causes problems in regression

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from numpy.linalg import matrix_rank, svd

# Set random seed for reproducibility
np.random.seed(42)

# Nice plotting defaults
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

# Colab-ready data loading
BASE_URL = "https://raw.githubusercontent.com/buildLittleWorlds/ml-math-with-densworld/main/data/"

# Load our datasets
creature_vectors = pd.read_csv(BASE_URL + "creature_vectors.csv")
manuscripts = pd.read_csv(BASE_URL + "manuscript_features.csv")
expeditions = pd.read_csv(BASE_URL + "expedition_outcomes.csv")

print(f"Loaded {len(creature_vectors)} creatures")
print(f"Loaded {len(manuscripts)} manuscripts")
print(f"Loaded {len(expeditions)} expedition records")

## Part 1: Linear Independence — The Intuition

A set of vectors is **linearly independent** if no vector can be written as a combination of the others.

**Independent example:**
- $\mathbf{v}_1 = [1, 0]$ (points east)
- $\mathbf{v}_2 = [0, 1]$ (points north)

No combination of east can give you north. These span the full 2D plane.

**Dependent example:**
- $\mathbf{v}_1 = [1, 0]$ (points east)
- $\mathbf{v}_2 = [2, 0]$ (also points east, just longer)

Both point the same direction! $\mathbf{v}_2 = 2 \cdot \mathbf{v}_1$. They only span a 1D line.

In [None]:
# Visualize linear independence vs dependence
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Independent vectors
ax = axes[0]
v1 = np.array([1, 0])
v2 = np.array([0, 1])

ax.arrow(0, 0, v1[0]*0.9, v1[1], head_width=0.05, fc='blue', ec='blue', linewidth=2, label='v₁ = [1, 0]')
ax.arrow(0, 0, v2[0], v2[1]*0.9, head_width=0.05, fc='red', ec='red', linewidth=2, label='v₂ = [0, 1]')

# Show span (the whole plane)
ax.fill([-1.2, 1.2, 1.2, -1.2], [-1.2, -1.2, 1.2, 1.2], alpha=0.1, color='green')
ax.text(0.7, 0.7, 'Span = entire 2D plane', fontsize=11, style='italic')

ax.set_xlim(-1.2, 1.2)
ax.set_ylim(-1.2, 1.2)
ax.set_aspect('equal')
ax.axhline(0, color='black', linewidth=0.5)
ax.axvline(0, color='black', linewidth=0.5)
ax.set_title('Linearly INDEPENDENT\n(Span 2D space)', fontsize=13)
ax.legend(loc='lower right')
ax.grid(True, alpha=0.3)

# Dependent vectors
ax = axes[1]
v1 = np.array([1, 0])
v2 = np.array([2, 0])

ax.arrow(0, 0, v1[0]*0.9, v1[1], head_width=0.05, fc='blue', ec='blue', linewidth=2, label='v₁ = [1, 0]')
ax.arrow(0, 0, v2[0]*0.45, v2[1], head_width=0.05, fc='red', ec='red', linewidth=2, label='v₂ = [2, 0]')

# Show span (just the x-axis)
ax.axhline(0, color='green', linewidth=4, alpha=0.3)
ax.text(0.5, 0.3, 'Span = only x-axis!', fontsize=11, style='italic')
ax.text(0.5, -0.3, '(v₂ = 2·v₁)', fontsize=10, color='gray')

ax.set_xlim(-0.5, 2.5)
ax.set_ylim(-1.2, 1.2)
ax.set_aspect('equal')
ax.axhline(0, color='black', linewidth=0.5)
ax.axvline(0, color='black', linewidth=0.5)
ax.set_title('Linearly DEPENDENT\n(Only span 1D line)', fontsize=13)
ax.legend(loc='upper right')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Part 2: Matrix Rank — Counting Independent Dimensions

The **rank** of a matrix is the number of linearly independent columns (or rows—they're equal!).

- If a 5×5 matrix has rank 5: **full rank** (all columns independent)
- If a 5×5 matrix has rank 3: **rank deficient** (only 3 independent columns, 2 are redundant)

Rank tells you the "true dimensionality" of your data, regardless of how many columns you have.

In [None]:
# Demonstrate rank calculation
print("Matrix Rank Examples:")
print("="*60)

# Full rank matrix (3x3 with rank 3)
A_full = np.array([[1, 0, 0],
                   [0, 1, 0],
                   [0, 0, 1]])
print(f"\nIdentity Matrix (3×3):")
print(A_full)
print(f"Rank: {matrix_rank(A_full)}  (Full rank: 3 independent columns)")

# Rank-deficient matrix (3x3 with rank 2)
A_deficient = np.array([[1, 2, 3],
                        [4, 5, 6],
                        [7, 8, 9]])
print(f"\nMagic-like Matrix (3×3):")
print(A_deficient)
print(f"Rank: {matrix_rank(A_deficient)}  (Column 3 = -col1 + 2*col2... deficient!)")

# Obvious dependency
A_obvious = np.array([[1, 2],
                      [2, 4],
                      [3, 6]])
print(f"\nObvious Dependency (3×2):")
print(A_obvious)
print(f"Rank: {matrix_rank(A_obvious)}  (Column 2 = 2 × Column 1)")

## Part 3: Manuscript School Alignments — A Real Example

In the Archives, manuscripts are rated for alignment to three philosophical schools. But do we really have 3 independent pieces of information?

Let's investigate...

In [None]:
# Extract school alignment features
school_features = ['school_alignment_stone', 'school_alignment_water', 'school_alignment_pebble']
school_matrix = manuscripts[school_features].values

print("School Alignment Matrix (first 10 manuscripts):")
print("="*60)
print(manuscripts[['manuscript_id'] + school_features].head(10).to_string(index=False))

# Check if they sum to a constant
row_sums = school_matrix.sum(axis=1)
print(f"\nRow sums (should be constant if dependent):")
print(f"  Min: {row_sums.min():.4f}")
print(f"  Max: {row_sums.max():.4f}")
print(f"  Mean: {row_sums.mean():.4f}")
print(f"  Std: {row_sums.std():.4f}")

In [None]:
# Calculate rank of school alignment matrix
print("\nRank Analysis of School Alignments:")
print("="*60)

rank = matrix_rank(school_matrix)
print(f"\nMatrix shape: {school_matrix.shape}")
print(f"Matrix rank: {rank}")

if rank < school_matrix.shape[1]:
    print(f"\n⚠️  RANK DEFICIENT!")
    print(f"   We have {school_matrix.shape[1]} columns but only {rank} independent dimensions.")
    print(f"   {school_matrix.shape[1] - rank} column(s) are redundant.")
else:
    print(f"\n✓ Full rank: all {rank} columns are independent.")

In [None]:
# Visualize the near-dependency
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 3D scatter - if truly 3D, points fill space; if 2D, they lie on a plane
from mpl_toolkits.mplot3d import Axes3D

ax = fig.add_subplot(121, projection='3d')
ax.scatter(school_matrix[:, 0], school_matrix[:, 1], school_matrix[:, 2], alpha=0.5)
ax.set_xlabel('Stone')
ax.set_ylabel('Water')
ax.set_zlabel('Pebble')
ax.set_title('Manuscripts in 3D School Space\n(Do they fill 3D or lie on a plane?)')

# Pairwise correlations
ax = axes[1]
corr_matrix = np.corrcoef(school_matrix.T)
im = ax.imshow(corr_matrix, cmap='RdBu_r', vmin=-1, vmax=1)
ax.set_xticks([0, 1, 2])
ax.set_yticks([0, 1, 2])
ax.set_xticklabels(['Stone', 'Water', 'Pebble'])
ax.set_yticklabels(['Stone', 'Water', 'Pebble'])
ax.set_title('Correlation Matrix\n(High |correlation| = dependence)')

# Add correlation values
for i in range(3):
    for j in range(3):
        ax.text(j, i, f'{corr_matrix[i,j]:.2f}', ha='center', va='center', fontsize=12)

plt.colorbar(im, ax=ax, shrink=0.8)
plt.tight_layout()
plt.show()

## Part 4: Singular Value Decomposition (SVD) — Finding the True Dimensions

The **Singular Value Decomposition** reveals the "true" dimensions of your data:

$$\mathbf{A} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^T$$

The **singular values** (diagonal of $\Sigma$) tell you how much "energy" is in each dimension:
- Large singular values = important, real dimensions
- Tiny singular values = noise or redundant dimensions

If some singular values are near zero, those dimensions are effectively empty—the data doesn't vary in those directions.

In [None]:
# Perform SVD on school alignments
U, S, Vt = svd(school_matrix, full_matrices=False)

print("Singular Value Decomposition:")
print("="*60)
print(f"\nSingular values: {S}")
print(f"\nExplained variance ratio:")

variance_explained = (S ** 2) / np.sum(S ** 2)
cumulative_variance = np.cumsum(variance_explained)

for i, (s, var, cum) in enumerate(zip(S, variance_explained, cumulative_variance)):
    print(f"  Dimension {i+1}: singular value = {s:.4f}, variance = {var:.2%}, cumulative = {cum:.2%}")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

ax = axes[0]
ax.bar(range(1, len(S)+1), S, color='steelblue')
ax.set_xlabel('Dimension')
ax.set_ylabel('Singular Value')
ax.set_title('Singular Values\n(Drop-off indicates effective dimensionality)')
ax.set_xticks(range(1, len(S)+1))

ax = axes[1]
ax.bar(range(1, len(variance_explained)+1), variance_explained, color='steelblue', label='Individual')
ax.plot(range(1, len(cumulative_variance)+1), cumulative_variance, 'ro-', label='Cumulative')
ax.axhline(0.95, color='green', linestyle='--', label='95% threshold')
ax.set_xlabel('Dimension')
ax.set_ylabel('Variance Explained')
ax.set_title('Variance Explained by Each Dimension')
ax.set_xticks(range(1, len(S)+1))
ax.legend()

plt.tight_layout()
plt.show()

## Part 5: Multicollinearity — When Regression Breaks

If features are linearly dependent (or nearly so), **linear regression fails**. The model can't decide which feature to "credit" for the effect.

Let's create a scenario where this happens and see the consequences.

In [None]:
# Create synthetic example: predicting expedition success
# with nearly collinear features

np.random.seed(42)
n = 100

# Feature 1: crew_size
crew_size = np.random.uniform(10, 50, n)

# Feature 2: supplies (highly correlated with crew_size!)
supplies = 2 * crew_size + np.random.normal(0, 2, n)

# Feature 3: independent feature
weather_score = np.random.uniform(0, 1, n)

# Target: expedition success (truly depends on crew_size and weather)
true_success = 0.5 * crew_size + 20 * weather_score + np.random.normal(0, 5, n)

# Combine into matrix
X = np.column_stack([crew_size, supplies, weather_score])
y = true_success

print("Synthetic Expedition Data:")
print("="*60)
print(f"Features: crew_size, supplies (≈2×crew_size), weather_score")
print(f"True relationship: success = 0.5×crew + 20×weather + noise")
print(f"")
print(f"Correlation matrix:")
corr = np.corrcoef(X.T)
print(f"                crew_size  supplies  weather")
print(f"  crew_size     {corr[0,0]:.3f}      {corr[0,1]:.3f}     {corr[0,2]:.3f}")
print(f"  supplies      {corr[1,0]:.3f}      {corr[1,1]:.3f}     {corr[1,2]:.3f}")
print(f"  weather       {corr[2,0]:.3f}      {corr[2,1]:.3f}     {corr[2,2]:.3f}")
print(f"")
print(f"⚠️  crew_size and supplies have correlation {corr[0,1]:.3f}!")

In [None]:
# Fit linear regression and examine coefficients
from numpy.linalg import lstsq, cond

# Add intercept
X_with_intercept = np.column_stack([np.ones(n), X])

# Fit model
coefficients, residuals, rank, singular_values = lstsq(X_with_intercept, y, rcond=None)

print("Linear Regression with Collinear Features:")
print("="*60)
print(f"\nFitted coefficients:")
print(f"  Intercept:    {coefficients[0]:>10.4f}")
print(f"  crew_size:    {coefficients[1]:>10.4f}  (true: 0.5)")
print(f"  supplies:     {coefficients[2]:>10.4f}  (true: 0.0)")
print(f"  weather:      {coefficients[3]:>10.4f}  (true: 20.0)")

print(f"\nMatrix condition number: {cond(X_with_intercept):.2f}")
print("  (High condition number = unstable coefficients)")

In [None]:
# Show coefficient instability: small changes in data → big changes in coefficients
print("Coefficient Instability Test:")
print("="*60)
print("\nFitting on different random subsets of data...\n")

bootstrap_coeffs = []
for i in range(10):
    # Random subsample
    idx = np.random.choice(n, size=n//2, replace=False)
    X_sub = X_with_intercept[idx]
    y_sub = y[idx]
    
    coeffs, _, _, _ = lstsq(X_sub, y_sub, rcond=None)
    bootstrap_coeffs.append(coeffs)
    print(f"  Sample {i+1}: crew_size={coeffs[1]:>7.3f}, supplies={coeffs[2]:>7.3f}, weather={coeffs[3]:>7.3f}")

bootstrap_coeffs = np.array(bootstrap_coeffs)
print(f"\nCoefficient standard deviations:")
print(f"  crew_size:  {bootstrap_coeffs[:,1].std():.4f}")
print(f"  supplies:   {bootstrap_coeffs[:,2].std():.4f}")
print(f"  weather:    {bootstrap_coeffs[:,3].std():.4f}")
print(f"\n⚠️  crew_size and supplies swap their weights wildly!")
print(f"   The model can't tell which one matters.")

## Part 6: Detecting Collinearity in Creature Data

Let's check if our creature behavioral features have any hidden dependencies.

In [None]:
# Analyze creature behavioral features
behavioral_features = ['aggression', 'sociality', 'nocturnality', 'territoriality', 'hunting_strategy']
X_creatures = creature_vectors[behavioral_features].values

print("Creature Behavioral Features — Collinearity Check:")
print("="*60)

# Matrix rank
print(f"\nMatrix shape: {X_creatures.shape}")
print(f"Matrix rank: {matrix_rank(X_creatures)}")

# Correlation matrix
print(f"\nCorrelation matrix:")
corr = np.corrcoef(X_creatures.T)
print(f"{'':>15}", end='')
for f in behavioral_features:
    print(f"{f[:8]:>10}", end='')
print()
for i, f in enumerate(behavioral_features):
    print(f"{f:>15}", end='')
    for j in range(len(behavioral_features)):
        print(f"{corr[i,j]:>10.2f}", end='')
    print()

# SVD to find effective dimensionality
_, S, _ = svd(X_creatures - X_creatures.mean(axis=0), full_matrices=False)  # Center first
variance_explained = (S ** 2) / np.sum(S ** 2)
cumulative = np.cumsum(variance_explained)

print(f"\nSingular value analysis:")
for i, (s, var, cum) in enumerate(zip(S, variance_explained, cumulative)):
    print(f"  Dim {i+1}: σ={s:.3f}, var={var:.1%}, cumulative={cum:.1%}")

In [None]:
# Visualize correlation heatmap
fig, ax = plt.subplots(figsize=(9, 7))

im = ax.imshow(corr, cmap='RdBu_r', vmin=-1, vmax=1)
ax.set_xticks(range(len(behavioral_features)))
ax.set_yticks(range(len(behavioral_features)))
ax.set_xticklabels([f.replace('_', '\n') for f in behavioral_features], fontsize=10)
ax.set_yticklabels(behavioral_features, fontsize=10)

# Add correlation values
for i in range(len(behavioral_features)):
    for j in range(len(behavioral_features)):
        color = 'white' if abs(corr[i,j]) > 0.5 else 'black'
        ax.text(j, i, f'{corr[i,j]:.2f}', ha='center', va='center', fontsize=11, color=color)

ax.set_title('Creature Behavioral Feature Correlations', fontsize=13)
plt.colorbar(im, ax=ax, shrink=0.8)
plt.tight_layout()
plt.show()

print("\nInterpretation:")
print("- Aggression and territoriality are positively correlated (0.50)")
print("- Sociality is negatively correlated with aggression (-0.45)")
print("- These correlations are meaningful but not perfect collinearity")
print("- Full rank of 5 means all features contribute unique information")

## Part 7: Variance Inflation Factor (VIF) — The Standard Tool

The **Variance Inflation Factor** measures how much a feature's regression coefficient variance is "inflated" due to collinearity with other features.

- VIF = 1: No collinearity
- VIF = 5: Moderate (coefficient variance 5× higher than if independent)
- VIF > 10: Severe collinearity problem!

In [None]:
def calculate_vif(X):
    """Calculate Variance Inflation Factor for each feature."""
    vifs = []
    for i in range(X.shape[1]):
        # Regress feature i on all other features
        y_i = X[:, i]
        X_others = np.delete(X, i, axis=1)
        X_others = np.column_stack([np.ones(len(X_others)), X_others])  # Add intercept
        
        # Get predictions
        coeffs, _, _, _ = np.linalg.lstsq(X_others, y_i, rcond=None)
        y_pred = X_others @ coeffs
        
        # Calculate R²
        ss_res = np.sum((y_i - y_pred) ** 2)
        ss_tot = np.sum((y_i - y_i.mean()) ** 2)
        r_squared = 1 - ss_res / ss_tot if ss_tot > 0 else 0
        
        # VIF = 1 / (1 - R²)
        vif = 1 / (1 - r_squared) if r_squared < 1 else np.inf
        vifs.append(vif)
    
    return np.array(vifs)

# Calculate VIF for creature behavioral features
vifs = calculate_vif(X_creatures)

print("Variance Inflation Factors (VIF) for Creature Features:")
print("="*60)
print(f"\n{'Feature':<20} {'VIF':>10} {'Interpretation':>25}")
print("-"*60)

for feature, vif in zip(behavioral_features, vifs):
    if vif < 2:
        interp = "Low collinearity"
    elif vif < 5:
        interp = "Moderate"
    elif vif < 10:
        interp = "High - investigate"
    else:
        interp = "⚠️ SEVERE"
    print(f"{feature:<20} {vif:>10.2f} {interp:>25}")

print(f"\n✓ All VIFs < 5: No severe multicollinearity in creature features!")

In [None]:
# Compare to our synthetic collinear data
print("\nVIF for Synthetic Data (crew_size, supplies, weather):")
print("="*60)

vifs_synthetic = calculate_vif(X)
synthetic_features = ['crew_size', 'supplies', 'weather_score']

print(f"\n{'Feature':<15} {'VIF':>10}")
print("-"*30)
for feature, vif in zip(synthetic_features, vifs_synthetic):
    flag = "⚠️ COLLINEAR!" if vif > 10 else ""
    print(f"{feature:<15} {vif:>10.2f}  {flag}")

print(f"\n⚠️ crew_size and supplies both have VIF > 10!")
print(f"   This confirms the near-perfect collinearity we created.")

## Summary

| Concept | Key Insight | Densworld Example |
|---------|-------------|-------------------|
| **Linear Independence** | Can't express one vector from others | East and North directions |
| **Linear Dependence** | Redundant information | supplies ≈ 2 × crew_size |
| **Matrix Rank** | Count of independent columns | 3 school columns → rank 2 |
| **SVD** | Reveals true dimensionality | Singular values show real dims |
| **Multicollinearity** | Correlated features break regression | Unstable coefficients |
| **VIF** | Standard diagnostic for collinearity | VIF > 10 = problem |

---

## Exercises

### Exercise 1: Habitat Feature Independence

Calculate the rank and VIF for the creature habitat features. Are there any hidden dependencies?

In [None]:
# Exercise 1: Your code here
# habitat_features = ['depth_preference', 'moisture_preference', 'light_tolerance', 
#                     'cave_affinity', 'surface_affinity']



### Exercise 2: Creating Perfect Collinearity

Create a new feature that is an exact linear combination of existing behavioral features (e.g., `threat_score = aggression + territoriality`). Add it to the feature matrix and verify that the rank doesn't increase.

In [None]:
# Exercise 2: Your code here



### Exercise 3: Dimensionality Reduction Preview

Using SVD, project the 5 behavioral features down to 2 dimensions (keeping only the first 2 principal components). Plot all creatures in this 2D space. Do similar creatures cluster together?

In [None]:
# Exercise 3: Your code here
# Hint: U[:, :2] @ np.diag(S[:2]) gives the 2D coordinates



### Exercise 4: Fixing Multicollinearity

In the synthetic expedition data, we have collinear features (crew_size and supplies). Try these fixes:
1. Remove one of the collinear features
2. Create a new feature: ratio = supplies / crew_size

Verify that VIF decreases and regression coefficients become stable.

In [None]:
# Exercise 4: Your code here



---

## Module 2 Complete!

You've completed the Linear Algebra module. You now understand:

1. **Vectors** as both data records and points in space
2. **Norms** for measuring size and distance (L1, L2)
3. **Dot products** for measuring alignment and similarity
4. **Matrix transformations** as functions on vectors
5. **Rank and independence** for detecting redundancy

These concepts are the foundation of:
- Principal Component Analysis (PCA)
- Neural network layers
- Regularization (Ridge and Lasso)
- Recommendation systems
- And much more!

*"The Archives are built on vectors. Every creature, every manuscript, every expedition—all are points in vast spaces. To understand these spaces is to understand the hidden structure of the world."*  
— Boffa Trent