# Chapter 32: Dimension Reduction with PCA

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/bradleyboehmke/uc-bana-4080/blob/main/example-notebooks/32_dimension_reduction.ipynb)

This notebook contains all the executable code examples from Chapter 32 of the BANA 4080 textbook. You can run each code cell and experiment with the examples to deepen your understanding of Principal Component Analysis (PCA) and dimension reduction.

## Learning Objectives

By working through this notebook, you will be able to:

- Understand the curse of dimensionality and why dimension reduction is important
- Explain the difference between feature selection and feature extraction
- Understand how PCA finds directions of maximum variance
- Apply PCA using scikit-learn's `PCA` class
- Interpret principal components, loadings, and explained variance
- Use scree plots and the elbow method to select the number of components
- Transform data to principal component space
- Integrate PCA into machine learning pipelines
- Compare model performance with and without PCA

## Setup: Import Required Libraries

First, let's import all the libraries we'll need throughout this notebook.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Set random seed for reproducibility
np.random.seed(42)

# Configure plotting style
plt.style.use('default')

print("Libraries imported successfully!")

---

## Part 1: Understanding PCA with a Simple Example

We'll start with a small dataset of student study habits to understand how PCA works step-by-step.

### Create Student Study Habits Dataset

This simple dataset has 6 students and 4 features that are likely correlated (students who study more tend to do more practice problems, attend more classes, and score higher).

In [None]:
# Create a simple student study habits dataset
data = {
    'Student': ['Alice', 'Bob', 'Carol', 'David', 'Emma', 'Frank'],
    'Hours_Studied': [10, 5, 8, 12, 6, 9],
    'Practice_Problems': [50, 20, 40, 60, 25, 45],
    'Attendance_Pct': [95, 70, 85, 98, 75, 90],
    'Quiz_Score': [88, 65, 78, 92, 68, 82]
}

df = pd.DataFrame(data)
print("Student Study Habits Dataset:")
df

### Step 1: Standardize the Features

**Why standardize?** PCA is sensitive to feature scales. Features with larger values will dominate the principal components. Standardization (z-score normalization) ensures all features contribute equally.

In [None]:
# Extract features for PCA (exclude student names)
X = df[['Hours_Studied', 'Practice_Problems', 'Attendance_Pct', 'Quiz_Score']].values
feature_names = ['Hours_Studied', 'Practice_Problems', 'Attendance_Pct', 'Quiz_Score']

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Show Standardized data
print("Standardized features (mean=0, std=1 for each column):")
pd.DataFrame(
    np.round(X_scaled, 2),
    columns=feature_names,
    index=df['Student']
)

### Step 2: Compute the Covariance Matrix

The covariance matrix shows how features vary together. High covariance indicates redundancy - this is what PCA exploits to reduce dimensions!

In [None]:
# Compute covariance matrix (what PCA does behind the scenes)
cov_matrix = np.cov(X_scaled.T)

print("Covariance Matrix (4x4 for our 4 features):")
print(np.round(cov_matrix, 2))

# Create a labeled version for easier interpretation
cov_df = pd.DataFrame(cov_matrix,
                      columns=feature_names,
                      index=feature_names)
print("\nLabeled Covariance Matrix:")
cov_df.round(2)

### Steps 3-4: Fit PCA and Extract Components

Now PCA will:
- Find eigenvectors (directions of maximum variance) = **principal components**
- Find eigenvalues (amount of variance in each direction) = **explained variance**

In [None]:
# Fit PCA on our student data (keep all 4 components initially)
pca = PCA(n_components=4)
pca.fit(X_scaled)

print("PCA fitted successfully!")
print("PCA has computed eigenvectors and eigenvalues internally.")

### Examine Principal Component Loadings

**Loadings** show how much each original feature contributes to each principal component. Think of them as the "recipe" for each PC.

In [None]:
# Extract eigenvectors (feature weights/loadings)
components_df = pd.DataFrame(
    pca.components_,
    columns=feature_names,
    index=['PC1', 'PC2', 'PC3', 'PC4']
)

print("Principal Component Loadings:")
print("(How much each feature contributes to each PC)\n")
components_df.round(3)

### Examine Eigenvalues (Explained Variance)

**Eigenvalues** tell us how much variance each PC captures. Larger eigenvalues = more important components.

In [None]:
print("Raw Eigenvalues (Explained Variance):")
print(pca.explained_variance_)

print("\nNicely formatted Eigenvalues:")
for i, eigenvalue in enumerate(pca.explained_variance_, 1):
    print(f"PC{i}: {eigenvalue:.3f}")

### Variance Explained Ratios

It's more useful to look at the **proportion** of total variance explained by each component.

In [None]:
# Look at variance explained by each component
print("Variance Explained by Each PC:")
for i, var in enumerate(pca.explained_variance_ratio_, 1):
    print(f"  PC{i}: {var*100:.1f}%")

print(f"\nCumulative Variance Explained:")
cumsum = np.cumsum(pca.explained_variance_ratio_)
for i, var in enumerate(cumsum, 1):
    print(f"  First {i} PC(s): {var*100:.1f}%")

### Choosing the Number of Components

You can choose components based on:
1. **Variance threshold**: Keep components that explain X% of variance
2. **Fixed number**: Keep a specific number of components

In [None]:
# Method 1: Keep enough components to explain 90% of variance
pca_90 = PCA(n_components=0.90)
pca_90.fit(X_scaled)
print(f"To explain 90% of variance, keep {pca_90.n_components_} components")
print(f"Actual variance explained: {pca_90.explained_variance_ratio_.sum()*100:.1f}%")

print("\n" + "="*60)

# Method 2: Keep exactly 2 components for visualization
pca_2 = PCA(n_components=2)
pca_2.fit(X_scaled)
print(f"\nKeeping 2 components explains {pca_2.explained_variance_ratio_.sum()*100:.1f}% of variance")

### Step 5: Transform Data to Principal Component Space

Now we'll **transform** our original 4-feature data into 2-component data.

In [None]:
# Transform the original data into principal component space (using 2 components)
X_pca = pca_2.transform(X_scaled)

print("Dimensionality Reduction:")
print(f"  Original Data Shape:    {X_scaled.shape} (6 students × 4 features)")
print(f"  Transformed Data Shape: {X_pca.shape} (6 students × 2 components)")

# Show the transformed data for our students
pca_df = pd.DataFrame(X_pca,
                      columns=['PC1', 'PC2'],
                      index=df['Student'])
print("\nStudent Data in PC Space:")
print(pca_df.round(2))

# Compare: Alice vs Bob
print(f"\nInterpretation Example:")
print(f"  Alice's PC1 score: {X_pca[0, 0]:+.2f} (overall academic performance)")
print(f"  Bob's PC1 score:   {X_pca[1, 0]:+.2f} (overall academic performance)")
print("  → Alice scores much higher on PC1, reflecting stronger overall performance")

---

## Part 2: Real-World Application - Breast Cancer Classification

Now let's apply PCA to a real dataset with 30 features! This is where PCA really shines.

### Load the Breast Cancer Dataset

This dataset has measurements from tumor cell nuclei images. **30 features** - perfect for dimension reduction!

In [None]:
# Load the breast cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

print("Breast Cancer Dataset:")
print(f"  Samples: {X.shape[0]}")
print(f"  Features: {X.shape[1]}")
print(f"  Classes: {len(np.unique(y))} (malignant=0, benign=1)")

print("\nFeature names:")
for i, name in enumerate(data.feature_names, 1):
    print(f"  {i:2d}. {name}")

### Split Data and Standardize

Always split **before** standardizing to avoid data leakage!

In [None]:
# Split into train/test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set:     {X_test.shape[0]} samples")

# Standardize the features (fit on train, transform both)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"\nOriginal training data shape: {X_train_scaled.shape}")
print(f"Number of features: {X_train_scaled.shape[1]}")

### Fit PCA with All Components

First, let's fit PCA with all 30 components to see the full variance breakdown.

In [None]:
# Fit PCA with all components to see the full picture
pca = PCA()
pca.fit(X_train_scaled)

print(f"Number of components: {pca.n_components_}")
print(f"\nFirst 10 components explain:")
for i in range(10):
    print(f"  PC{i+1:2d}: {pca.explained_variance_ratio_[i]*100:5.2f}%")

### Visualize with Scree Plot

A **scree plot** shows the variance explained by each component. Look for the "elbow" where adding more components gives diminishing returns.

In [None]:
# Create scree plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Left plot: Individual variance explained
ax1.plot(range(1, len(pca.explained_variance_ratio_) + 1),
         pca.explained_variance_ratio_, 'bo-', linewidth=2, markersize=8)
ax1.set_xlabel('Principal Component', fontsize=12, fontweight='bold')
ax1.set_ylabel('Variance Explained', fontsize=12, fontweight='bold')
ax1.set_title('Scree Plot: Variance per Component', fontsize=14, fontweight='bold')
ax1.grid(True, alpha=0.3)

# Right plot: Cumulative variance explained
cumsum = np.cumsum(pca.explained_variance_ratio_)
ax2.plot(range(1, len(cumsum) + 1), cumsum, 'ro-', linewidth=2, markersize=8)
ax2.axhline(y=0.95, color='green', linestyle='--', linewidth=2, label='95% threshold')
ax2.axhline(y=0.90, color='orange', linestyle='--', linewidth=2, label='90% threshold')
ax2.set_xlabel('Number of Components', fontsize=12, fontweight='bold')
ax2.set_ylabel('Cumulative Variance Explained', fontsize=12, fontweight='bold')
ax2.set_title('Cumulative Variance Explained', fontsize=14, fontweight='bold')
ax2.legend(fontsize=11)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Find number of components for 95% variance
n_components_95 = np.argmax(cumsum >= 0.95) + 1
n_components_90 = np.argmax(cumsum >= 0.90) + 1

print(f"Components needed for 95% variance: {n_components_95}")
print(f"Components needed for 90% variance: {n_components_90}")
print(f"\nDimensionality reduction: 30 → {n_components_95} features ({(1-n_components_95/30)*100:.0f}% reduction)")

### Refit PCA with Optimal Number of Components

Based on the scree plot, let's use 7 components (which captures ~95% of variance).

In [None]:
# Refit with 7 components
pca_7 = PCA(n_components=7)
pca_7.fit(X_train_scaled)

# Show variance explained by each component
print("Variance explained by each component:")
total = 0
for i, var in enumerate(pca_7.explained_variance_ratio_, 1):
    total += var
    print(f"  PC{i}: {var*100:5.1f}%  (cumulative: {total*100:5.1f}%)")

print(f"\nTotal variance explained by 7 components: {pca_7.explained_variance_ratio_.sum()*100:.1f}%")

### Interpret Principal Component Loadings

Let's examine what each principal component represents by looking at which original features contribute most.

In [None]:
# Create loadings dataframe
loadings_df = pd.DataFrame(
    pca_7.components_.T,
    columns=[f'PC{i}' for i in range(1, 8)],
    index=data.feature_names
)

# Display top contributing features for each PC
print("Top 3 features (by absolute loading) for each PC:")
print("=" * 70)

for pc in loadings_df.columns:
    top_features = loadings_df[pc].abs().nlargest(3)
    print(f"\n{pc}:")
    for feature, _ in top_features.items():
        actual_loading = loadings_df.loc[feature, pc]
        print(f"  {feature:30s}: {actual_loading:+.3f}")

### Transform the Data

Now let's transform both training and test data from 30 features to 7 principal components.

In [None]:
# Transform both training and test data
X_train_pca = pca_7.transform(X_train_scaled)
X_test_pca = pca_7.transform(X_test_scaled)

print("Dimensionality Reduction Results:")
print("=" * 50)
print(f"Original training shape:    {X_train_scaled.shape} (30 features)")
print(f"Transformed training shape: {X_train_pca.shape} (7 components)")
print(f"\nOriginal test shape:        {X_test_scaled.shape} (30 features)")
print(f"Transformed test shape:     {X_test_pca.shape} (7 components)")
print(f"\nDimensionality reduced from 30 to 7 features ({(1-7/30)*100:.0f}% reduction)")

### Build and Compare Models

Let's compare classification performance using:
1. All 30 original features
2. Only 7 PCA components

This shows the **practical benefit** of PCA: similar accuracy with fewer features!

In [None]:
# Model with all 30 original features
print("Training model with 30 original features...")
model_original = LogisticRegression(max_iter=5000, random_state=42)
model_original.fit(X_train_scaled, y_train)
y_pred_original = model_original.predict(X_test_scaled)
acc_original = accuracy_score(y_test, y_pred_original)

# Model with 7 PCA components
print("Training model with 7 PCA components...")
model_pca = LogisticRegression(max_iter=5000, random_state=42)
model_pca.fit(X_train_pca, y_train)
y_pred_pca = model_pca.predict(X_test_pca)
acc_pca = accuracy_score(y_test, y_pred_pca)

# Compare results
print("\n" + "=" * 60)
print("MODEL COMPARISON")
print("=" * 60)
print(f"Accuracy with 30 original features: {acc_original:.3f}")
print(f"Accuracy with 7 PCA components:     {acc_pca:.3f}")
print(f"Accuracy difference:                {abs(acc_original - acc_pca):.3f}")
print(f"\nFeatures used: 30 → 7 ({(1-7/30)*100:.0f}% reduction)")
print(f"Variance retained: {pca_7.explained_variance_ratio_.sum()*100:.1f}%")

if acc_pca >= acc_original - 0.02:  # Within 2% of original
    print("\n✓ PCA achieves comparable accuracy with far fewer features!")
else:
    print("\n⚠ PCA reduced accuracy significantly - may need more components.")

---

## Part 3: Visualizing High-Dimensional Data in 2D

One powerful use of PCA is visualizing high-dimensional data by projecting it to 2 or 3 dimensions.

In [None]:
# Fit PCA with just 2 components for visualization
pca_2d = PCA(n_components=2)
X_train_2d = pca_2d.fit_transform(X_train_scaled)

# Create scatter plot colored by class
plt.figure(figsize=(10, 7))

# Plot malignant (class 0) in red
malignant = y_train == 0
plt.scatter(X_train_2d[malignant, 0], X_train_2d[malignant, 1],
           c='red', label='Malignant', alpha=0.6, s=60, edgecolors='black', linewidths=0.5)

# Plot benign (class 1) in blue
benign = y_train == 1
plt.scatter(X_train_2d[benign, 0], X_train_2d[benign, 1],
           c='blue', label='Benign', alpha=0.6, s=60, edgecolors='black', linewidths=0.5)

plt.xlabel(f'PC1 ({pca_2d.explained_variance_ratio_[0]*100:.1f}% variance)', 
          fontsize=12, fontweight='bold')
plt.ylabel(f'PC2 ({pca_2d.explained_variance_ratio_[1]*100:.1f}% variance)', 
          fontsize=12, fontweight='bold')
plt.title('Breast Cancer Data Projected to 2D with PCA', fontsize=14, fontweight='bold')
plt.legend(fontsize=11, loc='best')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"2D visualization captures {pca_2d.explained_variance_ratio_.sum()*100:.1f}% of variance")
print("Notice how the two classes (malignant vs benign) show some separation!")

---

## Summary and Key Takeaways

Congratulations! You've mastered Principal Component Analysis. Here's what you learned:

### Core Concepts
1. **Dimension Reduction**: Compress many features into fewer components while retaining information
2. **PCA Algorithm**: Finds directions of maximum variance (eigenvectors) and their importance (eigenvalues)
3. **Principal Components**: Linear combinations of original features that capture variance
4. **Loadings**: Weights showing how original features contribute to each PC

### Practical Skills
1. **Standardization**: Always standardize before PCA
2. **Scree Plots**: Visualize variance explained to choose number of components
3. **Elbow Method**: Look for the "elbow" where variance gains diminish
4. **Interpretation**: Understand what each PC represents by examining loadings
5. **Transformation**: Convert data from original features to PC space

### When to Use PCA
✓ **High-dimensional data** (many features)  
✓ **Correlated features** (redundancy to exploit)  
✓ **Visualization** (project to 2D or 3D)  
✓ **Speed up training** (fewer features = faster models)  
✓ **Noise reduction** (minor components often capture noise)  

### Tradeoffs
⚠ **Loss of interpretability**: PCs are combinations, not original features  
⚠ **Linear assumptions**: PCA assumes linear relationships  
⚠ **Variance ≠ Importance**: High variance doesn't always mean predictive power  

---

## Next Steps

Now that you've mastered PCA, try:
1. Experimenting with different numbers of components
2. Comparing PCA performance on different datasets
3. Using PCA for visualization of your own high-dimensional data
4. Exploring other dimension reduction techniques (t-SNE, UMAP)
5. Integrating PCA into your machine learning pipelines

Happy dimension reducing!