# Principal Component Analysis (PCA)

Dimensionality reduction and exploratory data analysis using PCA.

## Contents
1. Data Exploration and Standardization
2. Computing Principal Components
3. Variance Explained
4. Biplots
5. Interpreting Principal Components
6. Choosing Number of Components

## Setup and Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-whitegrid')
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

## 1. Data Exploration: USArrests Dataset

Dataset contains statistics (per 100,000 residents) for arrests in 50 US states.

In [None]:
# Load USArrests dataset
url = "https://raw.githubusercontent.com/selva86/datasets/master/USArrests.csv"
USArrests = pd.read_csv(url)

# Check if first column is state names (unnamed or 'Unnamed: 0')
if USArrests.columns[0] in ['Unnamed: 0', 'State'] or USArrests.iloc[0, 0] == 'Alabama':
    USArrests = USArrests.set_index(USArrests.columns[0])
    USArrests.index.name = 'State'

print(f"Dataset shape: {USArrests.shape}")
print(f"\nStates (first 10):")
print(USArrests.index[:10].tolist())
print(f"\nVariables:")
print(USArrests.columns.tolist())
print(f"\nFirst few rows:")
USArrests.head()

In [None]:
# Summary statistics
print("Summary Statistics:")
print(USArrests.describe())

### Check Mean and Variance

In [None]:
# Calculate mean and variance for each variable
means = USArrests.mean()
variances = USArrests.var()

print("Means:")
print(means)
print(f"\nVariances:")
print(variances)

print(f"\n" + "="*60)
print("IMPORTANT: Variables have very different scales!")
print("="*60)
print(f"Murder: mean = {means['Murder']:.2f}, var = {variances['Murder']:.2f}")
print(f"UrbanPop: mean = {means['UrbanPop']:.2f}, var = {variances['UrbanPop']:.2f}")
print(f"\nWe MUST standardize before PCA!")

In [None]:
# Visualize distributions
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

for idx, col in enumerate(USArrests.columns):
    ax = axes[idx // 2, idx % 2]
    ax.hist(USArrests[col], bins=15, edgecolor='black', alpha=0.7)
    ax.set_xlabel(col, fontsize=12)
    ax.set_ylabel('Frequency', fontsize=12)
    ax.set_title(f'{col} Distribution', fontsize=12)
    ax.axvline(means[col], color='red', linestyle='--', linewidth=2, label=f'Mean = {means[col]:.1f}')
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 2. Principal Component Analysis

Standardize data and compute principal components.

In [None]:
# Standardize the data (mean=0, std=1)
scaler = StandardScaler()
USArrests_scaled = scaler.fit_transform(USArrests)

# Verify standardization
print("After standardization:")
print(f"Means (should be ~0): {USArrests_scaled.mean(axis=0)}")
print(f"Std devs (should be 1): {USArrests_scaled.std(axis=0, ddof=1)}")

In [None]:
# Compute PCA
pca = PCA()
pca.fit(USArrests_scaled)

print("PCA Components:")
print(f"Number of components: {pca.n_components_}")
print(f"\nComponent shape: {pca.components_.shape}")
print(f"(4 components × 4 original features)")

### Principal Component Loadings (Rotation Matrix)

In [None]:
# Loadings (rotation matrix)
loadings = pd.DataFrame(
    pca.components_.T,  # Transpose to match R output
    columns=[f'PC{i+1}' for i in range(pca.n_components_)],
    index=USArrests.columns
)

print("Principal Component Loadings:")
print(loadings)

print(f"\n" + "="*60)
print("Interpretation of PC1:")
print("="*60)
print(f"PC1 = {loadings.loc['Murder', 'PC1']:.3f} × Murder")
print(f"    + {loadings.loc['Assault', 'PC1']:.3f} × Assault")
print(f"    + {loadings.loc['UrbanPop', 'PC1']:.3f} × UrbanPop")
print(f"    + {loadings.loc['Rape', 'PC1']:.3f} × Rape")
print(f"\nPC1 represents overall crime rate (all crimes have similar weights)")

In [None]:
# Visualize loadings
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# PC1 vs PC2 loadings
axes[0].scatter(loadings['PC1'], loadings['PC2'], s=100)
for i, var in enumerate(loadings.index):
    axes[0].annotate(var, (loadings.loc[var, 'PC1'], loadings.loc[var, 'PC2']),
                    fontsize=12, ha='center', va='bottom')
    axes[0].arrow(0, 0, loadings.loc[var, 'PC1']*0.9, loadings.loc[var, 'PC2']*0.9,
                 head_width=0.05, head_length=0.05, fc='blue', ec='blue', alpha=0.5)
axes[0].axhline(0, color='black', linewidth=0.5)
axes[0].axvline(0, color='black', linewidth=0.5)
axes[0].set_xlabel('PC1 Loading', fontsize=12)
axes[0].set_ylabel('PC2 Loading', fontsize=12)
axes[0].set_title('Variable Loadings on PC1 and PC2', fontsize=14)
axes[0].grid(True, alpha=0.3)

# Heatmap of all loadings
sns.heatmap(loadings, annot=True, fmt='.3f', cmap='RdBu_r', center=0,
           cbar_kws={'label': 'Loading'}, ax=axes[1])
axes[1].set_title('All Principal Component Loadings', fontsize=14)

plt.tight_layout()
plt.show()

### Principal Component Scores

In [None]:
# Transform data to PC space
scores = pca.transform(USArrests_scaled)
scores_df = pd.DataFrame(
    scores,
    columns=[f'PC{i+1}' for i in range(pca.n_components_)],
    index=USArrests.index
)

print(f"Principal Component Scores shape: {scores_df.shape}")
print(f"\nFirst few states:")
print(scores_df.head(10))

## 3. Variance Explained

How much variance does each PC capture?

In [None]:
# Standard deviation of each PC
std_devs = np.sqrt(pca.explained_variance_)

print("Standard Deviations of Principal Components:")
for i, std in enumerate(std_devs):
    print(f"  PC{i+1}: {std:.4f}")

# Variance explained by each PC
var_explained = pca.explained_variance_

print("\nVariance Explained by Each PC:")
for i, var in enumerate(var_explained):
    print(f"  PC{i+1}: {var:.4f}")

In [None]:
# Proportion of variance explained (PVE)
pve = pca.explained_variance_ratio_

print("Proportion of Variance Explained (PVE):")
for i, ratio in enumerate(pve):
    print(f"  PC{i+1}: {ratio:.4f} ({ratio*100:.2f}%)")

print(f"\nTotal variance explained by all PCs: {pve.sum():.4f} (100%)")

In [None]:
# Cumulative variance explained
cumulative_pve = np.cumsum(pve)

print("Cumulative Variance Explained:")
for i, cum_var in enumerate(cumulative_pve):
    print(f"  PC1 to PC{i+1}: {cum_var:.4f} ({cum_var*100:.2f}%)")

print(f"\n" + "="*60)
print(f"First 2 PCs explain {cumulative_pve[1]*100:.1f}% of variance!")
print(f"="*60)

In [None]:
# Scree plot and cumulative variance plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Scree plot (Proportion of variance explained)
axes[0].plot(range(1, len(pve)+1), pve, marker='o', linewidth=2, markersize=8)
axes[0].set_xlabel('Principal Component', fontsize=12)
axes[0].set_ylabel('Proportion of Variance Explained', fontsize=12)
axes[0].set_title('Scree Plot', fontsize=14)
axes[0].set_ylim([0, 1])
axes[0].set_xticks(range(1, len(pve)+1))
axes[0].grid(True, alpha=0.3)

# Cumulative variance plot
axes[1].plot(range(1, len(cumulative_pve)+1), cumulative_pve, 
            marker='o', linewidth=2, markersize=8, color='orange')
axes[1].axhline(0.8, color='red', linestyle='--', linewidth=1, label='80% threshold')
axes[1].set_xlabel('Principal Component', fontsize=12)
axes[1].set_ylabel('Cumulative Proportion of Variance Explained', fontsize=12)
axes[1].set_title('Cumulative Variance Explained', fontsize=14)
axes[1].set_ylim([0, 1])
axes[1].set_xticks(range(1, len(cumulative_pve)+1))
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Elbow in scree plot suggests using 2-3 components")

## 4. Biplot

Visualize both observations (states) and variables in PC space.

In [None]:
# Create biplot
def biplot(scores, loadings, labels=None, pc1=0, pc2=1):
    """
    Create a biplot showing both observations and variables.
    
    Parameters:
    - scores: Principal component scores (observations)
    - loadings: Principal component loadings (variables)
    - labels: Labels for observations
    - pc1, pc2: Which PCs to plot (0-indexed)
    """
    fig, ax = plt.subplots(figsize=(12, 8))
    
    # Plot observations (states)
    ax.scatter(scores[:, pc1], scores[:, pc2], alpha=0.6, s=50, c='blue')
    
    # Add state labels
    if labels is not None:
        for i, label in enumerate(labels):
            ax.annotate(label, (scores[i, pc1], scores[i, pc2]),
                       fontsize=8, alpha=0.7)
    
    # Scale factor for arrows
    scale = 4
    
    # Plot variables (arrows)
    for i, var in enumerate(loadings.index):
        ax.arrow(0, 0, 
                loadings.iloc[i, pc1] * scale, 
                loadings.iloc[i, pc2] * scale,
                head_width=0.2, head_length=0.2, 
                fc='red', ec='red', linewidth=2, alpha=0.7)
        ax.text(loadings.iloc[i, pc1] * scale * 1.15,
               loadings.iloc[i, pc2] * scale * 1.15,
               var, fontsize=12, color='red', fontweight='bold')
    
    ax.axhline(0, color='black', linewidth=0.5, linestyle='--')
    ax.axvline(0, color='black', linewidth=0.5, linestyle='--')
    ax.set_xlabel(f'PC{pc1+1} ({pve[pc1]*100:.1f}% variance)', fontsize=12)
    ax.set_ylabel(f'PC{pc2+1} ({pve[pc2]*100:.1f}% variance)', fontsize=12)
    ax.set_title('PCA Biplot', fontsize=14)
    ax.grid(True, alpha=0.3)
    
    return fig, ax

# Create biplot
fig, ax = biplot(scores, loadings, labels=USArrests.index)
plt.tight_layout()
plt.show()

print("Biplot Interpretation:")
print("- Blue points = States")
print("- Red arrows = Original variables")
print("- States in direction of arrow have high values for that variable")
print("- Longer arrow = more variance in that direction")

### Sign Ambiguity

Principal components are unique up to a sign change.

In [None]:
# Flip signs of loadings and scores
loadings_flipped = -loadings
scores_flipped = -scores

# Create biplot with flipped signs
fig, ax = biplot(scores_flipped, loadings_flipped, labels=USArrests.index)
plt.title('PCA Biplot (signs flipped)', fontsize=14)
plt.tight_layout()
plt.show()

print("\nNote: Sign flip doesn't change interpretation!")
print("Relative positions of states and variables remain the same.")

## 5. Interpreting Principal Components

In [None]:
# States with highest/lowest scores on PC1
pc1_scores = scores_df['PC1'].sort_values()

print("="*60)
print("PC1 INTERPRETATION")
print("="*60)
print("\nPC1 loadings:")
print(loadings['PC1'].sort_values(ascending=False))

print(f"\nStates with LOWEST PC1 (low crime):")
print(pc1_scores.head(5))

print(f"\nStates with HIGHEST PC1 (high crime):")
print(pc1_scores.tail(5))

print("\n→ PC1 represents overall crime severity")
print("  All crime variables have similar negative loadings")
print("  High PC1 = High crime rate")

In [None]:
# PC2 interpretation
pc2_scores = scores_df['PC2'].sort_values()

print("="*60)
print("PC2 INTERPRETATION")
print("="*60)
print("\nPC2 loadings:")
print(loadings['PC2'].sort_values(ascending=False))

print(f"\nStates with LOWEST PC2:")
print(pc2_scores.head(5))

print(f"\nStates with HIGHEST PC2:")
print(pc2_scores.tail(5))

print("\n→ PC2 contrasts UrbanPop vs violent crimes")
print("  High PC2 = High urbanization, lower violent crime rate")
print("  Low PC2 = Low urbanization, higher violent crime rate")

In [None]:
# Correlation between original variables and PCs
# Correlation = loading × sqrt(eigenvalue)
correlations = loadings.copy()
for i in range(pca.n_components_):
    correlations.iloc[:, i] = loadings.iloc[:, i] * np.sqrt(pca.explained_variance_[i])

print("Correlations between original variables and PCs:")
print(correlations)

# Heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(correlations, annot=True, fmt='.3f', cmap='RdBu_r', 
           center=0, vmin=-1, vmax=1, cbar_kws={'label': 'Correlation'})
plt.title('Correlation: Original Variables vs Principal Components', fontsize=14)
plt.tight_layout()
plt.show()

## 6. Choosing Number of Components

How many PCs should we keep?

In [None]:
# Summary table
summary_df = pd.DataFrame({
    'PC': [f'PC{i+1}' for i in range(pca.n_components_)],
    'Std Dev': std_devs,
    'Variance': pca.explained_variance_,
    'Prop. Var': pve,
    'Cumulative Var': cumulative_pve
})

print("PCA Summary:")
print(summary_df.to_string(index=False))

print("\n" + "="*60)
print("DECISION RULES:")
print("="*60)
print(f"1. Kaiser Rule (eigenvalue > 1): Keep PC1 and PC2")
print(f"   PC1 variance = {pca.explained_variance_[0]:.2f} > 1 ✓")
print(f"   PC2 variance = {pca.explained_variance_[1]:.2f} > 1 ✓")
print(f"   PC3 variance = {pca.explained_variance_[2]:.2f} < 1 ✗")

print(f"\n2. Scree Plot Elbow: Around PC2-PC3")

print(f"\n3. Variance Threshold (80%): Need {np.argmax(cumulative_pve >= 0.8) + 1} PCs")

print(f"\n→ Recommendation: Keep 2 components ({cumulative_pve[1]*100:.1f}% variance)")

In [None]:
# Reduced dataset with 2 PCs
USArrests_pca = scores_df[['PC1', 'PC2']].copy()

print(f"Original data: {USArrests.shape}")
print(f"Reduced data: {USArrests_pca.shape}")
print(f"\nDimensionality reduced from 4 to 2!")
print(f"Retained {cumulative_pve[1]*100:.1f}% of variance")

print("\nReduced dataset (first 10 states):")
print(USArrests_pca.head(10))

## Summary

This notebook covered:

### **Principal Component Analysis (PCA)**
- **Goal**: Reduce dimensionality while preserving variance
- **Method**: Find orthogonal directions of maximum variance
- **Output**: New uncorrelated variables (principal components)

### **Key Steps**
1. **Standardize** data (mean=0, std=1)
2. Compute **covariance matrix**
3. Find **eigenvectors** (principal components)
4. **Project** data onto PCs

### **Interpretation**
- **Loadings**: How original variables contribute to PCs
- **Scores**: Position of observations in PC space
- **Variance explained**: How much information each PC captures

### **Choosing Number of Components**
- **Kaiser Rule**: Keep PCs with eigenvalue > 1
- **Scree Plot**: Look for "elbow"
- **Variance Threshold**: e.g., 80% cumulative variance
- **Cross-validation**: For supervised learning

### **Biplot**
- **Observations** (blue points): States
- **Variables** (red arrows): Original features
- **Interpretation**: Direction shows correlation

### **USArrests Results**
- **PC1 (62% variance)**: Overall crime severity
  - All crime variables load similarly
  - High PC1 = high crime state
- **PC2 (25% variance)**: Urbanization vs violent crime
  - Positive loading on UrbanPop
  - Negative loading on violent crimes
- **2 PCs capture 87%** of total variance

### **When to Use PCA**
- ✅ Exploratory data analysis
- ✅ Visualization (reduce to 2-3 dimensions)
- ✅ Remove multicollinearity
- ✅ Noise reduction
- ✅ Speed up algorithms (fewer features)

### **Limitations**
- ❌ Linear method (won't capture non-linear patterns)
- ❌ PCs are linear combinations (hard to interpret)
- ❌ Assumes variance = information
- ❌ Sensitive to scaling (always standardize!)

### **Key Takeaway**
PCA transforms correlated variables into uncorrelated principal components,
enabling dimensionality reduction while preserving most information.