# Tier 4: Principal Component Analysis (PCA)

---

**Author:** Brandon Deloatch
**Affiliation:** Quipu Research Labs, LLC
**Date:** 2025-10-02
**Version:** v1.3
**License:** MIT
**Notebook ID:** 87895554-973d-42a5-ad90-e7c11fa3846a

---

## Citation
Brandon Deloatch, "Tier 4: Principal Component Analysis (PCA)," Quipu Research Labs, LLC, v1.3, 2025-10-02.

Please cite this notebook if used or adapted in publications, presentations, or derivative work.

---

## Contributors / Acknowledgments
- **Primary Author:** Brandon Deloatch (Quipu Research Labs, LLC)
- **Institutional Support:** Quipu Research Labs, LLC - Advanced Analytics Division
- **Technical Framework:** Built on scikit-learn, pandas, numpy, and plotly ecosystems
- **Methodological Foundation:** Statistical learning principles and modern data science best practices

---

## Version History
| Version | Date | Notes |
|---------|------|-------|
| v1.3 | 2025-10-02 | Enhanced professional formatting, comprehensive documentation, interactive visualizations |
| v1.2 | 2024-09-15 | Updated analysis methods, improved data generation algorithms |
| v1.0 | 2024-06-10 | Initial release with core analytical framework |

---

## Environment Dependencies
- **Python:** 3.8+
- **Core Libraries:** pandas 2.0+, numpy 1.24+, scikit-learn 1.3+
- **Visualization:** plotly 5.0+, matplotlib 3.7+
- **Statistical:** scipy 1.10+, statsmodels 0.14+
- **Development:** jupyter-lab 4.0+, ipywidgets 8.0+

> **Reproducibility Note:** Use requirements.txt or environment.yml for exact dependency matching.

---

## Data Provenance
| Dataset | Source | License | Notes |
|---------|--------|---------|-------|
| Synthetic Data | Generated in-notebook | MIT | Custom algorithms for realistic simulation |
| Statistical Distributions | NumPy/SciPy | BSD-3-Clause | Standard library implementations |
| ML Algorithms | Scikit-learn | BSD-3-Clause | Industry-standard implementations |
| Visualization Schemas | Plotly | MIT | Interactive dashboard frameworks |

---

## Execution Provenance Logs
- **Created:** 2025-10-02
- **Notebook ID:** 87895554-973d-42a5-ad90-e7c11fa3846a
- **Execution Environment:** Jupyter Lab / VS Code
- **Computational Requirements:** Standard laptop/workstation (2GB+ RAM recommended)

> **Auto-tracking:** Execution metadata can be programmatically captured for reproducibility.

---

## Disclaimer & Responsible Use
This notebook is provided "as-is" for educational, research, and professional development purposes. Users assume full responsibility for any results, applications, or decisions derived from this analysis.

**Professional Standards:**
- Validate all results against domain expertise and additional data sources
- Respect licensing and attribution requirements for all dependencies
- Follow ethical guidelines for data analysis and algorithmic decision-making
- Credit all methodological sources and derivative frameworks appropriately

**Academic & Commercial Use:**
- Permitted under MIT license with proper attribution
- Suitable for educational curriculum and professional training
- Appropriate for commercial adaptation with citation requirements
- Recommended for reproducible research and transparent analytics

---



In [None]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_digits, make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
import warnings
warnings.filterwarnings('ignore')

print(" Tier 4: Principal Component Analysis - Libraries Loaded!")
print("=" * 55)
print("PCA Techniques:")
print("• Dimensionality reduction via eigendecomposition")
print("• Explained variance analysis and component selection")
print("• Feature transformation and data compression")
print("• Noise reduction and signal enhancement")
print("• High-dimensional data visualization")

In [None]:
# Create comprehensive high-dimensional datasets
np.random.seed(42)

# 1. Customer behavior dataset (high-dimensional)
n_customers = 1500
n_features = 50

# Generate correlated customer features
customer_segments = ['Premium', 'Standard', 'Budget']
segment_data = []

for i, segment in enumerate(customer_segments):
 n_segment = n_customers // 3

 # Base behavior patterns for each segment
 if segment == 'Premium':
 base_spending = np.random.normal(5000, 1000, n_segment)
 frequency_mult = np.random.normal(2.5, 0.5, n_segment)
 quality_pref = np.random.beta(8, 2, n_segment)
 elif segment == 'Standard':
 base_spending = np.random.normal(2000, 500, n_segment)
 frequency_mult = np.random.normal(1.5, 0.3, n_segment)
 quality_pref = np.random.beta(5, 5, n_segment)
 else: # Budget
 base_spending = np.random.normal(800, 300, n_segment)
 frequency_mult = np.random.normal(0.8, 0.2, n_segment)
 quality_pref = np.random.beta(2, 8, n_segment)

 # Create correlated features
 for j in range(n_segment):
 customer_features = []

 # Spending-related features (10 features)
 spending_features = base_spending[j] * np.random.normal(1, 0.2, 10) * frequency_mult[j]
 customer_features.extend(spending_features)

 # Engagement features (15 features)
 engagement_base = frequency_mult[j] * 100
 engagement_features = engagement_base * np.random.gamma(2, 0.5, 15)
 customer_features.extend(engagement_features)

 # Product preferences (15 features)
 pref_features = quality_pref[j] * np.random.normal(1, 0.3, 15)
 customer_features.extend(pref_features)

 # Demographic proxies (10 features)
 demo_features = np.random.normal(0, 1, 10)
 customer_features.extend(demo_features)

 segment_data.append({
 'customer_id': f"CUST_{i:03d}_{j:03d}",
 'segment': segment,
 'segment_code': i,
 **{f'feature_{k:02d}': customer_features[k] for k in range(len(customer_features))}
 })

customer_df = pd.DataFrame(segment_data)

# Feature matrix for PCA
feature_columns = [col for col in customer_df.columns if col.startswith('feature_')]
X_customer = customer_df[feature_columns].values
y_customer = customer_df['segment_code'].values

print(" PCA Dataset Created:")
print(f"Customers: {len(customer_df)}")
print(f"Features: {len(feature_columns)}")
print(f"Segments: {customer_df['segment'].value_counts().to_dict()}")
print(f"Feature range: {X_customer.min():.2f} to {X_customer.max():.2f}")

# 2. Load digits dataset for image PCA demonstration
digits = load_digits()
X_digits = digits.data
y_digits = digits.target

print(f"\nDigits dataset: {X_digits.shape[0]} samples, {X_digits.shape[1]} features")
print(f"Represents 8x8 pixel images of handwritten digits 0-9")

In [None]:
# 1. PCA ANALYSIS AND COMPONENT SELECTION
print(" 1. PCA ANALYSIS AND COMPONENT SELECTION")
print("=" * 42)

# Standardize customer data
scaler_customer = StandardScaler()
X_customer_scaled = scaler_customer.fit_transform(X_customer)

# Fit PCA with all components
pca_full = PCA()
X_customer_pca = pca_full.fit_transform(X_customer_scaled)

# Calculate cumulative explained variance
explained_variance_ratio = pca_full.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance_ratio)

# Find optimal number of components (95% variance explained)
n_components_95 = np.argmax(cumulative_variance >= 0.95) + 1
n_components_90 = np.argmax(cumulative_variance >= 0.90) + 1
n_components_80 = np.argmax(cumulative_variance >= 0.80) + 1

print(f"Original dimensions: {X_customer.shape[1]}")
print(f"Components for 80% variance: {n_components_80}")
print(f"Components for 90% variance: {n_components_90}")
print(f"Components for 95% variance: {n_components_95}")

# Create optimal PCA models
pca_optimal = PCA(n_components=n_components_95)
X_customer_reduced = pca_optimal.fit_transform(X_customer_scaled)

print(f"Dimensionality reduction: {X_customer.shape[1]} → {X_customer_reduced.shape[1]}")
print(f"Compression ratio: {X_customer.shape[1] / X_customer_reduced.shape[1]:.1f}x")

# Analyze component loadings
components_df = pd.DataFrame(
 pca_optimal.components_[:5].T, # First 5 components
 columns=[f'PC{i+1}' for i in range(5)],
 index=[f'Feature_{i:02d}' for i in range(len(feature_columns))]
)

print(f"\nTop feature loadings for first 3 components:")
for i in range(3):
 pc_name = f'PC{i+1}'
 top_features = components_df[pc_name].abs().nlargest(5)
 print(f"\n{pc_name} (explains {explained_variance_ratio[i]*100:.1f}% variance):")
 for feature, loading in top_features.items():
 print(f" {feature}: {loading:.3f}")

# Digits PCA for comparison
scaler_digits = StandardScaler()
X_digits_scaled = scaler_digits.fit_transform(X_digits)

pca_digits = PCA(n_components=0.95) # 95% variance
X_digits_reduced = pca_digits.fit_transform(X_digits_scaled)

print(f"\nDigits PCA: {X_digits.shape[1]} → {X_digits_reduced.shape[1]} dimensions")
print(f"Digits compression ratio: {X_digits.shape[1] / X_digits_reduced.shape[1]:.1f}x")

In [None]:
# 2. INTERACTIVE PCA VISUALIZATIONS
print(" 2. INTERACTIVE PCA VISUALIZATIONS")
print("=" * 36)

# Create comprehensive PCA dashboard
fig = make_subplots(
 rows=3, cols=2,
 subplot_titles=[
 'Explained Variance by Component',
 'Customer Segments in PC Space (2D)',
 'Component Loadings Heatmap',
 'Customer Segments in PC Space (3D)',
 'Digits Reconstruction Comparison',
 'PCA Performance Metrics'
 ],
 specs=[[{"secondary_y": False}, {"secondary_y": False}],
 [{"secondary_y": False}, {"type": "scatter3d"}],
 [{"secondary_y": False}, {"secondary_y": False}]]
)

# 1. Explained variance plot
fig.add_trace(
 go.Scatter(
 x=np.arange(1, len(explained_variance_ratio) + 1),
 y=explained_variance_ratio,
 mode='lines+markers',
 name='Individual Variance',
 line=dict(color='blue', width=2),
 marker=dict(size=6)
 ),
 row=1, col=1
)

fig.add_trace(
 go.Scatter(
 x=np.arange(1, len(cumulative_variance) + 1),
 y=cumulative_variance,
 mode='lines+markers',
 name='Cumulative Variance',
 line=dict(color='red', width=2),
 marker=dict(size=6)
 ),
 row=1, col=1
)

# Add threshold lines
fig.add_hline(y=0.95, line=dict(color='green', dash='dash'),
 annotation_text="95% threshold", row=1, col=1)
fig.add_vline(x=n_components_95, line=dict(color='green', dash='dash'),
 annotation_text=f"n={n_components_95}", row=1, col=1)

# 2. Customer segments in 2D PC space
colors = ['red', 'blue', 'green']
segments = ['Premium', 'Standard', 'Budget']

for i, (segment, color) in enumerate(zip(segments, colors)):
 mask = customer_df['segment'] == segment
 fig.add_trace(
 go.Scatter(
 x=X_customer_reduced[mask, 0],
 y=X_customer_reduced[mask, 1],
 mode='markers',
 name=f'{segment} Customers',
 marker=dict(color=color, size=8, opacity=0.7),
 text=customer_df[mask]['customer_id'],
 hovertemplate=f'{segment}<br>PC1: %{{x:.2f}}<br>PC2: %{{y:.2f}}<extra></extra>'
 ),
 row=1, col=2
 )

# 3. Component loadings heatmap
fig.add_trace(
 go.Heatmap(
 z=components_df.T.values,
 x=components_df.index,
 y=components_df.columns,
 colorscale='RdBu',
 zmid=0,
 showscale=True,
 hovertemplate='Feature: %{x}<br>Component: %{y}<br>Loading: %{z:.3f}<extra></extra>'
 ),
 row=2, col=1
)

# 4. Customer segments in 3D PC space
for i, (segment, color) in enumerate(zip(segments, colors)):
 mask = customer_df['segment'] == segment
 fig.add_trace(
 go.Scatter3d(
 x=X_customer_reduced[mask, 0],
 y=X_customer_reduced[mask, 1],
 z=X_customer_reduced[mask, 2],
 mode='markers',
 name=f'{segment} 3D',
 marker=dict(color=color, size=5, opacity=0.7),
 text=customer_df[mask]['customer_id'],
 hovertemplate=f'{segment}<br>PC1: %{{x:.2f}}<br>PC2: %{{y:.2f}}<br>PC3: %{{z:.2f}}<extra></extra>'
 ),
 row=2, col=2
 )

# 5. Digits reconstruction comparison
# Select a few digit samples for reconstruction
sample_indices = [0, 100, 200, 300, 400]
original_images = X_digits[sample_indices].reshape(-1, 8, 8)

# Reconstruct from PCA
X_digits_reconstructed = pca_digits.inverse_transform(X_digits_reduced)
reconstructed_images = X_digits_reconstructed[sample_indices].reshape(-1, 8, 8)

# Show reconstruction quality
reconstruction_errors = np.mean((X_digits[sample_indices] - X_digits_reconstructed[sample_indices])**2, axis=1)

fig.add_trace(
 go.Bar(
 x=[f'Digit {i}' for i in sample_indices],
 y=reconstruction_errors,
 name='Reconstruction Error',
 marker_color='orange'
 ),
 row=3, col=1
)

# 6. PCA performance metrics comparison
dimensions = [5, 10, 20, 30, n_components_95]
performance_metrics = []

for n_comp in dimensions:
 pca_temp = PCA(n_components=n_comp)
 X_temp = pca_temp.fit_transform(X_customer_scaled)

 # Classification performance with reduced dimensions
 X_train, X_test, y_train, y_test = train_test_split(X_temp, y_customer, test_size=0.3, random_state=42)
 clf = RandomForestClassifier(n_estimators=50, random_state=42)
 clf.fit(X_train, y_train)
 accuracy = accuracy_score(y_test, clf.predict(X_test))

 variance_explained = np.sum(pca_temp.explained_variance_ratio_)

 performance_metrics.append({
 'components': n_comp,
 'accuracy': accuracy,
 'variance_explained': variance_explained
 })

perf_df = pd.DataFrame(performance_metrics)

fig.add_trace(
 go.Scatter(
 x=perf_df['components'],
 y=perf_df['accuracy'],
 mode='lines+markers',
 name='Classification Accuracy',
 line=dict(color='purple', width=3),
 marker=dict(size=8),
 yaxis='y6'
 ),
 row=3, col=2
)

# Update layout
fig.update_layout(
 height=1200,
 title="Principal Component Analysis (PCA) Dashboard",
 showlegend=True
)

# Update axis labels
fig.update_xaxes(title_text="Component Number", row=1, col=1)
fig.update_xaxes(title_text="First Principal Component", row=1, col=2)
fig.update_xaxes(title_text="Features", row=2, col=1)
fig.update_xaxes(title_text="Digit Samples", row=3, col=1)
fig.update_xaxes(title_text="Number of Components", row=3, col=2)

fig.update_yaxes(title_text="Explained Variance Ratio", row=1, col=1)
fig.update_yaxes(title_text="Second Principal Component", row=1, col=2)
fig.update_yaxes(title_text="Components", row=2, col=1)
fig.update_yaxes(title_text="Reconstruction Error", row=3, col=1)
fig.update_yaxes(title_text="Classification Accuracy", row=3, col=2)

fig.show()

# Business insights
print(f"\n PCA BUSINESS INSIGHTS:")

# Customer segment analysis in PC space
for i, segment in enumerate(segments):
 segment_mask = customer_df['segment'] == segment
 segment_pca = X_customer_reduced[segment_mask]

 pc1_mean = np.mean(segment_pca[:, 0])
 pc2_mean = np.mean(segment_pca[:, 1])

 print(f"\n{segment} Customers:")
 print(f"• PC1 (primary behavior): {pc1_mean:.2f}")
 print(f"• PC2 (secondary behavior): {pc2_mean:.2f}")
 print(f"• Cluster tightness: {np.std(segment_pca[:, 0]):.2f}")

# Storage and computational benefits
original_storage = X_customer.shape[0] * X_customer.shape[1] * 8 # 8 bytes per float
reduced_storage = X_customer_reduced.shape[0] * X_customer_reduced.shape[1] * 8
storage_savings = (original_storage - reduced_storage) / original_storage

print(f"\n DATA COMPRESSION BENEFITS:")
print(f"• Original storage: {original_storage / 1024:.1f} KB")
print(f"• Reduced storage: {reduced_storage / 1024:.1f} KB")
print(f"• Storage savings: {storage_savings*100:.1f}%")
print(f"• Information retained: {np.sum(pca_optimal.explained_variance_ratio_)*100:.1f}%")

# ROI calculation
data_storage_cost_per_gb = 100 # $100 per GB per year
processing_cost_reduction = 0.60 # 60% reduction in processing time
model_accuracy_maintained = perf_df[perf_df['components'] == n_components_95]['accuracy'].iloc[0]

annual_data_volume_gb = 1000 # 1TB of customer data
storage_cost_savings = annual_data_volume_gb * data_storage_cost_per_gb * storage_savings
processing_cost_savings = 200_000 * processing_cost_reduction # $200k annual processing costs

total_benefits = storage_cost_savings + processing_cost_savings
implementation_cost = 80_000

print(f"\n PCA IMPLEMENTATION ROI:")
print(f"• Storage cost savings: ${storage_cost_savings:,.0f}/year")
print(f"• Processing cost savings: ${processing_cost_savings:,.0f}/year")
print(f"• Model accuracy maintained: {model_accuracy_maintained*100:.1f}%")
print(f"• Total annual benefits: ${total_benefits:,.0f}")
print(f"• Implementation cost: ${implementation_cost:,.0f}")
print(f"• ROI: {(total_benefits - implementation_cost)/implementation_cost*100:.0f}%")
print(f"• Payback period: {implementation_cost/total_benefits*12:.1f} months")

print(f"\n Cross-Reference Learning Path:")
print(f"• Foundation: Tier3_Statistics.ipynb (variance concepts)")
print(f"• Application: Tier4_Clustering.ipynb (dimensionality preprocessing)")
print(f"• Advanced: Tier5_NeuralNetworks.ipynb (autoencoder comparison)")
print(f"• Next: Tier6_AdvancedML.ipynb (curse of dimensionality)")