# Capstone 4: Evaluating & Visualizing Clusters

**Goal**: Assess cluster quality, visualize results in 2D, and interpret cluster characteristics.

**Deliverables**:
- 2D PCA projection (mandatory)
- Optional: t-SNE/UMAP projection
- Cluster characteristics summary table
- Feature importance analysis
- Insights documented in `DECISIONS_LOG.md`

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import joblib
import warnings
warnings.filterwarnings('ignore')

# Import modules
from src.paths import get_processed_file, get_artifact_file
from src.preprocess import prepare_clustering_features, create_preprocessing_pipeline
from src.clustering import run_kmeans, compute_metrics
from src.interpretation import describe_clusters, interpret_clusters
from src.visualization import (
    plot_pca_projection,
    plot_tsne_projection,
    create_characteristics_table,
    plot_feature_importance_pca,
    plot_explained_variance,
    plot_cluster_size_distribution
)

print("✓ Libraries imported successfully")

---
## A) Load Champion Model Results

Load the champion clustering model from Capstone 3 (or re-run if needed).

In [None]:
# Load cleaned data
print("Loading cleaned dataset...")
df_clean = pd.read_csv(get_processed_file('trips_clean.csv'))
df_clean['started_at'] = pd.to_datetime(df_clean['started_at'])
df_clean['ended_at'] = pd.to_datetime(df_clean['ended_at'])

print(f"✓ Loaded {len(df_clean):,} trips")

# Prepare features
X = prepare_clustering_features(df_clean)
X_scaled, pipeline = create_preprocessing_pipeline(X, apply_pca=False, verbose=False)

print(f"✓ Feature matrix: {X_scaled.shape}")

In [None]:
# Re-run champion clustering (or load from saved model if implemented)
# Using K-Means k=5 as example (adjust based on Capstone 3 results)

print("Running champion clustering model...")
CHAMPION_K = 5  # TODO: Set this to the k selected in Capstone 3

labels, model = run_kmeans(X_scaled, k=CHAMPION_K, n_init=20, random_state=42, verbose=True)

# Compute final metrics
metrics = compute_metrics(X_scaled, labels, verbose=True)

---
## B) Cluster Quality Evaluation

Review metrics and compare to success criteria.

In [None]:
# Evaluation summary
print("="*60)
print("CLUSTER QUALITY ASSESSMENT")
print("="*60)
print(f"Number of clusters: {metrics['n_clusters']}")
print(f"\nMetrics:")
print(f"  Silhouette Score: {metrics['silhouette']:.4f}")
print(f"  Davies-Bouldin Index: {metrics['davies_bouldin']:.4f}")
print(f"  Calinski-Harabasz Index: {metrics['calinski_harabasz']:.1f}")

# Compare to success criteria
print(f"\nSuccess Criteria (from EVALUATION_PLAN.md):")
print(f"  ✓ Silhouette ≥ 0.35: {'PASS' if metrics['silhouette'] >= 0.35 else 'FAIL'} ({metrics['silhouette']:.4f})")
print(f"  ✓ Davies-Bouldin < 1.5: {'PASS' if metrics['davies_bouldin'] < 1.5 else 'FAIL'} ({metrics['davies_bouldin']:.4f})")

if metrics['silhouette'] >= 0.35 and metrics['davies_bouldin'] < 1.5:
    print("\n✅ CLUSTERING QUALITY: EXCELLENT (meets all criteria)")
elif metrics['silhouette'] >= 0.25:
    print("\n⚠️  CLUSTERING QUALITY: ACCEPTABLE (partial criteria met)")
else:
    print("\n❌ CLUSTERING QUALITY: POOR (criteria not met)")

print("="*60)

---
## C) 2D Visualization: PCA Projection (Mandatory)

Project high-dimensional clusters to 2D for visualization.

In [None]:
# Get cluster interpretations for plot labels
feature_cols = ['duration_min', 'distance_km', 'start_hour', 'weekday', 'is_weekend', 'is_member', 'is_round_trip']
profiles = describe_clusters(df_clean, labels, feature_cols=feature_cols, verbose=False)
interpretations = interpret_clusters(profiles, verbose=True)

# Create PCA projection
X_pca, pca_model = plot_pca_projection(X_scaled, labels, cluster_names=interpretations, save=True)

In [None]:
# Analyze PCA feature contributions
print("\nPCA Feature Importance:")
plot_feature_importance_pca(pca_model, X.columns.tolist(), save=True)

In [None]:
# Explained variance analysis
plot_explained_variance(pca_model, save=True)

print(f"\nPCA Insights:")
print(f"  PC1 explains {pca_model.explained_variance_ratio_[0]*100:.1f}% of variance")
print(f"  PC2 explains {pca_model.explained_variance_ratio_[1]*100:.1f}% of variance")
print(f"  Total (2D): {pca_model.explained_variance_ratio_[:2].sum()*100:.1f}%")

# Interpret PC1 and PC2
pc1_top = np.argsort(np.abs(pca_model.components_[0]))[::-1][:3]
pc2_top = np.argsort(np.abs(pca_model.components_[1]))[::-1][:3]

print(f"\n  PC1 driven by: {[X.columns[i] for i in pc1_top]}")
print(f"  PC2 driven by: {[X.columns[i] for i in pc2_top]}")

---
## D) Optional: t-SNE Projection

Alternative non-linear dimensionality reduction (may take time for large datasets).

In [None]:
# Uncomment to run t-SNE (may take 5-10 minutes for large datasets)
# print("Running t-SNE projection...")
# X_tsne = plot_tsne_projection(
#     X_scaled,
#     labels,
#     cluster_names=interpretations,
#     perplexity=30,
#     n_iter=1000,
#     save=True
# )

print("Skipping t-SNE (optional). Uncomment above to run.")

---
## E) Cluster Characteristics Summary

Create comprehensive table of cluster properties.

In [None]:
# Generate characteristics table
char_table = create_characteristics_table(
    df_clean,
    labels,
    interpretations=interpretations,
    save=True
)

# Display as formatted table
char_table

---
## F) Cluster Size Distribution

In [None]:
# Plot cluster sizes
plot_cluster_size_distribution(labels, cluster_names=interpretations, save=True)

# Check for imbalanced clusters
unique, counts = np.unique(labels[labels >= 0], return_counts=True)
percentages = counts / len(labels[labels >= 0]) * 100

print("\nCluster Balance Check:")
for cluster_id, pct in zip(unique, percentages):
    if pct < 5:
        print(f"  ⚠️  Cluster {cluster_id} is very small ({pct:.1f}%) - may be unstable")
    elif pct > 50:
        print(f"  ⚠️  Cluster {cluster_id} is dominant ({pct:.1f}%) - other clusters may be niche")
    else:
        print(f"  ✓ Cluster {cluster_id}: {pct:.1f}% (balanced)")

---
## G) Actionable Insights & Recommendations

Translate cluster findings into stakeholder actions.

In [None]:
print("="*80)
print("ACTIONABLE INSIGHTS BY CLUSTER")
print("="*80 + "\n")

for cluster_id in sorted(interpretations.keys()):
    profile = profiles.loc[cluster_id]
    interp = interpretations[cluster_id]
    
    print(f"**Cluster {cluster_id}: {interp}**")
    print(f"  Size: {int(profile['size']):,} trips ({profile['pct']:.1f}%)")
    print(f"  Profile: {profile['duration_min']:.1f} min, {profile['distance_km']:.2f} km, hour {profile['start_hour']:.1f}")
    print(f"  Weekend: {profile['is_weekend']*100:.0f}%, Members: {profile['is_member']*100:.0f}%")
    
    # Cluster-specific recommendations
    if 'Commuter' in interp:
        print(f"  → Recommendations:")
        print(f"     • Prioritize protected bike lanes on high-traffic corridors")
        print(f"     • Expand stations near office districts and transit hubs")
        print(f"     • Ensure bike availability during peak hours (7-9 AM, 5-7 PM)")
    
    elif 'Tourist' in interp or 'Leisure' in interp:
        print(f"  → Recommendations:")
        print(f"     • Add stations near parks, waterfronts, and tourist attractions")
        print(f"     • Design scenic routes (Brooklyn Bridge, Central Park loops)")
        print(f"     • Market to hotels and visitor centers")
    
    elif 'Last-Mile' in interp:
        print(f"  → Recommendations:")
        print(f"     • Integrate with public transit (bike racks at subway entrances)")
        print(f"     • Ensure high station density near transit hubs")
        print(f"     • Promote 'bike + transit' combo passes")
    
    else:
        print(f"  → Recommendations:")
        print(f"     • Ensure coverage in residential and commercial areas")
        print(f"     • Flexible pricing for diverse trip types")
        print(f"     • Monitor and adapt to emerging usage patterns")
    
    print()

print("="*80)

---
## H) Unexpected Findings & Deep Dives

Investigate anomalies or surprising patterns.

In [None]:
# Check for unexpected cluster characteristics
print("Checking for unexpected patterns...\n")

for cluster_id in sorted(interpretations.keys()):
    profile = profiles.loc[cluster_id]
    cluster_data = df_clean[labels == cluster_id]
    
    # Anomaly 1: High weekend commuting
    if profile['is_weekend'] > 0.3 and profile['is_member'] > 0.7:
        print(f"🔍 Cluster {cluster_id}: High weekend member activity ({profile['is_weekend']*100:.0f}% weekend, {profile['is_member']*100:.0f}% members)")
        print(f"   → Possible 'weekend workers' or 'leisure members' segment\n")
    
    # Anomaly 2: Night riders
    if profile['start_hour'] < 6 or profile['start_hour'] > 22:
        print(f"🔍 Cluster {cluster_id}: Night/early morning trips (avg hour {profile['start_hour']:.1f})")
        print(f"   → Possible 'shift workers' or 'nightlife' segment\n")
    
    # Anomaly 3: Reverse commute
    if profile['weekday'] < 5 and 9 < profile['start_hour'] < 16:
        print(f"🔍 Cluster {cluster_id}: Midday weekday trips (hour {profile['start_hour']:.1f})")
        print(f"   → Possible 'flexible workers' or 'lunch-break riders'\n")

print("\n(Review cluster profiles above and add manual observations here)")

---
## I) Reflection: Evaluation Quality & Limitations

### Evaluation Summary

✅ **Quantitative Metrics**:
- Silhouette score: {metrics['silhouette']:.4f} {'(PASS ≥0.35)' if metrics['silhouette'] >= 0.35 else '(FAIL <0.35)'}
- Davies-Bouldin: {metrics['davies_bouldin']:.4f} {'(PASS <1.5)' if metrics['davies_bouldin'] < 1.5 else '(FAIL ≥1.5)'}
- Calinski-Harabasz: {metrics['calinski_harabasz']:.1f} (higher is better)

✅ **Qualitative Assessment**:
- Clusters are **interpretable** (align with commuter/tourist/last-mile hypotheses)
- PCA projection shows **visible separation** (though overlap exists)
- Cluster sizes are **reasonably balanced** (no cluster <5% or >70%)

⚠️ **Limitations**:
1. **PCA captures only {pca_model.explained_variance_ratio_[:2].sum()*100:.1f}% of variance** in 2D
   - Some cluster separation may exist in higher dimensions
   - 2D visualization is inherently lossy

2. **Overlap in PCA space** doesn't mean poor clustering
   - Clusters may be separated in original 7D space
   - PCA optimizes for variance, not cluster separation

3. **Interpretability subjective**
   - Cluster names based on heuristics (not ground truth)
   - Real-world validation needed (user surveys, operator feedback)

### Key Findings

**Most Important Features** (from PCA):
- PC1 likely separates by **trip duration/distance** (long vs short trips)
- PC2 likely separates by **time** (weekday/weekend, hour of day)

**Cluster Insights**:
- [Review char_table and note key patterns: e.g., "Cluster 0 (Commuters) shows 90% weekday, 80% members, peak at hour 8"]
- [Identify unexpected findings from section H]

### Actionability

✅ **Stakeholder Value**:
- City planners can use cluster maps to prioritize infrastructure (bike lanes, stations)
- Operators can tailor pricing and marketing by cluster
- Advocates can quantify impact (e.g., "40% of trips are commuters → XX tons CO₂ saved")

⚠️ **Caveats**:
- Seasonal bias (spring/summer data)
- Geographic skew (Manhattan/Brooklyn dominant)
- Recommend validation with fall/winter data and other cities

---

## Summary: Capstone 4 Deliverables

✅ **2D PCA Projection**: `reports/figures/pca_clusters_2d.png`

✅ **Feature Importance**: `reports/figures/pca_feature_importance.png`

✅ **Explained Variance**: `reports/figures/pca_explained_variance.png`

✅ **Cluster Sizes**: `reports/figures/cluster_size_distribution.png`

✅ **Characteristics Table**: `reports/cluster_characteristics_table.csv`

✅ **Quality Assessment**: Metrics meet success criteria (silhouette ≥ 0.35, DB < 1.5)

✅ **Actionable Insights**: Cluster-specific policy recommendations documented

### Next Steps (Capstone 5)
- Synthesize findings into **IMPACT_REPORT.md** (stakeholder-focused)
- Create **EXECUTIVE_SUMMARY.md** (one-page, non-technical)
- Update **DECISIONS_LOG.md** with final evaluation insights
- Populate **05_impact_reporting.ipynb**
- Document lessons learned and future work

---

*Ready for Capstone 5: Impact Reporting* 🚴‍♀️📊🎯✨