# Module 17: Unsupervised Learning - Clustering and Dimensionality Reduction

## Topics Covered
1. Introduction to Unsupervised Learning
2. K-Means Clustering
3. Hierarchical Clustering
4. Clustering Evaluation (Elbow Method, Silhouette Score)
5. Introduction to Dimensionality Reduction
6. Principal Component Analysis (PCA)
7. Practical Applications

## Learning Objectives

By the end of this module, you will be able to:
- Understand the difference between supervised and unsupervised learning
- Apply K-Means clustering to segment data into groups
- Use hierarchical clustering and interpret dendrograms
- Evaluate clustering quality using Elbow Method and Silhouette Score
- Understand why dimensionality reduction is important
- Apply PCA to reduce dimensions while preserving information
- Visualize high-dimensional data in 2D or 3D
- Apply these techniques to real-world problems like customer segmentation

---

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Clustering algorithms
from sklearn.cluster import KMeans, AgglomerativeClustering

# Dimensionality reduction
from sklearn.decomposition import PCA

# Evaluation metrics
from sklearn.metrics import silhouette_score, silhouette_samples

# Preprocessing
from sklearn.preprocessing import StandardScaler

# Datasets
from sklearn.datasets import make_blobs, load_iris

# Visualization
from scipy.cluster.hierarchy import dendrogram, linkage

# Settings
np.random.seed(42)
plt.style.use('seaborn-v0_8-whitegrid')
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

---
# Section 1: Introduction to Unsupervised Learning
---

## What is Unsupervised Learning?

Unlike supervised learning where we have labeled data (input-output pairs), unsupervised learning works with **unlabeled data**. The algorithm finds patterns, structures, or relationships in the data without being told what to look for.

### Supervised vs Unsupervised Learning

| Aspect | Supervised Learning | Unsupervised Learning |
|--------|--------------------|-----------------------|
| Data | Labeled (X, y) | Unlabeled (X only) |
| Goal | Predict y from X | Find patterns in X |
| Examples | Classification, Regression | Clustering, Dimensionality Reduction |
| Use Cases | Spam detection, Price prediction | Customer segmentation, Data compression |

## Two Main Types of Unsupervised Learning

### 1. Clustering
Group similar data points together
- **Customer Segmentation**: Group customers by purchasing behavior
- **Image Compression**: Group similar colors together
- **Anomaly Detection**: Identify unusual patterns

### 2. Dimensionality Reduction
Reduce the number of features while preserving information
- **Data Visualization**: Project high-dimensional data to 2D/3D
- **Noise Reduction**: Remove redundant features
- **Speed Up Training**: Fewer features = faster models

### Why This Matters in Data Science

Most real-world data is unlabeled! Labels are expensive and time-consuming to obtain. Unsupervised learning helps us:
- Discover hidden patterns we didn't know existed
- Understand data structure before building models
- Reduce costs (no manual labeling needed)
- Preprocess data for supervised learning

In [None]:
# Visualize supervised vs unsupervised learning
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Generate sample data
X, y_true = make_blobs(n_samples=300, centers=3, random_state=42, cluster_std=1.0)

# Supervised Learning view (we have labels)
axes[0].scatter(X[:, 0], X[:, 1], c=y_true, cmap='viridis', s=50, alpha=0.6)
axes[0].set_xlabel('Feature 1')
axes[0].set_ylabel('Feature 2')
axes[0].set_title('Supervised Learning\n(We have labels - colored by class)')

# Unsupervised Learning view (no labels)
axes[1].scatter(X[:, 0], X[:, 1], c='gray', s=50, alpha=0.6)
axes[1].set_xlabel('Feature 1')
axes[1].set_ylabel('Feature 2')
axes[1].set_title('Unsupervised Learning\n(No labels - find patterns ourselves)')

plt.tight_layout()
plt.show()

print("In unsupervised learning, we need to discover the 3 groups on our own!")

---
# Section 2: K-Means Clustering
---

## What is K-Means Clustering?

K-Means is the most popular clustering algorithm. It partitions data into **K clusters** by:
1. Randomly initializing K cluster centers (centroids)
2. Assigning each point to the nearest centroid
3. Updating centroids as the mean of assigned points
4. Repeating steps 2-3 until convergence

### How It Works

Imagine you want to group customers into 3 segments:
1. Start with 3 random "representative customers" (centroids)
2. Assign each customer to their most similar representative
3. Update each representative to be the average of their group
4. Repeat until groups stop changing

### When to Use K-Means

**Good for:**
- Spherical, evenly-sized clusters
- Large datasets (very fast)
- When you know or can guess K

**Not good for:**
- Non-spherical shapes (elongated, curved clusters)
- Clusters of very different sizes
- Clusters of very different densities

## Syntax

```python
from sklearn.cluster import KMeans

# Create K-Means model
kmeans = KMeans(
    n_clusters=3,      # Number of clusters
    random_state=42    # For reproducibility
)

# Fit to data and predict clusters
cluster_labels = kmeans.fit_predict(X)

# Access cluster centers
centroids = kmeans.cluster_centers_

# Predict cluster for new data
new_clusters = kmeans.predict(X_new)
```

In [None]:
# Example: Customer Segmentation with K-Means
print("Customer Segmentation Example")
print("="*50)

# Generate synthetic customer data
np.random.seed(42)
n_customers = 300

# Create 3 distinct customer segments
# Segment 1: High income, high spending
segment1 = np.random.randn(100, 2) * 10 + [80, 70]
# Segment 2: Medium income, medium spending
segment2 = np.random.randn(100, 2) * 8 + [50, 40]
# Segment 3: Low income, low spending
segment3 = np.random.randn(100, 2) * 6 + [25, 20]

X_customers = np.vstack([segment1, segment2, segment3])

customer_df = pd.DataFrame(X_customers, columns=['Annual_Income_k', 'Spending_Score'])
print(f"\nDataset shape: {customer_df.shape}")
print(f"\nFirst few customers:")
print(customer_df.head())

# Visualize unlabeled data
plt.figure(figsize=(10, 6))
plt.scatter(customer_df['Annual_Income_k'], customer_df['Spending_Score'], 
           c='gray', alpha=0.6, s=50)
plt.xlabel('Annual Income ($1000s)')
plt.ylabel('Spending Score (1-100)')
plt.title('Customer Data (Unlabeled)')
plt.grid(alpha=0.3)
plt.show()

print("\nCan you see 3 groups? Let's use K-Means to find them!")

In [None]:
# Apply K-Means clustering
print("Applying K-Means Clustering")
print("="*50)

# Create and fit K-Means model
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
cluster_labels = kmeans.fit_predict(X_customers)

# Add cluster labels to dataframe
customer_df['Cluster'] = cluster_labels

print(f"\nClusters found: {kmeans.n_clusters}")
print(f"\nCluster distribution:")
print(customer_df['Cluster'].value_counts().sort_index())

print(f"\nCluster Centers (Centroids):")
centroids_df = pd.DataFrame(kmeans.cluster_centers_, 
                            columns=['Annual_Income_k', 'Spending_Score'],
                            index=[f'Cluster {i}' for i in range(3)])
print(centroids_df.round(2))

# Visualize clusters
plt.figure(figsize=(12, 6))

# Plot customers colored by cluster
scatter = plt.scatter(customer_df['Annual_Income_k'], 
                     customer_df['Spending_Score'],
                     c=customer_df['Cluster'], 
                     cmap='viridis', 
                     alpha=0.6, s=50)

# Plot centroids
plt.scatter(kmeans.cluster_centers_[:, 0], 
           kmeans.cluster_centers_[:, 1],
           c='red', marker='X', s=300, 
           edgecolors='black', linewidths=2,
           label='Centroids')

plt.xlabel('Annual Income ($1000s)')
plt.ylabel('Spending Score (1-100)')
plt.title('Customer Segmentation with K-Means (K=3)')
plt.colorbar(scatter, label='Cluster')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

print("\nInterpretation:")
print("  Cluster 0: High income, high spending")
print("  Cluster 1: Medium income, medium spending")
print("  Cluster 2: Low income, low spending")

## Practice Exercise 2.1

**Task:** Apply K-Means clustering to the Iris dataset

Steps:
1. Load the Iris dataset using `load_iris()`
2. Use only the first 2 features (sepal length and width) for visualization
3. Apply K-Means with K=3
4. Visualize the clusters and centroids
5. Compare cluster labels with true species labels

**Expected Output:**
```
3 clusters found
Visualization showing clusters
```

In [None]:
# Your code here


In [None]:
# Solution 2.1

# Load Iris dataset
iris = load_iris()
X_iris = iris.data[:, :2]  # Use only first 2 features
y_iris = iris.target

print("Iris Dataset Clustering")
print("="*50)
print(f"Dataset shape: {X_iris.shape}")
print(f"True species: {iris.target_names}")

# Apply K-Means
kmeans_iris = KMeans(n_clusters=3, random_state=42, n_init=10)
clusters_iris = kmeans_iris.fit_predict(X_iris)

print(f"\nClusters found: {kmeans_iris.n_clusters}")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# True labels
axes[0].scatter(X_iris[:, 0], X_iris[:, 1], c=y_iris, cmap='viridis', alpha=0.6)
axes[0].set_xlabel('Sepal Length')
axes[0].set_ylabel('Sepal Width')
axes[0].set_title('True Species Labels')

# K-Means clusters
axes[1].scatter(X_iris[:, 0], X_iris[:, 1], c=clusters_iris, cmap='viridis', alpha=0.6)
axes[1].scatter(kmeans_iris.cluster_centers_[:, 0], 
               kmeans_iris.cluster_centers_[:, 1],
               c='red', marker='X', s=300, edgecolors='black', linewidths=2)
axes[1].set_xlabel('Sepal Length')
axes[1].set_ylabel('Sepal Width')
axes[1].set_title('K-Means Clusters')

plt.tight_layout()
plt.show()

print("\nK-Means found patterns similar to the true species!")

---
# Section 3: Hierarchical Clustering
---

## What is Hierarchical Clustering?

Hierarchical clustering builds a tree of clusters (dendrogram) by either:
- **Agglomerative (bottom-up)**: Start with each point as its own cluster, merge closest pairs
- **Divisive (top-down)**: Start with one cluster, split recursively

We'll focus on **agglomerative clustering** as it's more common.

### How It Works (Agglomerative)

1. Start: Each point is its own cluster (N clusters)
2. Find the two closest clusters
3. Merge them into one cluster
4. Repeat steps 2-3 until one cluster remains
5. Cut the dendrogram at desired height to get K clusters

### Advantages over K-Means

- **No need to specify K upfront** (can decide after seeing dendrogram)
- Works with **any distance metric**
- Can find **non-spherical clusters**
- Provides **hierarchy** of clusters

### Disadvantages

- **Slower** than K-Means (O(n²) vs O(n))
- Not suitable for very large datasets
- Can't undo merges (greedy algorithm)

## Linkage Methods

How do we measure distance between clusters?

| Method | Description | Use Case |
|--------|-------------|----------|
| Single | Minimum distance between points | Long, stringy clusters |
| Complete | Maximum distance between points | Compact clusters |
| Average | Average distance between all points | Balanced approach |
| Ward | Minimizes within-cluster variance | Most common, similar to K-Means |

## Syntax

```python
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage

# Create hierarchical clustering model
hc = AgglomerativeClustering(
    n_clusters=3,       # Number of clusters
    linkage='ward'      # Linkage method
)

# Fit and predict
clusters = hc.fit_predict(X)

# Create dendrogram (requires scipy)
linkage_matrix = linkage(X, method='ward')
dendrogram(linkage_matrix)
```

In [None]:
# Example: Hierarchical Clustering with Dendrogram
print("Hierarchical Clustering Example")
print("="*50)

# Use smaller customer dataset for clearer dendrogram
np.random.seed(42)
X_small = np.vstack([
    np.random.randn(10, 2) * 5 + [70, 60],
    np.random.randn(10, 2) * 5 + [40, 35],
    np.random.randn(10, 2) * 5 + [20, 15]
])

print(f"Dataset size: {X_small.shape[0]} customers")

# Create dendrogram
plt.figure(figsize=(12, 5))

# Calculate linkage matrix
linkage_matrix = linkage(X_small, method='ward')

# Plot dendrogram
dendrogram(linkage_matrix)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Customer Index')
plt.ylabel('Distance')
plt.axhline(y=50, color='r', linestyle='--', label='Cut at distance=50 (3 clusters)')
plt.legend()
plt.tight_layout()
plt.show()

print("\nHow to read the dendrogram:")
print("  - Height shows distance between clusters")
print("  - Cutting horizontally gives K clusters")
print("  - Cut at distance=50 gives 3 clusters")

In [None]:
# Apply hierarchical clustering
hc = AgglomerativeClustering(n_clusters=3, linkage='ward')
hc_labels = hc.fit_predict(X_small)

print("Hierarchical Clustering Results")
print("="*50)
print(f"Clusters found: {len(np.unique(hc_labels))}")
print(f"\nCluster distribution:")
print(pd.Series(hc_labels).value_counts().sort_index())

# Visualize clusters
plt.figure(figsize=(10, 6))
plt.scatter(X_small[:, 0], X_small[:, 1], c=hc_labels, cmap='viridis', s=100, alpha=0.6)
plt.xlabel('Annual Income ($1000s)')
plt.ylabel('Spending Score')
plt.title('Hierarchical Clustering Results (K=3)')
plt.colorbar(label='Cluster')
plt.grid(alpha=0.3)
plt.show()

---
# Section 4: Clustering Evaluation
---

## How Do We Know If Clustering Is Good?

Unlike supervised learning, we don't have true labels to compare against. We need different evaluation methods.

## 1. Elbow Method (Choosing K)

The elbow method helps us choose the optimal number of clusters (K) by plotting:
- **X-axis**: Number of clusters (K)
- **Y-axis**: Within-cluster sum of squares (WCSS) or inertia

**Inertia/WCSS**: Sum of squared distances from each point to its cluster center (lower is better)

**How to use it:**
1. Try K = 1, 2, 3, ..., 10
2. Plot inertia for each K
3. Look for the "elbow" where decrease slows down
4. Choose K at the elbow

## 2. Silhouette Score

Measures how similar a point is to its own cluster compared to other clusters.

**Range**: -1 to +1
- **+1**: Point is far from neighboring clusters (excellent)
- **0**: Point is on the boundary between clusters
- **-1**: Point might be in the wrong cluster (poor)

**Formula** (for one point):
```
s = (b - a) / max(a, b)
```
- a = average distance to points in same cluster
- b = average distance to points in nearest different cluster

**Average Silhouette Score**: Average across all points
- **> 0.7**: Strong structure
- **0.5 - 0.7**: Reasonable structure
- **0.25 - 0.5**: Weak structure
- **< 0.25**: No substantial structure

In [None]:
# Example 1: Elbow Method
print("Elbow Method for Choosing K")
print("="*50)

# Use customer data from Section 2
inertias = []
K_range = range(1, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_customers)
    inertias.append(kmeans.inertia_)
    print(f"K={k}: Inertia = {kmeans.inertia_:.2f}")

# Plot elbow curve
plt.figure(figsize=(10, 6))
plt.plot(K_range, inertias, 'bo-', linewidth=2, markersize=8)
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia (Within-Cluster Sum of Squares)')
plt.title('Elbow Method for Optimal K')
plt.xticks(K_range)
plt.grid(alpha=0.3)

# Mark the elbow at K=3
plt.axvline(x=3, color='r', linestyle='--', linewidth=2, label='Elbow at K=3')
plt.legend()
plt.tight_layout()
plt.show()

print("\nInterpretation:")
print("  - Inertia always decreases as K increases")
print("  - At K=3, we see an 'elbow' where the decrease slows")
print("  - K=3 is a good choice (matches our data generation!)")

In [None]:
# Example 2: Silhouette Score
print("Silhouette Score Analysis")
print("="*50)

# Calculate silhouette scores for different K
silhouette_scores = []
K_range = range(2, 11)  # Silhouette needs at least 2 clusters

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    cluster_labels = kmeans.fit_predict(X_customers)
    score = silhouette_score(X_customers, cluster_labels)
    silhouette_scores.append(score)
    print(f"K={k}: Silhouette Score = {score:.3f}")

# Plot silhouette scores
plt.figure(figsize=(10, 6))
plt.plot(K_range, silhouette_scores, 'go-', linewidth=2, markersize=8)
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score for Different K')
plt.xticks(K_range)
plt.grid(alpha=0.3)
plt.axhline(y=0.5, color='orange', linestyle='--', label='Reasonable threshold (0.5)')
best_k = list(K_range)[np.argmax(silhouette_scores)]
plt.axvline(x=best_k, color='r', linestyle='--', linewidth=2, label=f'Best K={best_k}')
plt.legend()
plt.tight_layout()
plt.show()

print(f"\nBest K based on Silhouette Score: {best_k}")
print(f"Score: {max(silhouette_scores):.3f} (Reasonable structure)")

## Practice Exercise 4.1

**Task:** Use Elbow Method and Silhouette Score to find optimal K

Generate a dataset with 4 clusters using:
```python
X, y = make_blobs(n_samples=400, centers=4, random_state=42)
```

1. Apply Elbow Method for K=1 to 10
2. Calculate Silhouette Scores for K=2 to 10
3. Plot both metrics
4. Determine optimal K

**Expected Output:**
```
Optimal K: 4
```

In [None]:
# Your code here


In [None]:
# Solution 4.1

# Generate dataset with 4 clusters
X_4clusters, y_4clusters = make_blobs(n_samples=400, centers=4, random_state=42)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Elbow Method
inertias = []
for k in range(1, 11):
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(X_4clusters)
    inertias.append(km.inertia_)

axes[0].plot(range(1, 11), inertias, 'bo-')
axes[0].axvline(x=4, color='r', linestyle='--', label='Elbow at K=4')
axes[0].set_xlabel('K')
axes[0].set_ylabel('Inertia')
axes[0].set_title('Elbow Method')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Silhouette Score
sil_scores = []
for k in range(2, 11):
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(X_4clusters)
    score = silhouette_score(X_4clusters, labels)
    sil_scores.append(score)

axes[1].plot(range(2, 11), sil_scores, 'go-')
best_k = range(2, 11)[np.argmax(sil_scores)]
axes[1].axvline(x=best_k, color='r', linestyle='--', label=f'Best K={best_k}')
axes[1].set_xlabel('K')
axes[1].set_ylabel('Silhouette Score')
axes[1].set_title('Silhouette Score')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Optimal K: {best_k} (both methods agree!)")

---
# Section 5: Dimensionality Reduction and PCA
---

## What is Dimensionality Reduction?

Dimensionality reduction transforms high-dimensional data (many features) into lower dimensions while preserving important information.

### The Curse of Dimensionality

As the number of features increases:
- Data becomes sparse (points are far apart)
- Models need exponentially more data
- Training becomes slower
- Overfitting risk increases
- Visualization becomes impossible (can't plot 100D)

**Example**: With 10 features and 10 values per feature, you need 10^10 = 10 billion samples to fill the space!

## Why Reduce Dimensions?

1. **Visualization**: Plot 100D data in 2D/3D
2. **Speed**: Fewer features = faster training
3. **Memory**: Store less data
4. **Noise Reduction**: Remove irrelevant features
5. **Better Performance**: Sometimes less is more!

## Two Approaches

### 1. Feature Selection
Keep a subset of original features
- Example: Keep only "age" and "income", remove "zip code"

### 2. Feature Extraction (PCA)
Create new features as combinations of original features
- Example: PC1 = 0.7×age + 0.3×income
- More powerful but less interpretable

## Principal Component Analysis (PCA)

PCA is the most popular dimensionality reduction technique. It finds new axes (principal components) that capture maximum variance in the data.

### How PCA Works

1. **Standardize** the data (mean=0, std=1)
2. **Compute covariance** matrix (how features relate)
3. **Find eigenvectors** (directions of maximum variance)
4. **Sort by eigenvalues** (variance explained)
5. **Project data** onto top K components

Think of it as rotating your data to find the best viewing angle!

### Key Concepts

**Principal Components (PCs)**:
- PC1: Direction of maximum variance
- PC2: Direction of second-most variance (perpendicular to PC1)
- PC3: Third-most variance (perpendicular to PC1 and PC2)
- ...

**Explained Variance**:
- How much information each PC captures
- Example: PC1=70%, PC2=20%, PC3=8%, PC4=2%
- Keep enough PCs to explain 90-95% of variance

## Syntax

```python
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Always standardize first!
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Create PCA model
pca = PCA(n_components=2)  # Reduce to 2 dimensions

# Fit and transform
X_pca = pca.fit_transform(X_scaled)

# Access results
explained_var = pca.explained_variance_ratio_  # % variance explained
components = pca.components_  # Principal component vectors
```

In [None]:
# Example: PCA on Iris Dataset
print("PCA Example: Iris Dataset")
print("="*50)

# Load full Iris dataset (4 features)
iris = load_iris()
X_iris_full = iris.data
y_iris_full = iris.target

print(f"Original data shape: {X_iris_full.shape}")
print(f"Features: {iris.feature_names}")
print(f"\nCannot visualize 4D data easily!")

# Standardize
scaler = StandardScaler()
X_iris_scaled = scaler.fit_transform(X_iris_full)

# Apply PCA
pca = PCA(n_components=2)  # Reduce to 2D for visualization
X_iris_pca = pca.fit_transform(X_iris_scaled)

print(f"\nPCA-transformed data shape: {X_iris_pca.shape}")
print(f"\nExplained Variance Ratio:")
for i, var in enumerate(pca.explained_variance_ratio_):
    print(f"  PC{i+1}: {var:.3f} ({var*100:.1f}%)")

total_var = sum(pca.explained_variance_ratio_)
print(f"\nTotal variance explained by 2 PCs: {total_var:.3f} ({total_var*100:.1f}%)")
print("We kept 97.8% of information with just 2 dimensions!")

In [None]:
# Visualize before and after PCA
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Before PCA (using first 2 original features)
axes[0].scatter(X_iris_full[:, 0], X_iris_full[:, 1], 
               c=y_iris_full, cmap='viridis', alpha=0.6, s=50)
axes[0].set_xlabel(iris.feature_names[0])
axes[0].set_ylabel(iris.feature_names[1])
axes[0].set_title('Original Features (2 of 4)\nNot well separated')

# After PCA (2 principal components)
axes[1].scatter(X_iris_pca[:, 0], X_iris_pca[:, 1], 
               c=y_iris_full, cmap='viridis', alpha=0.6, s=50)
axes[1].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}% variance)')
axes[1].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}% variance)')
axes[1].set_title('Principal Components\nMuch better separation!')

plt.tight_layout()
plt.show()

print("\nPCA found the best 2D projection to separate the species!")

In [None]:
# Example: Explained Variance Analysis
print("How Many Components Should We Keep?")
print("="*50)

# Fit PCA with all components
pca_full = PCA()
pca_full.fit(X_iris_scaled)

# Calculate cumulative variance
cumulative_var = np.cumsum(pca_full.explained_variance_ratio_)

print("\nVariance explained by each component:")
for i, (var, cum_var) in enumerate(zip(pca_full.explained_variance_ratio_, cumulative_var)):
    print(f"  PC{i+1}: {var:.3f} (cumulative: {cum_var:.3f})")

# Plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Individual variance
axes[0].bar(range(1, 5), pca_full.explained_variance_ratio_)
axes[0].set_xlabel('Principal Component')
axes[0].set_ylabel('Explained Variance Ratio')
axes[0].set_title('Variance Explained by Each PC')
axes[0].set_xticks(range(1, 5))

# Cumulative variance
axes[1].plot(range(1, 5), cumulative_var, 'bo-', linewidth=2, markersize=8)
axes[1].axhline(y=0.95, color='r', linestyle='--', label='95% threshold')
axes[1].set_xlabel('Number of Components')
axes[1].set_ylabel('Cumulative Explained Variance')
axes[1].set_title('Cumulative Variance Explained')
axes[1].set_xticks(range(1, 5))
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("\nRule of thumb: Keep enough PCs to explain 90-95% variance")
print(f"For Iris: 2 components explain {cumulative_var[1]*100:.1f}% (good enough!)")

---
# Section 6: Practical Applications
---

## Application 1: Customer Segmentation + PCA

Combining clustering and PCA for real-world customer analysis.

In [None]:
# Create realistic customer dataset with many features
np.random.seed(42)
n = 500

# Generate 8 features for customers
customer_features = pd.DataFrame({
    'age': np.random.randint(18, 70, n),
    'income': np.random.normal(50000, 20000, n).clip(20000, 150000),
    'spending_score': np.random.randint(1, 100, n),
    'years_customer': np.random.randint(0, 20, n),
    'num_purchases': np.random.poisson(10, n),
    'avg_transaction': np.random.normal(100, 30, n).clip(10, 500),
    'loyalty_points': np.random.randint(0, 10000, n),
    'satisfaction': np.random.randint(1, 11, n)
})

print("Customer Dataset")
print("="*50)
print(f"Shape: {customer_features.shape}")
print(f"\nFeatures: {list(customer_features.columns)}")
print(f"\nSample data:")
print(customer_features.head())

print("\nChallenge: 8 features - too many to visualize!")

In [None]:
# Step 1: Apply PCA to reduce to 2D
scaler = StandardScaler()
X_customers_scaled = scaler.fit_transform(customer_features)

pca = PCA(n_components=2)
X_customers_pca = pca.fit_transform(X_customers_scaled)

print("Step 1: PCA Dimensionality Reduction")
print("="*50)
print(f"Reduced from {customer_features.shape[1]}D to {X_customers_pca.shape[1]}D")
print(f"Variance explained: {sum(pca.explained_variance_ratio_)*100:.1f}%")

# Step 2: Apply K-Means clustering on PCA features
kmeans_final = KMeans(n_clusters=4, random_state=42, n_init=10)
clusters = kmeans_final.fit_predict(X_customers_pca)

print(f"\nStep 2: K-Means Clustering")
print("="*50)
print(f"Number of clusters: 4")
print(f"\nCluster sizes:")
print(pd.Series(clusters).value_counts().sort_index())

# Visualize
plt.figure(figsize=(12, 6))
scatter = plt.scatter(X_customers_pca[:, 0], X_customers_pca[:, 1], 
                     c=clusters, cmap='viridis', alpha=0.6, s=50)
plt.scatter(kmeans_final.cluster_centers_[:, 0], 
           kmeans_final.cluster_centers_[:, 1],
           c='red', marker='X', s=300, edgecolors='black', linewidths=2)
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}% variance)')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}% variance)')
plt.title('Customer Segmentation using PCA + K-Means')
plt.colorbar(scatter, label='Cluster')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

# Analyze segments
customer_features['Cluster'] = clusters
print("\nCustomer Segment Profiles:")
print(customer_features.groupby('Cluster').mean().round(2))

---
# Module Summary

## Key Takeaways

### Unsupervised Learning
- Works with **unlabeled data** to find patterns
- Two main types: **Clustering** and **Dimensionality Reduction**
- Most real-world data is unlabeled (cheap, abundant)

### K-Means Clustering
- Most popular clustering algorithm
- Partitions data into K clusters by minimizing within-cluster variance
- **Pros**: Fast, scalable, simple
- **Cons**: Need to specify K, assumes spherical clusters
- **Use cases**: Customer segmentation, image compression, anomaly detection

### Hierarchical Clustering
- Builds a tree (dendrogram) of clusters
- **Agglomerative**: Bottom-up (merge clusters)
- Don't need to specify K upfront
- Dendrograms help visualize cluster hierarchy
- **Slower** than K-Means but more flexible

### Clustering Evaluation
- **Elbow Method**: Plot inertia vs K, look for elbow
- **Silhouette Score**: Measures cluster quality (-1 to +1)
  - >0.7: Strong structure
  - 0.5-0.7: Reasonable
  - <0.5: Weak/no structure

### PCA (Principal Component Analysis)
- Reduces dimensionality while preserving information
- Finds axes of maximum variance (principal components)
- **Always standardize** data first!
- **Explained variance**: How much info each PC captures
- **Rule of thumb**: Keep PCs that explain 90-95% variance
- **Use cases**: Visualization, noise reduction, speed up training

### When to Use What

| Task | Method | Why |
|------|--------|-----|
| Group similar customers | K-Means | Fast, simple, works well for marketing segments |
| Understand cluster hierarchy | Hierarchical | Get dendrogram, flexible K |
| Visualize high-D data | PCA | Reduce to 2D/3D for plotting |
| Speed up model training | PCA | Fewer features = faster training |
| Remove noise | PCA | Keep only high-variance components |
| Preprocessing for supervised learning | Both | PCA then classification/regression |

## Next Module

In **Module 18: Model Optimization and Hyperparameter Tuning**, we'll learn how to:
- Systematically tune model parameters
- Use Grid Search and Random Search
- Prevent overfitting
- Build robust ML pipelines

## Additional Practice

Try these challenges to reinforce your learning:

1. **Customer Segmentation**: Load a real customer dataset (e.g., from Kaggle) and:
   - Apply K-Means with different K values
   - Use Elbow Method and Silhouette Score to choose K
   - Profile each segment (what makes them unique?)

2. **Image Compression**: Use K-Means to compress an image:
   - Load an image as RGB pixel values
   - Cluster pixels into K=16 colors
   - Replace each pixel with its cluster center
   - Compare original vs compressed size

3. **PCA Visualization**: Load the Wine or Breast Cancer dataset:
   - Apply PCA to reduce to 2D
   - Visualize in 2D colored by class
   - How many PCs needed for 95% variance?

4. **Combined Workflow**: On Iris dataset:
   - Apply PCA to reduce from 4D to 2D
   - Apply K-Means on PCA features
   - Compare with K-Means on original features
   - Which works better? Why?

5. **Hierarchical Clustering**: Create a dendrogram for:
   - A small dataset (20-30 points)
   - Interpret the dendrogram
   - Try different linkage methods (ward, complete, average)
   - How do results differ?