# üéØ Clustering & Unsupervised Learning

**Author**: Data Science Master System  
**Difficulty**: ‚≠ê‚≠ê Intermediate  
**Time**: 45 minutes  
**Prerequisites**: 06_regression_models

## Learning Objectives
- K-Means, DBSCAN, Hierarchical clustering
- Dimensionality reduction (PCA, t-SNE)
- Cluster validation metrics
- Practical applications

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs, make_moons
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.metrics import silhouette_score, calinski_harabasz_score

np.random.seed(42)

## 1. Generate Data

In [None]:
# Blob data
X, y_true = make_blobs(n_samples=500, centers=4, cluster_std=1.0)
X_scaled = StandardScaler().fit_transform(X)

plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=y_true, cmap='viridis', alpha=0.6)
plt.title('True Clusters')
plt.show()

## 2. K-Means Clustering

In [None]:
# Elbow method
inertias = []
K_range = range(1, 10)
for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_scaled)
    inertias.append(kmeans.inertia_)

plt.plot(K_range, inertias, 'bo-')
plt.xlabel('K')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()

# Final model
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
labels = kmeans.fit_predict(X_scaled)
print(f"Silhouette Score: {silhouette_score(X_scaled, labels):.3f}")

## 3. DBSCAN

In [None]:
# DBSCAN for non-globular clusters
X_moons, _ = make_moons(n_samples=300, noise=0.1)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# K-Means fails
kmeans_labels = KMeans(n_clusters=2, n_init=10).fit_predict(X_moons)
axes[0].scatter(X_moons[:, 0], X_moons[:, 1], c=kmeans_labels, cmap='viridis')
axes[0].set_title('K-Means (fails)')

# DBSCAN works
dbscan_labels = DBSCAN(eps=0.3, min_samples=5).fit_predict(X_moons)
axes[1].scatter(X_moons[:, 0], X_moons[:, 1], c=dbscan_labels, cmap='viridis')
axes[1].set_title('DBSCAN (works)')

plt.show()

## 4. Dimensionality Reduction

In [None]:
# High-dimensional data
X_high = np.random.randn(300, 50)

# PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_high)
print(f"PCA Explained Variance: {pca.explained_variance_ratio_.sum():.1%}")

# t-SNE
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X_high)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].scatter(X_pca[:, 0], X_pca[:, 1], alpha=0.5)
axes[0].set_title('PCA')
axes[1].scatter(X_tsne[:, 0], X_tsne[:, 1], alpha=0.5)
axes[1].set_title('t-SNE')
plt.show()

## üéØ Key Takeaways
- K-Means: globular clusters, need k
- DBSCAN: any shape, handles noise
- PCA: linear, fast, interpretable
- t-SNE: visualization, non-linear

**Next**: 08_feature_engineering.ipynb