# Unsupervised Learning — Practical Notebook

This notebook contains runnable examples for clustering, dimensionality reduction, and anomaly detection using scikit-learn. It generates sample datasets and shows common workflows: scaling, pipelines, hyperparameter selection (internal metrics), and visualization.

In [ ]:
# Imports
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris, make_moons, make_blobs
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA
from sklearn.ensemble import IsolationForest
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

## 1. Create sample datasets
We create three datasets: Iris (classic), blobs (for KMeans), and moons (for DBSCAN / non-spherical clusters).

In [ ]:
# Iris dataset (for demonstration with labels, but we'll ignore labels for unsupervised tasks)
iris = load_iris()
X_iris = iris.data
y_iris = iris.target

# Blobs: well-separated clusters for KMeans
X_blobs, y_blobs = make_blobs(n_samples=500, centers=4, cluster_std=0.60, random_state=42)

# Moons: non-spherical clusters (good for DBSCAN/DBSCAN-like methods)
X_moons, y_moons = make_moons(n_samples=500, noise=0.06, random_state=42)

# Standardize datasets (important for distance-based methods)
scaler = StandardScaler()
X_iris_s = scaler.fit_transform(X_iris)
X_blobs_s = scaler.fit_transform(X_blobs)
X_moons_s = scaler.fit_transform(X_moons)

print('Shapes:', X_iris_s.shape, X_blobs_s.shape, X_moons_s.shape)

## 2. KMeans clustering — Elbow and Silhouette examples
Use inertia (elbow) and silhouette score to select k.

In [ ]:
# Explore different k values on the blobs dataset
inertias = []
sil_scores = []
K = range(2, 9)
for k in K:
    km = KMeans(n_clusters=k, random_state=42).fit(X_blobs_s)
    inertias.append(km.inertia_)
    sil_scores.append(silhouette_score(X_blobs_s, km.labels_))

fig, ax = plt.subplots(1,2, figsize=(12,4))
ax[0].plot(K, inertias, '-o')
ax[0].set_xlabel('k')
ax[0].set_ylabel('Inertia')
ax[0].set_title('Elbow method (inertia)')
ax[1].plot(K, sil_scores, '-o')
ax[1].set_xlabel('k')
ax[1].set_ylabel('Silhouette score')
ax[1].set_title('Silhouette score')
plt.show()

# Fit KMeans with chosen k (e.g., 4) and visualize
km4 = KMeans(n_clusters=4, random_state=42).fit(X_blobs_s)
plt.figure(figsize=(6,4))
plt.scatter(X_blobs_s[:,0], X_blobs_s[:,1], c=km4.labels_, cmap='tab10', s=20)
plt.title('KMeans (k=4) on blobs')
plt.show()

## 3. DBSCAN — density-based clustering and outlier detection
DBSCAN can find arbitrary-shaped clusters and produce a noise label (-1) for outliers.

In [ ]:
db = DBSCAN(eps=0.2, min_samples=5).fit(X_moons_s)
labels_db = db.labels_
n_clusters = len(set(labels_db)) - (1 if -1 in labels_db else 0)
n_noise = list(labels_db).count(-1)
print('Estimated clusters:', n_clusters, 'Noise points:', n_noise)

plt.figure(figsize=(6,4))
plt.scatter(X_moons_s[:,0], X_moons_s[:,1], c=labels_db, cmap='tab10', s=20)
plt.title('DBSCAN on moons (labels: -1 = noise)')
plt.show()

## 4. Dimensionality reduction — PCA and UMAP/TSNE for visualization
PCA is linear and fast. UMAP or t-SNE are nonlinear and useful for 2D/3D visualizations. UMAP is recommended for speed and preserving global structure. This cell will try to use UMAP if available, otherwise fall back to TSNE.

In [ ]:
# PCA example on Iris (reduce to 2D for plotting)
pca = PCA(n_components=2)
X_iris_pca = pca.fit_transform(X_iris_s)
plt.figure(figsize=(6,4))
plt.scatter(X_iris_pca[:,0], X_iris_pca[:,1], c=y_iris, cmap='tab10', s=30)
plt.title('Iris PCA (2D)')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()

# Try UMAP, else use TSNE
try:
    import umap
    emb = umap.UMAP(n_components=2, random_state=42).fit_transform(X_blobs_s)
    method = 'UMAP'
except Exception as e:
    from sklearn.manifold import TSNE
    emb = TSNE(n_components=2, random_state=42).fit_transform(X_blobs_s)
    method = 'TSNE'

plt.figure(figsize=(6,4))
plt.scatter(emb[:,0], emb[:,1], c=km4.labels_, cmap='tab10', s=20)
plt.title(f'{method} visualization of blobs colored by KMeans labels')
plt.show()

## 5. Anomaly detection — IsolationForest
IsolationForest is an unsupervised method that detects outliers. We'll run it on the blobs dataset and visualize flagged points.

In [ ]:
iso = IsolationForest(contamination=0.02, random_state=42)
iso.fit(X_blobs_s)
is_outlier = iso.predict(X_blobs_s) == -1

plt.figure(figsize=(6,4))
plt.scatter(X_blobs_s[:,0], X_blobs_s[:,1], c='lightgrey', s=20)
plt.scatter(X_blobs_s[is_outlier,0], X_blobs_s[is_outlier,1], c='red', s=30, label='outliers')
plt.title('IsolationForest outliers (on blobs)')
plt.legend()
plt.show()

## 6. Pipeline example: scaling → PCA → KMeans
Use sklearn Pipeline to encapsulate preprocessing and model. This prevents leakage and makes experiments reproducible.

In [ ]:
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=5)),
    ('kmeans', KMeans(n_clusters=4, random_state=42))
])
pipe.fit(X_blobs)
labels_pipe = pipe.named_steps['kmeans'].labels_
print('Labels from pipeline (unique):', np.unique(labels_pipe))

plt.figure(figsize=(6,4))
plt.scatter(X_blobs_s[:,0], X_blobs_s[:,1], c=labels_pipe, cmap='tab10', s=20)
plt.title('Pipeline: scaler → PCA → KMeans')
plt.show()

## 7. Hyperparameter selection (internal metric)
For clustering, use internal metrics like silhouette score to choose parameters. Here we pick k for KMeans by silhouette on the blobs dataset.

In [ ]:
best = (None, -1)  # (k, score)
for k in range(2,9):
    km = KMeans(n_clusters=k, random_state=42).fit(X_blobs_s)
    s = silhouette_score(X_blobs_s, km.labels_)
    print('k=', k, 'silhouette=', round(s,4))
    if s > best[1]:
        best = (k, s)
print('Best k by silhouette:', best)

## 8. Next steps and notes
- For production, save fitted transformers and models (joblib, pickle), and build an inference pipeline.
- For large datasets, use minibatch KMeans or scalable libraries (Dask-ML, Faiss for nearest neighbors).
- For text data, embed text (sentence-transformers) and then cluster embeddings.

References:
- scikit-learn: https://scikit-learn.org
- UMAP: https://umap-learn.readthedocs.io
- t-SNE in scikit-learn: sklearn.manifold.TSNE