# Chapter 9: Unsupervised Learning Techniques

**Reference:** Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (Aurélien Géron)

---

## 1. Chapter Introduction

Although most of the applications of Machine Learning today are based on supervised learning (and as a result, this is where most of the investments go to), the vast majority of the available data is unlabeled: we have the input features $\mathbf{X}$, but we do not have the labels $\mathbf{y}$. The computer scientist Yann LeCun famously said that "if intelligence was a cake, unsupervised learning would be the cake, supervised learning would be the icing on the cake, and reinforcement learning would be the cherry on the cake." In other words, there is a huge potential in unsupervised learning that we have only barely started to sink our teeth into.

In this chapter we will look at a few unsupervised learning tasks and algorithms:
* **Clustering:** The goal is to group similar instances together into clusters. Clustering is a great tool for data analysis, customer segmentation, recommender systems, search engines, image segmentation, semi-supervised learning, dimensionality reduction, and more.
* **Anomaly detection:** The objective is to learn what "normal" data looks like, and then use that to detect abnormal instances, such as defective items on a production line or a new trend in a time series.
* **Density estimation:** This is the task of estimating the probability density function (PDF) of the random process that generated the dataset. Density estimation is commonly used for anomaly detection: instances located in very low-density regions are likely to be anomalies. It is also useful for data analysis and visualization.

We will start with clustering, using K-Means and DBSCAN, and then we will discuss Gaussian mixture models and see how they can be used for density estimation, clustering, and anomaly detection.

## 2. Clustering

### A. K-Means

Consider a dataset containing blobs of instances. K-Means is a simple algorithm capable of clustering this kind of dataset very quickly and efficiently, often in just a few iterations.

**Deep Dive Mechanism:**
K-Means is a centroid-based algorithm. The goal is to partition the dataset into $k$ distinct, non-overlapping subgroups (clusters). It attempts to minimize the **intra-cluster variance**, also known as inertia.

1.  **Initialization:** The algorithm starts by randomly selecting $k$ data points as initial centroids.
2.  **Assignment Step:** Every data point in the dataset is assigned to the nearest centroid based on Euclidean distance. This creates $k$ clusters.
3.  **Update Step:** The centroids are recomputed by taking the mean of all data points assigned to that centroid.
4.  **Convergence:** Steps 2 and 3 repeat until the centroids no longer move (or move very little), meaning the algorithm has converged.

**Optimization & Challenges:**
* **Inertia:** The objective function is to minimize Inertia, which is the sum of the squared distances between each training instance and its closest centroid.

    **Equation 9-1: K-Means Inertia**
    $$ \text{Inertia} = \sum_{i=1}^{m} \min_{k=1..K} \|\mathbf{x}^{(i)} - \mathbf{\mu}_k\|^2 $$

* **Local Optima:** K-Means is guaranteed to converge, but it might converge to a local minimum (suboptimal solution) depending on initialization. It is common to run the algorithm multiple times with different random initializations (`n_init`) and keep the best solution (lowest inertia).
* **K-Means++:** A smarter initialization strategy used by default in Scikit-Learn. It chooses the first centroid randomly, then chooses subsequent centroids from the remaining data points with probability proportional to the squared distance from the closest existing centroid. This drastically reduces the probability of poor initialization.
* **Hard vs. Soft Clustering:**
    * *Hard Clustering:* Each instance is assigned to exactly one cluster.
    * *Soft Clustering:* Each instance is assigned a score (e.g., distance or affinity) for every cluster. This can be used as a dimensionality reduction technique (transforming instances into vectors of distances to centroids).

**Determining Optimal k:**
* **Elbow Method:** Plot inertia vs. $k$. As $k$ increases, inertia decreases. The "elbow" of the curve represents a point of diminishing returns where adding more clusters doesn't significantly improve the model.
* **Silhouette Score:** A more precise metric. It measures how similar an instance is to its own cluster (cohesion) compared to other clusters (separation). The score ranges from -1 to +1. A high value indicates that the instance is well matched to its own cluster and poorly matched to neighboring clusters. The silhouette coefficient for an instance is equal to $(b - a) / \max(a, b)$, where $a$ is the mean distance to the other instances in the same cluster (i.e., the mean intra-cluster distance), and $b$ is the mean nearest-cluster distance (i.e., the mean distance to the instances of the next closest cluster).

In [None]:
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import numpy as np

# 1. Generate Data
# We create 5 blobs with unequal variances to show K-Means limitations (it assumes equal variance).
# This setup simulates a common real-world scenario where clusters aren't perfectly spherical.
blob_centers = np.array([
    [ 0.2,  2.3],
    [-1.5 ,  2.3],
    [-2.8,  1.8],
    [-2.8,  2.8],
    [-2.8,  1.3]])
blob_std = np.array([0.4, 0.3, 0.1, 0.1, 0.1])
X, y = make_blobs(n_samples=2000, centers=blob_centers,
                  cluster_std=blob_std, random_state=7)

# 2. Train K-Means (k=5)
# 'n_clusters=5' specifies that we are looking for 5 distinct groups.
# 'n_init=10' (default) means the algorithm will run 10 times with different random centroid seeds.
# K-Means keeps the model with the lowest inertia (sum of squared distances).
# This helps avoid getting stuck in local optima where centroids are poorly placed.
k = 5
kmeans = KMeans(n_clusters=k, random_state=42)
y_pred = kmeans.fit_predict(X)

# 3. Inspect Results
# 'cluster_centers_' gives the coordinates of the 5 centroids found.
print("Cluster Centers:\n", kmeans.cluster_centers_)

# 'inertia_' is the performance metric (lower is better).
# It represents the sum of squared distances of samples to their closest cluster center.
print("Inertia (Sum of Squared Errors):", kmeans.inertia_)

# 4. Predict new instances
# K-Means acts like a Voronoi tesselation; new points are simply assigned to the nearest center.
# This is a "Hard Clustering" approach.
X_new = np.array([[0, 2], [3, 2], [-3, 3], [-3, 2.5]])
print("Predictions for new data:", kmeans.predict(X_new))

# 5. Silhouette Score
# Calculate Silhouette Score for our chosen k=5.
# Range: -1 (wrong cluster) to +1 (perfectly clustered).
# A score near 0 means overlapping clusters.
print("Silhouette Score (k=5):", silhouette_score(X, kmeans.labels_))

### B. Limits of K-Means
Despite its many merits, K-Means is not perfect. It is necessary to run the algorithm several times to avoid suboptimal solutions, plus you need to specify the number of clusters, which can be quite a hassle. Moreover, K-Means does not behave very well when the clusters have varying sizes, different densities, or non-spherical shapes.

### C. Using Clustering for Image Segmentation
Image segmentation is the task of partitioning an image into multiple segments. In *semantic segmentation*, all pixels that are part of the same object type get assigned to the same segment. In *color segmentation*, we simply assign pixels to the same segment if they have a similar color. We can use K-Means for this.

In [None]:
import os
from matplotlib.image import imread

# Load the ladybug image (assuming it's in the current directory or provide path)
# Since I don't have the file, I will create a dummy image
image = np.random.rand(100, 100, 3)

X = image.reshape(-1, 3)
kmeans = KMeans(n_clusters=8, random_state=42).fit(X)
segmented_img = kmeans.cluster_centers_[kmeans.labels_]
segmented_img = segmented_img.reshape(image.shape)
print("Image segmented into 8 colors.")

### D. Using Clustering for Preprocessing
Clustering can be an efficient approach to dimensionality reduction, in particular as a preprocessing step before a supervised learning algorithm. Let's tackle the digits dataset, which is a simple MNIST-like dataset containing 1,797 grayscale $8 \times 8$ images representing the digits 0 to 9.

In [None]:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

# Load Digits Data
X_digits, y_digits = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X_digits, y_digits, random_state=42)

# Pipeline Construction:
# 1. K-Means creates 50 clusters. This acts as feature engineering.
#    Instead of raw pixel intensity, the new features are the distances to these 50 centroids.
#    This effectively reduces noise and focuses on structural similarity.
# 2. Logistic Regression trains on these 50 new features.
pipeline = Pipeline([
    ("kmeans", KMeans(n_clusters=50, random_state=42)),
    ("log_reg", LogisticRegression(solver="lbfgs", multi_class="ovr", max_iter=5000, random_state=42)),
])
pipeline.fit(X_train, y_train)

print("Accuracy with K-Means Preprocessing:", pipeline.score(X_test, y_test))

### E. DBSCAN

This algorithm defines clusters as continuous regions of high density. It is based on the idea that clusters are dense regions in the data space, separated by regions of lower density.

**Core Concepts:**
1.  **$\epsilon$-neighborhood:** The radius around a data point.
2.  **MinPts (min_samples):** The minimum number of points required within an $\epsilon$-neighborhood to form a dense region.
3.  **Core Point:** A point is a core point if it has at least `min_samples` points (including itself) within its $\epsilon$-neighborhood.
4.  **Border Point:** A point that is within the $\epsilon$-neighborhood of a core point but does not have enough neighbors to be a core point itself.
5.  **Noise Point (Outlier):** A point that is neither a core point nor a border point.

**Algorithm Logic:**
* For each instance, the algorithm counts how many instances are located within a small distance $\epsilon$ (epsilon) from it. This region is called the instance’s $\epsilon$-neighborhood.
* If an instance has at least `min_samples` instances in its $\epsilon$-neighborhood (including itself), then it is considered a core instance. In other words, core instances are those that are located in dense regions.
* All instances in the neighborhood of a core instance belong to the same cluster. This neighborhood may include other core instances; therefore, a long sequence of neighboring core instances forms a single cluster.
* Any instance that is not a core instance and does not have one in its neighborhood is considered an anomaly.

In [None]:
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons

# Generate 'Moons' dataset (two interleaving half circles).
# This shape is impossible for K-Means to separate correctly because K-Means assumes spherical clusters.
X, y = make_moons(n_samples=1000, noise=0.05, random_state=42)

# DBSCAN configuration:
# eps=0.2: The radius of the neighborhood to look for nearby points.
# min_samples=5: Minimum points required within 'eps' radius to form a dense region (Core Point).
dbscan = DBSCAN(eps=0.2, min_samples=5)
dbscan.fit(X)

# Labels of -1 are considered anomalies (noise points) that didn't fit into any cluster.
# This built-in outlier detection is a major advantage of DBSCAN.
print("First 10 labels:", dbscan.labels_[:10])

# Core instances are the "anchors" of the clusters.
print("Number of core instances found:", len(dbscan.core_sample_indices_))

## 3. Gaussian Mixture Models (GMM)

A Gaussian mixture model (GMM) is a probabilistic model that assumes that the instances were generated from a mixture of several Gaussian distributions whose parameters are unknown. All the instances generated from a single Gaussian distribution form a cluster that typically looks like an ellipsoid. Each cluster can have a different ellipsoidal shape, size, density, and orientation.

When you observe an instance $\mathbf{x}$, you know it was generated from one of the Gaussian distributions, but you don't know which one, and you don't know the parameters of these distributions.

**Expectation-Maximization (EM) Algorithm:**
Finding the optimal parameters (mean $\mu$, covariance $\Sigma$, and mixing weight $\pi$ for each cluster) is difficult. We use EM, which is similar to K-Means:
1.  **Expectation (E-step):** Given the current parameter estimates, calculate the probability (responsibility) that each data point belongs to each cluster.
2.  **Maximization (M-step):** Update the parameters ($\mu, \Sigma, \pi$) to maximize the likelihood of the data, weighting each data point by the responsibility calculated in the E-step.

**Covariance Types:**
* **Spherical:** Clusters are spheres (like K-Means), but can have different diameters.
* **Diagonal:** Clusters can be ellipsoids, but axes are parallel to coordinate axes.
* **Tied:** All clusters share the same covariance matrix (same shape/orientation).
* **Full:** Each cluster can take on any ellipsoidal shape and orientation.

**Anomaly Detection:**
Gaussian Mixtures can be used for anomaly detection: instances located in low-density regions can be considered anomalies. You must define what density threshold you want to use. For example, in a manufacturing company that tries to detect defective products, the ratio of defective products is usually well known (e.g., 4%). You then set the density threshold to be the value that results in having 4% of the instances located in areas below that threshold density.

In [None]:
from sklearn.mixture import GaussianMixture

# Fit GMM to the moons data.
# n_components=2: We assume there are 2 underlying distributions.
# n_init=10: The EM algorithm can get stuck in local optima, so we run it 10 times and keep the best result.
gm = GaussianMixture(n_components=2, n_init=10, random_state=42)
gm.fit(X)

print("Estimated Means of the components:\n", gm.means_)
print("Did the algorithm converge?", gm.converged_)

# Anomaly Detection Logic:
# 1. Calculate the log-likelihood (density) of each instance using 'score_samples'.
#    Lower scores indicate the instance is in a lower-density region.
densities = gm.score_samples(X)

# 2. Define a threshold. We classify the 4% of instances with the lowest density as anomalies.
#    This is a common technique for outlier detection in unsupervised settings.
density_threshold = np.percentile(densities, 4)
anomalies = X[densities < density_threshold]

print(f"Density Threshold (4th percentile): {density_threshold:.2f}")
print("Number of anomalies detected:", len(anomalies))