# Elbow Method for Determining the Optimal Value of k in KMeans

The **Elbow Method** is a technique used to determine the optimal number of clusters ($k$) in a **KMeans** clustering algorithm. The steps are as follows:

1. **Run KMeans for a range of k values**: For example, from $k=1$ to some maximum value (e.g., $k=10$ or $k=20$).
2. **Calculate the Within-Cluster Sum of Squares (WCSS)** for each $k$:
   - WCSS measures the total variance within each cluster.
   - A smaller WCSS indicates tighter clusters, but increasing $k$ always decreases WCSS.
3. **Plot WCSS vs. k**: The plot typically decreases rapidly at first and then slows down.
4. **Identify the "elbow point"**:
   - The elbow point is where the marginal gain in reducing WCSS starts to diminish.
   - This point represents the most appropriate number of clusters, balancing variance reduction and model simplicity.

**Intuition**: Before the elbow, adding clusters significantly reduces WCSS (better fit). After the elbow, adding clusters provides minimal improvement, so the extra complexity is not justified.


In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans

In [None]:
# Creating the data
x1 = np.array([3, 1, 1, 2, 1, 6, 6, 6, 5, 6, 7, 8, 9, 8, 9, 9, 8])
x2 = np.array([5, 4, 5, 6, 5, 8, 6, 7, 6, 7, 1, 2, 1, 2, 3, 2, 3])
X = np.array(list(zip(x1, x2))).reshape(len(x1), 2)
X = pd.DataFrame(X)
  
# Visualizing the data
plt.plot()
plt.xlim([0, 10])
plt.ylim([0, 10])
plt.title('Dataset')
plt.scatter(x1, x2)
plt.show()

From the Elbow Method visualization, we observe that the optimal number of clusters appears to be around **3**. However, relying solely on visual inspection may not always provide the most accurate answer. To complement this, we can calculate quantitative metrics such as **distortion**, **inertia**, and **WCSS** for different values of $k$.

### Definitions

* **Distortion**: The average of the squared distances from each point to the centroid of its assigned cluster. Formally, if we have a dataset of $n$ points ${x_1, x_2, \dots, x_n}$ and $k$ cluster centroids ${c_1, c_2, \dots, c_k}$, then

$$
\text{Distortion} = \frac{1}{n} \sum_{i=1}^{n} \lVert x_i - c_{x_i} \rVert^2
$$

where $c_{x_i}$ is the centroid of the cluster assigned to point $x_i$.

* **Inertia / WCSS**: The sum of the squared distances from each point to the centroid of its assigned cluster. This is equivalent to the **Within-Cluster Sum of Squares (WCSS)**:

$$
\text{WCSS} = \text{Inertia} = \sum_{i=1}^{n} \lVert x_i - c_{x_i} \rVert^2
$$

Notice that distortion is simply WCSS divided by the number of points:

$$
\text{Distortion} = \frac{\text{WCSS}}{n}
$$

### Procedure

1. Iterate over a range of $k$ values (e.g., $k = 1$ to $9$).  
2. Fit the KMeans algorithm for each value of $k$.  
3. Calculate and record **distortion**, **inertia**, and **WCSS** for each $k$.  
4. Use these metrics, in combination with the elbow plot, to determine the optimal number of clusters.

> **Note**: Distortion is essentially WCSS divided by the number of points in the dataset, so all three metrics provide complementary insights into cluster compactness and fit.

### Pros and Cons of Using Distortion, Inertia, and WCSS

**Pros:**
* Provides **quantitative measures** of clustering quality.
* Easy to **compute and compare** across different values of $k$.  
* Directly reflects **cluster compactness**.

**Cons:**
* WCSS, inertia, and distortion **always decrease** as $k$ increases, so they cannot identify the optimal $k$ on their own.  
* Sensitive to **outliers**, which can inflate distances and distort metrics.  
* Does not account for **cluster separation**; only measures compactness.  
* Choosing $k$ still requires **some subjective judgment** when interpreting the elbow plot.

In [None]:
# Assuming X is your DataFrame from earlier
distortions = []
inertias = []
clusters = []
K = range(1, 10)

for k in K:
    # Build and fit the KMeans model
    kmeans = KMeans(n_clusters=k, init='k-means++', max_iter=300, random_state=0)
    y_kmeans = kmeans.fit_predict(X)
    
    # Store results
    clusters.append(y_kmeans)
    distortions.append(kmeans.inertia_/X.shape[0])
    inertias.append(kmeans.inertia_)

# Plot distortion (Elbow Method)
plt.figure(figsize=(6, 4))
plt.plot(K, distortions, 'bx-')
plt.xlabel('Number of clusters (K)')
plt.ylabel('Distortion')
plt.title('Elbow Method using Distortion')
plt.grid(True)
plt.show()

# Plot inertia (another Elbow Method metric)
plt.figure(figsize=(6, 4))
plt.plot(K, inertias, 'bx-')
plt.xlabel('Number of clusters (K)')
plt.ylabel('Inertia')
plt.title('Elbow Method using Inertia')
plt.grid(True)
plt.show()


To determine the optimal number of clusters, we select the value of $k$ at the “elbow,” i.e., the point after which the distortion/inertia starts decreasing in a roughly linear fashion.  

For the given data, we conclude that the optimal number of clusters is **3**.


# Silhouette Method for Optimal Value of $k$ in KMeans

The **silhouette method** can be used to study the separation distance between the resulting clusters. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters, providing a visual way to assess clustering quality and determine the optimal number of clusters ($k$).

### Silhouette Coefficient

For a given sample $i$, the **silhouette coefficient** $s(i)$ is defined as:

$$
s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}
$$

where:

* $a(i)$ = average distance between sample $i$ and all other points in the **same cluster** (intra-cluster distance)
* $b(i)$ = minimum average distance between sample $i$ and all points in any **other cluster** (nearest-cluster distance)

### Interpretation

* **$s(i) \approx 1$** → The sample is well-matched to its own cluster and far from neighboring clusters.
* **$s(i) \approx 0$** → The sample is near the boundary between two clusters.
* **$s(i) < 0$** → The sample may have been assigned to the wrong cluster.

By averaging the silhouette coefficients over all samples, we get the **mean silhouette score**, which can be used to compare different values of $k$:

$$
\text{Mean Silhouette Score} = \frac{1}{n} \sum_{i=1}^{n} s(i)
$$

The value of $k$ that **maximizes the mean silhouette score** is often considered the optimal number of clusters.

### Pros and Cons of the Silhouette Method

**Pros:**
* Provides a **quantitative measure** of clustering quality.
* Works for **any distance metric**, not just Euclidean.
* Helps **identify poorly clustered samples**.
* Easy to **visualize** with silhouette plots.

**Cons:**
* Computationally **expensive** for large datasets (requires pairwise distance calculations).
* Can be **misleading** for clusters with very different sizes or densities.
* Less effective when the **true cluster structure is not well-separated**.
* Sensitive to **noise and outliers**, which can reduce the mean silhouette score.

In [None]:
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score

# Creating the data
x1 = np.array([3, 1, 1, 2, 1, 6, 6, 6, 5, 6, 7, 8, 9, 8, 9, 9, 8])
x2 = np.array([5, 4, 5, 6, 5, 8, 6, 7, 6, 7, 1, 2, 1, 2, 3, 2, 3])
X = pd.DataFrame(list(zip(x1, x2)))

K = range(2, 10)

for n_clusters in K:
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 7))

    # Initialize KMeans
    clusterer = KMeans(n_clusters=n_clusters, init='k-means++', max_iter=300, n_init=10, random_state=0)
    cluster_labels = clusterer.fit_predict(X)

    # Compute average silhouette score
    silhouette_avg = silhouette_score(X, cluster_labels)
    print(f"For n_clusters = {n_clusters}, the average silhouette_score is: {silhouette_avg:.3f}")

    # Compute silhouette values for each sample
    sample_silhouette_values = silhouette_samples(X, cluster_labels)

    # Silhouette plot
    y_lower = 10
    for i in range(n_clusters):
        ith_cluster_values = sample_silhouette_values[cluster_labels == i]
        ith_cluster_values.sort()
        size_cluster_i = ith_cluster_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper), 0, ith_cluster_values,
                          facecolor=color, edgecolor=color, alpha=0.7)
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
        y_lower = y_upper + 10

    ax1.set_title("Silhouette plot for clusters")
    ax1.set_xlabel("Silhouette coefficient values")
    ax1.set_ylabel("Cluster label")
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
    ax1.set_yticks([])
    ax1.set_xticks(np.arange(-0.1, 1.1, 0.2))

    # Cluster scatter plot
    colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
    ax2.scatter(X.iloc[:, 0], X.iloc[:, 1], marker='.', c=colors, edgecolor='k')

    # Plot cluster centers
    centers = clusterer.cluster_centers_
    ax2.scatter(centers[:, 0], centers[:, 1], marker='o', c='white', s=200, edgecolor='k')
    for i, c in enumerate(centers):
        ax2.scatter(c[0], c[1], marker=f"${i}$", alpha=1, s=50, edgecolor='k')

    ax2.set_title("Cluster visualization")

    plt.suptitle(f"Silhouette analysis for KMeans with n_clusters = {n_clusters}", fontsize=14, fontweight="bold")

plt.show()
