K-Means Clustering is an unsupervised learning algorithm used for grouping data into ‘K’ clusters. After identifying k centroids, each data point is assigned to the closest cluster with the goal of minimizing the size of the centroids.

The algorithm assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid is at the minimum. The homogeneity of data points inside a cluster increases with decreasing variance within the cluster.
Evaluation Metrics

    Inertia: The total squared distance of the samples to the nearest cluster center is known as inertia. It is better to have lower values.
    Silhouette Score: Indicates how cohesively an item belongs to its own cluster as opposed to how much it separates from other clusters. A high silhouette score means that the item is well matched to its own cluster and poorly matched to nearby clusters. The silhouette score goes from -1 to 1.

Applying with Sci-kit Learn

Let’s use the Iris dataset for K-Means Clustering. The task will be to group the iris plants into clusters based on their flower measurements. We’ll train the model, assign the plants to clusters, and evaluate the clustering.

    Load the Iris Dataset:

    The Iris dataset contains measurements of iris flowers, including sepal length, sepal width, petal length, and petal width. The dataset is typically used for classification tasks, but here we’ll use it for clustering.

2. Apply K-Means Clustering:

    We initialize a K-Means clustering algorithm with n_clusters=3, as there are three species of iris in the dataset. However, the algorithm is unaware of these species; it will simply try to find the best way to group the data into three clusters.
    We fit the model to the data X, which includes our four features. The K-Means algorithm iteratively assigns each data point to one of the three clusters based on the distance of the data point to the cluster centroids.

3. Predict Clusters:

    The predict method is used to assign each data point in X to one of the three clusters. This step is somewhat conceptual with K-Means since the fitting and prediction happen together, but essentially, each data point is now labeled with a cluster number

4. Evaluate the Clustering:

    We evaluate our clustering using two metrics:
    • Inertia: This is the sum of squared distances of samples to their closest cluster center. It’s a measure of how internally coherent clusters are. We aim for lower inertia.
    • Silhouette Score: This measures how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette score ranges from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.

In [1]:
# Load necessary libraries
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

In [2]:
# Load the Iris dataset
iris = load_iris()
X = iris.data

In [3]:
# Apply K-Means Clustering
model = KMeans(n_clusters=3, random_state=42)
model.fit(X)

  super()._check_params_vs_input(X, default_n_init=10)


In [4]:
# Predicting the cluster for each data point
y_pred = model.predict(X)


In [6]:
# Evaluating the model
inertia = model.inertia_
silhouette = silhouette_score(X, y_pred)

In [7]:
# Printing the results
print("Inertia:", inertia)
print("Silhouette", silhouette)

Inertia: 78.85144142614601
Silhouette 0.5528190123564095


These metrics suggest that the K-Means algorithm has performed reasonably well in clustering the Iris dataset, though there’s room for improvement in terms of cluster compactness and separation.