# K-Means Clustering and Choosing the Right k

This tutorial demonstrates the K-Means clustering algorithm using the Iris dataset, and explains how to choose the optimal number of clusters (`k`) using the Elbow Method and Silhouette Score.

---

## 1. Introduction

**K-Means** is a popular unsupervised machine learning algorithm used to group data into `k` clusters. Deciding how many clusters (`k`) to use is one of the key challenges in applying K-Means effectively.

---

In [None]:
# 2. Imports and Data Preparation
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Load the Iris dataset (using only first 2 features for visualization)
iris = load_iris()
X = iris.data[:, :2]

## 3. Fitting and Visualizing K-Means (with k=3)

In [None]:
# Fit KMeans with 3 clusters (k=3)
kmeans = KMeans(n_clusters=3, random_state=42)
y_kmeans = kmeans.fit_predict(X)

# Plotting the clusters and their centroids
plt.figure(figsize=(8, 5))
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, cmap='viridis', s=50, label='Data points')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], 
            c='red', marker='X', s=200, alpha=0.75, label='Centroids')
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1])
plt.title('K-Means Clustering (k=3) on Iris Data')
plt.legend()
plt.show()

## 4. Choosing the Number of Clusters (`k`)

### a) The Elbow Method

The Elbow Method is used to find the value of `k` where the WCSS (Within-Cluster Sum of Squares) stops decreasing rapidly.

In [None]:
# Calculate WCSS for different values of k
wcss = []
ks = range(1, 11)
for k in ks:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)

# Plotting the Elbow
plt.figure(figsize=(8, 5))
plt.plot(ks, wcss, marker='o')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Within-cluster Sum of Squares (WCSS)')
plt.title('Elbow Method for Optimal k')
plt.xticks(list(ks))
plt.show()

### b) The Silhouette Score

The Silhouette Score measures how well each data point fits its own cluster and how distinct each cluster is.

In [None]:
sil = []
k_list = range(2, 11)
for k in k_list:
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(X)
    sil.append(silhouette_score(X, labels))

plt.figure(figsize=(8, 5))
plt.plot(k_list, sil, marker='o')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Scores for Different k')
plt.xticks(list(k_list))
plt.show()

best_k = k_list[np.argmax(sil)]
print(f"The best k according to Silhouette Score is: {best_k}")

## 5. Conclusions

- K-Means is a fast and simple clustering algorithm.
- The **Elbow Method** and **Silhouette Score** help in selecting the best `k`.
- On the Iris dataset, both methods suggest `k=3`, which matches our domain knowledge (three species in Iris).
- These methods generalize to other datasets; always visualize clusters and evaluation scores.

---

## 6. References

- [scikit-learn: KMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)
- [Wikipedia: Elbow Method](https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set#The_Elbow_Method)
- "Pattern Recognition and Machine Learning" by Christopher Bishop

---

*End of tutorial. Try using different datasets or more features for further exploration!*