## K-Means Clustering

K-Means is an **unsupervised learning algorithm** used to group data into **clusters** based on similarity.

### Key Concepts

#### 1. Centroids

* Each cluster is represented by a **centroid**, which is the mean (average) of all points assigned to that cluster.
* The centroid acts as the "center" of the cluster.
* During the algorithm, centroids move as points get assigned or reassigned to clusters.

#### 2. Inertia

* **Inertia** measures how internally coherent clusters are.
* It is calculated as the **sum of squared distances** between each data point and its assigned cluster centroid.
* Lower inertia means points are closer to their centroids, indicating tighter clusters.

#### 3. The Elbow Method

* The elbow method helps choose the **optimal number of clusters (K)**.
* Run K-Means with different values of $K$ (e.g., 1 to 10) and plot the inertia for each.
* The plot typically shows a sharp decrease in inertia at first, then levels off.
* The **“elbow” point**—where the decrease slows—is considered the best choice for $K$, balancing cluster compactness and simplicity.

---

## How K-Means Works (Simplified)

1. Choose $K$ initial centroids (randomly or by other methods).
2. Assign each data point to the nearest centroid.
3. Update centroids by computing the mean of assigned points.
4. Repeat steps 2 and 3 until assignments don’t change or maximum iterations reached.

---

## Summary

* **Centroids** are the centers of clusters.
* **Inertia** measures cluster tightness; lower is better.
* **Elbow method** helps pick the right number of clusters by finding where adding more clusters doesn’t improve inertia much.

## Dataset: Points and Goal

| Point | Coordinates (x, y) |
| ----- | ------------------ |
| A     | (1, 2)             |
| B     | (1, 4)             |
| C     | (3, 2)             |
| D     | (5, 8)             |
| E     | (6, 9)             |

Goal: Group these into **2 clusters**.

---

## Step 1: Initialize Centroids

Pick initial centroids — for example:

* Centroid 1: Point A → (1, 2)
* Centroid 2: Point D → (5, 8)

---

## Step 2: Assign Each Point to Closest Centroid

Calculate Euclidean distance from each point to each centroid.

Distance formula:

$$
d = \sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2}
$$

Calculate and assign:

| Point   | Distance to $C_1$                       | Distance to $C_2$                       | Assigned Cluster |
| ------- | --------------------------------------- | --------------------------------------- | ---------------- |
| A (1,2) | $\sqrt{(1-1)^2 + (2-2)^2} = 0$          | $\sqrt{(1-5)^2 + (2-8)^2} \approx 7.21$ | 1                |
| B (1,4) | $\sqrt{(1-1)^2 + (4-2)^2} = 2$          | $\sqrt{(1-5)^2 + (4-8)^2} \approx 5.66$ | 1                |
| C (3,2) | $\sqrt{(3-1)^2 + (2-2)^2} = 2$          | $\sqrt{(3-5)^2 + (2-8)^2} \approx 6.32$ | 1                |
| D (5,8) | $\sqrt{(5-1)^2 + (8-2)^2} \approx 7.21$ | $\sqrt{(5-5)^2 + (8-8)^2} = 0$          | 2                |
| E (6,9) | $\sqrt{(6-1)^2 + (9-2)^2} \approx 8.60$ | $\sqrt{(6-5)^2 + (9-8)^2} \approx 1.41$ | 2                |

---

## Step 3: Update Centroids

Calculate new centroids by averaging points in each cluster.

* Cluster 1 points: A (1,2), B (1,4), C (3,2)

$$
C_1 = \left( \frac{1 + 1 + 3}{3}, \frac{2 + 4 + 2}{3} \right) = (1.67, 2.67)
$$

* Cluster 2 points: D (5,8), E (6,9)

$$
C_2 = \left( \frac{5 + 6}{2}, \frac{8 + 9}{2} \right) = (5.5, 8.5)
$$

---

## Step 4: Re-Assign Points to Closest Centroid

Calculate distances with new centroids:

| Point   | Distance to $C_1$                             | Distance to $C_2$                           | Assigned Cluster |
| ------- | --------------------------------------------- | ------------------------------------------- | ---------------- |
| A (1,2) | $\sqrt{(1-1.67)^2 + (2-2.67)^2} \approx 0.94$ | $\sqrt{(1-5.5)^2 + (2-8.5)^2} \approx 7.29$ | 1                |
| B (1,4) | $\sqrt{(1-1.67)^2 + (4-2.67)^2} \approx 1.54$ | $\sqrt{(1-5.5)^2 + (4-8.5)^2} \approx 5.84$ | 1                |
| C (3,2) | $\sqrt{(3-1.67)^2 + (2-2.67)^2} \approx 1.43$ | $\sqrt{(3-5.5)^2 + (2-8.5)^2} \approx 6.56$ | 1                |
| D (5,8) | $\sqrt{(5-1.67)^2 + (8-2.67)^2} \approx 6.62$ | $\sqrt{(5-5.5)^2 + (8-8.5)^2} \approx 0.71$ | 2                |
| E (6,9) | $\sqrt{(6-1.67)^2 + (9-2.67)^2} \approx 8.05$ | $\sqrt{(6-5.5)^2 + (9-8.5)^2} \approx 0.71$ | 2                |

---

## Step 5: Calculate Inertia

Inertia is sum of squared distances of points to their centroids.

* Cluster 1:

$$
0.94^2 + 1.54^2 + 1.43^2 = 0.88 + 2.37 + 2.04 = 5.29
$$

* Cluster 2:

$$
0.71^2 + 0.71^2 = 0.50 + 0.50 = 1.0
$$

* Total inertia:

$$
5.29 + 1.0 = 6.29
$$

---

## Step 6: Repeat Steps 3-5

You repeat updating centroids and assigning points until the assignments no longer change or inertia stabilizes.

---

## Summary

* Calculate distances using Euclidean formula.
* Assign points to closest centroid.
* Update centroids by averaging cluster points.
* Calculate inertia to measure clustering quality.
* Repeat until stable.


In [8]:
import numpy as np

# Our points (x, y)
points = np.array([
    [1, 2],  # Point A
    [1, 4],  # Point B
    [3, 2],  # Point C
    [5, 8],  # Point D
    [6, 9]   # Point E
])

In [9]:
# Step 1: Choose two starting centroids manually
centroid1 = points[0]  # (1, 2)
centroid2 = points[3]  # (5, 8)

In [10]:
# Function to calculate distance between two points
def distance(p1, p2):
    return np.sqrt((p1[0] - p2[0])**2 + (p1[1] - p2[1])**2)

In [11]:
cluster1 = []
cluster2 = []

for point in points:
    dist1 = distance(point, centroid1)
    dist2 = distance(point, centroid2)
    
    if dist1 < dist2:
        cluster1.append(point)
    else:
        cluster2.append(point)

cluster1 = np.array(cluster1)
cluster2 = np.array(cluster2)

In [12]:
# Step 3: Calculate new centroids by averaging points in clusters
def calculate_centroid(cluster):
    x_mean = np.mean(cluster[:, 0])
    y_mean = np.mean(cluster[:, 1])
    return np.array([x_mean, y_mean])

centroid1_new = calculate_centroid(cluster1)
centroid2_new = calculate_centroid(cluster2)

In [13]:
# --- New Step: Predict which cluster a new point belongs to ---
new_point = np.array([4, 5])  # Example new point

dist_to_c1 = distance(new_point, centroid1_new)
dist_to_c2 = distance(new_point, centroid2_new)

In [14]:
dist_to_c1

np.float64(3.2998316455372216)

In [15]:
dist_to_c2

np.float64(3.8078865529319543)

In [16]:
if dist_to_c1 < dist_to_c2:
    print("New point belongs to Cluster 1")
else:
    print("New point belongs to Cluster 2")

New point belongs to Cluster 1


In [17]:
from sklearn.cluster import KMeans

In [18]:
# Our points (x, y)
points = np.array([
    [1, 2],  # Point A
    [1, 4],  # Point B
    [3, 2],  # Point C
    [5, 8],  # Point D
    [6, 9]   # Point E
])

# Create and fit KMeans model with 2 clusters
kmeans = KMeans(n_clusters=2, random_state=0)
kmeans.fit(points)

# Print the cluster centers (centroids)
print("Centroids:")
print(kmeans.cluster_centers_)

# Print cluster labels for each point
print("\nCluster assignments for each point:")
for i, label in enumerate(kmeans.labels_):
    print(f"Point {points[i]} is in Cluster {label}")

# Predict cluster for a new point
new_point = np.array([[4, 5]])
predicted_cluster = kmeans.predict(new_point)

print(f"\nNew point {new_point[0]} belongs to Cluster {predicted_cluster[0]}")

Centroids:
[[1.66666667 2.66666667]
 [5.5        8.5       ]]

Cluster assignments for each point:
Point [1 2] is in Cluster 0
Point [1 4] is in Cluster 0
Point [3 2] is in Cluster 0
Point [5 8] is in Cluster 1
Point [6 9] is in Cluster 1

New point [4 5] belongs to Cluster 0
