## What is Clustering?

- **Clustering** is an **unsupervised learning technique** that groups similar data points into clusters **without using labeled data**.
- Data points within the same cluster are **more similar to each other** than to points in other clusters.


## What is K-Means Clustering?

- **K-Means Clustering** is an **unsupervised learning algorithm** that divides a dataset into **K distinct clusters**.
- It works by **minimizing the distance** between data points and their respective **cluster centroids**.
- Each cluster is represented by a **centroid**, which is the **mean of all data points** in that cluster.


## Why is it called K-Means?

- **K** → Number of clusters  
- **Means** → The **average (centroid)** of points in each cluster

![k means clustering.png](attachment:4ff95bc0-1cc9-4f18-9ca2-dfda329a7014.png)

## How K-Means Works (Algorithm Steps)

1. Choose the number of clusters **K**.
2. Initialize **K centroids randomly**.
3. Calculate the distance between each data point and the centroids.
4. Assign each data point to the **nearest centroid**.
5. Update the centroids by calculating the **mean** of all points in each cluster.
6. Repeat steps **3–5** until the centroids **do not change**.


## Formulas Used in K-Means

### Euclidean Distance
$$
d = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}
$$

### Centroid Formula
$$
\text{Centroid} = \left( \frac{\sum x}{n}, \frac{\sum y}{n} \right)
$$


## Advantages of K-Means

- Simple and easy to understand  
- Fast and efficient  
- Works well for large datasets  


## Limitations of K-Means

- Value of **K** must be chosen manually  
- Sensitive to outliers  
- Different initial centroids may give different results  

## Example: K-Means Clustering (Step-by-Step)

### Step 1: Dataset

| Point | x | y |
|------|---|---|
| A | 1 | 1 |
| B | 1 | 2 |
| C | 4 | 4 |
| D | 5 | 4 |

![k means before.png](attachment:e95c50a8-f811-4f1d-acae-548c92279688.png)

Number of clusters:  
$$
K = 2
$$

### Step 2: Initialize Centroids (Randomly)

Let the initial centroids be:

- **Centroid 1 (C₁)** = A(1, 1)  
- **Centroid 2 (C₂)** = C(4, 4)


### Step 3: Calculate Distance (Euclidean Distance)

#### Distance from Point A (1,1)

$$
d(A, C_1) = \sqrt{(1-1)^2 + (1-1)^2} = 0
$$

$$
d(A, C_2) = \sqrt{(4-1)^2 + (4-1)^2} = \sqrt{18} \approx 4.24
$$

➡ A is assigned to **Cluster 1**


#### Distance from Point B (1,2)

$$
d(B, C_1) = \sqrt{(1-1)^2 + (2-1)^2} = \sqrt{1} = 1
$$

$$
d(B, C_2) = \sqrt{(4-1)^2 + (4-2)^2} = \sqrt{13} \approx 3.6
$$

➡ B is assigned to **Cluster 1**


#### Distance from Point C (4,4)

$$
d(C, C_1) = \sqrt{(1-4)^2 + (1-4)^2} = \sqrt{18} \approx 4.24
$$

$$
d(C, C_2) = \sqrt{(4-4)^2 + (4-4)^2} = 0
$$

➡ C is assigned to **Cluster 2**


#### Distance from Point D (5,4)

$$
d(D, C_1) = \sqrt{(1-5)^2 + (1-4)^2} = \sqrt{25} = 5
$$

$$
d(D, C_2) = \sqrt{(4-5)^2 + (4-4)^2} = \sqrt{1} = 1
$$

➡ D is assigned to **Cluster 2**

## Cluster Assignment Table (Iteration 1)

| Point | Distance to C₁ | Distance to C₂ | Assigned Cluster |
|------|----------------|----------------|------------------|
| A (1,1) | 0 | 4.24 | C₁ |
| B (1,2) | 1 | 3.6 | C₁ |
| C (4,4) | 4.24 | 0 | C₂ |
| D (5,4) | 5 | 1 | C₂ |


## Step 4: Update Centroids

### New Centroid for Cluster 1 (A, B)

$$
C_1 = \left( \frac{1+1}{2}, \frac{1+2}{2} \right) = (1, 1.5)
$$

### New Centroid for Cluster 2 (C, D)

$$
C_2 = \left( \frac{4+5}{2}, \frac{4+4}{2} \right) = (4.5, 4)
$$


## Step 5: Repeat Distance Calculation (Iteration 2)

### Distance from Point A (1,1)

$$
d(A, C_1) = \sqrt{(1-1)^2 + (1-1.5)^2} = 0.5
$$

$$
d(A, C_2) = \sqrt{(4.5-1)^2 + (4-1)^2} = \sqrt{12.25 + 9} \approx 4.6
$$

➡ A remains in **Cluster 1**


### Distance from Point B (1,2)

$$
d(B, C_1) = \sqrt{(1-1)^2 + (1.5-2)^2} = 0.5
$$

$$
d(B, C_2) = \sqrt{(4.5-1)^2 + (4-2)^2} = \sqrt{12.25 + 4} \approx 4.03
$$

➡ B remains in **Cluster 1**


### Distance from Point C (4,4)

$$
d(C, C_1) = \sqrt{(1-4)^2 + (1.5-4)^2} = \sqrt{9 + 6.25} \approx 3.9
$$

$$
d(C, C_2) = \sqrt{(4.5-4)^2 + (4-4)^2} = 0.5
$$

➡ C remains in **Cluster 2**

## Distance from Point D (5,4) — Iteration 2

$$
d(D, C_1) = \sqrt{(1-5)^2 + (1.5-4)^2}
           = \sqrt{16 + 6.25}
           = \sqrt{22.25}
           \approx 4.71
$$

$$
d(D, C_2) = \sqrt{(4.5-5)^2 + (4-4)^2}
           = \sqrt{0.25}
           = 0.5
$$

➡ D remains in **Cluster 2**


## Distance Table (Iteration 2)

| Point | Distance to C₁ | Distance to C₂ | New Cluster |
|------|----------------|----------------|-------------|
| A (1,1) | 0.5 | 4.6 | C₁ |
| B (1,2) | 0.5 | 4.03 | C₁ |
| C (4,4) | 3.9 | 0.5 | C₂ |
| D (5,4) | 4.71 | 0.5 | C₂ |


## Step 6: Final Clusters

- **Cluster 1**  
  $$
  A(1,1),\; B(1,2)
  $$

- **Cluster 2**  
  $$
  C(4,4),\; D(5,4)
  $$


![K means after.png](attachment:4c2eb0ea-7f0b-403e-b65c-476910afb146.png)