# Unsupervised Learning Algorithms

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


## Clustering

## **K-Means Clustering: Elbow Method for Optimal $ k $**  

### **1. What is K-Means Clustering?**  
K-Means is a **partition-based clustering algorithm** that groups $ n $ data points into $ k $ clusters by minimizing the **intra-cluster variance**.

### **2. Mathematical Explanation**  

#### **Objective Function (Minimization of Intra-Cluster Variance)**  
For a dataset $ X = \{x_1, x_2, ..., x_n\} $, the K-Means algorithm finds $ k $ cluster centroids $ \mu_1, \mu_2, ..., \mu_k $ that minimize the following objective function:

$
J = \sum_{i=1}^{n} \sum_{j=1}^{k} \mathbb{1}(c_i = j) ||x_i - \mu_j||^2
$

where:  
- $ c_i $ is the cluster assigned to $ x_i $.  
- $ \mu_j $ is the centroid of cluster $ j $.  
- $ ||x_i - \mu_j||^2 $ is the squared Euclidean distance.

### **3. Steps of K-Means Algorithm**  

1. **Initialize** $ k $ cluster centroids randomly.  
2. **Assign each point** to the nearest centroid using Euclidean distance:  

   $
   c_i = \arg\min_{j} ||x_i - \mu_j||^2
   $

3. **Update centroids** by taking the mean of all points in the cluster:  

   $
   \mu_j = \frac{1}{|C_j|} \sum_{x_i \in C_j} x_i
   $

4. **Repeat steps 2 and 3** until convergence (centroids do not change).  

---

## **4. Choosing the Optimal $ k $ Using the Elbow Method**  

The **Elbow Method** determines the optimal $ k $ by analyzing the **Within-Cluster Sum of Squares (WCSS)**:

$
WCSS(k) = \sum_{j=1}^{k} \sum_{x_i \in C_j} ||x_i - \mu_j||^2
$

The goal is to **find the "elbow point"** where the WCSS stops decreasing significantly.

---


In [None]:
# Generate synthetic data
np.random.seed(42)
X1 = np.random.randn(100, 2) + np.array([2, 2])
X2 = np.random.randn(100, 2) + np.array([-2, -2])
X3 = np.random.randn(100, 2) + np.array([2, -2])

# Combine data into a DataFrame
X = np.vstack([X1, X2, X3])
df = pd.DataFrame(X, columns=['Feature1', 'Feature2'])


In [None]:
class KMeans:
    def __init__(self, k=3, iterations=100, tolerance=1e-4):
        self.k = k
        self.iterations = iterations
        self.tolerance = tolerance  # Tolerance for convergence
        self.centroids = None

    def fit(self, X):
        # Randomly initialize k centroids
        np.random.seed(42)
        self.centroids = X[np.random.choice(len(X), self.k, replace=False)]
        
        for _ in range(self.iterations):
            # Assign clusters
            labels = self._assignClusters(X)
            
            # Compute new centroids
            newCentroids = np.array([X[labels == j].mean(axis=0) for j in range(self.k)])
            
            # Check convergence
            if np.linalg.norm(self.centroids - newCentroids) < self.tolerance:
                break
            
            self.centroids = newCentroids
            
        self.labels_ = labels

    def _assignClusters(self, X):
        # Compute distances to centroids and assign cluster labels
        distances = np.linalg.norm(X[:, np.newaxis] - self.centroids, axis=2)
        return np.argmin(distances, axis=1)
    
    def predict(self, X):
        return self._assignClusters(X)

# Elbow Method to Find Optimal (k)

def WCSS(X, maxK=10):
    wcss = []
    for k in range(1, maxK + 1):
        kMeans = KMeans(k=k)
        kMeans.fit(X)
        wcss.append(sum(np.min(np.linalg.norm(X[:, np.newaxis] - kMeans.centroids, axis=2)**2, axis=1)))
    return wcss

# Compute WCSS for k = 1 to 10
wcssValues = WCSS(X, maxK=10)

# Plot Elbow Method
plt.plot(range(1, 11), wcssValues, marker='o')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('WCSS')
plt.title('Elbow Method for Optimal k')
plt.show()
