### Kmeans algorithm

K-means clustering is a popular unsupervised machine learning algorithm used for grouping similar data points into k - clusters. Goal: to partition a given dataset into k (predefined) clusters.

The k-means algorithm works by first randomly initializing k cluster centers, one for each cluster. Each data point in the dataset is then assigned to the nearest cluster center based on their distance. The distance metric used is typically Euclidean distance, but other distance measures such as Manhattan distance or cosine similarity can also be used.

After all the data points have been assigned to a cluster, the algorithm calculates the new mean for each cluster by taking the average of all the data points assigned to that cluster. These new means become the new cluster centers. The algorithm then repeats the assignment and mean calculation steps until the cluster assignments no longer change or until a maximum number of iterations is reached.

The final output of the k-means algorithm is a set of k clusters, where each cluster contains the data points that are most similar to each other based on the distance metric used. The algorithm is commonly used in various fields such as image segmentation, market segmentation, and customer profiling.

In [35]:
import numpy as np

class KMeans:
    def __init__(self, k, max_iterations=100):
        self.k = k
        self.max_iterations = max_iterations
        self.centroids = None
        self.cluster_assignments = None

    def fit(self, X):
        # Initialize centroids by randomly selecting k data points from X
        self.centroids = X[np.random.choice(len(X), self.k, replace=False)]
        
        for _ in range(self.max_iterations):
            # Assign each data point to the nearest centroid
            cluster_assignments = np.zeros(len(X), dtype=int)
            for j in range(len(X)):
                distances = np.linalg.norm(X[j] - self.centroids, axis=1)
                cluster_assignments[j] = np.argmin(distances)

            # Update centroids
            new_centroids = np.copy(self.centroids)
            for k in range(self.k):
                cluster_data_points = X[cluster_assignments == k]
                if len(cluster_data_points) > 0:
                    new_centroids[k] = np.mean(cluster_data_points, axis=0)
            
            # Check for convergence
            if np.array_equal(self.centroids, new_centroids):
                break
            
            # Update centroids
            self.centroids = new_centroids

        # Store final cluster assignments
        self.cluster_assignments = cluster_assignments
    
    def predict(self, X):
        # Assign each data point to the nearest centroid
        cluster_assignments = []
        for j in range(len(X)):
            distances = np.linalg.norm(X[j] - self.centroids, axis=1)
            cluster_assignments.append(np.argmin(distances))
        return cluster_assignments

# Example usage:
X = np.array([[1, 2], [3, 4], [5, 6], [8, 8], [9, 10], [10, 12]])
kmeans = KMeans(k=2, max_iterations=100)
kmeans.fit(X)
print("Centroids:", kmeans.centroids)
print("Cluster Assignments:", kmeans.cluster_assignments)

Centroids: [[ 3  4]
 [ 9 10]]
Cluster Assignments: [0 0 0 1 1 1]
