New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
K-means with kmeans++ initialization #166
Comments
Hello @pmisiakos , Here is the code (adapated from our Kmeans tutorial) : ########################################################################
# Setup
# -----------------
# Standard imports:
import time
import torch
from matplotlib import pyplot as plt
from pykeops.torch import LazyTensor
use_cuda = torch.cuda.is_available()
dtype = torch.float32 if use_cuda else torch.float64
device_id = "cuda:0" if use_cuda else "cpu"
########################################################################
# Simple implementation of the K-means algorithm:
def KMeans(x, K=10, Niter=100, verbose=True, init="default"):
"""Implements Lloyd's algorithm for the Euclidean metric."""
N, D = x.shape # Number of samples, dimension of the ambient space
if verbose:
print(
f"K-means for the Euclidean metric with {N:,} points in dimension {D:,}, K = {K:,}:"
)
start = time.time()
if init=="default":
c = x[:K, :].clone() # Simplistic initialization for the centroids
elif init=="kmeans++":
c = kmeans_plus_plus(x,K)
if verbose:
if use_cuda:
torch.cuda.synchronize()
end = time.time()
print(
"time for {} initialization: {:.5f}s".format(init,end-start)
)
start = time.time()
x_i = LazyTensor(x.view(N, 1, D)) # (N, 1, D) samples
c_j = LazyTensor(c.view(1, K, D)) # (1, K, D) centroids
# E step: assign points to the closest cluster -------------------------
D_ij = ((x_i - c_j) ** 2).sum(-1) # (N, K) symbolic squared distances
cl = D_ij.argmin(dim=1).long().view(-1) # Points -> Nearest cluster
# K-means loop:
# - x is the (N, D) point cloud,
# - cl is the (N,) vector of class labels
for i in range(Niter):
# keep record of current cl for detecting convergence
cl_old = cl.clone()
# M step: update the centroids to the normalized cluster average: ------
# Compute the sum of points per cluster:
c.zero_()
c.scatter_add_(0, cl[:, None].repeat(1, D), x)
# Divide by the number of points per cluster:
Ncl = torch.bincount(cl, minlength=K).type_as(c).view(K, 1)
c /= Ncl # in-place division to compute the average
# E step: assign points to the closest cluster -------------------------
D_ij = ((x_i - c_j) ** 2).sum(-1) # (N, K) symbolic squared distances
cl = D_ij.argmin(dim=1).long().view(-1) # Points -> Nearest cluster
# detect convergence
if torch.all(cl==cl_old):
Niter = i
break
if Niter>0 and verbose: # Fancy display -----------------------------------------------
if use_cuda:
torch.cuda.synchronize()
end = time.time()
print(
"Timing for {} iterations: {:.5f}s = {} x {:.5f}s\n".format(
Niter, end - start, Niter, (end - start) / Niter
)
)
return cl, c
def kmeans_plus_plus(x, K):
# Kmeans++ initialization
N, D = x.shape
c = torch.empty(K, D, dtype=x.dtype, device=x.device)
# 1. Choose one center uniformly at random among the data points.
ind = int(torch.floor(torch.rand(1)*N))
c[0,:] = x[ind,:]
# 2. For each data point x not chosen yet, compute D(x)^2,
# the squared distance between x and the nearest center that has already been chosen.
# N.B. sq_dists is initialized with infinity values and will be updated through iterations
sq_dists = 1/torch.zeros(N, device=x.device)
# N.B. invarangeN below is used later in step 3
invarangeN = torch.arange(N,0,-1, device=x.device, dtype=torch.float32)
for k in range(K-1):
sq_dists = torch.minimum(sq_dists, ((x - c[k,:]) ** 2).sum(-1))
# 3. Choose one new data point at random as a new center,
# using a weighted probability distribution where a point x
# is chosen with probability proportional to D(x)^2.
distrib = torch.cumsum(sq_dists, dim=0)
ind = torch.argmax(invarangeN * (float(torch.rand(1))*distrib[-1] < distrib))
c[k+1,:] = x[ind,:]
return c
###############################################################
# K-means in 2D
# ----------------------
# First experiment with N=10,000 points in dimension D=2, with K=50 classes:
#
N, D, K = 10000, 2, 50
###############################################################
# Define our dataset:
x = 0.7 * torch.randn(N, D, dtype=dtype, device=device_id) + 0.3
###############################################################
# Perform the computation:
cl, c = KMeans(x, K, Niter=1000)
cl, c = KMeans(x, K, Niter=1000, init="kmeans++")
###############################################################
# Fancy display:
plt.figure(figsize=(8, 8))
plt.scatter(x[:, 0].cpu(), x[:, 1].cpu(), c=cl.cpu(), s=30000 / len(x), cmap="tab10")
plt.scatter(c[:, 0].cpu(), c[:, 1].cpu(), c="black", s=50, alpha=0.8)
plt.axis([-2, 2, -2, 2])
plt.tight_layout()
plt.show()
####################################################################
# K-means in dimension 100
# -------------------------
# Second experiment with N=1,000,000 points in dimension D=100, with K=1,000 classes:
if use_cuda:
N, D, K = 1000000, 100, 1000
x = torch.randn(N, D, dtype=dtype, device=device_id)
cl, c = KMeans(x, K, Niter=1000)
cl, c = KMeans(x, K, Niter=1000, init="kmeans++") |
Hi!
I was trying to apply kmeans for big data and i was really surprised from the capability of your library for this task!
I was trying to implement kmeans with more advance initializations. Random initialization is easy to write without having memory overflows. However, kmeans++ seems more challenging. Probably a random choice function that follows a specific distribution needs to be implemented at first (e.g something similar to torch.multinomial)
Do you is possible to add this feature in your library?
Thanks in advance!
Cheers,
Panagiotis Misiakos.
The text was updated successfully, but these errors were encountered: