Progress Report 1
----

**Team Members**

Yaqian Cheng, Department of Statistical Science

Mengrun Li, Department of Statistical Science

**Github repository**

<https://github.com/cici7941/Sta_663_Statistical_Computation_Final_Project>

**Choice of paper** 

*Scalable K-Means++*

**Abstract**

*K-means* is one of the most popular clustering methods. A good initialization of *k-means* is essential for obtaining the global optimal solution and efficiency. However, there are two main obstacles with traditional *k-means* method. One is theoretical inefficiency and the other one is that its final solution is locally optimal. A better algorithm, *k-means++* addresses the second problem with an improved initialization procedure of the cluster centers. But this *k-means++* initialization is not parallelizable, because the selection for the *i*th center depends on the previous *i-1* centers [1]. Therefore, *k-means||*, a parallelizable version of *k-means++*, has been raised, which can both improve the final solution and run faster. In this report, we implemented the algorithm in the paper "Scalable K-Means++" in Python, compared the clustering cost and runtime between *k-means*, *k-means++* and *k-means||*, performed tests for main functions, profiled the performance of the algorithm and identified bottlenecks, and performed optimization using Cython. We then apply *k-means||* to a massive dataset to evaluate its performance.

**Outline**

1. Introduction
2. Algorithm  
    2.1 K-Means  
    2.2 K-Means++  
    2.3 K-Means||  
3. Code Testing
4. Profiling and Optimization
5. Application and Comparison

In [3]:
import scipy.linalg as la
import numpy as np

In [4]:
# helper functions
def euc_dist(x, y):
    return la.norm(x-y)

def centroid(X):
    return X.mean(0)
    
def d(x, Y):
    minDist = float("inf")
    for yi in Y:
        dist = euc_dist(x, yi)
        if(dist < minDist):
            minDist = dist
    return minDist

def cost(Y, C):
    cost = 0
    for yi in Y:
        cost += d(yi, C)**2
    return cost

In [5]:
# K-Means++


In [29]:
# K-Means||
X_ori = np.array([[1,0],[2,0],[3,5],[77,34],[6,88],[24,66],[90,12],[26,23],[91,100]])
X = X_ori
k = 3
l = 2
##step1
##Sample a point uniformly at random from X
idx1 = np.random.choice(X_ori.shape[0],1,replace = False)
C = X[idx1-1]
print(idx1)
print("first centroid:",C)
##step2
##initial cost
gamma1 = cost(X,C)
print("initial cost:",gamma1)
##remove first centroid
X = np.delete(X, idx1-1, axis=0)
X

[2]
first centroid: [[2 0]]
initial cost: 46322.0


array([[  1,   0],
       [  3,   5],
       [ 77,  34],
       [  6,  88],
       [ 24,  66],
       [ 90,  12],
       [ 26,  23],
       [ 91, 100]])

In [30]:
##step3
for i in range(int(round(np.log(gamma1)))):
    #Ct = np.array([])
    gamma = cost(X,C)
    for idx in range(X.shape[0]):
        p = l*(d(X[idx,:],C))**2/gamma
        point = np.random.uniform(size = 1)
        if point < p:
           C = np.concatenate((C,[X[idx,:].tolist()]),axis = 0)
           X = np.delete(X,idx,axis = 0)
    #C = np.concatenate((C,Ct),axis = 0)
C

IndexError: index 5 is out of bounds for axis 0 with size 5

In [25]:
idx = 2
#np.concatenate((C,X[idx,:]),axis = 0)
print(C)
[X[idx,:].tolist()]


[[ 6 88]]


[[3, 5]]

In [138]:
for idx in range(X.shape[0]):
    p = l*(d(X[idx,:],C))**2/gamma
    point = np.random.uniform(size = 1)
    if point < p:
       Ct = np.concatenate((Ct,X[idx,:]),axis = 0)
       X = np.delete(X,idx,axis = 0)
    #C = np.concatenate((C,Ct),axis = 0)

IndexError: index 7 is out of bounds for axis 0 with size 6

In [124]:
A = np.array([[2,3]])
B = np.array([[4,5]])

In [125]:
np.concatenate((A,B),axis=0)

array([[2, 3],
       [4, 5]])

In [127]:
Ct = np.array([[]])