# K-means Clustering
## Introduction  
The following Python code demonstrates some basic Clustering techniques using the __k-means__ algorithm on Spark. In this example, we will use the __MLLib__ package: __Kmeans__. This algorithm does the following: 
1. Assign points to the "closest" cluster mean. 
2. Update the cluster mean.
3. Iterate until the ssignments converge.  

The main algorithm __cost__ involves calculating the distance between points and the current cluster centers (centroids). To test the algorithm, we will use the generate artifical data using another __MLLib__ module called `RandomRDD`, which generates random RDD's. One of the functions of `RandomRDD` is `normalVectorRDD`, which creates a matrix of random numbers from a standard normal distribution. 

__Note:__ The __cost__ funtion gets the __Sum of Squared Error__ and is used to compute the different __k-means__ runs.

In [1]:
# Import the necessary functions
import numpy as np
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.clustering import KMeans

# To generate random data RDD we need need `RandomRDD`
from pyspark.mllib.random import RandomRDDs

# Generate random class data and add in a cluster center to random 2D points
c1_v = RandomRDDs.normalVectorRDD(sc,20,2,numPartitions=2,seed=1L).map(lambda v:np.add([1,5],v))
c2_v = RandomRDDs.normalVectorRDD(sc,16,2,numPartitions=2,seed=2L).map(lambda v:np.add([5,1],v))
c3_v = RandomRDDs.normalVectorRDD(sc,12,2,numPartitions=2,seed=3L).map(lambda v:np.add([4,6],v))

# Concatenate 2 RDDs with the `.union(other)` function
tmp_c = c1_v.union(c2_v)
my_data = tmp_c.union(c3_v)   #this now has all points, as RDD

# View the sumamry statistics on the data
print my_data.stats()

(count: 48, mean: [ 3.22923238  3.81929245], stdev: [ 2.05035446  2.52269532], max: [ 6.95641707  8.46783831], min: [-1.08338413 -0.49359456])


To train the model, we use __k-means++__, implemented as `k-means||` in __MLLib__, where: 
- __k__ is the number of desired clusters. 
- __maxIterations__ is the maximum number of iterations to run.
- __initializationMode__ specifies either random initialization or initialization via `k-means||`.
- __runs__ is the number of times to run the k-means algorithm. Since __k-means__ is not guaranteed to find a globally optimal solution, when run multiple times on a given dataset, the algorithm returns the best clustering result. 
- __initializationSteps__ determines the number of steps in the `k-means||` algorithm. 
- __epsilon__ determines the distance threshold within which we consider k-means to have converged. 

In [2]:
# Train the model using k-means++
my_kmmodel = KMeans.train(my_data,k=1,
               maxIterations=20,runs=1,
               initializationMode='k-means||',seed=10L)

# View the available functions avilable on the model
dir(my_kmmodel)

['__class__',
 '__delattr__',
 '__dict__',
 '__doc__',
 '__format__',
 '__getattribute__',
 '__hash__',
 '__init__',
 '__module__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'centers',
 'clusterCenters',
 'computeCost',
 'k',
 'load',
 'predict',
 'save']

In [3]:
# Use the `computeCost` to compute the Sum Squared Error
print "Sum of Squared Error using computeCost():"
my_kmmodel.computeCost(my_data)

Sum of Squared Error using computeCost():


507.26136309159807

__Note:__ If the `computeCost` function is not available, the following code manualy does the same thing as computeCost, and gives an example of coding a metric. 

In [4]:
# Get the sse of a point to the center of the cluster it's assigned to
def getsse(point):
    this_center = my_kmmodel.centers[my_kmmodel.predict(point)]
           #for this point get it's clustercenter
    return (sum([x**2 for x in (point - this_center)])) 


my_sse=my_data.map(getsse).collect()  #collect list of sse of each pt to its center

print "Sum of Squared Error:"
np.array(my_sse).mean() 

Sum of Squared Error:


10.567945064408294

## Changing the number of cluster, `k=4`.

In [5]:
# Import the necessary functions
import numpy as np
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.clustering import KMeans

# To generate random data RDD we need need `RandomRDD`
from pyspark.mllib.random import RandomRDDs

# Generate random class data and add in a cluster center to random 2D points
c1_v = RandomRDDs.normalVectorRDD(sc,20,2,numPartitions=2,seed=1L).map(lambda v:np.add([1,5],v))
c2_v = RandomRDDs.normalVectorRDD(sc,16,2,numPartitions=2,seed=2L).map(lambda v:np.add([5,1],v))
c3_v = RandomRDDs.normalVectorRDD(sc,12,2,numPartitions=2,seed=3L).map(lambda v:np.add([4,6],v))

# Concatenate 2 RDDs with the `.union(other)` function
tmp_c = c1_v.union(c2_v)
my_data = tmp_c.union(c3_v)   #this now has all points, as RDD

# Train the model using k-means++
my_kmmodel = KMeans.train(my_data,k=4,
               maxIterations=20,runs=1,
               initializationMode='k-means||',seed=10L)

# Get the sse of a point to the center of the cluster it's assigned to
def getsse(point):
    this_center = my_kmmodel.centers[my_kmmodel.predict(point)]
           #for this point get it's clustercenter
    return (sum([x**2 for x in (point - this_center)])) 


my_sse=my_data.map(getsse).collect()  #collect list of sse of each pt to its center

print "Sum of Squared Error:"
np.array(my_sse).mean()

Sum of Squared Error:


1.6550659149901916

## Changing the number of cluster, `k=3`.

In [6]:
# Import the necessary functions
import numpy as np
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.clustering import KMeans

# To generate random data RDD we need need `RandomRDD`
from pyspark.mllib.random import RandomRDDs

# Generate random class data and add in a cluster center to random 2D points
c1_v = RandomRDDs.normalVectorRDD(sc,20,2,numPartitions=2,seed=1L).map(lambda v:np.add([1,5],v))
c2_v = RandomRDDs.normalVectorRDD(sc,16,2,numPartitions=2,seed=2L).map(lambda v:np.add([5,1],v))
c3_v = RandomRDDs.normalVectorRDD(sc,12,2,numPartitions=2,seed=3L).map(lambda v:np.add([4,6],v))

# Concatenate 2 RDDs with the `.union(other)` function
tmp_c = c1_v.union(c2_v)
my_data = tmp_c.union(c3_v)   #this now has all points, as RDD

# Train the model using k-means++
my_kmmodel = KMeans.train(my_data,k=3,
               maxIterations=20,runs=1,
               initializationMode='k-means||',seed=10L)

# Get the sse of a point to the center of the cluster it's assigned to
def getsse(point):
    this_center = my_kmmodel.centers[my_kmmodel.predict(point)]
           #for this point get it's clustercenter
    return (sum([x**2 for x in (point - this_center)])) 


my_sse=my_data.map(getsse).collect()  #collect list of sse of each pt to its center

print "Sum of Squared Error:"
np.array(my_sse).mean()

Sum of Squared Error:


2.0255546450123463