# Parallel KMeans implementation
Based on J. Y. Q. H. Z. W. a. J. C. Bowen Wang, “Parallelizing K-means-based Clustering on Spark,” International Conference on Advanced Cloud and Big Data, 2016. 

## Parallel partition-based algorithm outline
1. Initialize centroids by randomly selecting k points from the data set. Broadcast selected centroids to all nodes
1. While centroids are moving:
    1. Broadcast current centroids
    1. For each point
        1. Compute the distance to all centroids
        1. Asign the closest cluster
    1. For each cluster
        1. Compute local mean
    1. Compute the mean for each cluster for each partition
 

## Adaptations made to suggested implementation of the algorithm:
1. The authors suggest using SparseVector, with chosen data sets it is better to use regular arrays
1. We use the random sample for centroids initialization as described in *Scalable K-Means++* because the quality of initial centroids has a major effect on the quality
1. We use crisp clustering only i. e. each point can be a member of one cluster only
1. We use Euclidian data because it is the recommended distance function for dense data.

## Description of production cluster on Azure:
We are using Azure HDIsight in order to run a spark cluster. We are using a cluster with the following configuration:
* two master nodes Standard_D12_V2 4 CPU Cores 28GB RAM
* eight slave nodes Standard_D13_V2 8 CPU Cores 56GB RAM


In [42]:
import os
import requests
import time
from pprint import pprint

from numpy import array
import numpy as np
import pandas as pd
from itertools import groupby

In [21]:
from pyspark.sql import DataFrame, SparkSession
from pyspark.sql.functions import split, col, size, trim, lit
from pyspark.ml.linalg import Vectors, DenseVector

# Initialize a Spark session
spark = SparkSession.builder \
    .appName("FirstSparkJob") \
    .master("spark://spark-master:7077") \
    .getOrCreate()

sc = spark.sparkContext

In [3]:
RANDOM_SEED = 301191
K = 3 

In [4]:
A3_DATASET_URL = "https://cs.joensuu.fi/sipu/datasets/a3.txt"
DATA_FOLDER = "/home/jovyan/work/data"
A3_LOCAL_PATH = os.path.join(DATA_FOLDER, "a3.txt")

In [5]:
# Download Data
response = requests.get(A3_DATASET_URL)
if not os.path.exists(A3_LOCAL_PATH):
    with open(A3_LOCAL_PATH, 'wb') as file:
        file.write(response.content)

In [9]:
# Load clean data into spark
data = sc.textFile(A3_LOCAL_PATH)
parsed_data = data.map(lambda row: Vectors.dense([float(x) for x in row.strip().split()]))
parsed_data.cache()

parsed_data.take(5)

                                                                                

[DenseVector([53920.0, 42968.0]),
 DenseVector([52019.0, 42206.0]),
 DenseVector([52570.0, 42476.0]),
 DenseVector([54220.0, 42081.0]),
 DenseVector([54268.0, 43420.0])]

In [60]:
def compute_distance(p, centroid):
    return np.sqrt(np.sum((np.array(p) - np.array(centroid))**2))

def kmeans(rdd, centroids, max_iters=10, tolerance=0.1):
    for i in range(max_iters):
        timer = time.time()
        
        # Broadcast the centroids to all nodes
        broadcast_centroids = spark.sparkContext.broadcast(centroids)
        clustered_rdd = rdd.map(lambda p: (np.argmin([compute_distance(p, c) for c in broadcast_centroids.value]), (p, 1)))

        # Recompute centroids by averaging points in each cluster
        new_centroids = (
            clustered_rdd
            .reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))  # Sum points and count
            .map(lambda x: (x[0], x[1][0] / x[1][1]))  # Compute new centroids
            .collectAsMap()
        )

        new_centroids_arr = np.array([new_centroids[j] for j in range(len(centroids))], dtype=np.float64)
        
        # Calculate sum of shifts for each centroid
        shifts = np.sum(np.linalg.norm(new_centroids_arr - centroids, axis=1))
        
        print(f"Iteration:{i}\tShifts:{shifts}\ttime taken:{time.time()-timer}")
        if shifts < tolerance:
            break

        # Update the centroids
        centroids = new_centroids_arr
    
        # Free memory
        broadcast_centroids.unpersist()

    return centroids

# Run the K-Means with broadcasting
centroids = np.array(parsed_data.takeSample(False, K, seed=RANDOM_SEED), dtype=np.float64)
final_centroids = kmeans(parsed_data, centroids)
print("Final centroids:", final_centroids)

                                                                                

Iteration:0	Shifts:28496.909826879117	time taken:0.21987652778625488
Iteration:1	Shifts:14512.573275107046	time taken:0.2187638282775879
Iteration:2	Shifts:6404.757819968364	time taken:0.21590971946716309
Iteration:3	Shifts:3147.2457307404798	time taken:0.21690726280212402
Iteration:4	Shifts:1799.0314255011751	time taken:0.4547231197357178
Iteration:5	Shifts:1222.2122357790731	time taken:0.21929264068603516
Iteration:6	Shifts:1011.5807744583558	time taken:0.2185688018798828
Iteration:7	Shifts:756.896907787936	time taken:0.21982431411743164
Iteration:8	Shifts:472.1670675801665	time taken:0.2169511318206787
Iteration:9	Shifts:253.05295585440612	time taken:0.2142651081085205
Final centroids: [[19110.20186335 50176.14995563]
 [23398.7486223  15450.88639254]
 [50551.12642882 36918.3820575 ]]
