- Initialize the centers using kmeans++ algorithm
- Assign each point to the cluster with the nearest
- Update the clusters centers
- Repeats step 2 and 3 until convergence
The data set was generated from a gaussian mixture distribution.
Clusters obtained by runing the sequential Kmeans clustering algorithm.
- C++ compiler with c++-17 support (tested on clang)
- Make
- [Optional] Julia (DataFrames, CSV and Plots packages)
To build run make
The following results are obtained for:
- number of clusters : 3
- number of threads: 32
- number of repetitions: 30
- maximum number of iterations: 1000
- threshold : 0.1
To reproduce the results run ./benchmark 3 32 30 1000 0.1
- Used OPenMP for parallelization
- Used speedup to evaluate the performances of the parallelized kmeans algorithm
- Maximum speedup is ~7x
- T Kanungo, DM Mount, NS Netanyahu, C.D. Piatko, R. Silverman, A.Y. Wu. An efficient k-means clustering algorithm: Analysis and implementation, in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 881-892, July 2002.
- Arthur, David and Sergei Vassilvitski. k-means++: The Advantages of Careful Seeding, 2007.
- Barbara Chapman, Gabrielle Jost, Ruud Van Der Pas. Using OpenMP, 2007.
- Joe Pitt-Francis, Jonathan Whiteley. Guide to Scientific Computing in C++, Second Edition, 2017.
- OpenMP 4.5 Reference Guide – C/C++