K-Means++ is the original seeding algorithm for K-Means as proposed by Arthur and Vassilvitskii in 2007 [kmpp].
It chooses one center uniformly, then computes distance of every datapoint to already chosen centers in order to use distance as weights when sampling next center. These steps are repeated until k centers are chosen.
Chosing good seedings speeds up convergence for K-Means, but extra time cost is occurred calculating all distances.
afkmc2.kmpp
KMC^2 was proposed as an improvement over K-Means++ in 2016 [kmc2]. While K-Means++ requires k full passes over the dataset, KMC^2 replaces the D^2 sampling step with Markov Chain Monte Carlo sampling. Runtime is no longer tied to number of datapoints while new centers will be chosen far from current centers.
afkmc2.kmc2
AFKMC^2 is an improvement proposed by the same authors [afkmc2]. While KMC^2 requires assumptions about the data generating distribution (in our implementation uniformity), this algorithm works without such assumptions. It the true D^2-sampling distribution with regards to the first center c_1
as a proposal distribution that can approximate nonuniform distributions.
This means an added runtime cost of O(nd)
to calculate the initial distribution, but performance improvements for nonuniform samples.
afkmc2.afkmc2
The author of this package proposed this slight runtime improvement for AFKMC^2 (it could also be applied to KMC^2). Since the first O(nd)
pass already calculates all distances between X
and c_1
we can at minimum save ourselves k*m distance calculations by storing these results to be reused in the Markov Chain. This comes with an additional space cost of O(nk)
.
The savings are highest for small datasets but can yield significant runtime improvement for very large/high-dimensional ones as well.
afkmc2.afkmc2_c
- afkmc2
Bachem, O., Lucic, M., Hassani, H., & Krause, A. (2016). Fast and Provably Good Seedings for k-Means. In Advances in Neural Information Processing Systems (pp. 55-63).
- kmc2
Bachem, O., Lucic, M., Hassani, S. H., & Krause, A. (2016, February). Approximate K-Means++ in Sublinear Time. In AAAI (pp. 1459-1467).
- kmpp
Arthur, D., & Vassilvitskii, S. (2007, January). k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms (pp. 1027-1035). Society for Industrial and Applied Mathematics.