[SPARK-17389] [SPARK-3261] [MLLIB] Significant KMeans speedup with better choice of init steps, optimizing to remove 'runs' #14948

srowen · 2016-09-03T08:24:52Z

What changes were proposed in this pull request?

Deprecate KMeans 'runs' param which was a no-op since 2.0
optimize/simplify code now that runs = 1 always
most significantly, choose initializationSteps = 2 as default for default k-means|| init because the paper implies that's as optimal and is much faster.

How was this patch tested?

Jenkins

…implify code now that runs = 1 always; most significantly, choose initializationSteps = 2 as default for default k-means|| init because the paper implies that's as optimal and is much faster.

SparkQA · 2016-09-03T09:00:11Z

Test build #64894 has finished for PR 14948 at commit dbed7d7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-09-03T10:58:06Z

mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala

@@ -75,7 +75,7 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext {

    // Make sure code runs.
    var model = KMeans.train(data, k = 2, maxIterations = 1)
-    assert(model.clusterCenters.size === 2)
+    assert(model.clusterCenters.size === 1)


The optimization above that returns early if # centers <= k causes this behavior change. I think the new behavior is more correct, because before you could get duplicate centers.

This actually fixed SPARK-3261 too

SparkQA · 2016-09-03T11:53:14Z

Test build #64896 has finished for PR 14948 at commit 0a51e1e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…e returned when there are < k inputs

srowen · 2016-09-03T14:12:37Z

Note this also now resolves SPARK-3261. This change already means that with k-means|| init, fewer than k cluster centers may be returned, which is probably correct (and faster). Now random init will also return no duplicate centers, and thus < k clusters when the input has size < k.

SparkQA · 2016-09-03T15:06:19Z

Test build #64899 has finished for PR 14948 at commit e7f12fa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-09-03T18:08:00Z

I think #14937 also removes runs. cc @yanboliang can we coordinate these PRs?

srowen · 2016-09-03T18:29:25Z

Ah yeah, it also removes runs from k-means||. The good news is I think these changes are actually not overlapping for the most part, and where they do, they're essentially the same change.

@yanboliang what do you think of this? This takes out runs entirely and I think simplifies the code even a bit further. But the real win was reducing the default init steps. I'm also here trying to fix the fact that duplicate centroids can be returned.

yanboliang · 2016-09-04T08:49:59Z

Yep, these changes are not overlapping with #14937 for the most part, and they are actually same changes for the overlapping part. Let's make this PR focus on the optimization of initialization step and to fix it's likely to return duplicate centroids.

For reducing initialization steps to 2 for default k-means||, I wonder the impact to the total training iteration number. It will definitely reduce the initialization time, but whether it will introduce more training iterations due to not good enough initial centers? If it does not introduce extra iterations for most cases, I think it's OK. Or we should trade off the initialization and training iterations. Thanks!

yanboliang · 2016-09-04T08:53:56Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

-    }.toArray)
+  private def initRandom(data: RDD[VectorWithNorm]): Array[VectorWithNorm] = {
+    val sample = data.takeSample(false, k, new XORShiftRandom(this.seed).nextInt())
+    sample.map(v => new VectorWithNorm(Vectors.dense(v.vector.toArray), v.norm))


This can be simplified as data.takeSample(false, k, new XORShiftRandom(this.seed).nextLong()).toSeq.toArray.

srowen · 2016-09-04T11:16:15Z

@yanboliang I'm going to close this PR and instead 'port' a few changes I made in this version for your consideration. Yours should be the primary PR for removing runs I think. I'll break out the two other changes in separate PRs.

You're right that this is the question regarding init steps -- more steps could make for a better clustering, which could indirectly mean a faster convergence too. Maybe you can check my work. I think the default of 5 was taken from Table 6 in http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf but it's not saying 5 is necessarily an optimal value. In fact Figure 5.2/5.3 imply that (for l/k=2 as we've chosen here) there's virtually no improvement for more than 2 init steps.

Coupled with the fact that an init step now takes about 5x longer than a single iteration, it seems like 5 is pretty expensive as a default too.

…|| init steps = 2 ## What changes were proposed in this pull request? Reduce default k-means|| init steps to 2 from 5. See JIRA for discussion. See also #14948 ## How was this patch tested? Existing tests. Author: Sean Owen <sowen@cloudera.com> Closes #14956 from srowen/SPARK-17389.2.

…|| init steps = 2 ## What changes were proposed in this pull request? Reduce default k-means|| init steps to 2 from 5. See JIRA for discussion. See also apache#14948 ## How was this patch tested? Existing tests. Author: Sean Owen <sowen@cloudera.com> Closes apache#14956 from srowen/SPARK-17389.2.

## What changes were proposed in this pull request? This is a revival of #14948 and related to #14937. This removes the 'runs' parameter, which has already been disabled, from the K-means implementation and further deprecates API methods that involve it. This also happens to resolve the issue that K-means should not return duplicate centers, meaning that it may return less than k centroids if not enough data is available. ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #15342 from srowen/SPARK-11560.

## What changes were proposed in this pull request? This is a revival of apache#14948 and related to apache#14937. This removes the 'runs' parameter, which has already been disabled, from the K-means implementation and further deprecates API methods that involve it. This also happens to resolve the issue that K-means should not return duplicate centers, meaning that it may return less than k centroids if not enough data is available. ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes apache#15342 from srowen/SPARK-11560.

Deprecate KMeans 'runs' param which was a no-op since 2.0; optimize/s…

dbed7d7

…implify code now that runs = 1 always; most significantly, choose initializationSteps = 2 as default for default k-means|| init because the paper implies that's as optimal and is much faster.

Fix a test problem

0a51e1e

srowen reviewed Sep 3, 2016
View reviewed changes

srowen changed the title ~~[SPARK-17389] [MLLIB] Significant KMeans speedup with better choice of init steps, optimizing to remove 'runs'~~ [SPARK-17389] [SPARK-3261] [MLLIB] Significant KMeans speedup with better choice of init steps, optimizing to remove 'runs' Sep 3, 2016

Don't create duplicate cluster centers. This means < k clusters may b…

e7f12fa

…e returned when there are < k inputs

yanboliang reviewed Sep 4, 2016
View reviewed changes

srowen closed this Sep 4, 2016

This was referenced Sep 4, 2016

[SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans implementation. #14937

Closed

[SPARK-17389] [ML] [MLLIB] KMeans speedup with better choice of k-means|| init steps = 2 #14956

Closed

srowen deleted the SPARK-17389 branch September 6, 2016 14:37

srowen mentioned this pull request Oct 4, 2016

[SPARK-11560] [MLLIB] Optimize KMeans implementation / remove 'runs' #15342

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-17389] [SPARK-3261] [MLLIB] Significant KMeans speedup with better choice of init steps, optimizing to remove 'runs' #14948

[SPARK-17389] [SPARK-3261] [MLLIB] Significant KMeans speedup with better choice of init steps, optimizing to remove 'runs' #14948

srowen commented Sep 3, 2016

SparkQA commented Sep 3, 2016

srowen Sep 3, 2016

SparkQA commented Sep 3, 2016

srowen commented Sep 3, 2016

SparkQA commented Sep 3, 2016

sethah commented Sep 3, 2016

srowen commented Sep 3, 2016

yanboliang commented Sep 4, 2016

yanboliang Sep 4, 2016

srowen commented Sep 4, 2016

[SPARK-17389] [SPARK-3261] [MLLIB] Significant KMeans speedup with better choice of init steps, optimizing to remove 'runs' #14948

[SPARK-17389] [SPARK-3261] [MLLIB] Significant KMeans speedup with better choice of init steps, optimizing to remove 'runs' #14948

Conversation

srowen commented Sep 3, 2016

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Sep 3, 2016

srowen Sep 3, 2016

Choose a reason for hiding this comment

SparkQA commented Sep 3, 2016

srowen commented Sep 3, 2016

SparkQA commented Sep 3, 2016

sethah commented Sep 3, 2016

srowen commented Sep 3, 2016

yanboliang commented Sep 4, 2016

yanboliang Sep 4, 2016

Choose a reason for hiding this comment

srowen commented Sep 4, 2016