[SPARK-17389] [ML] [MLLIB] KMeans speedup with better choice of k-means|| init steps = 2 #14956

srowen · 2016-09-04T11:50:42Z

What changes were proposed in this pull request?

Reduce default k-means|| init steps to 2 from 5. See JIRA for discussion.
See also #14948

How was this patch tested?

Existing tests.

SparkQA · 2016-09-04T12:29:02Z

Test build #64923 has finished for PR 14956 at commit 014f26a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-04T16:46:00Z

Test build #64924 has finished for PR 14956 at commit 3343acc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-09-07T11:33:43Z

@mateiz pardon for the ping but I think you authored the line / comment I'm changing here. See my note at #14948 (comment) -- does that logic make sense to you regarding R=2 steps instead of 5?

mateiz · 2016-09-07T18:05:57Z

I think the number 5 is indeed from that paper (I think from figure 5.1 actually), but have you tested the effect of using R=2 empirically? It would be good to check that they match what's in the paper.

mateiz · 2016-09-07T18:06:16Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala

@@ -395,7 +395,7 @@ object PowerIterationClustering extends Logging {
    val points = v.mapValues(x => Vectors.dense(x)).cache()
    val model = new KMeans()
      .setK(k)
-      .setSeed(0L)
+      .setSeed(5L)


Why change this?

The desired test result here depends on the seed, since some random initializations don't happen to produce the clustering that the test has in mind. 0 no longer worked after the change above but 5 did. This does indicate the clustering is different. Yes that sounds like a good quick science experiment, to verify more empirically that the clustering results here match what the paper advertises.

Wait, but I don't get it, is this a test? This seems to be changing the main Power Iteration Clustering file. For the tests, we usually created data with very obvious clusterings where nearly any initialization would find the right thing. You should look over whichever test is failing without this change.

Power iteration clustering uses k-means as an internal step, which is why the change affects PIC and its test. You're right, really a test should assert something that is true no matter how the implementation behaves. It'll probably never work for 100% of seeds though I agree, it should work for most. It worked for, I think, the third seed I tried. Maybe those odds are too low. Let me look into what goes wrong with seed 0 and see if it implies that something should be improved about the test or impl.

Sorry Matei I kind of missed your point. Yes it's a bit more strange to be changing a seed in a non-test file. Same reasoning, but I agree. There's a seed here to begin with for determinism but probably shouldn't matter.

I think I understand the problem, and I think it's the test. k-means is used to cluster 1D data. The test case generates two concentric circles of points of radius 1 and 4, which are intended to form k=2 separate clusters in the derived values that are clustered by k-means internally. That's even clear from looking at the similarities plotted:

While it's clear what the clustering is supposed to be, it's not actually the lowest-cost k-means clustering. Many clusterings do find the 'wrong' better clustering which is one that would include a few of the leftmost elements of the right group into the left one. Many other clusterings get the 'right' answer which is a big local minimum but not optimal. In fact, k-means|| init seems to do worse than random here exactly because it's less likely to find the local minimum.

I don't think the choice of radii matters here, since the resulting values above are basically invariant.

I'm going to have to read the paper more to understand what the difference is here. It's not quite the k-means change here, and, we can make this test pass easily by either

Set seed back to 0 and fix init steps = 5 for this use of k-means, because that happens to work. Then this implementation doesn't change at all. It means it does more work just to make the test pass.

Set seed to, say, 5 to get this to pass, on the theory that the choice of seed still doesn't seem to matter per se, and 5 is no worse than 0.

Obviously I want to understand a little more about how this is ever supposed to work in PIC, though it ends up being a slightly different issue.

I got the tests to pass reliably by simply making the two sets of points generated in this test both contain 10 points, not 10 and 40. Balancing them made the issue go away.

As to why the paper 'works', I'm actually not clear it does. It does not actually just k-means cluster the values. They say they run 100 clusterings and take the most common cluster assignment. It's a little ambiguous what this means, but may be the source of difference. AFAICT the current PIC test does present a situation that PIC clustering won't get right, often, if it uses straight k-means internally.

srowen · 2016-09-09T23:08:00Z

@mateiz I was able to reproduce the result in Figure 5.3 very closely (Spambase data set) and the result in Figure 5.2 pretty closely (mixture of Gaussians). In each case it seems like the final cost doesn't change much after 2 init steps. I didn't find as much of a jump in cost for 1 init step in mixture of Gaussians, but otherwise it matched well in terms of the log10 of sum of squares cost reported in the paper vs the implementation. (I have code if anyone cares.)

For example, for the Spambase data (corresponds to Figure 5.3) here's number of steps vs log10 of cost for...

k=20
1 7.344
2 7.404
3 7.401
4 7.389
5 7.396
6 7.423
7 7.374
8 7.442
9 7.402
10 7.345

k=50
1 6.940
2 6.817
3 6.831
4 6.798
5 6.888
6 6.803
7 6.838
8 6.815
9 6.818
10 6.804

k=100
1 6.663
2 6.359
3 6.400
4 6.345
5 6.397
6 6.407
7 6.367
8 6.452
9 6.368
10 6.413

mateiz · 2016-09-10T03:52:50Z

Cool, then it does make sense to change it.

…discussion)

…uster sizes and pass more reliably

SparkQA · 2016-09-10T14:35:56Z

Test build #65210 has finished for PR 14956 at commit b5aaec9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mateiz · 2016-09-10T18:17:22Z

Cool, thanks for improving the PIC test.

srowen · 2016-09-11T07:01:01Z

Merged to master

…|| init steps = 2 ## What changes were proposed in this pull request? Reduce default k-means|| init steps to 2 from 5. See JIRA for discussion. See also #14948 ## How was this patch tested? Existing tests. Author: Sean Owen <sowen@cloudera.com> Closes #14956 from srowen/SPARK-17389.2.

…ps from 5 to 2. ## What changes were proposed in this pull request? apache#14956 reduced default k-means|| init steps to 2 from 5 only for spark.mllib package, we should also do same change for spark.ml and PySpark. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#15050 from yanboliang/spark-17389.

…|| init steps = 2 ## What changes were proposed in this pull request? Reduce default k-means|| init steps to 2 from 5. See JIRA for discussion. See also apache#14948 ## How was this patch tested? Existing tests. Author: Sean Owen <sowen@cloudera.com> Closes apache#14956 from srowen/SPARK-17389.2.

…ps from 5 to 2. ## What changes were proposed in this pull request? apache#14956 reduced default k-means|| init steps to 2 from 5 only for spark.mllib package, we should also do same change for spark.ml and PySpark. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#15050 from yanboliang/spark-17389.

mateiz reviewed Sep 7, 2016
View reviewed changes

srowen added 3 commits September 10, 2016 11:23

Reduce default k-means|| init steps to 2 from 5 (see SPARK-17389 for …

2704042

…discussion)

Update seed so that PIC test passes again

0e02bfd

Alter PIC test emulating "3-circles" data set to use more balanced cl…

b5aaec9

…uster sizes and pass more reliably

srowen force-pushed the SPARK-17389.2 branch from 3343acc to b5aaec9 Compare September 10, 2016 13:40

srowen closed this Sep 11, 2016

srowen deleted the SPARK-17389.2 branch September 11, 2016 07:02

yanboliang mentioned this pull request Sep 11, 2016

[SPARK-17389][Follow-up][ML] Change KMeans k-means|| default init steps from 5 to 2. #15050

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-17389] [ML] [MLLIB] KMeans speedup with better choice of k-means|| init steps = 2 #14956

[SPARK-17389] [ML] [MLLIB] KMeans speedup with better choice of k-means|| init steps = 2 #14956

srowen commented Sep 4, 2016 •

edited

Loading

SparkQA commented Sep 4, 2016

SparkQA commented Sep 4, 2016

srowen commented Sep 7, 2016

mateiz commented Sep 7, 2016

mateiz Sep 7, 2016

srowen Sep 7, 2016

mateiz Sep 10, 2016

srowen Sep 10, 2016

srowen Sep 10, 2016

srowen Sep 10, 2016

srowen commented Sep 9, 2016 •

edited

Loading

mateiz commented Sep 10, 2016

SparkQA commented Sep 10, 2016

mateiz commented Sep 10, 2016

srowen commented Sep 11, 2016

[SPARK-17389] [ML] [MLLIB] KMeans speedup with better choice of k-means|| init steps = 2 #14956

[SPARK-17389] [ML] [MLLIB] KMeans speedup with better choice of k-means|| init steps = 2 #14956

Conversation

srowen commented Sep 4, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Sep 4, 2016

SparkQA commented Sep 4, 2016

srowen commented Sep 7, 2016

mateiz commented Sep 7, 2016

mateiz Sep 7, 2016

Choose a reason for hiding this comment

srowen Sep 7, 2016

Choose a reason for hiding this comment

mateiz Sep 10, 2016

Choose a reason for hiding this comment

srowen Sep 10, 2016

Choose a reason for hiding this comment

srowen Sep 10, 2016

Choose a reason for hiding this comment

srowen Sep 10, 2016

Choose a reason for hiding this comment

srowen commented Sep 9, 2016 • edited Loading

mateiz commented Sep 10, 2016

SparkQA commented Sep 10, 2016

mateiz commented Sep 10, 2016

srowen commented Sep 11, 2016

srowen commented Sep 4, 2016 •

edited

Loading

srowen commented Sep 9, 2016 •

edited

Loading