Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-17389] [ML] [MLLIB] KMeans speedup with better choice of k-means|| init steps = 2 #14956

Closed
wants to merge 3 commits into from

Conversation

srowen
Copy link
Member

@srowen srowen commented Sep 4, 2016

What changes were proposed in this pull request?

Reduce default k-means|| init steps to 2 from 5. See JIRA for discussion.
See also #14948

How was this patch tested?

Existing tests.

@SparkQA
Copy link

SparkQA commented Sep 4, 2016

Test build #64923 has finished for PR 14956 at commit 014f26a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 4, 2016

Test build #64924 has finished for PR 14956 at commit 3343acc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member Author

srowen commented Sep 7, 2016

@mateiz pardon for the ping but I think you authored the line / comment I'm changing here. See my note at #14948 (comment) -- does that logic make sense to you regarding R=2 steps instead of 5?

@mateiz
Copy link
Contributor

mateiz commented Sep 7, 2016

I think the number 5 is indeed from that paper (I think from figure 5.1 actually), but have you tested the effect of using R=2 empirically? It would be good to check that they match what's in the paper.

@@ -395,7 +395,7 @@ object PowerIterationClustering extends Logging {
val points = v.mapValues(x => Vectors.dense(x)).cache()
val model = new KMeans()
.setK(k)
.setSeed(0L)
.setSeed(5L)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The desired test result here depends on the seed, since some random initializations don't happen to produce the clustering that the test has in mind. 0 no longer worked after the change above but 5 did. This does indicate the clustering is different. Yes that sounds like a good quick science experiment, to verify more empirically that the clustering results here match what the paper advertises.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, but I don't get it, is this a test? This seems to be changing the main Power Iteration Clustering file. For the tests, we usually created data with very obvious clusterings where nearly any initialization would find the right thing. You should look over whichever test is failing without this change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Power iteration clustering uses k-means as an internal step, which is why the change affects PIC and its test. You're right, really a test should assert something that is true no matter how the implementation behaves. It'll probably never work for 100% of seeds though I agree, it should work for most. It worked for, I think, the third seed I tried. Maybe those odds are too low. Let me look into what goes wrong with seed 0 and see if it implies that something should be improved about the test or impl.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry Matei I kind of missed your point. Yes it's a bit more strange to be changing a seed in a non-test file. Same reasoning, but I agree. There's a seed here to begin with for determinism but probably shouldn't matter.

I think I understand the problem, and I think it's the test. k-means is used to cluster 1D data. The test case generates two concentric circles of points of radius 1 and 4, which are intended to form k=2 separate clusters in the derived values that are clustered by k-means internally. That's even clear from looking at the similarities plotted:

rplot

While it's clear what the clustering is supposed to be, it's not actually the lowest-cost k-means clustering. Many clusterings do find the 'wrong' better clustering which is one that would include a few of the leftmost elements of the right group into the left one. Many other clusterings get the 'right' answer which is a big local minimum but not optimal. In fact, k-means|| init seems to do worse than random here exactly because it's less likely to find the local minimum.

I don't think the choice of radii matters here, since the resulting values above are basically invariant.

I'm going to have to read the paper more to understand what the difference is here. It's not quite the k-means change here, and, we can make this test pass easily by either

  • Set seed back to 0 and fix init steps = 5 for this use of k-means, because that happens to work. Then this implementation doesn't change at all. It means it does more work just to make the test pass.
  • Set seed to, say, 5 to get this to pass, on the theory that the choice of seed still doesn't seem to matter per se, and 5 is no worse than 0.

Obviously I want to understand a little more about how this is ever supposed to work in PIC, though it ends up being a slightly different issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got the tests to pass reliably by simply making the two sets of points generated in this test both contain 10 points, not 10 and 40. Balancing them made the issue go away.

As to why the paper 'works', I'm actually not clear it does. It does not actually just k-means cluster the values. They say they run 100 clusterings and take the most common cluster assignment. It's a little ambiguous what this means, but may be the source of difference. AFAICT the current PIC test does present a situation that PIC clustering won't get right, often, if it uses straight k-means internally.

@srowen
Copy link
Member Author

srowen commented Sep 9, 2016

@mateiz I was able to reproduce the result in Figure 5.3 very closely (Spambase data set) and the result in Figure 5.2 pretty closely (mixture of Gaussians). In each case it seems like the final cost doesn't change much after 2 init steps. I didn't find as much of a jump in cost for 1 init step in mixture of Gaussians, but otherwise it matched well in terms of the log10 of sum of squares cost reported in the paper vs the implementation. (I have code if anyone cares.)

For example, for the Spambase data (corresponds to Figure 5.3) here's number of steps vs log10 of cost for...

k=20
1 7.344
2 7.404
3 7.401
4 7.389
5 7.396
6 7.423
7 7.374
8 7.442
9 7.402
10 7.345

k=50
1 6.940
2 6.817
3 6.831
4 6.798
5 6.888
6 6.803
7 6.838
8 6.815
9 6.818
10 6.804

k=100
1 6.663
2 6.359
3 6.400
4 6.345
5 6.397
6 6.407
7 6.367
8 6.452
9 6.368
10 6.413

@mateiz
Copy link
Contributor

mateiz commented Sep 10, 2016

Cool, then it does make sense to change it.

@SparkQA
Copy link

SparkQA commented Sep 10, 2016

Test build #65210 has finished for PR 14956 at commit b5aaec9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mateiz
Copy link
Contributor

mateiz commented Sep 10, 2016

Cool, thanks for improving the PIC test.

@srowen
Copy link
Member Author

srowen commented Sep 11, 2016

Merged to master

@srowen srowen closed this Sep 11, 2016
@srowen srowen deleted the SPARK-17389.2 branch September 11, 2016 07:02
asfgit pushed a commit that referenced this pull request Sep 11, 2016
…|| init steps = 2

## What changes were proposed in this pull request?

Reduce default k-means|| init steps to 2 from 5. See JIRA for discussion.
See also #14948

## How was this patch tested?

Existing tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #14956 from srowen/SPARK-17389.2.
ghost pushed a commit to dbtsai/spark that referenced this pull request Sep 11, 2016
…ps from 5 to 2.

## What changes were proposed in this pull request?
apache#14956 reduced default k-means|| init steps to 2 from 5 only for spark.mllib package, we should also do same change for spark.ml and PySpark.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes apache#15050 from yanboliang/spark-17389.
wgtmac pushed a commit to wgtmac/spark that referenced this pull request Sep 19, 2016
…|| init steps = 2

## What changes were proposed in this pull request?

Reduce default k-means|| init steps to 2 from 5. See JIRA for discussion.
See also apache#14948

## How was this patch tested?

Existing tests.

Author: Sean Owen <sowen@cloudera.com>

Closes apache#14956 from srowen/SPARK-17389.2.
wgtmac pushed a commit to wgtmac/spark that referenced this pull request Sep 19, 2016
…ps from 5 to 2.

## What changes were proposed in this pull request?
apache#14956 reduced default k-means|| init steps to 2 from 5 only for spark.mllib package, we should also do same change for spark.ml and PySpark.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes apache#15050 from yanboliang/spark-17389.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants