Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-17389] [SPARK-3261] [MLLIB] Significant KMeans speedup with better choice of init steps, optimizing to remove 'runs' #14948

Closed
wants to merge 3 commits into from

Conversation

srowen
Copy link
Member

@srowen srowen commented Sep 3, 2016

What changes were proposed in this pull request?

  • Deprecate KMeans 'runs' param which was a no-op since 2.0
  • optimize/simplify code now that runs = 1 always
  • most significantly, choose initializationSteps = 2 as default for default k-means|| init because the paper implies that's as optimal and is much faster.

How was this patch tested?

Jenkins

…implify code now that runs = 1 always; most significantly, choose initializationSteps = 2 as default for default k-means|| init because the paper implies that's as optimal and is much faster.
@SparkQA
Copy link

SparkQA commented Sep 3, 2016

Test build #64894 has finished for PR 14948 at commit dbed7d7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -75,7 +75,7 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext {

// Make sure code runs.
var model = KMeans.train(data, k = 2, maxIterations = 1)
assert(model.clusterCenters.size === 2)
assert(model.clusterCenters.size === 1)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The optimization above that returns early if # centers <= k causes this behavior change. I think the new behavior is more correct, because before you could get duplicate centers.

This actually fixed SPARK-3261 too

@SparkQA
Copy link

SparkQA commented Sep 3, 2016

Test build #64896 has finished for PR 14948 at commit 0a51e1e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen srowen changed the title [SPARK-17389] [MLLIB] Significant KMeans speedup with better choice of init steps, optimizing to remove 'runs' [SPARK-17389] [SPARK-3261] [MLLIB] Significant KMeans speedup with better choice of init steps, optimizing to remove 'runs' Sep 3, 2016
@srowen
Copy link
Member Author

srowen commented Sep 3, 2016

Note this also now resolves SPARK-3261. This change already means that with k-means|| init, fewer than k cluster centers may be returned, which is probably correct (and faster). Now random init will also return no duplicate centers, and thus < k clusters when the input has size < k.

@SparkQA
Copy link

SparkQA commented Sep 3, 2016

Test build #64899 has finished for PR 14948 at commit e7f12fa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sethah
Copy link
Contributor

sethah commented Sep 3, 2016

I think #14937 also removes runs. cc @yanboliang can we coordinate these PRs?

@srowen
Copy link
Member Author

srowen commented Sep 3, 2016

Ah yeah, it also removes runs from k-means||. The good news is I think these changes are actually not overlapping for the most part, and where they do, they're essentially the same change.

@yanboliang what do you think of this? This takes out runs entirely and I think simplifies the code even a bit further. But the real win was reducing the default init steps. I'm also here trying to fix the fact that duplicate centroids can be returned.

@yanboliang
Copy link
Contributor

Yep, these changes are not overlapping with #14937 for the most part, and they are actually same changes for the overlapping part. Let's make this PR focus on the optimization of initialization step and to fix it's likely to return duplicate centroids.

For reducing initialization steps to 2 for default k-means||, I wonder the impact to the total training iteration number. It will definitely reduce the initialization time, but whether it will introduce more training iterations due to not good enough initial centers? If it does not introduce extra iterations for most cases, I think it's OK. Or we should trade off the initialization and training iterations. Thanks!

}.toArray)
private def initRandom(data: RDD[VectorWithNorm]): Array[VectorWithNorm] = {
val sample = data.takeSample(false, k, new XORShiftRandom(this.seed).nextInt())
sample.map(v => new VectorWithNorm(Vectors.dense(v.vector.toArray), v.norm))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be simplified as data.takeSample(false, k, new XORShiftRandom(this.seed).nextLong()).toSeq.toArray.

@srowen
Copy link
Member Author

srowen commented Sep 4, 2016

@yanboliang I'm going to close this PR and instead 'port' a few changes I made in this version for your consideration. Yours should be the primary PR for removing runs I think. I'll break out the two other changes in separate PRs.

You're right that this is the question regarding init steps -- more steps could make for a better clustering, which could indirectly mean a faster convergence too. Maybe you can check my work. I think the default of 5 was taken from Table 6 in http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf but it's not saying 5 is necessarily an optimal value. In fact Figure 5.2/5.3 imply that (for l/k=2 as we've chosen here) there's virtually no improvement for more than 2 init steps.

Coupled with the fact that an init step now takes about 5x longer than a single iteration, it seems like 5 is pretty expensive as a default too.

@srowen srowen closed this Sep 4, 2016
@srowen srowen deleted the SPARK-17389 branch September 6, 2016 14:37
asfgit pushed a commit that referenced this pull request Sep 11, 2016
…|| init steps = 2

## What changes were proposed in this pull request?

Reduce default k-means|| init steps to 2 from 5. See JIRA for discussion.
See also #14948

## How was this patch tested?

Existing tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #14956 from srowen/SPARK-17389.2.
wgtmac pushed a commit to wgtmac/spark that referenced this pull request Sep 19, 2016
…|| init steps = 2

## What changes were proposed in this pull request?

Reduce default k-means|| init steps to 2 from 5. See JIRA for discussion.
See also apache#14948

## How was this patch tested?

Existing tests.

Author: Sean Owen <sowen@cloudera.com>

Closes apache#14956 from srowen/SPARK-17389.2.
asfgit pushed a commit that referenced this pull request Oct 12, 2016
## What changes were proposed in this pull request?

This is a revival of #14948 and related to #14937. This removes the 'runs' parameter, which has already been disabled, from the K-means implementation and further deprecates API methods that involve it.

This also happens to resolve the issue that K-means should not return duplicate centers, meaning that it may return less than k centroids if not enough data is available.

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #15342 from srowen/SPARK-11560.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

This is a revival of apache#14948 and related to apache#14937. This removes the 'runs' parameter, which has already been disabled, from the K-means implementation and further deprecates API methods that involve it.

This also happens to resolve the issue that K-means should not return duplicate centers, meaning that it may return less than k centroids if not enough data is available.

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes apache#15342 from srowen/SPARK-11560.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants