-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-17389] [SPARK-3261] [MLLIB] Significant KMeans speedup with better choice of init steps, optimizing to remove 'runs' #14948
Conversation
…implify code now that runs = 1 always; most significantly, choose initializationSteps = 2 as default for default k-means|| init because the paper implies that's as optimal and is much faster.
Test build #64894 has finished for PR 14948 at commit
|
@@ -75,7 +75,7 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext { | |||
|
|||
// Make sure code runs. | |||
var model = KMeans.train(data, k = 2, maxIterations = 1) | |||
assert(model.clusterCenters.size === 2) | |||
assert(model.clusterCenters.size === 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The optimization above that returns early if # centers <= k causes this behavior change. I think the new behavior is more correct, because before you could get duplicate centers.
This actually fixed SPARK-3261 too
Test build #64896 has finished for PR 14948 at commit
|
…e returned when there are < k inputs
Note this also now resolves SPARK-3261. This change already means that with k-means|| init, fewer than k cluster centers may be returned, which is probably correct (and faster). Now random init will also return no duplicate centers, and thus < k clusters when the input has size < k. |
Test build #64899 has finished for PR 14948 at commit
|
I think #14937 also removes runs. cc @yanboliang can we coordinate these PRs? |
Ah yeah, it also removes runs from k-means||. The good news is I think these changes are actually not overlapping for the most part, and where they do, they're essentially the same change. @yanboliang what do you think of this? This takes out |
Yep, these changes are not overlapping with #14937 for the most part, and they are actually same changes for the overlapping part. Let's make this PR focus on the optimization of initialization step and to fix it's likely to return duplicate centroids. For reducing initialization steps to 2 for default k-means||, I wonder the impact to the total training iteration number. It will definitely reduce the initialization time, but whether it will introduce more training iterations due to not good enough initial centers? If it does not introduce extra iterations for most cases, I think it's OK. Or we should trade off the initialization and training iterations. Thanks! |
}.toArray) | ||
private def initRandom(data: RDD[VectorWithNorm]): Array[VectorWithNorm] = { | ||
val sample = data.takeSample(false, k, new XORShiftRandom(this.seed).nextInt()) | ||
sample.map(v => new VectorWithNorm(Vectors.dense(v.vector.toArray), v.norm)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be simplified as data.takeSample(false, k, new XORShiftRandom(this.seed).nextLong()).toSeq.toArray
.
@yanboliang I'm going to close this PR and instead 'port' a few changes I made in this version for your consideration. Yours should be the primary PR for removing You're right that this is the question regarding init steps -- more steps could make for a better clustering, which could indirectly mean a faster convergence too. Maybe you can check my work. I think the default of 5 was taken from Table 6 in http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf but it's not saying 5 is necessarily an optimal value. In fact Figure 5.2/5.3 imply that (for l/k=2 as we've chosen here) there's virtually no improvement for more than 2 init steps. Coupled with the fact that an init step now takes about 5x longer than a single iteration, it seems like 5 is pretty expensive as a default too. |
…|| init steps = 2 ## What changes were proposed in this pull request? Reduce default k-means|| init steps to 2 from 5. See JIRA for discussion. See also #14948 ## How was this patch tested? Existing tests. Author: Sean Owen <sowen@cloudera.com> Closes #14956 from srowen/SPARK-17389.2.
…|| init steps = 2 ## What changes were proposed in this pull request? Reduce default k-means|| init steps to 2 from 5. See JIRA for discussion. See also apache#14948 ## How was this patch tested? Existing tests. Author: Sean Owen <sowen@cloudera.com> Closes apache#14956 from srowen/SPARK-17389.2.
## What changes were proposed in this pull request? This is a revival of #14948 and related to #14937. This removes the 'runs' parameter, which has already been disabled, from the K-means implementation and further deprecates API methods that involve it. This also happens to resolve the issue that K-means should not return duplicate centers, meaning that it may return less than k centroids if not enough data is available. ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #15342 from srowen/SPARK-11560.
## What changes were proposed in this pull request? This is a revival of apache#14948 and related to apache#14937. This removes the 'runs' parameter, which has already been disabled, from the K-means implementation and further deprecates API methods that involve it. This also happens to resolve the issue that K-means should not return duplicate centers, meaning that it may return less than k centroids if not enough data is available. ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes apache#15342 from srowen/SPARK-11560.
What changes were proposed in this pull request?
How was this patch tested?
Jenkins