New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-11560] [MLLIB] Optimize KMeans implementation / remove 'runs' #15342

Closed
wants to merge 6 commits into
base: master
from

Conversation

Projects
None yet
4 participants
@srowen
Member

srowen commented Oct 4, 2016

What changes were proposed in this pull request?

This is a revival of #14948 and related to #14937. This removes the 'runs' parameter, which has already been disabled, from the K-means implementation and further deprecates API methods that involve it.

This also happens to resolve the issue that K-means should not return duplicate centers, meaning that it may return less than k centroids if not enough data is available.

How was this patch tested?

Existing tests

@srowen

This comment has been minimized.

Show comment
Hide comment
@srowen
Member

srowen commented Oct 4, 2016

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Oct 4, 2016

Test build #66310 has finished for PR 15342 at commit cd14b65.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Oct 4, 2016

Test build #66310 has finished for PR 15342 at commit cd14b65.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
Show outdated Hide outdated mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
Show outdated Hide outdated mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
Show outdated Hide outdated mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
Show outdated Hide outdated mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
@@ -558,6 +475,7 @@ object KMeans {
* Trains a k-means model using specified parameters and the default values for unspecified.
*/
@Since("0.8.0")
@deprecated("Use train method without 'runs'", "2.1.0")

This comment has been minimized.

@sethah

sethah Oct 4, 2016

Contributor

There are two other train signatures that use runs, but have not been marked as deprecated.

@sethah

sethah Oct 4, 2016

Contributor

There are two other train signatures that use runs, but have not been marked as deprecated.

This comment has been minimized.

@srowen

srowen Oct 5, 2016

Member

Yes, though there's no alternative to those with the same arguments. We could add another overload and deprecate the others. I'm OK with that too, just felt a little gross to add yet more.

@srowen

srowen Oct 5, 2016

Member

Yes, though there's no alternative to those with the same arguments. We could add another overload and deprecate the others. I'm OK with that too, just felt a little gross to add yet more.

This comment has been minimized.

@sethah

sethah Oct 10, 2016

Contributor

I think we should add them for completeness, and deprecate all overloads using runs.

@sethah

sethah Oct 10, 2016

Contributor

I think we should add them for completeness, and deprecate all overloads using runs.

Show outdated Hide outdated mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
Show outdated Hide outdated mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
@sethah

This comment has been minimized.

Show comment
Hide comment
@sethah

sethah Oct 4, 2016

Contributor

Looking good. My main concern is that now you can have the following:

scala> model.getK
res2: Int = 3

scala> model.clusterCenters.length
res3: Int = 1

We could set the model k to match the cluster centers length before creating the model, during training. We could leave it, but then what does k mean, if not the number of centers?

Contributor

sethah commented Oct 4, 2016

Looking good. My main concern is that now you can have the following:

scala> model.getK
res2: Int = 3

scala> model.clusterCenters.length
res3: Int = 1

We could set the model k to match the cluster centers length before creating the model, during training. We could leave it, but then what does k mean, if not the number of centers?

@srowen

This comment has been minimized.

Show comment
Hide comment
@srowen

srowen Oct 5, 2016

Member

That's right. k seems like the requested number of centroids, which may not match the actual number in corner cases. What about just documenting that more?

Member

srowen commented Oct 5, 2016

That's right. k seems like the requested number of centroids, which may not match the actual number in corner cases. What about just documenting that more?

@srowen

This comment has been minimized.

Show comment
Hide comment
@srowen

srowen Oct 5, 2016

Member

Otherwise updated to reflect all the other review comments, thanks.

Member

srowen commented Oct 5, 2016

Otherwise updated to reflect all the other review comments, thanks.

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Oct 5, 2016

Test build #66381 has finished for PR 15342 at commit ebbb852.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Oct 5, 2016

Test build #66381 has finished for PR 15342 at commit ebbb852.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@yanboliang

This comment has been minimized.

Show comment
Hide comment
@yanboliang

yanboliang Oct 5, 2016

Contributor

I'm more prefer to maintain the original logic that keep model.clusterCenters.length equal to k. Was there some discussion as to make this change?
I checked popular Python machine learning library scikit-learn, it return the requested number of centroids even if it is greater than the number of distinct data points:
image

And for R kmeans, it throw error if there is more cluster centers than distinct data points:
image

Contributor

yanboliang commented Oct 5, 2016

I'm more prefer to maintain the original logic that keep model.clusterCenters.length equal to k. Was there some discussion as to make this change?
I checked popular Python machine learning library scikit-learn, it return the requested number of centroids even if it is greater than the number of distinct data points:
image

And for R kmeans, it throw error if there is more cluster centers than distinct data points:
image

@srowen

This comment has been minimized.

Show comment
Hide comment
@srowen

srowen Oct 5, 2016

Member

This is what SPARK-3261 is about. It's a corner case to be sure. To me it seems like having duplicate centroids is worse because the model loses some of its meaning. Points may arbitrarily assign to one or the other of two identical centroids. Of the 3 possible behaviors, looks like we have all 3 on the table:

  1. error
  2. return < k centroids
  3. return k centroids

I suppose I prefer the new behavior but I can't say I feel that strongly. I guess matching scikit has some value.

Member

srowen commented Oct 5, 2016

This is what SPARK-3261 is about. It's a corner case to be sure. To me it seems like having duplicate centroids is worse because the model loses some of its meaning. Points may arbitrarily assign to one or the other of two identical centroids. Of the 3 possible behaviors, looks like we have all 3 on the table:

  1. error
  2. return < k centroids
  3. return k centroids

I suppose I prefer the new behavior but I can't say I feel that strongly. I guess matching scikit has some value.

@sethah

This comment has been minimized.

Show comment
Hide comment
@sethah

sethah Oct 6, 2016

Contributor

What are the circumstances that lead to duplicate cluster centers? Other than the obvious having less data than requested centers. The comment on the original JIRA said training 1.3M points asking for 10k clusters only returned 1k centers.

Contributor

sethah commented Oct 6, 2016

What are the circumstances that lead to duplicate cluster centers? Other than the obvious having less data than requested centers. The comment on the original JIRA said training 1.3M points asking for 10k clusters only returned 1k centers.

@srowen

This comment has been minimized.

Show comment
Hide comment
@srowen

srowen Oct 6, 2016

Member

Good question. I think he's saying that it returned 1K centers after this change. It's a good point that this would also speed things up considerably, because computing the distance to duplicate centroids is all superfluous work.

Member

srowen commented Oct 6, 2016

Good question. I think he's saying that it returned 1K centers after this change. It's a good point that this would also speed things up considerably, because computing the distance to duplicate centroids is all superfluous work.

@sethah

This comment has been minimized.

Show comment
Hide comment
@sethah

sethah Oct 7, 2016

Contributor

@srowen That is not the impression that I got from "I just ran clustering on 1.3M points, asking for 10,000 clusters. This clustering run resulted in 1019 unique cluster centers."

@derrickburns Can you clarify a bit here? Also, could you tell us the nature of the data that was used for your clustering?

Contributor

sethah commented Oct 7, 2016

@srowen That is not the impression that I got from "I just ran clustering on 1.3M points, asking for 10,000 clusters. This clustering run resulted in 1019 unique cluster centers."

@derrickburns Can you clarify a bit here? Also, could you tell us the nature of the data that was used for your clustering?

@srowen

This comment has been minimized.

Show comment
Hide comment
@srowen

srowen Oct 8, 2016

Member

I'm wondering, what's the use case for allowing duplicate centroids? it doesn't have a reasonable meaning and does slow down execution. I don't feel so strongly about it and would like to get the change to remove "runs" in regardless, so could back that out, but I'd be a little more convinced if it were more than just matching scikit

Member

srowen commented Oct 8, 2016

I'm wondering, what's the use case for allowing duplicate centroids? it doesn't have a reasonable meaning and does slow down execution. I don't feel so strongly about it and would like to get the change to remove "runs" in regardless, so could back that out, but I'd be a little more convinced if it were more than just matching scikit

Sean Owen added some commits Oct 4, 2016

Sean Owen
Sean Owen
Sean Owen
@srowen

This comment has been minimized.

Show comment
Hide comment
@srowen

srowen Oct 8, 2016

Member

I backed out the change for SPARK-3261; that part is actually tiny and separable now anyway. We can discuss that here too but wanted to split it from the main change for expediency.

Member

srowen commented Oct 8, 2016

I backed out the change for SPARK-3261; that part is actually tiny and separable now anyway. We can discuss that here too but wanted to split it from the main change for expediency.

@srowen srowen changed the title from [SPARK-11560] [SPARK-3261] [MLLIB] Optimize KMeans implementation / remove 'runs' / KMeans clusterer can return duplicate cluster centers to [SPARK-11560] [MLLIB] Optimize KMeans implementation / remove 'runs' Oct 8, 2016

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Oct 8, 2016

Test build #66578 has finished for PR 15342 at commit 68e3d90.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Oct 8, 2016

Test build #66578 has finished for PR 15342 at commit 68e3d90.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@srowen

This comment has been minimized.

Show comment
Hide comment
@srowen

srowen Oct 10, 2016

Member

@sethah are you OK with this part? We can still talk about the k centroids bit, either here or on the JIRA.

Member

srowen commented Oct 10, 2016

@sethah are you OK with this part? We can still talk about the k centroids bit, either here or on the JIRA.

@sethah

This comment has been minimized.

Show comment
Hide comment
@sethah

sethah Oct 10, 2016

Contributor

@srowen I will take a look shortly.

Contributor

sethah commented Oct 10, 2016

@srowen I will take a look shortly.

new VectorWithNorm(Vectors.dense(v.vector.toArray), v.norm)
}.toArray)
private def initRandom(data: RDD[VectorWithNorm]): Array[VectorWithNorm] = {
data.takeSample(true, k, new XORShiftRandom(this.seed).nextInt()).map(_.toDense)

This comment has been minimized.

@yanboliang

yanboliang Oct 10, 2016

Contributor

Is it necessary to cast vector to dense one?

@yanboliang

yanboliang Oct 10, 2016

Contributor

Is it necessary to cast vector to dense one?

This comment has been minimized.

@sethah

sethah Oct 10, 2016

Contributor

I guess not, but the centers become immediately dense in the first iteration of runAlgorithm.

@sethah

sethah Oct 10, 2016

Contributor

I guess not, but the centers become immediately dense in the first iteration of runAlgorithm.

This comment has been minimized.

@srowen

srowen Oct 10, 2016

Member

At least, it's what the existing code did.

@srowen

srowen Oct 10, 2016

Member

At least, it's what the existing code did.

This comment has been minimized.

@yanboliang

yanboliang Oct 11, 2016

Contributor

Maybe we can optimize this issue at #14937, since it will have different treatment for dense and sparse vector.

@yanboliang

yanboliang Oct 11, 2016

Contributor

Maybe we can optimize this issue at #14937, since it will have different treatment for dense and sparse vector.

Show outdated Hide outdated mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
chosen.foreach { case (p, rs) =>
rs.foreach(newCenters(_) += p.toDense)
}
newCenters = chosen.map(_.toDense)

This comment has been minimized.

@yanboliang

yanboliang Oct 10, 2016

Contributor

Ditto.

@yanboliang

yanboliang Oct 10, 2016

Contributor

Ditto.

bcCenters.destroy(blocking = false)
// Update the cluster centers and costs
converged = true

This comment has been minimized.

@yanboliang

yanboliang Oct 10, 2016

Contributor

I think changed would be more intuitive.

@yanboliang

yanboliang Oct 10, 2016

Contributor

I think changed would be more intuitive.

This comment has been minimized.

@srowen

srowen Oct 10, 2016

Member

Hm, I thought the opposite flag, converged was more intuitive. If you don't feel strongly about it, let's leave it, but, if you'd moderately prefer changed then I don't mind that. I think it's the same thing with the flag inverted.

@srowen

srowen Oct 10, 2016

Member

Hm, I thought the opposite flag, converged was more intuitive. If you don't feel strongly about it, let's leave it, but, if you'd moderately prefer changed then I don't mind that. I think it's the same thing with the flag inverted.

// Update the cluster centers and costs
converged = true
totalContribs.foreach { case (j, (sum, count)) =>

This comment has been minimized.

@yanboliang

yanboliang Oct 10, 2016

Contributor

Compared with the original code, foreach may slower than while loop if you have a large k.

@yanboliang

yanboliang Oct 10, 2016

Contributor

Compared with the original code, foreach may slower than while loop if you have a large k.

This comment has been minimized.

@srowen

srowen Oct 10, 2016

Member

Why is that? I'm aware that Scala for comprehensions can desugar into something surprisingly expensive, but this seems clearer and about the same as a while

@srowen

srowen Oct 10, 2016

Member

Why is that? I'm aware that Scala for comprehensions can desugar into something surprisingly expensive, but this seems clearer and about the same as a while

This comment has been minimized.

@sethah

sethah Oct 10, 2016

Contributor

In general while is faster than foreach (creating and calling an anonymous function), but I'd be surprised if it affected performance here because we are only running this once per iteration and the bulk of the cost will be distributed computation.

@sethah

sethah Oct 10, 2016

Contributor

In general while is faster than foreach (creating and calling an anonymous function), but I'd be surprised if it affected performance here because we are only running this once per iteration and the bulk of the cost will be distributed computation.

@sethah

A few minor things, but otherwise LGTM.

bcCenters.destroy(blocking = false)
// Update the cluster centers and costs
converged = true

This comment has been minimized.

@sethah

sethah Oct 10, 2016

Contributor

Why don't we just leave converged false, and only change it to true inside the foreach?

@sethah

sethah Oct 10, 2016

Contributor

Why don't we just leave converged false, and only change it to true inside the foreach?

This comment has been minimized.

@srowen

srowen Oct 10, 2016

Member

Unless I'm overlooking some obviously nicer expression, I think the loop is going to work the same either way: you have to assume you terminate unless a distance proves otherwise, per iteration.

@srowen

srowen Oct 10, 2016

Member

Unless I'm overlooking some obviously nicer expression, I think the loop is going to work the same either way: you have to assume you terminate unless a distance proves otherwise, per iteration.

This comment has been minimized.

@sethah

sethah Oct 10, 2016

Contributor

The logic is the same, yes, but it seems really strange to set something to false, then each iteration set it to true and then set it back false if some condition. Why not leave it false and change to true if convergence criteria is met? This is basically a trivial detail, so only change it if you want. I'm fine either way.

@sethah

sethah Oct 10, 2016

Contributor

The logic is the same, yes, but it seems really strange to set something to false, then each iteration set it to true and then set it back false if some condition. Why not leave it false and change to true if convergence criteria is met? This is basically a trivial detail, so only change it if you want. I'm fine either way.

This comment has been minimized.

@srowen

srowen Oct 10, 2016

Member

I don't think it can be done the way you're suggesting; it's not just preference. You could just set it with a nice simple call .forall as you're suggesting, usually, but here we also need the side effect of visiting each element. To do both I think we have to 'unroll' the equivalent logic and it amounts to this.

@srowen

srowen Oct 10, 2016

Member

I don't think it can be done the way you're suggesting; it's not just preference. You could just set it with a nice simple call .forall as you're suggesting, usually, but here we also need the side effect of visiting each element. To do both I think we have to 'unroll' the equivalent logic and it amounts to this.

This comment has been minimized.

@sethah

sethah Oct 10, 2016

Contributor

Yep, you're correct. Thanks!

@sethah

sethah Oct 10, 2016

Contributor

Yep, you're correct. Thanks!

new VectorWithNorm(Vectors.dense(v.vector.toArray), v.norm)
}.toArray)
private def initRandom(data: RDD[VectorWithNorm]): Array[VectorWithNorm] = {
data.takeSample(true, k, new XORShiftRandom(this.seed).nextInt()).map(_.toDense)

This comment has been minimized.

@sethah

sethah Oct 10, 2016

Contributor

I guess not, but the centers become immediately dense in the first iteration of runAlgorithm.

@sethah

sethah Oct 10, 2016

Contributor

I guess not, but the centers become immediately dense in the first iteration of runAlgorithm.

Show outdated Hide outdated mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
@@ -558,6 +475,7 @@ object KMeans {
* Trains a k-means model using specified parameters and the default values for unspecified.
*/
@Since("0.8.0")
@deprecated("Use train method without 'runs'", "2.1.0")

This comment has been minimized.

@sethah

sethah Oct 10, 2016

Contributor

I think we should add them for completeness, and deprecate all overloads using runs.

@sethah

sethah Oct 10, 2016

Contributor

I think we should add them for completeness, and deprecate all overloads using runs.

@srowen

This comment has been minimized.

Show comment
Hide comment
@srowen

srowen Oct 10, 2016

Member

@sethah OK I will add a new overload of train and deprecate the others.

Member

srowen commented Oct 10, 2016

@sethah OK I will add a new overload of train and deprecate the others.

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Oct 10, 2016

Test build #66673 has finished for PR 15342 at commit 5cb9e5f.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Oct 10, 2016

Test build #66673 has finished for PR 15342 at commit 5cb9e5f.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
Sean Owen
@@ -531,6 +471,7 @@ object KMeans {
* "k-means||". (default: "k-means||")
*/
@Since("0.8.0")
@deprecated("Use train method without 'runs'", "2.1.0")
def train(

This comment has been minimized.

@sethah

sethah Oct 10, 2016

Contributor

This signature does not have a direct alternative without runs.

@sethah

sethah Oct 10, 2016

Contributor

This signature does not have a direct alternative without runs.

This comment has been minimized.

@yanboliang
@yanboliang

yanboliang Oct 11, 2016

Contributor

+1 @sethah

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Oct 10, 2016

Test build #66684 has finished for PR 15342 at commit 84fb22f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Oct 10, 2016

Test build #66684 has finished for PR 15342 at commit 84fb22f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@yanboliang

This comment has been minimized.

Show comment
Hide comment
@yanboliang

yanboliang Oct 11, 2016

Contributor

Only the last two minor items, otherwise, this looks ready to me. Thanks!

Contributor

yanboliang commented Oct 11, 2016

Only the last two minor items, otherwise, this looks ready to me. Thanks!

Sean Owen
@srowen

This comment has been minimized.

Show comment
Hide comment
@srowen

srowen Oct 11, 2016

Member

Yeah, but now we have yet 2 more overloads. I had intended to point people to 1 new overload, but I guess it's weird to make people specify the seed arg. And optional args, the normal solution, breaks binary compatibility IIRC

Member

srowen commented Oct 11, 2016

Yeah, but now we have yet 2 more overloads. I had intended to point people to 1 new overload, but I guess it's weird to make people specify the seed arg. And optional args, the normal solution, breaks binary compatibility IIRC

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Oct 11, 2016

Test build #66729 has finished for PR 15342 at commit ba52582.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Oct 11, 2016

Test build #66729 has finished for PR 15342 at commit ba52582.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@sethah

This comment has been minimized.

Show comment
Hide comment
@sethah

sethah Oct 11, 2016

Contributor

LGTM

Contributor

sethah commented Oct 11, 2016

LGTM

@srowen

This comment has been minimized.

Show comment
Hide comment
@srowen

srowen Oct 12, 2016

Member

Merged to master. I'm going to reopen a PR for just the duplicate centroids issue to re-table that.

Member

srowen commented Oct 12, 2016

Merged to master. I'm going to reopen a PR for just the duplicate centroids issue to re-table that.

@srowen srowen closed this Oct 12, 2016

@srowen srowen deleted the srowen:SPARK-11560 branch Oct 12, 2016

asfgit pushed a commit that referenced this pull request Oct 12, 2016

Sean Owen
[SPARK-11560][MLLIB] Optimize KMeans implementation / remove 'runs'
## What changes were proposed in this pull request?

This is a revival of #14948 and related to #14937. This removes the 'runs' parameter, which has already been disabled, from the K-means implementation and further deprecates API methods that involve it.

This also happens to resolve the issue that K-means should not return duplicate centers, meaning that it may return less than k centroids if not enough data is available.

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #15342 from srowen/SPARK-11560.

ThySinner pushed a commit to ThySinner/spark that referenced this pull request Oct 19, 2016

Sean Owen Cao Jianghe
[SPARK-11560][MLLIB] Optimize KMeans implementation / remove 'runs'
## What changes were proposed in this pull request?

This is a revival of apache#14948 and related to apache#14937. This removes the 'runs' parameter, which has already been disabled, from the K-means implementation and further deprecates API methods that involve it.

This also happens to resolve the issue that K-means should not return duplicate centers, meaning that it may return less than k centroids if not enough data is available.

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes apache#15342 from srowen/SPARK-11560.

uzadude added a commit to uzadude/spark that referenced this pull request Jan 27, 2017

[SPARK-11560][MLLIB] Optimize KMeans implementation / remove 'runs'
## What changes were proposed in this pull request?

This is a revival of apache#14948 and related to apache#14937. This removes the 'runs' parameter, which has already been disabled, from the K-means implementation and further deprecates API methods that involve it.

This also happens to resolve the issue that K-means should not return duplicate centers, meaning that it may return less than k centroids if not enough data is available.

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes apache#15342 from srowen/SPARK-11560.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment