[SPARK-11560] [MLLIB] Optimize KMeans implementation / remove 'runs' #15342

srowen · 2016-10-04T10:26:47Z

What changes were proposed in this pull request?

This is a revival of #14948 and related to #14937. This removes the 'runs' parameter, which has already been disabled, from the K-means implementation and further deprecates API methods that involve it.

This also happens to resolve the issue that K-means should not return duplicate centers, meaning that it may return less than k centroids if not enough data is available.

How was this patch tested?

Existing tests

srowen · 2016-10-04T10:27:03Z

CC @yanboliang

SparkQA · 2016-10-04T11:25:40Z

Test build #66310 has finished for PR 15342 at commit cd14b65.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-10-04T20:37:13Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

-    }.toArray)
+  private def initRandom(data: RDD[VectorWithNorm]): Array[VectorWithNorm] = {
+    val sample = data.takeSample(false, k, new XORShiftRandom(this.seed).nextInt())
+    sample.map(v => new VectorWithNorm(Vectors.dense(v.vector.toArray), v.norm))


sample.map(_.toDense)?

sethah · 2016-10-04T20:38:26Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

  }

  /**
-   * Initialize `runs` sets of cluster centers using the k-means|| algorithm by Bahmani et al.
+   * Initialize set of cluster centers using the k-means|| algorithm by Bahmani et al.


nit: "Initialize a set"

sethah · 2016-10-04T20:41:14Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

  }

  /**
-   * Initialize `runs` sets of cluster centers using the k-means|| algorithm by Bahmani et al.
+   * Initialize set of cluster centers using the k-means|| algorithm by Bahmani et al.
   * (Bahmani et al., Scalable K-Means++, VLDB 2012). This is a variant of k-means++ that tries
   * to find with dissimilar cluster centers by starting with a random center and then doing


"to find with dissimilar" -> "to find dissimilar" (while we're here)

sethah · 2016-10-04T20:42:21Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

    // Initialize empty centers and point costs.
-    val centers = Array.tabulate(runs)(r => ArrayBuffer.empty[VectorWithNorm])
-    var costs = data.map(_ => Array.fill(runs)(Double.PositiveInfinity))
+    var costs = data.map(_ => Double.PositiveInfinity)

    // Initialize each run's first center to a random point.


"Initialize the first center to a random point."

sethah · 2016-10-04T21:19:45Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

@@ -558,6 +475,7 @@ object KMeans {
   * Trains a k-means model using specified parameters and the default values for unspecified.
   */
  @Since("0.8.0")
+  @deprecated("Use train method without 'runs'", "2.1.0")


There are two other train signatures that use runs, but have not been marked as deprecated.

Yes, though there's no alternative to those with the same arguments. We could add another overload and deprecate the others. I'm OK with that too, just felt a little gross to add yet more.

I think we should add them for completeness, and deprecate all overloads using runs.

sethah · 2016-10-04T21:53:16Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

    costs.unpersist(blocking = false)
    bcNewCentersList.foreach(_.destroy(false))

-    // Finally, we might have a set of more than k candidate centers for each run; weigh each
+    if (centers.size <= k) {
+      return centers.toArray


I prefer to avoid the return keyword and just put the other code under the else here. But it is a small preference.

sethah · 2016-10-04T22:09:22Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

-    val costs = Array.fill(numRuns)(0.0)
-
-    var activeRuns = new ArrayBuffer[Int] ++ (0 until numRuns)
+    var active = true


I find converged to be more intuitive, but not a strong preference.

sethah · 2016-10-04T22:30:52Z

Looking good. My main concern is that now you can have the following:

scala> model.getK
res2: Int = 3

scala> model.clusterCenters.length
res3: Int = 1

We could set the model k to match the cluster centers length before creating the model, during training. We could leave it, but then what does k mean, if not the number of centers?

srowen · 2016-10-05T07:30:01Z

That's right. k seems like the requested number of centroids, which may not match the actual number in corner cases. What about just documenting that more?

srowen · 2016-10-05T07:32:18Z

Otherwise updated to reflect all the other review comments, thanks.

SparkQA · 2016-10-05T08:32:44Z

Test build #66381 has finished for PR 15342 at commit ebbb852.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-10-05T14:21:44Z

I'm more prefer to maintain the original logic that keep model.clusterCenters.length equal to k. Was there some discussion as to make this change?
I checked popular Python machine learning library scikit-learn, it return the requested number of centroids even if it is greater than the number of distinct data points:

And for R kmeans, it throw error if there is more cluster centers than distinct data points:

srowen · 2016-10-05T14:26:42Z

This is what SPARK-3261 is about. It's a corner case to be sure. To me it seems like having duplicate centroids is worse because the model loses some of its meaning. Points may arbitrarily assign to one or the other of two identical centroids. Of the 3 possible behaviors, looks like we have all 3 on the table:

error
return < k centroids
return k centroids

I suppose I prefer the new behavior but I can't say I feel that strongly. I guess matching scikit has some value.

sethah · 2016-10-06T01:01:43Z

What are the circumstances that lead to duplicate cluster centers? Other than the obvious having less data than requested centers. The comment on the original JIRA said training 1.3M points asking for 10k clusters only returned 1k centers.

srowen · 2016-10-06T09:28:13Z

Good question. I think he's saying that it returned 1K centers after this change. It's a good point that this would also speed things up considerably, because computing the distance to duplicate centroids is all superfluous work.

sethah · 2016-10-07T23:57:53Z

@srowen That is not the impression that I got from "I just ran clustering on 1.3M points, asking for 10,000 clusters. This clustering run resulted in 1019 unique cluster centers."

@derrickburns Can you clarify a bit here? Also, could you tell us the nature of the data that was used for your clustering?

srowen · 2016-10-08T11:15:29Z

I'm wondering, what's the use case for allowing duplicate centroids? it doesn't have a reasonable meaning and does slow down execution. I don't feel so strongly about it and would like to get the change to remove "runs" in regardless, so could back that out, but I'd be a little more convinced if it were more than just matching scikit

…be chosen too

srowen · 2016-10-08T13:25:06Z

I backed out the change for SPARK-3261; that part is actually tiny and separable now anyway. We can discuss that here too but wanted to split it from the main change for expediency.

SparkQA · 2016-10-08T14:29:25Z

Test build #66578 has finished for PR 15342 at commit 68e3d90.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-10-10T08:10:34Z

@sethah are you OK with this part? We can still talk about the k centroids bit, either here or on the JIRA.

sethah · 2016-10-10T15:00:57Z

@srowen I will take a look shortly.

yanboliang · 2016-10-10T14:51:14Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala


  /**
-   * Number of clusters to create (k).
+   * Number of clusters to create (k). Note that if the input has fewer than k elements,
+   * then it's possible that fewer than k clusters are created.


If we back out change to avoid duplicate centroids, does this annotation be invalid?

Ooops, right

yanboliang · 2016-10-10T15:14:53Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

-      new VectorWithNorm(Vectors.dense(v.vector.toArray), v.norm)
-    }.toArray)
+  private def initRandom(data: RDD[VectorWithNorm]): Array[VectorWithNorm] = {
+    data.takeSample(true, k, new XORShiftRandom(this.seed).nextInt()).map(_.toDense)


Is it necessary to cast vector to dense one?

I guess not, but the centers become immediately dense in the first iteration of runAlgorithm.

At least, it's what the existing code did.

Maybe we can optimize this issue at #14937, since it will have different treatment for dense and sparse vector.

yanboliang · 2016-10-10T15:23:26Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

-    // On each step, sample 2 * k points on average for each run with probability proportional
-    // to their squared distance from that run's centers. Note that only distances between points
+    val centers = ArrayBuffer[VectorWithNorm]()
+    var newCenters = Seq(sample.head.toDense)


Maybe irrelevant with this PR, why we need to cast it to dense one?

I'm not sure this is performance critical, but centers ++= newCenters will be faster if newCenters is an Array instead of List.

Why is it faster if it's an Array instead of Seq? or am I getting the comments cross-wired?

ArrayBuffer.++= is optimized for IndexedSeq, but not for List.

The difference is probably negligible. I just thought we could use Array if there is no specific preference for using a List.

It's a Seq but yeah no specific preference. Where do you see that optimization BTW? I just see it implemented for TraversableOnce, and that's what my IDE says it calls even when given an Array

yanboliang · 2016-10-10T15:24:17Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

-      chosen.foreach { case (p, rs) =>
-        rs.foreach(newCenters(_) += p.toDense)
-      }
+      newCenters = chosen.map(_.toDense)


yanboliang · 2016-10-10T15:29:27Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

+      bcCenters.destroy(blocking = false)
+
+      // Update the cluster centers and costs
+      converged = true


I think changed would be more intuitive.

Hm, I thought the opposite flag, converged was more intuitive. If you don't feel strongly about it, let's leave it, but, if you'd moderately prefer changed then I don't mind that. I think it's the same thing with the flag inverted.

yanboliang · 2016-10-10T15:30:47Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

+
+      // Update the cluster centers and costs
+      converged = true
+      totalContribs.foreach { case (j, (sum, count)) =>


Compared with the original code, foreach may slower than while loop if you have a large k.

Why is that? I'm aware that Scala for comprehensions can desugar into something surprisingly expensive, but this seems clearer and about the same as a while

In general while is faster than foreach (creating and calling an anonymous function), but I'd be surprised if it affected performance here because we are only running this once per iteration and the bulk of the cost will be distributed computation.

sethah

A few minor things, but otherwise LGTM.

sethah · 2016-10-10T15:21:35Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

+      bcCenters.destroy(blocking = false)
+
+      // Update the cluster centers and costs
+      converged = true


Why don't we just leave converged false, and only change it to true inside the foreach?

Unless I'm overlooking some obviously nicer expression, I think the loop is going to work the same either way: you have to assume you terminate unless a distance proves otherwise, per iteration.

The logic is the same, yes, but it seems really strange to set something to false, then each iteration set it to true and then set it back false if some condition. Why not leave it false and change to true if convergence criteria is met? This is basically a trivial detail, so only change it if you want. I'm fine either way.

I don't think it can be done the way you're suggesting; it's not just preference. You could just set it with a nice simple call .forall as you're suggesting, usually, but here we also need the side effect of visiting each element. To do both I think we have to 'unroll' the equivalent logic and it amounts to this.

Yep, you're correct. Thanks!

sethah · 2016-10-10T16:00:58Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

-      new VectorWithNorm(Vectors.dense(v.vector.toArray), v.norm)
-    }.toArray)
+  private def initRandom(data: RDD[VectorWithNorm]): Array[VectorWithNorm] = {
+    data.takeSample(true, k, new XORShiftRandom(this.seed).nextInt()).map(_.toDense)


I guess not, but the centers become immediately dense in the first iteration of runAlgorithm.

sethah · 2016-10-10T16:12:17Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

-    // On each step, sample 2 * k points on average for each run with probability proportional
-    // to their squared distance from that run's centers. Note that only distances between points
+    val centers = ArrayBuffer[VectorWithNorm]()
+    var newCenters = Seq(sample.head.toDense)


I'm not sure this is performance critical, but centers ++= newCenters will be faster if newCenters is an Array instead of List.

sethah · 2016-10-10T17:18:22Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

@@ -558,6 +475,7 @@ object KMeans {
   * Trains a k-means model using specified parameters and the default values for unspecified.
   */
  @Since("0.8.0")
+  @deprecated("Use train method without 'runs'", "2.1.0")


I think we should add them for completeness, and deprecate all overloads using runs.

srowen · 2016-10-10T17:57:56Z

@sethah OK I will add a new overload of train and deprecate the others.

SparkQA · 2016-10-10T18:04:44Z

Test build #66673 has finished for PR 15342 at commit 5cb9e5f.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-10-10T21:54:21Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

+   *             on system time.
+   */
+  @Since("2.1.0")
+  def train(data: RDD[Vector],


minor: style should match other train signatures

sethah · 2016-10-10T22:01:15Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

@@ -531,6 +471,7 @@ object KMeans {
   *                           "k-means||". (default: "k-means||")
   */
  @Since("0.8.0")
+  @deprecated("Use train method without 'runs'", "2.1.0")
  def train(


This signature does not have a direct alternative without runs.

SparkQA · 2016-10-10T22:10:39Z

Test build #66684 has finished for PR 15342 at commit 84fb22f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-10-11T07:31:16Z

Only the last two minor items, otherwise, this looks ready to me. Thanks!

…eprecated method

srowen · 2016-10-11T07:41:43Z

Yeah, but now we have yet 2 more overloads. I had intended to point people to 1 new overload, but I guess it's weird to make people specify the seed arg. And optional args, the normal solution, breaks binary compatibility IIRC

SparkQA · 2016-10-11T08:44:13Z

Test build #66729 has finished for PR 15342 at commit ba52582.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-10-11T15:02:37Z

LGTM

srowen · 2016-10-12T09:01:27Z

Merged to master. I'm going to reopen a PR for just the duplicate centroids issue to re-table that.

## What changes were proposed in this pull request? This is a revival of #14948 and related to #14937. This removes the 'runs' parameter, which has already been disabled, from the K-means implementation and further deprecates API methods that involve it. This also happens to resolve the issue that K-means should not return duplicate centers, meaning that it may return less than k centroids if not enough data is available. ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #15342 from srowen/SPARK-11560.

## What changes were proposed in this pull request? This is a revival of apache#14948 and related to apache#14937. This removes the 'runs' parameter, which has already been disabled, from the K-means implementation and further deprecates API methods that involve it. This also happens to resolve the issue that K-means should not return duplicate centers, meaning that it may return less than k centroids if not enough data is available. ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes apache#15342 from srowen/SPARK-11560.

sethah reviewed Oct 4, 2016

View reviewed changes

srowen added 3 commits October 8, 2016 14:19

Remove inactive KMeans 'runs' param; allow fewer than k centroids to …

03a643d

…be chosen too

Review changes

0e50172

Back out change to avoid duplicate centroids

68e3d90

srowen force-pushed the SPARK-11560 branch from ebbb852 to 68e3d90 Compare October 8, 2016 13:24

srowen changed the title ~~[SPARK-11560] [SPARK-3261] [MLLIB] Optimize KMeans implementation / remove 'runs' / KMeans clusterer can return duplicate cluster centers~~ [SPARK-11560] [MLLIB] Optimize KMeans implementation / remove 'runs' Oct 8, 2016

yanboliang reviewed Oct 10, 2016

View reviewed changes

sethah reviewed Oct 10, 2016

View reviewed changes

Fix comment; deprecate remaining train() methods with 'run' parameter

5cb9e5f

Fix scaladoc style

84fb22f

sethah reviewed Oct 10, 2016

View reviewed changes

Add another overload, fix train args style, remove internal call to d…

ba52582

…eprecated method

srowen closed this Oct 12, 2016

srowen deleted the SPARK-11560 branch October 12, 2016 09:02

[SPARK-11560] [MLLIB] Optimize KMeans implementation / remove 'runs' #15342

[SPARK-11560] [MLLIB] Optimize KMeans implementation / remove 'runs' #15342

Conversation

srowen commented Oct 4, 2016

What changes were proposed in this pull request?

How was this patch tested?

srowen commented Oct 4, 2016

SparkQA commented Oct 4, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sethah commented Oct 4, 2016

srowen commented Oct 5, 2016

srowen commented Oct 5, 2016

SparkQA commented Oct 5, 2016

yanboliang commented Oct 5, 2016 • edited Loading

srowen commented Oct 5, 2016

sethah commented Oct 6, 2016

srowen commented Oct 6, 2016

sethah commented Oct 7, 2016

srowen commented Oct 8, 2016

srowen commented Oct 8, 2016

SparkQA commented Oct 8, 2016

srowen commented Oct 10, 2016

sethah commented Oct 10, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yanboliang Oct 10, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yanboliang Oct 10, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sethah left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srowen commented Oct 10, 2016

SparkQA commented Oct 10, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 10, 2016

yanboliang commented Oct 11, 2016

srowen commented Oct 11, 2016

SparkQA commented Oct 11, 2016

sethah commented Oct 11, 2016

srowen commented Oct 12, 2016

yanboliang commented Oct 5, 2016 •

edited

Loading

yanboliang Oct 10, 2016 •

edited

Loading

yanboliang Oct 10, 2016 •

edited

Loading