[SPARK-22119][ML] Add cosine distance to KMeans #19340

mgaido91 · 2017-09-25T16:35:26Z

What changes were proposed in this pull request?

Currently, KMeans assumes the only possible distance measure to be used is the Euclidean. This PR aims to add the cosine distance support to the KMeans algorithm.

How was this patch tested?

existing and added UTs.

….KMeans

srowen

Not a bad idea but I'm not sure about the design direction here.

srowen · 2017-09-25T16:46:44Z

mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala

@@ -260,7 +269,8 @@ class KMeans @Since("1.5.0") (
    maxIter -> 20,
    initMode -> MLlibKMeans.K_MEANS_PARALLEL,
    initSteps -> 2,
-    tol -> 1e-4)
+    tol -> 1e-4,
+    distanceMeasure -> DistanceSuite.EUCLIDEAN)


"DistanceSuite" sounds like a test case, which you can't use here, but, looks like it's an object you added in non-test code. That's confusing.

do you have any suggestion about which might be an appropriate name? Thanks.

srowen · 2017-09-25T16:47:29Z

mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala

@@ -71,6 +71,15 @@ private[clustering] trait KMeansParams extends Params with HasMaxIter with HasFe
  @Since("1.5.0")
  def getInitMode: String = $(initMode)

+  @Since("2.3.0")
+  final val distanceMeasure = new Param[String](this, "distanceMeasure", "The distance measure. " +


Interesting question here -- what about supplying a function as an argument, for full generality? but then that doesn't translate to Pyspark I guess, and, probably only 2-3 distance functions that ever make sense here.

This would be hard for two main reasons:
1 - as I will explain later, even though theoretically we would need only a function, in practice this is not true for performance reasons;
2 - saving and loading a function would be much harder (I'm not sure it would even be feasible).

srowen · 2017-09-25T16:48:14Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

@@ -40,20 +40,29 @@ import org.apache.spark.util.random.XORShiftRandom
 * to it should be cached by the user.
 */
 @Since("0.8.0")
-class KMeans private (
+class KMeans @Since("2.3.0") private (


If it's a private constructor, it shouldn't have API "Since" annotations. You don't need to preserve it for compatibility.

srowen · 2017-09-25T16:50:38Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

+  }
+  private[spark] def validateDistanceMeasure(distanceMeasure: String): Boolean = {
+    distanceMeasure match {
+      case DistanceSuite.EUCLIDEAN => true


You can use two labels in one statement if the result is the same; might be clearer. Match is probably overkill anyway

I just wanted to be consistent with the similar implementation which is three lines above. Doing the same thing in two different ways a few lines of code after might be very confusing IMHO.

srowen · 2017-09-25T16:50:58Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

+}
+
+@Since("2.3.0")
+object DistanceSuite {


Why is this an API (and why so named)? this can all be internal.

About the name, if you have any better suggestion, I'd be happy to change it. Maybe DistanceMeasure?
This in not internal because it contains the definition of the two constants which might be used by the users to set the right distance measure.

srowen · 2017-09-25T16:51:33Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

+  /**
+   * Returns the index of the closest center to the given point, as well as the squared distance.
+   */
+  def findClosest(


Why would something like this vary with the distance function? finding a closest thing is the same for all definitions of close-ness.

Even though you are right in theory, if you look at the implementation for the euclidean distance, in the current code there is an optimization which doesn't use the real distance measure for performance reason. Thus, dropping this method and introducing a more generic one would cause a performance regression for the euclidean distance, which is something I'd definitely avoid.

It seems like this should have a default implementation then that does the obvious thing

mgaido91 · 2017-09-26T12:21:05Z

@yanboliang may you please take a look at this when you have time? Thanks.

srowen · 2017-09-30T08:56:33Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

@@ -246,14 +271,16 @@ class KMeans private (

    val initStartTime = System.nanoTime()

+    val distanceSuite = DistanceMeasure.decodeFromString(this.distanceMeasure)


Why is this called "suite"?

srowen · 2017-09-30T08:57:25Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

+  /**
+   * Returns the index of the closest center to the given point, as well as the squared distance.
+   */
+  def findClosest(


It seems like this should have a default implementation then that does the obvious thing

srowen · 2017-09-30T08:58:12Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

+  /**
+   * Returns whether a center converged or not, given the epsilon parameter.
+   */
+  def isCenterConverged(


Likewise this always seems to be "distance < epsilon"; does it ever vary?

It's not for the EuclideanDistance, since in that case for performance reason is: distance^2 < epsilon^2.

Anyway, as you suggested for the findClosest, I refactored the base class in order to have a default implementation based on the distance method I introduced. And then, for the Euclidean distance I am overriding all the methods to the more efficient implementations. This looks the best and cleanest approach to me since it will allow to add more distance measures by implementing only the distance method, as the current CosineDistance implementation.

Thank you for your comments. Please, when you have time take a look at the new structure and let me know if it looks good to you now.

Thanks.

mgaido91 · 2017-10-06T07:27:21Z

kindly remind to @srowen and @yanboliang if you can take a look at it when you have time, thanks.

srowen · 2017-10-06T07:28:50Z

I'm kind of neutral given the complexity of adding this, but maybe it's the least complexity you can get away with. @hhbyyh was adding something related: https://issues.apache.org/jira/browse/SPARK-22195

mgaido91 · 2017-10-06T09:33:24Z

thanks for your replt @srowen. I saw it. My feeling is that so far there is no distance metric definition on Vectors. If we add the cosine distance, than we should add the Euclidean too there. Do you have any suggestion about how to make this PR going on then?
Thank you.

SparkQA · 2017-10-07T07:42:26Z

Test build #3945 has finished for PR 19340 at commit 1a5acdb.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2017-10-16T09:18:58Z

@mengxr @yu-iskw, sorry for pinging you, I saw from the commits you contributed to KMeans, might you please help reviewing this PR?
Thanks.

mgaido91 · 2017-11-10T11:18:30Z

kindly pinging @yanboliang

Kevin-Ferret · 2017-12-19T15:50:58Z

@mgaido91 I actually needed something like that recently and I stumbled upon your PR (and JIRA, that I cannot update unfortunately).
Your approach looks good to me but I was wondering : are you keeping the existing code to find the new centers, in the sense "newCenter = compute the arithmetic mean of all points in the cluster" ? (couldn't find any git diff on this piece https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L291-L298)
This will lead to misleading results (your centroids won't minimize the within-cluster cosine distance anymore) unless you make sure to normalize all your input vectors AND your newly computed centroid (ie spherical k-means).
I might be wrong/have misread the code, forgive me if that's the case. If you want to discuss further, what's the best place (assuming you don't want this PR derailed with my blabbering) ?

mgaido91 · 2017-12-19T17:15:59Z

hi @Kevin-Ferret , thanks for looking at this. Yes, you are right, I have not changed the method for updating the centroids. The current methods seems to me the most widely adopted for cosine similarity too. Indeed, the same approach is used in RapidMiner (https://docs.rapidminer.com/latest/studio/operators/modeling/segmentation/k_means.html) and also in this paper (https://s3.amazonaws.com/academia.edu.documents/32952068/pg049_Similarity_Measures_for_Text_Document_Clustering.pdf?AWSAccessKeyId=AKIAIWOWYYGZ2Y53UL3A&Expires=1513706450&Signature=MFPcahadw35IpP2o0v%2F51xW7KOM%3D&response-content-disposition=inline%3B%20filename%3DSimilarity_Measures_for_Text_Document_Cl.pdf).
I think this is the right place where to discuss it, since it is related to the implementation I am proposing in the PR.
Thanks.

mgaido91 · 2018-01-13T19:01:22Z

@srowen this has been stuck a while now. Nobody so far was able to provide a "less complex" proposal. I tried to ping all the people I was aware of who might have helped. Do you have any suggestion how to go on? Thanks.

srowen

Ideally @jkbradley could look at this, or @MLnick, as they are closer to this part, but it's looking good to me.

srowen · 2018-01-13T19:15:46Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

+
+}
+
+@Since("2.3.0")


All the "2.3.0" would likely have to change. I don't know if this would get in for 2.3.0.

yes, any idea which version should I target here?

I'd default to 2.4.0

srowen · 2018-01-13T19:16:06Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

+object DistanceMeasure {
+
+  @Since("2.3.0")
+  val EUCLIDEAN = "euclidean"


Ideally we'd use an enum for this but I don't think Scala's enums are encouraged, and probably not worth involving Java enums.

srowen · 2018-01-13T19:17:44Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala

@@ -149,4 +173,38 @@ object KMeansModel extends Loader[KMeansModel] {
      new KMeansModel(localCentroids.sortBy(_.id).map(_.point))
    }
  }
+  object SaveLoadV2_0 {


Nit: blank line before?

srowen · 2018-01-13T19:18:17Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

+  }
+
+  /**
+   * Returns the K-means cost of a given point against the given cluster centers.


Nit: might make this @return in order to get it to render in docs as the return documentation

srowen · 2018-01-13T19:18:36Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

+  def pointCost(
+      centers: TraversableOnce[VectorWithNorm],
+      point: VectorWithNorm): Double =
+    findClosest(centers, point)._2


Another nit, might put braces around these one-line functions just for clarity.

viirya · 2018-01-14T01:52:17Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

+      oldCenter: VectorWithNorm,
+      newCenter: VectorWithNorm,
+      epsilon: Double): Boolean = {
+    EuclideanDistanceMeasure.fastSquaredDistance(newCenter, oldCenter) <= epsilon * epsilon


Do we need to override default isCenterConverged here? Seems to me it is equal to the default one.

it is not, here we compare with epsilon * epsilon and we use the squared distance, in order to avoid the computation of the square, which is an expensive operation.

Add sqrt to both sides?

Math.sqrt(EuclideanDistanceMeasure.fastSquaredDistance(newCenter, oldCenter)) <= Math.sqrt(epsilon * epsilon) = epsilon

The left one is just the override distance, isn't?

Using sqrt would introduce a performance regression. This is the reason why I can't use only a function to differentiate the to distance measures, because the implementation for Euclidean distance is highly optimized and this is an optimization: avoiding sqrt can be a great performance improvement since it is an expensive operation.

viirya · 2018-01-14T01:56:14Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

+private[spark] abstract class DistanceMeasure extends Serializable {
+
+  /**
+   * @return the index of the closest center to the given point, as well as the squared distance.


Not always returns squared distance for now.

viirya · 2018-01-14T09:11:57Z

Based on some discussion I can find quickly, I am not sure if we can just support cosine distance by replacing distance function:

https://stats.stackexchange.com/questions/81481/why-does-k-means-clustering-algorithm-use-only-euclidean-distance-metric

mgaido91 · 2018-01-18T21:52:38Z

@srowen thank you for pointing out my style issue. Addressed, thanks.

viirya · 2018-01-19T00:22:07Z

That link also mentions that Matlab allows cosine distance. http://www.mathworks.com/help/stats/kmeans.html?s_tid=gn_loc_drop

The link to Matlab doc explicitly describes how it computes centroid clusters differently for the different, supported distance measures. For cosine distance, the centroids are computed with normalized points, before computing the mean of the points. In this part, seems to me Matlab's approach is more comprehensive than RapidMiners which only takes the mean of points without normalization.

I quickly looked at Spark's KMeans implementation, looks like we now also compute the centroids as the mean of the points without normalization.

I'm not sure if this can be an issue in practice usage of KMeans and affect its results or correctness. If we don't want to update centroids differently for different distance measures. I think we should at least clarify it in documents to warn users.

mgaido91 · 2018-01-19T06:55:33Z

@viirya yes you're right in your analysis. Where in the doc should we put this?

@srowen please if you.think this.is.ok, may you start a build? Thanks.

SparkQA · 2018-01-19T15:22:31Z

Test build #4064 has finished for PR 19340 at commit 5ed87ea.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2018-01-19T15:24:53Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

+/**
+ * A vector with its norm for fast distance computation.
+ *
+ * @see [[org.apache.spark.mllib.clustering.KMeans#fastSquaredDistance]]


This seems to fail the doc build for some reason. You can just remove it.

mgaido91 · 2018-01-19T16:06:14Z

Jenkins, retest this please

mgaido91 · 2018-01-19T17:09:05Z

@srowen sorry, I don't know why but it seems that I cannot start new jenkins jobs for this PR... May you white-list it or trigger a new test please? Thanks.

srowen · 2018-01-19T17:10:38Z

I think it may not be responding now for whatever reason. I use https://spark-prs.appspot.com/ to view and trigger tests

mgaido91 · 2018-01-19T17:17:37Z

Thanks, I didn't know its existence.

SparkQA · 2018-01-19T18:12:53Z

Test build #4065 has finished for PR 19340 at commit fda93ae.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2018-01-21T14:51:38Z

Merged to master

zhengruifeng · 2018-01-30T09:41:09Z

@mgaido91 @srowen I have the same concern as @Kevin-Ferret and @viirya
I don't find the normailization of vectors before training, and the update of center seems incorrect.
The arithmetic mean of all points in the cluster is not naturally the new cluster center:
For EUCLIDEAN distance, we need to update the center to minimize the square lose, then we get the arithmetic mean as the closed-form solution;
For COSINE similarity, we need to update the center to maximize the cosine similarity, the solution is also the arithmetic mean only if all vectors are of unit length.

In matlab's doc for KMeans, it says "One minus the cosine of the included angle between points (treated as vectors). Each centroid is the mean of the points in that cluster, after normalizing those points to unit Euclidean length."

I think RapidMiners's implementation of KMeans with cosine similarity is wrong, if it just assign new center with the arithmetic mean.

Some reference:
Spherical k-Means Clustering

Scikit-Learn's example: Clustering text documents using k-means

https://stats.stackexchange.com/questions/299013/cosine-distance-as-similarity-measure-in-kmeans

https://www.quora.com/How-can-I-use-cosine-similarity-in-clustering-For-example-K-means-clustering

mgaido91 · 2018-01-30T11:06:39Z

@Kevin-Ferret pointed out that both the input and the centers should be normalized to unit Euclidean length. Citing you @zhengruifeng ,

the solution is also the arithmetic mean only if all vectors are of unit length.

Therefore ensuring convergence means that the input dataset should contain unit length vectors, but this should be done by the user. I think we can add a comment in the documentation or adding a check and a WARN, but this has performance impact.

srowen · 2018-01-30T15:18:00Z

I think you could reasonably define it either way; depends on how much you think the cluster center is always defined as the mean (in "k-means") regardless of distance function, or not.

However I think I'm more sympathetic now to defining the center as the point that minimizes intra-cluster distance, which isn't quite the same thing. In that case yes you must normalize the inputs in order for Euclidean distance and cosine distance to match up.

Yeah you could tell the user that she can basically choose this behavior or not by normalizing or not. I think I'd now believe that's more potential for surprise than a useful choice. So yeah I'd also support going back and normalizing the inputs in all cases here when cosine distance is used.

zhengruifeng · 2018-01-31T02:39:43Z

The updating of centers should be viewed as the M-step in EM algorithm, in which some objective is optimized.

Since cosine similarity do not take vector-norm into account:

the optimal solution of normized points (V) should also be optimal to original points
scaled solution (k*V, k>0) is also optimal to both normized points and original points

If we want to optimize intra-cluster cosine similarity (like Matlab), then arithmetic mean of normized points should be a better solution than arithmetic mean of original points.

Suppose two 2D points (x=0,y=1) and (x=100,y=0):

If we choose the arithmetic mean (x=50,y=0.5) as the center, the sum of cosine similarity is about 1.0;
If we choose the arithmetic mean of normized points (x=0.5,y=0.5), the sum of cosine similarity is about 1.414;
this center can then be normized for computation convenience in following assignment (E-step) or prediction.

Since VectorWithNorm is used as the input, norms of vectors are already computed, then I think we only need to update this line to

if (point.norm > 0) {
  axpy(1.0 / point.norm, point.vector, sum)
}

mgaido91 · 2018-01-31T10:57:51Z

@zhengruifeng I agree with you, but then we can also normalize the center points here, since the user can access them and would therefore expect them to be unit length norm vectors. WDYT?

srowen · 2018-01-31T15:10:24Z

@zhengruifeng yes I understand why the solutions aren't the same, though it depends on whether you think that's what k-means is supposed to do or not. We're not actually maximizing an expectation here, but, this is just semantics and I agree with you.

srowen · 2018-02-06T03:13:09Z

@mgaido91 what do you think about the right follow-up here? as in your comment just above?

zhengruifeng · 2018-02-06T06:50:45Z

@mgaido91 agree that it is better to normalize centers

mgaido91 · 2018-02-06T11:09:04Z

@srowen honestly I don't think that we should change current implementation. Rapidminer, ELKI and nltk work like this. Matlab instead works differently and does what suggested by @Kevin-Ferret and @zhengruifeng.

Anyway, it looks like a majority (@viirya, @Kevin-Ferret, @zhengruifeng ) think that the other solution is better. So I think that if we change it, we should do basically the change suggested by @zhengruifeng and the normalization of the centers, otherwise we would come out with an hybrid and unclear solution.

I can submit a follow up PR with this second solution and maybe we can continue the discussion there. What do you think?

## What changes were proposed in this pull request? In apache#19340 some comments considered needed to use spherical KMeans when cosine distance measure is specified, as Matlab does; instead of the implementation based on the behavior of other tools/libraries like Rapidminer, nltk and ELKI, ie. the centroids are computed as the mean of all the points in the clusters. The PR introduce the approach used in spherical KMeans. This behavior has the nice feature to minimize the within-cluster cosine distance. ## How was this patch tested? existing/improved UTs Author: Marco Gaido <marcogaido91@gmail.com> Closes apache#20518 from mgaido91/SPARK-22119_followup.

mgaido91 added 4 commits September 25, 2017 18:27

Add the distanceSuite parameter and the cosine distance impl to mllib…

d679add

….KMeans

Add distance measure to ml Kmeans

c364ae3

Add tests for cosine

0e2a9ee

fix scalastyle

d8d8c64

srowen requested changes Sep 25, 2017

View reviewed changes

mgaido91 added 2 commits September 26, 2017 16:13

rename DistanceSuite to DistanceMeasure

1e01ffa

remove useless since annotation

da1f1cd

mgaido91 changed the title ~~[SPARK-22119] Add cosine distance to KMeans~~ [SPARK-22119][ML] Add cosine distance to KMeans Sep 27, 2017

srowen requested changes Sep 30, 2017

View reviewed changes

refactor DistanceMetric

1a5acdb

mgaido91 added 4 commits October 7, 2017 11:53

fix renaming error

69ab96c

renaming for better clarity

15f6373

renaming where missing

5ddcfe1

renaming also in suite

ffdd33b

srowen reviewed Jan 13, 2018

View reviewed changes

address style comments

b11af3b

viirya reviewed Jan 14, 2018

View reviewed changes

update description

da25dc9

fix style issues

5ed87ea

srowen approved these changes Jan 18, 2018

View reviewed changes

srowen reviewed Jan 19, 2018

View reviewed changes

fix doc generation error

fda93ae

asfgit closed this in 4f43d27 Jan 21, 2018

mgaido91 mentioned this pull request Feb 6, 2018

[SPARK-22119][FOLLOWUP][ML] Use spherical KMeans with cosine distance #20518

Closed

		@@ -246,14 +271,16 @@ class KMeans private (

		val initStartTime = System.nanoTime()

		val distanceSuite = DistanceMeasure.decodeFromString(this.distanceMeasure)

[SPARK-22119][ML] Add cosine distance to KMeans #19340

[SPARK-22119][ML] Add cosine distance to KMeans #19340

Conversation

mgaido91 commented Sep 25, 2017

What changes were proposed in this pull request?

How was this patch tested?

srowen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgaido91 Sep 25, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgaido91 commented Sep 26, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgaido91 commented Oct 6, 2017

srowen commented Oct 6, 2017

mgaido91 commented Oct 6, 2017

SparkQA commented Oct 7, 2017

mgaido91 commented Oct 16, 2017

mgaido91 commented Nov 10, 2017

Kevin-Ferret commented Dec 19, 2017

mgaido91 commented Dec 19, 2017

mgaido91 commented Jan 13, 2018

srowen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgaido91 Jan 14, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Jan 14, 2018

mgaido91 commented Jan 18, 2018

viirya commented Jan 19, 2018 • edited

mgaido91 commented Jan 19, 2018

SparkQA commented Jan 19, 2018

Choose a reason for hiding this comment

mgaido91 commented Jan 19, 2018

mgaido91 commented Jan 19, 2018

srowen commented Jan 19, 2018

mgaido91 commented Jan 19, 2018

SparkQA commented Jan 19, 2018

srowen commented Jan 21, 2018

zhengruifeng commented Jan 30, 2018

mgaido91 commented Jan 30, 2018 • edited

srowen commented Jan 30, 2018

zhengruifeng commented Jan 31, 2018

mgaido91 commented Jan 31, 2018

srowen commented Jan 31, 2018

srowen commented Feb 6, 2018

zhengruifeng commented Feb 6, 2018

mgaido91 commented Feb 6, 2018

mgaido91 Sep 25, 2017 •

edited

mgaido91 Jan 14, 2018 •

edited

viirya commented Jan 19, 2018 •

edited

mgaido91 commented Jan 30, 2018 •

edited