[SPARK-14516][ML] Adding ClusteringEvaluator with the implementation of Cosine silhouette and squared Euclidean silhouette. #18538

mgaido91 · 2017-07-05T07:51:07Z

What changes were proposed in this pull request?

This PR adds the ClusteringEvaluator Evaluator which contains two metrics:

cosineSilhouette: the Silhouette measure using the cosine distance;
squaredSilhouette: the Silhouette measure using the squared Euclidean distance.

The implementation of the two metrics refers to the algorithm proposed and explained here. These algorithms have been thought for a distributed and parallel environment, thus they have reasonable performance, unlike a naive Silhouette implementation following its definition.

How was this patch tested?

The patch has been tested with the additional unit tests added (comparing the results with the ones provided by Python sklearn library).

…osine silhouette and squared Euclidean silhouette.

yanboliang · 2017-08-03T06:35:44Z

ok to test

yanboliang · 2017-08-03T09:56:56Z

mllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala

+    testDefaultReadWrite(evaluator)
+  }
+
+  test("squared euclidean Silhouette") {


Could you add Python code which can help to reproduce the result in scikit-learn, like we did in other algorithms?

Thanks for the reference, I have added it.

yanboliang · 2017-08-03T09:59:16Z

@gatorsmile Could you help to trigger the test job? It seems I can't do it now. Thanks.

… in scikit-learn

gatorsmile · 2017-08-05T06:37:47Z

test this please

gatorsmile · 2017-08-05T06:37:52Z

ok to test

gatorsmile · 2017-08-05T06:37:56Z

retest this please

SparkQA · 2017-08-05T06:43:59Z

Test build #80281 has finished for PR 18538 at commit cfcb106.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-05T08:43:33Z

Test build #80285 has finished for PR 18538 at commit 923418a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang

@mgaido91 Could you refactor the code as I suggested? It should be more succinct and efficient. And try to organize all your code in ClusteringEvaluator. Any questions, feel free to let me know. Thanks.

yanboliang · 2017-08-08T09:49:13Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

+  private[this] def computeCosineSilhouette(dataset: Dataset[_]): Double = {
+    CosineSilhouette.registerKryoClasses(dataset.sparkSession.sparkContext)
+
+    val computeCsi = dataset.sparkSession.udf.register("computeCsi",


Could we use more descriptive name? We can't get what this function does from its name.

yanboliang · 2017-08-08T11:45:52Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/SquaredEuclideanSilhouette.scala

+        count("*").alias("count"),
+        sum("csi").alias("psi"),
+        Yudaf(col(featuresCol)).alias("y")
+      )


Aggregate function performance is not ideal for column of non-primitive type(like here is vector type). So we would still use RDD-based aggregate. You can factor this part of code following NaiveBayes like:

import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vectors} import org.apache.spark.sql.functions._ val numFeatures = ... val squaredNorm = udf { features: Vector => math.pow(Vectors.norm(features, 2.0), 2.0) } df.select(col(predictionCol), col(featuresCol)) .withColumn("squaredNorm", squaredNorm(col(featuresCol))) .rdd .map { row => (row.getDouble(0), (row.getAs[Vector](1), row.getDouble(2))) } .aggregateByKey[(DenseVector, Double)]((Vectors.zeros(numFeatures).toDense, 0.0))( seqOp = { case ((featureSum: DenseVector, squaredNormSum: Double), (features, squaredNorm)) => BLAS.axpy(1.0, features, featureSum) (featureSum, squaredNormSum + squaredNorm) }, combOp = { case ((featureSum1, squaredNormSum1), (featureSum2, squaredNormSum2)) => BLAS.axpy(1.0, featureSum2, featureSum1) (featureSum1, squaredNormSum1 + squaredNormSum2) }).collect()

In my suggestion, you can compute csi and y in a single data pass, which should be more efficient.

yanboliang · 2017-08-08T11:48:28Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/SquaredEuclideanSilhouette.scala

+
+  case class ClusterStats(Y: Vector, psi: Double, count: Long)
+
+  def computeCsi(vector: Vector): Double = {


Can we use Vectors.norm(vector, 2.0)? It should be more efficient for both dense and sparse vector. Actually we can remove this function if you refactor code as my suggested below.

yanboliang · 2017-08-08T11:52:49Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/SquaredEuclideanSilhouette.scala

+      .agg(
+        count("*").alias("count"),
+        sum("csi").alias("psi"),
+        Yudaf(col(featuresCol)).alias("y")


Please rename csi to squaredNorm, psi to squaredNormSum, y to featureSum if I don't have misunderstand. We should use more descriptive name.

yanboliang · 2017-08-08T11:57:30Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/CosineSilhouette.scala

+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions.{col, count}
+
+private[evaluation] object CosineSilhouette {


There is no clustering algorithms using other distance metrics except for squared euclidean distance currently. I'd suggest to remove the CosineSilhouette implementation firstly, we can add it back when it's needed. This can also make this PR more easy to review.

yanboliang · 2017-08-08T11:58:52Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/SquaredEuclideanSilhouette.scala

+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions.{col, count, sum}
+
+private[evaluation] object SquaredEuclideanSilhouette {


Let's move this to file ClusteringEvaluator.

…ClusteringEvaluator

…tion

SparkQA · 2017-08-09T13:50:02Z

Test build #80453 has finished for PR 18538 at commit ffc17f9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2017-08-09T13:53:15Z

@yanboliang thanks for your review.
I refactored the code according to your suggestions and I removed the cosine implementation.
Might you please review it now?
Thanks.

yanboliang · 2017-08-15T10:11:27Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

+      .toMap
+  }
+
+  def computeSquaredSilhouetteCoefficient(


I'd suggest rename computeSquaredSilhouetteCoefficient to computeSilhouetteCoefficient, since this function is already inside of class SquaredEuclideanSilhouette, it doesn't necessary to highlight SquaredEuclidean. What do you think of it?

yanboliang · 2017-08-15T10:13:42Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

+
+  }
+
+  def computeSquaredSilhouette(dataset: Dataset[_],


Ditto, rename computeSquaredSilhouette to computeSilhouetteScore, which should be more clear to let users know this is the silhouette score. Meanwhile, could you add annotation for this function like following?

/** * Compute the mean Silhouette Coefficient of all samples. */

yanboliang · 2017-08-15T10:16:16Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

+  }
+
+  def computeSquaredSilhouette(dataset: Dataset[_],
+    predictionCol: String,


The indentation should be four spaces in this and the following lines.

yanboliang · 2017-08-15T10:16:53Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

+
+    val squaredNorm = udf {
+      features: Vector =>
+        math.pow(Vectors.norm(features, 2.0), 2.0)


Move this line to above.

yanboliang · 2017-08-15T10:21:44Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

+
+    val bClustersStatsMap = dataset.sparkSession.sparkContext.broadcast(clustersStatsMap)
+
+    val computeSilhouette = dataset.sparkSession.udf.register("computeSilhouette",


What do you think about to rename computeSilhouette to computeSilhouetteCoefficientUDF?

Why not follow the above way of creating udf squaredNorm?

yanboliang · 2017-08-15T12:21:04Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

+ *   in this document</a>.
+ */
+@Experimental
+class ClusteringEvaluator (val uid: String)


Add @Since("2.3.0") here and other places necessary.

yanboliang · 2017-08-15T12:22:27Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

+class ClusteringEvaluator (val uid: String)
+  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
+
+  def this() = this(Identifiable.randomUID("SquaredEuclideanSilhouette"))


SquaredEuclideanSilhouette -> cluEval

yanboliang · 2017-08-15T12:25:05Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

+   * @group param
+   */
+  val metricName: Param[String] = {
+    val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))


squaredSilhouette -> silhouette? If we support other distance like cosine, the metric name should be the same. The distance metric should be controlled by other param.

Should I introduce then a new param for the distance metric? I think it is important to highlight that the used distance measure is the squared Euclidean distance, because anybody would assume that the Euclidean distance is used, if we don't specify it very well IMHO.

Yeah, I think we can add a new param for the distance metric in the future. As MLlib only support squared Euclidean distance , we can ignore this param and add annotation in the API to clarify it currently. You can check MLlib KMeans, there is no param to set distance metric. cc @jkbradley @MLnick @hhbyyh @zhengruifeng

Yes, the idea often crosses my mind.
Even though there's a claim that K-Means is for Euclidean distances only, I often see people has the requirement for custom distance computation in practice. So I would like to see KMeans support it.

yanboliang · 2017-08-15T12:30:52Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

+
+}
+
+private[evaluation] object SquaredEuclideanSilhouette {


It's better to have some annotation to explain how we compute Silhouette Coefficient by the high efficient distributed implementation. You can refer what we did at LogisticRegression.

I included the link to the design document here: https://github.com/mgaido91/spark/blob/ffc17f929dd86d1e7e73931eac5663bc08b6ba7a/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala#L37. Should I move it from there? Or should I rewrite the content of the document in an annotation here? Thanks!

Usually we should paste the formula here to explain how we compute Silhouette Coefficient by the high efficient distributed implementation. Because your design document is not a publication, so I think we need to move it from there, but you can simplify it.

yanboliang · 2017-08-15T12:35:22Z

mllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala

+
+  import testImplicits._
+
+  val dataset = Seq(Row(Vectors.dense(5.1, 3.5, 1.4, 0.2), 0),


It's good to have this to verify the correctness of your implementation, but usually we don't hard code so much data for test. Could you try to find existing data in KMeansSuite or GaussianMixtureSuite for testing? If the hard code is necessary, please try to use small dataset.

Unfortunately {{KMeansSuite}} and {{GaussianMixtureSuite}} use randomly generated data: thus it is not possible to know which should be the output value for the Silhouette in advance. What if I move the data to a resource file and read it?

I think we can't put test data in resource file, as resource file will be packaged in the final jar file, which may lead it bigger. What about randomly generated some small data(10-20 samples), get sklearn output in Python and hard code them here? Just like what we did in GaussianMixtureSuite .

@mgaido91 You can set seed to control the randomly generated data.

Sorry but I can't understand your point. Resources in the test scope are not included in the compiled jars. The same approach is used in the sql component for instance, where the test data is in the resources (https://github.com/apache/spark/tree/master/sql/core/src/test/resources/test-data).
If I generate randomly test data, I have to first perform a clustering on those points, while with this dataset I have the result of the clustering ready too. I am not sure this is the best approach. But maybe I am missing something. Can you please clarify this to me?

@mgaido91 Sorry I mistakenly thought to put it in the src resource rather than test resource. Usually we generate some dataset to verify MLlib result, we never put existing dataset in resource even test scope until now, this is because we use different dataset to verify different algorithms, which may lead large amount dataset to be added. But the iris dataset is so popular and can be used to verify lots of algorithms, so I'm OK to put it there. Thanks.

yanboliang · 2017-08-15T12:41:55Z

@mgaido91 I made another pass and left some comments, mainly about naming and annotation. This looks in good shape now. I'd suggest to following the name in sklearn, which should be easy to understand for both developers and users. Thanks for this great contribution.

mgaido91 · 2017-08-15T13:12:24Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

+ * Evaluator for clustering results.
+ * At the moment, the supported metrics are:
+ *  squaredSilhouette: silhouette measure using the squared Euclidean distance;
+ *  cosineSilhouette: silhouette measure using the cosine distance.


I forgot to remove this line, I am doing it.

zhengruifeng · 2017-08-16T06:34:57Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

+ *  squaredSilhouette: silhouette measure using the squared Euclidean distance;
+ *  cosineSilhouette: silhouette measure using the cosine distance.
+ *  The implementation follows the proposal explained
+ * <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">


It maybe better to refer to the wiki and explain your method in the ml-clustering.md

@zhengruifeng I can't see in the wiki any of the other evaluators. And I don't see a detailed explanation of the maths behind the algorithms either. Thus I am not sure it is the best place.

zhengruifeng · 2017-08-16T06:37:15Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

+    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
+    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
+
+    val metric: Double = $(metricName) match {


If only Euclidean is support for now, here val metric and match are not needed, directly return SquaredEuclideanSilhouette.computeSquaredSilhouette...

zhengruifeng · 2017-08-16T06:50:22Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

+        (clusterCurrentPoint.numOfPoints - 1)
+    }
+
+    var silhouetteCoeff = 0.0


What about changing this to

if (clusterSil < minOther) { 1 - clusterSil / minOther } else if (clusterSil > minOther) { minOther / clusterSil - 1 } else { 0.0 }

zhengruifeng · 2017-08-16T06:51:11Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

+      }
+    }
+    silhouetteCoeff
+


remove empty line

zhengruifeng · 2017-08-16T06:51:49Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

+    }
+    metric
+  }
+


remove empty line, and otherwise

zhengruifeng · 2017-08-16T07:02:10Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

+
+    val bClustersStatsMap = dataset.sparkSession.sparkContext.broadcast(clustersStatsMap)
+
+    val computeSilhouette = dataset.sparkSession.udf.register("computeSilhouette",


Why not follow the above way of creating udf squaredNorm?

zhengruifeng · 2017-08-16T07:04:09Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

+      computeSquaredSilhouetteCoefficient(bClustersStatsMap, _: Vector, _: Int, _: Double)
+    )
+
+    val squaredSilhouetteDF = dfWithSquaredNorm


Use select(avg(computeSilhouette(...)))

zhengruifeng · 2017-08-16T07:05:16Z

mllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala

+
+  import testImplicits._
+
+  val dataset = Seq(Row(Vectors.dense(5.1, 3.5, 1.4, 0.2), 0),


@mgaido91 You can set seed to control the randomly generated data.

zhengruifeng · 2017-08-16T07:09:53Z

mllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala

+  }
+
+  /*
+  Use the following python code to load the data and evaluate it using scikit-learn package.


you should add the expected output of your python code, refer to FPGrowthSuite.scala
, and mind the indent

…set to resources

SparkQA · 2017-08-18T21:50:20Z

Test build #80860 has finished for PR 18538 at commit a4ca3cd.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-19T00:10:33Z

Test build #80862 has finished for PR 18538 at commit a7db896.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123 · 2017-08-22T11:07:23Z

mllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala

+    val irisCsvPath = Thread.currentThread()
+      .getContextClassLoader
+      .getResource("test-data/iris.csv")
+      .toString


So this testsuite reference another testdata file. Can we generate the testdata in the code? Like other testsuite.

There was a discussion about this in the outdated comments. The main reason to avoid test data generation in my point of view is that the generated data must be clustered before running the Silhouette.
The iris dataset is a well-known one and contains already clustered data. Thus it seemed the best option.

yanboliang · 2017-08-31T10:56:07Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

+ *
+ * <blockquote>
+ *   s_{i} = \frac{b_{i}-a_{i}}{max\{a_{i},b_{i}\}}
+ * </blockquote>


The latex formula should be surrounded by $$, change here and other places as:

<blockquote> $$ s_{i} = \frac{b_{i}-a_{i}}{max\{a_{i},b_{i}\}} $$ </blockquote>

yanboliang · 2017-08-31T10:58:48Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

+ * distance measure.
+ *
+ * With this assumption, the average of the distance of the point `X`
+ * to the points `C_{i}` belonging to the cluster `\Gamma` is:


C_{i} -> $C_{i}$, otherwise, it can't generated correct doc. Change here and other places.

yanboliang · 2017-08-31T11:02:17Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

+ * <blockquote>
+ *   s_{i}=\left\{ \begin{tabular}{cc}
+ *   $1-\frac{a_{i}}{b_{i}}$ & if $a_{i} \leq b_{i}$ \\
+ *   $\frac{b_{i}}{a_{i}}-1$ & if $a_{i} \gt b_{i}$


There is syntax error in this latex formula, I checked the generated doc and found it can't show correctly. Or you can paste this formula into http://www.hostmath.com/ to check.

thanks @yanboliang, may you please tell me how to check the generated doc? thank you!

1, Remove private[evaluation] from object SquaredEuclideanSilhouette. We only generate docs for public APIs, the doc of private APIs are used for developers to understand code.
2, cd docs
3, Run jekyll build
4, Then you can get API docs under docs/_site/api/scala/index.html, try to search SquaredEuclideanSilhouette.

thank you! You're always nice. Just fixed everything, thanks.

yanboliang · 2017-08-31T11:04:24Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

+ * thus we can name it `\Psi_{\Gamma}`
+ *
+ * <blockquote>
+ *   sum\limits_{i=1}^N \sum\limits_{j=1}^D c_{ij}^2 =


Ditto, there is syntax error in this latex formula.

yanboliang · 2017-08-31T13:24:02Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

+ *
+ * The implementation follows the proposal explained
+ * <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">
+ * in this document</a>.


BTW, we have necessary docs at object SquaredEuclideanSilhouette to explain our proposed algorithm, so we can remove this. Usually we only refer to public publication.

SparkQA · 2017-09-04T03:01:33Z

Test build #81369 has finished for PR 18538 at commit 9abe9e5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123

LGTM. Thanks!

yanboliang · 2017-09-06T05:49:06Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

+ *
+ * where `$a_{i}$` is the average dissimilarity of `i` with all other data
+ * within the same cluster, `$b_{i}$` is the lowest average dissimilarity
+ * of to any other cluster, of which `i` is not a member.


of to -> of `i` to

yanboliang · 2017-09-06T06:09:35Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

+ *
+ * <blockquote>
+ *   $$
+ *   \sum\limits_{i=1}^N d(X, C_{i} )^2 =


I'd suggest to change d(X, C_{i} )^2 to d(X, C_{i} ), as we don't define d() for Euclidean distance, so we can regard it as squared Euclidean distance . What do you think of?

yes, you are right, thanks.

yanboliang · 2017-09-06T06:15:11Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

+ *   \sum\limits_{i=1}^N d(X, C_{i} )^2 =
+ *   \sum\limits_{i=1}^N \Big( \sum\limits_{j=1}^D (x_{j}-c_{ij})^2 \Big)
+ *   = \sum\limits_{i=1}^N \Big( \sum\limits_{j=1}^D x_{j}^2 +
+ *   \sum\limits_{j=1}^D c_{ij}^2 -2\sum\limits_{j=1}^D x_{i}c_{ij} \Big)


x_{i}c_{ij} -> x_{ij}c_{ij}? Since x_{i} is a vector and c_{ij} is a double, here we compute dot product.

No, x_{i} is not a vector. X is a vector (which represents a point). x_{i} is a typo I am fixing for x_{j} which is a scalar, not a vector.

yanboliang · 2017-09-06T06:15:54Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

+ *   \sum\limits_{j=1}^D c_{ij}^2 -2\sum\limits_{j=1}^D x_{i}c_{ij} \Big)
+ *   = \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{j}^2 +
+ *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D c_{ij}^2
+ *   -2 \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{i}c_{ij}


Ditto, x_{i}c_{ij} -> x_{ij}c_{ij}.
BTW, could you also check this issue in the following description? Thanks.

As above, I am checking for the typo everywhere, thanks for pointing it out.

yanboliang · 2017-09-06T06:26:31Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

+ * and parallel implementation of the Silhouette using the squared Euclidean
+ * distance measure.
+ *
+ * With this assumption, the average of the distance of the point `X`


the average of the distance of the point -> the total distance of the point? Should it be the total distance rather than the average distance?

yanboliang · 2017-09-06T11:24:37Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

+}
+
+
+object ClusteringEvaluator


@Since("2.3.0")

yanboliang · 2017-09-06T11:24:48Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

+object ClusteringEvaluator
+  extends DefaultParamsReadable[ClusteringEvaluator] {
+
+  override def load(path: String): ClusteringEvaluator = super.load(path)


@Since("2.3.0")

yanboliang · 2017-09-06T11:27:12Z

mllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala

+        .setFeaturesCol("features")
+        .setPredictionCol("label")
+
+    assert(evaluator.evaluate(iris) ~== 0.6564679231 relTol 1e-10)


Check with tolerance 1e-5 is good enough.

yanboliang · 2017-09-06T11:37:09Z

mllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala

+            splits(splits.length-1).toInt
+          )
+      }
+      .toDF()


Can we store the test data as libsvm format rather than csv? Then we can use spark.read.format("libsvm").load(irisPath) to load it to a DataFrame with two columns: features and label.

yanboliang · 2017-09-06T11:41:52Z

mllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala

+
+    assert(evaluator.evaluate(iris) ~== 0.6564679231 relTol 1e-10)
+  }
+


It's better to add another corner case: single cluster. We should guarantee it output consistent result with sklearn. You can just select one cluster from the iris dataset and test it.

Actually sklearn throws an exception in this case. Should we do the same? Thanks.

Yeah, I support to keep consistent result. Otherwise, any real value is a confused result that doesn't make sense. What do you think of it? Thanks.

yes, I agree. Thanks.

yanboliang · 2017-09-06T11:45:19Z

@mgaido91 I left some minor comments, otherwise, this looks good. Thanks.

SparkQA · 2017-09-06T17:00:43Z

Test build #81463 has finished for PR 18538 at commit 7b8149a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2017-09-11T09:27:48Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

+ * where `$a_{i}$` is the average dissimilarity of `i` with all other data
+ * within the same cluster, `$b_{i}$` is the lowest average dissimilarity
+ * of `i` to any other cluster, of which `i` is not a member.
+ * `$a_{i}$` can be interpreted as as how well `i` is assigned to its cluster


Remove duplicated as.

yanboliang · 2017-09-11T09:37:36Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

+ *
+ * <blockquote>
+ *   $$
+ *   \frac{\sum\limits_{i=1}^N d(X, C_{i} )^2}{N} =


Like above, d(X, C_{i} )^2 -> d(X, C_{i} ), we have consensus at last round discussion.

yanboliang · 2017-09-11T09:43:00Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

+   * about a cluster which are needed by the algorithm.
+   *
+   * @param df The DataFrame which contains the input data
+   * @param predictionCol The name of the column which contains the cluster id for the point.


the cluster id -> the predicted cluster id

yanboliang · 2017-09-11T09:43:57Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

+   * Compute the mean Silhouette values of all samples.
+   *
+   * @param dataset The input dataset (previously clustered) on which compute the Silhouette.
+   * @param predictionCol The name of the column which contains the cluster id for the point.


yanboliang · 2017-09-11T09:46:50Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

+
+    // compute aggregate values for clusters needed by the algorithm
+    val clustersStatsMap = SquaredEuclideanSilhouette
+      .computeClusterStats(dfWithSquaredNorm, predictionCol, featuresCol)


We can check whether the number of clusters is grater then 1 at here to avoid unnecessary computation.

assert(clustersStatsMap.size != 1, "...")

this comment has been addressed just one line after. Thanks.

yanboliang · 2017-09-11T09:48:24Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

+    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
+
+    // Silhouette is reasonable only when the number of clusters is grater then 1
+    assert(dataset.select($(predictionCol)).distinct().count() > 1,


Move this check to L418, in case another unnecessary computation for most of the cases(cluster size > 1). See my comment at L418.

yanboliang · 2017-09-11T09:56:28Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

+   */
+  @Since("2.3.0")
+  val metricName: Param[String] = {
+    val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))


I'd suggest the metric name is silhouette, since we may add silhouette for other distance, then we can add another param like distance to control that. The param metricName should not bind to any distance computation way. There are lots of other metrics for clustering algorithms, like these in sklearn. We would not add all of them for MLlib, but we may add part of them in the future.
cc @jkbradley @MLnick @WeichenXu123

SparkQA · 2017-09-11T13:33:22Z

Test build #81639 has finished for PR 18538 at commit b0b7853.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2017-09-12T05:35:28Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

+        dataset,
+        $(predictionCol),
+        $(featuresCol)
+      )


Reorg as:

$(metricName) match { case "squaredSilhouette" => SquaredEuclideanSilhouette.computeSilhouetteScore( dataset, $(predictionCol), $(featuresCol)) }

yanboliang · 2017-09-12T05:37:10Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

+import org.apache.spark.sql.types.IntegerType
+
+/**
+ * :: Experimental ::


Usually we leave a blank line under :: Experimental ::.

yanboliang · 2017-09-12T05:39:43Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

+  @Since("2.3.0")
+  override def evaluate(dataset: Dataset[_]): Double = {
+    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
+    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)


We should support all numeric type for prediction column, not only integer.

SchemaUtils.checkNumericType(schema, $(labelCol))

yanboliang · 2017-09-12T05:40:43Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

+      "metricName",
+      "metric name in evaluation (silhouette)",
+      allowedParams
+    )


Reorg as:

val metricName: Param[String] = { val allowedParams = ParamValidators.inArray(Array("squaredSilhouette")) new Param( this, "metricName", "metric name in evaluation (squaredSilhouette)", allowedParams) }

yanboliang · 2017-09-12T05:50:15Z

@mgaido91 These are my last comments, it should be ready to merge once they are addressed. Thanks for your contribution.

SparkQA · 2017-09-12T09:15:08Z

Test build #81666 has finished for PR 18538 at commit a7c1481.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2017-09-12T09:19:06Z

@yanboliang I addressed them. Thank you very much for your time, help and your great reviews.

yanboliang

LGTM

yanboliang · 2017-09-12T10:00:30Z

I'm merging this into master, thanks for all. If anyone has more comments, we can address them in follow-up PRs.

yanboliang · 2017-09-12T10:08:59Z

@mgaido91 I opened SPARK-21981 for Python API, would you like to work on it? Thanks.

mgaido91 · 2017-09-12T10:13:10Z

@yanboliang yes, thank you very much.

jkbradley · 2017-11-02T01:10:01Z

@yanboliang @mgaido91 I just saw this PR. It creates a new test data directory. Could you please send a quite update to move the data to the existing data directory: https://github.com/apache/spark/tree/master/data/mllib ? Thanks!

mgaido91 · 2017-11-02T13:27:24Z

@jkbradley I am not sure that we should put the data for tests of the ml package in the mllib package. Is this the right approach?

yanboliang · 2017-11-02T17:04:26Z

@mgaido91 Don't worry, I'll post a follow-up PR for discussion in a few days. Thanks.

yanboliang · 2017-11-03T01:11:21Z

@jkbradley @mgaido91 I just sent #19648 to move test data to data/mllib, please feel free to review it. Thanks.

[SPARK-14516] Adding ClusteringEvaluator with the implementation of C…

64e17b4

…osine silhouette and squared Euclidean silhouette.

yanboliang reviewed Aug 3, 2017

View reviewed changes

Added comments with Python code to reproduce the results of the tests…

cfcb106

… in scikit-learn

Fix scalastyle

923418a

yanboliang reviewed Aug 8, 2017

View reviewed changes

mgaido91 added 3 commits August 9, 2017 13:57

Remove cosineSilhouette implementation and move squaredSilhouette to …

a1701ef

…ClusteringEvaluator

fix typo

c01ca6e

Refactor to use RDDs instead of DataFrame for complex objects aggrega…

ffc17f9

…tion

yanboliang reviewed Aug 15, 2017

View reviewed changes

mgaido91 commented Aug 15, 2017

View reviewed changes

zhengruifeng reviewed Aug 16, 2017

View reviewed changes

mgaido91 added 2 commits August 18, 2017 08:45

Remove unused params, refactor code, some renaming and move iris data…

53c65f1

…set to resources

Added documentation

a4ca3cd

Fix javadoc errors

a7db896

WeichenXu123 reviewed Aug 22, 2017

View reviewed changes

yanboliang reviewed Aug 31, 2017

View reviewed changes

Fix documentation

45d1380

WeichenXu123 mentioned this pull request Sep 4, 2017

[SPARK-21801][SPARKR][TEST] set random seed for predictable test #19111

Closed

WeichenXu123 approved these changes Sep 5, 2017

View reviewed changes

yanboliang reviewed Sep 6, 2017

View reviewed changes

mgaido91 added 4 commits September 6, 2017 17:42

Fix typos and destroy broadcast variable

4f3c1db

Use libsvm format instead of csv

a99c429

Added metricName param

a900652

Assert number of clusters is greater than one

7b8149a

yanboliang reviewed Sep 11, 2017

View reviewed changes

Fix comments

b0b7853

yanboliang reviewed Sep 12, 2017

View reviewed changes

Support all numeric types for prediction and minor style fix

a7c1481

yanboliang approved these changes Sep 12, 2017

View reviewed changes

asfgit closed this in dd78167 Sep 12, 2017

mgaido91 mentioned this pull request Jan 26, 2018

[SPARK-23217][ML] Add cosine distance measure to ClusteringEvaluator #20396

Closed


		case class ClusterStats(Y: Vector, psi: Double, count: Long)

		def computeCsi(vector: Vector): Double = {


		val bClustersStatsMap = dataset.sparkSession.sparkContext.broadcast(clustersStatsMap)

		val computeSilhouette = dataset.sparkSession.udf.register("computeSilhouette",


		import testImplicits._

		val dataset = Seq(Row(Vectors.dense(5.1, 3.5, 1.4, 0.2), 0),


		assert(evaluator.evaluate(iris) ~== 0.6564679231 relTol 1e-10)
		}

[SPARK-14516][ML] Adding ClusteringEvaluator with the implementation of Cosine silhouette and squared Euclidean silhouette. #18538

[SPARK-14516][ML] Adding ClusteringEvaluator with the implementation of Cosine silhouette and squared Euclidean silhouette. #18538

Conversation

mgaido91 commented Jul 5, 2017

What changes were proposed in this pull request?

How was this patch tested?

yanboliang commented Aug 3, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yanboliang commented Aug 3, 2017

gatorsmile commented Aug 5, 2017

gatorsmile commented Aug 5, 2017

gatorsmile commented Aug 5, 2017

SparkQA commented Aug 5, 2017

SparkQA commented Aug 5, 2017

yanboliang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 9, 2017

mgaido91 commented Aug 9, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yanboliang Aug 16, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yanboliang Aug 16, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yanboliang Aug 16, 2017 • edited

Choose a reason for hiding this comment

yanboliang commented Aug 15, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgaido91 Aug 17, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 18, 2017

SparkQA commented Aug 19, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yanboliang Aug 31, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 4, 2017

WeichenXu123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yanboliang Aug 16, 2017 •

edited

yanboliang Aug 16, 2017 •

edited

yanboliang Aug 16, 2017 •

edited

mgaido91 Aug 17, 2017 •

edited

yanboliang Aug 31, 2017 •

edited