Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-14516][ML] Adding ClusteringEvaluator with the implementation of Cosine silhouette and squared Euclidean silhouette. #18538

Closed
wants to merge 18 commits into from

Conversation

mgaido91
Copy link
Contributor

@mgaido91 mgaido91 commented Jul 5, 2017

What changes were proposed in this pull request?

This PR adds the ClusteringEvaluator Evaluator which contains two metrics:

  • cosineSilhouette: the Silhouette measure using the cosine distance;
  • squaredSilhouette: the Silhouette measure using the squared Euclidean distance.

The implementation of the two metrics refers to the algorithm proposed and explained here. These algorithms have been thought for a distributed and parallel environment, thus they have reasonable performance, unlike a naive Silhouette implementation following its definition.

How was this patch tested?

The patch has been tested with the additional unit tests added (comparing the results with the ones provided by Python sklearn library).

…osine silhouette and squared Euclidean silhouette.
@yanboliang
Copy link
Contributor

ok to test

testDefaultReadWrite(evaluator)
}

test("squared euclidean Silhouette") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add Python code which can help to reproduce the result in scikit-learn, like we did in other algorithms?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reference, I have added it.

@yanboliang
Copy link
Contributor

@gatorsmile Could you help to trigger the test job? It seems I can't do it now. Thanks.

@gatorsmile
Copy link
Member

test this please

@gatorsmile
Copy link
Member

ok to test

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Aug 5, 2017

Test build #80281 has finished for PR 18538 at commit cfcb106.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 5, 2017

Test build #80285 has finished for PR 18538 at commit 923418a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@yanboliang yanboliang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgaido91 Could you refactor the code as I suggested? It should be more succinct and efficient. And try to organize all your code in ClusteringEvaluator. Any questions, feel free to let me know. Thanks.

private[this] def computeCosineSilhouette(dataset: Dataset[_]): Double = {
CosineSilhouette.registerKryoClasses(dataset.sparkSession.sparkContext)

val computeCsi = dataset.sparkSession.udf.register("computeCsi",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use more descriptive name? We can't get what this function does from its name.

count("*").alias("count"),
sum("csi").alias("psi"),
Yudaf(col(featuresCol)).alias("y")
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aggregate function performance is not ideal for column of non-primitive type(like here is vector type). So we would still use RDD-based aggregate. You can factor this part of code following NaiveBayes like:

    import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vectors}
    import org.apache.spark.sql.functions._
    
    val numFeatures = ...
    val squaredNorm = udf { features: Vector => math.pow(Vectors.norm(features, 2.0), 2.0) }

    df.select(col(predictionCol), col(featuresCol))
      .withColumn("squaredNorm", squaredNorm(col(featuresCol)))
      .rdd
      .map { row => (row.getDouble(0), (row.getAs[Vector](1), row.getDouble(2))) }
      .aggregateByKey[(DenseVector, Double)]((Vectors.zeros(numFeatures).toDense, 0.0))(
      seqOp = {
        case ((featureSum: DenseVector, squaredNormSum: Double), (features, squaredNorm)) =>
          BLAS.axpy(1.0, features, featureSum)
          (featureSum, squaredNormSum + squaredNorm)
      },
      combOp = {
        case ((featureSum1, squaredNormSum1), (featureSum2, squaredNormSum2)) =>
          BLAS.axpy(1.0, featureSum2, featureSum1)
          (featureSum1, squaredNormSum1 + squaredNormSum2)
      }).collect()

In my suggestion, you can compute csi and y in a single data pass, which should be more efficient.


case class ClusterStats(Y: Vector, psi: Double, count: Long)

def computeCsi(vector: Vector): Double = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use Vectors.norm(vector, 2.0)? It should be more efficient for both dense and sparse vector. Actually we can remove this function if you refactor code as my suggested below.

.agg(
count("*").alias("count"),
sum("csi").alias("psi"),
Yudaf(col(featuresCol)).alias("y")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename csi to squaredNorm, psi to squaredNormSum, y to featureSum if I don't have misunderstand. We should use more descriptive name.

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.{col, count}

private[evaluation] object CosineSilhouette {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no clustering algorithms using other distance metrics except for squared euclidean distance currently. I'd suggest to remove the CosineSilhouette implementation firstly, we can add it back when it's needed. This can also make this PR more easy to review.

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.{col, count, sum}

private[evaluation] object SquaredEuclideanSilhouette {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this to file ClusteringEvaluator.

@SparkQA
Copy link

SparkQA commented Aug 9, 2017

Test build #80453 has finished for PR 18538 at commit ffc17f9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Aug 9, 2017

@yanboliang thanks for your review.
I refactored the code according to your suggestions and I removed the cosine implementation.
Might you please review it now?
Thanks.

.toMap
}

def computeSquaredSilhouetteCoefficient(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest rename computeSquaredSilhouetteCoefficient to computeSilhouetteCoefficient, since this function is already inside of class SquaredEuclideanSilhouette, it doesn't necessary to highlight SquaredEuclidean. What do you think of it?


}

def computeSquaredSilhouette(dataset: Dataset[_],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, rename computeSquaredSilhouette to computeSilhouetteScore, which should be more clear to let users know this is the silhouette score. Meanwhile, could you add annotation for this function like following?

/**
 * Compute the mean Silhouette Coefficient of all samples.
 */

}

def computeSquaredSilhouette(dataset: Dataset[_],
predictionCol: String,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indentation should be four spaces in this and the following lines.


val squaredNorm = udf {
features: Vector =>
math.pow(Vectors.norm(features, 2.0), 2.0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this line to above.


val bClustersStatsMap = dataset.sparkSession.sparkContext.broadcast(clustersStatsMap)

val computeSilhouette = dataset.sparkSession.udf.register("computeSilhouette",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about to rename computeSilhouette to computeSilhouetteCoefficientUDF?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not follow the above way of creating udf squaredNorm?

* in this document</a>.
*/
@Experimental
class ClusteringEvaluator (val uid: String)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add @Since("2.3.0") here and other places necessary.

class ClusteringEvaluator (val uid: String)
extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {

def this() = this(Identifiable.randomUID("SquaredEuclideanSilhouette"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SquaredEuclideanSilhouette -> cluEval

* @group param
*/
val metricName: Param[String] = {
val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

squaredSilhouette -> silhouette? If we support other distance like cosine, the metric name should be the same. The distance metric should be controlled by other param.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I introduce then a new param for the distance metric? I think it is important to highlight that the used distance measure is the squared Euclidean distance, because anybody would assume that the Euclidean distance is used, if we don't specify it very well IMHO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we can add a new param for the distance metric in the future. As MLlib only support squared Euclidean distance , we can ignore this param and add annotation in the API to clarify it currently. You can check MLlib KMeans, there is no param to set distance metric. cc @jkbradley @MLnick @hhbyyh @zhengruifeng

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the idea often crosses my mind.
Even though there's a claim that K-Means is for Euclidean distances only, I often see people has the requirement for custom distance computation in practice. So I would like to see KMeans support it.


}

private[evaluation] object SquaredEuclideanSilhouette {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to have some annotation to explain how we compute Silhouette Coefficient by the high efficient distributed implementation. You can refer what we did at LogisticRegression.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I included the link to the design document here: https://github.com/mgaido91/spark/blob/ffc17f929dd86d1e7e73931eac5663bc08b6ba7a/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala#L37. Should I move it from there? Or should I rewrite the content of the document in an annotation here? Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually we should paste the formula here to explain how we compute Silhouette Coefficient by the high efficient distributed implementation. Because your design document is not a publication, so I think we need to move it from there, but you can simplify it.


import testImplicits._

val dataset = Seq(Row(Vectors.dense(5.1, 3.5, 1.4, 0.2), 0),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good to have this to verify the correctness of your implementation, but usually we don't hard code so much data for test. Could you try to find existing data in KMeansSuite or GaussianMixtureSuite for testing? If the hard code is necessary, please try to use small dataset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately {{KMeansSuite}} and {{GaussianMixtureSuite}} use randomly generated data: thus it is not possible to know which should be the output value for the Silhouette in advance. What if I move the data to a resource file and read it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can't put test data in resource file, as resource file will be packaged in the final jar file, which may lead it bigger. What about randomly generated some small data(10-20 samples), get sklearn output in Python and hard code them here? Just like what we did in GaussianMixtureSuite .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgaido91 You can set seed to control the randomly generated data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry but I can't understand your point. Resources in the test scope are not included in the compiled jars. The same approach is used in the sql component for instance, where the test data is in the resources (https://github.com/apache/spark/tree/master/sql/core/src/test/resources/test-data).
If I generate randomly test data, I have to first perform a clustering on those points, while with this dataset I have the result of the clustering ready too. I am not sure this is the best approach. But maybe I am missing something. Can you please clarify this to me?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgaido91 Sorry I mistakenly thought to put it in the src resource rather than test resource. Usually we generate some dataset to verify MLlib result, we never put existing dataset in resource even test scope until now, this is because we use different dataset to verify different algorithms, which may lead large amount dataset to be added. But the iris dataset is so popular and can be used to verify lots of algorithms, so I'm OK to put it there. Thanks.

@yanboliang
Copy link
Contributor

@mgaido91 I made another pass and left some comments, mainly about naming and annotation. This looks in good shape now. I'd suggest to following the name in sklearn, which should be easy to understand for both developers and users. Thanks for this great contribution.

* Evaluator for clustering results.
* At the moment, the supported metrics are:
* squaredSilhouette: silhouette measure using the squared Euclidean distance;
* cosineSilhouette: silhouette measure using the cosine distance.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to remove this line, I am doing it.

* squaredSilhouette: silhouette measure using the squared Euclidean distance;
* cosineSilhouette: silhouette measure using the cosine distance.
* The implementation follows the proposal explained
* <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It maybe better to refer to the wiki and explain your method in the ml-clustering.md

Copy link
Contributor Author

@mgaido91 mgaido91 Aug 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhengruifeng I can't see in the wiki any of the other evaluators. And I don't see a detailed explanation of the maths behind the algorithms either. Thus I am not sure it is the best place.

SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)

val metric: Double = $(metricName) match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If only Euclidean is support for now, here val metric and match are not needed, directly return SquaredEuclideanSilhouette.computeSquaredSilhouette...

(clusterCurrentPoint.numOfPoints - 1)
}

var silhouetteCoeff = 0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about changing this to

if (clusterSil < minOther) {
1 - clusterSil / minOther
} else if (clusterSil > minOther) {
minOther / clusterSil - 1
} else {
0.0
}

}
}
silhouetteCoeff

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove empty line

}
metric
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove empty line, and otherwise


val bClustersStatsMap = dataset.sparkSession.sparkContext.broadcast(clustersStatsMap)

val computeSilhouette = dataset.sparkSession.udf.register("computeSilhouette",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not follow the above way of creating udf squaredNorm?

computeSquaredSilhouetteCoefficient(bClustersStatsMap, _: Vector, _: Int, _: Double)
)

val squaredSilhouetteDF = dfWithSquaredNorm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use select(avg(computeSilhouette(...)))


import testImplicits._

val dataset = Seq(Row(Vectors.dense(5.1, 3.5, 1.4, 0.2), 0),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgaido91 You can set seed to control the randomly generated data.

}

/*
Use the following python code to load the data and evaluate it using scikit-learn package.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should add the expected output of your python code, refer to FPGrowthSuite.scala
, and mind the indent

@SparkQA
Copy link

SparkQA commented Aug 18, 2017

Test build #80860 has finished for PR 18538 at commit a4ca3cd.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 19, 2017

Test build #80862 has finished for PR 18538 at commit a7db896.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val irisCsvPath = Thread.currentThread()
.getContextClassLoader
.getResource("test-data/iris.csv")
.toString
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this testsuite reference another testdata file. Can we generate the testdata in the code? Like other testsuite.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a discussion about this in the outdated comments. The main reason to avoid test data generation in my point of view is that the generated data must be clustered before running the Silhouette.
The iris dataset is a well-known one and contains already clustered data. Thus it seemed the best option.

*
* <blockquote>
* s_{i} = \frac{b_{i}-a_{i}}{max\{a_{i},b_{i}\}}
* </blockquote>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latex formula should be surrounded by $$, change here and other places as:

<blockquote>
    $$
    s_{i} = \frac{b_{i}-a_{i}}{max\{a_{i},b_{i}\}}
    $$
</blockquote>

* distance measure.
*
* With this assumption, the average of the distance of the point `X`
* to the points `C_{i}` belonging to the cluster `\Gamma` is:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C_{i} -> $C_{i}$, otherwise, it can't generated correct doc. Change here and other places.

* <blockquote>
* s_{i}=\left\{ \begin{tabular}{cc}
* $1-\frac{a_{i}}{b_{i}}$ & if $a_{i} \leq b_{i}$ \\
* $\frac{b_{i}}{a_{i}}-1$ & if $a_{i} \gt b_{i}$
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is syntax error in this latex formula, I checked the generated doc and found it can't show correctly. Or you can paste this formula into http://www.hostmath.com/ to check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @yanboliang, may you please tell me how to check the generated doc? thank you!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1, Remove private[evaluation] from object SquaredEuclideanSilhouette. We only generate docs for public APIs, the doc of private APIs are used for developers to understand code.
2, cd docs
3, Run jekyll build
4, Then you can get API docs under docs/_site/api/scala/index.html, try to search SquaredEuclideanSilhouette.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you! You're always nice. Just fixed everything, thanks.

* thus we can name it `\Psi_{\Gamma}`
*
* <blockquote>
* sum\limits_{i=1}^N \sum\limits_{j=1}^D c_{ij}^2 =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, there is syntax error in this latex formula.

*
* The implementation follows the proposal explained
* <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">
* in this document</a>.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, we have necessary docs at object SquaredEuclideanSilhouette to explain our proposed algorithm, so we can remove this. Usually we only refer to public publication.

@SparkQA
Copy link

SparkQA commented Sep 4, 2017

Test build #81369 has finished for PR 18538 at commit 9abe9e5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@WeichenXu123 WeichenXu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

*
* where `$a_{i}$` is the average dissimilarity of `i` with all other data
* within the same cluster, `$b_{i}$` is the lowest average dissimilarity
* of to any other cluster, of which `i` is not a member.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of to -> of `i` to

*
* <blockquote>
* $$
* \sum\limits_{i=1}^N d(X, C_{i} )^2 =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to change d(X, C_{i} )^2 to d(X, C_{i} ), as we don't define d() for Euclidean distance, so we can regard it as squared Euclidean distance . What do you think of?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you are right, thanks.

* \sum\limits_{i=1}^N d(X, C_{i} )^2 =
* \sum\limits_{i=1}^N \Big( \sum\limits_{j=1}^D (x_{j}-c_{ij})^2 \Big)
* = \sum\limits_{i=1}^N \Big( \sum\limits_{j=1}^D x_{j}^2 +
* \sum\limits_{j=1}^D c_{ij}^2 -2\sum\limits_{j=1}^D x_{i}c_{ij} \Big)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

x_{i}c_{ij} -> x_{ij}c_{ij}? Since x_{i} is a vector and c_{ij} is a double, here we compute dot product.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, x_{i} is not a vector. X is a vector (which represents a point). x_{i} is a typo I am fixing for x_{j} which is a scalar, not a vector.

* \sum\limits_{j=1}^D c_{ij}^2 -2\sum\limits_{j=1}^D x_{i}c_{ij} \Big)
* = \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{j}^2 +
* \sum\limits_{i=1}^N \sum\limits_{j=1}^D c_{ij}^2
* -2 \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{i}c_{ij}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, x_{i}c_{ij} -> x_{ij}c_{ij}.
BTW, could you also check this issue in the following description? Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, I am checking for the typo everywhere, thanks for pointing it out.

* and parallel implementation of the Silhouette using the squared Euclidean
* distance measure.
*
* With this assumption, the average of the distance of the point `X`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the average of the distance of the point -> the total distance of the point? Should it be the total distance rather than the average distance?

}


object ClusteringEvaluator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Since("2.3.0")

object ClusteringEvaluator
extends DefaultParamsReadable[ClusteringEvaluator] {

override def load(path: String): ClusteringEvaluator = super.load(path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Since("2.3.0")

.setFeaturesCol("features")
.setPredictionCol("label")

assert(evaluator.evaluate(iris) ~== 0.6564679231 relTol 1e-10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check with tolerance 1e-5 is good enough.

splits(splits.length-1).toInt
)
}
.toDF()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we store the test data as libsvm format rather than csv? Then we can use spark.read.format("libsvm").load(irisPath) to load it to a DataFrame with two columns: features and label.


assert(evaluator.evaluate(iris) ~== 0.6564679231 relTol 1e-10)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to add another corner case: single cluster. We should guarantee it output consistent result with sklearn. You can just select one cluster from the iris dataset and test it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually sklearn throws an exception in this case. Should we do the same? Thanks.

Copy link
Contributor

@yanboliang yanboliang Sep 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I support to keep consistent result. Otherwise, any real value is a confused result that doesn't make sense. What do you think of it? Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I agree. Thanks.

@yanboliang
Copy link
Contributor

@mgaido91 I left some minor comments, otherwise, this looks good. Thanks.

@SparkQA
Copy link

SparkQA commented Sep 6, 2017

Test build #81463 has finished for PR 18538 at commit 7b8149a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* where `$a_{i}$` is the average dissimilarity of `i` with all other data
* within the same cluster, `$b_{i}$` is the lowest average dissimilarity
* of `i` to any other cluster, of which `i` is not a member.
* `$a_{i}$` can be interpreted as as how well `i` is assigned to its cluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove duplicated as.

*
* <blockquote>
* $$
* \frac{\sum\limits_{i=1}^N d(X, C_{i} )^2}{N} =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like above, d(X, C_{i} )^2 -> d(X, C_{i} ), we have consensus at last round discussion.

* about a cluster which are needed by the algorithm.
*
* @param df The DataFrame which contains the input data
* @param predictionCol The name of the column which contains the cluster id for the point.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the cluster id -> the predicted cluster id

* Compute the mean Silhouette values of all samples.
*
* @param dataset The input dataset (previously clustered) on which compute the Silhouette.
* @param predictionCol The name of the column which contains the cluster id for the point.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.


// compute aggregate values for clusters needed by the algorithm
val clustersStatsMap = SquaredEuclideanSilhouette
.computeClusterStats(dfWithSquaredNorm, predictionCol, featuresCol)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can check whether the number of clusters is grater then 1 at here to avoid unnecessary computation.

assert(clustersStatsMap.size != 1, "...")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment has been addressed just one line after. Thanks.

SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)

// Silhouette is reasonable only when the number of clusters is grater then 1
assert(dataset.select($(predictionCol)).distinct().count() > 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this check to L418, in case another unnecessary computation for most of the cases(cluster size > 1). See my comment at L418.

*/
@Since("2.3.0")
val metricName: Param[String] = {
val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest the metric name is silhouette, since we may add silhouette for other distance, then we can add another param like distance to control that. The param metricName should not bind to any distance computation way. There are lots of other metrics for clustering algorithms, like these in sklearn. We would not add all of them for MLlib, but we may add part of them in the future.
cc @jkbradley @MLnick @WeichenXu123

@SparkQA
Copy link

SparkQA commented Sep 11, 2017

Test build #81639 has finished for PR 18538 at commit b0b7853.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

dataset,
$(predictionCol),
$(featuresCol)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reorg as:

$(metricName) match {
      case "squaredSilhouette" =>
        SquaredEuclideanSilhouette.computeSilhouetteScore(
          dataset, $(predictionCol), $(featuresCol))
}

import org.apache.spark.sql.types.IntegerType

/**
* :: Experimental ::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually we leave a blank line under :: Experimental ::.

@Since("2.3.0")
override def evaluate(dataset: Dataset[_]): Double = {
SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should support all numeric type for prediction column, not only integer.

SchemaUtils.checkNumericType(schema, $(labelCol))

"metricName",
"metric name in evaluation (silhouette)",
allowedParams
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reorg as:

val metricName: Param[String] = {
    val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
    new Param(
      this, "metricName", "metric name in evaluation (squaredSilhouette)", allowedParams)
}

@yanboliang
Copy link
Contributor

@mgaido91 These are my last comments, it should be ready to merge once they are addressed. Thanks for your contribution.

@SparkQA
Copy link

SparkQA commented Sep 12, 2017

Test build #81666 has finished for PR 18538 at commit a7c1481.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

@yanboliang I addressed them. Thank you very much for your time, help and your great reviews.

Copy link
Contributor

@yanboliang yanboliang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yanboliang
Copy link
Contributor

I'm merging this into master, thanks for all. If anyone has more comments, we can address them in follow-up PRs.

@asfgit asfgit closed this in dd78167 Sep 12, 2017
@yanboliang
Copy link
Contributor

@mgaido91 I opened SPARK-21981 for Python API, would you like to work on it? Thanks.

@mgaido91
Copy link
Contributor Author

@yanboliang yes, thank you very much.

@jkbradley
Copy link
Member

@yanboliang @mgaido91 I just saw this PR. It creates a new test data directory. Could you please send a quite update to move the data to the existing data directory: https://github.com/apache/spark/tree/master/data/mllib ? Thanks!

@mgaido91
Copy link
Contributor Author

mgaido91 commented Nov 2, 2017

@jkbradley I am not sure that we should put the data for tests of the ml package in the mllib package. Is this the right approach?

@yanboliang
Copy link
Contributor

@mgaido91 Don't worry, I'll post a follow-up PR for discussion in a few days. Thanks.

@yanboliang
Copy link
Contributor

@jkbradley @mgaido91 I just sent #19648 to move test data to data/mllib, please feel free to review it. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants