New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-14516][ML] Adding ClusteringEvaluator with the implementation of Cosine silhouette and squared Euclidean silhouette. #18538
Conversation
…osine silhouette and squared Euclidean silhouette.
ok to test |
testDefaultReadWrite(evaluator) | ||
} | ||
|
||
test("squared euclidean Silhouette") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add Python code which can help to reproduce the result in scikit-learn, like we did in other algorithms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reference, I have added it.
@gatorsmile Could you help to trigger the test job? It seems I can't do it now. Thanks. |
test this please |
ok to test |
retest this please |
Test build #80281 has finished for PR 18538 at commit
|
Test build #80285 has finished for PR 18538 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgaido91 Could you refactor the code as I suggested? It should be more succinct and efficient. And try to organize all your code in ClusteringEvaluator
. Any questions, feel free to let me know. Thanks.
private[this] def computeCosineSilhouette(dataset: Dataset[_]): Double = { | ||
CosineSilhouette.registerKryoClasses(dataset.sparkSession.sparkContext) | ||
|
||
val computeCsi = dataset.sparkSession.udf.register("computeCsi", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use more descriptive name? We can't get what this function does from its name.
count("*").alias("count"), | ||
sum("csi").alias("psi"), | ||
Yudaf(col(featuresCol)).alias("y") | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aggregate function performance is not ideal for column of non-primitive type(like here is vector type). So we would still use RDD-based aggregate. You can factor this part of code following NaiveBayes
like:
import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vectors}
import org.apache.spark.sql.functions._
val numFeatures = ...
val squaredNorm = udf { features: Vector => math.pow(Vectors.norm(features, 2.0), 2.0) }
df.select(col(predictionCol), col(featuresCol))
.withColumn("squaredNorm", squaredNorm(col(featuresCol)))
.rdd
.map { row => (row.getDouble(0), (row.getAs[Vector](1), row.getDouble(2))) }
.aggregateByKey[(DenseVector, Double)]((Vectors.zeros(numFeatures).toDense, 0.0))(
seqOp = {
case ((featureSum: DenseVector, squaredNormSum: Double), (features, squaredNorm)) =>
BLAS.axpy(1.0, features, featureSum)
(featureSum, squaredNormSum + squaredNorm)
},
combOp = {
case ((featureSum1, squaredNormSum1), (featureSum2, squaredNormSum2)) =>
BLAS.axpy(1.0, featureSum2, featureSum1)
(featureSum1, squaredNormSum1 + squaredNormSum2)
}).collect()
In my suggestion, you can compute csi
and y
in a single data pass, which should be more efficient.
|
||
case class ClusterStats(Y: Vector, psi: Double, count: Long) | ||
|
||
def computeCsi(vector: Vector): Double = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use Vectors.norm(vector, 2.0)
? It should be more efficient for both dense and sparse vector. Actually we can remove this function if you refactor code as my suggested below.
.agg( | ||
count("*").alias("count"), | ||
sum("csi").alias("psi"), | ||
Yudaf(col(featuresCol)).alias("y") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rename csi
to squaredNorm
, psi
to squaredNormSum
, y
to featureSum
if I don't have misunderstand. We should use more descriptive name.
import org.apache.spark.sql.DataFrame | ||
import org.apache.spark.sql.functions.{col, count} | ||
|
||
private[evaluation] object CosineSilhouette { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no clustering algorithms using other distance metrics except for squared euclidean distance currently. I'd suggest to remove the CosineSilhouette
implementation firstly, we can add it back when it's needed. This can also make this PR more easy to review.
import org.apache.spark.sql.DataFrame | ||
import org.apache.spark.sql.functions.{col, count, sum} | ||
|
||
private[evaluation] object SquaredEuclideanSilhouette { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this to file ClusteringEvaluator
.
Test build #80453 has finished for PR 18538 at commit
|
@yanboliang thanks for your review. |
.toMap | ||
} | ||
|
||
def computeSquaredSilhouetteCoefficient( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest rename computeSquaredSilhouetteCoefficient
to computeSilhouetteCoefficient
, since this function is already inside of class SquaredEuclideanSilhouette
, it doesn't necessary to highlight SquaredEuclidean
. What do you think of it?
|
||
} | ||
|
||
def computeSquaredSilhouette(dataset: Dataset[_], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto, rename computeSquaredSilhouette
to computeSilhouetteScore
, which should be more clear to let users know this is the silhouette score. Meanwhile, could you add annotation for this function like following?
/**
* Compute the mean Silhouette Coefficient of all samples.
*/
} | ||
|
||
def computeSquaredSilhouette(dataset: Dataset[_], | ||
predictionCol: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indentation should be four spaces in this and the following lines.
|
||
val squaredNorm = udf { | ||
features: Vector => | ||
math.pow(Vectors.norm(features, 2.0), 2.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this line to above.
|
||
val bClustersStatsMap = dataset.sparkSession.sparkContext.broadcast(clustersStatsMap) | ||
|
||
val computeSilhouette = dataset.sparkSession.udf.register("computeSilhouette", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about to rename computeSilhouette
to computeSilhouetteCoefficientUDF
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not follow the above way of creating udf squaredNorm
?
* in this document</a>. | ||
*/ | ||
@Experimental | ||
class ClusteringEvaluator (val uid: String) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add @Since("2.3.0")
here and other places necessary.
class ClusteringEvaluator (val uid: String) | ||
extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable { | ||
|
||
def this() = this(Identifiable.randomUID("SquaredEuclideanSilhouette")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SquaredEuclideanSilhouette
-> cluEval
* @group param | ||
*/ | ||
val metricName: Param[String] = { | ||
val allowedParams = ParamValidators.inArray(Array("squaredSilhouette")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
squaredSilhouette
-> silhouette
? If we support other distance like cosine, the metric name should be the same. The distance metric should be controlled by other param.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I introduce then a new param for the distance metric? I think it is important to highlight that the used distance measure is the squared Euclidean distance, because anybody would assume that the Euclidean distance is used, if we don't specify it very well IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think we can add a new param for the distance metric in the future. As MLlib only support squared Euclidean distance , we can ignore this param and add annotation in the API to clarify it currently. You can check MLlib KMeans
, there is no param to set distance metric. cc @jkbradley @MLnick @hhbyyh @zhengruifeng
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the idea often crosses my mind.
Even though there's a claim that K-Means is for Euclidean distances only, I often see people has the requirement for custom distance computation in practice. So I would like to see KMeans support it.
|
||
} | ||
|
||
private[evaluation] object SquaredEuclideanSilhouette { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to have some annotation to explain how we compute Silhouette Coefficient
by the high efficient distributed implementation. You can refer what we did at LogisticRegression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I included the link to the design document here: https://github.com/mgaido91/spark/blob/ffc17f929dd86d1e7e73931eac5663bc08b6ba7a/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala#L37. Should I move it from there? Or should I rewrite the content of the document in an annotation here? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually we should paste the formula here to explain how we compute Silhouette Coefficient
by the high efficient distributed implementation. Because your design document is not a publication, so I think we need to move it from there, but you can simplify it.
|
||
import testImplicits._ | ||
|
||
val dataset = Seq(Row(Vectors.dense(5.1, 3.5, 1.4, 0.2), 0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's good to have this to verify the correctness of your implementation, but usually we don't hard code so much data for test. Could you try to find existing data in KMeansSuite
or GaussianMixtureSuite
for testing? If the hard code is necessary, please try to use small dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately {{KMeansSuite}} and {{GaussianMixtureSuite}} use randomly generated data: thus it is not possible to know which should be the output value for the Silhouette in advance. What if I move the data to a resource file and read it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can't put test data in resource file, as resource file will be packaged in the final jar file, which may lead it bigger. What about randomly generated some small data(10-20 samples), get sklearn output in Python and hard code them here? Just like what we did in GaussianMixtureSuite
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgaido91 You can set seed to control the randomly generated data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry but I can't understand your point. Resources in the test scope are not included in the compiled jars. The same approach is used in the sql
component for instance, where the test data is in the resources (https://github.com/apache/spark/tree/master/sql/core/src/test/resources/test-data).
If I generate randomly test data, I have to first perform a clustering on those points, while with this dataset I have the result of the clustering ready too. I am not sure this is the best approach. But maybe I am missing something. Can you please clarify this to me?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgaido91 Sorry I mistakenly thought to put it in the src resource rather than test resource. Usually we generate some dataset to verify MLlib result, we never put existing dataset in resource even test scope until now, this is because we use different dataset to verify different algorithms, which may lead large amount dataset to be added. But the iris dataset is so popular and can be used to verify lots of algorithms, so I'm OK to put it there. Thanks.
@mgaido91 I made another pass and left some comments, mainly about naming and annotation. This looks in good shape now. I'd suggest to following the name in sklearn, which should be easy to understand for both developers and users. Thanks for this great contribution. |
* Evaluator for clustering results. | ||
* At the moment, the supported metrics are: | ||
* squaredSilhouette: silhouette measure using the squared Euclidean distance; | ||
* cosineSilhouette: silhouette measure using the cosine distance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to remove this line, I am doing it.
* squaredSilhouette: silhouette measure using the squared Euclidean distance; | ||
* cosineSilhouette: silhouette measure using the cosine distance. | ||
* The implementation follows the proposal explained | ||
* <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It maybe better to refer to the wiki and explain your method in the ml-clustering.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhengruifeng I can't see in the wiki any of the other evaluators. And I don't see a detailed explanation of the maths behind the algorithms either. Thus I am not sure it is the best place.
SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT) | ||
SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType) | ||
|
||
val metric: Double = $(metricName) match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If only Euclidean is support for now, here val metric
and match
are not needed, directly return SquaredEuclideanSilhouette.computeSquaredSilhouette...
(clusterCurrentPoint.numOfPoints - 1) | ||
} | ||
|
||
var silhouetteCoeff = 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about changing this to
if (clusterSil < minOther) {
1 - clusterSil / minOther
} else if (clusterSil > minOther) {
minOther / clusterSil - 1
} else {
0.0
}
} | ||
} | ||
silhouetteCoeff | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove empty line
} | ||
metric | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove empty line, and otherwise
|
||
val bClustersStatsMap = dataset.sparkSession.sparkContext.broadcast(clustersStatsMap) | ||
|
||
val computeSilhouette = dataset.sparkSession.udf.register("computeSilhouette", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not follow the above way of creating udf squaredNorm
?
computeSquaredSilhouetteCoefficient(bClustersStatsMap, _: Vector, _: Int, _: Double) | ||
) | ||
|
||
val squaredSilhouetteDF = dfWithSquaredNorm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use select(avg(computeSilhouette(...)))
|
||
import testImplicits._ | ||
|
||
val dataset = Seq(Row(Vectors.dense(5.1, 3.5, 1.4, 0.2), 0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgaido91 You can set seed to control the randomly generated data.
} | ||
|
||
/* | ||
Use the following python code to load the data and evaluate it using scikit-learn package. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should add the expected output of your python code, refer to FPGrowthSuite.scala
, and mind the indent
Test build #80860 has finished for PR 18538 at commit
|
Test build #80862 has finished for PR 18538 at commit
|
val irisCsvPath = Thread.currentThread() | ||
.getContextClassLoader | ||
.getResource("test-data/iris.csv") | ||
.toString |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this testsuite reference another testdata file. Can we generate the testdata in the code? Like other testsuite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a discussion about this in the outdated comments. The main reason to avoid test data generation in my point of view is that the generated data must be clustered before running the Silhouette.
The iris dataset is a well-known one and contains already clustered data. Thus it seemed the best option.
* | ||
* <blockquote> | ||
* s_{i} = \frac{b_{i}-a_{i}}{max\{a_{i},b_{i}\}} | ||
* </blockquote> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latex formula should be surrounded by $$, change here and other places as:
<blockquote>
$$
s_{i} = \frac{b_{i}-a_{i}}{max\{a_{i},b_{i}\}}
$$
</blockquote>
* distance measure. | ||
* | ||
* With this assumption, the average of the distance of the point `X` | ||
* to the points `C_{i}` belonging to the cluster `\Gamma` is: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C_{i}
->
* <blockquote> | ||
* s_{i}=\left\{ \begin{tabular}{cc} | ||
* $1-\frac{a_{i}}{b_{i}}$ & if $a_{i} \leq b_{i}$ \\ | ||
* $\frac{b_{i}}{a_{i}}-1$ & if $a_{i} \gt b_{i}$ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is syntax error in this latex formula, I checked the generated doc and found it can't show correctly. Or you can paste this formula into http://www.hostmath.com/ to check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @yanboliang, may you please tell me how to check the generated doc? thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1, Remove private[evaluation]
from object SquaredEuclideanSilhouette
. We only generate docs for public APIs, the doc of private APIs are used for developers to understand code.
2, cd docs
3, Run jekyll build
4, Then you can get API docs under docs/_site/api/scala/index.html
, try to search SquaredEuclideanSilhouette
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you! You're always nice. Just fixed everything, thanks.
* thus we can name it `\Psi_{\Gamma}` | ||
* | ||
* <blockquote> | ||
* sum\limits_{i=1}^N \sum\limits_{j=1}^D c_{ij}^2 = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto, there is syntax error in this latex formula.
* | ||
* The implementation follows the proposal explained | ||
* <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view"> | ||
* in this document</a>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, we have necessary docs at object SquaredEuclideanSilhouette
to explain our proposed algorithm, so we can remove this. Usually we only refer to public publication.
Test build #81369 has finished for PR 18538 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
* | ||
* where `$a_{i}$` is the average dissimilarity of `i` with all other data | ||
* within the same cluster, `$b_{i}$` is the lowest average dissimilarity | ||
* of to any other cluster, of which `i` is not a member. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of to
-> of `i` to
* | ||
* <blockquote> | ||
* $$ | ||
* \sum\limits_{i=1}^N d(X, C_{i} )^2 = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest to change d(X, C_{i} )^2
to d(X, C_{i} )
, as we don't define d()
for Euclidean distance, so we can regard it as squared Euclidean distance . What do you think of?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, you are right, thanks.
* \sum\limits_{i=1}^N d(X, C_{i} )^2 = | ||
* \sum\limits_{i=1}^N \Big( \sum\limits_{j=1}^D (x_{j}-c_{ij})^2 \Big) | ||
* = \sum\limits_{i=1}^N \Big( \sum\limits_{j=1}^D x_{j}^2 + | ||
* \sum\limits_{j=1}^D c_{ij}^2 -2\sum\limits_{j=1}^D x_{i}c_{ij} \Big) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
x_{i}c_{ij}
-> x_{ij}c_{ij}
? Since x_{i}
is a vector and c_{ij}
is a double, here we compute dot product.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, x_{i}
is not a vector. X
is a vector (which represents a point). x_{i}
is a typo I am fixing for x_{j}
which is a scalar, not a vector.
* \sum\limits_{j=1}^D c_{ij}^2 -2\sum\limits_{j=1}^D x_{i}c_{ij} \Big) | ||
* = \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{j}^2 + | ||
* \sum\limits_{i=1}^N \sum\limits_{j=1}^D c_{ij}^2 | ||
* -2 \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{i}c_{ij} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto, x_{i}c_{ij}
-> x_{ij}c_{ij}
.
BTW, could you also check this issue in the following description? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, I am checking for the typo everywhere, thanks for pointing it out.
* and parallel implementation of the Silhouette using the squared Euclidean | ||
* distance measure. | ||
* | ||
* With this assumption, the average of the distance of the point `X` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the average of the distance of the point
-> the total distance of the point
? Should it be the total distance rather than the average distance?
} | ||
|
||
|
||
object ClusteringEvaluator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Since("2.3.0")
object ClusteringEvaluator | ||
extends DefaultParamsReadable[ClusteringEvaluator] { | ||
|
||
override def load(path: String): ClusteringEvaluator = super.load(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Since("2.3.0")
.setFeaturesCol("features") | ||
.setPredictionCol("label") | ||
|
||
assert(evaluator.evaluate(iris) ~== 0.6564679231 relTol 1e-10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check with tolerance 1e-5 is good enough.
splits(splits.length-1).toInt | ||
) | ||
} | ||
.toDF() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we store the test data as libsvm format rather than csv? Then we can use spark.read.format("libsvm").load(irisPath)
to load it to a DataFrame with two columns: features and label.
|
||
assert(evaluator.evaluate(iris) ~== 0.6564679231 relTol 1e-10) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to add another corner case: single cluster. We should guarantee it output consistent result with sklearn. You can just select one cluster from the iris dataset and test it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually sklearn throws an exception in this case. Should we do the same? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I support to keep consistent result. Otherwise, any real value is a confused result that doesn't make sense. What do you think of it? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I agree. Thanks.
@mgaido91 I left some minor comments, otherwise, this looks good. Thanks. |
Test build #81463 has finished for PR 18538 at commit
|
* where `$a_{i}$` is the average dissimilarity of `i` with all other data | ||
* within the same cluster, `$b_{i}$` is the lowest average dissimilarity | ||
* of `i` to any other cluster, of which `i` is not a member. | ||
* `$a_{i}$` can be interpreted as as how well `i` is assigned to its cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove duplicated as
.
* | ||
* <blockquote> | ||
* $$ | ||
* \frac{\sum\limits_{i=1}^N d(X, C_{i} )^2}{N} = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like above, d(X, C_{i} )^2
-> d(X, C_{i} )
, we have consensus at last round discussion.
* about a cluster which are needed by the algorithm. | ||
* | ||
* @param df The DataFrame which contains the input data | ||
* @param predictionCol The name of the column which contains the cluster id for the point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the cluster id
-> the predicted cluster id
* Compute the mean Silhouette values of all samples. | ||
* | ||
* @param dataset The input dataset (previously clustered) on which compute the Silhouette. | ||
* @param predictionCol The name of the column which contains the cluster id for the point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
|
||
// compute aggregate values for clusters needed by the algorithm | ||
val clustersStatsMap = SquaredEuclideanSilhouette | ||
.computeClusterStats(dfWithSquaredNorm, predictionCol, featuresCol) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can check whether the number of clusters is grater then 1 at here to avoid unnecessary computation.
assert(clustersStatsMap.size != 1, "...")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment has been addressed just one line after. Thanks.
SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType) | ||
|
||
// Silhouette is reasonable only when the number of clusters is grater then 1 | ||
assert(dataset.select($(predictionCol)).distinct().count() > 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this check to L418, in case another unnecessary computation for most of the cases(cluster size > 1). See my comment at L418.
*/ | ||
@Since("2.3.0") | ||
val metricName: Param[String] = { | ||
val allowedParams = ParamValidators.inArray(Array("squaredSilhouette")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest the metric name is silhouette
, since we may add silhouette for other distance, then we can add another param like distance
to control that. The param metricName
should not bind to any distance computation way. There are lots of other metrics for clustering algorithms, like these in sklearn. We would not add all of them for MLlib, but we may add part of them in the future.
cc @jkbradley @MLnick @WeichenXu123
Test build #81639 has finished for PR 18538 at commit
|
dataset, | ||
$(predictionCol), | ||
$(featuresCol) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reorg as:
$(metricName) match {
case "squaredSilhouette" =>
SquaredEuclideanSilhouette.computeSilhouetteScore(
dataset, $(predictionCol), $(featuresCol))
}
import org.apache.spark.sql.types.IntegerType | ||
|
||
/** | ||
* :: Experimental :: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually we leave a blank line under :: Experimental ::
.
@Since("2.3.0") | ||
override def evaluate(dataset: Dataset[_]): Double = { | ||
SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT) | ||
SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should support all numeric type for prediction column, not only integer.
SchemaUtils.checkNumericType(schema, $(labelCol))
"metricName", | ||
"metric name in evaluation (silhouette)", | ||
allowedParams | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reorg as:
val metricName: Param[String] = {
val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
new Param(
this, "metricName", "metric name in evaluation (squaredSilhouette)", allowedParams)
}
@mgaido91 These are my last comments, it should be ready to merge once they are addressed. Thanks for your contribution. |
Test build #81666 has finished for PR 18538 at commit
|
@yanboliang I addressed them. Thank you very much for your time, help and your great reviews. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I'm merging this into master, thanks for all. If anyone has more comments, we can address them in follow-up PRs. |
@mgaido91 I opened SPARK-21981 for Python API, would you like to work on it? Thanks. |
@yanboliang yes, thank you very much. |
@yanboliang @mgaido91 I just saw this PR. It creates a new test data directory. Could you please send a quite update to move the data to the existing data directory: https://github.com/apache/spark/tree/master/data/mllib ? Thanks! |
@jkbradley I am not sure that we should put the data for tests of the ml package in the mllib package. Is this the right approach? |
@mgaido91 Don't worry, I'll post a follow-up PR for discussion in a few days. Thanks. |
@jkbradley @mgaido91 I just sent #19648 to move test data to data/mllib, please feel free to review it. Thanks. |
What changes were proposed in this pull request?
This PR adds the ClusteringEvaluator Evaluator which contains two metrics:
The implementation of the two metrics refers to the algorithm proposed and explained here. These algorithms have been thought for a distributed and parallel environment, thus they have reasonable performance, unlike a naive Silhouette implementation following its definition.
How was this patch tested?
The patch has been tested with the additional unit tests added (comparing the results with the ones provided by Python sklearn library).