[Spark-23975][ML] Add support of array input for all clustering methods by lu-wang-dl · Pull Request #21195 · apache/spark

lu-wang-dl · 2018-04-30T17:16:56Z

What changes were proposed in this pull request?

Add support for all of the clustering methods

How was this patch tested?

unit tests added

Please review http://spark.apache.org/contributing.html before opening a pull request.

MrBago · 2018-04-30T22:17:34Z

Looking now.

jkbradley · 2018-05-01T16:18:49Z

add to whitelist

SparkQA · 2018-05-01T17:31:10Z

Test build #89981 has finished for PR 21195 at commit 45e6e96.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2018-05-01T20:26:11Z

Rerunning tests in case the R CRAN failure was from flakiness

SparkQA · 2018-05-01T21:33:47Z

Test build #4166 has finished for PR 21195 at commit 45e6e96.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

MrBago · 2018-05-02T22:28:22Z

Thanks Lu!

I had a pass over this PR and it looks pretty straightforward. One thing I noticed is that there are two patterns that we keep repeating. I think we should add private APIs for these patterns and delegate to those.

The first pattern is the validate schema method defined in terms of typeCandidates. I suggest we add something like validateVectorCompatibleColumn to DatasetUtils. In addition to helping with code reuse, this api would make it easier if we ever decide, for example, to support Arrays of Ints.

The second pattern is going from a dataframe & column name to an rdd[OldVector]. Lets add a method that does this, maybe something like (DataFrame, String) => RDD[OldVector].

SparkQA · 2018-05-03T22:40:46Z

Test build #90160 has finished for PR 21195 at commit 877c126.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr

In general, we should keep each PR minimal. First implement Array support for one estimator, get it reviewed and merged, and then implement support for other estimators. If other estimators share exactly the same pattern, we may put the rest in a single PR. But it is totally fine if we split them into multiple PRs. This helps avoid unnecessary code refactoring during code review.

mengxr · 2018-05-04T00:36:04Z

mllib/src/main/scala/org/apache/spark/ml/util/SchemaUtils.scala

+
+  /**
+   * Check whether the given column in the schema is one of the supporting vector type: Vector,
+   * Array[Dloat]. Array[Double]


mengxr · 2018-05-04T00:38:36Z

mllib/src/test/scala/org/apache/spark/ml/clustering/BisectingKMeansSuite.scala

+    val featuresColNameF = "array_float_features"
+    val doubleUDF = udf { (features: Vector) =>
+      val featureArray = Array.fill[Double](features.size)(0.0)
+      features.foreachActive((idx, value) => featureArray(idx) = value.toFloat)


If .toFloat is to keep the same precision, we should leave an inline comment.

features.toArray.map(_.toFloat.toDouble) should do the work.

mengxr · 2018-05-04T03:06:04Z

mllib/src/test/scala/org/apache/spark/ml/clustering/BisectingKMeansSuite.scala

+      featureArray
+    }
+    val newdatasetD = dataset.withColumn(featuresColNameD, doubleUDF(col("features")))
+      .drop("features")


Unnecessary to drop features. Or you can simply replace the features column:

val newdatasetD = dataset.withColumn(FEATURES, doubleUDF(col(FEATURES)))

mengxr · 2018-05-04T03:18:22Z

mllib/src/test/scala/org/apache/spark/ml/clustering/BisectingKMeansSuite.scala

+    val transformedF = modelF.transform(newdatasetF)
+    val predictDifference = transformedD.select("prediction")
+      .except(transformedF.select("prediction"))
+    assert(predictDifference.count() == 0)


This only verifies it handles Array[Double] and Array[Float] the same way. But it doesn't guarantee that the result is correct. We can define a method that takes a dataset, apply one iteration, and return the cost.

def trainAndComputeCost(dataset: DataFrame): Double = { val model = new BisectingKMeans() .setK(k).setMaxIter(1).setSeed(1) .fit(dataset) model.computeCost(dataset) } val trueCost = trainAndComputeCost(dataset) val floatArrayCost = trainAndComputeCost(newDatasetF) assert(floatArrayCost === trueCost) val doubleArrayCost = trainAndComputeCost(newDatasetD) assert(doubleArrayCost === trueCost)

We can map the original dataset to single precision to have exact match. Or we can test equality with a threshold. See https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/util/TestingUtils.scala

mengxr · 2018-05-04T03:18:46Z

mllib/src/test/scala/org/apache/spark/ml/clustering/GaussianMixtureSuite.scala

+    assert(predictDifference.count() == 0)
+    val probabilityDifference = transformedD.select("probability")
+      .except(transformedF.select("probability"))
+    assert(probabilityDifference.count() == 0)


mengxr · 2018-05-04T03:24:37Z

mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala

+    val lpF = modelF.logPerplexity(newdatasetF)
+    // assert(lpD == lpF)
+    assert(lpD >= 0.0 && lpD != Double.NegativeInfinity)
+    assert(lpF >= 0.0 && lpF != Double.NegativeInfinity)


SparkQA · 2018-05-05T00:39:18Z

Test build #90233 has finished for PR 21195 at commit c7a14bb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-05-07T19:19:02Z

Test build #90335 has finished for PR 21195 at commit d065634.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2018-05-07T21:21:49Z

mllib/src/test/scala/org/apache/spark/ml/clustering/GaussianMixtureSuite.scala

+    val floatLikelihood = trainAndComputlogLikelihood(newDatasetF)
+
+    // checking the cost is fine enough as a sanity check
+    assert(trueLikelihood == doubleLikelihood)


minor: should use === instead of == for assertions, the former gives a better error message. (not necessary to update this PR)

mengxr · 2018-05-07T21:23:52Z

mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala

+    }
+
+    val (newDataset, newDatasetD, newDatasetF) = MLTestingUtils.generateArrayFeatureDataset(dataset)
+    val (ll, lp) = trainAndLogLikelihoodAndPerplexity(newDataset)


minor: the output are not used. I expect they will be used once we fixed SPARK-22210

Yes. I want to use this as the base for the comparison after we fix SPARK-22210.

mengxr · 2018-05-07T21:24:42Z

mllib/src/test/scala/org/apache/spark/ml/util/MLTestingUtils.scala

+
+  /**
+   * Helper function for testing different input types for features. Given a DataFrame, generate
+   * three output DataFrames: one having vector feature column with float precision, one having


minor: should say features column to make the contract clear.

mengxr · 2018-05-07T21:27:36Z

mllib/src/test/scala/org/apache/spark/ml/util/MLTestingUtils.scala

+    val toFloatVectorUDF = udf { (features: Vector) => features.toArray.map(_.toFloat).toVector}
+    val toDoubleArrayUDF = udf { (features: Vector) => features.toArray}
+    val toFloatArrayUDF = udf { (features: Vector) => features.toArray.map(_.toFloat)}
+    val newDataset = dataset.withColumn("features", toFloatVectorUDF(col("features")))


minor: maybe useful to define "features" as a constant at the beginning of the function

mengxr · 2018-05-07T21:27:38Z

mllib/src/test/scala/org/apache/spark/ml/util/MLTestingUtils.scala

+    val toDoubleArrayUDF = udf { (features: Vector) => features.toArray}
+    val toFloatArrayUDF = udf { (features: Vector) => features.toArray.map(_.toFloat)}
+    val newDataset = dataset.withColumn("features", toFloatVectorUDF(col("features")))
+    val newDatasetD = dataset.withColumn("features", toDoubleArrayUDF(col("features")))


This doesn't truncate the precision to single. Did you want to use newDataset instead of dataset?

SparkQA · 2018-05-07T23:20:43Z

Test build #90344 has finished for PR 21195 at commit c9f478e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2018-05-08T03:09:14Z

LGTM. Merged into master. Thanks!

lu-wang-dl added 2 commits April 26, 2018 10:46

add Array input support for BisectingKMeans

31226b4

add support of array input for all clustering methods

45e6e96

lu-wang-dl changed the title ~~[Spark 23975][ML] Add support of array input for all clustering methods~~ [Spark-23975][ML] Add support of array input for all clustering methods Apr 30, 2018

add general util functions in DatasetUtils and SchemaUtils

877c126

mengxr requested changes May 4, 2018

View reviewed changes

mengxr reviewed May 4, 2018

View reviewed changes

modify and clean the unit tests

c7a14bb

fix typos and make the unit tests simpler

d065634

mengxr requested changes May 7, 2018

View reviewed changes

minor bug fix

c9f478e

asfgit closed this in 0d63eb8 May 8, 2018

lu-wang-dl deleted the SPARK-23975-1 branch May 16, 2018 20:23

Conversation

lu-wang-dl commented Apr 30, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

MrBago commented Apr 30, 2018

Uh oh!

jkbradley commented May 1, 2018

Uh oh!

SparkQA commented May 1, 2018

Uh oh!

jkbradley commented May 1, 2018

Uh oh!

SparkQA commented May 1, 2018

Uh oh!

MrBago commented May 2, 2018

Uh oh!

SparkQA commented May 3, 2018

Uh oh!

mengxr left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 5, 2018

Uh oh!

SparkQA commented May 7, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 7, 2018

Uh oh!

mengxr commented May 8, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants