[SPARK-27944][ML] Unify the behavior of checking empty output column names #24793

zhengruifeng · 2019-06-04T11:31:47Z

What changes were proposed in this pull request?

In regression/clustering/ovr/als, if an output column name is empty, igore it. And if all names are empty, log a warning msg, then do nothing.

How was this patch tested?

existing tests

SparkQA · 2019-06-04T11:36:21Z

Test build #106149 has finished for PR 24793 at commit 6c2a81a.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-04T12:47:51Z

Test build #106150 has finished for PR 24793 at commit a0e4d54.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2019-06-25T10:36:18Z

This pr can help to avoid uncessary computation. For example, in GMM, current impl always predict twice, one for probabilityCol and one for probabilityCol. if we only need probability, the first col can be skipped.

zhengruifeng · 2019-07-11T06:57:54Z

ping @srowen , would you mind help reviewing this?

srowen · 2019-07-11T14:40:44Z

mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala

+        .withColumns(predictionColNames, predictionColumns)
+        .drop(accColName)
+    } else {
+      this.logWarning(s"$uid: OneVsRestModel.transform() was called as NOOP" +


I don't see how this happens given the check at the start of the method?

srowen · 2019-07-11T14:42:44Z

mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala

-    val predictUDF = udf((vector: Vector) => predict(vector))
-    dataset.withColumn($(predictionCol),
-      predictUDF(DatasetUtils.columnToVector(dataset, getFeaturesCol)))
+    if ($(predictionCol).nonEmpty) {


Likewise how would this be empty, unless the user set it to nothing? in which case, this seems not worth worrying about

I changed here to keep in line with other algs like LDA.transform
Or leave alone algs with only one output column, and remove the check in algs like LDA?

Hm, I'd say we don't need this check anywhere that the user would have to explicitly set no prediction column to get no output, and in that case, I don't think it's worth checking and warning. I'm neutral on removing the other checks, but not against it.

Some checks are OK like the ones above as it might be easier to accidentally get into this situation because there are multiple prediction cols.

srowen · 2019-07-11T14:43:32Z

mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala

+      dataset.withColumn($(predictionCol),
+        predictUDF(DatasetUtils.columnToVector(dataset, getFeaturesCol)))
+    } else {
+      this.logWarning(s"$uid: BisectingKMeansModel.transform() was called as NOOP" +


PS we should fix these messages. "called as NOOP" is cryptic. Just say "transform() does nothing because ..."

SparkQA · 2019-07-12T04:56:55Z

Test build #107577 has finished for PR 24793 at commit 90eb0ac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala

SparkQA · 2019-07-15T06:09:44Z

Test build #107665 has finished for PR 24793 at commit f010ff9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2019-07-15T14:24:51Z

mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala

@@ -33,7 +33,7 @@ import org.apache.spark.ml.util.Instrumentation.instrumented
 import org.apache.spark.mllib.linalg.{Matrices => OldMatrices, Matrix => OldMatrix,
  Vector => OldVector, Vectors => OldVectors}
 import org.apache.spark.rdd.RDD
-import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession}
+import org.apache.spark.sql._


nit: can we avoid this?

ok, I will revert these place.

mgaido91 · 2019-07-15T14:25:24Z

mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala

@@ -37,7 +37,7 @@ import org.apache.spark.mllib.linalg.VectorImplicits._
 import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
 import org.apache.spark.mllib.util.MLUtils
 import org.apache.spark.rdd.RDD
-import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql._


mgaido91 · 2019-07-15T14:25:38Z

mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala

@@ -34,7 +34,7 @@ import org.apache.spark.ml.util.Instrumentation.instrumented
 import org.apache.spark.mllib.tree.configuration.{Algo => OldAlgo, Strategy => OldStrategy}
 import org.apache.spark.mllib.tree.model.{DecisionTreeModel => OldDecisionTreeModel}
 import org.apache.spark.rdd.RDD
-import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql._


SparkQA · 2019-07-16T02:45:40Z

Test build #107711 has finished for PR 24793 at commit 388e5fc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-07-16T13:56:19Z

Merged to master

…names ## What changes were proposed in this pull request? In regression/clustering/ovr/als, if an output column name is empty, igore it. And if all names are empty, log a warning msg, then do nothing. ## How was this patch tested? existing tests Closes apache#24793 from zhengruifeng/aft_iso_check_empty_outputCol. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

dongjoon-hyun added IMPROVEMENT ML and removed IMPROVEMENT labels Jun 13, 2019

srowen reviewed Jul 11, 2019

View reviewed changes

zhengruifeng force-pushed the aft_iso_check_empty_outputCol branch from a0e4d54 to f02051b Compare July 12, 2019 03:38

srowen reviewed Jul 12, 2019

View reviewed changes

mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala Show resolved Hide resolved

zhengruifeng added 7 commits July 15, 2019 11:30

init

d2d0abc

fix scala style

d19846c

update log msg

96a8a41

nit

2d0f634

nit

6296091

nit lda

276a062

revert algs with only one output col

f010ff9

zhengruifeng force-pushed the aft_iso_check_empty_outputCol branch from 90eb0ac to f010ff9 Compare July 15, 2019 04:57

srowen approved these changes Jul 15, 2019

View reviewed changes

mgaido91 reviewed Jul 15, 2019

View reviewed changes

zhengruifeng added 2 commits July 16, 2019 09:37

nits

66e9060

nits

388e5fc

srowen approved these changes Jul 16, 2019

View reviewed changes

srowen closed this in 282a12d Jul 16, 2019

zhengruifeng deleted the aft_iso_check_empty_outputCol branch July 17, 2019 01:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-27944][ML] Unify the behavior of checking empty output column names #24793

[SPARK-27944][ML] Unify the behavior of checking empty output column names #24793

zhengruifeng commented Jun 4, 2019

SparkQA commented Jun 4, 2019

SparkQA commented Jun 4, 2019

zhengruifeng commented Jun 25, 2019

zhengruifeng commented Jul 11, 2019

srowen Jul 11, 2019

srowen Jul 11, 2019

zhengruifeng Jul 12, 2019

srowen Jul 12, 2019

srowen Jul 11, 2019

SparkQA commented Jul 12, 2019

SparkQA commented Jul 15, 2019

mgaido91 Jul 15, 2019

zhengruifeng Jul 16, 2019

mgaido91 Jul 15, 2019

mgaido91 Jul 15, 2019

SparkQA commented Jul 16, 2019

srowen commented Jul 16, 2019

[SPARK-27944][ML] Unify the behavior of checking empty output column names #24793

[SPARK-27944][ML] Unify the behavior of checking empty output column names #24793

Conversation

zhengruifeng commented Jun 4, 2019

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jun 4, 2019

SparkQA commented Jun 4, 2019

zhengruifeng commented Jun 25, 2019

zhengruifeng commented Jul 11, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 12, 2019

SparkQA commented Jul 15, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 16, 2019

srowen commented Jul 16, 2019