Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-27944][ML] Unify the behavior of checking empty output column names #24793

Closed

Conversation

zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

In regression/clustering/ovr/als, if an output column name is empty, igore it. And if all names are empty, log a warning msg, then do nothing.

How was this patch tested?

existing tests

@SparkQA
Copy link

SparkQA commented Jun 4, 2019

Test build #106149 has finished for PR 24793 at commit 6c2a81a.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 4, 2019

Test build #106150 has finished for PR 24793 at commit a0e4d54.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

This pr can help to avoid uncessary computation. For example, in GMM, current impl always predict twice, one for probabilityCol and one for probabilityCol. if we only need probability, the first col can be skipped.

@zhengruifeng
Copy link
Contributor Author

ping @srowen , would you mind help reviewing this?

.withColumns(predictionColNames, predictionColumns)
.drop(accColName)
} else {
this.logWarning(s"$uid: OneVsRestModel.transform() was called as NOOP" +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how this happens given the check at the start of the method?

val predictUDF = udf((vector: Vector) => predict(vector))
dataset.withColumn($(predictionCol),
predictUDF(DatasetUtils.columnToVector(dataset, getFeaturesCol)))
if ($(predictionCol).nonEmpty) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise how would this be empty, unless the user set it to nothing? in which case, this seems not worth worrying about

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed here to keep in line with other algs like LDA.transform
Or leave alone algs with only one output column, and remove the check in algs like LDA?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I'd say we don't need this check anywhere that the user would have to explicitly set no prediction column to get no output, and in that case, I don't think it's worth checking and warning. I'm neutral on removing the other checks, but not against it.

Some checks are OK like the ones above as it might be easier to accidentally get into this situation because there are multiple prediction cols.

dataset.withColumn($(predictionCol),
predictUDF(DatasetUtils.columnToVector(dataset, getFeaturesCol)))
} else {
this.logWarning(s"$uid: BisectingKMeansModel.transform() was called as NOOP" +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS we should fix these messages. "called as NOOP" is cryptic. Just say "transform() does nothing because ..."

@zhengruifeng zhengruifeng force-pushed the aft_iso_check_empty_outputCol branch from a0e4d54 to f02051b Compare July 12, 2019 03:38
@SparkQA
Copy link

SparkQA commented Jul 12, 2019

Test build #107577 has finished for PR 24793 at commit 90eb0ac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng zhengruifeng force-pushed the aft_iso_check_empty_outputCol branch from 90eb0ac to f010ff9 Compare July 15, 2019 04:57
@SparkQA
Copy link

SparkQA commented Jul 15, 2019

Test build #107665 has finished for PR 24793 at commit f010ff9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -33,7 +33,7 @@ import org.apache.spark.ml.util.Instrumentation.instrumented
import org.apache.spark.mllib.linalg.{Matrices => OldMatrices, Matrix => OldMatrix,
Vector => OldVector, Vectors => OldVectors}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession}
import org.apache.spark.sql._
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we avoid this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will revert these place.

@@ -37,7 +37,7 @@ import org.apache.spark.mllib.linalg.VectorImplicits._
import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, Dataset, Row}
import org.apache.spark.sql._
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@@ -34,7 +34,7 @@ import org.apache.spark.ml.util.Instrumentation.instrumented
import org.apache.spark.mllib.tree.configuration.{Algo => OldAlgo, Strategy => OldStrategy}
import org.apache.spark.mllib.tree.model.{DecisionTreeModel => OldDecisionTreeModel}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, Dataset, Row}
import org.apache.spark.sql._
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@SparkQA
Copy link

SparkQA commented Jul 16, 2019

Test build #107711 has finished for PR 24793 at commit 388e5fc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jul 16, 2019

Merged to master

@srowen srowen closed this in 282a12d Jul 16, 2019
@zhengruifeng zhengruifeng deleted the aft_iso_check_empty_outputCol branch July 17, 2019 01:48
vinodkc pushed a commit to vinodkc/spark that referenced this pull request Jul 18, 2019
…names

## What changes were proposed in this pull request?
In regression/clustering/ovr/als, if an output column name is empty, igore it. And if all names are empty, log a warning msg, then do nothing.

## How was this patch tested?
existing tests

Closes apache#24793 from zhengruifeng/aft_iso_check_empty_outputCol.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants