New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-27944][ML] Unify the behavior of checking empty output column names #24793
[SPARK-27944][ML] Unify the behavior of checking empty output column names #24793
Conversation
Test build #106149 has finished for PR 24793 at commit
|
Test build #106150 has finished for PR 24793 at commit
|
This pr can help to avoid uncessary computation. For example, in GMM, current impl always predict twice, one for |
ping @srowen , would you mind help reviewing this? |
.withColumns(predictionColNames, predictionColumns) | ||
.drop(accColName) | ||
} else { | ||
this.logWarning(s"$uid: OneVsRestModel.transform() was called as NOOP" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how this happens given the check at the start of the method?
val predictUDF = udf((vector: Vector) => predict(vector)) | ||
dataset.withColumn($(predictionCol), | ||
predictUDF(DatasetUtils.columnToVector(dataset, getFeaturesCol))) | ||
if ($(predictionCol).nonEmpty) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise how would this be empty, unless the user set it to nothing? in which case, this seems not worth worrying about
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed here to keep in line with other algs like LDA.transform
Or leave alone algs with only one output column, and remove the check in algs like LDA?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I'd say we don't need this check anywhere that the user would have to explicitly set no prediction column to get no output, and in that case, I don't think it's worth checking and warning. I'm neutral on removing the other checks, but not against it.
Some checks are OK like the ones above as it might be easier to accidentally get into this situation because there are multiple prediction cols.
dataset.withColumn($(predictionCol), | ||
predictUDF(DatasetUtils.columnToVector(dataset, getFeaturesCol))) | ||
} else { | ||
this.logWarning(s"$uid: BisectingKMeansModel.transform() was called as NOOP" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PS we should fix these messages. "called as NOOP" is cryptic. Just say "transform() does nothing because ..."
a0e4d54
to
f02051b
Compare
Test build #107577 has finished for PR 24793 at commit
|
mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala
Show resolved
Hide resolved
90eb0ac
to
f010ff9
Compare
Test build #107665 has finished for PR 24793 at commit
|
@@ -33,7 +33,7 @@ import org.apache.spark.ml.util.Instrumentation.instrumented | |||
import org.apache.spark.mllib.linalg.{Matrices => OldMatrices, Matrix => OldMatrix, | |||
Vector => OldVector, Vectors => OldVectors} | |||
import org.apache.spark.rdd.RDD | |||
import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession} | |||
import org.apache.spark.sql._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we avoid this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I will revert these place.
@@ -37,7 +37,7 @@ import org.apache.spark.mllib.linalg.VectorImplicits._ | |||
import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer | |||
import org.apache.spark.mllib.util.MLUtils | |||
import org.apache.spark.rdd.RDD | |||
import org.apache.spark.sql.{DataFrame, Dataset, Row} | |||
import org.apache.spark.sql._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
@@ -34,7 +34,7 @@ import org.apache.spark.ml.util.Instrumentation.instrumented | |||
import org.apache.spark.mllib.tree.configuration.{Algo => OldAlgo, Strategy => OldStrategy} | |||
import org.apache.spark.mllib.tree.model.{DecisionTreeModel => OldDecisionTreeModel} | |||
import org.apache.spark.rdd.RDD | |||
import org.apache.spark.sql.{DataFrame, Dataset, Row} | |||
import org.apache.spark.sql._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
Test build #107711 has finished for PR 24793 at commit
|
Merged to master |
…names ## What changes were proposed in this pull request? In regression/clustering/ovr/als, if an output column name is empty, igore it. And if all names are empty, log a warning msg, then do nothing. ## How was this patch tested? existing tests Closes apache#24793 from zhengruifeng/aft_iso_check_empty_outputCol. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>
What changes were proposed in this pull request?
In regression/clustering/ovr/als, if an output column name is empty, igore it. And if all names are empty, log a warning msg, then do nothing.
How was this patch tested?
existing tests