[SPARK-34356][ML] OVR transform fix potential column conflict #31472

zhengruifeng · 2021-02-04T10:19:13Z

What changes were proposed in this pull request?

1, clear predictionCol & probabilityCol, use tmp rawPred col, to avoid potential column conflict;
2, use array instead of map, to keep in line with the python side;
3, simplify transform

Why are the changes needed?

if input dataset has a column whose name is predictionCol,probabilityCol,RawPredictionCol, transfrom will fail.

Does this PR introduce any user-facing change?

No

How was this patch tested?

added testsuite

zhengruifeng · 2021-02-04T10:21:48Z

mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala

@@ -223,6 +223,13 @@ class OneVsRestSuite extends MLTest with DefaultReadWriteTest {
    assert(oldCols === newCols)
  }

+  test("SPARK-SPARK-34356: OneVsRestModel.transform should avoid potential column conflict") {


this test will fail in master and (maybe) all versions of OVR.
but I think fix it in master maybe enough.

zhengruifeng · 2021-02-04T10:29:42Z

in 3.0.1 and master

scala> val df = spark.read.format("libsvm").load("/d0/Dev/Opensource/spark/data/mllib/sample_multiclass_classification_data.txt").withColumn("probability", lit(0.0))
21/02/04 18:06:36 WARN LibSVMFileFormat: 'numFeatures' option not specified, determining the number of features by going though the input. If you know the number in advance, please specify it via 'numFeatures' option to avoid the extra scan.
df: org.apache.spark.sql.DataFrame = [label: double, features: vector ... 1 more field]

scala> 

scala> val classifier = new LogisticRegression().setMaxIter(1).setTol(1E-6).setFitIntercept(true)
classifier: org.apache.spark.ml.classification.LogisticRegression = logreg_5900509aa825

scala> val ovr = new OneVsRest().setClassifier(classifier)
ovr: org.apache.spark.ml.classification.OneVsRest = oneVsRest_dd2b3e9da4e3

scala> val ovrm = ovr.fit(df)
ovrm: org.apache.spark.ml.classification.OneVsRestModel = OneVsRestModel: uid=oneVsRest_dd2b3e9da4e3, classifier=logreg_5900509aa825, numClasses=3, numFeatures=4

scala> ovrm.transform(df)
java.lang.IllegalArgumentException: requirement failed: Column probability already exists.
  at scala.Predef$.require(Predef.scala:281)
  at org.apache.spark.ml.util.SchemaUtils$.appendColumn(SchemaUtils.scala:106)
  at org.apache.spark.ml.util.SchemaUtils$.appendColumn(SchemaUtils.scala:96)
  at org.apache.spark.ml.classification.ProbabilisticClassifierParams.validateAndTransformSchema(ProbabilisticClassifier.scala:38)
  at org.apache.spark.ml.classification.ProbabilisticClassifierParams.validateAndTransformSchema$(ProbabilisticClassifier.scala:33)
  at org.apache.spark.ml.classification.LogisticRegressionModel.org$apache$spark$ml$classification$LogisticRegressionParams$$super$validateAndTransformSchema(LogisticRegression.scala:917)
  at org.apache.spark.ml.classification.LogisticRegressionParams.validateAndTransformSchema(LogisticRegression.scala:268)
  at org.apache.spark.ml.classification.LogisticRegressionParams.validateAndTransformSchema$(LogisticRegression.scala:255)
  at org.apache.spark.ml.classification.LogisticRegressionModel.validateAndTransformSchema(LogisticRegression.scala:917)
  at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:222)
  at org.apache.spark.ml.classification.ClassificationModel.transformSchema(Classifier.scala:182)
  at org.apache.spark.ml.classification.ProbabilisticClassificationModel.transformSchema(ProbabilisticClassifier.scala:88)
  at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:71)
  at org.apache.spark.ml.classification.ProbabilisticClassificationModel.transform(ProbabilisticClassifier.scala:107)
  at org.apache.spark.ml.classification.OneVsRestModel.$anonfun$transform$4(OneVsRest.scala:215)
  at scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60)
  at scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68)
  at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:198)
  at org.apache.spark.ml.classification.OneVsRestModel.transform(OneVsRest.scala:203)
  ... 49 elided

scala>

SparkQA · 2021-02-04T12:00:59Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39456/

SparkQA · 2021-02-04T12:33:36Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39456/

SparkQA · 2021-02-04T12:36:23Z

Test build #134869 has finished for PR 31472 at commit 8a47b6e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala

srowen · 2021-02-04T13:34:00Z

mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala

+        tmpModel.setPredictionCol("")
+        tmpModel match {
+          case m: ProbabilisticClassificationModel[_, _] => m.setProbabilityCol("")
+          case _ =>


Should this case be silently ignored? if it's always ProbabilisticClassificationModel then just cast?

mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala

SparkQA · 2021-02-05T07:56:16Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39501/

SparkQA · 2021-02-05T08:25:40Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39501/

SparkQA · 2021-02-05T11:51:31Z

Test build #134918 has finished for PR 31472 at commit 02725b0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2021-02-07T02:59:50Z

thanks @srowen for reviewing!

init

1431ade

zhengruifeng commented Feb 4, 2021

View reviewed changes

zhengruifeng added 2 commits February 4, 2021 18:53

array append

7eaa60c

nit

8a47b6e

github-actions bot added the ML label Feb 4, 2021

srowen reviewed Feb 4, 2021

View reviewed changes

address comments

02725b0

srowen approved these changes Feb 6, 2021

View reviewed changes

srowen closed this in 178dc50 Feb 6, 2021

zhengruifeng deleted the ovr_submodel_skip_pred_prob branch February 7, 2021 01:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-34356][ML] OVR transform fix potential column conflict #31472

[SPARK-34356][ML] OVR transform fix potential column conflict #31472

zhengruifeng commented Feb 4, 2021

zhengruifeng Feb 4, 2021 •

edited

Loading

zhengruifeng commented Feb 4, 2021 •

edited

Loading

SparkQA commented Feb 4, 2021

SparkQA commented Feb 4, 2021

SparkQA commented Feb 4, 2021

srowen Feb 4, 2021

zhengruifeng Feb 5, 2021

SparkQA commented Feb 5, 2021

SparkQA commented Feb 5, 2021

SparkQA commented Feb 5, 2021

zhengruifeng commented Feb 7, 2021

[SPARK-34356][ML] OVR transform fix potential column conflict #31472

[SPARK-34356][ML] OVR transform fix potential column conflict #31472

Conversation

zhengruifeng commented Feb 4, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

zhengruifeng Feb 4, 2021 • edited Loading

Choose a reason for hiding this comment

zhengruifeng commented Feb 4, 2021 • edited Loading

SparkQA commented Feb 4, 2021

SparkQA commented Feb 4, 2021

SparkQA commented Feb 4, 2021

srowen Feb 4, 2021

Choose a reason for hiding this comment

zhengruifeng Feb 5, 2021

Choose a reason for hiding this comment

SparkQA commented Feb 5, 2021

SparkQA commented Feb 5, 2021

SparkQA commented Feb 5, 2021

zhengruifeng commented Feb 7, 2021

zhengruifeng Feb 4, 2021 •

edited

Loading

zhengruifeng commented Feb 4, 2021 •

edited

Loading