[SPARK-13961][ML] spark.ml ChiSqSelector and RFormula should support other numeric types for label #12467

BenFradet · 2016-04-18T07:08:58Z

What changes were proposed in this pull request?

Made ChiSqSelector and RFormula accept all numeric types for label

How was this patch tested?

Unit tests

yanboliang · 2016-04-18T16:14:40Z

@BenFradet Thanks for this PR. I think we can not make RFormula supporting other numeric types for label as your proposal. If the label column already exists, it must be type of DoubleType, otherwise it will cause the downstream model can not recognize the label column when validateAndTransformSchema.

BenFradet · 2016-04-18T16:27:48Z

@yanboliang thanks for your input, I reverted the affected commits

yanboliang · 2016-04-18T16:43:30Z

@BenFradet I'm really very sorry that I did not notice the #10355 has been merged. Please ignore my last comments because it's not valid after #10355. Would you mind to add RFormula support back? Thanks!

BenFradet · 2016-04-18T16:45:51Z

@yanboliang yup, no problem.

SparkQA · 2016-04-18T18:28:02Z

Test build #56068 has finished for PR 12467 at commit f05c217.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-19T12:22:58Z

Test build #56220 has finished for PR 12467 at commit c9edad0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-04-19T14:52:56Z

mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala

@@ -290,4 +291,18 @@ class RFormulaSuite extends SparkFunSuite with MLlibTestSparkContext with Defaul
    val newModel = testDefaultReadWrite(model)
    checkModelData(model, newModel)
  }
+
+  test("should support all NumericType labels") {


Can we use MLTestingUtils.checkNumericTypes to test this? It will eliminate some redundant code.

It'd work expect for the expected exception when dealing with a dataframe containing string labels because the label column gets indexed by RFormula's fit.
Consequently, an exception is thrown by StringIndexer.

What I could do is add a validateSchema to RFormula (called at the beginiing of the fit method) checking that the label column is of numeric type, then I could use MLTestingUtils.checkNumericTypes.

What do you think?

Or simply just:

val schema = dataset.schema SchemaUtils.checkNumericType(schema, $(labelCol))

at the beginning of RFormula's fit method.

After reviewing the suite, I don't think the same tests apply since RFormula also accepts string labels.

Consequently, I think it's best as is.

@BenFradet Sorry for late response.
I'm OK with what you have done here for the issue mentioned above. Could you add more tests for RFormulaModel equality check? Here you have checked resolvedFormula which is produced by RFormulaParser rather than the entire RFormula. It's better also check the equality of pipelineModel of RFormulaModel.

Sure, thanks for your input.

SparkQA · 2016-04-19T20:53:14Z

Test build #56258 has finished for PR 12467 at commit 4cb27cf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2016-04-22T11:59:05Z

mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala

@@ -254,8 +254,8 @@ class RFormulaModel private[feature](
    val columnNames = schema.map(_.name)
    require(!columnNames.contains($(featuresCol)), "Features column already exists.")
    require(
-      !columnNames.contains($(labelCol)) || schema($(labelCol)).dataType == DoubleType,
-      "Label column already exists and is not of type DoubleType.")
+      !columnNames.contains($(labelCol)) || schema($(labelCol)).dataType.isInstanceOf[NumericType],


Should the || not be &&?

I don't think so no. What do you think @yanboliang ?

+1 @BenFradet It should be ||

e.g. before this PR this works (and I don't believe it's supposed to?).

scala> val original = sqlContext.createDataFrame(Seq((0, 1), (2, 2))).toDF("x", "y") original: org.apache.spark.sql.DataFrame = [x: int, y: int] scala> formula.fit(original).transform(original).show +---+---+--------+-----+ | x| y|features|label| +---+---+--------+-----+ | 0| 1| [0.0]| 1.0| | 2| 2| [2.0]| 2.0| +---+---+--------+-----+

And to make it clear that this check is not actually being performed:

scala> val original = sqlContext.createDataFrame(Seq((0, Seq(1)), (2, Seq(2)))).toDF("x", "y") original: org.apache.spark.sql.DataFrame = [x: int, y: array<int>] scala> formula.fit(original).transform(original).show java.lang.IllegalArgumentException: Unsupported type for label: ArrayType(IntegerType,false) at org.apache.spark.ml.feature.RFormulaModel.transformLabel(RFormula.scala:246) at org.apache.spark.ml.feature.RFormulaModel.transform(RFormula.scala:211) ... 48 elided

... so it's catching it, but at L244 not here.

ah I see now, never mind.

MLnick · 2016-04-27T18:47:45Z

This LGTM. @yanboliang anything further?

yanboliang · 2016-04-28T04:46:34Z

Look good overall, I have my last inline comment. After that, it should be ready to go.

SparkQA · 2016-04-29T21:29:39Z

Test build #57359 has finished for PR 12467 at commit 79b0f9d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BenFradet · 2016-04-29T21:52:23Z

mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala

+      assert(expected.resolvedFormula.label === actual.resolvedFormula.label)
+      assert(expected.resolvedFormula.terms === actual.resolvedFormula.terms)
+      assert(expected.resolvedFormula.hasIntercept === actual.resolvedFormula.hasIntercept)
+    }


@yanboliang is this what you had in mind?

BenFradet · 2016-05-12T11:27:40Z

Ping @yanboliang

yanboliang · 2016-05-12T12:36:03Z

Sorry for late response. This LGTM and the conflicts should be resolved. Thanks!

BenFradet · 2016-05-12T12:40:01Z

@yanboliang thanks a lot!
Will rebase soon.

…d optional

SparkQA · 2016-05-12T16:51:49Z

Test build #58502 has finished for PR 12467 at commit 23f80ee.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…abel column

SparkQA · 2016-05-12T17:35:12Z

Test build #58503 has finished for PR 12467 at commit 3786ef9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BenFradet · 2016-05-12T19:04:32Z

pinging @MLnick

MLnick · 2016-05-13T07:08:18Z

Merged to master and branch-2.0. Thanks!

…other numeric types for label ## What changes were proposed in this pull request? Made ChiSqSelector and RFormula accept all numeric types for label ## How was this patch tested? Unit tests Author: BenFradet <benjamin.fradet@gmail.com> Closes #12467 from BenFradet/SPARK-13961. (cherry picked from commit 31f1aeb) Signed-off-by: Nick Pentreath <nick.pentreath@gmail.com>

yanboliang reviewed Apr 19, 2016
View reviewed changes

MLnick reviewed Apr 22, 2016
View reviewed changes

BenFradet reviewed Apr 29, 2016
View reviewed changes

BenFradet added 5 commits May 12, 2016 18:28

ChiSqSelector now accepts numeric labels

fc494d3

spec for ChiSqSelector accepting numeric types

69b6470

RFormula now accepts all numeric type

8c843ef

spec for RFormula accepting all numeric types

a8e5aa0

made the isClassification parameter for the check numeric types metho…

ce19549

…d optional

extended checking when testing RFormula against other types for the l…

3786ef9

…abel column

asfgit closed this in 31f1aeb May 13, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-13961][ML] spark.ml ChiSqSelector and RFormula should support other numeric types for label #12467

[SPARK-13961][ML] spark.ml ChiSqSelector and RFormula should support other numeric types for label #12467

BenFradet commented Apr 18, 2016

yanboliang commented Apr 18, 2016

BenFradet commented Apr 18, 2016

yanboliang commented Apr 18, 2016 •

edited

BenFradet commented Apr 18, 2016

SparkQA commented Apr 18, 2016

SparkQA commented Apr 19, 2016

yanboliang Apr 19, 2016

BenFradet Apr 19, 2016

BenFradet Apr 19, 2016

BenFradet Apr 19, 2016

yanboliang Apr 28, 2016 •

edited

BenFradet Apr 28, 2016

SparkQA commented Apr 19, 2016

MLnick Apr 22, 2016

BenFradet Apr 22, 2016

yanboliang Apr 22, 2016

MLnick Apr 22, 2016 •

edited

MLnick Apr 22, 2016

MLnick commented Apr 27, 2016

yanboliang commented Apr 28, 2016

SparkQA commented Apr 29, 2016

BenFradet Apr 29, 2016

yanboliang May 12, 2016

BenFradet commented May 12, 2016

yanboliang commented May 12, 2016 •

edited

BenFradet commented May 12, 2016

SparkQA commented May 12, 2016

SparkQA commented May 12, 2016

BenFradet commented May 12, 2016

MLnick commented May 13, 2016

[SPARK-13961][ML] spark.ml ChiSqSelector and RFormula should support other numeric types for label #12467

[SPARK-13961][ML] spark.ml ChiSqSelector and RFormula should support other numeric types for label #12467

Conversation

BenFradet commented Apr 18, 2016

What changes were proposed in this pull request?

How was this patch tested?

yanboliang commented Apr 18, 2016

BenFradet commented Apr 18, 2016

yanboliang commented Apr 18, 2016 • edited

BenFradet commented Apr 18, 2016

SparkQA commented Apr 18, 2016

SparkQA commented Apr 19, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yanboliang Apr 28, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 19, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MLnick Apr 22, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MLnick commented Apr 27, 2016

yanboliang commented Apr 28, 2016

SparkQA commented Apr 29, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenFradet commented May 12, 2016

yanboliang commented May 12, 2016 • edited

BenFradet commented May 12, 2016

SparkQA commented May 12, 2016

SparkQA commented May 12, 2016

BenFradet commented May 12, 2016

MLnick commented May 13, 2016

yanboliang commented Apr 18, 2016 •

edited

yanboliang Apr 28, 2016 •

edited

MLnick Apr 22, 2016 •

edited

yanboliang commented May 12, 2016 •

edited