[SPARK-14657][SPARKR][ML] RFormula w/o intercept should output reference category when encoding string terms #12414

yanboliang · 2016-04-15T10:12:27Z

What changes were proposed in this pull request?

Please see SPARK-14657 for detail of this bug.
I searched online and test some other cases, found when we fit R glm model(or other models powered by R formula) w/o intercept on a dataset including string/category features, one of the categories in the first category feature is being used as reference category, we will not drop any category for that feature.
I think we should keep consistent semantics between Spark RFormula and R formula.

How was this patch tested?

Add standard unit tests.

cc @mengxr

SparkQA · 2016-04-15T10:53:27Z

Test build #55918 has finished for PR 12414 at commit 167beae.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-04-29T21:45:53Z

test this please

SparkQA · 2016-04-29T22:28:27Z

Test build #57366 has finished for PR 12414 at commit 167beae.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-11T05:19:35Z

Test build #62074 has finished for PR 12414 at commit 167beae.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-01-09T05:51:49Z

hi - where are we on this?
@yanboliang could you add [SPARKR] to the title

HyukjinKwon · 2017-05-11T12:24:26Z

(gentle ping)

yanboliang · 2017-05-19T17:03:35Z

@felixcheung @HyukjinKwon This is still active, and I'll push forward this soon.

SparkQA · 2017-05-22T14:55:09Z

Test build #77188 has finished for PR 12414 at commit bd84af5.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-22T16:34:02Z

Test build #77189 has finished for PR 12414 at commit 9c884f9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2017-05-24T13:55:47Z

@felixcheung @actuaryzhang Would you mind to have a look at this? Thanks.

felixcheung · 2017-05-24T16:54:19Z

mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala

@@ -163,12 +163,20 @@ class RFormula @Since("1.5.0") (@Since("1.5.0") override val uid: String)
    }.toMap

    // Then we handle one-hot encoding and interactions between terms.
+    var hasReferenceCategory = false


should hasReferenceCategory be set at runtime or always initialize to false?

because then the 2nd condition !hasReferenceCategory is always true on L175 https://github.com/apache/spark/pull/12414/files#diff-bedaa993ebc2cc2e0d496859095270feR175

I think this is just used as an indicator whether a dropLast(false) has been set.
However, I do think we can have better name than hasReferenceCategory.
Basically what's needed is just set dropLast(false) for the first string.
How about something like this:

var firstString = true if (!hasIntercept && firstString) { encoder = encoder.setDropLast(false) firstString = false }

I rename hasReferenceCategory to keepReferenceCategory with default initial value false. This is because if users fit with intercept, it will not trigger keepReferenceCategory, so keepReferenceCategory always as false. @actuaryzhang I think firstString is not an expressive name, if users fit with intercept, RFormula will check the first string as well, but doesn't keep all categories(and don't switch firstString). So after running pass this code snippet, firstString is still true which will make developer confused.

actuaryzhang · 2017-05-25T00:01:37Z

mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala

@@ -129,6 +129,23 @@ class RFormulaSuite extends SparkFunSuite with MLlibTestSparkContext with Defaul
    assert(result.collect() === expected.collect())
  }

+  test("formula w/o intercept, we should output reference category when encoding string terms") {


Would also be great to test the impact of the change when there is are interactions present?
Will this create the same design matrix as R?

Added test for interactions.

SparkQA · 2017-06-27T03:59:27Z

Test build #78674 has finished for PR 12414 at commit 097da70.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-27T09:15:08Z

Test build #78686 has finished for PR 12414 at commit 097da70.

This patch fails due to an unknown error code, -10.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2017-06-27T10:12:52Z

Jenkins, test this please.

SparkQA · 2017-06-27T10:53:12Z

Test build #78692 has finished for PR 12414 at commit 097da70.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds no public classes.

actuaryzhang · 2017-06-27T22:29:55Z

LGTM once it clears Jenkins. Thanks.

SparkQA · 2017-06-28T03:09:56Z

Test build #78742 has finished for PR 12414 at commit 097da70.

This patch fails due to an unknown error code, -10.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-28T11:00:50Z

Test build #78789 has finished for PR 12414 at commit 097da70.

This patch fails due to an unknown error code, -10.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2017-06-28T11:06:25Z

Jenkins, test this please.

SparkQA · 2017-06-28T11:56:36Z

Test build #78793 has finished for PR 12414 at commit 097da70.

This patch fails due to an unknown error code, -10.
This patch merges cleanly.
This patch adds no public classes.

…ing string terms

SparkQA · 2017-06-28T14:39:37Z

Test build #78804 has finished for PR 12414 at commit 15bb587.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2017-06-29T02:32:08Z

Merged into master. Thanks for all your review.

…nce category when encoding string terms ## What changes were proposed in this pull request? Please see [SPARK-14657](https://issues.apache.org/jira/browse/SPARK-14657) for detail of this bug. I searched online and test some other cases, found when we fit R glm model(or other models powered by R formula) w/o intercept on a dataset including string/category features, one of the categories in the first category feature is being used as reference category, we will not drop any category for that feature. I think we should keep consistent semantics between Spark RFormula and R formula. ## How was this patch tested? Add standard unit tests. cc mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#12414 from yanboliang/spark-14657.

yanboliang changed the title ~~[SPARK-14657] [ML] RFormula w/o intercept, we should output reference category when encoding string terms~~ [SPARK-14657] [ML] RFormula w/o intercept should output reference category when encoding string terms Apr 15, 2016

yanboliang changed the title ~~[SPARK-14657] [ML] RFormula w/o intercept should output reference category when encoding string terms~~ [SPARK-14657][SPARKR][ML] RFormula w/o intercept should output reference category when encoding string terms May 19, 2017

yanboliang force-pushed the spark-14657 branch from 167beae to bd84af5 Compare May 22, 2017 14:45

felixcheung reviewed May 24, 2017

View reviewed changes

actuaryzhang reviewed May 25, 2017

View reviewed changes

yanboliang force-pushed the spark-14657 branch from 9c884f9 to 097da70 Compare June 27, 2017 03:17

felixcheung approved these changes Jun 28, 2017

View reviewed changes

yanboliang closed this Jun 28, 2017

yanboliang reopened this Jun 28, 2017

yanboliang added 2 commits June 28, 2017 21:41

formula w/o intercept, we should output reference category when encod…

348244e

…ing string terms

Update test suites.

8312090

Rename to keepReferenceCategory and add test cases.

15bb587

yanboliang force-pushed the spark-14657 branch from 097da70 to 15bb587 Compare June 28, 2017 13:41

asfgit closed this in 0c8444c Jun 29, 2017

yanboliang deleted the spark-14657 branch June 29, 2017 02:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-14657][SPARKR][ML] RFormula w/o intercept should output reference category when encoding string terms #12414

[SPARK-14657][SPARKR][ML] RFormula w/o intercept should output reference category when encoding string terms #12414

yanboliang commented Apr 15, 2016 •

edited

Loading

SparkQA commented Apr 15, 2016

jkbradley commented Apr 29, 2016

SparkQA commented Apr 29, 2016

SparkQA commented Jul 11, 2016

felixcheung commented Jan 9, 2017

HyukjinKwon commented May 11, 2017

yanboliang commented May 19, 2017

SparkQA commented May 22, 2017

SparkQA commented May 22, 2017

yanboliang commented May 24, 2017

felixcheung May 24, 2017 •

edited

Loading

felixcheung May 24, 2017

actuaryzhang May 24, 2017

yanboliang Jun 27, 2017 •

edited

Loading

actuaryzhang May 25, 2017

yanboliang Jun 27, 2017

SparkQA commented Jun 27, 2017

SparkQA commented Jun 27, 2017

yanboliang commented Jun 27, 2017

SparkQA commented Jun 27, 2017

actuaryzhang commented Jun 27, 2017

SparkQA commented Jun 28, 2017

SparkQA commented Jun 28, 2017

yanboliang commented Jun 28, 2017

SparkQA commented Jun 28, 2017

SparkQA commented Jun 28, 2017

yanboliang commented Jun 29, 2017

[SPARK-14657][SPARKR][ML] RFormula w/o intercept should output reference category when encoding string terms #12414

[SPARK-14657][SPARKR][ML] RFormula w/o intercept should output reference category when encoding string terms #12414

Conversation

yanboliang commented Apr 15, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Apr 15, 2016

jkbradley commented Apr 29, 2016

SparkQA commented Apr 29, 2016

SparkQA commented Jul 11, 2016

felixcheung commented Jan 9, 2017

HyukjinKwon commented May 11, 2017

yanboliang commented May 19, 2017

SparkQA commented May 22, 2017

SparkQA commented May 22, 2017

yanboliang commented May 24, 2017

felixcheung May 24, 2017 • edited Loading

Choose a reason for hiding this comment

felixcheung May 24, 2017

Choose a reason for hiding this comment

actuaryzhang May 24, 2017

Choose a reason for hiding this comment

yanboliang Jun 27, 2017 • edited Loading

Choose a reason for hiding this comment

actuaryzhang May 25, 2017

Choose a reason for hiding this comment

yanboliang Jun 27, 2017

Choose a reason for hiding this comment

SparkQA commented Jun 27, 2017

SparkQA commented Jun 27, 2017

yanboliang commented Jun 27, 2017

SparkQA commented Jun 27, 2017

actuaryzhang commented Jun 27, 2017

SparkQA commented Jun 28, 2017

SparkQA commented Jun 28, 2017

yanboliang commented Jun 28, 2017

SparkQA commented Jun 28, 2017

SparkQA commented Jun 28, 2017

yanboliang commented Jun 29, 2017

yanboliang commented Apr 15, 2016 •

edited

Loading

felixcheung May 24, 2017 •

edited

Loading

yanboliang Jun 27, 2017 •

edited

Loading