Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-14657][SPARKR][ML] RFormula w/o intercept should output reference category when encoding string terms #12414

Closed
wants to merge 3 commits into from

Conversation

yanboliang
Copy link
Contributor

@yanboliang yanboliang commented Apr 15, 2016

What changes were proposed in this pull request?

Please see SPARK-14657 for detail of this bug.
I searched online and test some other cases, found when we fit R glm model(or other models powered by R formula) w/o intercept on a dataset including string/category features, one of the categories in the first category feature is being used as reference category, we will not drop any category for that feature.
I think we should keep consistent semantics between Spark RFormula and R formula.

How was this patch tested?

Add standard unit tests.

cc @mengxr

@yanboliang yanboliang changed the title [SPARK-14657] [ML] RFormula w/o intercept, we should output reference category when encoding string terms [SPARK-14657] [ML] RFormula w/o intercept should output reference category when encoding string terms Apr 15, 2016
@SparkQA
Copy link

SparkQA commented Apr 15, 2016

Test build #55918 has finished for PR 12414 at commit 167beae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

test this please

@SparkQA
Copy link

SparkQA commented Apr 29, 2016

Test build #57366 has finished for PR 12414 at commit 167beae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 11, 2016

Test build #62074 has finished for PR 12414 at commit 167beae.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

hi - where are we on this?
@yanboliang could you add [SPARKR] to the title

@HyukjinKwon
Copy link
Member

(gentle ping)

@yanboliang yanboliang changed the title [SPARK-14657] [ML] RFormula w/o intercept should output reference category when encoding string terms [SPARK-14657][SPARKR][ML] RFormula w/o intercept should output reference category when encoding string terms May 19, 2017
@yanboliang
Copy link
Contributor Author

@felixcheung @HyukjinKwon This is still active, and I'll push forward this soon.

@SparkQA
Copy link

SparkQA commented May 22, 2017

Test build #77188 has finished for PR 12414 at commit bd84af5.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 22, 2017

Test build #77189 has finished for PR 12414 at commit 9c884f9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor Author

@felixcheung @actuaryzhang Would you mind to have a look at this? Thanks.

@@ -163,12 +163,20 @@ class RFormula @Since("1.5.0") (@Since("1.5.0") override val uid: String)
}.toMap

// Then we handle one-hot encoding and interactions between terms.
var hasReferenceCategory = false
Copy link
Member

@felixcheung felixcheung May 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should hasReferenceCategory be set at runtime or always initialize to false?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because then the 2nd condition !hasReferenceCategory is always true on L175 https://github.com/apache/spark/pull/12414/files#diff-bedaa993ebc2cc2e0d496859095270feR175

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is just used as an indicator whether a dropLast(false) has been set.
However, I do think we can have better name than hasReferenceCategory.
Basically what's needed is just set dropLast(false) for the first string.
How about something like this:

var firstString = true 
if (!hasIntercept && firstString) {
          encoder = encoder.setDropLast(false)
          firstString = false
}

Copy link
Contributor Author

@yanboliang yanboliang Jun 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rename hasReferenceCategory to keepReferenceCategory with default initial value false. This is because if users fit with intercept, it will not trigger keepReferenceCategory, so keepReferenceCategory always as false. @actuaryzhang I think firstString is not an expressive name, if users fit with intercept, RFormula will check the first string as well, but doesn't keep all categories(and don't switch firstString). So after running pass this code snippet, firstString is still true which will make developer confused.

@@ -129,6 +129,23 @@ class RFormulaSuite extends SparkFunSuite with MLlibTestSparkContext with Defaul
assert(result.collect() === expected.collect())
}

test("formula w/o intercept, we should output reference category when encoding string terms") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would also be great to test the impact of the change when there is are interactions present?
Will this create the same design matrix as R?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added test for interactions.

@SparkQA
Copy link

SparkQA commented Jun 27, 2017

Test build #78674 has finished for PR 12414 at commit 097da70.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 27, 2017

Test build #78686 has finished for PR 12414 at commit 097da70.

  • This patch fails due to an unknown error code, -10.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor Author

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Jun 27, 2017

Test build #78692 has finished for PR 12414 at commit 097da70.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@actuaryzhang
Copy link
Contributor

LGTM once it clears Jenkins. Thanks.

@SparkQA
Copy link

SparkQA commented Jun 28, 2017

Test build #78742 has finished for PR 12414 at commit 097da70.

  • This patch fails due to an unknown error code, -10.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 28, 2017

Test build #78789 has finished for PR 12414 at commit 097da70.

  • This patch fails due to an unknown error code, -10.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor Author

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Jun 28, 2017

Test build #78793 has finished for PR 12414 at commit 097da70.

  • This patch fails due to an unknown error code, -10.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang yanboliang closed this Jun 28, 2017
@yanboliang yanboliang reopened this Jun 28, 2017
@SparkQA
Copy link

SparkQA commented Jun 28, 2017

Test build #78804 has finished for PR 12414 at commit 15bb587.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor Author

Merged into master. Thanks for all your review.

@asfgit asfgit closed this in 0c8444c Jun 29, 2017
@yanboliang yanboliang deleted the spark-14657 branch June 29, 2017 02:35
robert3005 pushed a commit to palantir/spark that referenced this pull request Jun 29, 2017
…nce category when encoding string terms

## What changes were proposed in this pull request?

Please see [SPARK-14657](https://issues.apache.org/jira/browse/SPARK-14657) for detail of this bug.
I searched online and test some other cases, found when we fit R glm model(or other models powered by R formula) w/o intercept on a dataset including string/category features, one of the categories in the first category feature is being used as reference category, we will not drop any category for that feature.
I think we should keep consistent semantics between Spark RFormula and R formula.
## How was this patch tested?

Add standard unit tests.

cc mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes apache#12414 from yanboliang/spark-14657.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants