-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-14657][SPARKR][ML] RFormula w/o intercept should output reference category when encoding string terms #12414
Conversation
Test build #55918 has finished for PR 12414 at commit
|
test this please |
Test build #57366 has finished for PR 12414 at commit
|
Test build #62074 has finished for PR 12414 at commit
|
hi - where are we on this? |
(gentle ping) |
@felixcheung @HyukjinKwon This is still active, and I'll push forward this soon. |
Test build #77188 has finished for PR 12414 at commit
|
Test build #77189 has finished for PR 12414 at commit
|
@felixcheung @actuaryzhang Would you mind to have a look at this? Thanks. |
@@ -163,12 +163,20 @@ class RFormula @Since("1.5.0") (@Since("1.5.0") override val uid: String) | |||
}.toMap | |||
|
|||
// Then we handle one-hot encoding and interactions between terms. | |||
var hasReferenceCategory = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should hasReferenceCategory
be set at runtime or always initialize to false
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because then the 2nd condition !hasReferenceCategory
is always true on L175 https://github.com/apache/spark/pull/12414/files#diff-bedaa993ebc2cc2e0d496859095270feR175
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is just used as an indicator whether a dropLast(false)
has been set.
However, I do think we can have better name than hasReferenceCategory
.
Basically what's needed is just set dropLast(false)
for the first string.
How about something like this:
var firstString = true
if (!hasIntercept && firstString) {
encoder = encoder.setDropLast(false)
firstString = false
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rename hasReferenceCategory
to keepReferenceCategory
with default initial value false
. This is because if users fit with intercept, it will not trigger keepReferenceCategory
, so keepReferenceCategory
always as false
. @actuaryzhang I think firstString
is not an expressive name, if users fit with intercept, RFormula
will check the first string as well, but doesn't keep all categories(and don't switch firstString
). So after running pass this code snippet, firstString
is still true
which will make developer confused.
@@ -129,6 +129,23 @@ class RFormulaSuite extends SparkFunSuite with MLlibTestSparkContext with Defaul | |||
assert(result.collect() === expected.collect()) | |||
} | |||
|
|||
test("formula w/o intercept, we should output reference category when encoding string terms") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would also be great to test the impact of the change when there is are interactions present?
Will this create the same design matrix as R?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added test for interactions.
Test build #78674 has finished for PR 12414 at commit
|
Test build #78686 has finished for PR 12414 at commit
|
Jenkins, test this please. |
Test build #78692 has finished for PR 12414 at commit
|
LGTM once it clears Jenkins. Thanks. |
Test build #78742 has finished for PR 12414 at commit
|
Test build #78789 has finished for PR 12414 at commit
|
Jenkins, test this please. |
Test build #78793 has finished for PR 12414 at commit
|
Test build #78804 has finished for PR 12414 at commit
|
Merged into master. Thanks for all your review. |
…nce category when encoding string terms ## What changes were proposed in this pull request? Please see [SPARK-14657](https://issues.apache.org/jira/browse/SPARK-14657) for detail of this bug. I searched online and test some other cases, found when we fit R glm model(or other models powered by R formula) w/o intercept on a dataset including string/category features, one of the categories in the first category feature is being used as reference category, we will not drop any category for that feature. I think we should keep consistent semantics between Spark RFormula and R formula. ## How was this patch tested? Add standard unit tests. cc mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#12414 from yanboliang/spark-14657.
What changes were proposed in this pull request?
Please see SPARK-14657 for detail of this bug.
I searched online and test some other cases, found when we fit R glm model(or other models powered by R formula) w/o intercept on a dataset including string/category features, one of the categories in the first category feature is being used as reference category, we will not drop any category for that feature.
I think we should keep consistent semantics between Spark RFormula and R formula.
How was this patch tested?
Add standard unit tests.
cc @mengxr