New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Factor treatment designTreatmentsC #1
Comments
|
rsuhada, Thanks for your issue. I definitely see the problem you are talking about. To repeat: the issue is as in the vtreat.html example we process data as follows: library('vtreat') And (contrary to the claims in the document the derived variable x_lev_x.b is not scaled and doesn't meet the expected scaled claims (mean zero and slope 1 when regressed with y). It looks like the issue that the conditional distribution of y|x_lev_x.b is equal to the base distribution of y and also equal to the complementary distribution y|!x_lev_x.b. This means variable x_lev_x.b is perfectly independent of y. So the scaling slope would have to be zero (which it is). It looks like the code detected this exceptional condition and skipped scaling on this variable (instead of writing NA, throwing or skipping the variable). So we would not see this variable if we had pruned with pruneSig<1. I am guessing the best the library can do is get the mean to zero (why not) and fix the documentation/example on this one. Sorry about sowing a bit of confusion there. We will incorporate a fix today. |
|
rsuhada, Thanks for your issue report! I have pushed a fix to the dev version of vtreat ( https://github.com/WinVector/vtreat , since pruning did the right thing I am going to put off pushing this to CRAN). I have also fixed some documentation, added a test (to try and catch any reversion), and produced a vignette explaining the nuance of the issue: http://winvector.github.io/vtreathtml/vtreatScaleMode.html |
rsuhada commentedApr 18, 2016
•
edited
Thank you for the nice package!
I've just went through the initial tutorial and noticed (and fully reproduced) an odd result in:
https://winvector.github.io/vtreathtml/vtreat.html
After applying the treatment from designTreatmentsC, we test
Neither of these is true for x_lev_x.b:
mean is 3.333333e-01
slope is -1.442222e-16
For all the other variables as well as for everything in designTreatmenN example it all seems OK.
Is this an expected behavior?
Oddly the x_lev_x.b is coded with 0s and 1s, while other variables are fractional.
Thank you very much!
The text was updated successfully, but these errors were encountered: