-
Notifications
You must be signed in to change notification settings - Fork 92
Add support for PostalCode, SubRegionCode, CountryCode logical types #2946
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2946 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 307 307
Lines 29049 29197 +148
=======================================
+ Hits 28958 29106 +148
Misses 91 91
Continue to review full report at Codecov.
|
freddyaboulton
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eccabay Thank you for this! I think this is ready to merge except we should make sure the one-way partial dependence plot is a bar plot for postal code etc instead of a line plot.
| i | ||
| for i, val in enumerate(X.ww.types["Logical Type"].items()) | ||
| if str(val[1]) in {"Boolean", "Categorical"} | ||
| for i, val in enumerate(X.ww.semantic_tags.items()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we also need to check for boolean here? Why not do X.ww.select(['category', 'boolean']) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait @bchen1116 I think there might be a bug here on main? If there are either categorical or boolean features in the input along with numerics, the sampler should be SMOTENC not SMOTE right? I thought that's why we had changed the one hot encoder to encode the created features as boolean?
Repro on main
from evalml.automl import AutoMLSearch
from evalml.demos import load_fraud
import imblearn.over_sampling as imb
X, y = load_fraud(100)
X = X.ww[["provider", "country", "amount", "region"]]
automl = AutoMLSearch(X, y, "binary", verbose=True)
automl.search()
pipeline_3 = automl.get_pipeline(3)
pipeline_3.fit(X, y)
assert pipeline_3.get_component("Oversampler").sampler == imb.SMOTENot blocking this pr since it preserves this behavior but if this is a bug we should file another issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton Yeah, you're right! That's a good catch, this should be SMOTENC in the case thata there are both numeric and categorical. It's likely through this line. I think we should be grabbing both categorical and booleans, not just categoricals.
I can file an issue here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @bchen1116 !
bchen1116
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just left one nit pick just for clarity bcause I was confused for a while.
Closes #2856