-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for PostalCode, SubRegionCode, CountryCode logical types #2946
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2946 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 307 307
Lines 29049 29197 +148
=======================================
+ Hits 28958 29106 +148
Misses 91 91
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eccabay Thank you for this! I think this is ready to merge except we should make sure the one-way partial dependence plot is a bar plot for postal code etc instead of a line plot.
@@ -88,8 +88,8 @@ def _get_categorical(self, X): | |||
X = infer_feature_types(X) | |||
self.categorical_features = [ | |||
i | |||
for i, val in enumerate(X.ww.types["Logical Type"].items()) | |||
if str(val[1]) in {"Boolean", "Categorical"} | |||
for i, val in enumerate(X.ww.semantic_tags.items()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we also need to check for boolean here? Why not do X.ww.select(['category', 'boolean']) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait @bchen1116 I think there might be a bug here on main? If there are either categorical or boolean features in the input along with numerics, the sampler should be SMOTENC
not SMOTE
right? I thought that's why we had changed the one hot encoder to encode the created features as boolean?
Repro on main
from evalml.automl import AutoMLSearch
from evalml.demos import load_fraud
import imblearn.over_sampling as imb
X, y = load_fraud(100)
X = X.ww[["provider", "country", "amount", "region"]]
automl = AutoMLSearch(X, y, "binary", verbose=True)
automl.search()
pipeline_3 = automl.get_pipeline(3)
pipeline_3.fit(X, y)
assert pipeline_3.get_component("Oversampler").sampler == imb.SMOTE
Not blocking this pr since it preserves this behavior but if this is a bug we should file another issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton Yeah, you're right! That's a good catch, this should be SMOTENC
in the case thata there are both numeric and categorical. It's likely through this line. I think we should be grabbing both categorical and booleans, not just categoricals.
I can file an issue here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @bchen1116 !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just left one nit pick just for clarity bcause I was confused for a while.
Closes #2856