Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for PostalCode, SubRegionCode, CountryCode logical types #2946

Merged
merged 20 commits into from
Oct 25, 2021

Conversation

eccabay
Copy link
Contributor

@eccabay eccabay commented Oct 21, 2021

Closes #2856

@codecov
Copy link

codecov bot commented Oct 21, 2021

Codecov Report

Merging #2946 (1ed6e42) into main (c11809c) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2946     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        307     307             
  Lines      29049   29197    +148     
=======================================
+ Hits       28958   29106    +148     
  Misses        91      91             
Impacted Files Coverage Δ
evalml/model_understanding/graphs.py 100.0% <100.0%> (ø)
...es/components/transformers/samplers/oversampler.py 100.0% <100.0%> (ø)
evalml/tests/component_tests/test_oversampler.py 100.0% <100.0%> (ø)
...s/prediction_explanations_tests/test_explainers.py 100.0% <100.0%> (ø)
...del_understanding_tests/test_partial_dependence.py 99.4% <100.0%> (+0.1%) ⬆️
...understanding_tests/test_permutation_importance.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c11809c...1ed6e42. Read the comment docs.

@eccabay eccabay marked this pull request as ready for review October 22, 2021 13:32
Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eccabay Thank you for this! I think this is ready to merge except we should make sure the one-way partial dependence plot is a bar plot for postal code etc instead of a line plot.

@@ -88,8 +88,8 @@ def _get_categorical(self, X):
X = infer_feature_types(X)
self.categorical_features = [
i
for i, val in enumerate(X.ww.types["Logical Type"].items())
if str(val[1]) in {"Boolean", "Categorical"}
for i, val in enumerate(X.ww.semantic_tags.items())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we also need to check for boolean here? Why not do X.ww.select(['category', 'boolean']) ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait @bchen1116 I think there might be a bug here on main? If there are either categorical or boolean features in the input along with numerics, the sampler should be SMOTENC not SMOTE right? I thought that's why we had changed the one hot encoder to encode the created features as boolean?

Repro on main

from evalml.automl import AutoMLSearch
from evalml.demos import load_fraud
import imblearn.over_sampling as imb

X, y = load_fraud(100)
X = X.ww[["provider", "country", "amount", "region"]]

automl = AutoMLSearch(X, y, "binary", verbose=True)
automl.search()

pipeline_3 = automl.get_pipeline(3)
pipeline_3.fit(X, y)
assert pipeline_3.get_component("Oversampler").sampler == imb.SMOTE

Not blocking this pr since it preserves this behavior but if this is a bug we should file another issue.

Copy link
Contributor

@bchen1116 bchen1116 Oct 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton Yeah, you're right! That's a good catch, this should be SMOTENC in the case thata there are both numeric and categorical. It's likely through this line. I think we should be grabbing both categorical and booleans, not just categoricals.

I can file an issue here!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @bchen1116 !

evalml/model_understanding/graphs.py Show resolved Hide resolved
Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just left one nit pick just for clarity bcause I was confused for a while.

evalml/model_understanding/graphs.py Show resolved Hide resolved
@eccabay eccabay merged commit 910fbd0 into main Oct 25, 2021
@eccabay eccabay deleted the 2856_postalcode branch October 25, 2021 12:42
@chukarsten chukarsten mentioned this pull request Oct 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for PostalCode, SubRegionCode, CountryCode logical types
3 participants