-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apply standard scalar only to numeric columns #3686
Conversation
@@ -21,14 +21,31 @@ class StandardScaler(Transformer): | |||
def __init__(self, random_seed=0, **kwargs): | |||
parameters = {} | |||
parameters.update(kwargs) | |||
|
|||
self._supported_types = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this covers all the types we want to manage here.
Codecov Report
@@ Coverage Diff @@
## main #3686 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 339 339
Lines 34239 34251 +12
=======================================
+ Hits 34108 34122 +14
+ Misses 131 129 -2
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
@@ -41,15 +68,20 @@ def transform(self, X, y=None): | |||
""" | |||
X = infer_feature_types(X) | |||
X = X.ww.select(exclude=["datetime"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think dropping datetime columns like this is weird but we're doing this to keep this compatible with partial dependence. Just going to leave this here for now but I do the same here and removed the separate dropping in fit_transform()
evalml/tests/model_understanding_tests/test_partial_dependence.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/scalers/standard_scaler.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks solid! I just left a few nitpicks and conciseness comments, but nothing blocking.
""" | ||
X = infer_feature_types(X) | ||
X_can_scale_columns = X.ww.select(self._supported_types) | ||
X_can_scale_columns = X_can_scale_columns.ww.select(exclude=["datetime"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand why we need to drop datetime in transform
, but I don't think we need to here, shouldn't datetime columns already be excluded since it's not a supported type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using this to keep a reference of the columns that scaling is applied to in the training data!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah nvm I got what you mean!
evalml/pipelines/components/transformers/scalers/standard_scaler.py
Outdated
Show resolved
Hide resolved
assert {k: type(v) for k, v in transformed.ww.logical_types.items()} == { | ||
0: Double, | ||
} | ||
def test_standard_scaler_applies_to_numeric_columns_only(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only a nit as it relates to my current work with hardening our tests against woodwork changes - using pytest fixtures for tests means we have all our woodwork typing consolidated and have to change fewer things should we need to tweak types for compatibility reasons. Plus, it should reduce the number of lines of code here.
I think either the get_test_data_from_configuration
or the imputer_test_data
fixture should have the same column types you're looking for.
0.2821023107719306: 1, | ||
0.08407392065748104: 63, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the fraud test data set contains Categoricals
that are being ignored now.
No description provided.