Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply standard scalar only to numeric columns #3686

Merged
merged 17 commits into from
Sep 6, 2022
Merged

Conversation

jeremyliweishih
Copy link
Collaborator

No description provided.

@@ -21,14 +21,31 @@ class StandardScaler(Transformer):
def __init__(self, random_seed=0, **kwargs):
parameters = {}
parameters.update(kwargs)

self._supported_types = [
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this covers all the types we want to manage here.

@codecov
Copy link

codecov bot commented Aug 30, 2022

Codecov Report

Merging #3686 (3e27d95) into main (31f4715) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #3686     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        339     339             
  Lines      34239   34251     +12     
=======================================
+ Hits       34108   34122     +14     
+ Misses       131     129      -2     
Impacted Files Coverage Δ
...ansformers/preprocessing/replace_nullable_types.py 100.0% <ø> (ø)
...del_understanding_tests/test_partial_dependence.py 100.0% <ø> (ø)
...valml/tests/pipeline_tests/test_component_graph.py 99.9% <ø> (ø)
...components/transformers/scalers/standard_scaler.py 100.0% <100.0%> (ø)
...alml/tests/component_tests/test_standard_scaler.py 100.0% <100.0%> (+10.6%) ⬆️
evalml/utils/woodwork_utils.py 100.0% <100.0%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@@ -41,15 +68,20 @@ def transform(self, X, y=None):
"""
X = infer_feature_types(X)
X = X.ww.select(exclude=["datetime"])
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think dropping datetime columns like this is weird but we're doing this to keep this compatible with partial dependence. Just going to leave this here for now but I do the same here and removed the separate dropping in fit_transform()

@jeremyliweishih jeremyliweishih marked this pull request as ready for review September 2, 2022 14:48
Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks solid! I just left a few nitpicks and conciseness comments, but nothing blocking.

"""
X = infer_feature_types(X)
X_can_scale_columns = X.ww.select(self._supported_types)
X_can_scale_columns = X_can_scale_columns.ww.select(exclude=["datetime"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why we need to drop datetime in transform, but I don't think we need to here, shouldn't datetime columns already be excluded since it's not a supported type?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using this to keep a reference of the columns that scaling is applied to in the training data!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah nvm I got what you mean!

assert {k: type(v) for k, v in transformed.ww.logical_types.items()} == {
0: Double,
}
def test_standard_scaler_applies_to_numeric_columns_only():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only a nit as it relates to my current work with hardening our tests against woodwork changes - using pytest fixtures for tests means we have all our woodwork typing consolidated and have to change fewer things should we need to tweak types for compatibility reasons. Plus, it should reduce the number of lines of code here.
I think either the get_test_data_from_configuration or the imputer_test_data fixture should have the same column types you're looking for.

Comment on lines -529 to +519
0.2821023107719306: 1,
0.08407392065748104: 63,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the fraud test data set contains Categoricals that are being ignored now.

docs/source/release_notes.rst Outdated Show resolved Hide resolved
@jeremyliweishih jeremyliweishih merged commit 5afdbd8 into main Sep 6, 2022
@jeremyliweishih jeremyliweishih deleted the js_test_scalar branch September 6, 2022 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants