Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update dfs transformer features arg to be empty list for fast mode #3875

Merged
merged 15 commits into from
Dec 7, 2022

Conversation

tamargrey
Copy link
Contributor

@tamargrey tamargrey commented Dec 5, 2022

Closes #3874

Avoids passing the DFS Transformer's full list of features into the cloned pipeline for partial dependence fast mode. To achieve this, updates the DFS Transformer to allow an empty list as input for the features arg.

Note: This PR is currently set to merge into #3873 because it's dependent on changes made in that PR. We should merge that first and then merge this into main.

@tamargrey tamargrey force-pushed the fix-fast-mode-features-arg branch from 623ad8d to b497fb8 Compare December 5, 2022 22:50
@codecov
Copy link

codecov bot commented Dec 5, 2022

Codecov Report

Merging #3875 (6c577a7) into main (9fedf3b) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #3875     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        344     344             
  Lines      36209   36262     +53     
=======================================
+ Hits       36072   36125     +53     
  Misses       137     137             
Impacted Files Coverage Δ
evalml/automl/automl_search.py 99.5% <ø> (ø)
...ponents/transformers/preprocessing/featuretools.py 100.0% <100.0%> (ø)
evalml/tests/automl_tests/test_automl.py 99.5% <100.0%> (+0.1%) ⬆️
evalml/tests/component_tests/test_featuretools.py 100.0% <100.0%> (ø)
...del_understanding_tests/test_partial_dependence.py 100.0% <100.0%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@@ -27,7 +27,7 @@ def __init__(self, index="index", features=None, random_seed=0, **kwargs):

self.index = index
self.features = features
self._passed_in_features = True if features else None
self._passed_in_features = True if features is not None else None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make this fix, I needed the ability to set the parameters of the DFS Transformer so that a transform call would be no op. One option I considered was only passing in the IdentityFeature objects in at _handle_partial_dependence_fast_mode, but I felt that left some ambiguity as to what the behavior would be.

Therefore, I updated the behavior here to no longer treat an empty list passed in for features the same as features being unspecified. This change made sense to me, but I wanted to highlight it in case anyone disagrees or thinks this could be problematic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little curious about this. It opens up the opportunity for a user to pass in features=[] during AutoMLSearch, which would turn the whole component into a no-op. My intuition says we might want to add some sort of warning to make users aware of what they're doing - though it may be a weird warning in the pd fast mode case since users aren't controlling it there.
Does anyone else have an opinion about this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the warning could come from AutoMLSearch itself. If it sees that a user is trying to have a DFS Transformer in automl search with no features, it could raise that warning. If it's passed directly into a pipeline, maybe we're okay not warning users since it might be more clear to users at that point what the impact would be.

Then again, my goal here was specifically to have no-op functionality in the DFS Transformer, but I definitely get the discomfort. At the very least, I need to update the dosctring making this explicit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the warning within AutoMLSearch is a good idea. Having it in the docstring is also a super good callout, sounds like a good path forward to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out AutomlSearch doesn’t add the DFS Transformer to pipelines if features is an empty list.

We definitely add it to the internal search parameters here because we explicitly check if it’s None or not, but when we make_pipeline we exclude it if features is falsy: https://github.com/alteryx/evalml/blob/main/evalml/pipelines/utils.py#L412. We do the same for split pipelines. So currently in automl, an empty features list already is no-op.

My preferance here would be to not touch the automl code (though, maybe it’d be worth adding a note to the docstring there about the behavior when features is an empty list). And it makes me more confident that we probably want this behavior for the component itself as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I do think the distinction between an empty list and None can be murky to new users. I think it's mostly mitigated by a docstring callout though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, updated the automl docstring to make this clear there as well


assert not mock_calculate_feature_matrix.called
# --> Going through automl, because we have to instantiate the DFSTransformer to make a pipeline directly,
# which means we don't get the updated parameters applied, causing this test to fail
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an odd behavior I noticed while testing: Calling pipeline.new with an updated features parameter for the DFS Transformer wasn't resetting the features on the new pipeline. After digging into what was happening, it was because I was defining my pipeline with an instantiated DFSTransformer object DFSTransformer(features=features) so that I could specify features.

It seems at handle_component_class, we’re returning the instantiated component_class with features defined, which is why the features param is ignored when we try to instantiate the component with the new parameters.

Going to automl was my workaround for this, but I feel like there must be a better way. Is this a known behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I do need to keep this, I'll plan to use the AutoMLTestEnv

@tamargrey tamargrey force-pushed the fix-fast-mode-features-arg branch from 136edbb to a412921 Compare December 6, 2022 16:08
Base automatically changed from fix-dfs-feature-df-names to main December 6, 2022 16:49
Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good! Just left a few clarification questions.

@@ -27,7 +27,7 @@ def __init__(self, index="index", features=None, random_seed=0, **kwargs):

self.index = index
self.features = features
self._passed_in_features = True if features else None
self._passed_in_features = True if features is not None else None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little curious about this. It opens up the opportunity for a user to pass in features=[] during AutoMLSearch, which would turn the whole component into a no-op. My intuition says we might want to add some sort of warning to make users aware of what they're doing - though it may be a weird warning in the pd fast mode case since users aren't controlling it there.
Does anyone else have an opinion about this?

Comment on lines 169 to 172
# Pass in empty list of features so we don't run calculate feature matrix
# which would happen with the full set of features for a single column at refit
dfs_transformer["features"] = []
pipeline_parameters["DFS Transformer"] = dfs_transformer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A clarification for my own understanding: we only use this function with the single column pipeline, correct? This code would never run when we're generating the transformations from the full dataset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct - the parameters returned here will only be used to train a pipeline on a single column, and the check above this that confirms all features have corresponding columns in X confirms that calculate feature matrix is never called in the original pipeline; this parameter change makes that the case for the single column as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not possible to just do:

pipeline_parameters["DFS Transformer"]["features"] = []

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep totally doable

Comment on lines +253 to +254
"""Confirms that the features arg being an empty list is not treated the same as
it being unspecified."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we be explicit about testing this? i.e. generate two DFS objects, one with features=[] and one with features=None and test the differences?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a check with features=None to this test and confirmed that dfs and calculate feature matrix get called and that the component results in features being made


assert part_dep.feature_values.notnull().all()
assert part_dep.partial_dependence.notnull().all()
pd.testing.assert_frame_equal(part_dep, fast_part_dep)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question just for my own understanding: does this test fail with our current main branch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does fail, as calculate_feature_matrix gets called in fast mode causing the mock checks above to fail

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah gotcha. I guess I'm more concerned about the pd.testing.assert_frame_equal(part_dep, fast_part_dep) line specifically, since that's what this PR is aimed at combatting

Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - just something to remove

@@ -4,6 +4,7 @@
from collections import OrderedDict, defaultdict
from itertools import product
from math import ceil
from pdb import set_trace
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to remove this!

@@ -27,7 +27,7 @@ def __init__(self, index="index", features=None, random_seed=0, **kwargs):

self.index = index
self.features = features
self._passed_in_features = True if features else None
self._passed_in_features = True if features is not None else None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I do think the distinction between an empty list and None can be murky to new users. I think it's mostly mitigated by a docstring callout though.

Comment on lines 169 to 172
# Pass in empty list of features so we don't run calculate feature matrix
# which would happen with the full set of features for a single column at refit
dfs_transformer["features"] = []
pipeline_parameters["DFS Transformer"] = dfs_transformer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not possible to just do:

pipeline_parameters["DFS Transformer"]["features"] = []

@@ -4,6 +4,7 @@
from collections import OrderedDict, defaultdict
from itertools import product
from math import ceil
from pdb import set_trace
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tsktsk

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed 🙈

@chukarsten chukarsten merged commit eb02c44 into main Dec 7, 2022
@chukarsten chukarsten deleted the fix-fast-mode-features-arg branch December 7, 2022 19:08
@christopherbunn christopherbunn mentioned this pull request Dec 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants