-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update dfs transformer features arg to be empty list for fast mode #3875
Conversation
623ad8d
to
b497fb8
Compare
Codecov Report
@@ Coverage Diff @@
## main #3875 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 344 344
Lines 36209 36262 +53
=======================================
+ Hits 36072 36125 +53
Misses 137 137
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
@@ -27,7 +27,7 @@ def __init__(self, index="index", features=None, random_seed=0, **kwargs): | |||
|
|||
self.index = index | |||
self.features = features | |||
self._passed_in_features = True if features else None | |||
self._passed_in_features = True if features is not None else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make this fix, I needed the ability to set the parameters of the DFS Transformer so that a transform
call would be no op. One option I considered was only passing in the IdentityFeature
objects in at _handle_partial_dependence_fast_mode
, but I felt that left some ambiguity as to what the behavior would be.
Therefore, I updated the behavior here to no longer treat an empty list passed in for features
the same as features
being unspecified. This change made sense to me, but I wanted to highlight it in case anyone disagrees or thinks this could be problematic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little curious about this. It opens up the opportunity for a user to pass in features=[]
during AutoMLSearch, which would turn the whole component into a no-op. My intuition says we might want to add some sort of warning to make users aware of what they're doing - though it may be a weird warning in the pd fast mode case since users aren't controlling it there.
Does anyone else have an opinion about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the warning could come from AutoMLSearch itself. If it sees that a user is trying to have a DFS Transformer in automl search with no features, it could raise that warning. If it's passed directly into a pipeline, maybe we're okay not warning users since it might be more clear to users at that point what the impact would be.
Then again, my goal here was specifically to have no-op functionality in the DFS Transformer, but I definitely get the discomfort. At the very least, I need to update the dosctring making this explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, the warning within AutoMLSearch is a good idea. Having it in the docstring is also a super good callout, sounds like a good path forward to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out AutomlSearch doesn’t add the DFS Transformer to pipelines if features is an empty list.
We definitely add it to the internal search parameters here because we explicitly check if it’s None or not, but when we make_pipeline we exclude it if features is falsy: https://github.com/alteryx/evalml/blob/main/evalml/pipelines/utils.py#L412. We do the same for split pipelines. So currently in automl, an empty features list already is no-op.
My preferance here would be to not touch the automl code (though, maybe it’d be worth adding a note to the docstring there about the behavior when features is an empty list). And it makes me more confident that we probably want this behavior for the component itself as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I do think the distinction between an empty list and None can be murky to new users. I think it's mostly mitigated by a docstring callout though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, updated the automl docstring to make this clear there as well
|
||
assert not mock_calculate_feature_matrix.called | ||
# --> Going through automl, because we have to instantiate the DFSTransformer to make a pipeline directly, | ||
# which means we don't get the updated parameters applied, causing this test to fail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was an odd behavior I noticed while testing: Calling pipeline.new
with an updated features
parameter for the DFS Transformer wasn't resetting the features
on the new pipeline. After digging into what was happening, it was because I was defining my pipeline with an instantiated DFSTransformer object DFSTransformer(features=features)
so that I could specify features.
It seems at handle_component_class, we’re returning the instantiated component_class with features defined, which is why the features param is ignored when we try to instantiate the component with the new parameters.
Going to automl was my workaround for this, but I feel like there must be a better way. Is this a known behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I do need to keep this, I'll plan to use the AutoMLTestEnv
136edbb
to
a412921
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good! Just left a few clarification questions.
@@ -27,7 +27,7 @@ def __init__(self, index="index", features=None, random_seed=0, **kwargs): | |||
|
|||
self.index = index | |||
self.features = features | |||
self._passed_in_features = True if features else None | |||
self._passed_in_features = True if features is not None else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little curious about this. It opens up the opportunity for a user to pass in features=[]
during AutoMLSearch, which would turn the whole component into a no-op. My intuition says we might want to add some sort of warning to make users aware of what they're doing - though it may be a weird warning in the pd fast mode case since users aren't controlling it there.
Does anyone else have an opinion about this?
# Pass in empty list of features so we don't run calculate feature matrix | ||
# which would happen with the full set of features for a single column at refit | ||
dfs_transformer["features"] = [] | ||
pipeline_parameters["DFS Transformer"] = dfs_transformer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A clarification for my own understanding: we only use this function with the single column pipeline, correct? This code would never run when we're generating the transformations from the full dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct - the parameters returned here will only be used to train a pipeline on a single column, and the check above this that confirms all features have corresponding columns in X confirms that calculate feature matrix is never called in the original pipeline; this parameter change makes that the case for the single column as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it not possible to just do:
pipeline_parameters["DFS Transformer"]["features"] = []
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep totally doable
"""Confirms that the features arg being an empty list is not treated the same as | ||
it being unspecified.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we be explicit about testing this? i.e. generate two DFS objects, one with features=[]
and one with features=None
and test the differences?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a check with features=None
to this test and confirmed that dfs and calculate feature matrix get called and that the component results in features being made
|
||
assert part_dep.feature_values.notnull().all() | ||
assert part_dep.partial_dependence.notnull().all() | ||
pd.testing.assert_frame_equal(part_dep, fast_part_dep) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another question just for my own understanding: does this test fail with our current main branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it does fail, as calculate_feature_matrix
gets called in fast mode causing the mock checks above to fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah gotcha. I guess I'm more concerned about the pd.testing.assert_frame_equal(part_dep, fast_part_dep)
line specifically, since that's what this PR is aimed at combatting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - just something to remove
@@ -4,6 +4,7 @@ | |||
from collections import OrderedDict, defaultdict | |||
from itertools import product | |||
from math import ceil | |||
from pdb import set_trace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to remove this!
@@ -27,7 +27,7 @@ def __init__(self, index="index", features=None, random_seed=0, **kwargs): | |||
|
|||
self.index = index | |||
self.features = features | |||
self._passed_in_features = True if features else None | |||
self._passed_in_features = True if features is not None else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I do think the distinction between an empty list and None can be murky to new users. I think it's mostly mitigated by a docstring callout though.
# Pass in empty list of features so we don't run calculate feature matrix | ||
# which would happen with the full set of features for a single column at refit | ||
dfs_transformer["features"] = [] | ||
pipeline_parameters["DFS Transformer"] = dfs_transformer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it not possible to just do:
pipeline_parameters["DFS Transformer"]["features"] = []
@@ -4,6 +4,7 @@ | |||
from collections import OrderedDict, defaultdict | |||
from itertools import product | |||
from math import ceil | |||
from pdb import set_trace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tsktsk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed 🙈
Closes #3874
Avoids passing the DFS Transformer's full list of features into the cloned pipeline for partial dependence fast mode. To achieve this, updates the DFS Transformer to allow an empty list as input for the
features
arg.Note: This PR is currently set to merge into #3873 because it's dependent on changes made in that PR. We should merge that first and then merge this into main.