Update dfs transformer features arg to be empty list for fast mode#3875
Update dfs transformer features arg to be empty list for fast mode#3875chukarsten merged 15 commits intomainfrom
Conversation
623ad8d to
b497fb8
Compare
Codecov Report
@@ Coverage Diff @@
## main #3875 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 344 344
Lines 36209 36262 +53
=======================================
+ Hits 36072 36125 +53
Misses 137 137
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
| self.index = index | ||
| self.features = features | ||
| self._passed_in_features = True if features else None | ||
| self._passed_in_features = True if features is not None else None |
There was a problem hiding this comment.
To make this fix, I needed the ability to set the parameters of the DFS Transformer so that a transform call would be no op. One option I considered was only passing in the IdentityFeature objects in at _handle_partial_dependence_fast_mode, but I felt that left some ambiguity as to what the behavior would be.
Therefore, I updated the behavior here to no longer treat an empty list passed in for features the same as features being unspecified. This change made sense to me, but I wanted to highlight it in case anyone disagrees or thinks this could be problematic.
There was a problem hiding this comment.
I am a little curious about this. It opens up the opportunity for a user to pass in features=[] during AutoMLSearch, which would turn the whole component into a no-op. My intuition says we might want to add some sort of warning to make users aware of what they're doing - though it may be a weird warning in the pd fast mode case since users aren't controlling it there.
Does anyone else have an opinion about this?
There was a problem hiding this comment.
Maybe the warning could come from AutoMLSearch itself. If it sees that a user is trying to have a DFS Transformer in automl search with no features, it could raise that warning. If it's passed directly into a pipeline, maybe we're okay not warning users since it might be more clear to users at that point what the impact would be.
Then again, my goal here was specifically to have no-op functionality in the DFS Transformer, but I definitely get the discomfort. At the very least, I need to update the dosctring making this explicit.
There was a problem hiding this comment.
Ah, the warning within AutoMLSearch is a good idea. Having it in the docstring is also a super good callout, sounds like a good path forward to me
There was a problem hiding this comment.
Turns out AutomlSearch doesn’t add the DFS Transformer to pipelines if features is an empty list.
We definitely add it to the internal search parameters here because we explicitly check if it’s None or not, but when we make_pipeline we exclude it if features is falsy: https://github.com/alteryx/evalml/blob/main/evalml/pipelines/utils.py#L412. We do the same for split pipelines. So currently in automl, an empty features list already is no-op.
My preferance here would be to not touch the automl code (though, maybe it’d be worth adding a note to the docstring there about the behavior when features is an empty list). And it makes me more confident that we probably want this behavior for the component itself as well.
There was a problem hiding this comment.
Agreed, I do think the distinction between an empty list and None can be murky to new users. I think it's mostly mitigated by a docstring callout though.
There was a problem hiding this comment.
Yep, updated the automl docstring to make this clear there as well
|
|
||
| assert not mock_calculate_feature_matrix.called | ||
| # --> Going through automl, because we have to instantiate the DFSTransformer to make a pipeline directly, | ||
| # which means we don't get the updated parameters applied, causing this test to fail |
There was a problem hiding this comment.
This was an odd behavior I noticed while testing: Calling pipeline.new with an updated features parameter for the DFS Transformer wasn't resetting the features on the new pipeline. After digging into what was happening, it was because I was defining my pipeline with an instantiated DFSTransformer object DFSTransformer(features=features) so that I could specify features.
It seems at handle_component_class, we’re returning the instantiated component_class with features defined, which is why the features param is ignored when we try to instantiate the component with the new parameters.
Going to automl was my workaround for this, but I feel like there must be a better way. Is this a known behavior?
There was a problem hiding this comment.
If I do need to keep this, I'll plan to use the AutoMLTestEnv
136edbb to
a412921
Compare
eccabay
left a comment
There was a problem hiding this comment.
Looks pretty good! Just left a few clarification questions.
| self.index = index | ||
| self.features = features | ||
| self._passed_in_features = True if features else None | ||
| self._passed_in_features = True if features is not None else None |
There was a problem hiding this comment.
I am a little curious about this. It opens up the opportunity for a user to pass in features=[] during AutoMLSearch, which would turn the whole component into a no-op. My intuition says we might want to add some sort of warning to make users aware of what they're doing - though it may be a weird warning in the pd fast mode case since users aren't controlling it there.
Does anyone else have an opinion about this?
| # Pass in empty list of features so we don't run calculate feature matrix | ||
| # which would happen with the full set of features for a single column at refit | ||
| dfs_transformer["features"] = [] | ||
| pipeline_parameters["DFS Transformer"] = dfs_transformer |
There was a problem hiding this comment.
A clarification for my own understanding: we only use this function with the single column pipeline, correct? This code would never run when we're generating the transformations from the full dataset?
There was a problem hiding this comment.
Correct - the parameters returned here will only be used to train a pipeline on a single column, and the check above this that confirms all features have corresponding columns in X confirms that calculate feature matrix is never called in the original pipeline; this parameter change makes that the case for the single column as well.
There was a problem hiding this comment.
Is it not possible to just do:
pipeline_parameters["DFS Transformer"]["features"] = []
There was a problem hiding this comment.
yep totally doable
| """Confirms that the features arg being an empty list is not treated the same as | ||
| it being unspecified.""" |
There was a problem hiding this comment.
Can we be explicit about testing this? i.e. generate two DFS objects, one with features=[] and one with features=None and test the differences?
There was a problem hiding this comment.
Added a check with features=None to this test and confirmed that dfs and calculate feature matrix get called and that the component results in features being made
|
|
||
| assert part_dep.feature_values.notnull().all() | ||
| assert part_dep.partial_dependence.notnull().all() | ||
| pd.testing.assert_frame_equal(part_dep, fast_part_dep) |
There was a problem hiding this comment.
Another question just for my own understanding: does this test fail with our current main branch?
There was a problem hiding this comment.
Yes, it does fail, as calculate_feature_matrix gets called in fast mode causing the mock checks above to fail
There was a problem hiding this comment.
Ah gotcha. I guess I'm more concerned about the pd.testing.assert_frame_equal(part_dep, fast_part_dep) line specifically, since that's what this PR is aimed at combatting
jeremyliweishih
left a comment
There was a problem hiding this comment.
LGTM - just something to remove
| from collections import OrderedDict, defaultdict | ||
| from itertools import product | ||
| from math import ceil | ||
| from pdb import set_trace |
There was a problem hiding this comment.
Need to remove this!
| self.index = index | ||
| self.features = features | ||
| self._passed_in_features = True if features else None | ||
| self._passed_in_features = True if features is not None else None |
There was a problem hiding this comment.
Agreed, I do think the distinction between an empty list and None can be murky to new users. I think it's mostly mitigated by a docstring callout though.
| # Pass in empty list of features so we don't run calculate feature matrix | ||
| # which would happen with the full set of features for a single column at refit | ||
| dfs_transformer["features"] = [] | ||
| pipeline_parameters["DFS Transformer"] = dfs_transformer |
There was a problem hiding this comment.
Is it not possible to just do:
pipeline_parameters["DFS Transformer"]["features"] = []
| from collections import OrderedDict, defaultdict | ||
| from itertools import product | ||
| from math import ceil | ||
| from pdb import set_trace |
Closes #3874
Avoids passing the DFS Transformer's full list of features into the cloned pipeline for partial dependence fast mode. To achieve this, updates the DFS Transformer to allow an empty list as input for the
featuresarg.Note: This PR is currently set to merge into #3873 because it's dependent on changes made in that PR. We should merge that first and then merge this into main.