Add support for known-in-advanced features for time series#3149
Add support for known-in-advanced features for time series#3149freddyaboulton merged 10 commits intomainfrom
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3149 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 318 318
Lines 30818 30908 +90
=======================================
+ Hits 30714 30804 +90
Misses 104 104
Continue to review full report at Codecov.
|
| assert pipeline.parameters[k] == parameters[k] | ||
| if is_classification(problem_type): | ||
| assert ( | ||
| len([c for c in pipeline.component_graph if "Label Encoder" in c.name]) |
| == 2 | ||
| ) | ||
| assert ( | ||
| len([c for c in pipeline.component_graph if "Undersampler" in c.name]) |
| component_parameters["n_jobs"] = self.n_jobs | ||
| if "number_features" in init_params: | ||
| component_parameters["number_features"] = self.number_features | ||
| names_to_check = [ |
There was a problem hiding this comment.
question for my understanding: why do we need to check only specific components for _pipeline_params and not every component that shows up? This is how it looks like in DefaultAlgorithm.
There was a problem hiding this comment.
I think this is precisely the tech debt around the #3150 and #3138. Our current approach for specifying pipeline parameters through the search is really confusing.
My guess of what happened:
- In Add ability to freeze hyperparameters in AutoMLSearch #1676, we added the ability to freeze parameters/hyperparameters via the
pipeline_parameterscomponent. This had a check for batch 0, to make sure we initialized the pipelines with values that are in the hyperparameter ranges. This could only apply for batch 0 otherwise we would always override the values proposed by the tuner with random values. - Then in Handle index columns in AutoMLSearch and DataChecks #2138, we added support to drop index columns throughout search. Since we can't specify the parameters for DropColumns as a hyperparameter, we had to manually add logic to add them.
- In Separate
pipeline_parametersfromcustom_hyperparameters#2317, we separated the custom hyperparameters from pipeline parameters. The design choice was for pipeline parameters to only apply for the first batch and for custom hyperparemeters to apply for all subsequent batches. Therefore we're in a position were the only components whose parameters should be overwritten with what's innames_to_checkare the select/drop columns transformers.
Regarding the difference in behavior with DefaultAlgorithm, that may be a bug? I think pipeline parameters should only apply in the first batch - at least that's what the design stipulated when we separated parameters from custom hyperparameters.
There was a problem hiding this comment.
Thanks for the context @freddyaboulton, really helpful in catching up on this logic. I made the explicit choice to pass down pipeline_parameters into each DefaultAlgorithm pipeline because my thought was that pipeline_params represented static parameters that a user wanted in each pipeline. But I think you're right that the design in #2317 made it only apply in the first batch.. I filed #3153 to track this divergent behavior for further discussion!
| "known_in_advance": ["bool_feature", "cat_feature"], | ||
| }, | ||
| optimize_thresholds=False, | ||
| sampler_method="Undersampler", |
chukarsten
left a comment
There was a problem hiding this comment.
This looks great, Freddy. Thanks for putting this together and thanks especially for the linking of all the new and existing issues. I find this extremely helpful for people that pick up refactoring issues to find all the places where resolution of their underlying issue should lead to a refactor.
| {"features": range(101, 601), "date": pd.date_range("2010-10-01", periods=500)} | ||
| ) | ||
| y = pd.Series(range(500)) |
There was a problem hiding this comment.
super-nit: perhaps consolidate the periods into a var and refactor the range calls and even the generation of the target to use that var?
| ProblemTypes.TIME_SERIES_BINARY, | ||
| ], | ||
| ) | ||
| def test_automl_passes_known_in_advance_pipeline_parameters_to_all_pipelines( |
There was a problem hiding this comment.
One tech debt issue with the algorithm upgrade is separating pipeline building logic out of search and removing pipeline building out of tests as well (#2868 tracks this). Do you think this test better belongs in test_iterative_algorithm.py? You can just test against allowed_pipelines to see if the pipelines generated match what you would expect. Once I get to #2691 I can add a test in DefaultAlgorithm as well!
There was a problem hiding this comment.
Thanks for bringing this up! The known-in-advance implementation requires us to set the parameters for SelectColumns correctly for all pipelines. That will have to hold for all automl algorithms - iterative, default, whatever we come up with in the future. Because of that, I thought it'd be cleaner to have a test in test_automl and then parametrize over automl algorithms (when default algorithm supports time series) as opposed to a test in every automl test file.
What do you think?
There was a problem hiding this comment.
Gotcha, that make a whole lot of sense to me. Thanks for explaining, the current test is great!
|
@freddyaboulton overall looks good to me! I'm adding a comment to test support for this in #2691 (my catch-all for TS issues and Default Algorithm). Will post updates there once I get to it. |
fb1da94 to
9fd6dc8
Compare
9ea6033 to
8220eb4
Compare
ParthivNaresh
left a comment
There was a problem hiding this comment.
Looks great, the splitting into subgraphs is done very neatly and I love it. Also great test coverage.
| **Future Releases** | ||
| * Enhancements | ||
| * Added the ability to accept serialized features and skip computation in ``DFSTransformer`` :pr:`3106` | ||
| * Added support for known-if-advance features :pr:`3149` |
There was a problem hiding this comment.
Fixing now 😆
| # Pre-processing components do not depend on problem type so we | ||
| # are ok by specifying regression for the known-in-advance sub pipeline | ||
| # Since we specify the correct problem type for the not known-in-advance pipeline | ||
| # the label encoder and time series featurizer will be correctly added |
There was a problem hiding this comment.
I still don't think I fully understand why problem_type can't be passed here as well
There was a problem hiding this comment.
Great question. We can't specify a time series problem type because then the pipeline for the known-in-advance features will have a time series featurizer which is not what we want. We could convert a time series problem type to a non-time-series problem type (time series binary -> binary) but that's the same as just using regression. Since the not-known-in-advance pipeline will have the "right" components for the problem type, e.g. label encoder, samplers, etc, the overall pipeline will have them too.
There was a problem hiding this comment.
Elaborating on this in the comment!
| if known_in_advance: | ||
| not_known_in_advance = [c for c in X.columns if c not in known_in_advance] | ||
| X_not_known_in_advance = X.ww[not_known_in_advance] | ||
| X_known_in_advance = X.ww[known_in_advance] |
There was a problem hiding this comment.
Nice and clean implementation
Pull Request Description
Fixes #2511
Design doc here
After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of
docs/source/release_notes.rstto include this pull request by adding :pr:123.