Skip to content

Add support for known-in-advanced features for time series#3149

Merged
freddyaboulton merged 10 commits intomainfrom
2511-known-in-advanced
Dec 15, 2021
Merged

Add support for known-in-advanced features for time series#3149
freddyaboulton merged 10 commits intomainfrom
2511-known-in-advanced

Conversation

@freddyaboulton
Copy link
Copy Markdown
Contributor

Pull Request Description

Fixes #2511

Design doc here


After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

@codecov
Copy link
Copy Markdown

codecov bot commented Dec 13, 2021

Codecov Report

Merging #3149 (e493759) into main (2732ca6) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #3149     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        318     318             
  Lines      30818   30908     +90     
=======================================
+ Hits       30714   30804     +90     
  Misses       104     104             
Impacted Files Coverage Δ
evalml/automl/automl_algorithm/automl_algorithm.py 100.0% <100.0%> (ø)
...lml/automl/automl_algorithm/iterative_algorithm.py 100.0% <100.0%> (ø)
evalml/pipelines/utils.py 99.7% <100.0%> (+0.1%) ⬆️
evalml/tests/automl_tests/test_automl.py 99.5% <100.0%> (+0.1%) ⬆️
.../integration_tests/test_time_series_integration.py 100.0% <100.0%> (ø)
evalml/tests/pipeline_tests/test_pipeline_utils.py 99.7% <100.0%> (+0.1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2732ca6...e493759. Read the comment docs.

assert pipeline.parameters[k] == parameters[k]
if is_classification(problem_type):
assert (
len([c for c in pipeline.component_graph if "Label Encoder" in c.name])
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is #2987

== 2
)
assert (
len([c for c in pipeline.component_graph if "Undersampler" in c.name])
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is #3076

@freddyaboulton freddyaboulton changed the title Add support for known-in-advanced Add support for known-in-advanced features for time series Dec 13, 2021
component_parameters["n_jobs"] = self.n_jobs
if "number_features" in init_params:
component_parameters["number_features"] = self.number_features
names_to_check = [
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is #3150

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question for my understanding: why do we need to check only specific components for _pipeline_params and not every component that shows up? This is how it looks like in DefaultAlgorithm.

Copy link
Copy Markdown
Contributor Author

@freddyaboulton freddyaboulton Dec 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is precisely the tech debt around the #3150 and #3138. Our current approach for specifying pipeline parameters through the search is really confusing.

My guess of what happened:

  • In Add ability to freeze hyperparameters in AutoMLSearch #1676, we added the ability to freeze parameters/hyperparameters via the pipeline_parameters component. This had a check for batch 0, to make sure we initialized the pipelines with values that are in the hyperparameter ranges. This could only apply for batch 0 otherwise we would always override the values proposed by the tuner with random values.
  • Then in Handle index columns in AutoMLSearch and DataChecks #2138, we added support to drop index columns throughout search. Since we can't specify the parameters for DropColumns as a hyperparameter, we had to manually add logic to add them.
  • In Separate pipeline_parameters from custom_hyperparameters #2317, we separated the custom hyperparameters from pipeline parameters. The design choice was for pipeline parameters to only apply for the first batch and for custom hyperparemeters to apply for all subsequent batches. Therefore we're in a position were the only components whose parameters should be overwritten with what's in names_to_check are the select/drop columns transformers.

Regarding the difference in behavior with DefaultAlgorithm, that may be a bug? I think pipeline parameters should only apply in the first batch - at least that's what the design stipulated when we separated parameters from custom hyperparameters.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the context @freddyaboulton, really helpful in catching up on this logic. I made the explicit choice to pass down pipeline_parameters into each DefaultAlgorithm pipeline because my thought was that pipeline_params represented static parameters that a user wanted in each pipeline. But I think you're right that the design in #2317 made it only apply in the first batch.. I filed #3153 to track this divergent behavior for further discussion!

"known_in_advance": ["bool_feature", "cat_feature"],
},
optimize_thresholds=False,
sampler_method="Undersampler",
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using undersampler because #3151

Copy link
Copy Markdown
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, Freddy. Thanks for putting this together and thanks especially for the linking of all the new and existing issues. I find this extremely helpful for people that pick up refactoring issues to find all the places where resolution of their underlying issue should lead to a refactor.

Comment thread evalml/automl/automl_algorithm/automl_algorithm.py Outdated
Comment thread evalml/pipelines/utils.py Outdated
Comment on lines +79 to +81
{"features": range(101, 601), "date": pd.date_range("2010-10-01", periods=500)}
)
y = pd.Series(range(500))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super-nit: perhaps consolidate the periods into a var and refactor the range calls and even the generation of the target to use that var?

Comment thread evalml/tests/integration_tests/test_time_series_integration.py
ProblemTypes.TIME_SERIES_BINARY,
],
)
def test_automl_passes_known_in_advance_pipeline_parameters_to_all_pipelines(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One tech debt issue with the algorithm upgrade is separating pipeline building logic out of search and removing pipeline building out of tests as well (#2868 tracks this). Do you think this test better belongs in test_iterative_algorithm.py? You can just test against allowed_pipelines to see if the pipelines generated match what you would expect. Once I get to #2691 I can add a test in DefaultAlgorithm as well!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bringing this up! The known-in-advance implementation requires us to set the parameters for SelectColumns correctly for all pipelines. That will have to hold for all automl algorithms - iterative, default, whatever we come up with in the future. Because of that, I thought it'd be cleaner to have a test in test_automl and then parametrize over automl algorithms (when default algorithm supports time series) as opposed to a test in every automl test file.

What do you think?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, that make a whole lot of sense to me. Thanks for explaining, the current test is great!

@jeremyliweishih
Copy link
Copy Markdown
Collaborator

@freddyaboulton overall looks good to me! I'm adding a comment to test support for this in #2691 (my catch-all for TS issues and Default Algorithm). Will post updates there once I get to it.

Copy link
Copy Markdown
Contributor

@ParthivNaresh ParthivNaresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, the splitting into subgraphs is done very neatly and I love it. Also great test coverage.

Comment thread docs/source/release_notes.rst Outdated
**Future Releases**
* Enhancements
* Added the ability to accept serialized features and skip computation in ``DFSTransformer`` :pr:`3106`
* Added support for known-if-advance features :pr:`3149`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

known-in*-advance?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing now 😆

Comment thread evalml/pipelines/utils.py Outdated
# Pre-processing components do not depend on problem type so we
# are ok by specifying regression for the known-in-advance sub pipeline
# Since we specify the correct problem type for the not known-in-advance pipeline
# the label encoder and time series featurizer will be correctly added
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't think I fully understand why problem_type can't be passed here as well

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question. We can't specify a time series problem type because then the pipeline for the known-in-advance features will have a time series featurizer which is not what we want. We could convert a time series problem type to a non-time-series problem type (time series binary -> binary) but that's the same as just using regression. Since the not-known-in-advance pipeline will have the "right" components for the problem type, e.g. label encoder, samplers, etc, the overall pipeline will have them too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elaborating on this in the comment!

Comment thread evalml/pipelines/utils.py
if known_in_advance:
not_known_in_advance = [c for c in X.columns if c not in known_in_advance]
X_not_known_in_advance = X.ww[not_known_in_advance]
X_known_in_advance = X.ww[known_in_advance]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice and clean implementation

@freddyaboulton freddyaboulton merged commit 80aa901 into main Dec 15, 2021
@freddyaboulton freddyaboulton deleted the 2511-known-in-advanced branch December 15, 2021 16:25
@angela97lin angela97lin mentioned this pull request Dec 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Spike - Handle known-in-advance features

4 participants