-
Notifications
You must be signed in to change notification settings - Fork 92
Added drop NaN component to some time series pipelines #3310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f9934d6 to
8b0365c
Compare
Codecov Report
@@ Coverage Diff @@
## main #3310 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 327 329 +2
Lines 31840 31912 +72
=======================================
+ Hits 31715 31789 +74
+ Misses 125 123 -2
Continue to review full report at Codecov.
|
4b597b9 to
724b53c
Compare
| and pipeline.component_graph.compute_order[-2] == name | ||
| ): | ||
| n_rows_to_drop = ( | ||
| self._pipeline_params["pipeline"]["max_delay"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@christopherbunn not all of the pipelines have the same number of NaN rows. After the first batch, the number of delays, rolling window size will change as the search progresses.
from evalml.automl import AutoMLSearch
from evalml.demos import load_weather
X, y = load_weather()
automl = AutoMLSearch(X, y, "time series regression",
max_batches=4, _automl_algorithm="iterative",
problem_configuration={"max_delay": 30, "forecast_horizon": 7, "gap": 1,
"time_index": "Date"},
verbose=True)
automl.search()
light_gbm_pipelines = [automl.get_pipeline(i) for
i in automl.full_rankings.loc[automl.full_rankings.pipeline_name.str.contains("LightGBM")].id]
[pl.parameters['Time Series Featurizer']['conf_level'] for pl in light_gbm_pipelines]
[pl.parameters['Time Series Featurizer']['rolling_window_size'] for pl in light_gbm_pipelines]
fitted_light_gbm_pipelines = [pl.fit(X, y) for pl in light_gbm_pipelines]
[pl.transform_all_but_final(X, y).isna().sum(axis=0).max() for pl in fitted_light_gbm_pipelines]Note that the number of NaN rows is never max_delay + forecast_horizon + gap. I worry we may be throwing out too much data with the approach. I like the idea of only dropping NaNs for some estimators. Can we test with DropNaN component for the estimators that need it for more than two batches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! Here's the results I got for first N vs. drop NaN only for select estimators. It looks like across the board we're getting better results with the drop NaN component than the drop rows component. As you pointed out, it's most likely because we're dropping rows that do not have NaN values. I'll adjust the preprocessing chain to use the drop NaN component for time series pipelines when needed.
03b53ff to
5384c6d
Compare
a3e9f65 to
558dca1
Compare
freddyaboulton
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@christopherbunn Looks good to me. I think we just need to make sure the DropNaNRows... doesn't reset the woodwork schema. Let's add a test for that too.
| y_t = infer_feature_types(y) if y is not None else None | ||
|
|
||
| X_t, y_t = drop_rows_with_nans(X_t, y_t) | ||
| X_t.ww.init() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@christopherbunn I think we need to store the schema prior to drop and then call init with the old schema so that we make sure we don't reset the schema.
37c1f64 to
659738d
Compare
christopherbunn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton do you mind taking another look? the DefaultAlgorithm as the default got merged to main after your review and I ended up having to make some changes...
evalml/pipelines/components/transformers/preprocessing/drop_nan_rows_transformer.py
Show resolved
Hide resolved
| if need_drop_nan: | ||
| last_component_name = pipeline.component_graph.get_last_component().name | ||
| pipeline.component_graph.component_dict[estimator.name] = [ | ||
| estimator, | ||
| last_component_name + ".x", | ||
| last_component_name + ".y", | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this and lines 381-383 + 392 are a bit hacky, but I couldn't find a better function to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, yea, this part seems a little hacky. To make sure I understand, we're modifying the component_graph so that the estimator is fed by the DropNanRowsTransfomer, which can modify the target. But why can't we do this within _make_pipeline_from_multiple_graphs()? Wouldn't that be the function responsible for that logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, my current implementation basically has both sub pipelines feed into the DropNaNRowsTransformer. The drop NaN rows transformer then feeds that output into the estimator.
I know I asked for a better alternative to this implementation, but I'm a bit hesitant to move this logic into _make_pipeline_from_multiple_graphs. I think that this specific function should just be responsible for combining graphs. Having the insert DropNaNRowsTransformer logic could make maintaining _make_pipeline_from_multiple_graphs() confusing in the long run.
I guess my ideal would be the ability for _make_pipelines_from_multiple_graphs() to take in a "postprocessing graph" of multiple components rather than just a single estimator, but I think adding that functionality should be its own issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is another example of why we need this: #2997
| def test_drop_rows_transformer_retain_ww_schema(): | ||
| # Expecting float because of np.NaN values | ||
| X = pd.DataFrame( | ||
| {"a column": [np.NaN, 2, 3, 4], "another col": ["a", np.NaN, "c", "d"]} | ||
| ) | ||
| X.ww.init() | ||
| X.ww.set_types( | ||
| logical_types={"another col": "PersonFullName"}, | ||
| semantic_tags={"a column": "custom_tag"}, | ||
| ) | ||
|
|
||
| X_expected = pd.DataFrame({"a column": [3], "another col": ["c"]}, index=[2]) | ||
| X_expected = X_expected.astype({"a column": "float", "another col": "string"}) | ||
| X_expected_schema = X.ww.schema | ||
|
|
||
| y = pd.Series([3, 2, 1, np.NaN]) | ||
| y = init_series(y, logical_type="IntegerNullable", semantic_tags="y_custom_tag") | ||
|
|
||
| y_expected = pd.Series([True], index=[2]) | ||
| y_expected = init_series( | ||
| y_expected, logical_type="IntegerNullable", semantic_tags="y_custom_tag" | ||
| ) | ||
| y_expected_schema = y.ww.schema | ||
|
|
||
| drop_rows_transformer = DropNaNRowsTransformer() | ||
| transformed_X, transformed_y = drop_rows_transformer.fit_transform(X, y) | ||
| assert_frame_equal(transformed_X, X_expected) | ||
| assert_series_equal(transformed_y, y_expected) | ||
| assert _schema_is_equal(transformed_X.ww.schema, X_expected_schema) | ||
| assert transformed_y.ww.schema == y_expected_schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Saw I forgot to include this file in the last push 😅 but added a ww check test
659738d to
1ac479e
Compare
chukarsten
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Chris, great work, as per usual. I think we should reconsider refactoring out the test for what needs the DropNanRowsTransformer so that it can be reused in the places I call out in the review. That will protect us from any weird divergence bugs where maybe we don't update both. Also had a lingering question about why we can't let the normal make_pipeline style functions handle the component graph adjustments!
evalml/pipelines/components/transformers/preprocessing/drop_nan_rows_transformer.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/preprocessing/drop_nan_rows_transformer.py
Show resolved
Hide resolved
| if need_drop_nan: | ||
| last_component_name = pipeline.component_graph.get_last_component().name | ||
| pipeline.component_graph.component_dict[estimator.name] = [ | ||
| estimator, | ||
| last_component_name + ".x", | ||
| last_component_name + ".y", | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, yea, this part seems a little hacky. To make sure I understand, we're modifying the component_graph so that the estimator is fed by the DropNanRowsTransfomer, which can modify the target. But why can't we do this within _make_pipeline_from_multiple_graphs()? Wouldn't that be the function responsible for that logic?
freddyaboulton
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @christopherbunn !
| return estimator_classes | ||
|
|
||
|
|
||
| def estimator_unable_to_handle_nans(estimator_class): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be safer to implement this as the opposite, e.g. "able_to_handle_nans". The concern I have with doing it like this is that if we add a new estimator, this function will return False if we forget to update this file and a false negative in this case is costly because it will mean AutoMLSearch will error out.
Not blocking - let's handle it in a follow-up.
chukarsten
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks for making the changes and making them so quickly!
395383c to
826c79e
Compare


Resolves #3259