Skip to content

Conversation

@christopherbunn
Copy link
Contributor

Resolves #3259

@christopherbunn christopherbunn force-pushed the 3259_drop_nan_rows_component branch from f9934d6 to 8b0365c Compare February 8, 2022 17:00
@christopherbunn christopherbunn changed the title Add drop first N component and removed drop NaN from time series fit Replaced drop NaN for time series with drop rows component Feb 8, 2022
@codecov
Copy link

codecov bot commented Feb 8, 2022

Codecov Report

Merging #3310 (395383c) into main (2398f09) will increase coverage by 0.1%.
The diff coverage is 100.0%.

❗ Current head 395383c differs from pull request most recent head 826c79e. Consider uploading reports for the commit 826c79e to get more accurate results

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #3310     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        327     329      +2     
  Lines      31840   31912     +72     
=======================================
+ Hits       31715   31789     +74     
+ Misses       125     123      -2     
Impacted Files Coverage Δ
evalml/pipelines/__init__.py 100.0% <ø> (ø)
evalml/pipelines/components/__init__.py 100.0% <ø> (ø)
...alml/pipelines/components/transformers/__init__.py 100.0% <ø> (ø)
evalml/tests/component_tests/test_components.py 99.3% <ø> (ø)
...s/prediction_explanations_tests/test_explainers.py 100.0% <ø> (ø)
.../components/transformers/preprocessing/__init__.py 100.0% <100.0%> (ø)
...formers/preprocessing/drop_nan_rows_transformer.py 100.0% <100.0%> (ø)
evalml/pipelines/components/utils.py 95.3% <100.0%> (+0.2%) ⬆️
evalml/pipelines/time_series_pipeline_base.py 100.0% <100.0%> (ø)
evalml/pipelines/utils.py 99.5% <100.0%> (+0.1%) ⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2398f09...826c79e. Read the comment docs.

@christopherbunn christopherbunn force-pushed the 3259_drop_nan_rows_component branch 2 times, most recently from 4b597b9 to 724b53c Compare February 10, 2022 19:54
@christopherbunn christopherbunn requested a review from a team February 10, 2022 22:23
@christopherbunn christopherbunn marked this pull request as ready for review February 11, 2022 16:10
and pipeline.component_graph.compute_order[-2] == name
):
n_rows_to_drop = (
self._pipeline_params["pipeline"]["max_delay"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christopherbunn not all of the pipelines have the same number of NaN rows. After the first batch, the number of delays, rolling window size will change as the search progresses.

from evalml.automl import AutoMLSearch
from evalml.demos import load_weather

X, y = load_weather()

automl = AutoMLSearch(X, y, "time series regression",
                      max_batches=4, _automl_algorithm="iterative",
                      problem_configuration={"max_delay": 30, "forecast_horizon": 7, "gap": 1,
                                             "time_index": "Date"},
                      verbose=True)

automl.search()

light_gbm_pipelines = [automl.get_pipeline(i) for 
                       i in automl.full_rankings.loc[automl.full_rankings.pipeline_name.str.contains("LightGBM")].id]

[pl.parameters['Time Series Featurizer']['conf_level'] for pl in light_gbm_pipelines]


[pl.parameters['Time Series Featurizer']['rolling_window_size'] for pl in light_gbm_pipelines]

fitted_light_gbm_pipelines = [pl.fit(X, y) for pl in light_gbm_pipelines]

[pl.transform_all_but_final(X, y).isna().sum(axis=0).max() for pl in fitted_light_gbm_pipelines]

image

image

Note that the number of NaN rows is never max_delay + forecast_horizon + gap. I worry we may be throwing out too much data with the approach. I like the idea of only dropping NaNs for some estimators. Can we test with DropNaN component for the estimators that need it for more than two batches?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Here's the results I got for first N vs. drop NaN only for select estimators. It looks like across the board we're getting better results with the drop NaN component than the drop rows component. As you pointed out, it's most likely because we're dropping rows that do not have NaN values. I'll adjust the preprocessing chain to use the drop NaN component for time series pipelines when needed.

drop_n_vs_drop_nan_necessary.html.zip

@christopherbunn christopherbunn force-pushed the 3259_drop_nan_rows_component branch from 03b53ff to 5384c6d Compare February 14, 2022 20:14
@christopherbunn christopherbunn changed the title Replaced drop NaN for time series with drop rows component Added drop NaN component to some time series pipelines Feb 14, 2022
@christopherbunn christopherbunn force-pushed the 3259_drop_nan_rows_component branch from a3e9f65 to 558dca1 Compare February 14, 2022 20:48
Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christopherbunn Looks good to me. I think we just need to make sure the DropNaNRows... doesn't reset the woodwork schema. Let's add a test for that too.

y_t = infer_feature_types(y) if y is not None else None

X_t, y_t = drop_rows_with_nans(X_t, y_t)
X_t.ww.init()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christopherbunn I think we need to store the schema prior to drop and then call init with the old schema so that we make sure we don't reset the schema.

@christopherbunn christopherbunn force-pushed the 3259_drop_nan_rows_component branch from 37c1f64 to 659738d Compare February 15, 2022 05:20
Copy link
Contributor Author

@christopherbunn christopherbunn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton do you mind taking another look? the DefaultAlgorithm as the default got merged to main after your review and I ended up having to make some changes...

Comment on lines +400 to +398
if need_drop_nan:
last_component_name = pipeline.component_graph.get_last_component().name
pipeline.component_graph.component_dict[estimator.name] = [
estimator,
last_component_name + ".x",
last_component_name + ".y",
]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this and lines 381-383 + 392 are a bit hacky, but I couldn't find a better function to do this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, yea, this part seems a little hacky. To make sure I understand, we're modifying the component_graph so that the estimator is fed by the DropNanRowsTransfomer, which can modify the target. But why can't we do this within _make_pipeline_from_multiple_graphs()? Wouldn't that be the function responsible for that logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, my current implementation basically has both sub pipelines feed into the DropNaNRowsTransformer. The drop NaN rows transformer then feeds that output into the estimator.

I know I asked for a better alternative to this implementation, but I'm a bit hesitant to move this logic into _make_pipeline_from_multiple_graphs. I think that this specific function should just be responsible for combining graphs. Having the insert DropNaNRowsTransformer logic could make maintaining _make_pipeline_from_multiple_graphs() confusing in the long run.

I guess my ideal would be the ability for _make_pipelines_from_multiple_graphs() to take in a "postprocessing graph" of multiple components rather than just a single estimator, but I think adding that functionality should be its own issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is another example of why we need this: #2997

Comment on lines 28 to 62
def test_drop_rows_transformer_retain_ww_schema():
# Expecting float because of np.NaN values
X = pd.DataFrame(
{"a column": [np.NaN, 2, 3, 4], "another col": ["a", np.NaN, "c", "d"]}
)
X.ww.init()
X.ww.set_types(
logical_types={"another col": "PersonFullName"},
semantic_tags={"a column": "custom_tag"},
)

X_expected = pd.DataFrame({"a column": [3], "another col": ["c"]}, index=[2])
X_expected = X_expected.astype({"a column": "float", "another col": "string"})
X_expected_schema = X.ww.schema

y = pd.Series([3, 2, 1, np.NaN])
y = init_series(y, logical_type="IntegerNullable", semantic_tags="y_custom_tag")

y_expected = pd.Series([True], index=[2])
y_expected = init_series(
y_expected, logical_type="IntegerNullable", semantic_tags="y_custom_tag"
)
y_expected_schema = y.ww.schema

drop_rows_transformer = DropNaNRowsTransformer()
transformed_X, transformed_y = drop_rows_transformer.fit_transform(X, y)
assert_frame_equal(transformed_X, X_expected)
assert_series_equal(transformed_y, y_expected)
assert _schema_is_equal(transformed_X.ww.schema, X_expected_schema)
assert transformed_y.ww.schema == y_expected_schema
Copy link
Contributor Author

@christopherbunn christopherbunn Feb 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saw I forgot to include this file in the last push 😅 but added a ww check test

@christopherbunn christopherbunn force-pushed the 3259_drop_nan_rows_component branch from 659738d to 1ac479e Compare February 15, 2022 14:53
Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Chris, great work, as per usual. I think we should reconsider refactoring out the test for what needs the DropNanRowsTransformer so that it can be reused in the places I call out in the review. That will protect us from any weird divergence bugs where maybe we don't update both. Also had a lingering question about why we can't let the normal make_pipeline style functions handle the component graph adjustments!

Comment on lines +400 to +398
if need_drop_nan:
last_component_name = pipeline.component_graph.get_last_component().name
pipeline.component_graph.component_dict[estimator.name] = [
estimator,
last_component_name + ".x",
last_component_name + ".y",
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, yea, this part seems a little hacky. To make sure I understand, we're modifying the component_graph so that the estimator is fed by the DropNanRowsTransfomer, which can modify the target. But why can't we do this within _make_pipeline_from_multiple_graphs()? Wouldn't that be the function responsible for that logic?

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @christopherbunn !

return estimator_classes


def estimator_unable_to_handle_nans(estimator_class):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be safer to implement this as the opposite, e.g. "able_to_handle_nans". The concern I have with doing it like this is that if we add a new estimator, this function will return False if we forget to update this file and a false negative in this case is costly because it will mean AutoMLSearch will error out.

Not blocking - let's handle it in a follow-up.

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for making the changes and making them so quickly!

@christopherbunn christopherbunn force-pushed the 3259_drop_nan_rows_component branch from 395383c to 826c79e Compare February 17, 2022 17:47
@christopherbunn christopherbunn merged commit c7d229a into main Feb 17, 2022
@chukarsten chukarsten mentioned this pull request Feb 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create drop NaN rows component and use for time series pipelines

4 participants