-
Notifications
You must be signed in to change notification settings - Fork 862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[timeseries] Implement missing value imputation for TimeSeriesDataFrame #2781
Conversation
Job PR-2781-9c1586a is done. |
---------- | ||
method : {"ffill", "interpolate"}, default = "ffill" | ||
Method used to impute missing values. | ||
"ffill" - propagate last valid observation forward. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should support all the filling methods available in pandas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see two potential problems with bfill
/backfill
:
- it doesn't fill the trailing NaNs, and these are the ones that are actually important to ensure that all models can generate predictions over the forecast horizon. E.g, after we bfill
[1, 1, NaN, NaN]
, we again get[1, 1, NaN, NaN]
and this cannot be processed byTimeSeriesPredictor
. We cannot just drop the trailing NaNs as easily as we can drop leading NaNs withffill
. - it introduces information leakage from the test/val set, which might affect model selection.
Do you think there is a strong potential use case for bfill
that we should support & find a way around these problems?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Information leakage is a critical point, agree that we should drop bfill
for now. Curious for interpolate, will linear interpolation cause leakage as well by any chance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point regarding interpolation
. I've checked how other libraries (sktime, darts) handle this and updated the functionality to be more in line with them.
- Do not change the index (never drop the leading NaNs)
- By default, use
ffill
to fill gaps + trailing NaNs, then usebfill
to fill the leading NaNs. - Added options
constant
andbfill
. - Added warnings for
bfill
andinterpolate
that these may lead to data leakage.
Job PR-2781-faa668b is done. |
Job PR-2781-6b6a994 is done. |
Job PR-2781-7ab5a94 is done. |
605b03b
to
4acf1a8
Compare
Job PR-2781-44bf527 is done. |
Description of changes:
TimeSeriesPredictor
.to_regular_index(freq)
- fills gaps in an irregularly-sampled time series with NaNsfill_missing_values(method)
- drop leading NaNs & replace other NaNs (middle/trailing) using the chosen method (forward fill or interpolation)By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.