[timeseries] Implement missing value imputation for TimeSeriesDataFrame #2781

shchur · 2023-01-30T15:20:55Z

Description of changes:

Two new methods for imputing the missing values. These should be called before passing the data to the TimeSeriesPredictor.
- to_regular_index(freq) - fills gaps in an irregularly-sampled time series with NaNs
- fill_missing_values(method) - drop leading NaNs & replace other NaNs (middle/trailing) using the chosen method (forward fill or interpolation)

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

github-actions · 2023-01-30T16:46:42Z

Job PR-2781-9c1586a is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2781/9c1586a/index.html

timeseries/src/autogluon/timeseries/dataset/ts_dataframe.py

tonyhoo · 2023-01-30T18:28:48Z

timeseries/src/autogluon/timeseries/dataset/ts_dataframe.py

+        ----------
+        method : {"ffill", "interpolate"}, default = "ffill"
+            Method used to impute missing values.
+            "ffill" - propagate last valid observation forward.


We should support all the filling methods available in pandas.

I see two potential problems with bfill/backfill:

it doesn't fill the trailing NaNs, and these are the ones that are actually important to ensure that all models can generate predictions over the forecast horizon. E.g, after we bfill [1, 1, NaN, NaN], we again get [1, 1, NaN, NaN] and this cannot be processed by TimeSeriesPredictor. We cannot just drop the trailing NaNs as easily as we can drop leading NaNs with ffill.

it introduces information leakage from the test/val set, which might affect model selection.

Do you think there is a strong potential use case for bfill that we should support & find a way around these problems?

Information leakage is a critical point, agree that we should drop bfill for now. Curious for interpolate, will linear interpolation cause leakage as well by any chance?

Good point regarding interpolation. I've checked how other libraries (sktime, darts) handle this and updated the functionality to be more in line with them.

Do not change the index (never drop the leading NaNs)

By default, use ffill to fill gaps + trailing NaNs, then use bfill to fill the leading NaNs.

Added options constant and bfill.

Added warnings for bfill and interpolate that these may lead to data leakage.

timeseries/src/autogluon/timeseries/dataset/ts_dataframe.py

github-actions · 2023-01-30T23:07:12Z

Job PR-2781-faa668b is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2781/faa668b/index.html

github-actions · 2023-01-31T12:34:39Z

Job PR-2781-6b6a994 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2781/6b6a994/index.html

github-actions · 2023-01-31T12:52:25Z

Job PR-2781-7ab5a94 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2781/7ab5a94/index.html

github-actions · 2023-02-01T20:37:03Z

Job PR-2781-44bf527 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2781/44bf527/index.html

shchur requested a review from tonyhoo January 30, 2023 15:29

tonyhoo reviewed Jan 30, 2023

View reviewed changes

tonyhoo approved these changes Jan 30, 2023

View reviewed changes

timeseries/src/autogluon/timeseries/dataset/ts_dataframe.py Outdated Show resolved Hide resolved

shchur force-pushed the ffill branch 2 times, most recently from 605b03b to 4acf1a8 Compare February 1, 2023 17:05

shchur added 6 commits February 1, 2023 19:07

Implement missing data imputation in TSDF

396bb51

Add tests for imputation

1376d39

Address PR comments

ef3b68c

Add more imputation methods

eaed561

Add tests

3051185

Update in-depth

44bf527

shchur force-pushed the ffill branch from 4acf1a8 to 44bf527 Compare February 1, 2023 19:07

shchur requested a review from tonyhoo February 2, 2023 17:04

tonyhoo approved these changes Feb 3, 2023

View reviewed changes

shchur merged commit 746f796 into autogluon:master Feb 3, 2023

shchur deleted the ffill branch February 3, 2023 06:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[timeseries] Implement missing value imputation for TimeSeriesDataFrame #2781

[timeseries] Implement missing value imputation for TimeSeriesDataFrame #2781

shchur commented Jan 30, 2023 •

edited

github-actions bot commented Jan 30, 2023

tonyhoo Jan 30, 2023

shchur Jan 30, 2023 •

edited

tonyhoo Jan 30, 2023

shchur Jan 31, 2023

github-actions bot commented Jan 30, 2023

github-actions bot commented Jan 31, 2023

github-actions bot commented Jan 31, 2023

github-actions bot commented Feb 1, 2023

[timeseries] Implement missing value imputation for TimeSeriesDataFrame #2781

[timeseries] Implement missing value imputation for TimeSeriesDataFrame #2781

Conversation

shchur commented Jan 30, 2023 • edited

github-actions bot commented Jan 30, 2023

tonyhoo Jan 30, 2023

Choose a reason for hiding this comment

shchur Jan 30, 2023 • edited

Choose a reason for hiding this comment

tonyhoo Jan 30, 2023

Choose a reason for hiding this comment

shchur Jan 31, 2023

Choose a reason for hiding this comment

github-actions bot commented Jan 30, 2023

github-actions bot commented Jan 31, 2023

github-actions bot commented Jan 31, 2023

github-actions bot commented Feb 1, 2023

shchur commented Jan 30, 2023 •

edited

shchur Jan 30, 2023 •

edited