Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[timeseries] Add native support for missing values #3995

Merged
merged 22 commits into from
Mar 29, 2024

Conversation

shchur
Copy link
Collaborator

@shchur shchur commented Mar 21, 2024

Issue #, if available: fixes #3886

Description of changes:

  • Instead of imputing all missing values in target column in TimeSeriesPredictor._check_and_prepare_data_frame, we let each model use its own logic for handling missing values.
    • GluonTS models (DeepAR, TFT, PatchTST, DLinear) + some local models (Average, SeasonalAverage, NPTS, Naive, SeasonalNaive) handle the missing values natively
    • Other local models (AutoETS, AutoCES, AutoARIMA, Theta, intermittent models) perform imputation first
    • MLForecast models use a mix of two strategies: missing values are imputed, but rows that originally contained NaN values are not used for training
  • Model properties (e.g., whether it can handle missing values) are stored using the _get_tags() mechanism
  • TimeSeriesPredictor now removes time series consisting of only NaN values from train_data during fit()
  • Missing values in covariates are still always imputed inside TimeSeriesFeatureGenerator
  • Add missing values to DUMMY_TS_DATAFRAME used in the tests to ensure that NaN support works in all scenarios

To do:

  • Add tests
  • Add missing values support to MLForecast models
  • Benchmark the new NaN handling strategy on datasets with missing values

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@yinweisu
Copy link
Collaborator

Previous CI Run Current CI Run

1 similar comment
@yinweisu
Copy link
Collaborator

Previous CI Run Current CI Run

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@shchur shchur changed the title WIP: [timeseries] Add native support for missing values [timeseries] Add native support for missing values Mar 25, 2024
@yinweisu
Copy link
Collaborator

Previous CI Run Current CI Run
huggingface-hub==0.21.4 huggingface-hub==0.22.0
lightning-utilities==0.11.0 lightning-utilities==0.11.1
filelock==3.13.1 filelock==3.13.2
huggingface-hub==0.21.4 huggingface-hub==0.22.0
lightning-utilities==0.11.0 lightning-utilities==0.11.1
filelock==3.13.1 filelock==3.13.2

@canerturkmen canerturkmen added the module: timeseries related to the timeseries module label Mar 25, 2024
Copy link
Contributor

@canerturkmen canerturkmen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall thanks a lot for this! Dropped a few minor comments and questions.

@yinweisu
Copy link
Collaborator

Previous CI Run Current CI Run
flatbuffers==24.3.7 flatbuffers==24.3.25
flatbuffers==24.3.7 flatbuffers==24.3.25

Copy link

Job PR-3995-7ff79e4 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3995/7ff79e4/index.html

@yinweisu
Copy link
Collaborator

Previous CI Run Current CI Run

Copy link
Contributor

@canerturkmen canerturkmen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Only one comment.

if known_covariates_names == []:
assert known_covariates_transformed is None
else:
assert not known_covariates_transformed[known_covariates_names].isna().any(axis=None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great and I'll probably use it for feature importance.

Should we also test the filling logic, at least for the 'median' and 'mode' scenarios?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, added a test for that in test_learner to ensure that this logic works after loading from disk

@yinweisu
Copy link
Collaborator

Previous CI Run Current CI Run

@yinweisu
Copy link
Collaborator

Previous CI Run Current CI Run

1 similar comment
@yinweisu
Copy link
Collaborator

Previous CI Run Current CI Run

Copy link
Collaborator Author

@shchur shchur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @canerturkmen, I've added the tests for mode/median imputation + added missing values to DUMMY_TS_DATAFRAME so that we consider more settings with NaNs

@@ -46,7 +46,7 @@ ignore-words-list = 'mape,ans,2st,fo,nd,te,fpr,coo,rouge'


[tool.ruff]
ignore = [
lint.ignore = [
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes a deprecation warning from ruff

if known_covariates_names == []:
assert known_covariates_transformed is None
else:
assert not known_covariates_transformed[known_covariates_names].isna().any(axis=None)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, added a test for that in test_learner to ensure that this logic works after loading from disk

@shchur
Copy link
Collaborator Author

shchur commented Mar 28, 2024

@canerturkmen just finished some benchmarking on 12 datasets with missing values / 3 folds each: This PR branch has 65% win rate vs. current master branch, so I guess we are good to merge once the problems with tests are resolved.

@yinweisu
Copy link
Collaborator

Previous CI Run Current CI Run

@@ -86,25 +86,6 @@ def test_when_local_model_saved_then_local_model_args_are_saved(model_class, hyp
assert dict_equal_primitive(model._local_model_args, loaded_model._local_model_args)


@pytest.mark.parametrize("model_class", TESTABLE_MODELS)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yinweisu
Copy link
Collaborator

Previous CI Run Current CI Run

@yinweisu
Copy link
Collaborator

Previous CI Run Current CI Run

Copy link

Job PR-3995-e50709a is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3995/e50709a/index.html

Copy link
Contributor

@canerturkmen canerturkmen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🚀

@shchur shchur merged commit d0d1fa9 into autogluon:master Mar 29, 2024
29 checks passed
prateekdesai04 pushed a commit to prateekdesai04/autogluon that referenced this pull request Apr 3, 2024
@shchur shchur deleted the nan-values-ts-models branch April 3, 2024 09:51
LennartPurucker pushed a commit to LennartPurucker/autogluon that referenced this pull request Jun 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: timeseries related to the timeseries module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[timeseries] Add first-class support for missing values
3 participants