-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Relax contraints on test set for time series #3071
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3071 +/- ##
=======================================
+ Coverage 99.8% 99.8% +0.1%
=======================================
Files 312 312
Lines 30421 30433 +12
=======================================
+ Hits 30330 30343 +13
+ Misses 91 90 -1
Continue to review full report at Codecov.
|
0e77c0d
to
03e071b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Solid work mate, just left a question about our implementation of _move_index_forward
. I think it might be difficult for OS users to understand the limitations of what they can do with forecasting time series when the test set is more than gap
observations beyond the training set.
@@ -292,7 +292,7 @@ | |||
"You'll notice that in the code snippets here, we use the ``predict_in_sample`` pipeline method as opposed to the usual ``predict`` method. What's the difference?\n", | |||
"\n", | |||
"* ``predict_in_sample`` is used when the target value is known on the dates we are predicting on. This is true in cross validation. This method has an expected ``y`` parameter so that we can compute features using previous target values for all of the observations on the holdout set.\n", | |||
"* ``predict`` is used when the target value is not known, e.g. the test dataset. The y parameter is not expected as only the target is observed in the training set. The test dataset must be separated by ``gap`` days from the training dataset. For the moment, the test set size must match ``forecast_horizon``.\n", | |||
"* ``predict`` is used when the target value is not known, e.g. the test dataset. The y parameter is not expected as only the target is observed in the training set. The test dataset must be separated by ``gap`` days from the training dataset. For the moment, the test set size be less than or equal to ``forecast_horizon``.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"the test set size must be less than..."
@@ -247,8 +247,11 @@ def predict(self, X, objective=None, X_train=None, y_train=None): | |||
"Cannot call predict() on a component graph because the final component is not an Estimator." | |||
) | |||
X = infer_feature_types(X) | |||
X.index = self._move_index_forward( | |||
X_train.index[-X.shape[0] :], self.gap + X.shape[0] | |||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this implementation but I had a question about it. Previously if a user were to set a forecast for indices that were more than gap
units ahead of the training data, the second part of the warning above would be relevant: its index needs to start {self.gap + 1} values ahead of the training index
.
Now the same forecast attempt doesn't raise the error, but still sets the forecasts to the end of the training set.
This is really useful for when the test set starts at 0 but if it continues the index from the end of the training set, this might give a false impression of future predictions. Maybe we should add a check here for test indices that begin more than gap
observations away from the training set so the original error can still be raised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question @ParthivNaresh ! I think the reason why I didn't want the check to be based on the index is that the index will probably not start at gap
+ 1 units away from the training/validation index.
My reasoning is that the kind of examples we've been working off, where we manually split the same dataset into separate chunks are not reflective of what happens in "real life". Users will give one dataset to AutoMLSearch (that will have index from 0 to X.shape[0] or whatever) and then the test set will come from a different csv or file. When you read that in with pandas the index will start at 0. I feel like it's awkward for users to have to re-index when we can do that calculation for them.
That being said, it does assume the user is uploading data that's temporally right after the dataset given to AutoMLSearch. We can check that via the new date_index
parameter that we're now requiring!
I was actually thinking of doing that in a separate PR but the implementation should be small enough to do in this one maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I absolutely agree with the intent behind this. Users are almost guaranteed in the vast majority of cases to pass in a separate dataset for testing and forecasting from index 0, so this works great for that.
In the case that a user passes in a test set separately but tries to do something similar to above, where they might try and forecast values starting from [50 : ]
as opposed to starting with the first observation, they'll run into a similar issue. But if that's going in as part of a separate PR I'm totally good with leaving it out of this one!
03e071b
to
76fe47e
Compare
Pull Request Description
Fixes #3070
After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of
docs/source/release_notes.rst
to include this pull request by adding :pr:123
.